<style>
.rendered_html * + ul {
	margin-top: 0.5em;
}
    div.text_cell_render {
    padding: 0.0em 0.0em 0.0em 0.0em;
}
    .reveal p {
    margin: 20px 10;
    line-height: 1.3;
}
    html, body, .reveal div, .reveal span, .reveal applet, .reveal object, .reveal iframe, .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6, .reveal p, .reveal blockquote, .reveal pre, .reveal a, .reveal abbr, .reveal acronym, .reveal address, .reveal big, .reveal cite, .reveal code, .reveal del, .reveal dfn, .reveal em, .reveal img, .reveal ins, .reveal kbd, .reveal q, .reveal s, .reveal samp, .reveal small, .reveal strike, .reveal strong, .reveal sub, .reveal sup, .reveal tt, .reveal var, .reveal b, .reveal u, .reveal center, .reveal dl, .reveal dt, .reveal dd, .reveal ol, .reveal ul, .reveal li, .reveal fieldset, .reveal form, .reveal label, .reveal legend, .reveal table, .reveal caption, .reveal tbody, .reveal tfoot, .reveal thead, .reveal tr, .reveal th, .reveal td, .reveal article, .reveal aside, .reveal canvas, .reveal details, .reveal embed, .reveal figure, .reveal figcaption, .reveal footer, .reveal header, .reveal hgroup, .reveal menu, .reveal nav, .reveal output, .reveal ruby, .reveal section, .reveal summary, .reveal time, .reveal mark, .reveal audio, .reveal video {
    margin-bottom: -1px;
}
    div.text_cell_render {
    padding: 0em 0em 0.5em 0.0em;
}
</style>

# Session 5: Strings, Queries and APIs

*Joachim Kahr Rasmussen*

## Recap (I/II) 

We can think of there as being two 'types' of plots:
- **Exploratory** plots: Figures for understanding data
    - Quick to produce $\sim$ minimal polishing
    - Interesting feature may by implied by the producer
    - Be careful showing these out of context
- **Explanatory** plots: Figures to convey a message
    - Polished figures
    - Direct attention to interesting feature in the data
    - Minimize risk of misunderstanding

There exist several packages for plotting. 

Some popular ones:
- `Matplotlib` is good for customization (explanatory plots)
    - Might take a lot of time when customizing!
- `Seaborn` and `Pandas` are good quick and dirty plots (exploratory)

## Recap (II/II) 

We need to put a lot of thinking in how to present data.

In particular, one must consider the *type* of data that is to be presented:

- One variable:
    - Categorical: Pie charts, simple counts, etc.
    - Numeric: Histograms, distplot (/cumulative) in seaborn


- Multiple variables:
    - `scatter` (matplotlib) or `jointplot` (seaborn) for (i) simple descriptives when (ii) both variables are numeric and (iii) there are not too many observations
    - `lmplot` or `regplot` (seaborn) when you also want to fit a linear model
    - `barplot` (matplotlib), `catplot` and `violinplot` (both seaborn) when one or more variables are categorical
    - the option `hue` allows you to add a "third" categorical dimension... use with care
    - Lots of other plot types and options. Go explore yourself!

- When you just want to explore: `pairplot` (seaborn) plots all pairwise correlations

# Questions from Yesterday

I have tried to gather some questions that seemed to address more general issues:
- *4432*

Other questions?


# Overview of Session 5

Today, we will work with strings, requests and APIs. In particular, we will cover:
1. Git for version control:
    - It is useful when several people are colaborating on the same code. 
    - Today: Motivation for trying Git
2. Markdown for exposition:
    - How to start using Markdown 
3. Text as data:
    - What is a string, and how do we work with it?
    - What kinds of text data does there exist?
4. Key Based Containers:
    - What is a dictionary, and how is this different from lists and tuples?
    - When are dictionaries useful, and how do we work with them?
5. Interacting with the web:
    - What is HTTP and HTML?
    - What is an API, and how do interact with it?
6. Leveraging APIs:
    - What kinds of data can be extracted via an API?
    - How do we translate an API into useful data?

# Associated Readings

PDA:
- Section 2.3: How to work with strings in Python
- Section 3.3: Opening text files, interpreting characters
- Section 6.1: Opening and working with CSV files
- Section 6.3: Intro to interacting with APIs
- Section 7.3: Manipulating strings

Gazarov (2016): "What is an API? In English, please."
- Excellent and easily understood intro to the concept
- Examples of different 'types' of APIs
- Intro to the concepts of servers, clients and HTML

# Text as Data

## Why Text Data

Data is everywhere... and collection is taking speed! 
- Personal devices and [what we have at home](https://www.washingtonpost.com/technology/2019/05/06/alexa-has-been-eavesdropping-you-this-whole-time/)
- Online in terms of news websites, wikipedia, social media, blogs, document archives 

Working with text data opens up interesting new avenues for analysis and research. Some cool examples:
  - [The predictive information from central bank meeting minutes](https://sekhansen.github.io/pdf_files/qje_2018.pdf)
  - [Choice of words by politicians - which shows increased polarization](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf)

## How Text Data

Data from the web often comes in HTML or other text format

In this course, you will get tools to do basic work with text as data.

However, in order to do that:

- learn how to manipulate and save strings
- save our text data in smart ways (JSON)
- interact with the web

First things first...

# Video 5.1: Key Based Containers

## Containers Recap (I/II)

*What are containers? Which have we seen?*

Sequential containers:
- `list` which we can modify (**mutable**).
    - useful to collect data on the go
- `tuple` which is after initial assignment **immutable**
     - tuples are faster as they can do less things
- `array` 
    - which is mutable in content (i.e. we can change elements)
    - but immutable in size
    - great for data analysis

## Containers Recap (II/II)

Non-sequential containers:
- Dictionaries (`dict`) which are accessed by keys (immutable objects).
- Sets (`set`) where elements are
    - unique (no duplicates) 
    - not ordered
    - disadvantage: cannot access specific elements!

## Dictionaries Recap (I/II)

*How did we make a container which is accessed by arbitrary keys?*

By using a dictionary, `dict`. Simple way of constructing a `dict`:

In [3]:
my_dict = {'Andreas': 'Assistant Professor',
           'Joachim': 'PhD Fellow',
           'Nicklas': 'PhD Fellow',
           'Terne': 'PhD Fellow'}

In [2]:
print(my_dict['Joachim'])

PhD Fellow


In [49]:
my_new_dict = {}
for a in range(0,100):
    my_new_dict["cube%s" %a] = a**3
    
print(my_new_dict['cube10'])

1000


## Dictionaries Recap (II/II)

Dictionaries can also be constructed from two associated lists. These are tied together with the `zip` function. Try the following code:

In [51]:
keys = ['a', 'b', 'c']
values = range(2,5)

key_value_pairs = list(zip(keys, values))
print(key_value_pairs) #Print as a list of tuples

[('a', 2), ('b', 3), ('c', 4)]


In [54]:
my_dict2 = dict(key_value_pairs)
print(my_dict2) #Print dictionary

{'a': 2, 'b': 3, 'c': 4}


In [55]:
print(my_dict2['a']) #Fetch the value associated with 'a'

2


## Storing Containers

*Does there exist a file format for easy storage of containers?*

Yes, the JSON file format.
- Can store lists and dictionaries.
- Syntax is the same as Python lists and dictionaries - only add quotation marks. 
    - Example: `'{"a":1,"b":1}'`

*Why is JSON so useful?*

- Standard format that looks exactly like Python.
- Extreme flexibility:
    - Can hold any list or dictionary of any depth which contains only float, int, str.
    - Does not work well with other formats, but normally holds any structured data.
        - Extension to spatial data: GeoJSON

# VIDEO: Interacting with the Web

## The Internet as Data (I/II)

When we surf around the internet we are exposed to a wealth of information.

- What if we could take this and analyze it?

Well, we can. And we will. 

Examples: Facebook, Twitter, Reddit, Wikipedia, Airbnb etc.

## The Internet as Data (II/II)

Sometimes we get lucky. The data is served to us.

- The data is provided as an `API` service (today)
- The data can extracted by queries on underlying tables (scraping sessions)


However, often we need to do the work ourselves (scraping sessions)

- We need to explore the structure of the webpage we are interested in
- We can extract relevant elements 
   

## Web Interactions

In the words of Gazarov (2016): The web can be seen as a large network of connected servers
- Every page on the internet is stored somewhere on a remote server
    - Remove server $\sim$ remotely located computer that is optimized to process requests

- When accessing a web page through browser:
    - Your browser (the *client*) sends a request to the website's server
    - The server then sends code back to the browser
    - This code is interpreted by the browser and displayed


- Websites come in the form of HTML $-$ APIs only contain data (often in *JSON* format) without presentational overhead

## The Web Protocol
*What is `http` and where is it used?*

- `http` stands for HyperText Transfer Protocol.
- `http` is good for transmitting the data when a webpage is visited:
   - the visiting client sends request for URL or object;
   - the server returns relevant data if active.

*Should we care about `http`?*

- In this course we ***do not*** care explicitly about `http`. 
- We use a Python module called `requests` as a `http` interface.
- However... Some useful advice - you should **always**:
  - use the encrypted version, `https`;
  - use authenticated connection, i.e. private login, whenever possible.

## Markup Language
*What is `html` and where is it used?*

- HyperText Markup Lanugage
- `html` is a language for communicating how a webpage looks like and behaves.
  - That is, `html` contains: content, design, available actions.

*Should we care about `html`?*

- Yes, `html` is often where the interesting data can be found.
- Sometimes, we are lucky, and instead of `html` we get a JSON in return. 
- Getting data from `html` will the topic of the subsequent scraping sessions.

# VIDEO: Leveraging APIs 

## Web APIs (I/IV)
*So when do we get lucky, i.e. when is `html` not important?*

- When we get a Application Programming Interface (`API`) on the web
- What does this mean?
  - We send a query to the Web API 
  - We get a response from the Web API with data back in return, typically as JSON.
  - The API usually provides access to a database or some service

## Web APIs (II/IV)
*So where is the API?*

- Usually on separate sub-domain, e.g. `api.github.com`
- Sometimes hidden in code (see sessions on scraping) 

*So how do we know how the API works?*

- There usually is some documentation. E.g. google ["api github com"](https://www.google.com/search?q=api+github)

## Web APIs (III/IV)
*So is data free? As in free lunch?*

- Most commercial APIs require authentication and have limited free usage
  - e.g. Google Maps, various weather services
- Some open APIs that are free
  - Danish 
    - Danish statistics (DST)
    - Danish weather data (DMI, this fall)
    - Danish spatial data (DAWA, danish addresses) 
  - Global
      - OpenStreetMaps, Wikipedia
- If no authentication is required the API may be delimited.
  - This means only a certain number of requests can be handled per second or per hour from a given IP address.

## Web APIs (IV/IV)
*So how do make the URLs?*

- An `API` query is a URL consisting of:
  - Server URL, e.g. `https://api.github.com`
  - Endpoint path, `/users/isdsucph/repos`

We can convert a string to JSON with `loads`.

## File handling
*How can we remove a file?*

The module `os` can do a lot of file handling tasks, e.g. removing files:

In [71]:
import os

os.remove('my_file.json')

In [None]:
############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import os

# Find current wokring directory
print(os.getcwd()) # Check working directory

# What files are stored here?
print(os.listdir(os.getcwd())) # Check files in working directory

# I want to access the folder 'my_csv_file'. How?

# I can also change the directory (note 'forward slash'!)
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2/my_csv_file')
print(os.listdir(os.getcwd())) # Check files in working directory

# And change back again!
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2')
print(os.listdir(os.getcwd())) # Check files in working directory

############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import pandas as pd

# The simple case
data = pd.read_csv('my_csv_file/titanic.csv')
print(data.head(10))

print(dtype)

data = pd.read_csv(
    'data/files/complex_data_example.tsv',      # relative python path to subdirectory
    sep='\t'           # Tab-separated value file.
    quotechar="'",        # single quote allowed as quote character
    dtype={"salary": int},             # Parse the salary column as an integer 
    usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.
    parse_dates=['birth_date'],     # Intepret the birth_date column as a date
    skiprows=10,         # Skip the first 10 rows of the file
    na_values=['.', '??']       # Take any '.' or '??' values as NA
)

## How Text Data

Data from the web often comes in HTML or other text format

In this course, you will get tools to do basic work with text as data.

However, in order to do that:

- learn how to manipulate and save strings
- save our text data in smart ways (JSON)
- interact with the web

First things first...