# Session 1 Recap

## Recap (I/III) 

Three trends driving increasing interest in data science:
- **Data** is increasingly available
- Faster and bigger **computers**
- Improved **algorithms**, methods for computation amd accessability of tools

*Social data science* captures certain applications of data science within social sciences:
- **Handling** (big) sets of structured (tabular) and unstructured (image, text and social media) data
- **Analyzing** data with machine learning techniques (statistics, causal inference and economic modelling)

## Recap (II/III) 

In this course, we use Python for most tasks:
- **Other popular languages** out there (e.g. R and Julia)
- However, Python is becoming increasingly popular due to**broad applicability** (data structuring, ML, general programming) and **ease of learning**

However, not that easy to learn... Remember:
- Coding takes **time** and **practice** to learn
- We are here to **help** you!
- Eventually, you will **learn the fundamentals**

When you have learned the fundamentals, there are other margins to focus on:
- **Speed**: Use functions and classes when possible
- **Clarity**: Remember to annotate code so that others (including 'future you') understand what you have done

## Recap (III/III) 

Git is popular for file sharing and collaborative coding
- **Track changes** system for files (log of all changes is kept)
- **Share the files** you want, how you want (you can 'copy' repositories)

Markdown is an easy and flexible alternative to LaTeX for building slides:
- **Coding** is easy and fast!
- Works well with **Jupyter**

## Questions?

<center><img src='https://media.giphy.com/media/7PfwoiCwBp6Ra/giphy.gif' alt="Drawing" style="width: 400px;"/></center>

# Session 2: Strings and APIs

## Overview of Session

Today, we will work with strings, requests and APIs. In particular, we will cover:
- Text as data:
    - What is a string, and how do we work with it?
    - What kinds of text data does there exist?
- Interacting with the web:
    - What is HTTP and HTML?
    - What is an API, and how do interact with it?
- Leveraging APIs:
    - What kinds of data can be extracted via an API?
    - How do we translate an API into useful data?

## Associated Readings

Gazarov (2016): "What is an API? In English, please."
- Excellent and easily understood intro to the concept
- Examples of different 'types' of APIs
- Intro to the concepts of servers, clients and HTML

PDA:
- Section 2.3: How to work with strings in Python
- Section 3.3: Opening text files, interpreting characters
- Section 6.1: Opening and working with CSV files
- Section 6.3: Intro to interacting with APIs
- Section 7.3: Manipulating strings

# Text Data and Containers

## Why Text Data

Data is everywhere... and collection is taking speed! 
- Personal devices and [what we have at home](https://www.washingtonpost.com/technology/2019/05/06/alexa-has-been-eavesdropping-you-this-whole-time/)
- Online in terms of news websites, wikipedia, social media, blogs, document archives 

Working with text data opens up interesting new avenues for analysis and research. Some cool examples:
  - [The predictive information from central bank meeting minutes](https://sekhansen.github.io/pdf_files/qje_2018.pdf)
  - [Choice of words by politicians - which shows increased polarization](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf)

## How Text Data

Data from the web often comes in HTML or other text format

In this course, you will get tools to do basic work with text as data.

However, in order to do that:

- learn how to manipulate and save strings
- save our text data in smart ways (JSON)
- interact with the web



# Interacting with the Web

## The Internet as Data (I/II)

When we surf around the internet we are exposed to a wealth of information.

- What if we could take this and analyze it?

Well, we can. And we will. 

Examples: Facebook, Twitter, Reddit, Wikipedia, Airbnb etc.

## The Internet as Data (II/II)

Sometimes we get lucky. The data is served to us.

- The data is provided as an `API` service (today)
- The data can extracted by queries on underlying tables (scraping sessions)


However, often we need to do the work ourselves (scraping sessions)

- We need to explore the structure of the webpage we are interested in
- We can extract relevant elements 
   

## Web Interactions

In the words of Gazarov (2016): The web can be seen as a large network of connected servers
- Every page on the internet is stored somewhere on a remote server
    - Remove server $\sim$ remotely located computer that is optimized to process requests

- When accessing a web page through browser:
    - Your browser (the *client*) sends a request to the website's server
    - The server then sends code back to the browser
    - This code is interpreted by the browser and displayed


- Websites come in the form of HTML $-$ APIs only contain data (often in *JSON* format) without presentational overhead

## The Web Protocol
*What is `http` and where is it used?*

- `http` stands for HyperText Transfer Protocol.
- `http` is good for transmitting the data when a webpage is visited:
   - the visiting client sends request for URL or object;
   - the server returns relevant data if active.

*Should we care about `http`?*

- In this course we ***do not*** care explicitly about `http`. 
- We use a Python module called `requests` as a `http` interface.
- However... Some useful advice - you should **always**:
  - use the encrypted version, `https`;
  - use authenticated connection, i.e. private login, whenever possible.

## Markup Language
*What is `html` and where is it used?*

- HyperText Markup Lanugage
- `html` is a language for communicating how a webpage looks like and behaves.
  - That is, `html` contains: content, design, available actions.

*Should we care about `html`?*

- Yes, `html` is often where the interesting data can be found.
- Sometimes, we are lucky, and instead of `html` we get a JSON in return. 
- Getting data from `html` will the topic of the subsequent scraping sessions.

# Web APIs 

## Web APIs (I/III)
*So when do we get lucky, i.e. when is `html` not important?*

- When we get a Application Programming Interface (`API`) on the web
- What does this mean?
  - We send a query to the Web API 
  - We get a response from the Web API with data back in return, typically as JSON.
  - The API usually provides access to a database or some service

## Web APIs (II/III)
*So where is the API?*

- Usually on separate sub-domain, e.g. `api.github.com`
- Sometimes hidden in code (see sessions on scraping) 

*So how do we know how the API works?*

- There usually is some documentation. E.g. google ["api github com"](https://www.google.com/search?q=api+github)

## Web APIs (III/III)
*So is data free? As in free lunch?*

- Most commercial APIs require authentication and have limited free usage
  - e.g. Google Maps, various weather services
- Some open APIs that are free
  - Danish 
    - Danish statistics (DST)
    - Danish weather data (DMI, this fall)
    - Danish spatial data (DAWA, danish addresses) 
  - Global
      - OpenStreetMaps, Wikipedia
- If no authentication is required the API may be delimited.
  - This means only a certain number of requests can be handled per second or per hour from a given IP address.

In [None]:
############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import os

# Find current wokring directory
print(os.getcwd()) # Check working directory

# What files are stored here?
print(os.listdir(os.getcwd())) # Check files in working directory

# I want to access the folder 'my_csv_file'. How?

# I can also change the directory (note 'forward slash'!)
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2/my_csv_file')
print(os.listdir(os.getcwd())) # Check files in working directory

# And change back again!
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2')
print(os.listdir(os.getcwd())) # Check files in working directory

############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import pandas as pd

# The simple case
data = pd.read_csv('my_csv_file/titanic.csv')
print(data.head(10))

print(dtype)

data = pd.read_csv(
    'data/files/complex_data_example.tsv',      # relative python path to subdirectory
    sep='\t'           # Tab-separated value file.
    quotechar="'",        # single quote allowed as quote character
    dtype={"salary": int},             # Parse the salary column as an integer 
    usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.
    parse_dates=['birth_date'],     # Intepret the birth_date column as a date
    skiprows=10,         # Skip the first 10 rows of the file
    na_values=['.', '??']       # Take any '.' or '??' values as NA
)