# Session 1 Recap

*Joachim Kahr Rasmussen*

## Recap (I/II) 

Three trends driving increasing interest in data science:
- **Data** is increasingly available
- Faster and bigger **computers**
- Improved **algorithms**, methods for computation amd accessability of tools

*Social data science* captures certain applications of data science within social sciences:
- **Handling** (big) sets of structured (tabular) and unstructured (image, text and social media) data
- **Analyzing** data with machine learning techniques (statistics, causal inference and economic modelling)

## Recap (II/II) 

In this course, we use Python for most tasks:
- **Other popular languages** out there (e.g. R and Julia)
- However, Python is becoming increasingly popular due to **broad applicability** (data structuring, ML, general programming) and **ease of learning**

However, not that easy to learn... Remember:
- Coding takes **time** and **practice** to learn
- We are here to **help** you!
- Eventually, you will **learn the fundamentals**

When you have learned the fundamentals, there are other margins to focus on:
- **Speed**: Use functions and classes when possible
- **Clarity**: Remember to annotate code so that others (including 'future you') understand what you have done

## Questions?

<center><img src='https://media.giphy.com/media/7PfwoiCwBp6Ra/giphy.gif' alt="Drawing" style="width: 400px;"/></center>

# Session 2: Git, Markdown, Strings and APIs

*Joachim Kahr Rasmussen*

## Overview of Session

Today, we will work with strings, requests and APIs. In particular, we will cover:
1. Git for version control:
    - It is useful when several people are colaborating on the same code. 
    - Today: Motivation for trying Git
2. Markdown for exposition:
    - How to start using Markdown 
3. Text as data:
    - What is a string, and how do we work with it?
    - What kinds of text data does there exist?
4. Key Based Containers:
    - What is a dictionary, and how is this different from lists and tuples?
    - When are dictionaries useful, and how do we work with them?
5. Interacting with the web:
    - What is HTTP and HTML?
    - What is an API, and how do interact with it?
6. Leveraging APIs:
    - What kinds of data can be extracted via an API?
    - How do we translate an API into useful data?

## Associated Readings

Gazarov (2016): "What is an API? In English, please."
- Excellent and easily understood intro to the concept
- Examples of different 'types' of APIs
- Intro to the concepts of servers, clients and HTML

PDA:
- Section 2.3: How to work with strings in Python
- Section 3.3: Opening text files, interpreting characters
- Section 6.1: Opening and working with CSV files
- Section 6.3: Intro to interacting with APIs
- Section 7.3: Manipulating strings

Zachery (2015):
- **Technical** guide to Git

# Git for Version Control

## Git for Version Control - a Non-technical Overview

Git is a tool for command line:

1) "Track changes" system for files
- A log of all changes is kept - from nothing to current version
- All changes are explicitly declared by you, may annotate
  - You can try out things, but only save meaningful changes!

2) Share the files you want, how you want 
- A git folder, called **repository**, can be copied by others
- Many sites allow public and private repositories - you decide access
    


## Why version control: Track of files/code

### Without git
<img src="https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/nogit.jpg" style="margin: 0 0 -400px 0" alt="Donald Duck">
 

### With Git
![](https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/git.jpg)

## What Dropbox/Google Drive etc. does

Synchronize folder (including subfolders)
- All changes are synchronized continuously (no choice)
- If shared you keep the latest copy (one month reversion)

## What Git does

- Keeps a log of the entire history of changes to files. 
- You can decide **what** and **when** to put in this log.
- You can syncronize the log
    - in a centralized place, e.g. GitHub (which can be public or private).
    - in a decentralized place, e.g. your servers, your computer
- Others can see **who** contributed from this log.

## Why is Git useful

Can scale to many people as it solves:
- Handling of conflicting copies 
- Removes clutter: Only keep relevant changes $\rightarrow$ Less use for space
- Eternal memory: You can revert changes that are very old!
- Contributions: Clear attribution of work

## Vocabulary


- Git: Git is an open source command line program for version control.
- Repository: the location where your files are stored
- GitHub: Company/web services that hosts Git repositories and enables 'social coding'
- Clone: copy another repositiry to a new location (e.g. from GitHub to your PC)
- Pull: to download the newest version of a repository
- Push: push the changes you have made, to the repository

## Git in this course



Homepage hosted on [underlying github repo](https://github.com/abjer/sds2019)
- Has all material and info
- Careful: always ***copy*** notebooks if you use(lectures and exercises!!!)

## Alternatives 

*GitHub for Mac/Windows*
- A point and click version of Git.

*Google's [Colab](https://colab.research.google.com/notebooks/welcome.ipynb)* 
- Is a combination of Google Docs and Jupyter Notebook. 
- Plug and play: is an easy to use, less flexible, alternative. 

# Markdown for Exposition

## LaTeX vs Markdown (I/II)

[LaTeX](https://en.wikipedia.org/wiki/LaTeX) is often used for creating slides in economics and social sciences more generally. 

Undeniable advantages:
- Same language for slides and documents
- Capable of creating beautiful scientific written products
- Lots of people within academia can do the coding
- Can be written jointly in [Overleaf](https://www.overleaf.com/)

## Example of Latex Slide

<img src="https://github.com/joachimkrasmussen/ISDS/blob/master/session_1/Latex_slides_preview.png?raw=true" alt="drawing" width="600"/>

## Example of Latex Document
<img src="https://github.com/joachimkrasmussen/ISDS/blob/master/session_1/Latex_doc_preview.png?raw=true" alt="drawing" width="600"/>


## LaTeX vs Markdown (II/II)

However, it has some problems. In particular:
- Background code is heavy to read $-$ and write!
- If including many images, it can be heavy to compile...

Any alternative?



Yes: markdown. Like python, keeps code simple. Example:
        
Making italic text:
- markdown: `*Some text*`
- LaTeX: `\textit{Some text}`

In addition: Works very well with Jupyter + can be used for making homepages!

How do you learn it? Open our notebook cells or see tutorial in reading lsit

## Headlines

\# This will be the headline    
\#\# This will be the sub headline  
\#\#\# And so on 

---------------------

# This will be the headline
## This will be the sub headline
### And so on


## Bold and italics

- \*\*Text in bold\*\* ->  **Text in bold**

- \*Text in italics\* -> *Text in italics* 

\> This text will be indented 
> This text will be indented 


## Lists

``` 
- fruits 
    - apples
        - macintosh
        - red delicious
    - pears 
    - peaches
- vegetables
    - broccoli
    - chard 
```

## ... gives you this list
- fruits 
    - apples
        - macintosh
        - red delicious
    - pears 
    - peaches
- vegetables
    - broccoli
    - chard

## Links

This is how you insert a link `[name of link](URL)` 
```
The subreddit [DataIsBeautiful](https://www.reddit.com/r/dataisbeautiful/) loves data
```

-> 

The subreddit [DataIsBeautiful](https://www.reddit.com/r/dataisbeautiful/) loves data


## Images
It is almost the same, to insert an image `![](URL)`

```
This is a cat ![](https://upload.wikimedia.org/wikipedia/commons/a/a2/Cat_Golden_Chinchilla.jpg)
```

This is a cat ![](https://upload.wikimedia.org/wikipedia/commons/a/a2/Cat_Golden_Chinchilla.jpg)

# Text as Data

## Why Text Data

Data is everywhere... and collection is taking speed! 
- Personal devices and [what we have at home](https://www.washingtonpost.com/technology/2019/05/06/alexa-has-been-eavesdropping-you-this-whole-time/)
- Online in terms of news websites, wikipedia, social media, blogs, document archives 

Working with text data opens up interesting new avenues for analysis and research. Some cool examples:
  - [The predictive information from central bank meeting minutes](https://sekhansen.github.io/pdf_files/qje_2018.pdf)
  - [Choice of words by politicians - which shows increased polarization](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf)

## How Text Data

Data from the web often comes in HTML or other text format

In this course, you will get tools to do basic work with text as data.

However, in order to do that:

- learn how to manipulate and save strings
- save our text data in smart ways (JSON)
- interact with the web

First things first...

## Strings Recap

*What are strings? What do they consist of?*

Strings are sequential containers of characters    

Python characters can be:
- Unicode (`UTF`)  
    - Characters from European and Asian language and much more
    - 16 bit information
    - Python 3 default and newer web, e.g. [møn.dk](https://møn.dk)        
-  American Standard Code (`ascii`)
    - Characters from English alphabet, numbers, symbols for writing 
    - 8 bit information
    - (Python 2 default, faster)

## String Concatenation

*How can I combine strings?*

Strings can be added together...

In [5]:
s1 = 'police'
s2 = 'officer'

print(s1 + ' ' + s2)

police officer


... arranged via lists...

In [6]:
s = ' '
my_list = [s1, s2, 'arrests']

print(my_list)

['police', 'officer', 'arrests']


... and koined in order to, say, write sentences

In [9]:
my_join_string = s.join(my_list)

print(my_join_string)

police officer arrests


In [10]:
s_ = '+'
my_join_string2 = s_.join(my_list)

print(my_join_string2)

police+officer+arrests


## String Changing Case

*Can I alter the sentence-case of strings?*

- Yes using the string methods `upper`, `lower`, `capitalize`. Example:

In [11]:
s1.upper()

'POLICE'

## Substrings
*How can I check if a substring is contained in the string?*

- in/not in

In [24]:
'pol' in s1 #Check in string


True

In [25]:
[('pol' in my_s) for my_s in my_list] #Check in list of strings

[True, False, False]

*How can I replace a specific substring?*

- replace

In [26]:
s1.replace('po', 'ma') #Replace in string


'malice'

In [29]:
[(my_s.replace('po', 'ma')) for my_s in my_list] #Replace in list of strings


['malice', 'officer', 'arrests']

*Can I also access a string via indices? (in the sequence of characters)*

- sequence form - slicing/indexing

In [34]:
print(s1[:4]) #Return first 4 symbols

poli


In [35]:
print(s1[-4:]) #Return last 4 symbols

lice


In [36]:
print(s1[len(s1)-4:]) #Return last 4 symbols

lice


## Strings quiz
*Which Python object do strings remind you of?*

- Lists work like strings.
  - Concatention (`+`, `*`) works the same way.
  - We check if element/character is contained with `in`.
  - We can slice and use indices for.

## More about strings 

There are many things about strings which we have not covered:

- Methods for splitting or combining strings etc.
- [String formatting](http://www.python-course.eu/python3_formatted_output.php) is exceptionally useful, e.g for making URLs, printing etc. 


# Key Based Containers

## Containers recap

*What are containers? Which have we seen?*

Sequential containers:
- `list` which we can modify (**mutable**).
    - useful to collect data on the go
- `tuple` which is after initial assignment (**immutable**)
     - tuples are faster as they can do less things
- `array` 
    - which is mutable in content (i.e. we can change elements)
    - but immutable in size
    - great for data analysis

Non-sequential containers:
- Dictionaries (`dict`) which are accessed by keys (immutable objects).
    - Focus of tomorrow.
- Sets (`set`) where elements are
    - unique (no duplicates) 
    - not ordered
    - disadvantage: cannot access specific elements!

## Dictionaries (I/II)

*How can we make a container which is accessed by arbitrary keys?*

By using a dictionary, `dict`. Try executing the two pieces of code below:

In [48]:
my_dict = {'Andreas': 'Lecturer 3',
           'Joachim': 'Lecturer 1',
           'Nicklas': 'Lecturer 2',
           'Terne': 'Lecturer 4'}

print(my_dict['Joachim'])

Lecturer 1


In [49]:
my_new_dict = {}
for a in range(0,100):
    my_new_dict["cube%s" %a] = a**3
    
print(my_new_dict['cube10'])

1000


## Dictionaries (II/II)

Dictionaries can also be constructed from two associated lists. These are tied together with the `zip` function. Try the following code:

In [51]:
keys = ['a', 'b', 'c']
values = range(2,5)

key_value_pairs = list(zip(keys, values))
print(key_value_pairs) #Print as a list of tuples

[('a', 2), ('b', 3), ('c', 4)]


In [54]:
my_dict2 = dict(key_value_pairs)
print(my_dict2) #Print dictionary

{'a': 2, 'b': 3, 'c': 4}


In [55]:
print(my_dict2['a']) #Fetch the value associated with 'a'

2


## Storing Containers

*Does there exist a file format for easy storage of containers?*

Yes, the JSON file format.
- Can store lists and dictionaries.
- Syntax is the same as Python lists and dictionaries - only add quotation marks. 
    - Example: `'{"a":1,"b":1}'`

*Why is JSON so useful?*

- Standard format that looks exactly like Python.
- Extreme flexibility:
    - Can hold any list or dictionary of any depth which contains only float, int, str.
    - Does not work well with other formats, but normally holds any structured data.
        - Extension to spatial data: GeoJSON

# Interacting with the Web

## The Internet as Data (I/II)

When we surf around the internet we are exposed to a wealth of information.

- What if we could take this and analyze it?

Well, we can. And we will. 

Examples: Facebook, Twitter, Reddit, Wikipedia, Airbnb etc.

## The Internet as Data (II/II)

Sometimes we get lucky. The data is served to us.

- The data is provided as an `API` service (today)
- The data can extracted by queries on underlying tables (scraping sessions)


However, often we need to do the work ourselves (scraping sessions)

- We need to explore the structure of the webpage we are interested in
- We can extract relevant elements 
   

## Web Interactions

In the words of Gazarov (2016): The web can be seen as a large network of connected servers
- Every page on the internet is stored somewhere on a remote server
    - Remove server $\sim$ remotely located computer that is optimized to process requests

- When accessing a web page through browser:
    - Your browser (the *client*) sends a request to the website's server
    - The server then sends code back to the browser
    - This code is interpreted by the browser and displayed


- Websites come in the form of HTML $-$ APIs only contain data (often in *JSON* format) without presentational overhead

## The Web Protocol
*What is `http` and where is it used?*

- `http` stands for HyperText Transfer Protocol.
- `http` is good for transmitting the data when a webpage is visited:
   - the visiting client sends request for URL or object;
   - the server returns relevant data if active.

*Should we care about `http`?*

- In this course we ***do not*** care explicitly about `http`. 
- We use a Python module called `requests` as a `http` interface.
- However... Some useful advice - you should **always**:
  - use the encrypted version, `https`;
  - use authenticated connection, i.e. private login, whenever possible.

## Markup Language
*What is `html` and where is it used?*

- HyperText Markup Lanugage
- `html` is a language for communicating how a webpage looks like and behaves.
  - That is, `html` contains: content, design, available actions.

*Should we care about `html`?*

- Yes, `html` is often where the interesting data can be found.
- Sometimes, we are lucky, and instead of `html` we get a JSON in return. 
- Getting data from `html` will the topic of the subsequent scraping sessions.

# Leveraging APIs 

## Web APIs (I/IV)
*So when do we get lucky, i.e. when is `html` not important?*

- When we get a Application Programming Interface (`API`) on the web
- What does this mean?
  - We send a query to the Web API 
  - We get a response from the Web API with data back in return, typically as JSON.
  - The API usually provides access to a database or some service

## Web APIs (II/IV)
*So where is the API?*

- Usually on separate sub-domain, e.g. `api.github.com`
- Sometimes hidden in code (see sessions on scraping) 

*So how do we know how the API works?*

- There usually is some documentation. E.g. google ["api github com"](https://www.google.com/search?q=api+github)

## Web APIs (III/IV)
*So is data free? As in free lunch?*

- Most commercial APIs require authentication and have limited free usage
  - e.g. Google Maps, various weather services
- Some open APIs that are free
  - Danish 
    - Danish statistics (DST)
    - Danish weather data (DMI, this fall)
    - Danish spatial data (DAWA, danish addresses) 
  - Global
      - OpenStreetMaps, Wikipedia
- If no authentication is required the API may be delimited.
  - This means only a certain number of requests can be handled per second or per hour from a given IP address.

## Web APIs (IV/IV)
*So how do make the URLs?*

- An `API` query is a URL consisting of:
  - Server URL, e.g. `https://api.github.com`
  - Endpoint path, `/users/abjer/repos`

## Web APIs in Python (I/V)
*How do make a simple query?*

In [59]:
server_url = 'https://api.github.com'
endpoint_path = '/users/abjer/repos'
url = server_url + endpoint_path

print(url)

https://api.github.com/users/abjer/repos


## Web APIs in Python (II/V)
*How can we send a query with the `requests` module?*

In [61]:
import requests # import the module requests
response = requests.get(url) # submit query with `get` and save response as object

response.text

'[{"id":111244798,"node_id":"MDEwOlJlcG9zaXRvcnkxMTEyNDQ3OTg=","name":"abjer.github.io","full_name":"abjer/abjer.github.io","private":false,"owner":{"login":"abjer","id":6363844,"node_id":"MDQ6VXNlcjYzNjM4NDQ=","avatar_url":"https://avatars.githubusercontent.com/u/6363844?v=4","gravatar_id":"","url":"https://api.github.com/users/abjer","html_url":"https://github.com/abjer","followers_url":"https://api.github.com/users/abjer/followers","following_url":"https://api.github.com/users/abjer/following{/other_user}","gists_url":"https://api.github.com/users/abjer/gists{/gist_id}","starred_url":"https://api.github.com/users/abjer/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/abjer/subscriptions","organizations_url":"https://api.github.com/users/abjer/orgs","repos_url":"https://api.github.com/users/abjer/repos","events_url":"https://api.github.com/users/abjer/events{/privacy}","received_events_url":"https://api.github.com/users/abjer/received_events","type":"User","s

## Web APIs in Python (III/V)
*How do extract anything useful from this type of response?*

We can get the HTML response.

In [63]:
print(len(response.text))

print(response.text[:500])

61124
[{"id":111244798,"node_id":"MDEwOlJlcG9zaXRvcnkxMTEyNDQ3OTg=","name":"abjer.github.io","full_name":"abjer/abjer.github.io","private":false,"owner":{"login":"abjer","id":6363844,"node_id":"MDQ6VXNlcjYzNjM4NDQ=","avatar_url":"https://avatars.githubusercontent.com/u/6363844?v=4","gravatar_id":"","url":"https://api.github.com/users/abjer","html_url":"https://github.com/abjer","followers_url":"https://api.github.com/users/abjer/followers","following_url":"https://api.github.com/users/abjer/following{


## Web APIs in Python (IV/V)
*Not really there yet. Can we get something more meaningful or structured?*

Yes, this output of this API can be converted to JSON

In [64]:
response_json = response.json()
response_json[0]

{'id': 111244798,
 'node_id': 'MDEwOlJlcG9zaXRvcnkxMTEyNDQ3OTg=',
 'name': 'abjer.github.io',
 'full_name': 'abjer/abjer.github.io',
 'private': False,
 'owner': {'login': 'abjer',
  'id': 6363844,
  'node_id': 'MDQ6VXNlcjYzNjM4NDQ=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/6363844?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/abjer',
  'html_url': 'https://github.com/abjer',
  'followers_url': 'https://api.github.com/users/abjer/followers',
  'following_url': 'https://api.github.com/users/abjer/following{/other_user}',
  'gists_url': 'https://api.github.com/users/abjer/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/abjer/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/abjer/subscriptions',
  'organizations_url': 'https://api.github.com/users/abjer/orgs',
  'repos_url': 'https://api.github.com/users/abjer/repos',
  'events_url': 'https://api.github.com/users/abjer/events{/privacy}',
  'received_events_

## Web APIs in Python (V/V)
*And how can we see it even more clearly?*

In [67]:
import pprint #Data pretty printer
pprint.pprint(response.json()) #Everything is aranged alphabetically and in appropriate levels

[{'archive_url': 'https://api.github.com/repos/abjer/abjer.github.io/{archive_format}{/ref}',
  'archived': False,
  'assignees_url': 'https://api.github.com/repos/abjer/abjer.github.io/assignees{/user}',
  'blobs_url': 'https://api.github.com/repos/abjer/abjer.github.io/git/blobs{/sha}',
  'branches_url': 'https://api.github.com/repos/abjer/abjer.github.io/branches{/branch}',
  'clone_url': 'https://github.com/abjer/abjer.github.io.git',
  'collaborators_url': 'https://api.github.com/repos/abjer/abjer.github.io/collaborators{/collaborator}',
  'comments_url': 'https://api.github.com/repos/abjer/abjer.github.io/comments{/number}',
  'commits_url': 'https://api.github.com/repos/abjer/abjer.github.io/commits{/sha}',
  'compare_url': 'https://api.github.com/repos/abjer/abjer.github.io/compare/{base}...{head}',
  'contents_url': 'https://api.github.com/repos/abjer/abjer.github.io/contents/{+path}',
  'contributors_url': 'https://api.github.com/repos/abjer/abjer.github.io/contributors',
  '

## Text files
*How can we save a string as a text file?*

In [68]:
my_str = 'This is important...'
my_str2 = 'Written in Python!'

with open('my_file.txt', 'a') as f:
    f.write(my_str+'\n'+my_str2)

*How can we load a string from a text file?*

In [69]:
with open('my_file.txt', 'r') as f:    
    my_str_load = f.read()
print(my_str_load)

This is important...
Written in Python!This is important...
Written in Python!


## JSON files
*How can we save a JSON file?*

The trick is to convert the JSON file to a string. This can be done with `dumps` in the module `json`:

In [70]:
# import JSON module and convert to string
import json
response_json_str = json.dumps(response_json)

# save string as text file
with open('my_file.json', 'w') as f:
    f.write(response_json_str)

We can convert a string to JSON with `loads`.

## File handling
*How can we remove a file?*

The module `os` can do a lot of file handling tasks, e.g. removing files:

In [71]:
import os

os.remove('my_file.json')

In [None]:
############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import os

# Find current wokring directory
print(os.getcwd()) # Check working directory

# What files are stored here?
print(os.listdir(os.getcwd())) # Check files in working directory

# I want to access the folder 'my_csv_file'. How?

# I can also change the directory (note 'forward slash'!)
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2/my_csv_file')
print(os.listdir(os.getcwd())) # Check files in working directory

# And change back again!
os.chdir('C:/Users/xtw562/Documents/isds2021-master/teaching_material/session_2')
print(os.listdir(os.getcwd())) # Check files in working directory

############################################
### NOT CLEAR WHERE THIS SLIDE SHOULD BE ###
############################################

import pandas as pd

# The simple case
data = pd.read_csv('my_csv_file/titanic.csv')
print(data.head(10))

print(dtype)

data = pd.read_csv(
    'data/files/complex_data_example.tsv',      # relative python path to subdirectory
    sep='\t'           # Tab-separated value file.
    quotechar="'",        # single quote allowed as quote character
    dtype={"salary": int},             # Parse the salary column as an integer 
    usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.
    parse_dates=['birth_date'],     # Intepret the birth_date column as a date
    skiprows=10,         # Skip the first 10 rows of the file
    na_values=['.', '??']       # Take any '.' or '??' values as NA
)