# Collecting Information from the Web

Some of the material are, again, gently stolen from [Doing Computational Social Science with Python](https://github.com/damian0604/bdaca/blob/master/book/bd-aca_book.pdf) written by Damian Trilling.

## 1. HTTP request methods and status codes

### 1.1. Connecting to online documents with `requests`

`requests` is an external library that facilitates downloading data from the Web into your Python Notebook. Similar to NLTK `requests` doesn't load automatically but should be imported. This is handled by the `import` statement.

In [2]:
import requests

With `requests` it becomes fairly easy to read data from the Web into you Notebook, especially "raw" text. We need to provide the `.get()` methods with a string that contains a URL (Uniform Resource Locator). The URL below points to Franz Kafka's Metamorphoses on Gutenberg.org.

In [8]:
data = requests.get('http://www.gutenberg.org/cache/epub/5200/pg5200.txt')
print(data)

<Response [200]>



What operation has the `.get()` method actually performed? Printing the `data` variable does not give us the text, but a short message indicating the **Status Code**.

HTTP (or HyperText Transfer Protocal) contains methods to request data. The two most commons ones (supported by all Web browers) are **GET** and **POST**.

**[From Wikipedia](https://en.wikipedia.org/wiki/POST_(HTTP))**

- The **POST** request method requests that a web server accepts the data enclosed in the body of the request message, most likely for storing it;

- The HTTP **GET** request method retrieves information from the server. As part of a GET request, some data can be passed within the URL's query string, specifying (for example) search terms, date ranges, or other information that defines the query.

`requests.get()` is the Python tool to execute an HTTP GET request.

In other words, the `.get()` gives the HTTP status code returned by the GET request method. 200 means that everything is OK. An overview of status codes is given [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

One of the most infamous status code is the [**404 Error**](https://en.wikipedia.org/wiki/HTTP_404), which indicates the server could not find the requested information.
<img src='https://s3.amazonaws.com/images.seroundtable.com/t-google-404-1303660172.jpg'>

To return to the above example: If we want to print the actual text, we need to access that `.text` attribute of the `data` object.

In [7]:
content = data.text
print(content[:90])

﻿The Project Gutenberg EBook of Metamorphosis, by Franz Kafka
Translated by David Wyllie.


### --Exercise--

Can you make a GET request that returns a 404 Error?

In [11]:
# insert your code here

### --Exercise--

- Assign the [complete works of William Shakespeare](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt) to a variable with the name `sh_compl`;
- Use the `.get()` method from the requests library;
- Perform the operation in only **one line** of code


In [None]:
# insert your code here

### 1.2 Automatically importing texts

## Python Dictionaries and JSON

### Overview
- Dictionaries
- Counting words, mapping information
- Importing JSON data
- Nested dictionaries

## 1 Introduction

A dictionary resembles a list, but is more generic data type. In a list, the indices have to be **integers** (i.e. to the position of an item in a sequence); in a dictionary, the indices can be of (almost) **any type**. 

Before inspecting dictionaries, let's revisit lists for just a moment.

Using the index operator, we can retrieve the elements at a certain position (for example my first friend):

In [None]:
# Example 
all_my_friends = ['John','Mary','Benny']
# retrieve element by index
my_first_friend = all_my_friends[0]
print(my_first_friend)

Imagine having to look up someone's number. Here the index (or key) would be the names of all citizens with a telephone, and the values their numbers. A numerical index does not make sense here because **we want to retrieve the number by name, not by position in the book**. The same applies, of course, to a normal dictionary, where we'd look up translations or descriptions by word. 

Dictionaries provide you with the data structure that makes such tasks (**looking up values by keys**) exceptionally easy.

For example, if we look at the dictionary `telephone_numbers` below, what is Susan's phone number?

In Pyhon you can easily look-up a key (the element before the `:`) in a dictionary:

In [None]:
telephone_numbers = {'Frank': 4334030, 'Susan': 400230, 'Guido': 487239}
print(telephone_numbers)

... and now print Susan's telephone number:

In [None]:
print(telephone_numbers['Susan'])

In [None]:
# What is Guido's phone number?
print(telephone_numbers['Guido'])

Note how similar `telephone_numbers['Susan']` looks to retrieving the *n*-th element in a list, e.g. `my_list[n]`.

Of course, you could do something similar with a list (the look-up by key), but that would be very impractical.

In [None]:
telephone_numbers = ['Frank', 4334030, 'Susan', 400230, 'Guido', 487239]
print(telephone_numbers[telephone_numbers.index('Susan')+1])

 That's pretty inefficient. The take-home message here is **that lists are not really good if we want two pieces of information together**. Dictionaries come to the rescue!

Generally, you can think of a dictionary as **a mapping** between a set of indices (which are called keys) and a set of values. **Each key maps to a value.** 

The **association** of a key and a value is called a **key-value** pair or sometimes an **item**.

Essentialy, a dictionary **is a mapping between to keys and values**: for example words (key) to their frequencies (value), but we will cover other examples below.

## 2.2 Creating a dictionary

* a dictionary is surrounded by **curly brackets** 

* a dictionary consists of one or more **key:value pairs**, the key is the 'identifier' or "name" that is used to describe the value.
* the **keys** in a dictionary are **unique**
* the syntax for a key/value pair is: `key : value`
* and the **key/value** pairs (i.e. **items**) are separated by **commas**.
* the keys (e.g. 'Frank') in a dictionary have to be **immutable**
* the values (e.g. 8) in a dictionary can by **any python object**
* a dictionary can be empty


An empty dictionary:

In [None]:
x = {}

A mapping between English and German words:

In [None]:
english2deutsch = {'ambulance':'Krankenwagen',
                  'clever':'klug',
                  'concrete':'Beton'}

* Please note that **keys** in a dictionary have to be **immutable and uniques**. Lists, therefore, can not appear as keys. 
* **Anything** can be a value.


Because keys have to be immutable, a list can not appear in this location. This should raise an error:

In [None]:
a_dict = {['a', 'list']: 8}
print(a_dict)

This should work:

In [None]:
a_dict = { 8:['a', 'list']}
print(a_dict)

**Exercise**: correct the code below.

In [None]:
# fout
d = ['a'=4:
     'a':5,
     ['b'] = [1,2,34]
    'f':'bb'
    }

In [None]:
# correct
d = {'a' : 4,
     'a_2': 5,
     'b' : [1,2,34],
    'f':'bb'
    }
print(d)

**Exercise**: make dictionary which maps three cities to the size of their population. Call it `city2population`.

In [None]:
city2population = {'Shanghai' : 24256800,'Beijing' : 21516000, 'Delhi': 16349831}
print(city2population)

### 2.2.1 Adding items to a dictionary

There is one very simple way in order to add a **key:value** pair to a dictionary. Please look at the following code snippet:

In [None]:
english2deutsch = dict()
#or try english2deutsch = {}
print(english2deutsch)

In [None]:
english2deutsch['one'] = 'einz'
english2deutsch['two'] = 'zwei'
english2deutsch['three'] = 'drei'
print(english2deutsch)

**Exercise**: add two more cities to `city2population`

In [None]:
city2population['Lagos'] = 16060303
city2population['Tianjin'] = 15200000
print(city2population)

Please note that key:value pairs get overwritten if you assign a different value to an existing key.

In [None]:
english2deutsch = dict()
print(english2deutsch)
english2deutsch['one'] = 'einz?'
print(english2deutsch)
english2deutsch['one'] = 'zwei?'
print(english2deutsch)
english2deutsch['one'] = 'drei?'
print(english2deutsch)

**Exercise**: overwrite the value for the last key you added.

In [None]:
city2population['Tianjin'] = 15200001
print(city2population)

## 2.3 Inspecting the dictionary

In a dictionary we store values we'd like to inspect later by their keys. **Common situations are**:
- mapping words to frequencies (values are integers, floats)
- mapping names to a person's individual characteristics (age, gender, etc) (values are strings or numbers)
- mapping dates to counts (creating timelines) (values are integers, floats)
- mapping bands to their songs titles (values are lists)

The most basic operation on a dictionary is a **look-up**. Simply enter the key and the dictionary returns the value. In the example below, we mapped movies to their box-office performance. Keys are the Movie Titles, and values represent the ticket sale.

In [None]:
bo = {'Avatar': 27879650875, 'Titanic': 2187463944, 'Star Wars: The Force Awakens': 2068223624}

In [None]:
print(bo['Avatar'])

If the key is not in the dictionary, Python will raise a ``KeyError``.

In [None]:
bo['The Lion King']

**Membership operators** appear often to check whether a dictionary contains a specific key. Let's assume we posses a dictionary that maps writers to their date of birth.

In [None]:
writer2dob = {'Edgar Allan Poe': 'January 19, 1809',
             'Virginia Woolf':'January 25, 1882',
             'James Joyce':'February 2, 1882'}

I am curious whether I appear in this illustrous set of authors. 

In [None]:
print('Do I appear in this dictionary?')
writer2dob['Kaspar von Beelen']
print('Yes!')

Damn, obviously I am not, and Python raises a `KeyError`, which would be annoying when running a larger program, because, as you've noticed, **the program did not get to the last `print()` statement** (meaning it crashed). So, let's write a little program that prints "X is in the dictionary" if a particular writer appears in the collection and "X is NOT in the dictionary" if it does not.

In [None]:
# Tip a membership operation looks as follows:
'Edgar Allan Poe' in writer2dob

In [None]:
name = 'Kaspar von Beelen' 

print('Does ' + name +' appear in this dictionary?')
print('\n')

if name in writer2dob:
    print(writer2dob[name])
else:
    print(name + ' not in dictionary')

## 2.4 Dictionary Methods
### .get()

In order to avoid checking all the time for membership or getting a `KeyError` when a key does not appear in the dictionary, you can use the ``get`` method. The **first argument** is the **key** to look up, the **second argument** defines the **value** to be returned if the key is not found:

In [None]:
print(bo.get('The Lion King','Not in Dictionary'))
# a good alternative could be 
print(bo.get('Avatar',False))
print(bo.get('The Lion King',False))

Other methods allow us to access the different components of the dictionary:

### .keys()

the **keys** method returns the keys in a dictionary 

In [None]:
student_grades = {'Frank': 8, 'Susan': 7, 'Guido': 10}
the_keys = student_grades.keys()
print(the_keys)

### .values()

the **values** method returns the values in a dictionary

In [None]:
the_values = student_grades.values()
print(the_values)

We can use the other built-in functions to inspect the keys and values. For example:

In [None]:
the_values = student_grades.values()
print(len(the_values)) # number of values in a dict
print(max(the_values)) # highest value of values in a dict
print(min(the_values)) # lowest value of values in a dict
print(sum(the_values)) # sum of all values of values in a dict

### .items()

the **items** method returns a list of **tuples** (we have a look at tuples later), which allows us to easily loop through a dictionary.

In [None]:
student_grades = {'Frank': 8, 'Susan': 7, 'Guido': 10}
print(student_grades.items())

### `sorted()`

Often we want to sort the dictionary by either their keys or values. The Python `sorted()` function is very useful here.

In [None]:
student_grades = {'Frank': 8, 'Susan': 7, 'Guido': 10}
print(student_grades.items())

Sort by key:

In [None]:
sorted(student_grades.items())

Sort by value, in ascending order:

In [None]:
sorted(student_grades.items(),key=lambda x:x[1])

Sort by value, in descending order:

In [None]:
sorted(student_grades.items(),key=lambda x:x[1],reverse=True)

**Exercise**: Sort `word_counts`
- Alphabetically
- By word frequency in descending order:

In [None]:
word_counts = {',': 1, '.': 1, 'announced': 1, 'are': 1,  'be': 2, 'bosses': 1, 'by': 1,
         'company': 1,'failing': 1,'fines': 1,'government': 1, 'hit': 1, 'huge': 1,'irresponsible': 1,
         'line': 1, 'may': 1, 'own': 1,'pension': 1, 'plans': 1, 'protect': 1,'schemes': 1,
         'their': 1,'theresa': 1, 'to': 3, 'under': 1, 'weeks': 1, 'while': 1, 'who': 1,
         'with': 1, 'within': 1, 'workers': 1,}

In [None]:
print(sorted(word_counts.items()))
print(sorted(word_counts.items(),key = lambda x: x[1], reverse=True))

## 2.5 Iterating over dictionaries

Since dictionaries are iterable objects, we can iterate through our good reads collection as well. This will iterate over the *keys* of a dictionary:

In [None]:
good_reads = {"The Magic Mountain":9,
             "The Idiot":7,
             "Don Quixote": 9.5}

for book in good_reads:
    print(book)

We can also iterate over both the keys and the values of a dictionary, this is done as follows:

In [None]:
good_reads.items()

In [None]:
for x, y in good_reads.items():
    print(x + " has score " + str(y))

**Important**: Notice that we write `for x, y in` and not `for x in`. Because we are interested in titles and scores as seperate items we unpack the item.

In [None]:
# just to compare
for x in good_reads.items():
    print(x + " has score " + str(x))

## 2.6 Example Counting with dictionaries

Dictionaries are very useful to keep track of our data, for example by counting words:

In [None]:
sentence = 'Obama was the president of the USA' # assign the string to the variable sentence
words = sentence.split() # split the sentence
word2freq = {} # initialize an empty dictionary, here we store the word counts
# i.e. word2freq is a mapping from words to their frequencies
 

for word in words: # loop over all the words, word takes each word in turn
    if word in word2freq: # add 1 to the dictionary if the keys exists, here we perform membership check on the keys
        word2freq[word] += 1 # notice that we use the shorthand for incremental count
                             # which as an abbraviations for  word2freq[word] =  word2freq[word] + 1
    else: # if the above condition does not hold (word does not appear as key in the dictionary) than set the key's value to one
        word2freq[word] = 1 # set default value to 1 if key does not exists 

    print('Word added = ',word, 'Updated dictionary = ',word2freq)

print('\n')
print(word2freq)

A lot is happening in the previous code block; the examples below aim to clarify the individual steps.

#### `if` and `else`: see Notebook 3.1

In [None]:
# change x to see how if else works

x = 5

if x >= 0:
    print(x,' is positive or zero.')
else:
    print(x, ' is negative.')

##### Line 8 (and implicitly line 11): Membership check on the keys of the dictionary

In [None]:
w2fr = {'USA': 1, 'of': 1, 'president': 1, 'the': 2, 'was': 1, 'Obama': 1}


print('USA' in w2fr) # in does memership check on the keys if not method is appended to the dictionary
print(1 in w2fr) # it does not check if an items appears as values
print(1 in w2fr.values()) # unless you caled th values methods of course

##### Line 9 and line 12: Updating (9) and setting (12) dictionaries a key

Why do we distinguish between updating and setting a key? 

If we'd only update (which makes sense somehow) Python would raise a `KeyError` because the value for the key we want to increment does not appear yet in the dictionary. For this reason we explicitly set a default (start) value for each new key (i.e. each word which does not appear in the dictionary--in which case the membership condition `in` returns `False`)

In [None]:
w2fr = {'USA': 1, 'of': 1, 'president': 1, 'the': 2, 'was': 1, 'Obama': 1}

print('Obama has frequency ',w2fr['Obama'])
w2fr['Obama']+=1 
print('Obama has frequency ',w2fr['Obama'])

# remember +=1 is equal to var = var + 1 
# but this is not recommended
#w2fr['Obama'] = w2fr['Obama'] + 1 
#print('Obama has frequency ',w2fr['Obama'])

Now if we want to update the value for a word which key is not in the dictionary, Python throws back a `KeyError`:

In [None]:
w2fr['Barack']+=1 

Therefore we have to add the key first, and update it once we encounter it a second time.

In [None]:
w2fr['Barack'] = 1 # set key
w2fr['Barack'] += 1 # update key, increment count with one

### `setdefault()` method

The `setdefault()` method simplifies the above code by automatically checking if a key exists, and if not, setting a default value for this key. This method takes two arguments, the key to be set, and the default value for the key.

In [None]:
sentence = 'Obama was the president of the USA' # assign the string to the variable sentence
words = sentence.split() # split the sentence
word2freq = {} # initialize an empty dictionary, here we store the word counts
# i.e. word2freq is a mapping from words to their frequencies
 

for word in words: # loop over all the words, word takes each word in turn
    
    word2freq.setdefault(word,0) # if keys not appear
    word2freq[word] += 1 # notice that we use the shorthand for incremental count
    
    print('Word added = ',word, 'Updated dictionary = ',word2freq)

print('\n')
print(word2freq)

### Easier ways to count words with Python

As counting items in a list is such a common task, Python comes with other tools that make it user to obtain frequencies.

#### Counter()

In [None]:
from collections import Counter
sentence = 'Obama was the president of the USA he is no longer the president of the USA' # assign the string to the variable sentence
words = sentence.split() # split the sentence
wf = Counter(words)
print(wf) # Counter works very much like a dictionary
print(wf['the']) # you can look up a word by key
print(wf['aa']) # it returns zero if the word does not appear
print(wf.most_common(2)) # you can rank the words by their frequency

Again, to discover what you can do with a `Counter` object use `help()` or `dir()`.

### `pop()`

We can also delete keys with the `pop` method.

In [None]:
print(wf)
wf.pop('the')
print(wf)

### `update()`

`update()` merges to dictionaries, i.e. combines the word counts of two different documents in this case.

In [None]:
help(Counter.update)

In [None]:
wf1 = Counter('Obama was the president of the USA '.split())
wf2 = Counter('he is no longer the president of the USA'.split())
print(wf1)
print(wf2)
wf1.update(wf2)
print('')
print(wf1)

**Exercise**: download Volume I and II of Schopenhauer's 'The World as Will and Idea'
- As previously, use requests to download the page from Gutenberg
- `word_tokenize()` each of the books
- Compute the word frequencies using `Counter()`
- Get the hundred most common words (`most_common`)
- Which of the common words in volume I are not that common in volume II? 

In [None]:
import requests
from collections import Counter
from nltk.tokenize import word_tokenize

vol_i = 'http://www.gutenberg.org/files/38427/38427-0.txt'
vol_ii = 'http://www.gutenberg.org/files/40097/40097-0.txt'

In [None]:
text_vol_i = requests.get(vol_i).text
tokens = word_tokenize(text_vol_i)
wf = Counter(tokens)
wf.most_common(100)

### 2.6.1 Count song titles by year (making timelines)

Now let's apply these techniques to studying our song titles corpus. 

First we make a little program that collects the counts of a search term by year. This will allow is to make timelines that plot the evolution of a topic by year. 

In [None]:
# load the data
import requests
from nltk.tokenize import word_tokenize
url = 'https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/tracks_per_year.txt'
data = requests.get(url).text.strip() # download the song titles
song_titles = data.strip().split('\n') # split the string into lines

In [None]:
year2counts = {} # create an empty dictionary here we map years to the frequency of a word

search = 'love' # define your search term here

for row in song_titles: # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # unpack the row using multiple assignment
    # cast year as an integer with int()
    year = int(year)
    # set the default value for key year to 0
    year2counts.setdefault(year,0)
    # lowercase the title
    title_lower = title.lower()
    # tokenize the lowercased title
    words = word_tokenize(title_lower)
    
    if search in words: # membership check, does the word love appear in the list called words
        year2counts[year] +=1 # add one if the condition holds

# print the results sorted by year
print(sorted(year2counts.items()))

We can easily plot the time series using functions from an external library called **Pandas**. The code below is just to help you plotting your data, do not worry about it now.

In [None]:
import pandas
%matplotlib inline
series = pandas.Series(year2counts)
series.plot()

**Question**: Can we conclude that "love" has become a more popular theme in pop culture over time?

Ideally, we'd like to calculate the propability that a song from a certain year is about love. This is relatively straightforward: we have to divide the number of songs about love from year X by the total number of songs from year X.

To program below features therefore a small addition: the dictionary `wf` that *also* tracks the number of songs by year.

In [None]:
year2counts = {} # create an empty dictionary here we map years to the frequency of a word
year2counts_all = {}

search = 'love' # define your search term here

for row in song_titles: # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # unpack the row using multiple assignment
    # cast year as an integer with int()
    year = int(year)
    # set the default value for key year to 0
    year2counts.setdefault(year,0)
    # lowercase the title
    title_lower = title.lower()
    # tokenize the lowercased title
    words = word_tokenize(title_lower)
    
    year2counts_all.setdefault(year,0)
    year2counts_all[year]+=1
    
    if search in words: # membership check, does the word love appear in the list called words
        year2counts[year] +=1 # add one if the condition holds

# print the results sorted by year
print(sorted(year2counts.items()))

If we take a moment to study the total number of songs by year, we also see an increase:

In [None]:
# plot the total number of songs by year
series = pandas.Series(year2counts_all)
series.plot()

For this reason, the **relative frequency of a word will tell us more than the absolute frequency**--almost every search term we define will show an increase over time. Below we add a few more lines to divide the number of songs about love **by the total number of songs** for that specific year. This is called **normalization**.

Because we defined the two mappings (year to counts) earlier, this becomes a relatively straightforward task:

In [None]:
# calculate the probability that a song is about love 
# create an empty dictionary

ratios = {}

for key, value in year2counts.items():
    ratios[key] = year2counts.get(key,0) / year2counts_all[key]

In [None]:
# plot the results
pandas.Series(ratios).plot()

Exercise: Can you plot the chronological evolution of another term (such as "dirty")?

In [None]:
# copy-paste your code here

## 2.7 Recap

To finish this section, here is an overview of the new concepts and functions you have learnt. Make sure you understand them all.

-  dictionary
-  indexing or accessing keys of dictionaries
-  adding items to a dictionary
-  `.keys()`
-  `.values()`
-  `.get()`
-  `pandas.Series().plot()`

## 2.8 JSON

The previous section covered the most basic elements of Python dictionaries. You know now how to assign values to variables, keys to values etc. A variable is a **box** that can contain almost anything. Instead of strings and integers, we can also analyse  **a whole corpus of Tweets**.

As an example we used all tweets of the current American President. These we obtained via the [Trump Twitter Archive](http://www.trumptwitterarchive.com/archive).

The database is a [JSON](https://en.wikipedia.org/wiki/JSON) file in which each item is a tweet. The cell below shows the first tweet of the collection. It may seem difficult to read JSON notation, but there are various tools to help you. Go for example to this [JSON viewer](http://jsonviewer.stack.hu/) and copy paste the text into the cell below.

Basically as JSON object combines Python lists and dictionaries. As it is a very common data type, Python has some libraries to process and read JSON data.

``{"source":"Twitter for iPhone",
   "text":"The Tax Cut Bill is coming along very well, great support. With just a few changes, some mathematical, the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well!",
   "created_at":"Mon Nov 27 14:24:36 +0000 2017",
   "retweet_count":15663,
   "favorite_count":79868,
   "is_retweet":false,
   "id_str":"935152378747195392"}``

This looks almost exactly as a dictionary! To **convert** this JSON item to a proper Python object we can use the `json.loads()` method from the `json` library. `loads()` reads a string and transforms it to a Python readable object.

In [None]:
import json
tweet = json.loads('''{"source":"Twitter for iPhone",
                    "text":"The Tax Cut Bill is coming along very well, great support. With just a few changes, some mathematical, the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well!",
                    "created_at":"Mon Nov 27 14:24:36 +0000 2017",
                    "retweet_count":15663,
                    "favorite_count":79868,
                    "is_retweet":false,
                    "id_str":"935152378747195392"}''')
print(tweet)

now we can treat the `tweet` as a dictionary...

In [None]:
# print the number of retweets of the tweet defined above
print(tweet['retweet_count'])

We can also load the whole corpus from disk, but we need some additional syntax.

In [None]:
data = json.load(open('data/trump_tweets.json','r'))
print(data[:2])

If the previous cell raised a `UnicodeError`, try this:

In [None]:
import codecs
data = json.load(codecs.open('data/trump_tweets.json','r',encoding='utf-8'))
print(data[:2])

We can now see that the loaded JSON corpus is actually a `list`

In [None]:
type(data)

Each item in this `list` is a `dict`.

In [None]:
type(data[0])

**Exercise**: Store all retweet counts in a list named `retweet_count`. Ignore the retweets.

In [None]:
retweet_counts = []
for tweet in data:
    retweet_counts.append(tweet['retweet_count'])
print(retweet_counts)

**Exercise**: What is the maximum retweet count? What is the minimum?
> use max(list)

In [None]:
max(retweet_counts)

**Exercise**: Compute the the average retweet count? Use the `sum` and `len` functions.
> average = sum of items in the list / number of items in the list

In [None]:
average = sum(retweet_counts)/len(retweet_counts)
print(average)

The numpy library provides tools for computing such values.

In [None]:
import numpy as np
mock_example = [i**2 for i in range(4,100,3)]
print(mock_example)
print(np.mean(mock_example))
print(np.median(mock_example))

Dictionaries are useful for **mapping** two series. Let's map the 'id_str' to the actual 'text' of the tweet:

In [None]:
id2text = {}
for tweet in data:
    id2text[tweet['id_str']] = tweet['text']

In [None]:
list(id2text.items())[:4]

**Exercise**: Printing the most popular tweets.
- Map 'text' to 'retweet_count' (Assume for a moment that each tweet text is unique);
- sort the mappying by value in the descending order `sorted()`;
- and print the text of the five most popular tweets.

In [None]:
text2retweet = {}
for tweet in data:
    text2retweet[tweet['text']] = tweet['retweet_count']

sorted(text2retweet.items(),key = lambda x: x[1],reverse=True)[:5]

**Exercise**: Lastly, make a frequency dictionary for tweets that are more popular on average with respect to  their retweet count. Save these word counts in `is_popular`, the others in `not_popular`.

In [None]:
popular, not_popular = [],[]
for tweet in data:
    if tweet['retweet_count'] > average:
        popular.extend(word_tokenize(tweet['text']))
    else:
        not_popular.extend(word_tokenize(tweet['text']))
        
print(Counter(popular).most_common(100))
print(Counter(not_popular).most_common(100))

## Vader Sentiment Analyzer
[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

VADER uses a lexicon (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "Not interesting."
vs = analyzer.polarity_scores(text)['compound']
print("{:_<65} {}".format(text, str(vs)))

**Exercise**: Find the most positive and negative tweets in the corpus. Map tweets text to their sentiment values.

In [None]:
text2sentiment = {}
for tweet in data:
    text2sentiment[tweet['text']] = analyzer.polarity_scores(tweet['text'])['compound']
sorted(text2sentiment.items(),key = lambda x:x[1])

### Example: JSON and the New York Times API
Retrieving data from the **New York Times API** (Application Programming Interface).


Below you see an example of a JSOM file that represent an article retrieved via the New York Times API. Copy-paste this example http://jsonviewer.stack.hu/ to inspect the document's structure.

``
{"status":"OK", "copyright":"Copyright (c) 2017 The New York Times Company. All Rights Reserved.", "response": {"docs":[{"web_url": "https://query.nytimes.com/gst/abstract.html?res=9D03E5D71E3AE433A25753C3A9649D946696D6CF", "snippet":"\"But Colonel ROOSEVELT,\" I suggested, \"is advocating universal service as a permanent thing.\"","abstract":"for endless militarism, editorial","print_page":"E2","blog":{},"source":"The New York Times","multimedia":[],"headline":{"main":"FOR ENDLESS MILITARISM."},"keywords":[{"isMajor":null,"rank":0,"name":"subject","value":"EUROPEAN WAR"},{"isMajor":null,"rank":0,"name":"subject","value":"PEACE AND MEDIATION"},{"isMajor":null,"rank":0,"name":"subject","value":"OFFICIAL OVERTURES AND STATEMENTS"}],"pub_date":"1917-12-30T00:00:00Z", "document_type":"article","type_of_material":"Editorial","_id":"4fc079c745c1498b0d307c64","word_count":692,"score":0.0}]}}
``

Please note that the JSON document has a **tree-like shape** (head nodes with branches).

Many institutions provide web APIs, allowing users to access information about their collections in JSON via simple HTTP requests. But how to access the historical archive of the New York Times?

First, get an API Key on: https://developer.nytimes.com/

In the examples below, the api-key is replaced by ``###``. Please put your own key there.

After you received the key, you can interact with the API using URLs as queries. The URL below is an example in which I wanted to find all articles mentioning "Armenia" during the First World War. 

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=armenia&begin_date=19140101&end_date=19180101&api-key=###

The URL contains the following parts:
* the **base URL** **``https://api.nytimes.com/svc/search/v2/articlesearch.json``**
* the base URL is followed by a question mark **``?``**, which introduces the **query**.
* The query contains different **parameters**: **``q``**, **``begin``**, **``end``** and **``api-key``**
* We want to search for the term "Armenia" in broadly the Second World War.
* A list of all available parameters can be found [here]( https://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json) and then click **Show details**.
* all the parameters are joined using the **``&``** symbol.

In [None]:
import requests
key = open('/Users/kasparbeelen/Desktop/apikey.txt','r').read()
#key = # put you API key here
url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?q=armenia&begin_date=19140101&end_date=19180101&api-key='
call = url+key
data = requests.get(call).json()

**Exercise**: Inspect the data using the JSON Viewer. Write `print(json.dumps(data))` and copy-paste the result [here](http://jsonviewer.stack.hu/).

All the articles are hidden under the 'response' key and then the 'docs' key.

In [None]:
data['response']['docs']

**Exercise**: print the abstract of each article. Use the `get()` method, return "NaN" if the article lacks an abstract.

In [None]:
for doc in data['response']['docs']:
    print(doc.get('abstract','NaN'))

This prints only the first ten hits, while the total numbers of articles the mention Armenia is:

In [None]:
data['response']['meta']['hits']

To retrieve the other results, we have to set the 'page' attribute:

In [None]:
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'
# instead of typing the url we use a dictionary the save the specifics of the query
query_dict = {
                'q':'armenia',
                'begin_date':'19140101',
                'end_date':'19180101',
                'page':'3', # change the value here to download the other articles
                'api-key':key}

query = '&'.join(['='.join(item) for item in query_dict.items()])
print(query[:-4])
url = base_url + query
print(url)
data = requests.get(url).json()
for doc in data['response']['docs']:
    print(doc.get('abstract','NaN'))

**Exercise**: Print the first 10 abstracts of the articles between 1938 and 1940 that mention the pattern 'jew'.

In [None]:
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'
# instead of typing the url we use a dictionary the save the specifics of the query
query_dict = {
                'q':'jew',
                'begin_date':'19380101',
                'end_date':'19400101',
                'page':'1', # change the value here to download the other articles
                'api-key':key}

query = '&'.join(['='.join(item) for item in query_dict.items()])
print(query[:-4])
url = base_url + query
print(url)
data = requests.get(url).json()
for doc in data['response']['docs']:
    print(doc.get('abstract','NaN'))

Once you know how to use one API, learning the others will be a piece of cake. Try, for example, to obtain data from the [Guardian](http://open-platform.theguardian.com/) or ['Chronicling America'](https://chroniclingamerica.loc.gov/) is very similar. Let's have a look at Chronicling America.

In [None]:
state = 'New York'
year = '1865'
url="http://chroniclingamerica.loc.gov/search/pages/results/?state=" +state + "&date1=" + year + "&date2=" + year + "&dateFilterType=yearRange&sequence=1&sort=date&rows=100&format=json"
print(url)

Look at the value pairs after the ? to understand the parameters of the search. Explicitly, the first query URL will ask for newspapers:

- from New Yokr (state=New York)
- from the year 1865, (date1=1865&date2=1865&dateFilterType=yearRange)
- only the front pages (sequence=1)
- sorting by date (sort=date)
- returning a maximum of five (rows=100)
0 in JSON (format=json)

In [None]:
newspaper_data = requests.get(url).json()

In [None]:
len(newspaper_data['items'])

In [None]:
print(newspaper_data['items'][0].keys())

In [None]:
print(newspaper_data['items'][0]['ocr_eng'])

### The Google Books API

In [None]:
from urllib.request import urlopen
import json
from pprint import pprint
antwoord=urlopen("https://www.googleapis.com/books/v1/volumes?q=shakespeare").read()
data=json.loads(antwoord.decode("utf-8"))
pprint(data)

## 2.9 Advanced Examples: Nested Dictionaries

### Advanced Example 1: Topical shifts over time

The example in which we plotted the evolution of 'love' useful, but it is nonetheless quite slow, if we'd like to plot many other timelines. For each query the program iterates over all the 500.000+ songs. We can make this more efficient by using **nested dictionaries**. 

... What?

In [None]:
a_nested_dict = {1960:{'a':5,'the':9},
                1961:{'a':3,'the':10}}

How to access the word frequencies?

In [None]:
print(a_nested_dict[1960])
print(a_nested_dict[1960]['a'])

In [None]:
# print the value of the at year 1961

In a nested dictionary, a key maps to the value of type `dict`. In the example we map years to word frequencies for that year (we map years to a mapping between words and their frequencies). This make computing the historical change over time for different words much faster.

### The speed of your program: `Counter.update()` vs `dict.setdefault()`

In Python there are many ways to obtain the same result. The most important factor is often **speed**.

In [None]:
# install tqdm to monitor the progress of your script
!pip install tqdm

In [None]:
wf = {} # an empty dictionary where we will save 
from tqdm import tqdm
for row in tqdm(song_titles): # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    year = int(year) # cast year as an integer
    
    
    # here we add some lines to keep track of the total number of words by year
    wf.setdefault(year,Counter())
    
    words = word_tokenize(title.lower()) # split the lowercased string into words
    
    # here start collecting yearly word frequencies
    wf[year].update(Counter(words))

In [None]:
print(wf[1960]['a'])
print(wf[1961]['a'])

In [None]:
wf = {} # an empty dictionary where we will save 
from tqdm import tqdm
for row in tqdm(song_titles): # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    year = int(year) # cast year as an integer
    
    
    # here we add some lines to keep track of the total number of words by year
    wf.setdefault(year,{})
    
    words = word_tokenize(title.lower()) # split the lowercased string into words
    
    # here start collecting yearly word frequencies
    for w in words:
        wf[year].setdefault(w,0)
        wf[year][w] +=1

Now we have to loop through our corpus only once to get the frequency of a word at a certain point in time!

In [None]:
print(wf[1960]['a'])
print(wf[1961]['a'])

For sure, sometimes a word might not appear. To avoid `KeyErrors` we use the `.get()` method.

In [None]:
# this does not work
wf[1960]['madonna']

In [None]:
# this works, do you understand the syntax?
wf[1960].get('madonna',0.0)

The code below returns the same as the long program above but is much faster because we prepared everything as a nest dictionary!

In [None]:
search = 'tears' # define query
results = {} # empty dictionary to save frequencies by years
for year in wf: # loop over all the keys in the wf dictionary which are the years
    results[year] = wf[year].get(search,0.0) # get the value for word search in year year
pandas.Series(results).plot() # plot the results

For sure we can also **normalize** the results:

In [None]:
search = 'tears' # define query
results = {} # empty dictionary to save frequencies by years
for year in wf: # loop over all the keys in the wf dictionary which are the years
    total = sum(wf[year].values()) # the sum of all the values equals the total word count for that year
    results[year] = wf[year].get(search,0.0) / total # get the value for word search in year year and divide it by total 
pandas.Series(results).plot() # plot the results

In [None]:
# to understand line five
a_nested_dict = {1960:{'a':5,'the':9},
                1961:{'a':3,'the':10}}
print(a_nested_dict[1960])
print(a_nested_dict[1960].values())
print(sum(a_nested_dict[1960].values()))

**Exercise**: So far we only plotted the evolution of **one word over time**. A more realistic scenario would be to monitor topics. Instead of just one word, we'd like to see the presence of a set of semantically related words. Adapt the program below: it should iterate over a list of words and plot total word count over time.
> Tip: use two for loops

In [None]:
search = ['tears','cry','crying'] # define query
results = {} # empty dictionary to save frequencies by years
for year in wf: # loop over all the keys in the wf dictionary which are the years
    total = #??
    results.setdefault(#??)
    for s in search:
        #??
    results[year] = results[year] / total # get the value for word search in year year and divide it by total 
pandas.Series #?? plot the results

**Exercise**: In the previous exercises we only inspect evolutions over time. Can you make a nested dictionary which keeps track of word counts by band instead of by year?

In [None]:
wf = {} # an empty dictionary where we will save 
from tqdm import tqdm
for row in tqdm(song_titles): # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    
    # here start collecting yearly word frequencies
    for w in words:
    

**Exercise**: Rank the bands by their total word count.
> Tip: the total word count is equal to the some of the values as in sum(wf[band name].values())

In [None]:
counts = Counter()

for band in wf:
    counter[band] = 
    # ??
counts.most_common(100)

In [None]:
# or use sorted

**Exercise**: Map each group to the frequency with which they use the word 'love'.

In [None]:
counts = Counter()

for band in wf:



### Advanced Examples 2: The Lexical Diversity of Pop Culture

We can also compute now (approximately) if topics of songs titles are becoming more or less diverse. 

This can be done by computing the type-token ratio or the "lexical diversity".

From the [NLTK book](http://www.nltk.org/book/ch01.html):
    
> A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. [...] A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. 

The type-token ratio is then a measure of lexical diversity. It can be calculated as follows:

In [None]:
song_title = 'love love love all I want is candy'
tokens = song_title.split()
print(tokens)
types = set(tokens)
print(types)
ratio = len(types) / len(tokens)
print(ratio)

The maximum lexical diversity is one (each word in the corpus appears only once).

In [None]:
song_title = 'love all I want is candy'
tokens = song_title.split()
print(tokens)
types = set(tokens)
print(types)
ratio = len(types) / len(tokens)
print(ratio)

Now we can compute the lexical diversity of song titles by year.

In [None]:
lexdiv = {}
for year in wf:
    lexdiv[year] = len(wf[year]) / sum(wf[year].values())
pandas.Series(lexdiv).plot()

In [None]:
# if you are interested, try some of the above code to study band names!

## Exercises - DIY Lists and dictionaries

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

- Ex. 1: Consider the following strings `sentence1 = "Mike and Lars kick the bucket"` and `sentence2 = "Bonny and Clyde are really famous"`. Split these strings into words and create the following strings via list manipulation: `sentence3 = "Mike and Lars are really famous"` and `sentence4="Bonny+and+Clyde+kick+the+bucket"` (mind the plus signs!). Can you print the middle letter of the fourth sentence?

- Ex. 2: Create an empty list and add three names (strings) to it using the *append* method

Please use a built-in function to determine the number of strings in the list below

In [None]:
friend_list = ['John', 'Bob', 'John', 'Marry', 'Bob']
#  your code here

In [None]:
Please remove both *John* names from the list below using a list method

In [None]:
friend_list = ['John', 'Bob', 'John', 'Marry', 'Bob']
# your code here

-  Ex. 3: Consider the `lookup` dictionary below. The following letters are still missing from it: 'k':'kilo', 'l':'lima', 'm':'mike'. Add them to `lookup`! Could you spell the word "marvellous" in code language now? Collect these codes into the list object `msg`. Next, join the items in this list together with a comma and print the spelled out version!

> lookup = {'a':'alfa', 'b':'bravo', 'c':'charlie', 'd':'delta', 'e':'echo', 'f':'foxtrot', 'g':'golf', 'h':'hotel', 'i':'india', 'j':'juliett', 'n':'november', 'o':'oscar', 'p':'papa', 'q':'quebec', 'r':'romeo', 's':'sierra', 't':'tango', 'u':'uniform', 'v':'victor', 'w':'whiskey', 'x':'x-ray', 'y':'yankee', 'z':'zulu'}


-  Ex. 4: Collect the code terms in the lookup dict (`alpha`, `bravo`, ...) from the previous exercise into a list called `code_words`. Is this list alphabetically sorted? No? Then make sure that this list is sorted alphabetically. Now remove the items `victor`, `india` and `papa`. Append the words `pigeon` and `potato` at the end of this list. Combine this new list of items into a single string, using a semicolon as a delimiter and print this string. 

- Ex. 5: Write a program that given a long string containing multiple words, prints  the same string, except with the words in backwards order. For example, say I type the string:

`My name is Kaspar von Beelen`
Then I would see the string:

`Beelen von Kaspar is name My`

**Tip**: Try using a negative `step`.

Extra: Try to do this in just one line of code!