# Python Dictionaries 

## 1 Introduction

A dictionary is like a list, but more generic. In a list, the indices have to be **integers** (i.e. to the position of an item in a sequence); in a dictionary, the indices can be of (almost) **any type**. 

In text-mining, dictionary are often useful to keep track of word counts. How this works exactly will be shown below.

But first: to understand why this is useful let's return to lists for just a moment.

Using the index operator, we can retrieve the elements at a certain position (for example my first friend):

In [None]:
# Example 
all_my_friends = ['John','Mary','Benny']
# retrieve element by index
my_first_friend = all_my_friends[0]
print(my_first_friend)

Imagine having to look up someone's number. Here the index (or key) would be the names of all citizens with a telephone, and the values the numbers. A numerical index does not make sense because **we want to retrieve the number by name, not by position in the book**. The same applies, of course, to a normal dictionary, where we'd look up translations or descriptions of specific words. 

Dictionaries provide you with the data structure that makes such tasks (looking up values by keys) exceptionally easy.

For example, if we look at the dictionary `telephone_numbers` below, what is Susan's phone number?

In Pyhon you can easily look-up a key (the element before the `:`) in a dictionary:

In [None]:
telephone_numbers = {'Frank': 4334030, 'Susan': 400230, 'Guido': 487239}
print(telephone_numbers)

... and now print Susan's telephone number:

In [None]:
print(telephone_numbers['Susan'])

More generally, you can think of a dictionary as **a mapping** between a set of indices (which are called keys) and a set of values. **Each key maps to a value.** The association of a key and a value is called a **key-value** pair or sometimes an **item**.

In [None]:
# What is Guido's phone number?

Note how similar `telephone_numbers['Susan']` looks to retrieving the *n*-th element in a list, e.g. `my_list[n]`.

Of course, you could do something similar with a list (the look-up by key), but that would be very impractical.

In [None]:
telephone_numbers = ['Frank', 4334030, 'Susan', 400230, 'Guido', 487239]
print(telephone_numbers[telephone_numbers.index('Susan')+1])

 That's pretty inefficient. The take-home message here is **that lists are not really good if we want two pieces of information together**. Dictionaries for the rescue!

## 2.2 Creating a dictionary

* a dictionary is surrounded by **curly brackets** 
* and the **key/value** pairs are separated by **commas**.
* A dictionary consists of one or more **key:value pairs**, the key is the 'identifier' or "name" that is used to describe the value.
* the **keys** in a dictionary are unique
* the syntax for a key/value pair is: KEY : VALUE
* the keys (e.g. 'Frank') in a dictionary have to be **immutable**
* the values (e.g. 8) in a dictionary can by **any python object**
* a dictionary can be empty


An empty dictionary:

In [None]:
x = {}

A mapping between English and German words:

In [None]:
english2deutsch = {'ambulance':'Krankenwagen',
                  'clever':'klug',
                  'concrete':'Beton'}


* Please note that **keys** in a dictionary have to **immutable**. Lists can not appear as key. 
* Anything can be a value.


Because keys have to be immutable, a list can not appear in this location. This should raise an error:

In [None]:
a_dict = {['a', 'list']: 8}
print(a_dict)

This should work:

In [None]:
a_dict = { 8:['a', 'list']}
print(a_dict)

In [None]:
# Exercise: make dictionary which maps three cities to the size of their population
# call it city2population
# https://en.wikipedia.org/wiki/List_of_cities_proper_by_population
# Print a key to see the value

### 2.2.1 Adding items to a dictionary

There is one very simple way in order to add a **key:value** pair to a dictionary. Please look at the following code snippet:

In [None]:
english2deutsch = dict()
#or try english2deutsch = {}
print(english2deutsch)

In [None]:
english2deutsch['one'] = 'einz'
english2deutsch['two'] = 'zwei'
english2deutsch['three'] = 'drei'
print(english2deutsch)

Please note that key:value pairs get overwritten if you assign a different value to an existing key.

In [None]:
# Exercise add two cities to city2population

In [None]:
english2deutsch = dict()
print(english2deutsch)
english2deutsch['one'] = 'einz?'
print(english2deutsch)
english2deutsch['one'] = 'zwei?'
print(english2deutsch)
english2deutsch['one'] = 'drei?'
print(english2deutsch)

## 2.3 Inspecting the dictionary

In dictionary variable we store values we'd like to inspect later by their keys. Common situations are:
- mapping words to frequencies (values are integers, floats)
- mapping names to the individual characterstics of the person (age, gender, etc) (values are strings, numbers)
- mapping dates to counts (creating timelines) (values are integers, floats)
- mapping bands to their songs titles (values are lists)

The most basic operation on a dictionary is a **look-up**. Simply enter the key and the dictionary returns the value. In the example below, we mapped movies to their box-office performance. Keys are the Movie Titles, and values represent the ticket sale.

In [None]:
bo = {'Avatar': 27879650875, 'Titanic': 2187463944, 'Star Wars: The Force Awakens': 2068223624}

In [None]:
print(bo['Avatar'])

If the key is not in the dictionary, Python will raise a ``KeyError``.

In [None]:
bo['The Lion King']

## 2.4 Dictionary Methods
### .get()

In order to avoid getting a `KeyError` every time a key does not appear in the dictionary, you can use the ``get`` method. The **first argument** is the **key** to look up, the **second argument** defines the **value** to be returned if the key is not found:

In [None]:
print(bo.get('The Lion King','Not in Dictionary'))
# a good alternative could be 
print(bo.get('Avatar',False))
print(bo.get('The Lion King',False))

In [None]:
Other methods allow us to access the different components of the dictionary:

### .keys()

the **keys** method returns the keys in a dictionary 

In [None]:
student_grades = {'Frank': 8, 'Susan': 7, 'Guido': 10}
the_keys = student_grades.keys()
print(the_keys)

### .values()

the **values** method returns the values in a dictionary

In [None]:
the_values = student_grades.values()
print(the_values)

We can use the other built-in functions to inspect the keys and values. For example:

In [None]:
the_values = student_grades.values()
print(len(the_values)) # number of values in a dict
print(max(the_values)) # highest value of values in a dict
print(min(the_values)) # lowest value of values in a dict
print(sum(the_values)) # sum of all values of values in a dict

### .items()

the **items** method returns a list of **tuples** (we have a look at tuples later), which allows us to easily loop through a dictionary.

In [None]:
student_grades = {'Frank': 8, 'Susan': 7, 'Guido': 10}
print(student_grades.items())

## 2.5 Iterating over dictionaries

Since dictionaries are iterable objects as well, we can iterate through our good reads collection as well. This will iterate over the *keys* of a dictionary:

In [None]:
good_reads = {"The Magic Mountain":9,
             "The Idiot":7,
             "Don Quixote": 9.5}

for book in good_reads:
    print(book)

We can also iterate over both the keys and the values of a dictionary, this is done as follows:

In [None]:
good_reads["Pride and Prejudice"] = 8
good_reads["A Clockwork Orange"] = 9

In [None]:
good_reads.items()

In [None]:
for x, y in good_reads.items():
    print(x + " has score " + str(y))

## 2.6 Example Counting with dictionaries

Dictionaries are very useful to keep track of our data, for example by counting words:

In [None]:
sentence = 'Obama was the president of the USA' # assign the string to the variable sentence
words = sentence.split() # split the sentence
word2freq = {} # initialize an empty dictionary, here we store the word counts
# i.e. word2freq is a mapping from words to their frequencies
 

for word in words: # loop over all the words, word takes each word in turn
    if word in word2freq: # add 1 to the dictionary if the keys exists, here we perform membership check on the keys
        word2freq[word] += 1 # notice that we use the shorthand for incremental count
                             # which as an abbraviations for  word2freq[word] =  word2freq[word] + 1
    else: # if the above condition does not hold (word does not appear as key in the dictionary) than set the key's value to one
        word2freq[word] = 1 # set default value to 1 if key does not exists 

    print('Word added = ',word, 'Updated dictionary = ',word2freq)

print('\n')
print(word2freq)

#### `if` and `else`: see Notebook 2.2

In [None]:
# change x to see how if else works

x = 5

if x >= 0:
    print(x,' is positive or zero.')
else:
    print(x, ' is negative.')

A lot is happening in the previous code block; the examples below aim to clarify the individual steps.

##### Line 8 (and implicitly line 11): Membership check on the keys of the dictionary

In [None]:
w2fr = {'USA': 1, 'of': 1, 'president': 1, 'the': 2, 'was': 1, 'Obama': 1}


print('USA' in w2fr) # in does memership check on the keys if not method is appended to the dictionary
print(1 in w2fr) # it does not check if an items appears as values
print(1 in w2fr.values()) # unless you caled th values methods of course

##### Line 9 and line 12: Updating (9) and setting (12) dictionaries a key

Why do we distinguish between updating and setting a key? I we'd only update (which makes sense) Python would raise a `KeyError` because the key we are incrementing does not appear in the dictionary. For this reason we explicitly set a default (start) value for each new key (i.e. each word which does not appear in the dictionary, in which case the boolean or membership condition `in` returns `False`)

In [None]:
w2fr = {'USA': 1, 'of': 1, 'president': 1, 'the': 2, 'was': 1, 'Obama': 1}

print('Obama has frequency ',w2fr['Obama'])
w2fr['Obama']+=1 
print('Obama has frequency ',w2fr['Obama'])

# remember +=1 is equal to var = var + 1 
# but this is not recommended
w2fr['Obama'] = w2fr['Obama'] + 1 
print('Obama has frequency ',w2fr['Obama'])

Now if we want to update the value for a word which key is not in the dictionary, Python throws back a `KeyError`:

In [None]:
w2fr['Barack']+=1 

### `setdefault()` method

The `setdefault()` method simplifies the above code by automatically checking if a key exists, and if not, setting a default value for this key. This method takes two arguments, the key to be set, and the default value for the key.

In [None]:
sentence = 'Obama was the president of the USA' # assign the string to the variable sentence
words = sentence.split() # split the sentence
word2freq = {} # initialize an empty dictionary, here we store the word counts
# i.e. word2freq is a mapping from words to their frequencies
 

for word in words: # loop over all the words, word takes each word in turn
    
    word2freq.setdefault(word,0) # if keys not appear
    word2freq[word] += 1 # notice that we use the shorthand for incremental count
    
    print('Word added = ',word, 'Updated dictionary = ',word2freq)

print('\n')
print(word2freq)

### An example: count songs about by year

Now let's apply these techniques to studying our song titles corpus. 

First we make a little program that collects the counts of a search term by year. This will allow is to make timelines that plot the evolution of a topic by year. 

In [None]:
# load the data
import requests
url = 'https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/tracks_per_year.txt'
data = requests.get(url).text.strip() # download the song titles
song_titles = data.strip().split('\n') # split the string into lines

In [None]:
year2counts = {} # create an empty dictionary here we map years to the frequency of a word

search = 'love' # define your search term here

for row in song_titles: # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    year = int(year) # cast year as an integer
    
    year2counts.setdefault(year,0) # the value for key year to the default value of 0
    
    title_lower = title.lower() # lowercase the title
    words = title_lower.split() # split the lowercased string into words
    
    if search in words: # membership check, does the word love appear in the list called words
        year2counts[year] +=1 # add one if the above condition holds

# print the results sorted by year
print(sorted(year2counts.items()))

We can easily plot the time series using functions from an external library called Pandas. The code below is just to help you plotting your data, do not worry about it now.

In [None]:
import pandas
%matplotlib inline
series = pandas.Series(year2counts)
series.plot()

**Question**: Can we conclude that "love" became a more popular theme in pop culture over time?

Ideailly, we'd like to calculate the propability whether or not a song from a certain year is about love. This is relatively straightforward: we have to divide the number of songs about love from year X by the total number of songs from year X.

To program below features therefore a small addition: the dictionary `wf` that tracks the number of songs by year.

In [None]:
year2counts = {} # create an empty dictionary here we map years to the frequency of a word
wf = {} # an empty dictionary where we will save 

search = 'love' # define your search term here

for row in song_titles: # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    year = int(year) # cast year as an integer
    
    year2counts.setdefault(year,0) # the value for key year to the default value of 0
    # here we add some lines to keep track of the total number of song titles by year
    wf.setdefault(year,0)
    wf[year]+=1
    
    title_lower = title.lower() # lowercase the title
    words = title_lower.split() # split the lowercased string into words
    
    if search in words: # membership check, does the word love appear in the list called words
        year2counts[year] +=1 # add one if the above condition holds


If we take a moment to study the total number of songs by year, we also see an increase:

In [None]:
# plot the total number of songs by year

For this reason, the relative frequency of a word will tell us more than the absolute frequency--almost every search term we define will show an increase over time. Below we add a few more lines to divide the number of songs about love by the total number of songs for that specific year. 

Because we defined two mapping earlier, this becomes a relatively straightforward task:

In [None]:
# calculate the probability that a song is about love 

# create an empty dictionary
ratios = {}

for key, value in year2counts.items():
    ratios[key] = year2counts.get(key,0) / wf[key]

In [None]:
# plot the results
pandas.Series(ratios).plot()

Exercise: Can you plot the chronological evolution of another term (such as "dirty")?

In [None]:
# copy-paste your code here

## 2.7 Recap

To finish this section, here is an overview of the new concepts and functions you have learnt. Make sure you understand them all.

-  dictionary
-  indexing or accessing keys of dictionaries
-  adding items to a dictionary
-  `.keys()`
-  `.values()`

## 2.8 Advanced Examples

### Advanced Example 1: Nested Dictionaries

The example above is already useful, but let's just improve it a little. For each query we now have to iterate over all the 500.000+ songs. We can make this more efficient by using **nested dictionaries**. 

... What?

In [None]:
a_nested_dict = {1960:{'a':5,'the':9},
                1961:{'a':3,'the':10}}

How to access the word frequencies?

In [None]:
a_nested_dict[1960]['a']

In [None]:
# print the value of the at year 1961

In a nested dictionary, a key maps to the value of type `dict`. In the example we map years to word frequencies for that year (we map years to a mapping between words and their frequencies). This will make it much faster to compute the historical change over time for different words.

In [None]:
wf = {} # an empty dictionary where we will save 

search = 'love' # define your search term here

for row in song_titles: # iterate over the song_titles variable

    year,song_id,group,title = row.split('<SEP>') # parse the row using multiple assignment
    year = int(year) # cast year as an integer
    
    
    # here we add some lines to keep track of the total number of words by year
    wf.setdefault(year,{})
    
    title_lower = title.lower() # lowercase the title
    words = title_lower.split() # split the lowercased string into words
    
    # here start collecting yearly word frequencies
    for w in words:
        wf[year].setdefault(w,0) # set the default values for word w at year year to zero
        wf[year][w] += 1

Now we have to loop through our corpus only once to get the frequency of a word at a certain point in time!

In [None]:
wf[1960]['a']

For sure, sometimes a word might not appear. To avoid `KeyErrors` we use the `.get()` method.

In [None]:
# this does not work
wf[1960]['madonna']

In [None]:
# this works, do you understand the syntax?
wf[1960].get('madonna',0.0)

The code below returns the same as the long program above but is much faster because we prepared everything as a nest dictionary!

In [None]:
search = 'tears' # define query
results = {} # empty dictionary to save frequencies by years
for year in wf: # loop over all the keys in the wf dictionary which are the years
    results[year] = wf[year].get(search,0.0) # get the value for word search in year year
pandas.Series(results).plot() # plot the results

For sure we can also normalize the results:

In [None]:
search = 'tears' # define query
results = {} # empty dictionary to save frequencies by years
for year in wf: # loop over all the keys in the wf dictionary which are the years
    total = sum(wf[year].values()) # the sum of all the values equals the total word count 
    results[year] = wf[year].get(search,0.0) / total # get the value for word search in year year and divide it by total 
pandas.Series(results).plot() # plot the results

In [None]:
# to understand line five
a_nested_dict = {1960:{'a':5,'the':9},
                1961:{'a':3,'the':10}}
print(a_nested_dict[1960])
print(a_nested_dict[1960].values())
print(sum(a_nested_dict[1960].values()))

### Advanced Examples 2: The Lexical Diversity of Pop Culture

We can also compute now (approximately) if topics of songs titles are becoming more or less diverse. 

This can be done by computing the type-token ratio or the "lexical diversity".

From the [NLTK book](http://www.nltk.org/book/ch01.html):
    
> A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. [...] A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. 

The type-token ratio is then a measure of lexical diversity. It can be calculated as follows:

In [None]:
song_title = 'love love love all I want is candy'
tokens = song_title.split()
print(tokens)
types = set(tokens)
print(types)
ratio = len(types) / len(tokens)
print(ratio)

The maximum lexical diversity is one (each word in the corpus appears only once).

In [None]:
song_title = 'love all I want is candy'
tokens = song_title.split()
print(tokens)
types = set(tokens)
print(types)
ratio = len(types) / len(tokens)
print(ratio)

Now we can compute the lexical diversity of song titles by year.

In [None]:
lexdiv = {}
for year in wf:
    lexdiv[year] = len(wf[year]) / sum(wf[year].values())
pandas.Series(lexdiv).plot()

In [None]:
# if you are interested, try some of the above code to study band names!

## Exercises - DIY Lists and dictionaries

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

- Ex. 1: Consider the following strings `sentence1 = "Mike and Lars kick the bucket"` and `sentence2 = "Bonny and Clyde are really famous"`. Split these strings into words and create the following strings via list manipulation: `sentence3 = "Mike and Lars are really famous"` and `sentence4="Bonny+and+Clyde+kick+the+bucket"` (mind the plus signs!). Can you print the middle letter of the fourth sentence?

- Ex. 2: Create an empty list and add three names (strings) to it using the *append* method

Please use a built-in function to determine the number of strings in the list below

In [None]:
friend_list = ['John', 'Bob', 'John', 'Marry', 'Bob']
#  your code here

In [None]:
Please remove both *John* names from the list below using a list method

In [None]:
friend_list = ['John', 'Bob', 'John', 'Marry', 'Bob']
# your code here

-  Ex. 3: Consider the `lookup` dictionary below. The following letters are still missing from it: 'k':'kilo', 'l':'lima', 'm':'mike'. Add them to `lookup`! Could you spell the word "marvellous" in code language now? Collect these codes into the list object `msg`. Next, join the items in this list together with a comma and print the spelled out version!

> lookup = {'a':'alfa', 'b':'bravo', 'c':'charlie', 'd':'delta', 'e':'echo', 'f':'foxtrot', 'g':'golf', 'h':'hotel', 'i':'india', 'j':'juliett', 'n':'november', 'o':'oscar', 'p':'papa', 'q':'quebec', 'r':'romeo', 's':'sierra', 't':'tango', 'u':'uniform', 'v':'victor', 'w':'whiskey', 'x':'x-ray', 'y':'yankee', 'z':'zulu'}


-  Ex. 4: Collect the code terms in the lookup dict (`alpha`, `bravo`, ...) from the previous exercise into a list called `code_words`. Is this list alphabetically sorted? No? Then make sure that this list is sorted alphabetically. Now remove the items `victor`, `india` and `papa`. Append the words `pigeon` and `potato` at the end of this list. Combine this new list of items into a single string, using a semicolon as a delimiter and print this string. 

- Ex. 5: Write a program that given a long string containing multiple words, prints  the same string, except with the words in backwards order. For example, say I type the string:

`My name is Kaspar von Beelen`
Then I would see the string:

`Beelen von Kaspar is name My`

**Tip**: Try using a negative `step`.

Extra: Try to do this in just one line of code!