# Dictionaries

A dictionaries is a data structure for storing _pairs_ of data. Each pair consist of a _key_ and a _value_. Dictionaries have the type `dict` and are surrounded by curly brackets, `{...}`.

## Creation and Common Operations

In [None]:
# a few imports to be used later
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
d = {} # an empty dictionary
d['Na'] = 'sodium' # 'Na' is the key, 'sodium' is the value
d['K'] = 'potassium'
d

In [None]:
dict(Na='sodium', K='potassium') # alternative syntax

In [None]:
type(_)

In [None]:
print('The symbol K means "{}"'.format(d['K'])) # lookup key "K" in dict

In [None]:
for key, value in d.items(): # loop over all key/value pairs in dictionary
    print('The symbol for {0} is {1}'.format(value, key))

### Exercise

Amino acids can be ranked according to how hydrophobic they are via a _hydrophobicity scale_. On [this wikipedia page](https://en.wikipedia.org/wiki/Hydrophobicity_scales#Wimley-White_whole_residue_hydrophobicity_scales)
you will find a table with $\Delta G_{wif}$. Create a dictionary of the first 5 amino acids where the name is the key and $\Delta G_{wif}$ is the value.

# Pandas (much more about this later)

In the previous exercise you _manually_ extracted data from the internet which is problematic due to several reasons:

1. it's time consuming
2. it's error prone
3. what if the source is updated / corrected?

The following example uses _Pandas_ which is an external module used to handle large data bases; millions of entries is not a problem. As you will see, we simple point it to the Wikipedia page and it will automatically - and almost magically - detect the table and extract the values.

In [None]:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/Hydrophobicity_scales")
p = tables[0] # list of table found. Only one is found on the page. 
p.head() # show the first five rows (the head)

In [None]:
p.to_excel('hydrophobicity.xlsx') # save to MS Excel file

In [None]:
# we "zip" up two columns to form a dictionary. In Pandas, index can be given by their names.
d = dict( zip( p['Amino acid'], p['Interface scale, ΔGwif (kcal/mol)'] ) )
d

In [None]:
d['Cys']

In [None]:
# let's do a simple plot using Matplotlib (imported at the top). More on this later!
delta_G_values = list(d.values())
delta_G_values = [float(i.replace('−', '-')) for i in delta_G_values] # weird minus sign on wikipedia!
aminoacid_names = list(d.keys())
plt.bar(aminoacid_names, delta_G_values) # bar plot expects x and y values as lists
plt.xticks(rotation=90)
plt.xlabel('amino acid')
plt.ylabel('$\Delta G$ (kcal/mol)')
plt.title('source: wikipedia')
plt.show()

### Bonus Exercise

In the above we created two lists `delta_G_values` and `aminoacid_names` but the data is taken directly from the Wikipedia table and unsorted with respect to $\Delta G$. Use the answer to [this question](https://stackoverflow.com/questions/9764298/is-it-possible-to-sort-two-listswhich-reference-each-other-in-the-exact-same-w) to simultaneously sort both lists and re-plot the results.

### Excercise

1. Extract all tables from Wikipedias [list of Nobel laureates in Chemistry](https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Chemistry) into a list named `nobel`
1. How many tables did you find?
1. Show and investigate the first table
1. Try the following code:
~~~ py
# https://stackoverflow.com/questions/40581312/how-to-create-a-frequency-table-in-pandas-python
df = pd.value_counts(nobel[0]['Country[B]']).to_frame().reset_index()
mask = df['Country[B]']>1 # only countries with two or more prizes
df = df[mask]
explode = df['index']=='Sweden'
plt.pie( df['Country[B]'].values, labels=df['index'], autopct='%1.0f%%', radius=1.5, explode=explode )
plt.show()
~~~
1. have a look at `df`.
1. have a look at `mask`. How does it work?
1. have a look at `explode`. How does it work?


In [None]:
url     = 'https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Chemistry'
nobel   = pd.read_html(url)
df      = pd.value_counts(nobel[0]['Country[B]']).to_frame().reset_index()
mask    = df['Country[B]']>1 # only countries with two or more prizes
df      = df[mask]
explode = df['index']=='Sweden'
plt.pie(df['Country[B]'].values,
        labels=df['index'],
        autopct='%1.0f%%',
        radius=1., 
        explode=explode )
plt.show()