# Week 5 Lecture - Dictionaries


* What is a Dictionary
* Data in / Data Out
* 
* Loops?
* Making a word counter

## What is a Dictionary?

![Dictionary Picture](http://www.trytoprogram.com/images/python_dictionary.jpg)

*Image used without permission from [Trytoprogram](http://www.trytoprogram.com/python-programming/python-dictionary/)*

Similar to a Python list, a Python Dictionary is a data structure that behaves as a *container* or *collection* for other data values. 
A a list because it contains a values and each value has an index.

Not like a list because the indicies are not implicit, they can be more than numbers, and there is no order.

In a dicitonary, data is stored in *key*/*value* pairs. The *key* is the index and and *value* is the actual data.
When you create an *item* in a dictionary (another term for the key/value pairs) then you store teh data in the dictionary.
To get the data back out we use the key to *lookup* the value in the dictionary. In this way the dictionary behaves like a english language dictionary. The *key* is like the word and the *value* is like the definition.

In [None]:
# create a new dictionary
english2spanish = dict()
# another way to do the same thing using the literal syntax
english2spanish = {}

Now we have an empty dictionary with zero items.

In [None]:
len(english2spanish)

### Putting Data in a Dictionary
If you have an empty dictionary, you can add items using the 

In [None]:
# add values with the assignment operator
english2spanish["one"] = "uno"

# add items with the update method
english2spanish.update({"two":"dos"})

print(english2spanish) # look at the contents of the entire dictionary

Look at the syntax of what Python spit out. Curly braces with two string values separated by a colon. The curly braces mean we are looking at a dictionary (as opposed to a list which uses square brackets `[]`) and the colon always separates the key/value pairs (the *item*). 

In [None]:
# add another item to the dictionary
english2spanish["three"] = "tres"

# print dictionary contents
print(english2spanish)
#display the length of the dict
len(english2spanish)

Now our `english2spanish` dictionary has three items! 

![The count counting](https://media.giphy.com/media/FHzemFzwkyRfq/giphy.gif)

### Getting data out of a dictionary
We can use the familiar *indexing syntax* (square brackets) to get a value based on its key. 

In [None]:
# lookup the value associated with the key
english2spanish["one"]

In [None]:
# you can't look up by value
english2spanish["uno"]

If you try to look up a key that does not exist using the index syntax, Python will yell at you.

Use the `get()` method to look up keys with less errors

In [None]:
# getting data with get
english2spanish.get("one")

In [None]:
# getting data that does not exist
english2spanish.get("seven")

Look! no error. Sometimes this is preferred behavior when you are processing data

## Structuring Data in Data Structures

Dictionary keys must be basic data types (ints and strings are most common, but floats work too). The values can be anything, even more complicated data structures like lists and dictionaries. This allows you to create complicated, multidimentional data structures for storing your data in Python. 

In [None]:
# create a dictionary with data already loaded up
floaty_keys = {4.5:"four point five", 2.1:"Two point one", "one point 2":1.2, "4.2":4.2}

In [None]:
# grab the value with the float key 4.5
floaty_keys[4.5]

In [None]:
# get the value with the string key "one point 2"
floaty_keys["one point 2"]

In [None]:
# get the value with the string key "4.2"
floaty_keys.get("4.2")

Dictionary values can be any Python data type or data structure. Here is an example of storing tabular data in lists with the key as the column name.

In [None]:
# put some tabular data into a dictionary of lists
user_data = {
    "ages":[12,23,234,95],
    "name":["Bobby", "Karl", "Dracula", "Henry"],
    "occupation":["Boy", "Barista", "Sandwich Artist by Night", "Taxi Driver"]
}
user_data

In [None]:
# get the occupation data
user_data["occupation"]

In [None]:
# get the third item in the occupation list
user_data["occupation"][2]

Another way of storing this data would be to have a dictionary of dictionaries. Note how the key must be explicit

In [57]:
# Create a dictionary of dictionaries to hold all the data
data_as_dictionary = {
    1:{"name":"Bobby", "age":12, "occupation":"Boy"},
    2:{"name":"Karl", "age":23, "occupation":"Barista"},
    3:{"name":"Dracula", "age":234, "occupation":"Sandwich Artist by Night"},
    4:{"name":"Henry", "age":95, "occupation":"Taxi Driver"}
}
data_as_dictionary

{1: {'name': 'Bobby', 'age': 12, 'occupation': 'Boy'},
 2: {'name': 'Karl', 'age': 23, 'occupation': 'Barista'},
 3: {'name': 'Dracula', 'age': 234, 'occupation': 'Sandwich Artist by Night'},
 4: {'name': 'Henry', 'age': 95, 'occupation': 'Taxi Driver'}}

In [None]:
# get the data dictionary for user 3
data_as_dictionary[3]

In [None]:
# get the occupation of user 3
data_as_dictionary[3]['occupation']

In [None]:
# The cell above when the first expression is evaluated
{'name': 'Dracula', 'age': 234, 'occupation': 'Sandwich Artist by Night'}["occupation"]

You can even make a list of dictionaries which will behave very similarly

In [None]:
list_of_dictionaries = [
    {"name":"Bobby", "age":12, "occupation":"Boy"},
    {"name":"Karl", "age":23, "occupation":"Barista"},
    {"name":"Dracula", "age":234, "occupation":"Sandwich Artist by Night"},
    {"name":"Henry", "age":95, "occupation":"Taxi Driver"}
]
list_of_dictionaries

In [None]:
# get the item at index position 3 from the list
list_of_dictionaries[3]

In [None]:
# get the value with the key name from the dictionary at index position of the list called list_of_dictionaries
list_of_dictionaries[3]['name']

## Looping over dictionaries

While dictionaries are not sequential like lists, you can loop over all the items. The order will be equivalent to the insertion order (but remember there is no implicit numerical index).


In [61]:
# Looping over a dictionary will go over all the keys
for item in user_data:
    print(item)

ages
name
occupation


In [62]:
# Looping over a dictionary will go over all the keys
for item in user_data:
    print(item)
    print(user_data[item])

ages
[12, 23, 234, 95]
name
['Bobby', 'Karl', 'Dracula', 'Henry']
occupation
['Boy', 'Barista', 'Sandwich Artist by Night', 'Taxi Driver']


## Putting it all together - Stylometrics


Don't be afraid of the $5 word, it just a technique for analyzing texts (usually to [determine authorship](https://www.latimes.com/science/sciencenow/la-sci-sn-shakespeare-play-linguistic-analysis-20150410-story.html)). Computational Stylometics does a lot of fancy statistics, but much of it is based on *counting words*.

Given the text below, we want to write a program that counts all the words.

In [None]:
#swoon
romeo = """
But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon,
Who is already sick and pale with grief,
That thou her maid art far more fair than she:
Be not her maid, since she is envious;
Her vestal livery is but sick and green
And none but fools do wear it; cast it off.
It is my lady, O, it is my love!
O, that she knew she were!
She speaks yet she says nothing: what of that?
Her eye discourses; I will answer it.
I am too bold, 'tis not to me she speaks:
Two of the fairest stars in all the heaven,
Having some business, do entreat her eyes
To twinkle in their spheres till they return.
What if her eyes were there, they in her head?
The brightness of her cheek would shame those stars,
As daylight doth a lamp; her eyes in heaven
Would through the airy region stream so bright
That birds would sing and think it were not night.
See, how she leans her cheek upon her hand!
O, that I were a glove upon that hand,
That I might touch that cheek!
"""

### Computational Thinking

If we want to count the words in the text above, we need to do the following things.

1. Normalize the text by removing punctuation and converting to lowercase.
2. Split the string of text into a list of words
3. Loop over the list and count each instance of a word

In [None]:
# convert everything to lowercase
romeo.lower()

Now we have everything in lowercase, but we need to remove the punctuation. Now, we could use the `replace()` string method and manually identify and remove each punctuation mark, but that would make for some ugly code.

In [None]:
# ugly approach to removing punctuation
romeo.replace(".","").replace(",","").replace("!","").replace("'","") #and so on

Wouldn't it be nice it we could do this all in one shot? Fortunately, we can but it is a bit complicated.


```
Replace each character in the string using the given translation table.

table
    Translation table, which must be a mapping of Unicode ordinals to
    Unicode ordinals, strings, or None.
```

In [None]:
# remember the ord()
print("Period:", ord("."))
print("Comma:", ord(","))
print("Explaination", ord("!"))

In [None]:
translation_table = {46:"",
                     44:"",
                     33:""}
romeo.translate(translation_table)

We can use the `maketrans()` function to automatically translate characters into their ordinal values

In [None]:
punctuation_dictionary = {
    ".":"",
    "!":"",
    ":":"",
    ",":"",
    "?":"",
    ";":"",
    ",":""
}
translation_table = romeo.maketrans(punctuation_dictionary)
translation_table

Now we can use this table to remove the punctuation from our string

In [None]:
romeo.translate(translation_table)

But, that was still a lot of typing and it looks like we missed the apostrophe...ugh. More typing means more bugs...

If Pythong is *actually* batteries included, then wouldn't this already be a solved problem?

In [None]:
# Get a list of all the punctuation from the standard library
from string import punctuation
print(punctuation)

In [None]:
# make our translation table programmatically with the excludes argument
translation_table = romeo.maketrans("", "", punctuation)
translation_table

In [None]:
# test our our punctuation remover
romeo.translate(translation_table)


Yes! Now we have almost solved the first step. Now make everything lowercase, fortunately that is easy.

In [None]:
# normalize the text
romeo_normalized = romeo.translate(translation_table).lower()
romeo_normalized

Now we can do computational thinking step 2: split the string of words into a list. Also an easy task thanks to the string method `split()` which will automatically split on whitespace

In [None]:
#split text into a list of words
romeo_list = romeo_normalized.split()
romeo_list[0:10] #look at the first 10 words in the list

Ok, now we can do the final step, which is loop over each word and count them up in a dictionary

In [None]:
# create a counter
word_counter = {}

# loop over each wor
for word in romeo_list:
    # check to see if we have encountered the word
    if word not in word_counter:
        # have not seen this word before, so create a key with value 1
        word_counter[word] = 1
    else: 
        # we have seen this word before, so increment the value by 1
        word_counter[word] += 1

print(word_counter)