<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/10_Python_dictionaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Relevant readings

[NLTK Book, Chapter 5, Section 3.1 and 3.2](https://www.nltk.org/book/ch05.html#sec-dictionaries)

# Python Dictionaries

The `nltk.FreqDist()` returns a specialised version of a built-in Python data type known as a `dictionary`. In this notebook we will cover how to create and use dictionaries in more depth.

We have already looked at using lists as a way to store data. A list is like a bucket which you can toss all of your stuff in. A dictionary imposes more order than a list, and can sometimes be more useful for many of the linguistic analyses we would like to perform.

A Python dictionary works in a similar manner to book dictionaries, where you look up the meaning of words. You first think of the word you want to look up, find it, then read the entry. A Python dictionary works the same way — you query the dictionary for a specific entry, and instead of returning the entry itself, the dictionary returns information associated with that entry.

Python dictionaries store information using `key:value` pairs. The `nltk.FreqDist()` function returns a dictionary where the `key` is the thing that was counted, and the `value` is the frequency of that thing.

To create an empty Python dictionary, we can use `dict()`. To add entries to the dictionary, we use the square bracket notation `[]` after the name of the dictionary, then use `=` to assign the value.


In [None]:
# initial creation of a dictionary
species = dict()

In [None]:
# adding a key and a value
species['dog'] = 'canine'

In [None]:
# calling a key to get a value
species['dog']

You can also choose to pre-populate a dictionary using curly braces and colons:

In [None]:
# You can create using curly braces
species = {'dog': 'canine', 'cat': 'feline'}

In [None]:
species['cat']

In [None]:
species['dog']

## Rules for dictionary keys

- Dictionary keys need to be *immutable* values which include strings, integers, booleans, tuples, and so on. This mostly means that you can't use a list as a dictionary key
- much like a real dictionary, each key can only occur one time in a dictionary
- the value associated with a dictionary key can be almost anything, including another dictionary, which can then contain other dictionaries...
  - if you manually add a value, you might overwrite what is already there.



In [None]:
# create an empty dictionary
temp_dict = dict()

In [None]:
# add a string as a key
temp_dict['one'] = 1

In [None]:
# look at the dictionary
temp_dict

In [None]:
# add a number as a key
temp_dict[1] = 'one'

In [None]:
# look at the dictionary
temp_dict[1]

In [None]:
# update our dictionary:
temp_dict['one'] = 'ONE'

In [None]:
# we have overwritten the value, because there can be only one.
temp_dict

In [None]:
# try adding a list as a key, you get an error.
temp_dict[['one', 'two']] = [1,2]

Crucially, you should notice an important difference in how `[]` are used in dictionaries versus strings. While `[]` indexes specific *locations* in a string or list, the `[]` indexes specific *keys* in a dictionary. This means you do not need to worry about the ordering of a dictionary's keys, but instead the values of the keys themselves.

## **Your Turn**

Take this moment to make some dictionaries of your own. Start simple, such as a dictionary containing first and last names, or a dictionary containing names and phone numbers of people you know.

Explore creating an empty dictionary and then adding values, as well as creating a dictionary using curly braces.

In [None]:
# make some dictionaries.

## Searching dictionaries

Because dictionaries do not use indexing in the same way as strings and lists, we need to explore some alternative methods for looping and searching through dictionaries.

Below, I will initalise a simple translation dictionary between English and te reo Māori. They `keys` are the English version, and the `values` are the Māori.

In [None]:
# create a translation dictionary
eng2mri = {'one': 'tahi','two': 'rua', 'three': 'toru', 'four': 'whā', 'five': 'rima', 'six': 'ono', 'seven': 'whitu', 'eight': 'waru', 'nine': 'iwa', 'ten': 'tekau'}

In [None]:
# test if it works
eng2mri['ten']

To see the entire dictionary, we can use the `.items()` dictionary method:

In [None]:
# what are they key:value pairs in this dictionary?
eng2mri.items()

Altneratively, we can inspect all of the entries in a dictionary by using the `.keys()` method...


In [None]:
# what are the keys (entries) of this dictionary?
eng2mri.keys()

...or we can inspect all of the values using the `.values()` method.

In [None]:
# what are the values of this dictionary?
eng2mri.values()

Now that we know how to access the keys of a dictionary, we can start to search through dictionaries using loops. For example, if we wanted to loop through each entry of a dictionary, we could use:

```
for key in dict.keys():
  print(key)
```

While getting the keys is useful to know what is inside the dictionary, we commonly also want to do something with the values associated with the key. So, we could adjust our for loop to loop over the keys, but then return the value associated with the key, rather than the key itself:

```
for key in dict.keys()
  print(dict[key])
```

Maybe this seems a bit confusing in the abstract, compare the for loops below:


In [None]:
# print all of the keys
for key in eng2mri.keys():
  print(key)

In [None]:
# now print all of the values by using the key as an index in the loop
for key in eng2mri.keys():
  print(eng2mri[key])

In [None]:
# could also print both the key and the value:
for key in eng2mri.keys():
  print(key, eng2mri[key])

In [None]:
# the same thing but with some fancier print formatting
for key in eng2mri.keys():
  print(f'{key} = {eng2mri[key]}')

And, we could actually reverse the order just by swapping the variables


In [None]:
# swap the order
for key in eng2mri.keys():
  print(f'{eng2mri[key]} = {key}')

Technically, you don't need to loop over the keys, as looping over the dictionary itself will provide you with access to the values. But it is sometimes handy to know how to do it with the keys as well.

In [None]:
for entry in eng2mri:
  print(entry)

## Adding conditions to our search

Let's create some conditional logic to spruce up our searches. For example, we could look at each translation pair and print whichever of the two words are the longest:

In [None]:
# start a loop
for key in eng2mri.keys():
  # if value is longer than key:
  if len(eng2mri[key]) >= len(key):
    print(eng2mri[key])

  else:
    print(key)


Now let's consider another dictionary of population of countries in Oceania.

In [None]:
oceania = {'Tokelau': 1499, 'Norfolk Island': 1748, 'Federated States of Micronesia': 103000, 'Nauru': 10084, 'New Zealand': 4795886, 'Tonga': 100651, 'Tuvalu': 10640, 'Niue': 1611, 'Cook Islands': 18100, 'Samoa': 199052, 'Marshall Islands': 55500, 'Wallis and Futuna': 11700, 'French Polynesia': 275918, 'Australia': 25710853, 'American Samoa': 56700, 'Palau': 21000, 'Papua New Guinea': 8558800, 'New Caledonia': 278500, 'Northern Mariana Islands': 56200, 'Vanuatu': 304500, 'Kiribati': 120100, 'Fiji': 896445, 'Pitcairn Islands': 50, 'Guam': 172400, 'Solomon Islands': 667044}

Let's first print out each country and its population:

In [None]:
for key in oceania.keys():
  print(f'the population of {key} is {oceania[key]}')

Let's do this alphabetically by employing the `sorted()` function around the keys:

In [None]:
for key in sorted(oceania.keys()):
  print(f'the pop of {key} is {oceania[key]}')


Now let's consider sorting our population sizes into different catgories, such as these:

>*large* = more than 1,000,000 people

> *medium* = 100,000 to 1,000,000 people

> *small* = 10,000 to 100,000 people

> *very small* = less than 10,000

Using these criteria, we can adapt the previous `for` loop so that it returns statements like this:

```
The population of Australia is big: 25710853.
```

We can do this with a series of elif statements.




In [None]:
# loop through the sorted keys
for key in sorted(oceania.keys()):
  # consider different sizes and categories
  if oceania[key] < 10000:
    size = 'very small'
  elif oceania[key] > 10000 and oceania[key] < 100000:
    size = 'small'
  elif oceania[key] > 100000 and oceania[key] < 1000000:
    size = 'medium'
  else:
    size = 'large'

  print(f'the pop of {key} is {size}! - {oceania[key]}')

Let's take the total population of Oceania to be 41,909,794.

We can continue adapting the previous `for` loop to return statements like this:

```
The population of Australia is big: 25710853, which is _____ percent of the total population of Oceania.
```



In [None]:
# total population of Oceania
total_pop = 41909794

# loop through the sorted keys
for key in sorted(oceania.keys()):
  # consider the size and assign a value
  # can you think of a different way to do this?
  if oceania[key] < 10000:
    size = 'very small'
  elif oceania[key] > 10000 and oceania[key] < 100000:
    size = 'small'
  elif oceania[key] > 100000 and oceania[key] < 1000000:
    size = 'medium'
  else:
    size = 'large'

  # calculate population percentage and round to 3 decimals.
  pop_perc = round((oceania[key] / total_pop) *100, 3)

  print(f'the pop of {key} is {size}! - it is {oceania[key]}, which is {pop_perc}% of Oceania!')

# Building more complex dictionaries

The dictionaries used so far have all included single `key:value` pairs. We can include more complex information in our dictionaries. For example, we could create a dictionary which stores lexical information about different texts — this is an ideal way to contain information about language because a dictionary can be continuously updated with new texts and new features.

Let's start with two small texts.

In [None]:
whale = """The sea was angry that day, my friends - like an old man trying to send back soup in a deli.
I got about fifty feet out and suddenly, the great beast appeared before me.
I tell you, he was ten stories high if he was a foot.
As if sensing my presence, he let out a great bellow."""

the_kramer = """I sense great vulnerability. A man-child crying out for love. An innocent orphan in the post-modern world.
I see a parasite. A sexually depraved miscreant who is seeking only to gratify his basest and most immediate urges.
His struggle is man's struggle. He lifts my spirit.
He is a loathsome, offensive brute. Yet I can’t look away.
He transcends time and space.
He sickens me.
I love it.
Me too.
"""

Let's create a dictionary in which the keys will represent each of these texts. The values of each key will be a *new* dictionary which contains information about the text. The first piece of information will be the text itself:



In [None]:
# create a dictionary named sf_dict with one key(whale)
# the value for the key is another dictionary, with one key (text) and the value is the string associated with whale (above)
sf_dict = {'whale': {'text': whale} }
sf_dict

In [None]:
# we can index the text by using [key][key]
# this is asking first to look at the key "whale", then at the key "text"
sf_dict['whale']['text']

Now that we've added one text to the dictionary, let's add the second:

In [None]:
sf_dict['the_kramer'] = {'text': the_kramer}
sf_dict

Now that we have our texts stored in the same dictionary with their own sub dictionaries, let's start performing some actions. Let's start simple and include the total number of words per text using `len()` and `.split()`.

Now, we could independently calculate these values and then add them to the dictionary manually.

Or, we could initiate a for loop which simultaneously loops over the dictionary and creates the new values for us in one go.

In the following cell, I do this in one line:

In [None]:
# add total length of text to the dictionary
for key in sf_dict.keys():
  sf_dict[key]['word_length'] = len(sf_dict[key]['text'].split())

There's a lot going on in that one line, so I have created an annotation which can help unpack what is going on:

(and yes, I recognize I am using `.split()` and not `nltk.word_tokenize`!)

<img src = https://i.imgur.com/202gE2K.png>

Basically, we are first selecting an entry in our top level dictionary, and because that will return a second dictionary, we then select an entry from *that* dictionary. If the annotated screen shot above doesnt quite help, perhaps this sketch of the structure is more straightforward:

```
top-level dictionary (sf_dict)
  - entry 1 (whale)
    - whale dictionary
      - entry 1 (text)
      - entry 2 (word_length)
  - entry 2 (the_kramer)
    - the_kramer dictionary
      - entry 1 (text)
      - entry 2 (word_length)

```

In [None]:
# we can see the results of our new variable by calling it, for each text...
sf_dict['whale']['word_length']

In [None]:
sf_dict['the_kramer']['word_length']

Ideally you can see how running a single loop over a dictionary and performing operations on the information in that dictionary allows for a one-stop shop of a data container which can be expanded to include as many pieces of information as one might like.

Let's extend the above example and create a dictionary of lexical information for our texts. For each text, we will report:

- the total number of words according to `.split()`
- the total number of tokens according to `nltk.word_tokenize()`
- the resulting TTR using `.split()` versus `nltk.word_tokenize()`
- the top 5 most frequent words using `nltk.FreqDist()` and `.split()`
- the top 5 most frequent words using `nltk.FreqDist()` and `nltk.word_tokenize()`

Sounds like a lot, right? You should know how to do each of these operations - it's just a matter of adding them to the for loop that we've already seen above.

In [None]:
# load in required nltk resources.
import nltk
nltk.download('punkt_tab')

In [None]:
# let's delete the word_length entry we made above
del sf_dict['whale']['word_length']
del sf_dict['the_kramer']['word_length']

In [None]:
# Expand the for loop to calculate more features:

for key in sf_dict.keys():
  # number of words according to .split()
  sf_dict[key]['num_split_tokens'] = len(sf_dict[key]['text'].split())

  # number of words according to nltk.word_tokenize
  sf_dict[key]['num_nltk_tokens'] = len(nltk.word_tokenize(sf_dict[key]['text']))

  # ttr from split
  sf_dict[key]['split_ttr'] = len(sf_dict[key]['text'].split()) / len(set(sf_dict[key]['text'].split()))

  # ttr from nltk tokens
  sf_dict[key]['nltk_tokens_ttr'] = len(nltk.word_tokenize(sf_dict[key]['text'])) / len(set(nltk.word_tokenize(sf_dict[key]['text'])))

  # top 5 most frequent words using .split()
  sf_dict[key]['split_five_most_frequent'] = nltk.FreqDist(sf_dict[key]['text'].split()).most_common(5)

  # top 5 most frequent words using nltk tokenz
  sf_dict[key]['nltk_five_most_frequent'] = nltk.FreqDist(nltk.word_tokenize(sf_dict[key]['text'])).most_common(5)


In [None]:
# now look at the information we have about the text
sf_dict['whale']

In [None]:
sf_dict['the_kramer']

##**Your Turn**

Compare the output from the two texts above. Think about the difference you see between the `.split()` and `nltk.word_tokenize()` approaches.

- Do you remember why these function are providing different results?
- What do you notice about the frequency results? Is there any pre-processing we might want to do, and if so, which results would be changed as a result of different pre-processing?