# Dictionaries



## Table of Contents
- [1. A Bit of History](#history)
- [2. The Dictionary Data Type](#dictionary)
- [3. Dictionary as a Set of Counters](#counters)
- [4. Dictionaries and Files](#dict-files)
- [5. Dictionaries and Lists](#dict-lists)
- [6. Iteration and Dictionaries](#iter-dict)
- [7. Reverse Lookup](#reverse-lookup)
- [8. Summary](#summary)

## 1. A Bit of History <a class="anchor" id="history"></a>

<img src="assets/samuel-johnson.jpeg.webp" alt="Samuel Johnson" width="400"/>

<div style="text-align:center">
    <span style="font-size:0.9em; font-weight: bold;"><b> Samuel Johnson</b></span>
</div>

Samuel Johnson is the man behind "A Dictionary of the English Language", the first definitive English dictionary. A Dictionary of the English Language, also called Johnson’s Dictionary, was first published in 1775 and is viewed with reverence by modern lexicographers.

His dictionary was the first book to address English as it was written and spoken. It was the first to include context-based information about English, and, it was the first to attempt to enforce a standard of spelling and grammar upon unruly English, which had no equivalent of an academy to defend its use as proper or improper.

To understand Johnson’s undertaking, it is important to understand the state of English lexicography in the middle of the 18th century. There were a handful of glossaries of difficult words, but overall, there was no reference for the English reader to consult words one might encounter on a day-to-day basis. In addition, books were becoming widely available and literacy in England was growing.

Several book publishers got together and commissioned Johnson to compile a dictionary similar to the one created by the French Academy. In France, that effort took 40 scholars and 40 years to complete, while it took Johnson nine years to complete.

## 2. The Dictionary Data Type <a class="anchor" id="dictionary"></a>

The **dictionary** data type in Python is similar to a list, but quite different from 
the dictionaries we know from daily life.
In a list, the indices have to be integers; in a dictionary they can be (almost) any type.

A dictionary contains a collection of indices, which are called **keys**, and a collection of
**values**. 
Every key is associated with a single value. 
The association of a key and a value is called a **key-value pair** or sometimes an **item**.

In mathematical language, a dictionary represents a mapping from keys to values, so you
can also say that each key “maps to” a value. 
Consider a dictionary from English to Dutch words, both the keys and values are strings.

If you want to use a dictionary as argument or result of a function together with a type hint you have to add the following line to your code:

`from typing import Dict`

In [None]:
from typing import Dict

In [None]:
eng2dut: Dict = dict()
print(eng2dut)

The function `dict` creates a new dictionary with no items. 
Because `dict` is the name of a built-in function, you cannot use it as a variable name.

The curly-brackets, `{}`, represent an empty dictionary. 
To add items to the dictionary, you can use square brackets.

In [None]:
eng2dut['one'] = 'een'

We have now created one entry with the key `'one'` and the value `'een'`.

If we print our dictionary we will see one key-value pair, where the key and value are separated by a colon.

In [None]:
print(eng2dut)

You can also create a dictionary as follows.

In [None]:
eng2dut: Dict = {
    'one': 'een', 
    'two': 'twee', 
    'three': 'drie', 
    'four': 'vier'
}
print(eng2dut)

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Create a dictionary <i>capitals</i> that maps countries to capital cities. Use the country as key of the pair. Include the following: The Netherlands - Amsterdam, Brazil - Brasilia, Australia - Canberra, Cuba - Havana, Kenya - Nairobi, Canada - Ottawa, Japan - Tokyo.
</div>

In [None]:
# Remove this line and add your code here

We can add via the *key* index a *value* to the dictionary. We can do this because of the dynamic nature of Python.

In [None]:
eng2dut['five'] = 'vijf'
print(eng2dut)

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Add two new pairs to the dictionary <i>capitals</i>.
</div>

In [None]:
# Remove this line and add your code here

We can also built a dictionary via the built-in `dict()` function.

In [None]:
eng2dut = dict([
    ('one', 'een'),
    ('two', 'twee'),
    ('three', 'drie'),
    ('four', 'vier'),
    ('five', 'vijf'),
    ('six', 'zes'),
    ('seven', 'zeven'),
    ('eight', 'acht'),
    ('nine', 'negen')
])

When the dictionary is printed the order of the key-value pairs may have changed, as we saw.
The order of the items in the dictionary is unpredictable.
The order is no issue, because the retrieval of values is done via the keys and **not** via indices.

This means that in order to retrieve a value from a dictionary you need to know its corresponding key.

In [None]:
eng2dut['two']

So, of you try to retrieve a value for a non-existing key, you will get a `KeyError` exception.

In [None]:
eng2dut['ten']

The function `len` will give you the number of key-value pairs in the dictionary. In this way you know at least the number of key-value pairs in your dictionary.

In [None]:
len(eng2dut)

If you are not certain whether a key is present in you dictionary you can use the `in` operator, because also works on dictionaries. 

It tells you whether a value is used as a **key**.

In [None]:
'one' in eng2dut

In [None]:
'een' in eng2dut

If you want to check whether a **value** is stored in the dictionary, you have first to retrieve all values through the `values` method.

In [None]:
dutch_words = eng2dut.values()

print(dutch_words)

Similar to the `values` method, there is also the `keys` method.

In [None]:
english_words = eng2dut.keys()

print(english_words)

The `in` uses a different algorithm to find the elements in a list or dictionary.
In a list, it uses a linear search algorithm: the longer the list gets the more time it will cost!

In a dictionary, a **hash table** is used: the `in` operator takes about the same amount of time no matter how many items are in the dictionary. 
This property makes dictionaries a very powerful data structure when storing large amounts of data.

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Check if the <i>capitals</i> dictionary contains the key <i>Mexico</i>.
</div>

In [None]:
# Remove this line and add your code here

## 3. Dictionary as a Set of Counters <a class="anchor" id="counters"></a>

Suppose you are given a string and you want to count how many times each letter appears, the so-called letter frequency. 
It is not hard to imagine that this type of exercise is useful, but you will have to do more often in the future, think of the frequency of words in a text, names in an archive, purchases per customer, etc.

We are computing a **histogram**, which is a statistical term for a set of frequencies.

There are multiple ways of doing this:
- **Option 1:** You could create *26 variables*, one for each letter of the alphabet. 
   Then you could traverse the string and, for each character, increment the corresponding counter, probably using a chained conditional. It is not hard to imagine that this does not work for arbitrary words in a text.

- **Option 2:** You could create a *list* with 26 elements. Then you could convert each character to a number (using the built-in function `ord`). Again, this does not work for arbitrary characters in a text: the main problem is to find an efficient way to convert a word into an index.

- **Option 3:** You could create a *dictionary* with characters as keys and counters as the corresponding values. The first time you see a character, you would add an item to the dictionary. After that you would increment the value of an existing item.
This is actually a very general and reusable solution.

This is, by the way, a typical example where you have to think carefully before choosing your data representation.
A simple, but wrong, choice may work for a small set of elements, but may get complicated or expensive if the data
becomes more bulky.

An **implementation** is a way of performing a computation; some implementations are more efficient than others.
For example, an advantage of the dictionary implementation is that we
do not have to know ahead of time which letters appear in the string and we only have to
make room for the letters that do appear.

In [None]:
def histogram(word: str) -> Dict:
    """
    Creates a dictionary for counting the number of letters in a word.
    :param word: word to process
    :returns: dictionary with characters and their frequency.
    """
    
    letter_freq: Dict = dict()
    for character in word:
        if character not in letter_freq:
            letter_freq[character] = 1
        else:
            letter_freq[character] += 1
    return letter_freq

letter_hist: Dict = histogram('softwareengineering')
print(letter_hist)

The `dict` data type offers a number of built in methods, one of them is `get`.
The `get` method takes a *key* as first argument and a *default* value as second.
If the `key` is not found in the dictionary, an *item* is created with the `key` as first argument and the second argument as `value`.

In [None]:
def histogram(word : str) -> Dict:
    """
    Creates a dictionary for counting the number of letters in a word.
    :param word: word to process
    :returns: dictionary with characters and their frequency.
    """
    letter_freq: Dict = dict()
    for character in word:
        letter_freq[character] = letter_freq.get(character, 0) + 1
    return letter_freq

letter_hist: Dict = histogram('computerscience')

print(letter_hist)

<div class="alert alert-info">
    <b>Plotting the histogram</b><br>
    Now we can use some useful Python libraries to plot our histogram. In this case, we use <code>matplotlib.pyplot</code> and <code>pandas</code> libraries.
</div>

In [None]:
# import the relevant libraries, see 
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

import pandas as pd

df = pd.DataFrame.from_dict(letter_hist, orient="index")
df.plot.bar(color="r");

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Call the <i>histogram</i> function with different inputs and print it to see how the histogram changes from case to case.
</div>

In [None]:
# Remove this line and add your code here

## 4. Dictionaries and Files <a class="anchor" id="dict-files"></a>

Now that we have seen and understood the **dictionary** data structure, we want illustrate how to use dictionaries when solving certain types of programming challlenges. Suppose you have to count the occurrence of words in a file, it turns out that dictionaries are very convenient to tackle this challenge.

For this task we will consider a piece of text taken from *Romeo and Julliet* by William Shakespeare. 
This text has been altered in the sense that it does not include any punctuation marks.

In [None]:
try:
    file_handle: str = open('datasets/romeo.txt')
except:
    print('File cannot be opened')
    exit()
    
word_cnt: Dict = dict()
for line in file_handle:
    words: list = line.split()
    for word in words:
        if word not in word_cnt:
            word_cnt[word] = 1
        else:
            word_cnt[word] += 1
file_handle.close()
            
print(word_cnt)

In the previous code we use two `for` loops. The first one iterates over the lines of the document while the second one iterates over the words of the current line in the first loop.

This pattern is quite common and is known as **nested loops**. The first loop is called the **outer loop** while the second one is named the **inner loop**.

In addition we have used the abbreviation of `word_cnt[word] = word_cnt[word] + 1` as `word_cnt[word] += 1`. 
This abbreviation is also used with `-=`, `*=`, and `/=`.

Now, let us consider a piece of text that does have punctuation marks and capital letters. 
In this regard, cases such as 'soft' and 'soft!', and 'Who' and 'who' will be considered as different words.

To solve both problems we can rely on the `lower` and `translate` string methods. 

The `translate` method receives a translation tables as input. To create this table we rely on the `maketrans` function, which get three parameters: characters to be replaced, characters to replace previous ones, and characters to delete. For our challenge we only need to define the third parameter.

We will also use the string constant `punctuation`, which defines the list of punctuation marks. We will need to import the module `string` to have access to this value.

In [None]:
import string
print(string.punctuation)

In [None]:
import string

try:
    file_handle: str = open('datasets/romeo-full.txt')
except:
    print('File cannot be opened:', file_handle)
    exit()
    
word_cnt: Dict = dict()
for line in file_handle:
    line = line.translate(line.maketrans('', '', string.punctuation))
    line = line.lower()
    words: list = line.split()
    for word in words:
        if word not in word_cnt:
            word_cnt[word] = 1
        else:
            word_cnt[word] += 1
file_handle.close()
            
print(word_cnt)

## 5. Dictionaries and Lists <a class="anchor" id="dict-lists"></a>

Lists can be used as values in a dictionary.

Suppose you want to create a dictionary where the frequency of the letters is the key and the value is a list of letters with that frequency.
In fact, you invert the dictionary, creating a dictionary that maps frequencies to letters.

In [None]:
def invert_dict(dct: Dict) -> Dict:
    """
    Inverts a dictionary.
    :param dct: dictionary to be inverted
    :returns: inverted dictionary
    """
    inv_dct: Dict = dict()
    for key in dct:
        value = dct[key]
        if value not in inv_dct:
            inv_dct[value] = [key]
        else:
            inv_dct[value].append(key)
    return inv_dct

At each loop iteration, `key` gets a key from `dct` and `value` gets the corresponding value.

If `value` is not in `inv_dct`, that means we have not seen it before, so we create a new item and
initialize it with a **singleton** (a list that contains a single element). 

Otherwise we have seen this value before, so we append the corresponding key to the list.

In [None]:
new_dct: Dict = invert_dict(letter_hist)

#print(letter_hist)
print(new_dct)

Lists can be values in a dictionary, as this example shows, but they cannot be used as keys. 

In [None]:
from typing import List

t: List = [1, 2, 3]
d: Dict = dict()
d[t] = 'oops'

So far, we have used integer values or string values as keys in our dictionaries. It is important that the keys must be **hashable**, or to put it more correctly, must be
usable as argument for a hash function.

A **hash function** is a function that takes a value (of almost any kind) and returns an integer. 
Dictionaries use these integers, called **hash values**, to store and look up key-value pairs.
A value must immutable in order to be usable for hashing, lists are mutable, so not suited for hashing.

When you create a key-value pair, Python hashes the key
and stores it in the corresponding location. 
If you modify the key and then hash it again, it
would go to a different location. 
In that case you might have two entries for the same key,
or you might not be able to find a key.
So, dictionaries themselves are not suitable for hashing, because they are mutable as well.

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Use the <i>invert_dict</i> function to invert the <i>romeo_dict</i> dictionary.
</div>

In [None]:
# Remove this line and add your code here

## 6. Iteration and Dictionaries <a class="anchor" id="iter-dict"></a>

If you want to iterate over the elements of dictionary you can use a `for` statement. The keys are used for traversing. 

Consider the following `print_histogram` function.

In [None]:
def print_histogram(histo : Dict) -> None:
    """
    Prints a histogram dictionary.
    :param histo: histogram to print
    """
    for key in histo:
        print(key, histo[key])
        
print_histogram(letter_hist)

Please, be aware of the fact that the entries in a dictionary are arbitrary. Thus, the keys are printed in an unsorted order. 

The built-in function `sorted` can be used if you want the keys to be sorted.

In [None]:
def print_histogram(histo : Dict) -> None:
    """
    Prints a histogram dictionary in a sorted manner
    :param histo: histogram to print
    """
    for key in sorted(histo):
        print(key, histo[key])
        
print_histogram(letter_hist)

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Sort the keys of the <i>capitals</i> dictionary, and print each pair.
</div>

In [None]:
# Remove this line and add your code here

## 7. Reverse Lookup <a class="anchor" id="reverse-lookup"></a>

Given a key it is easy to find the corresponding value, this is a so-called **lookup**.
Now, suppose you have a value and you want to find the corresponding key.

There are two challenges: 

1. A value may appear multiple times in the dictionary, keys are unique, but values not.
Consider the translation of the English words: the noun "lettuce" and the verb "beat"; the Dutch word in both cases is "sla".
Depending on the application, you might be able to pick one, or you create a list that contains all relevant keys.

2. There is no simple syntax to do a reverse lookup; you have to search explicitly.

In [None]:
def reverse_lookup(dct: Dict, val: any) -> any:
    """
    Finds at which key the value is stored.
    :param dct: dictionary to be searched
    :param val: value to be found
    :returns: key at which the value is stored.
    """
    for key in dct:
        if dct[key] == val:
            return key
    raise LookupError()

This function is again a typical application of the search pattern, however if the value is not found,
we **raise** an exception.
The **raise statement** raises an exception, in our case it causes a `LookupError`, this is a built-in
exception to indicate that a lookup operation failed.

In [None]:
histo: Dict = histogram('programming')
key = reverse_lookup(histo, 2)
key

In [None]:
histo: Dict = histogram('programming')
key = reverse_lookup(histo, 4)
key

The effect when you raise an exception is the same as when Python raises one: it prints a
traceback and an error message.

The `raise` statement can take a detailed error message as an optional argument. 

```Python
raise LookupError('value does not appear in the dictionary')
```

Beware that the reverse lookup is not efficient, in case of a large dictionary your program may become very slow.

An alternative way is to return a list of keys. If the list is empty it can be concluded that the value did not appear in the dictionary

In [None]:
from typing import List

In [None]:
def reverse_lookup(dct : Dict, val : any) -> List[any]:
    """
    Finds at which key the value is stored.
    :param dct: dictionary to be searched
    :param val: value to be found
    :returns: keys for which the value was found.
    """
    
    return_list: List[any] = []
    for key in dct:
        if dct[key] == val:
            return_list.append(key)
    
    return return_list

In [None]:
histo: Dict = histogram('programming')
key = reverse_lookup(histo, 4)
key

<div class="alert alert-success">
    <b>Do It Yourself!</b><br>
    Use the <i>reverse_lookup</i> function to look for existing and non-existing keys in the <i>capitals</i> dictionary.
</div>

In [None]:
# Remove this line and add your code here

## 8. Summary <a class="anchor" id="summary"></a>

Suppose you want to set up an administration where you keep track of your guests and the food you served, then **dictionaries** are very useful.

* The name of your guest can be used as key. Beware that if you have multiple guests you cannot use the guest list as key, because lists are mutable.

* The dish you served can be the value. If you serve more than one dish, you need to create a list of dishes as value.

* If you want to see which dishes you served to which guests, you need to use a *reverse_lookup* or *invert_dictionary*. Both use `for` loop to iterate of the keys (in our case: guests).

* You can also visualise statistics transforming the dictionary into a histogram. This allows you to visualise the number of dishes served per guest.

---

# (End of Notebook)

**TU/e** - Eindhoven University of Technology