# Dictionaries and Sets

# Using dictionaries

Dictionaries are common and useful in Python for organizing stored data.  They let you look up a value using a string instead of a number.  The string you use to look up the item is called a **key**, and the value retreived is called the **value**.



In the following example, we create an array of prices for some foods at a restaurant, and then look up the value for the salmon.

In [5]:
my_menu_dict = {
    "Salmon": 25,
    "Steak": 30,
    "Mac and cheese" : 18
}

print(my_menu_dict["Salmon"])

25


We could instead initialize an empty dictionary and then insert the key-value pairs.  That would look like this:

In [6]:
my_menu_dict = {} # empty dictionary
my_menu_dict["Salmon"] = 25
my_menu_dict["Steak"] = 30
my_menu_dict["Mac and cheese"] = 18

print(my_menu_dict["Salmon"])

25


Depending on what you're doing, you may want your dictionary to return a default value if the key isn't found; if you don't specify this, you will get a KeyError if you try to look up a key with no stored value.  You get do this with the get() method, which takes a key to look up as its first value, and the default value to retrieve as the second value.  This takes the place of looking up the value with square brackets.

In [10]:
my_dict = {}
my_dict.get('sushi', 0)

0

There are other ways we could store data.  We could, for example, store everything in lists of tuples, and iterate down the list looking for the right menu item.  But, that would be slow if we planned to do a lot of lookups - every one of them would involve a little search for the right word.

Dictionaries are *expected constant time* for lookups, which means that they don't get slower as the number of items stored gets bigger.

# Why hash tables are fast

Dictionaries are fast because they use a technique called *hashing* to ensure they're fast.  But before we can explain hashing, we need to take a step back and explain why arrays are fast.



Recall that arrays also are thought to be fast as data structures.  They index by natural numbers, not strings.  When indexing into an array, it doesn't need to look through all the data to find the matching record.  Rather, it takes the address in memory where the array starts, and adds (size of data type * index) to that to get the correct address.  All it needs is a quick computation to find the right place in memory given an index.



Hashing similarly only needs a quick computation on the key to figure out where it should go.  A *hash function* is a function specially designed to send data to an address within a range seemingly at random.  The function isn't actually random, which is good, because we'll want to find the data again quickly in the same spot where we left it.  Giving the same argument to the hash function always produces the same arbitrary value.



The hash function is just designed to spread the data evenly over the space allocated to the hash table.  It's deterministic, so the data will be found again in the same place by calling the hash function again.  But it's "pseudorandom," in the sense that it spreads out the data so well and with such apparent unpredictability that we may as well treat the function as producing a random index.



What might this hash function look like?  One approach for a string could be to read its bits as a number $M$, then compute $M^n$ mod $p$ for some large numbers $p$ and $n$.  It isn't supposed to make sense relative to what the string said; in fact, rather the opposite, because we want it to say very different results for very similar strings, lest we get *collisions* where multiple pieces of data try to go to the same place.



Collisions are something that could slow us down, but aside from those, storing data and looking up data both take a short amount of time that doesn't change with the amount of data stored.  It's just a quick computation of a hash function in either case to find the right address in memory.

# Collisions

Suppose we find that applying the hash function to two different keys produces the same address for both.  This should be always possible, since there are more possible strings than locations in the hash table's memory.  What happens next?



In Python dictionaries, the scheme it uses is called "open addressing."  If the hash function said to store the value at address h, and that's occupied, it'll query another function for where to try next. Similar to the first call, the subsequent calls also generate random-looking addresses to try.



This continues until it finds a space or decides there's no room at all - an unusual occurrence.

Lookup goes through a similar process, reconstructing the addresses that would have been tried on storing the value.

# Example of dictionary in action

Something that machine learning sometimes wants to do to text is count how many times each word appears in the text.  The frequency of different words can help to tell you who wrote a piece, and what they were feeling at the time.


In [11]:
two_cities = """It was the best of times, it was the worst of times,
 it was the age of wisdom, it was the age of foolishness, it was the epoch of belief,
 it was the epoch of incredulity, it was the season of light, it was the season of darkness,
 it was the spring of hope, it was the winter of despair."""

worddict = {}
wordlist = two_cities.split()
for word in wordlist:
  if word in worddict:  # Check for presence of key
    worddict[word] += 1
  else:
    worddict[word] = 1

print(worddict["age"])
print(worddict["of"])

2
10


# Iterating over a dictionary

You can iterate over a dictionary using items() to generate key, value pairs.  You can also call individual functions to get just the keys or the just the values; these are keys() and values(), not demonstrated here.

(This also reveals we weren't careful about handling capitalization and punctuation before making our counts - ideally, we should use a more advanced function to break the sentence into parts, called a tokenizer.)

In [12]:
for word, count in worddict.items():
  print(word + ":" + str(count))

It:1
was:10
the:10
best:1
of:10
times,:2
it:9
worst:1
age:2
wisdom,:1
foolishness,:1
epoch:2
belief,:1
incredulity,:1
season:2
light,:1
darkness,:1
spring:1
hope,:1
winter:1
despair.:1


# Sets

Sometimes there are no key-value relationships to remember, just a set of keys.  For example, a set of IP addresses may be banned from connecting to our server.  

This is a good application for sets, which just store keys using hash tables.  As with dictionaries, we could just store the data in lists, but that would be inefficient if we want a quick lookup (there are lots of IP addresses out there).

In [15]:
bigIPs = {"209.85.231.104", "207.46.170.123", "72.30.2.43"}

bigIPs.add("208.80.152.2")
len(bigIPs)

4

In [13]:
newset = set()
newset.add("First item")
print("First item" in newset)

True


# in

Python tries to give some shared functionality across its different container data structures, but the implementation is very different.  "in" works for dictionaries, sets, and lists, returning true if the key or item is present.  But in dictionaries and sets, the actual implementation is a quick hash function, while for lists, the list is scanned for the item.

In [16]:
"72.30.2.43" in bigIPs

True

In [17]:
bigIPsList = ["209.85.231.104", "207.46.170.123", "72.30.2.43"]
"72.30.2.43" in bigIPsList

True

# SpeedDemo

Here's a demo of the speed difference between searching lists and looking up in dictionaries.  It gets more significant with bigger tables.

In [18]:
two_cities_extended = """It was the best of times,
it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief,
it was the epoch of incredulity, it was the season of Light,
it was the season of Darkness, it was the spring of hope,
it was the winter of despair, we had everything before us,
we had nothing before us, we were all going direct to Heaven,
we were all going direct the other way--in short, the period was
so far like the present period that some of its noisiest authorities
insisted on its being received, for good or for evil, in the superlative
degree of comparison only.

There were a king with a large jaw and a queen with a plain face,
on the throne of England; there were a king with a large jaw and a
queen with a fair face, on the throne of France. In both countries
it was clearer than crystal to the lords of the State preserves of
loaves and fishes, that things in general were settled for ever.

It was the year of Our Lord one thousand seven hundred and seventy-five.
Spiritual revelations were conceded to England at that favoured period,
as at this. Mrs. Southcott had recently attained her five-and-twentieth
blessed birthday, of whom a prophetic private in the Life Guards had heralded
the sublime appearance by announcing that arrangements were made for the
swallowing up of London and Westminster. Even the Cock-lane ghost had been
laid only a round dozen of years, after rapping out its messages, as the
spirits of this very year last past (supernaturally deficient in originality)
rapped out theirs. Mere messages in the earthly order of events had lately
come to the English Crown and People, from a congress of British subjects
in America: which, strange to relate, have proved more important to the human
race than any communications yet received through any of the chickens of the
Cock-lane brood. 
"""

# Both pieces of timed code look for every word in their respective data structures

wordlist = two_cities_extended.split()

# Using a list
def find_by_list(wordlist):
  for word in wordlist:
    if word in wordlist:
        continue # Move on to next loop

%time find_by_list(wordlist)


CPU times: user 1.2 ms, sys: 68 µs, total: 1.27 ms
Wall time: 1.42 ms


In [19]:
# Using dictionary
worddict = {}
for word in wordlist:
  if word in worddict:
    worddict[word] += 1
  else:
    worddict[word] = 1

def find_by_dict(wordlist, dict):
  for word in wordlist:
    if word in dict:
      continue # Move on to next iteration of the for loop

%time find_by_dict(wordlist,worddict)

CPU times: user 48 µs, sys: 1 µs, total: 49 µs
Wall time: 55.8 µs


# Pass by reference

As long as we're talking about lists, dictionaries, and sets, there's another point these all have in common:  they don't get copied when they are passed to functions or assigned to variables.  The new variable in question gets a copy of the *address* of the data, but any changes inside the function or from the new variable will be seen on the original copy.  This is to avoid the time it would take to copy the data, which may be unnecessary.  But it is confusing to the newcomer.

This is best illustrated with examples.

In [20]:
mydict = {"a":1000}
dict2 = mydict # gets the address, so any changes are permanent to the original
dict2["b"] = 500
print(mydict) # we modified the original!

{'a': 1000, 'b': 500}


In [21]:
def add1(mylist):
  mylist.append(1)
  return mylist

mylist = ["a","b"]
mylist2 = add1(mylist)
print(mylist)  # changes were permanent here

['a', 'b', 1]


As a reminder, more primitive types don't work like this.

In [22]:
def add1(a):
  a += 1 # not returned and not by reference, so it's forgotten on return

a = 10
add1(a)
print(a) # still 10

10


This is actually a very important point for avoiding bugs, especially working with lists - there's a grave danger of handing off a value thinking it's a copy, and then modifying the original data structure accidentally.

Working with memory is on the expensive side as operations go, and this behavior makes sure large blocks of memory aren't copied unless the programmer specifically requests them with copy() functions.

In [23]:
def add1(mylist):
  # Returns a copy of the list with "1" appended
  list2 = mylist.copy()
  list2.append(1) # original list is now unmodified
  return list2

mylist = ["a","b"]
mylist2 = add1(mylist)
print(mylist)  # changes were made to original before, now they're not
print(mylist2)

['a', 'b']
['a', 'b', 1]


# Exercise

Does the longer passage from *A Tale of Two Cities* use all the letters of the alphabet?  Find out by first populating a set with the letters inside (note that you can index into strings as if they were lists, with []).  Then iterate over string.ascii_lowercase (a string containing all lowercase letters of the alphabet) checking whether each letter is in the set.  (This method is faster than some other possible approaches because it makes only a single pass through the original long string.)


In [24]:
from string import ascii_lowercase
# To iterate through lc letters, "for c in ascii_lowercase"

myset = set()
for i in range(len(two_cities_extended)):
  myset.add(two_cities_extended[i].lower())

def checkletters(myset):
  for c in ascii_lowercase:
    if c not in myset:
      print("Missing: " + c)
      return False
  print("All found")
  return True

checkletters(myset)

Missing: x


False