# DSCI 511: Data acquisition and pre-processing<br>Chapter 2: Working with Python's data types
We addressed some very basic, atomic data types in Chapter 1. While we used containers here and there in the processing fundementals exhibit Python's syntax, we'll take some time here to go in depth, specifically taking a look at the some of the types most frequently-used in data science and explore their methods and properties.

## 2.1 Ordered data types

### 2.1.1 Lists
A list is the most basic Python data structure. In some other programming languages, the similar data structure is called an "array". A list is simply a sequence of values. They are defined using square braces, like so:

In [1]:
x = [8, 1, 5, 1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]

In this particular example, we defined a list called "x" containing a few integer values. Python lists can contain a mix of data types, for example:

In [2]:
y = [1, "two", 3.0]

Lists can be nested, meaning we could put lists inside of lists:

In [3]:
list_of_lists = [[1, 2, 3], ["four", "five", "six"], [7.0, "eight", 9]]

#### 2.1.1.1 Size and truthiness
We can quickly look up the length of a list using the `len()` function:

In [4]:
len(list_of_lists)

3

In [5]:
len(x)

14

All container types can be tested for truth value. When a container is empty, its truth value is `False`, whenever it contains any elements, this value switches to `True`. This truth value can be used with `if` and `while` statements.

Empty lists have length 0 and evaluate as `False` in conditional and boolean operations:

In [6]:
z = []
len(z)

0

In [7]:
if z:
    print("This isn't empty!")
else:
    print("This is empty!")

This is empty!


#### 2.1.1.2 Nesting and repitition
Using nested lists, we might represent matrices of numbers:

In [8]:
matrix = [[4, 8, 4], [3, 2, 6], [5, 3, 7]] # this is a 3 x 3 matrix

For sequential integers there's the `range()` function. Since `range()` is a generator (Section 2.1.4) we'll have to coerce its result to a list to be able to interact with the squence as one.

In [9]:
seq = list(range(10))
print(seq)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


There are a lot more ways to define lists:

In [10]:
empty_list = []
print(empty_list)

[]


We can also use the `*` multiplication operator to initialize lists with repeated values:

In [11]:
list_of_empty_lists = [[]] * 5
print(list_of_empty_lists)

[[], [], [], [], []]


#### 2.1.1.3 Accessing elements
We access elements of lists using their index:

In [12]:
x[0]

8

In [13]:
x[3]

1

It is also possible, and quite convenient, to access elements in the reverse order by simply using a negative sign for the index:

In [14]:
x[-1]

21

In [15]:
x[-4]

34

We can also "slice" lists, picking out specific sequences of elements by indicating indices:

In [16]:
print(x)

[8, 1, 5, 1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]


In [17]:
print(x[4:10])

[89, 34, 3, 2, 144, 13]


Notice that the slice begins at index 4 (Python indexing starts at 0, so the element at index 4 is actually the 5th element) and ends with index 9, i.e. just before index 10. 

We can also slice using negative indices. We can perform open-ended slices to grab the rest of the list from a particular index.

In [18]:
print(x[-6:-3])

[144, 13, 34]


In [19]:
print(x[3:])

[1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]


In [20]:
print(x[:-8])

[8, 1, 5, 1, 89, 34]


#### 2.1.1.4 Modifying lists
Lists can be modified after being created. This is the hallmark of an important concept called _mutability_. Specifically, a mutable object is one that can be modified without direct re-assignment. Following our discussion on lists we'll move on into immutable ordered arrays, which are referred to as _tuples_.
When we want to add new elements to the end of a list, we use `.append()`:

In [21]:
seq.append(10)
print(seq)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


But we have to be careful when appending to lists. Since Python allows multiple data types to be mixed in the same data structure, it is very easy to mistakenly append an unwanted item to a list. For example:

In [22]:
seq.append([11]) # Don't do this!
print(seq)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, [11]]


The use of an extra pair of square braces can lead to list dimensions and types getting messed up. 

We can delete the unwanted element by specifying its index and using the built-in `del` function:

In [23]:
del(seq[-1])
print(seq)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


As it turns out, there's a different method for the event that we want to 'glue' two lists together. Here's an example of this using `.extend()`:

In [24]:
seq.extend([11, 12, 13]) # Do this!
print(seq)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]


#### 2.1.1.5 List iteration
Iterating over lists in a loop is extremely common. In Python, we don't need to use indices to iterate over lists like other languages. We can use the convenient `in` operator:

In [25]:
for number in seq:
    print(number)

0
1
2
3
4
5
6
7
8
9
10
11
12
13


We can use nested loops to access nested lists:

In [26]:
for row in matrix:
    for item in row:
        print(item)

4
8
4
3
2
6
5
3
7


The built-in function `enumerate` gives us easy access to the index of the current element inside a loop:

In [27]:
for m, row in enumerate(matrix):
    print("Printing row " + str(m)) # We're using str() to convert the integer variable m to a string, which allows us to concatenate it to the message
    for n, item in enumerate(row):
        print("Item from column " + str(n) + ": " + str(item))

Printing row 0
Item from column 0: 4
Item from column 1: 8
Item from column 2: 4
Printing row 1
Item from column 0: 3
Item from column 1: 2
Item from column 2: 6
Printing row 2
Item from column 0: 5
Item from column 1: 3
Item from column 2: 7


### 2.1.1.6 Exercise: trace of a matrix

The trace of a (square) matrix (list of lists) is the sum of it's diagonal elements.
Loop over the matrix's rows and add up the diagonal values for the trace. We'll be working with the simple 3 x 3 matrix:
```
[[1, 2, 3],   
 [4, 5, 6],  
 [7, 8, 9]]
```

In [28]:
matrix = [[4, 8, 4], [3, 2, 6], [5, 3, 7]] # print this matrix
# Hint: You'll need for loops!

#### 2.1.1.7 Sorting Lists
One of the most useful built-in functionalities of Python is its ability to sort collections. We can rearrange an existing list using the `.sort()` method, or we can create a sorted copy of it using the `sorted()` function, leaving the original list unchanged.

In [29]:
x = [8, 1, 5, 1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]
x.sort()
print(x)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 34, 55, 89, 144]


In [30]:
x = [8, 1, 5, 1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]
y = sorted(x)
print(y)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 34, 55, 89, 144]


In [31]:
print(x)

[8, 1, 5, 1, 89, 34, 3, 2, 144, 13, 34, 55, 0, 21]


We can set the `reverse` flag to `True` to sort a list in descending order:

In [32]:
z = sorted(x, reverse = True)
print(z)

[144, 89, 55, 34, 34, 21, 13, 8, 5, 3, 2, 1, 1, 0]


#### 2.1.1.8 Exercise: Sorting lists

You might have noticed that the list `y` is a list containing the first 14 numbers in the Fibonacci sequence, i.e., any number can be calculated from the sum of the previous two numbers. 

Now, use a `for` loop to calculate the next 10 Fibonacci numbers and append them to list `y`. 

In [33]:
y = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 34, 55, 89, 144] # These are the first 14 numbers in the sequence
# Hint: You'll need to use list indices

#### 2.1.1.9 List Comprehensions
Say we want to define the first 100 multiples of 7. We would ordinarily write a `for` loop like this:

In [34]:
mult7 = []
for n in range(101): # We're specifying 101 because we want to go up to 100 exactly. The range() function stops right before the specified value.
    mult7.append(7 * n)
print(mult7)

[0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105, 112, 119, 126, 133, 140, 147, 154, 161, 168, 175, 182, 189, 196, 203, 210, 217, 224, 231, 238, 245, 252, 259, 266, 273, 280, 287, 294, 301, 308, 315, 322, 329, 336, 343, 350, 357, 364, 371, 378, 385, 392, 399, 406, 413, 420, 427, 434, 441, 448, 455, 462, 469, 476, 483, 490, 497, 504, 511, 518, 525, 532, 539, 546, 553, 560, 567, 574, 581, 588, 595, 602, 609, 616, 623, 630, 637, 644, 651, 658, 665, 672, 679, 686, 693, 700]


We could perform exactly the same operation in a single line of code as follows, using a _list comprehension_:

In [35]:
mult7 = [7 * n for n in range(101)]
print(mult7)

[0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105, 112, 119, 126, 133, 140, 147, 154, 161, 168, 175, 182, 189, 196, 203, 210, 217, 224, 231, 238, 245, 252, 259, 266, 273, 280, 287, 294, 301, 308, 315, 322, 329, 336, 343, 350, 357, 364, 371, 378, 385, 392, 399, 406, 413, 420, 427, 434, 441, 448, 455, 462, 469, 476, 483, 490, 497, 504, 511, 518, 525, 532, 539, 546, 553, 560, 567, 574, 581, 588, 595, 602, 609, 616, 623, 630, 637, 644, 651, 658, 665, 672, 679, 686, 693, 700]


The basic comprehension syntax (which can be used not just with lists, but other container data types as well) consists of brackets containing an expression followed by a `for` clause. It loooks like `[expression(x) for x in iterable]`. In the example above, the expression is a multiplication by 7. Since it is a simple operation, we typed the expression inside the comprehension instead of defining it elsewhere and calling it as a function. The operation can be simple assignment (i.e. copying the value), in which case the syntax is further simplified to `[x for x in iterable]`.

Additional `for` and `if` clauses can be incorporated to construct more complex comprehensions. By using additional `for` loops, we can nest comprehensions. For example, we could "transpose" (switch columns and rows) a matrix using a nested list comprehension:

In [36]:
matrix = [[8, 2, 3], [9, 1, 9], [5, 4, 1]]

transposed_matrix = [[row[i] for row in matrix] for i in range(3)] # We generate indices using the range() function
print(transposed_matrix)

[[8, 9, 5], [2, 1, 4], [3, 9, 1]]


#### 2.1.1.10 Some more useful list methods
To formalize, methods are simply built-in functions that are available as attributes of a data type. For example, we have been using `list.append()` to add elements to a list. This is a method that comes built-in with the list data type. The other list method we've already shown is `list.sort()`. We'll take a look at a few more useful list methods.

Let's first define a list.

In [37]:
a = [2, 1, 4, 2, 5, 1, 5, 7, 0, 1, 2, 5, 6, 7, 8]

`.index()` returns the index of the first instancce of an item in the list.

In [38]:
print(a.index(1))

1


`.pop()` removes the last item in the list and returns it.

In [39]:
last = a.pop()
print(last)
print(a)

8
[2, 1, 4, 2, 5, 1, 5, 7, 0, 1, 2, 5, 6, 7]


`.count()` returns the number of times an item appears in a list.

In [40]:
print(a.count(1))

3


`.reverse()` reverses the order of the list; note that the list is changed in place!

In [41]:
a.reverse()
print(a)

[7, 6, 5, 2, 1, 0, 7, 5, 1, 5, 2, 4, 1, 2]


If we wanted to create a reverse copy of a list, we would need to use the built-in `reversed()` function, rather than the `list.reverse()` method.

In [42]:
b = list(reversed(a)) # the list() call is necessary because reversed() returns an iterator, not an actual list
print(b) 

[2, 1, 4, 2, 5, 1, 5, 7, 0, 1, 2, 5, 6, 7]


### 2.1.2 Tuples
Tuples are similar to lists, but with an important difference: they are `immutable`, meaning tuple objects can't be changed once they've been created (they can only be reassigned). Tuples will often be used in similar ways to lists and can be operated on by many of the same functions as lists, like `sorted()`.
#### 2.1.2.1 Basics
Tuples are printed with round braces, but can be defined by simply separating values by commas. The round braces are optional for tuple defintion.

In [43]:
numbers = 4, 2, 3, 1, 2, 4, 4
print(numbers)

(4, 2, 3, 1, 2, 4, 4)


Because they can't be changed after definition, we can't append anything to tuples:

In [44]:
numbers.append(3)

AttributeError: 'tuple' object has no attribute 'append'

#### 2.1.2.2 Packing
Often, tuples are used for "packing" and "unpacking" data. For example, we can create a tuple to hold information related to a book and unpack that information into separate variables:

In [45]:
book = "A Tale of Two Cities", "Charles Dickens", 1859

title, author, year = book
print(title)
print(author)
print(year)

A Tale of Two Cities
Charles Dickens
1859


Tuple elements can be accessed using indexes just like lists:

In [46]:
print(book[1])

Charles Dickens


#### 2.1.2.3 Exercise: tuple unpacking and list transformation
Use a list comprehension to create a list of all the author names from the given list of tuples containing title, author, and publication year.

In [47]:
books = [("A Tale of Two Cities", "Charles Dickens", 1859), 
         ("Crime and Punishment", "Fyodor Dostoyevski", 1866),
         ("Heart of Darkness", "Joseph Conrad", 1899), 
         ("Brave New World", "Aldous Huxley", 1932), 
         ("The Stranger", "Albert Camus", 1942)]

# Dazzle yourself with your list comprehension skills!
authors = []

### 2.1.3 Strings are immutable ordered arrays of characters
Interestingly, strings in Python are also sequential (ordered) objects. A string is like a list of characters, but is immutable (like a tuple). More details on strings can be found [here](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str), and on numeric data types like integers and floats, which are not containers but also have some surprising properties worth learning about, [here](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex).

#### 2.1.3.1 Some helpful string methods
Strings also come with many useful methods. For example, `string.split()` can break a string apart on occurrences of any specified character. A common use of `.split()` is to break up sentences into words by splitting by spaces, which is the default for the method.

In [48]:
sentence = "I care for no man on earth, and no man on earth cares for me."
words = sentence.split()

print(words)

['I', 'care', 'for', 'no', 'man', 'on', 'earth,', 'and', 'no', 'man', 'on', 'earth', 'cares', 'for', 'me.']


Custom separators can come in handy with `.split()`:

In [49]:
filename = "unusually-long-sequence-of-seemingly-related-words"
filename_words = filename.split("-")

print(filename_words)

['unusually', 'long', 'sequence', 'of', 'seemingly', 'related', 'words']


The `.strip()` method can clean up strings by removing leading and trailing whitespace:

In [50]:
badly_processed_sentence = "      There was so much that was wrong with the situation.     "
stripped_sentence = badly_processed_sentence.strip()

print("'" + badly_processed_sentence + "'")
print("'" + stripped_sentence + "'")

'      There was so much that was wrong with the situation.     '
'There was so much that was wrong with the situation.'


The `.upper()`, `.lower()`, and `.title()` methods can change the case of characters:

In [51]:
name = "PETER PARKER"
lower_name = name.lower()
print(lower_name)

peter parker


In [52]:
upper_name = lower_name.upper()
print(upper_name)

PETER PARKER


In [53]:
proper_name = name.title()
print(proper_name)

Peter Parker


Methods like `.isalpha()` and `isdigit()` are handy ways to check string contents.

In [54]:
name = "Peter"

if name.isalpha():
    print("This string only contains alphabetic characters.")

This string only contains alphabetic characters.


In [55]:
number = "19"

if number.isdigit():
    print("This string only contains digits.")

This string only contains digits.


`.startswith()` and `.endswith()` can be used to check the beginnings and endings of strings.

In [56]:
url = "https://www.website.com"
if url.startswith("http"):
    print("This looks like a URL.")

This looks like a URL.


In [57]:
filename = "file.pdf"
if filename.endswith(".pdf"):
    print("This looks like a document name.")

This looks like a document name.


For more detail on methods, check out the [documentaion](https://docs.python.org/3/library/stdtypes.html)!

### 2.1.4 Data transformation and traversals using generators and lambda functions
While note really a data type, generators are way of containing and traversing what data _will be_. Since generators are really most closely related to the concept of programming lanaguage functions, we'll review.
#### 2.1.4.1 Functions review
Recall from Chapter 1 that a function is a series of statements that can return one or more values. We define functions when we want to reuse some code at multiple points; i.e., instead of retyping the same statements, we package them in a function and call the function where needed. Functions are defined explicitly using the `def` keyword and can take as many arguments as needed:

In [58]:
def multiply(x, y): # x and y are arguments
    return x * y

In [59]:
a = 7
b = 8
c = multiply(a, b)
print(c)

56


Default values for arguments can be set by specifying them in the definition:

In [60]:
def raise_to_power(x, n = 2):
    y = 1
    for i in range(n):
        y = y * x
    return y

In [61]:
print(raise_to_power(5))

25


In [62]:
print(raise_to_power(5, 3))

125


#### 2.1.4.2 Exercise: functions review
Write your own basic version of the built-in `range()` function. It should take only one argument, the number before which the range should stop; and it should return a list containing all integers up to that end point.

In [None]:
# def my_range(stop):
#     # code goes here

In [64]:
# using the my_range() function, print out the list of the first 25 integers.

#### 2.1.4.3 Generator functions
Lists returned by functions need to be stored in the computer's memory. In case of very large lists, this can become a significant problem. A convenient alternative to this is the use of a generator function, which utilizes "lazy" evaluation. Essentially, a generator can produce a generator object instead of an actual in-memory list, which is much more memory efficient. Generator objects can be iterated over, and can produce values one by one as needed rather than producing an enitre sequence at once. When defining a generator function, we simply use `yield` instead of `return`.

Let's define a generator function to calculate the cubes of a sequence of numbers:

In [65]:
def cube(seq):
    for n in seq:
        yield n * n * n

When we call `cube()` by passing it a list, it produces a generator object:

In [66]:
cubes = cube([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(cubes)

<generator object cube at 0x1119f1b10>


The generator object can be looped over in the same way we would loop over a list:

In [67]:
for n in cubes:
    print(n)

1
8
27
64
125
216
343
512
729
1000


Once a generator object has been used to yield all values, it is discarded, so `cubes` is now empty:

In [68]:
for n in cubes:
    print(n)
print("All done!")

All done!


A full list can be produced from the generator (if the memory cost is acceptable) using `list()`:

In [69]:
cubes = cube([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
cubes_list = list(cubes)

print(cubes_list)

[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]


#### 2.1.4.4 Exercise: generator functions

Rewrite `my_range()` as a generator function and print out the first 25 integers using this generator function.

In [None]:
# def my_range(stop):
#     # code goes here, this time as a generator function

In [71]:
# using the my_range() generator function, print out the first 25 integers

#### 2.1.4.5 transforming data with lambda functions
As mentioned in Chapter 1, sometimes we need to write quick functions that don't necessarily need a name or a separate defintion. We can define these anonymous functions using the `lambda` keyword. For example, say we have a list of numbers and want to check which of those numbers from the list that are multiples of 3. We could perform the check with a concise lambda function:

In [72]:
numbers = [5, 9, 21, 14, 81, 31, 66, 43, 90, 1, 10]

multiple_of_3 = lambda x : x % 3 == 0

for number in numbers:
    if multiple_of_3(number):
        print(str(number) + " is a multiple of 3.")

9 is a multiple of 3.
21 is a multiple of 3.
81 is a multiple of 3.
66 is a multiple of 3.
90 is a multiple of 3.


The lambda function syntax is essentially `lambda argument : expression`. In the example above, the argument is `x` and the expression is `x % 3 == 0`, which returns a boolean `True` or `False` value after evaluating whether the remiander from dividing x by 3 is 0. 

A very useful built-in function in Python is `filter()`, which can simply filtering lists when combined with lambda functions. `filter()` takes a function and a list as arguments. If we wanted to collect the multiples of 3 into a new list, we could do this very succinctly. Note: since `filter()` results in a generator we have to coerce to a list as well to see the data and interact with it as a list.

In [73]:
multiples = filter(lambda x : x % 3 == 0, numbers)

print(list(multiples))

[9, 21, 81, 66, 90]


Lambda functions are useful when sorting lists. They can be used as arguments to the built-in `sorted()` function. For example, say we have a list of tuples, each containing two numbers. By defualt, `sorted()` will sort by the first number:

In [74]:
pairs = [(4, 30), (23, 1), (4, 32), (90, 21), (82, 23)]
ordered_pairs = sorted(pairs)

print(ordered_pairs)

[(4, 30), (4, 32), (23, 1), (82, 23), (90, 21)]


By specifying the `key` argument as a lambda function returning the second element, we can sort using the second number:

In [75]:
ordered_pairs = sorted(pairs, key = lambda x : x[1])

print(ordered_pairs)

[(23, 1), (90, 21), (82, 23), (4, 30), (4, 32)]


If we wanted the reverse order, we could specify the `reverse` argument for `sorted()`.

In [76]:
reverse_order = sorted(pairs, key = lambda x : x[1], reverse = True)

print(reverse_order)

[(4, 32), (4, 30), (82, 23), (90, 21), (23, 1)]


#### 2.1.4.6 Exercise: lambda function data transformation for a sort
Sort this list of books in descending order of year of publication (i.e. the more recently published books should come first) using a lambda function with `sorted()`.

In [None]:
books = [("A Tale of Two Cities", "Charles Dickens", 1859), 
         ("Crime and Punishment", "Fyodor Dostoyevski", 1866),
         ("Heart of Darkness", "Joseph Conrad", 1899), 
         ("Brave New World", "Aldous Huxley", 1932), 
         ("The Stranger", "Albert Camus", 1942)]

# # code goes here
# reverse_books = 

## 2.2 Unordered data objects
Ordered data object are extremely important in python programming for data science, but for many applications these are either impractical or inefficient. There are several unordered data types we'll use throughout the course and that you'll need in order to move forward in data science. These are discussed in detail with a focus on some useful methods here.

### 2.2.1 Sets
A set is another powerful data structure. Essentially, a set is a list where all the elements are unique and there are no indices (no ordering). This means that if you try to insert a value that is already in a set, it is not duplicated. In math, sets are written with curly braces; it's the same in Python.

In [78]:
names = {"Joey", "Diane", "Mike", "Jessie", "Alex", "Jane"}
print(names)

{'Joey', 'Alex', 'Jessie', 'Diane', 'Jane', 'Mike'}


An important thing to remember is that empty sets can't be defined by simply writing two curly braces. They need to be defined using the `set()` function:

In [79]:
empty_set = set()

Similar to the `append()` method for lists, for a set we have an `add()` method.

In [80]:
names.add("Mark")
print(names)

{'Joey', 'Alex', 'Mark', 'Jessie', 'Diane', 'Jane', 'Mike'}


What happens when we try to add something that's already in the set?

In [81]:
names.add("Diane")
print(names)

{'Joey', 'Alex', 'Mark', 'Jessie', 'Diane', 'Jane', 'Mike'}


Perhaps unsurprisingly, `set()`'s have mathematical set operarion methods, like `.union()` and `.intersection()`. Let's see which of our names were names of the main characters on Full House:

In [82]:
print(names.intersection({"Jessie", "D.J.", "Joey", "Danny", "Michelle", "Stephanie"}))

{'Jessie', 'Joey'}


### 2.2.2 Dictionaries
Dictionaries are an extremely important and fundamental Python data type&mdash;they are _associative arrays_, or data is organized into key-value pairs. 

#### 2.2.2.1 Basics
They are defined with curly braces. For example, we might define a dictionary containing the phone numbers of a few people:

In [83]:
phone_numbers = {"John" : 9867934,
            "Diane" : 3409344,
            "Mike" : 2340903,
            "Jess" : 8983993,
            "Alex" : 8920022,
            "Jane" : 6673391}

print(phone_numbers)

{'John': 9867934, 'Diane': 3409344, 'Mike': 2340903, 'Jess': 8983993, 'Alex': 8920022, 'Jane': 6673391}


#### 2.2.2.2 Access elements and iteration
In this dictionary, the names are keys and the numbers are values. If we want to access a value associated with a particular key, we can use the key in square braces similar to how indices are used with lists and tuples:

In [84]:
print(phone_numbers["Alex"])

8920022


In loops, we iterate over the keys of a dictionary:

In [85]:
for name in phone_numbers: # It is also possible to iterate over both keys and values together, we would just write "for name, number in phone_numbers:"
    print(name + ", Phone: " + str(phone_numbers[name]))

John, Phone: 9867934
Diane, Phone: 3409344
Mike, Phone: 2340903
Jess, Phone: 8983993
Alex, Phone: 8920022
Jane, Phone: 6673391


#### 2.2.2.3 Complex definitions, nesting, and modification
Dictionaries are often very useful when it comes to storing complex data. For example, let's create a dictionary to store information related to a book.

In [86]:
book = {"title" : "A Tale of Two Cities",
       "author" : "Charles Dickens",
       "year" : 1859}

We can easily access the different values using their keys. Dictionaries can be nested, so we could create a dictionary of dictionaries to store a few different books and use a unique key, an ID, for each book.

In [87]:
books_dict = {0 : {"title" : "A Tale of Two Cities",
                   "author" : "Charles Dickens",
                   "year" : 1859},
             1 : {"title" : "Crime and Punishment",
                 "author" : "Fyodor Dostoyevski",
                 "year" : 1866},
             2 : {"title" : "Heart of Darkness",
                 "author" : "Joseph Conrad",
                 "year" : 1899},
             3 : {"title" : "Brave New World",
                 "author" : "Aldous Huxley",
                 "year" : 1932},
             4 : {"title" : "The Stranger",
                 "author" : "Albert Camus",
                 "year" : 1942}}

Now this makes accessing the data for a book's title very convenient:

In [88]:
print(books_dict[3]["title"])

Brave New World


Adding new entries to a dictionary doesn't require something like an `append()` method. We can simply specify the key and assign it a value. Of course, that value can be another dictionary, too:

In [89]:
books_dict[5] =  {"title" : "The Old Man and the Sea",
                 "author" : "Ernest Hemingway",
                 "year" : 1952}

print(books_dict)

{0: {'title': 'A Tale of Two Cities', 'author': 'Charles Dickens', 'year': 1859}, 1: {'title': 'Crime and Punishment', 'author': 'Fyodor Dostoyevski', 'year': 1866}, 2: {'title': 'Heart of Darkness', 'author': 'Joseph Conrad', 'year': 1899}, 3: {'title': 'Brave New World', 'author': 'Aldous Huxley', 'year': 1932}, 4: {'title': 'The Stranger', 'author': 'Albert Camus', 'year': 1942}, 5: {'title': 'The Old Man and the Sea', 'author': 'Ernest Hemingway', 'year': 1952}}


#### 2.2.2.4 Exercise: interacting with dictionaries
Write code that uses a loop to iterate over `books_dict` and create three separate lists of titles, authors, and years, populated by the correct values.

In [90]:
titles = []
authors = []
years = []
# Remember, dictionaries iterate over keys by default

#### 2.2.2.5 Managing keys and values
Since default iteration on dictionaries occurs over keys, we alread have some capacity for managing dictionary keys. However, to iterate over&mdash;or even make lists of&mdash;either there are useful built in methods:
- `.keys()`: creates a generator of dictionary keys
- `.values()`: creates a generator of dictionary values
- `.items()`: creates a generator of key-value tuple pairs

Built in syntax and dictionary methods are also essential for dictionary modification. Attempting to modify a value when a key does not exist raises an error! To avoide issues like this one can use the `.setdefault(key,default_value)` method, which only sets the value for `key` to the `default_value` if the `key` does not exist in the dictionary. Without `.setdefault()`, a more syntax-intensive approach to this issue entails infullness: the assertation `key in some_dictionary` results in a boolean value, just as with lists!

#### 2.2.2.6 Example: Counting with a base Python dictionary
Say we have a list of results of soccer games. There are three possibilities: win (W), loss (L), and draw (D). Count the number of wins, losses, and draws using the `counts` dictionary:

In [None]:
results = ["W", "W", "L", "D", "W", "L", "L", "L", "W", 
           "D", "W", "W", "D", "W", "D", "L", "W", "W", "L", "W"]
counts = {}
## loop through results (below) and count the number of times each occurred.
## using infullness: test to see a value needs to be assigned or modified
## using `.setdefault()`: make sure a starting value is assumed, i.e., pre-assigned

counts

#### 2.2.2.7 Counters
A counter is a special kind of dictionary data structure. Think of it as a dictionary where the keys are the things being counted and the associated values are by default the number 0. First, the counter data structure needs to be imported from the `collections` module.

In [91]:
from collections import Counter

We define a counter by using the `Counter()` function:

In [92]:
count = Counter()

Returning to the list of soccer game results, a counter is the perfect data structure for counting the different outcomes:

In [93]:
results = ["W", "W", "L", "D", "W", "L", "L", "L", "W", "D", "W", "W", "D", "W", "D", "L", "W", "W", "L", "W"]

for result in results:
    count[result] += 1 # The += operator is a shortcut way to write count[result] = count[result] + 1
    
print(count)

Counter({'W': 10, 'L': 6, 'D': 4})


As it turns out, counters have a built in sorting methods: `.most_common(n)`. Note: this function is a generator that produces `(<key>,<count)` tuples. Let's try:

In [94]:
## unpack the key and count on the fly from the .most_common() generator's tuples
for ky, ct in count.most_common():
    print(ky, ct)

W 10
L 6
D 4


#### 2.2.2.8 Exercise: counters

Count the results and calculate a point tally, where a win equals 3 points, a draw equals 1 point, and a loss equals 0 points.

In [95]:
results = ["W", "W", "L", "D", "W", "L", "W", "D", "W", "L", "L", "W", "D", "W", "W", "L", "L", "W", "D", "W", "D", "L", "W", "W", "L", "W"]
result_counts = Counter()
points = 0
# Use a loop!

#### 2.2.2.9 Default valued dictionaries
A `defaultdict` is another special dictionary type. It is very similar to a dictionary, and allows user specification of a default value function. Aside from `Counter()`'s specialized methods, its basically just `defaultdict(int)`, i.e., a dictionary with the default value of an integer (0). You can set the data type of the values for a `defaultdict` at its definition. Here's one that defaults to a list:

In [96]:
from collections import defaultdict

dict_of_lists = defaultdict(list)
print(dict_of_lists)

defaultdict(<class 'list'>, {})


Let's look at an example of how defaultdicts can be useful. Say, we have a list of tuples where each tuple contains the name of a player and the number of goals they scored in one match. We want to organize these numbers by players and create lists of their goal record in all matches. We could use a defaultdict to achieve this very simply:

In [97]:
goals = [("Ronaldo", 3), ("Lukaku", 2), ("Messi", 1), ("Ronaldo", 1), ("Kane", 2), ("Lukaku", 2), ("Kane", 1), ("Kane", 2)]

goals_by_player = defaultdict(list)

for player, goals in goals:
    goals_by_player[player].append(goals)
    
print(goals_by_player)

defaultdict(<class 'list'>, {'Ronaldo': [3, 1], 'Lukaku': [2, 2], 'Messi': [1], 'Kane': [2, 1, 2]})


Deafultdicts can be nested, too. A really whacky way of defining nested defaultdicts is to do it recursively, by writing a function that creates a new defaultdict.

In [98]:
def nested_dd():
    return defaultdict(nested_dd)

This gives us a lot of power, since now we can define arbitrarily as many layers of nesting as we want in a single line!

In [99]:
super_dict = nested_dd()

super_dict[0][1][2][3] = 4

print(super_dict)

defaultdict(<function nested_dd at 0x1119f9840>, {0: defaultdict(<function nested_dd at 0x1119f9840>, {1: defaultdict(<function nested_dd at 0x1119f9840>, {2: defaultdict(<function nested_dd at 0x1119f9840>, {3: 4})})})})


We can convert (un-nested) defaultdicts to dicts by simply using the `dict()` function:

In [100]:
goals_by_player = dict(goals_by_player)
print(goals_by_player)

{'Ronaldo': [3, 1], 'Lukaku': [2, 2], 'Messi': [1], 'Kane': [2, 1, 2]}
