In [1]:
from __future__ import division, unicode_literals, print_function

# Data structures

So far, we have mostly dealt with variables that are organized in an *ad hoc* way. Most data, however, is not organized as single variables. Data has relationships - it has a *structure* that is often as meaningful as the individual data points itself.

As an example, imagine if chracters were always stored in individual variables, and there was no concept of a "string." We would lose a lot of the common grammar that we have for dealing with string data (replacements, counting, searching for substrings, etc.) This would be incredibly painful. While the data structures that we talk about here are a little more abstract than strings, you will quickly find that they are as indispensible as strings.

Python offers four major built in ways of organizing data, all of which we will talk about here. To list them briefly, they are:

0. Lists: Ordered lists
1. Dictionaries: Associated pairs
2. Sets: Unordered collections
3. Tuples: Immutable collections

Of these, being frank, sets and tuples are probably less important, but still good to know about. For the moment, we won't really talk about then. Lists and dictionaries, however, will be something you use basically every day in Python.

Data structures are a field of huge importance in computer science, and this list is definitely not exhaustive. However, these are the built in structures available in Python.

# Lists

Lists are ordered collections of items. They model sequential data (for example, data observations over time, or the collection of lines in a file.) By items, we mean any object in python (numbers, strings, booleans, even lists!) The main way that people construct a list in python is to surround the data with square brakets.

In [6]:
list_numbers = [0, 1, 2, 3, 4]

# Lists can extend over multiple lines
list_strings = ['Now', 'this', 'is', 'a', 'story', 
                'all', 'about', 'how', 'my', 'life', 
                'got', 'flipped', 'turned', 'upside', 'down']

# Any type of data can be stored in lists, including lists
list_mixed = [1, 'apple', [1,2,3], 4.567]

There are also many functions that return lists. One such function is the string function .split(), which "splits" a string into a list of "words" based on a seperating character (which would be a space in a normal sentence)

In [39]:
"This is a sentence to be split into words".split()

['This', 'is', 'a', 'sentence', 'to', 'be', 'split', 'into', 'words']

## List slicing
Like strings, lists can be sliced. The syntax is identical to strings!

In [30]:
list_strings[5]

'all'

In [8]:
list_numbers[0:2]

[0, 1]

In [9]:
list_numbers[::2]

[0, 2, 4]

In [10]:
list_strings[::-1]

['down',
 'upside',
 'turned',
 'flipped',
 'got',
 'life',
 'my',
 'how',
 'about',
 'all',
 'story',
 'a',
 'is',
 'this',
 'Now']

## List methods

Lists are *mutable*, which means that they can be altered after they are created. For example

In [11]:
list_strings[9] = 'cat'

In [12]:
list_strings

['Now',
 'this',
 'is',
 'a',
 'story',
 'all',
 'about',
 'how',
 'my',
 'cat',
 'got',
 'flipped',
 'turned',
 'upside',
 'down']

Beacuse lists are mutable, they have a variety of methods allowing mutation. Let's look at a few useful ones:

Probably the most important method for a list is .append(item), which grows the list by adding an item to the end

In [18]:
list_numbers.append(10)

In [14]:
list_numbers

[0, 1, 2, 3, 4, 10]

This allows you to grow a list bit by bit

Some other examples of methods are

In [15]:
# Returns the index of an item in the list, or -1 if not found. Like string.find()
list_numbers.index(10)

5

In [19]:
# Removes an item by value
list_numbers.remove(10)

In [20]:
list_numbers

[0, 1, 2, 3, 4]

In [22]:
# Exactly the same as for strings. Shows you how many times an items occurs in a list
list_numbers.count(4)

1

## Lists are shared in memory

An important "gotcha" with lists is that Python tries to optimize memory usage with them. This is because they can be long, and take up a lot of space in memory. Thus, if you re-assign a list to another variable, it doesn't copy the list to this variable, but instead just makes this second variable a "link" to the original list.

The effect of this is that modifying the reference will also modify the original list.

In [23]:
list_copy = list_numbers

In [25]:
list_numbers

[0, 1, 2, 3, 4]

In [28]:
list_copy.append(500) # Also affects list numbers!

In [29]:
list_numbers

[0, 1, 2, 3, 4, 500, 500]

If you want to make a real copy of a list, the best way to do so is using the copy library. A quick and dirty way to do it is to use a full slice of the list, though, as below

In [44]:
real_copy = list_numbers[:]

In [45]:
real_copy.append(4567)

In [47]:
print("Original list: {}".format(list_numbers))
print("True copy: {}".format(real_copy))

Original list: [0, 1, 2, 3, 4, 500, 500]
True copy: [0, 1, 2, 3, 4, 500, 500, 4567]


# Tuples

Tuples are somewhat similar to lists - they are created by surrounding comma seperated values with parantheses, like so. 

```python
number_tuple = (1,2,3)
```

The main difference is that tuples are *immutable*, meaning that once they're created they cannot be changed. Tuples are often used in python to represent lists where the structure of the list is fixed (i.e. where each entry has a specific meaning.)

For example, colors are represented on computers as three integers representing the red, green and blue channels. In this case, the structure of the list is fixed. We will never have another color channel, and we don't want the order of the channels being mixed up. We would represent each pixel then as a tuple of (r, g, b) values.

In [56]:
color_1 = (50, 255, 127) # Green
color_2 = (127, 127, 127) # Grey

In [60]:
# Tuples are immutable: assignment causes an error

color_1[0] = 25

TypeError: 'tuple' object does not support item assignment

## Tuple indexing

Tuples can be indexed and sliced much like lists and strings

In [59]:
print(color_2[0])
print(color_1[0:2])

127
(50, 255)


## Multiple assignment (destructuring)

Tuples can be used to assign multiple variables in one line. For example, if we wanted to set the variables a and b to 1 and 2, respectively, we could write

In [63]:
a = 1
b = 2

But it is more concise to write this as

In [64]:
a, b = 1, 2

This also allows for easy swapping of variables

In [65]:
a, b = b, a
print("a = {}, b = {}".format(a, b))

a = 2, b = 1


Finally, we can easily use this to extract the individual variables from a tuple into their own variables. This is sometimes called "destructuring" in programming.

In [67]:
r, g, b = color_1

print("R: ", r)
print("G: ", g)
print("B: ", b)

R:  50
G:  255
B:  127


# The *for* loop

One of the most common tasks in programming is to visit every item in a list, and do someting with each item. Most programming languages offer an expression to perform this task, called a *for loop*. In python, you express a *for* loop as follows:

```python
for item in list:
    # Code that does something with the variable item
```

The way we define a *for* loop is similar to how *if* statements or functions are defined, with a colon and an indented block. In this case, the indented block is executed for every item in the list, with the variable *item* passed to the indented code.

Let's look at some examples

In [31]:
# Go through the list of numbers, and print each number + 1

for number in list_numbers:
    print(number+1)

1
2
3
4
5
501
501


In [34]:
# Construct a sentence out of the list of strings using string concatenation

sentence = ""
for word in list_strings:
    sentence = sentence + word + " " # Add a space between words

sentence

'Now this is a story all about how my cat got flipped turned upside down '

In all of these data structures, you will find that for loops are a powerful way to manipulate them

# Dictionaries

Dictionaries are part of a common class of data structures in programming called "lookup tables" or "hash tables." Dictionaries are useful for cases where we want to express a relationship between two points, or a mapping from one set of objects onto another. 

We deal with data that has direct relationships all the time. One simple example from biology is the relationship from a DNA base to its complementary base:
    - A -> T
    - T -> A
    - C -> G
    - G -> C
    
If we're storing DNA bases as strings, we can express this in Python the following way:

In [35]:
complements = {'A':'T',
 'T': 'A',
 'C': 'G',
 'G': 'C'
}

Like a list, we store seperate elements with commas in between. Unlike a list, each element has two parts: the *key* and the *value*, which are stored in pairs as key: value.

Dictionaries use indexing like lists, but instead of the indexes being numbers, they are the keys of the dictionary. So when we see

In [36]:
complements['A']

'T'

Python looks in the dictionary complements and tries to find a key with the value A. If it exists, it returns the value.

Beacause of the way this lookup works, dictionaries only allow one value per key. However, multiple keys can have the same value. If we reassign the value of a key, as below

In [37]:
complements['A'] = 'U'

In [38]:
complements

{'A': 'U', 'C': 'G', 'G': 'C', 'T': 'A'}

It completely overwrites the previous value for that key.

Dictionaries are very useful in Python, but their usefulness might not immediately be apparent. One common use of dictionaries is to use the keys as labels for data, and the values as the data points. For example, say we had a DNA sequence, and we wanted to store the counts of each nucleotide. We could easily create a dictionary where the keys are the nucleotides, and the values are the counts of each one, as below

In [43]:
dna_sequence = 'acgttattgggacgtgtccacgt'.upper()

sequence_counts = {} # Empty dictionary

for nucleotide in ['A', 'C', 'G', 'T']:
    sequence_counts[nucleotide] = dna_sequence.count(nucleotide)

In [42]:
sequence_counts

{'A': 4, 'C': 5, 'G': 7, 'T': 7}

## Dictionary methods

Like lists and strings, dictionaries have a variety of methods. I find that they are generally less useful than list or string methods however.

In [53]:
# Let's start with a dictionary representing some kitchen items and their counts

kitchen_inventory = {
    'cutting_boards': 2,
    'knives': 1,
    'oven': 1
}

In [54]:
# We can remove keys
kitchen_inventory.pop('oven')
print(kitchen_inventory)

{'cutting_boards': 2, 'knives': 1}


In [55]:
# We can get lists of keys and values
print(kitchen_inventory.keys())
print(kitchen_inventory.values())

dict_keys(['cutting_boards', 'knives'])
dict_values([2, 1])


Last, but probably most importantly, we can use .items() to produce a list of (key, value) tuples for all items in the dictionary

In [68]:
kitchen_inventory.items()

dict_items([('cutting_boards', 2), ('knives', 1)])

## Iterating over dictionaries with *for* loops

Like lists, we can use *for* loops to visit every item in a dictionary. There's two idiomatic ways to do this, depending on whether we just want to access the keys, or want to see the values as well.

To just access the keys, we can express the for loop as we would for a list:

In [69]:
for key in kitchen_inventory:
    print(key)

cutting_boards
knives


More often though, we're interested in both the keys and the values of a dictionary. To access these, the most idomatic approach is to use .items(), and use tuple destructing to assign each key and value to loop variables.
    

In [71]:
for key, value in sequence_counts.items():
    print(key, ":", value)

C : 5
G : 7
T : 7
A : 4


Here's a more useful example of what we might use iterating over a dictionary to do: say we wanted to take these nucleotide counts in the dna sequence, and convert them to percentages. Let's build a new dictionary out of these counts called nucleotide_percentages, and populate it from the original dictionary by dividing each count by the length of the original sequence

In [77]:
nucleotide_percentages = {}

for nuc, count in sequence_counts.items():
    nucleotide_percentages[nuc] = 100 * (count / len(dna_sequence))

In [78]:
nucleotide_percentages

{'A': 17.391304347826086,
 'C': 21.73913043478261,
 'G': 30.434782608695656,
 'T': 30.434782608695656}

# Excercises (Lists)

Since data structures are such a critical part of Python programming, there's a few more excercises than normal to make sure you get the hang of them. 

## Squaring a list of numbers

Write a function that takes a list of numbers, and returns a new list where every number is squared

Bonus 1: Make your function more robust! What happens if one of the elements of the list is not a number? Can you write the function to ignore these values? You can test if a value represents a number (rather than a string, dictionary, etc) using
```python
isinstance(value, Number)
```
In order to do this, we imported the definition of a number from a library in the first line of the excercise. We'll talk about library imports in a future lesson. Use the second function definition (square_numbers_ignore_non_values) for this.

In [126]:
from numbers import Number # Only used for the bonus!

def square_numbers(numbers):
    squared_numbers = []
    return squared_numbers

def square_numbers_ignore_non_values(numbers):
    squared_numbers = []
    return squared_numbers

In [107]:
assert square_numbers([3,4,5]) == [9, 16, 25]
assert square_numbers([-3,-4,-5]) == [9, 16, 25]
assert square_numbers([3.5,4.5,5.5]) == [12.25, 20.25, 30.25]
assert square_numbers([]) == []

In [99]:
assert square_numbers_ignore_non_values([4.5, 'a', 6, 7]) == [20.25, 36, 49]
assert square_numbers_ignore_non_values(['4.5', 'a', 6, 7]) == [36, 49]

[36, 49]

## Use every other word from a sentence

Take a sentence as a string, and output a new setence where you have taken every other word from it. If the new sentence doesn't end in a period, add a period to it.

HINT: It would be good practice to write a loop to combine the sentence together. However, there's a useful way to convert a list back into a string using the .join() method on a *string* (Not a list!) 

HINT 2: If you can't remember how to take every other item from a collection, review slicing!

In [108]:
def take_every_other_word(sentence):
    every_other_sentence = " ".join(sentence.split()[::2])
    if not every_other_sentence.endswith('.'):
        every_other_sentence += '.'
    return every_other_sentence
    

In [111]:
assert take_every_other_word("I'm sorry Dave, but I can't do that.") == "I'm Dave, I do."

## Second largest element in a list

Given a list of numbers, return the second largest number in the list.

Example input: 
```python 
[1,2,3,4,500,3]
```
Example output:
```python 
4
```

As a sample, I've provided a function that finds the largest number 

In [135]:
def largest(numbers):
    largest_so_far = numbers[0]
    # Start at the second element of the list
    # If the list is only one element, numbers[1:] is empty
    # and the for loop does not do anything
    for number in numbers[1:]: 
        if number > largest_so_far:
            largest_so_far = number
    return largest_so_far
        

def second_largest(numbers):
    return numbers[0]

In [130]:
assert second_largest[]

8

# Excercises (Dictionaries)
## Word counter

Write a function that takes a sentence as a string, and returns a dictionary with the count of every word in the string. For this excercise, don't worry about making the capitalization consistent, or removing punctuation.

HINT 1: Remember that there is a function to convert a string to a list of words

HINT 2: When you encounter a word in the sentence, how do you check if the word is already contained in the dictionary? If it is in the dictionary, how would you update the entry to reflect the word being found again? If it's not in the dictionary, how might you add it? What should it's original count be?

HINT 3: Python has a nice way of adding one to an existing variable. The verbose way of doing this would be:

```python
a = 10
a = a + 1 # a is now 11
```

But a shorter and cleaner way to express this is:

```python
a = 10
a += 1 # A is now 11
```

The program you will write here is such a common pattern that Python provides some solutions in the Standard library. If you're curious, take a look at the collections module, particularly the Counter and defaultdict types. You can do this by typing 

```python
import collections
?collections.Counter()
```
into a cell below

In [105]:
def word_counter(sentence):
    word_counts = {}
    for word in sentence.split():
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
    return word_counts

In [106]:
assert word_counter('It was the best of times, it was the worst of times.') == {'It':1, 'it':1, 'was':2, 'the':2, 
                                                                                'of':2, 'times,':1, 'times.':1, 'best':1,
                                                                                'worst':1}
assert word_counter('') == {}

## Sum the values together from two different dictionaries

Write a function that takes two different dictionaries, and returns a new dictionary with the same keys as the two dictionaries together, where the values are the sum of the values from each individual dictionary. If a key only appears in one dictionary, use its value alone. For example: if you had the following two dictionaries:

```python
a = {'A':5, 'C':6, 'G':7, 'T':10}
b = {'A':1, 'C':12, 'G':7, 'U':3}
```

Then the output of your function should be:

```python
{'A':6, 'C':18, 'G':14, 'T':10, 'U':3}
```

NOTE/Warning: In the provided skeleton, we create a new empty dictionary. You should copy the of the values you use individually into this new dictionary, rather than setting it equal to one of the arguments. Looking back at the way that Python treats lists, can you predict what would happen if you were just to set combined_counts = counts_1? Test your prediction below. 

In [125]:
def merge_and_sum_dictionaries(counts_1, counts_2):
    combined_counts = {} 
    for k, v in counts_1.items():
        combined_counts[k] = v
    for k,v in counts_2.items():
        if k in counts_1:
            combined_counts[k] += v
        else:
            combined_counts[k] = v
    return combined_counts

In [123]:
a = {'A':5, 'C':6, 'G':7, 'T':10}
b = {'A':1, 'C':12, 'G':7, 'U':3}

assert merge_and_sum_dictionaries(a, b) == {'A': 6, 'C': 18, 'G': 14, 'T': 10, 'U': 3}
assert a == {'A':5, 'C':6, 'G':7, 'T':10}

{'A': 6, 'C': 18, 'G': 14, 'T': 10, 'U': 3}

In [124]:
a

{'A': 6, 'C': 18, 'G': 14, 'T': 10, 'U': 3}