# Data Structures

Last time we introduced the primitive Python data types.  Examples:

In [0]:
my_int = 42 #Integer
my_float = 3.1415926 #Floating point number
my_string = "Hello!" #String
my_boolean = False #Boolean
my_nothing = None #Nothingness

We also introduced lists:

In [0]:
list_of_ints = [1, 2, 3, 4, 5]
list_of_strings = ["Amy", "Ian", "Paul"]
list_of_stuff = [42, 3.1415926, "Hello!", False, None]

(By the way, you can add items to an existing list using the `append()` method)

In [0]:
list_of_ints.append(6)
print(list_of_ints)

(You can also concatenate two lists using the `+` operator)

In [0]:
print(list_of_ints + list_of_strings)

Additionally, we learned how to extract information from lists (and strings) using index notation:

In [0]:
print(list_of_stuff)

In [0]:
print(list_of_stuff[0])
print(list_of_stuff[-1])
print(list_of_stuff[1:-1])

In [0]:
list_of_stuff[1:]

In [0]:
print(str(list_of_stuff[0]) + " " + str(list_of_stuff[1]))

In [0]:
print(list_of_stuff[0] * list_of_stuff[1])

A list is an example of a *data structure*.  Data structures are tools for organizing potentially large amounts of information (represented using the primitive data types) in a way  which helps solve problems.  Choosing the right data structure for your problem is one of the most important and difficult parts of programming, but Python makes it much easier than most languages!

## More on Lists

We have seen how to create lists and how to extract data from them using indexing and `for` loops.  Python supports some additional helper functions to make it easy to do basic computations - we'll see some of them in action later on.

Let's begin with a list of numbers:

In [0]:
list_of_numbers = [3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9,5,0,2,8,8,4,1,9,7,1,6,9,3,9,9,3,7,5,1]

Here's how to get the number of items in the list:

In [0]:
print(len(list_of_numbers))

You can also count the number of times a specific item appears in the list:

In [0]:
print(list_of_numbers.count(3))

Which number appears more often: `4` or `5`?

In [0]:
print(list_of_numbers.count(4))

In [0]:
print(list_of_numbers.count(5))

The `sum` function allows you to add up all the numbers in the list; generally speaking it will throw an error if any of the items are not numbers.

In [0]:
print(sum(list_of_numbers))

What is the sum of the last 15 numbers in the list?

In [0]:
print(sum(list_of_numbers[-15:]))

The `sorted` function returns a new list with the same items sorted in increasing order:

In [0]:
print(sorted(list_of_numbers, reverse = True))

The `sorted` function can also be used to sort a list of strings alphabetically:

In [0]:
print(sorted(["paul", "amy", "ian"]))

Finally, if you only care about the largest or smallest items in the list you can use `max` or `min`:

In [0]:
print(max(list_of_numbers))
print(min(list_of_numbers))

## Dictionaries

The final basic Python data structure is the *dictionary*.  An ordinary dictionary works by storing the definition of a word next to the word itself; moreover the dictionary is structured so that it if easy to search for a particular word (they are stored alphabetically), so it is possible to quickly look up the definition of any word.

A Python dictionary works similarly.  It consists of a collection of *key-value* pairs, where the key is analogous to a word and the value is analogous to its definition.  The difference is that the key can be any primitive data type and the value can be anything!  As a dictionary is created and modified, Python organizes it so that it is easy to find the value associated to any key.

Here is a simple example of a dictionary:

In [0]:
tweet = {
    "author": "pw_siegel",
    "text": "I've never actually tweeted before...",
    "followers": 29,
    "verified": False,
    "interests": ["math", "programming", "cats", "games"]
}

Here the keys in the dictionary consist of the strings `author`, `text`, `followers`, `verified`, and `interests`.  (The keys all happened to be strings, but they could have been any primitive data type.  They can't, however, be data structures.)

Use square brackets to look up the value associated to a key, similarly to how you would look up an item in a list by its index.  For instance, the following line prints the text of the tweet:

In [0]:
print(tweet["text"])
print(tweet["followers"])

This syntax can also be used to add a new key-value pair to the dictionary:

In [0]:
tweet["tweet_id"] = 12345
print(tweet)

In fact, it's perfectly legitimate to add the key-value pairs one at a time:

In [0]:
book = {} #Start with an empty dictionary
book["author"] = ["John Milnor"]
book["title"] = "Morse Theory"
book["date"] = 1960

In [0]:
print(book)

Notice that the value associated to the key `interests` was a list.  This means that `tweet["interests"]` is a list which can be manipulated just like any other; for instance, the following accesses the last item in the list:

In [0]:
book["author"].append("Stephen Smale")

In [0]:
print(book)

In [0]:
print(tweet["interests"][-1])

If you forget the keys to your dictionary, you can always recover them by converting the dictionary to a `list` (though this throws away the values):

In [0]:
print(list(tweet))

You can also check whether or not a specific key is in the dictionary:

In [0]:
if "author" in tweet:
    print("'author' is a key in the dictionary")
if "blorp" in tweet:
    print("'blorp' is a key in the dictionary")

You can for-loop through a dictionary, though this really loops through the keys:

In [0]:
for key in tweet:
    print("Key: " + key + ",Value: " + str(tweet[key]))

Use the syntax `del` to remove a key (and its value) from the dictionary:

In [0]:
del tweet["verified"]
print(tweet)

### Example 1

At the bottom of this notebook is a cell containing the article from last time.  Execute that cell before continuing.

In [0]:
print(article)

In this example we will create a dictionary whose keys are words in the article and whose values give the number of times each word appears in the article.  To start, let's create an empty dictionary:

In [0]:
word_count_dict = {}

Next, let's use the `split()` function to create a list of words in the article:

In [0]:
list_of_words = article.split(" ")

How many times does the word "the" appear in the article?

In [0]:
print(list_of_words.count("the"))

Add a key-value pair to `word_count_dict` whose key is the word "sleep" and whose value is the number of times "sleep" appears in the article.

In [0]:
word_count_dict["sleep"] = list_of_words.count("sleep")

Do the same thing with the word "all":

In [0]:
word_count_dict["all"] = list_of_words.count("all")

Now do the same thing for all of the words in the article at once!  (Hint: use a `for` loop.)

In [0]:
for word in list_of_words:
    word_count_dict[word] = list_of_words.count(word)

How many distinct words appeared in the article?

In [0]:
print(len(word_count_dict))

### Example 2

At the bottom of this notebook is a cell containing some data about expert hours billed by the professional services team in May.  Execute that cell before continuing.

In [0]:
pprint(expert_hours[:5])

What are the keys in the first dictionary in the list?

In [0]:
for key in expert_hours[0]:
    print(key)

In [0]:
print(list(expert_hours[0]))

How many expert hours were used in the 10th project in the list?

In [0]:
print(expert_hours[9]["hours"])

Print every item in the list in which more than 10 expert hours were used.

In [0]:
for item in expert_hours:
    if item["hours"] > 10:
        print(item)

Print every item in the list whose client is State Farm.

In [0]:
for project in expert_hours:
    if project["client"] == "State Farm":
        print(project)

How many total expert hours did State Farm use?

In [0]:
statefarm_hours_list = []

for item in expert_hours:
    if item["client"] == "State Farm":
        statefarm_hours_list.append(item["hours"])

print(sum(statefarm_hours_list))

Who used more expert hours: Samsung US or 3M?

In [0]:
samsung_hours_list = []

for item in expert_hours:
    if item["client"] == "Samsung US":
        samsung_hours_list.append(item["hours"])

print(sum(samsung_hours_list))

Note that the list of dictionaries above is basically a Python representation of an Excell spreadsheet: each dictionary is like a row, and the keys are like the columns.  But Python dictionaries are much more powerful: the "cells" can themselves contain complex data structures, and Python syntax allows unlimited options for processing and investigating the data.

## Comprehensions (optional)

Consider the following dictionary:

In [0]:
my_dict = {"a": 5, "b": -1, "c": 7, "d": 4, "e": -6}

Suppose you wanted to find the maximum value in this dictionary.  Perhaps the simplest way to do it would be to create a list of values and plug it into the `max` function:

In [0]:
list_of_values = []

for key in my_dict:
    list_of_values.append(my_dict[key])
    
print(max(list_of_values))

It seems a bit wasteful to create the list of values when in the end we only actually care about one value.  Fortunately Python provides syntax for inserting `for` loops directly into functions like `max`:

In [0]:
max_value = max(my_dict[key] for key in my_dict)
print(max_value)

This is called a *list comprehension*.  It can also be used to build the list directly:

In [0]:
list_of_values = [my_dict[key] for key in my_dict]
print(list_of_values)

Python also supports *dictionary comprehensions*.  For instance, here is a dictionary comprehension which creates the dictionary of word counts for the string `article`:

In [0]:
words = article.split(" ")
word_counts = {word: words.count(word) for word in words}
print(word_counts)

You can also use conditional expressions inside comprehensions to do even more refined data management.  For instance, here is how to create a word count dictionary where we only consider words that show up more than once:

In [0]:
words = article.split(" ")
word_counts = {word: words.count(word) for word in words if words.count(word) > 1}
print(word_counts)

See if you can use a list comprehension to sort the positive values of the dictionary `my_dict` above in descending order.

In [0]:
sorted([my_dict[key] for key in my_dict if my_dict[key] > 0])

Now see if you can use a list comprehension on the `expert_hours` data structure to compute the total number of expert hours used by State Farm:

In [0]:
sum(item["hours"] for item in expert_hours if item["client"] == "State Farm")

There is nothing you can do with a comprehension that you can't do with an ordinary `for` loop: they are simply there to make your code simpler and easier to read.  If your comprehension gets very complex (e.g. if it has lots of conditional clauses) then it is probably best to use a `for` loop instead; never apologize for sticking with the approach which is clearest to you.  That said, once you get used to comprehensions you can use them do complex operations very quickly and easily.

## Data

In [0]:
article = "what do ducklings kittens puppies parrots and ferrets all have in common they all love their zzzs need proof these adorable sleepy animals drift and slither off to dreamland in this video from the pet collective but their sleep similarities may end there pets and all animals can have very different sleep and nap habits than their owners  and from each other cats spend more than 60 percent of their day asleep  and some spend as much as 80 percent slumbering dogs spend more than half their days snoozing some young pups need as much as 20 hours of sleep every day and ducks sometimes sleep with one eye open to keep a lookout for potential predators experts suspect watch the video the droopy eyes stretches and yawns will make you melt sarah digiulio is the huffington posts sleep reporter"

In [0]:
expert_hours = [{'client': 'Lookout',
  'date': '2016-05-12T21:47:45Z',
  'hours': 0.0,
  'id': '32237',
  'project': 'Lookout Competitor Analysis - Japanese',
  'team': 'US PS'},
 {'client': 'Lookout',
  'date': '2016-05-12T21:47:35Z',
  'hours': 0.0,
  'id': '32237',
  'project': 'Lookout Competitor Analysis - German',
  'team': 'US PS'},
 {'client': 'Lookout',
  'date': '2016-05-12T21:47:29Z',
  'hours': 5.0,
  'id': '32237',
  'project': 'Lookout Competitor Analysis - French',
  'team': 'US PS'},
 {'client': 'Archer Daniels Midland',
  'date': '2016-05-12T21:47:58Z',
  'hours': 4.0,
  'id': '34746',
  'project': 'Archer Daniels Midland - Money Laundering',
  'team': 'US PS'},
 {'client': 'Archer Daniels Midland',
  'date': '2016-05-05T15:34:46Z',
  'hours': 4.5,
  'id': '34746',
  'project': 'Archer Daniels Midland - Initial Setup',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-31T21:34:26Z',
  'hours': 1.0,
  'id': '05673',
  'project': 'State Farm - ABC Project- Competitive Channels',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-31T18:08:24Z',
  'hours': 40.0,
  'id': '05673',
  'project': 'State Farm - EH - Programs A/B/C - migration',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-27T19:44:47Z',
  'hours': 3.0,
  'id': '05673',
  'project': 'State Farm | Training | 05.26.2016',
  'team': 'US UA'},
 {'client': 'State Farm',
  'date': '2016-05-27T19:44:43Z',
  'hours': 3.0,
  'id': '05673',
  'project': 'State Farm | Training | 05.25.2016',
  'team': 'US UA'},
 {'client': 'State Farm',
  'date': '2016-05-19T15:51:29Z',
  'hours': 3.0,
  'id': '05673',
  'project': 'State Farm - EH - A/B/B - Drones Query test',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-13T18:15:54Z',
  'hours': 19.0,
  'id': '05673',
  'project': 'State Farm -EH - A/B/C - Comp Queries',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-06T16:44:39Z',
  'hours': 1.0,
  'id': '05673',
  'project': 'State Farm | Training | 05.06.2016',
  'team': 'US UA'},
 {'client': 'State Farm',
  'date': '2016-05-06T15:58:58Z',
  'hours': 1.0,
  'id': '05673',
  'project': 'State Farm - EH - Andrew McMahon',
  'team': 'US PS'},
 {'client': 'State Farm',
  'date': '2016-05-04T21:26:15Z',
  'hours': 1.0,
  'id': '05673',
  'project': 'State Farm | Training | 05.04.2016',
  'team': 'US UA'},
 {'client': 'Chevron',
  'date': '2016-05-27T18:13:34Z',
  'hours': 6.0,
  'id': '10156',
  'project': 'Chevron - Influencers',
  'team': 'US PS'},
 {'client': 'Chevron',
  'date': '2016-05-24T15:11:56Z',
  'hours': 0.5,
  'id': '10156',
  'project': 'Chevron - EH Rule upload',
  'team': 'US PS'},
 {'client': 'Chevron',
  'date': '2016-05-23T15:54:29Z',
  'hours': 0.5,
  'id': '10156',
  'project': 'Chevron - Please charge 0.5 EH',
  'team': 'US PS'},
 {'client': 'Nielsen Brazil',
  'date': '2016-05-03T14:52:05Z',
  'hours': 2.0,
  'id': '146220',
  'project': 'Vizia - Sponsors Setup EH',
  'team': 'US PS'},
 {'client': 'LG Ad America',
  'date': '2016-05-26T14:38:17Z',
  'hours': 1.0,
  'id': '06595',
  'project': 'LG Ad America + HT Tag Script',
  'team': 'US PS'},
 {'client': 'LG Ad America',
  'date': '2016-05-04T13:05:23Z',
  'hours': 30.0,
  'id': '06595',
  'project': 'LG Ad America + Influencer Identification',
  'team': 'US PS'},
 {'client': 'Whirlpool',
  'date': '2016-05-27T18:22:39Z',
  'hours': 2.0,
  'id': '07679',
  'project': 'Query Refinement: Whirlpool',
  'team': 'US PS'},
 {'client': 'Swisscom',
  'date': '2016-05-23T07:49:38Z',
  'hours': 2.0,
  'id': '142986',
  'project': 'Setup Check + First test on Research queries with demo - 3 hrs',
  'team': 'DACH PS'},
 {'client': 'Apprio',
  'date': '2016-05-10T15:19:39Z',
  'hours': 5.0,
  'id': '21500',
  'project': 'Apprio - EH - CTP Brand Additions',
  'team': 'US PS'},
 {'client': 'Adecco',
  'date': '2016-05-19T15:52:34Z',
  'hours': 4.0,
  'id': '06759',
  'project': 'EH Adecco - Dashboard + Query Work',
  'team': 'US PS'},
 {'client': 'Adecco',
  'date': '2016-05-19T12:49:05Z',
  'hours': 4.0,
  'id': '06759',
  'project': 'Adecco | Training | AM Session |5/18/16',
  'team': 'US UA'},
 {'client': 'Adecco',
  'date': '2016-05-06T13:05:49Z',
  'hours': 15.0,
  'id': '06759',
  'project': 'Adecco - EH - Account Cleanup',
  'team': 'US PS'},
 {'client': 'Adecco',
  'date': '2016-05-04T21:47:12Z',
  'hours': 30.0,
  'id': '06759',
  'project': 'Adecco - EH - Global Adecco Brands',
  'team': 'US PS'},
 {'client': 'Bosch AG',
  'date': '2016-05-31T17:43:29Z',
  'hours': 45.0,
  'id': '04362',
  'project': 'May - Project Management',
  'team': 'DACH PS'},
 {'client': 'Bosch AG',
  'date': '2016-05-06T15:50:12Z',
  'hours': 45.0,
  'id': '04362',
  'project': 'April - Calls',
  'team': 'DACH PS'},
 {'client': "L'Oreal DACH",
  'date': '2016-05-10T08:04:37Z',
  'hours': 20.0,
  'id': '15287',
  'project': 'Setup Devision 17 hrs',
  'team': 'DACH PS'},
 {'client': "L'Oreal DACH",
  'date': '2016-05-03T16:24:59Z',
  'hours': 20.0,
  'id': '15287',
  'project': 'CPD Monatsreport Februar 2016',
  'team': 'DACH PS'},
 {'client': "L'Oreal DACH",
  'date': '2016-05-02T17:15:19Z',
  'hours': 20.0,
  'id': '15287',
  'project': 'April - Calls',
  'team': 'DACH PS'},
 {'client': 'Samsung US',
  'date': '2016-05-25T19:08:57Z',
  'hours': 1.0,
  'id': '35257',
  'project': 'Samsung US - EH - @thesamsungside',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-21T15:22:59Z',
  'hours': 45.0,
  'id': '35257',
  'project': 'Samsung - EH - Product Team - Laptops & Computers',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-21T12:26:02Z',
  'hours': 5.0,
  'id': '35257',
  'project': 'Samsung Kids Additions',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-13T18:01:04Z',
  'hours': 4.0,
  'id': '35257',
  'project': 'Samsung - EH - Product Team - Wearables',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-12T19:10:39Z',
  'hours': 8.0,
  'id': '35257',
  'project': 'Samsung - EH - Product Team - VR',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-11T16:37:29Z',
  'hours': 3.0,
  'id': '35257',
  'project': 'Samsung - EH - Retail & Carrier Teams Implementation - phase 1',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-10T14:26:12Z',
  'hours': 3.0,
  'id': '35257',
  'project': 'Samsung - EH - Samsung Dev Conference',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-04T19:14:58Z',
  'hours': 1.0,
  'id': '35257',
  'project': 'Samsung - EH - LG OLED Burn-in',
  'team': 'US PS'},
 {'client': 'Samsung US',
  'date': '2016-05-03T21:38:50Z',
  'hours': 4.5,
  'id': '35257',
  'project': 'Samsung | Training | 05.03.2016',
  'team': 'US UA'},
 {'client': 'Samsung US',
  'date': '2016-05-03T13:31:22Z',
  'hours': 1.0,
  'id': '35257',
  'project': 'Samsung | Training | 05.02.2016',
  'team': 'US UA'},
 {'client': 'Netzkern',
  'date': '2016-05-31T13:20:13Z',
  'hours': 14.0,
  'id': '08139',
  'project': 'Netzkern - Mai - Linda momentan krank',
  'team': 'DACH PS'},
 {'client': 'Netzkern',
  'date': '2016-05-03T12:20:13Z',
  'hours': 14.0,
  'id': '08139',
  'project': 'Netzkern - April',
  'team': 'DACH PS'},
 {'client': '3M US',
  'date': '2016-05-25T20:46:36Z',
  'hours': 3.0,
  'id': '34850',
  'project': '3M US - EH - Novec Competitors',
  'team': 'US PS'},
 {'client': '3M US',
  'date': '2016-05-24T21:11:03Z',
  'hours': 1.0,
  'id': '34850',
  'project': '3M | Vizia Training | 05.24.2016',
  'team': 'US UA'},
 {'client': '3M US',
  'date': '2016-05-19T15:43:41Z',
  'hours': 4.0,
  'id': '34850',
  'project': '3M - EH - Scotch Dash Updates',
  'team': 'US PS'},
 {'client': '3M US',
  'date': '2016-05-11T16:35:00Z',
  'hours': 20.0,
  'id': '34850',
  'project': '3M - EH - Naming Inconsistencies',
  'team': 'US PS'},
 {'client': '3M US',
  'date': '2016-05-11T14:24:57Z',
  'hours': 9.0,
  'id': '34850',
  'project': '3M - EH - Healthcare IT',
  'team': 'US PS'},
 {'client': '3M US',
  'date': '2016-05-03T14:18:40Z',
  'hours': 2.0,
  'id': '34850',
  'project': '3M MC Hammer Command Segmentation',
  'team': 'US PS'},
 {'client': 'ConocoPhillips',
  'date': '2016-05-05T18:00:39Z',
  'hours': 6.0,
  'id': '14046',
  'project': 'ConocoPhillips EH - Wildfires',
  'team': 'US PS'},
 {'client': 'Toyota',
  'date': '2016-05-19T20:45:53Z',
  'hours': 0.5,
  'id': '01380',
  'project': 'Toyota | Training | 05.18.2016',
  'team': 'US UA'},
 {'client': 'Toyota',
  'date': '2016-05-02T14:16:14Z',
  'hours': 8.0,
  'id': '01380',
  'project': 'Toyota - Stagecoach',
  'team': 'US PS'},
 {'client': 'Henry Ford Health Systems',
  'date': '2016-05-19T02:45:02Z',
  'hours': 1.0,
  'id': '72821',
  'project': 'Henry Ford Health System | Training | 05.17.2016',
  'team': 'US UA'},
 {'client': 'Wells Fargo',
  'date': '2016-05-23T15:01:23Z',
  'hours': 2.0,
  'id': '03565',
  'project': 'Wells Fargo - EH - Chase Bank cleanup',
  'team': 'US PS'},
 {'client': 'Wells Fargo',
  'date': '2016-05-19T02:46:50Z',
  'hours': 1.0,
  'id': '03565',
  'project': 'Wells Fargo | Training | 05.17.2016',
  'team': 'US UA'},
 {'client': 'Wells Fargo',
  'date': '2016-05-12T21:47:51Z',
  'hours': 8.0,
  'id': '03565',
  'project': 'Wells Fargo - WFVC Competitive Analysis Dashboards',
  'team': 'US PS'},
 {'client': 'Wells Fargo',
  'date': '2016-05-04T17:06:44Z',
  'hours': 1.0,
  'id': '03565',
  'project': 'Wells Fargo - FEEDBACK: WFVC Risk & Compliance Listening Dashboard',
  'team': 'US PS'},
 {'client': 'Toys R Us',
  'date': '2016-05-12T21:48:22Z',
  'hours': 3.0,
  'id': '03221',
  'project': 'TRU - French Query Build EH',
  'team': 'US PS'},
 {'client': 'Kohler',
  'date': '2016-05-09T15:10:50Z',
  'hours': 1.0,
  'id': '09120',
  'project': 'Kohler Hosp. Exclusions',
  'team': 'US PS'},
 {'client': 'Kohler',
  'date': '2016-05-03T13:30:31Z',
  'hours': 1.0,
  'id': '09120',
  'project': 'Kohler Vizia Hospitality Installation',
  'team': 'US PS'}]