# Python data structures

* list: available everywhere we have python; stores values in order; indexed by number; like an array
* DataFrame: requires pandas library; stores values in a table; index by number, by column, etc... rows and columns have string indexes as well; like a spreadsheet

There are more data structures in Python:

## dictionary

javascript: "object" (Map); PHP "associative array"; C++ Map.

Series of key->value pairs (sort of like a two-column spreasheet)

Dictionaries are FAST! The computer sort of assigns each key a unique index, so if you want to look for a value, you send the request which goes through a "hashing" algorithm, and then the hashing value goes to the value. It's like compiling all the data beforehand and then every key has a place in memory that is easy to retrieve.
For that reason, they potentially use more memory. So this is a tradeoff between memory and speed.

In [5]:
presidents = {
    "Obama" : "Hawaii",
    "Clinton" : "Arkansas",
    "Biden" : "Pennsylvania",
    "Van Buren" : "Pennsylvania",
    "Biden" : "New York"
}

In [2]:
presidents

{'Obama': 'Hawaii', 'Clinton': 'Arkansas', 'Biden': 'Pennsylvania'}

In [3]:
type(presidents)

dict

In [4]:
presidents["Obama"]

'Hawaii'

In [7]:
presidents["Biden"] #Can't have two keys by the same name, the last one would override the first

'New York'

In [8]:
len(presidents)

4

In [9]:
presidents.keys()

dict_keys(['Obama', 'Clinton', 'Biden', 'Van Buren'])

In [10]:
presidents.values()

dict_values(['Hawaii', 'Arkansas', 'New York', 'Pennsylvania'])

In [13]:
"Obama" in presidents

True

In [14]:
"Arkansas" in presidents

False

In [15]:
"Arkansas" in presidents.values()

True

In [53]:
presidents.items() #returns a list of "tuples"

dict_items([('Obama', 'Hawaii'), ('Clinton', 'Arkansas'), ('Biden', 'New York'), ('Van Buren', 'Pennsylvania')])

In [19]:
["president " + k + " was from " + v for k, v in presidents.items()]

['president Obama was from Hawaii',
 'president Clinton was from Arkansas',
 'president Biden was from New York',
 'president Van Buren was from Pennsylvania']

## representing tabular data in python with dictionaries

list of dictionaries. List of rows in a spreadsheet, dictionary: column + value for that row

    name         state     population
    New York     NY        9000000
    Los Angeles  CA        4000000
    Chicago      IL        3000000

    [
        {"name": "New York", "state": "NY", "population": 9000000},
        {"name": "Los Angeles", "state": "CA", "population": 4000000},
        {"name": "Chicago", "state": "IL", "population": 3000000},
    ]

In [20]:
cities =  [
        {"name": "New York", "state": "NY", "population": 9000000},
        {"name": "Los Angeles", "state": "CA", "population": 4000000},
        {"name": "Chicago", "state": "IL", "population": 3000000},
    ]

In [21]:
cities

[{'name': 'New York', 'state': 'NY', 'population': 9000000},
 {'name': 'Los Angeles', 'state': 'CA', 'population': 4000000},
 {'name': 'Chicago', 'state': 'IL', 'population': 3000000}]

In [22]:
type(cities)

list

In [23]:
type(cities[0])

dict

In [24]:
cities[1]

{'name': 'Los Angeles', 'state': 'CA', 'population': 4000000}

In [27]:
cities[0]['population']

9000000

In [29]:
sum([item['population'] for item in cities])

16000000

In [30]:
# it's like we can sort of implement our mini Pandas by just using python data structures. 
# If we power a small computer that doesn't have enough memory for pandas this can come handy!

In [31]:
#but could also:
import pandas as pd
pd.DataFrame(cities)

Unnamed: 0,name,state,population
0,New York,NY,9000000
1,Los Angeles,CA,4000000
2,Chicago,IL,3000000


## Set

In [32]:
fruits = set()

In [34]:
fruits.add("apple")
fruits.add("orange")
fruits.add("banana")

In [35]:
fruits

{'apple', 'banana', 'orange'}

In [36]:
fruits.add("apple")

In [37]:
fruits

{'apple', 'banana', 'orange'}

In [40]:
# you can't add the same thing twice!
# also cannot get an item by index...

In [41]:
"apple" in fruits

True

In [42]:
"arugula" in fruits

False

In [46]:
#the main thing sets are good for is removing duplicates in lists.

In [44]:
fruit_salad = ["apple", "orange", "banana", "apple", "kiwi", "orange"]

In [45]:
list(set(fruit_salad))

['banana', 'orange', 'apple', 'kiwi']

In [47]:
#sets also use the same hashing thing as dictionary - checking to see if an item is in a set or list is going to be faster in a set.

## tuples
single, double, triple, quadruple (tuple in like n-ple).

They are almost exactly the same as lists, but you can't change it - it's immutable! - can't be changed after it's created. Basically read only.

It's a tradeoff in efficiency again. 

In [48]:
t = (1, 2, 3, 4, 5)

In [49]:
type(t)

tuple

In [51]:
t[4]

5

In [52]:
"cheese" in t

False