# Scripting Week 4: Container Patterns and Pandas

## Announcements

## Final Project

- proposal due date is week 6

### Details

- Data analysis project
- Develop a research question
- Collect or retrieve the data
- Write a narrative answering your research questions

# Today

- Containers
- Pandas
- Basics of Split-Apply-Combine

# Review

Trying something new for review - in groups of 3 or 4, let's take 10 minutes to discuss the questions where we stumbled together.

## Containers

If we want to design an object for holding or aggregating multiple objects, what decisions would we have to make?

- **Named**: How do we refer to individual objects? Are positions in the container named, or numbered?
- **Dynamic Typing**: Does it hold anything you give it, or just one type of object (e.g. integers, floats, strings)?
- **Unique**: Do objects have to be unique? (i.e. no two items that are `==`)
- **Ordered**: Are objects in the container in a predictable, consistant order?
- **Mutable**: Can you change the container (add, remove objects) after creating it?

We've seen a list, one type of container in Python.

In [7]:
my_list = [1, 'two', True, 4.0, 4.0, 4.0]
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

Are lists *named*?

 **No**.

In [11]:
my_list[3]

4.0

Example of a named container: a `dict`.

In [12]:
my_dict = { 'name': 'Peter', 'pet': 'Kiki' }
my_dict

{'name': 'Peter', 'pet': 'Kiki'}

In [13]:
my_dict['pet']

'Kiki'

Are lists *dynamically typed*?

**Yes**.

In [16]:
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

In [17]:
for obj in my_list:
    print("The type of", obj, "is", type(obj))

The type of 1 is <class 'int'>
The type of two is <class 'str'>
The type of True is <class 'bool'>
The type of 4.0 is <class 'float'>
The type of 4.0 is <class 'float'>
The type of 4.0 is <class 'float'>


Are lists *Unique*?

**No**.

In [7]:
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

Example of a container with uniqueness: a `set`:

In [18]:
my_set = { 1, 1, 2, 2, 2 }
my_set

{1, 2}

Are lists *Ordered*?

**Yes**.

In [9]:
for obj in my_list:
    print(obj, end=', ')

1, two, True, 4.0, 4.0, 4.0, 

Sets and dicts are unordered. E.g.

In [21]:
my_set = { 1, 'two', True, 4.0, 4.0, 4.0 }
my_set

{1, 4.0, 'two'}

In [16]:
my_dict = { "zoo": "Denver Zoo", "admission": 21.00 }
my_dict

{'zoo': 'Denver Zoo', 'admission': 21.0}

*(There's an order here, but not the one you specified!)*

Are lists *mutable*?

**Yes**.

In [12]:
my_list = [1, 'two', True, 4.0, 4.0, 4.0]
print(my_list)
my_list[4] = 'New value!'
print(my_list) 

[1, 'two', True, 4.0, 4.0, 4.0]
[1, 'two', True, 4.0, 'New value!', 4.0]


*I didn't set a new list to the `my_list` variable name: I changed something in the existing list*.

Compare a `set`:

In [23]:
my_set = { 1, 'two', 4.0 }
print(my_set)
my_set[2] = 'New value!'
print(my_set)

{1, 'two', 4.0}


TypeError: 'set' object does not support item assignment

Let's think about other containers of stuff based on these properties: *named*, *dynamic*, *unique*, *ordered*, *mutable*

In groups, consider:

- Box of assorted LEGO
- A line at the DMV
- The Pokédex
- A table in a relational database

## Dimensionality

![](../images/2d.png)
Once we start working with data, another property becomes important: how many dimensions to a dataset?
![](../images/3d.png)

# Pandas

In [17]:
import pandas as pd

- Pandas is a Python port of a data structure from `R`: the dataframe.
- Pandas makes high-performance data analysis easier, and meshes with a broader set of tools for analysis

### What's notable about Pandas?

Here's how Pandas is *pitched* in their [documentation](https://pandas.pydata.org/pandas-docs/stable/overview.html). Let's consider the points and work backwards toward understanding it, by untangling the jargon.

![](../images/pandas-pitch.png)

*Data Structures* 

> A set of labeled array data structures, the primary of which are Series and DataFrame

- `labeled` - Named!
- `array` - A generalized, high-performance container.

# Series

A `Series` is a 1-dimensional array of objects.

It's like a super-charged list, that can also be named!

In [20]:
beatles = ['John', 'Paul', 'George', 'Ringo']
my_series = pd.Series(beatles)
my_series

0      John
1      Paul
2    George
3     Ringo
dtype: object

Right are the *values* - this is the *data*.

Left is the *index* - this is the *metadata*.
 - These are the names for the data
 - We didn't give any names, so 0, 1, 2, 3, etc. were selected as default names

Providing an index explicitly:

In [23]:
my_series = pd.Series(['John', 'Paul', 'George', 'Ringo'],
                      index=["rhythm", "bass", "guitar", "drums"])
my_series

rhythm      John
bass        Paul
guitar    George
drums      Ringo
dtype: object

*Wait, `['John', 'Paul', 'George', 'Ringo']` is just a list!*

Yup, we *supercharged* it!

Here's another type of collection that can be converted to a Series:

In [4]:
my_dict = { "zoo": "Denver Zoo", "admission": 21.00 }
my_series2 = pd.Series(my_dict)
my_series2

zoo          Denver Zoo
admission          21.0
dtype: object

## Investigating `Series`

In [25]:
my_series[3]

'Ringo'

In [27]:
my_series[0:2]

rhythm    John
bass      Paul
dtype: object

In [28]:
my_series['guitar']

'George'

You can also give a collection of `True` or `False` values to select specific columns:

In [8]:
my_series[0]

'John'

In [30]:
bools = [True, True, False, True]
my_series[bools]

rhythm     John
bass       Paul
drums     Ringo
dtype: object

Remember lists... what happened when they were added?

In [33]:
list1 = ['hello', 'world']
list2 = ['foo', 'bar']
list1 + list2

['hello', 'world', 'foo', 'bar']

It concatenates. What about a Series?

In [11]:
series1 = pd.Series(list1)
series2 = pd.Series(list2)
series1 + series2

0    hellofoo
1    worldbar
dtype: object

In [12]:
series1 + series2

0    hellofoo
1    worldbar
dtype: object

Seems useless, but I'm going to tell you that it is *awesome* for working with data.

*WHY?*

What it's actually doing is working with values one-by-one. Adding strings is rarely useful, but how about:

In [13]:
series1 = pd.Series([100, 200, 100])
series2 = pd.Series([400, 300, 500])
series1 + series2

0    500
1    500
2    600
dtype: int64

In [34]:
# Loading some data
imls = pd.read_csv('../data/imls_mini_by_state.csv', index_col='State')
population = imls['Population'].head(10)
revenue = imls['Revenue'].head(10).astype(int)

IMLS Library Data by State (2015)

In [35]:
population

State
AK      647586
AL     4833722
AR     2643930
AS       60324
AZ     6758252
CA    38697702
CO     5265461
CT     3596673
DC      672228
DE      935043
Name: Population, dtype: int64

In [36]:
revenue

State
AK      35984725
AL     104479197
AR      74999703
AS        466525
AZ     179986697
CA    1417433751
CO     299659360
CT     198873464
DC      57526154
DE      26470866
Name: Revenue, dtype: int64

In [38]:
revenue / population

State
AK    55.567484
AL    21.614647
AR    28.366751
AS     7.733655
AZ    26.632138
CA    36.628370
CO    56.910375
CT    55.293730
DC    85.575361
DE    28.309785
dtype: float64

You can also make comparisons easily:

In [40]:
population > 1000000

State
AK    False
AL     True
AR     True
AS    False
AZ     True
CA     True
CO     True
CT     True
DC    False
DE    False
Name: Population, dtype: bool

In [41]:
revenue > population

State
AK    True
AL    True
AR    True
AS    True
AZ    True
CA    True
CO    True
CT    True
DC    True
DE    True
dtype: bool

Hmmm... What's next?

In [42]:
population[population > 1000000]

State
AL     4833722
AR     2643930
AZ     6758252
CA    38697702
CO     5265461
CT     3596673
Name: Population, dtype: int64

_or_

In [45]:
matches = population > 1000000
population[matches]

State
AL     4833722
AR     2643930
AZ     6758252
CA    38697702
CO     5265461
CT     3596673
Name: Population, dtype: int64

Everything between the square brackets resolves first, and if it's True/False values, that will be used to select rows.

^^ *write this down* ^^

## Some Useful `Series` Methods

First, review language from the lab:
*What's a method?*
    
See if you can explain it to the person closest to you, or if their explanation matches what you remember.

### Accessing just a few rows

In [50]:
population.head(5)

State
AK     647586
AL    4833722
AR    2643930
AS      60324
AZ    6758252
Name: Population, dtype: int64

In [83]:
population.sample(5)

State
AL     4833722
DC      672228
CA    38697702
AK      647586
DE      935043
Name: Population, dtype: int64

### Summary statistics (for numbers!)

In [94]:
population.mean()

6411092.1

In [51]:
population.median()

3120301.5

In [52]:
population.min()

60324

In [53]:
population.max()

38697702

In [55]:
population.std()

11570703.733362896

### `unique`

Lists only the unique values. (Less useful for numerical data.)

In [58]:
ddg = pd.Series(["duck", "duck", "duck", "goose"])
ddg.unique()

array(['duck', 'goose'], dtype=object)

### `value_counts`

Counts how often each unique value shows up in the series.

In [64]:
ddg.value_counts()

duck     3
goose    1
dtype: int64

## Lab Break!

Let's try some of the Series questions from the lab to get the hang of it. (up to **Q7**)

Before we start though, here's a Colab tip. You can see all the variables that you've set with the Variable inspector:

![Variable Inspector](../images/var-inspector.png)

Useful for keeping track of whether things are what you expect them to be.

# `DataFrame`

- Table-like structure

In [25]:
pd.DataFrame([[1,2,3], [4,5,6]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [65]:
my_df = pd.DataFrame([[1,2,3], [4,5,6]],
                     index=['row1', 'row2'],
                     columns=['col1', 'col2', 'col3']
                    )
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


## Investigating Dataframes

### Selecting Columns

In [76]:
my_df['col1']

row1    1
row2    4
Name: col1, dtype: int64

Or:

In [68]:
my_df.col1

row1    1
row2    4
Name: col1, dtype: int64

This second approach only works when the column name doesn't conflict with build-in methods, properties, or keywords.

e.g.

DataFrames have a method called 'count', so you can run

``my_df.count()``

That means that if you have a column named 'count', you can only use `my_df['count']`.

## When in doubt, use the square bracket notation

`DataFrame` columns are just `Series`

In [69]:
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


In [70]:
type(my_df)

pandas.core.frame.DataFrame

In [71]:
x = my_df['col1']
type(x)

pandas.core.series.Series

Remember this!!
 
## A `DataFrame` column is a `Series`

If selecting multiple columns, give a list of column names:

In [81]:
# option 1
list_of_columns = ['col1', 'col3']
my_df[list_of_columns]

# option 2
my_df[['col1', 'col3']]

Unnamed: 0,col1,col3
row1,1,3
row2,4,6


How does this look different from the following?

In [119]:
my_df['col1']

row1    1
row2    4
Name: col1, dtype: int64

### Selecting rows by comparison

Just like with a `Series`, a `DataFrame` can take True / False values

In [99]:
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


In [98]:
my_df[[True, False]]

Unnamed: 0,col1,col2,col3
row1,1,2,3


## Convert from SQL

```sql
SELECT * FROM my_df
    WHERE col1 < col2
```

In [100]:
rows_to_return = my_df['col1'] < my_df['col2']
my_df[rows_to_return]

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


```sql
SELECT col2, col3 FROM my_df;
```

In [101]:
list_of_columns = ['col2', 'col3']
my_df[list_of_columns]

Unnamed: 0,col2,col3
row1,2,3
row2,5,6


In [60]:
my_df[['col2', 'col3']]

Unnamed: 0,col2,col3
row1,2,3
row2,5,6


```sql
SELECT * FROM my_df
    WHERE col1 > 2
    LIMIT 10
```

In [103]:
matches = my_df['col1'] > 2
my_df[matches].head(10)

Unnamed: 0,col1,col2,col3
row2,4,5,6


In [61]:
my_df[my_df.col1 > 2].head(10)

Unnamed: 0,col1,col2,col3
row2,4,5,6


In [62]:
my_df[(my_df.col1 > 2)][:10]

Unnamed: 0,col1,col2,col3
row2,4,5,6


```sql
SELECT col2, col3 FROM my_df
    WHERE col1 > 2
    LIMIT 10
```

In [108]:
subset = my_df[my_df['col1'] > 2]
subset2 = subset[['col2', 'col3']]
subset2.head(10)

Unnamed: 0,col2,col3
row2,5,6


In [None]:
my_df[my_df['col1'] > 2][['col2', 'col3']]
.head(10)

In [110]:
my_df[my_df.col1 > 2][['col2', 'col3']].head()

Unnamed: 0,col2,col3
row2,5,6


(We'll learn tidier ways to do this in the future, but this is perfectly fine)

### Method Chaining Returns!

Consider what is happening here:

In [73]:
my_df[my_df.col1 > 2][['col2', 'col3']].head()

Unnamed: 0,col2,col3
row2,5,6


Break it down:

This is one action, that returns a DataFrame:

>**my_df[my_df.col1 > 2]**[['col2', 'col3']].head()

This is a second action, that returns a DataFrame:

>**my_df[my_df.col1 > 2][['col2', 'col3']]**.head()

`head()` is run on the second DataFrame.

In [146]:
selection1 = my_df[my_df.col1 > 2]
selection2 = selection1[['col2', 'col3']]
selection2.head()

Unnamed: 0,col2,col3
row2,5,6


## Load more substantial data
A dataset of craft beers from https://www.kaggle.com/nickhould/craft-cans.

`read_csv`, which we saw last week, returns a DataFrame.

In [111]:
beers = pd.read_csv('../data/beers.csv')

`head()` works the same as it did for a Series:

In [79]:
beers.head(4)

Unnamed: 0,abv,ibu,id,name,type,brewery_id,ounces
0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0


### Capitalizing on the "column == Series" functionality

- How do we see all the unique breweries?
- How do we get the counts of each Beer type in the dataset?
- What types of beer most commonly have an abv at 9% or higher?
- What's the average size of beers under 9% abv?
- How do I find the coolest name for a Tripel-style beer?

Tip: Method chaining is useful here!

In [14]:
isstrong = beers['abv'] >= 0.11
beers[isstrong]

Unnamed: 0,abv,ibu,id,name,type,brewery_id,ounces
70,0.125,80.0,2685,London Balling,English Barleywine,1,16.0
2189,0.12,90.0,2621,Csar,Russian Imperial Stout,17,16.0
2278,0.128,,2565,Lee Hill Series Vol. 5 - Belgian Style Quadrup...,Quadrupel (Quad),51,19.2


In [None]:
beers[beers['abv'] >= 0.11]

In [172]:
beers.head(1)

Unnamed: 0_level_0,abv,ibu,id,brewery_id,ounces
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Abbey Single Ale,0.049000,22.000000,2196.000000,57.000000,12.000000
Altbier,0.054385,34.125000,1480.923077,220.000000,13.538462
American Adjunct Lager,0.048722,11.000000,1714.055556,163.000000,13.333333
American Amber / Red Ale,0.057456,36.298701,1272.398496,248.045113,13.293233
American Amber / Red Lager,0.049464,23.250000,1154.413793,305.965517,13.241379
American Barleywine,0.099000,96.000000,1368.333333,317.000000,10.800000
American Black Ale,0.068971,68.900000,1562.416667,261.861111,13.888889
American Blonde Ale,0.049635,20.983607,1320.527778,253.222222,13.185185
American Brown Ale,0.057926,29.894737,1282.985714,238.928571,13.405714
American Dark Wheat Ale,0.052333,27.600000,1023.714286,250.714286,13.142857


### Quick Summary Stats

In [112]:
beers.describe()

Unnamed: 0,abv,ibu,id,brewery_id,ounces
count,2348.0,1405.0,2410.0,2410.0,2410.0
mean,0.059773,42.713167,1431.113278,231.749793,13.592241
std,0.013542,25.954066,752.459975,157.685604,2.352204
min,0.001,4.0,1.0,0.0,8.4
25%,0.05,21.0,808.25,93.0,12.0
50%,0.056,35.0,1453.5,205.0,12.0
75%,0.067,64.0,2075.75,366.0,16.0
max,0.128,138.0,2692.0,557.0,32.0


# Next Week
Here's how Pandas is *pitched* in their [documentation](https://pandas.pydata.org/pandas-docs/stable/overview.html). We'll consider the points and work backwards toward understanding it, by untangling the jargon.

>- A set of labeled array data structures, the primary of which are Series and DataFrame
>- Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
>- An integrated group by engine for aggregating and transforming data sets
>- Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
>- Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
>- Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
>- Moving window statistics (rolling mean, rolling standard deviation, etc.)

# Overview of Split-Apply-Combine

![](../images/split-apply-combine.png)

![](../images/example-split-apply-combine.png)

You don't have to split by just *one* property.

![](../images/dcube1.png)

![](../images/dcube.jpg)

## Next Week
- Split-Apply-Combine In Practice