# Scripting Week 4: Container Patterns and Pandas

## Announcements

## Final Project

- proposal due date is week 6

### Details

- Data analysis project
- Develop a research question
- Collect or retrieve the data
- Write a narrative answering your research questions

# Today

- Containers
- Pandas
- Basics of Split-Apply-Combine

# Review

## Containers

If we want to design an object for holding or aggregating multiple objects, what decisions would we have to make?

- **Named**: How to we refer to individual objects? Are positions in the container named, or numbered?
- **Dynamic Typing**: Does it hold anything you give it, or just one type of object (e.g. integers, floats, strings)?
- **Unique**: Do objects have to be unique? (i.e. nothing two items that are `==`)
- **Ordered**: Are objects in the container in a predictable, consistant order?
- **Mutable**: Can you change the container (add, remove objects) after creating it?

We've seen a list, one type of container in Python.

In [1]:
my_list = [1, 'two', True, 4.0, 4.0, 4.0]
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

Are lists *named*?

 **No**.

In [2]:
my_list[3]

4.0

Example of a named container: a `dict`.

In [2]:
my_dict = { 'name': 'Peter', 'pet': 'Kiki'}
my_dict

{'name': 'Peter', 'pet': 'Kiki'}

In [4]:
my_dict['name']

'Peter'

Are lists *dynamically typed*?

**Yes**.

In [5]:
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

In [6]:
for obj in my_list:
    print("The type of", obj, "is", type(obj))

The type of 1 is <class 'int'>
The type of two is <class 'str'>
The type of True is <class 'bool'>
The type of 4.0 is <class 'float'>
The type of 4.0 is <class 'float'>
The type of 4.0 is <class 'float'>


Are lists *Unique*?

**No**.

In [7]:
my_list

[1, 'two', True, 4.0, 4.0, 4.0]

Example of a container with uniqueness: a `set`:

In [8]:
my_set = { 1, 1, 2, 2, 2 }
my_set

{1, 2}

Are lists *Ordered*?

**Yes**.

In [9]:
for obj in my_list:
    print(obj, end=', ')

1, two, True, 4.0, 4.0, 4.0, 

Sets and dicts are unordered. E.g.

In [5]:
my_set = { 1, 'two', True, 4.0, 4.0, 4.0 }
my_set

{1, 4.0, 'two'}

In [8]:
my_dict = { "zoo": "Denver Zoo", "admission": 21.00 }
my_dict

{'admission': 21.0, 'zoo': 'Denver Zoo'}

*(There's an order here, but not the one you specified!)*

Are lists *mutable*?

**Yes**.

In [12]:
my_list = [1, 'two', True, 4.0, 4.0, 4.0]
print(my_list)
my_list[4] = 'New value!'
print(my_list)

[1, 'two', True, 4.0, 4.0, 4.0]
[1, 'two', True, 4.0, 'New value!', 4.0]


*I didn't set a new list to the `my_list` variable name: I changed something in the existing list*.

Compare a `set`:

In [11]:
my_set = { 1, 'two', 4.0 }
print(my_set)
my_set[2] = 'New value!'
print(my_set)

{1, 'two', 4.0}


TypeError: 'set' object does not support item assignment

## Dimensionality

![](../images/2d.png)
Once we start working with data, another property becomes important: how many dimensions to a dataset?
![](../images/3d.png)

# Pandas

In [12]:
import pandas as pd

- Pandas is a Python port of a data structure from `R`: the dataframe.
- Pandas makes high-performance data analysis easier, and meshes with a broader set of tools for analysis

### What's notable about Pandas?

Here's how Pandas is *pitched* in their [documentation](https://pandas.pydata.org/pandas-docs/stable/overview.html). Let's consider the points and work backwards toward understanding it, by untangling the jargon.

![](../images/pandas-pitch.png)

*Data Structures* 

> A set of labeled array data structures, the primary of which are Series and DataFrame

- `labeled` - Named!
- `array` - A generalized, high-performance container.

# Series

A `Series` is a 1-dimensional array of objects.

It's like a super-charged list, that can also be named!

In [14]:
my_series = pd.Series(['John', 'Paul', 'George', 'Ringo'])
my_series

0      John
1      Paul
2    George
3     Ringo
dtype: object

Right are the *values* - this is the *data*.

Left is the *index* - this is the *metadata*.
 - These are the names for the data
 - We didn't give any names, so 0, 1, 2, 3, etc. were selected as default names

Providing an index explicitly:

In [15]:
my_series = pd.Series(['John', 'Paul', 'George', 'Ringo'],
                      index=["Smart One", "Cute One", "Quiet One", "Funny One"])
my_series

Smart One      John
Cute One       Paul
Quiet One    George
Funny One     Ringo
dtype: object

*Wait, `['John', 'Paul', 'George', 'Ringo']` is just a list!*

Yup, we *supercharged* it!

Here's another type of collection that can be converted to a Series:

In [18]:
my_dict = { "zoo": "Denver Zoo", "admission": 21.00 }
my_series2 = pd.Series(my_dict)
my_series2

zoo          Denver Zoo
admission            21
dtype: object

## Investigating `Series`

In [19]:
my_series[0]

'John'

In [21]:
my_series[:2]

Smart One    John
Cute One     Paul
dtype: object

In [21]:
my_series['Cute One']

'Paul'

You can also give a collection of `True` or `False` values to select specific columns:

In [22]:
my_series

Smart One      John
Cute One       Paul
Quiet One    George
Funny One     Ringo
dtype: object

In [23]:
my_series[[True, False, False, True]]

Smart One     John
Funny One    Ringo
dtype: object

Remember lists... what happened when they were added?

In [24]:
list1 = ['hello', 'world']
list2 = ['foo', 'bar']
list1 + list2

['hello', 'world', 'foo', 'bar']

It concatenates. What about a Series?

In [25]:
series1 = pd.Series(list1)
series2 = pd.Series(list2)
series1 + series2

0    hellofoo
1    worldbar
dtype: object

In [26]:
series1 + series2

0    hellofoo
1    worldbar
dtype: object

Seems useless, but I'm going to tell you that it is *awesome* for working with data.

*WHY?*

What it's actually doing is working with values one-by-one. Adding strings is rarely useful, but how about:

In [27]:
series1 = pd.Series([100, 200, 100])
series2 = pd.Series([400, 300, 500])
series1 + series2

0    500
1    500
2    600
dtype: int64

In [25]:
# Loading some data
imls = pd.read_csv('../data/imls_mini_by_state.csv', index_col='State')
population = imls['Population'].head(10)
revenue = imls['Revenue'].head(10).astype(int)

IMLS Library Data by State (2015)

In [26]:
population

State
AK      647586
AL     4833722
AR     2643930
AS       60324
AZ     6758252
CA    38697702
CO     5265461
CT     3596673
DC      672228
DE      935043
Name: Population, dtype: int64

In [34]:
revenue

State
AK      35984725
AL     104479197
AR      74999703
AS        466525
AZ     179986697
CA    1417433751
CO     299659360
CT     198873464
DC      57526154
DE      26470866
Name: Revenue, dtype: int64

In [35]:
revenue / population

State
AK    55.567484
AL    21.614647
AR    28.366751
AS     7.733655
AZ    26.632138
CA    36.628370
CO    56.910375
CT    55.293730
DC    85.575361
DE    28.309785
dtype: float64

You can also make comparisons easily:

In [36]:
population > 1000000

State
AK    False
AL     True
AR     True
AS    False
AZ     True
CA     True
CO     True
CT     True
DC    False
DE    False
Name: Population, dtype: bool

In [37]:
revenue > population

State
AK    True
AL    True
AR    True
AS    True
AZ    True
CA    True
CO    True
CT    True
DC    True
DE    True
dtype: bool

Hmmm... What's next?

In [45]:
population[population > 1000000]

State
AL     4833722
AR     2643930
AZ     6758252
CA    38697702
CO     5265461
CT     3596673
Name: Population, dtype: int64

## Some Useful `Series` Methods

### Accessing just a few rows

In [30]:
population.head(5)

State
AK     647586
AL    4833722
AR    2643930
AS      60324
AZ    6758252
Name: Population, dtype: int64

In [40]:
population.sample(5)

State
AZ    6758252
CT    3596673
CO    5265461
AK     647586
AR    2643930
Name: Population, dtype: int64

### Summary statistics (for numbers!)

In [31]:
population.mean()

6411092.1

In [32]:
population.median()

3120301.5

In [33]:
population.min()

60324

In [34]:
population.max()

38697702

In [35]:
population.std()

11570703.733362896

### `unique`

Lists only the unique values. (Less useful for numerical data.)

In [37]:
ddg = pd.Series(["duck", "duck", "duck", "goose"])
ddg.unique()

array(['duck', 'goose'], dtype=object)

### `value_counts`

Counts how often each unique value shows up in the series.

In [40]:
ddg.value_counts()

duck     3
goose    1
dtype: int64

# `DataFrame`

- Table-like structure

In [46]:
pd.DataFrame([[1,2,3], [4,5,6]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [48]:
my_df = pd.DataFrame([[1,2,3], [4,5,6]],
                     index=['row1', 'row2'],
                     columns=['col1', 'col2', 'col3']
                    )
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


## Investigating Dataframes

### Selecting Columns

In [52]:
my_df['col1']

row1    1
row2    4
Name: col1, dtype: int64

Or:

In [53]:
my_df.col1

row1    1
row2    4
Name: col1, dtype: int64

In [52]:
my_df.col1

row1    1
row2    4
Name: col1, dtype: int64

This second approach only works when the column name doesn't conflict with build-in methods, properties, or keywords.

e.g.

DataFrames have a method called 'count', so you can run

``my_df.count()``

That means that if you have a column named 'count', you can only use `my_df['count']`.

## When in doubt, use the square bracket notation

`DataFrame` columns are just `Series`

In [53]:
type(my_df)

pandas.core.frame.DataFrame

In [54]:
type(my_df.col1)

pandas.core.series.Series

Remember this!!
 
## A `DataFrame` column is a `Series`

If selecting multiple columns, give a list of column names:

In [57]:
my_df[['col1', 'col3']]

Unnamed: 0,col1,col3
row1,1,3
row2,4,6


How does this look different from the following?

In [56]:
my_df['col1']

row1    1
row2    4
Name: col1, dtype: int64

### Selecting rows by comparison

Just like with a `Series`, a `DataFrame` can take True / False values

In [57]:
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


In [58]:
my_df[[True, False]]

Unnamed: 0,col1,col2,col3
row1,1,2,3


## Convert from SQL

```sql
SELECT * FROM my_df
    WHERE col1 == 1;
```

In [59]:
my_df[my_df.col1 == 1]

Unnamed: 0,col1,col2,col3
row1,1,2,3


```sql
SELECT col2, col3 FROM my_df;
```

In [60]:
my_df[['col2', 'col3']]

Unnamed: 0,col2,col3
row1,2,3
row2,5,6


```sql
SELECT * FROM my_df
    WHERE col1 > 2
    LIMIT 10
```

In [61]:
my_df[my_df.col1 > 2].head(10)

Unnamed: 0,col1,col2,col3
row2,4,5,6


In [62]:
my_df[(my_df.col1 > 2)][:10]

Unnamed: 0,col1,col2,col3
row2,4,5,6


```sql
SELECT col2, col3 FROM my_df
    WHERE col1 > 2
    LIMIT 10
```

In [63]:
my_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6


In [64]:
my_df[my_df.col1 > 2][['col2', 'col3']].head()

Unnamed: 0,col2,col3
row2,5,6


(We'll learn tidier ways to do this in the future, but this is perfectly fine)

### Method Chaining Returns!

Consider what is happening here:

In [73]:
my_df[my_df.col1 > 2][['col2', 'col3']].head()

Unnamed: 0,col2,col3
row2,5,6


Break it down:

This is one action, that returns a DataFrame:

>**my_df[my_df.col1 > 2]**[['col2', 'col3']].head()

This is a second action, that returns a DataFrame:

>**my_df[my_df.col1 > 2][['col2', 'col3']]**.head()

`head()` is run on the second DataFrame.

In [66]:
selection1 = my_df[my_df.col1 > 2]
selection2 = selection1[['col2', 'col3']]
selection2.head()

Unnamed: 0,col2,col3
row2,5,6


## Load more substantial data
A dataset of craft beers from https://www.kaggle.com/nickhould/craft-cans.

`read_csv`, which we saw last week, returns a DataFrame.

In [75]:
beers = pd.read_csv('../data/beers.csv')

`head()` works the same as it did for a Series:

In [79]:
beers.head(4)

Unnamed: 0,abv,ibu,id,name,type,brewery_id,ounces
0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0


### Capitalizing on the "column == Series" functionality

- How do we see all the unique breweries?
- How do we get the counts of each Beer type in the dataset?
- What types of beer most commonly have an abv at 9% or higher?
- What's the average size of beers under 9% abv?
- How do I find the coolest name for a Tripel-style beer?

Tip: Method chaining is useful here!

### Quick Summary Stats

In [69]:
beers.describe()

Unnamed: 0,abv,ibu,id,brewery_id,ounces
count,2348.0,1405.0,2410.0,2410.0,2410.0
mean,0.059773,42.713167,1431.113278,231.749793,13.592241
std,0.013542,25.954066,752.459975,157.685604,2.352204
min,0.001,4.0,1.0,0.0,8.4
25%,0.05,21.0,808.25,93.0,12.0
50%,0.056,35.0,1453.5,205.0,12.0
75%,0.067,64.0,2075.75,366.0,16.0
max,0.128,138.0,2692.0,557.0,32.0


# Next Week
Here's how Pandas is *pitched* in their [documentation](https://pandas.pydata.org/pandas-docs/stable/overview.html). Let's consider the points and work backwards toward understanding it, by untangling the jargon.

>- A set of labeled array data structures, the primary of which are Series and DataFrame
>- Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
>- An integrated group by engine for aggregating and transforming data sets
>- Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
>- Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
>- Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
>- Moving window statistics (rolling mean, rolling standard deviation, etc.)

# Overview of Split-Apply-Combine

![](../images/split-apply-combine.png)

![](../images/example-split-apply-combine.png)

You don't have to split by just *one* property.

![](../images/dcube1.png)

![](../images/dcube.jpg)

## Next Week
- Split-Apply-Combine In Practice