Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Owen McGrattan"
COLLABORATORS = ""

---

# Lab 2: Pandas Overview

**This assignment should be completed before Tuesday 1/30 at 1:00AM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (ie. selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation/Grouping dataframes

In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. You may press `shift+tab` on the method parameters to see the documentation for that method.

## Setup

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a two-dimensional labeled data structure with columns of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

In [3]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


**Method 2: ** You can also define a dataframe by specifying the rows like below.

In [4]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


You can obtain the dimensions of a matrix by using the shape attribute dataframe.shape

In [5]:
(num_rows, num_columns) = fruit_info.shape
num_rows, num_columns

(4, 2)

### Question 1

You can add a column by `dataframe['new column name'] = [data]`. Please add a column called `rank` to the `fruit_info` table which contains a 1,2,3, or 4 based on your personal preference ordering for each fruit. 


In [8]:
fruit_info['rank'] = [1, 3, 2, 4]
#raise NotImplementedError()

In [9]:
fruit_info

Unnamed: 0,color,fruit,rank
0,red,apple,1
1,orange,orange,3
2,yellow,banana,2
3,pink,raspberry,4


In [10]:
assert fruit_info["rank"].dtype == np.dtype('int64')

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `rank` column you created. (Make sure to use the `axis` parameter correctly)

In [24]:
fruit_info_original = fruit_info.drop(['rank'], axis = 1)
# YOUR CODE HERE
#raise NotImplementedError()

In [25]:
fruit_info_original

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


In [26]:
assert fruit_info_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `fruit_info_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `fruit_info` because `inplace` by default is `False`)

In [34]:
# YOUR CODE HERE
fruit_info_original.rename(index = str, columns = {"color": 'Color', "fruit": "Fruit"}, inplace = True)
#raise NotImplementedError()

In [35]:
fruit_info_original

Unnamed: 0,Color,Fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


In [36]:
assert fruit_info_original.columns[0] == 'Color'

### Babyname datasets
Now that we have learned the basics. We will then work on the babynames dataset. Let's clean and wrangle the following data frames for the remainder of the lab.

First let's run the following shell to build the dataframe.
It download the data from the web and extract the data in California region. There should be totally 367931 records

### `fetch_and_cache` Helper

The following function downloads and caches data in the `data/` directory and returns the `Path` to the downloaded file

In [37]:
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path

Below we use fetch and cache to download the `namesbystate.zip` zip file. 

**This might take a little while! Consider stretching.**

In [38]:
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')

Downloading... Done!


The following cell builds the final full `baby_names` DataFrame. 

In [39]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=field_names)

# List comprehension
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = pd.concat(states).reset_index(drop=True)

In [40]:
baby_names.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7


In [41]:
len(baby_names)

5838786

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `ex` data frame, we would use :

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would give select the columns `Name` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [42]:
#Example:
baby_names.loc[2:5, 'Name']

2        Anna
3    Margaret
4       Helen
5       Elsie
Name: Name, dtype: object

In [43]:
#Example:  Notice the difference between these two methods
baby_names.loc[2:5, ['Name']]

Unnamed: 0,Name
2,Anna
3,Margaret
4,Helen
5,Elsie


The `.loc` actually uses the index rather than row id to perform the selection. The pervious example is just a coincidence that it matches the array slicing syntax. 

But we can always uses [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slicing the dataframe using row id and column id.

See the following example:

In [44]:
#Example: We change the index from 0,1,2... to the Name column
df = baby_names[:5].set_index("Name") 
df

Unnamed: 0_level_0,State,Sex,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mary,AK,F,1910,14
Annie,AK,F,1910,12
Anna,AK,F,1910,10
Margaret,AK,F,1910,8
Helen,AK,F,1910,7


We can now lookup rows by name directly:

In [45]:
df.loc[['Mary', 'Anna'], :]

Unnamed: 0_level_0,State,Sex,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mary,AK,F,1910,14
Anna,AK,F,1910,10


However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [46]:
#Example: 
#df.loc[2:5,"Year"] You can't do this
df.iloc[1:4,2:3]

Unnamed: 0_level_0,Year
Name,Unnamed: 1_level_1
Annie,1910
Anna,1910
Margaret,1910


### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Name` and `Year` **in that order** from the `baby_names` table.

In [50]:
name_and_year = baby_names.loc[:, ['Name', 'Year']]
# YOUR CODE HERE
#raise NotImplementedError()

In [51]:
name_and_year[:5]

Unnamed: 0,Name,Year
0,Mary,1910
1,Annie,1910
2,Anna,1910
3,Margaret,1910
4,Helen,1910


In [52]:
assert name_and_year.shape == (5838786, 2)

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in California

In [53]:
ca = baby_names[baby_names['State'] == "CA"]

### Question 5a
Select the names in Year 2000 (for all baby_names) that have larger than 3000 counts. What do you notice?

(If you use `p & q` to filter the dataframe, make sure to use `df[df[(p) & (q)]]`)

In [78]:
result = baby_names[(baby_names['Year'] == 2000) & (baby_names['Count'] > 3000)]
# YOUR CODE HERE
#raise NotImplementedError()

In [79]:
result

Unnamed: 0,State,Sex,Year,Name,Count
687838,CA,M,2000,Daniel,4339
687839,CA,M,2000,Anthony,3837
687840,CA,M,2000,Jose,3803
687841,CA,M,2000,Andrew,3600
687842,CA,M,2000,Michael,3570
687843,CA,M,2000,Jacob,3520
687844,CA,M,2000,Joshua,3356
687845,CA,M,2000,Christopher,3332
687846,CA,M,2000,David,3280
687847,CA,M,2000,Matthew,3254


In [80]:
assert len(result) == 11
assert result["Count"].sum() == 38988

## Data Aggregration (Grouping Data Frames)

### Question 6a
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. Count the number of different names for each Year in `CA` (California).  (You may use the `ca` DataFrame created above.)

**Note:** *We are not computing the number of babies but instead the number of names (rows in the table) for each year.*

In [95]:
num_of_names_per_year = ca['Year'].value_counts()
# YOUR CODE HERE
#raise NotImplementedError()

In [96]:
num_of_names_per_year[:5]

2007    7247
2008    7156
2009    7118
2006    7074
2010    7008
Name: Year, dtype: int64

In [97]:
assert num_of_names_per_year[2007] == 7247
assert num_of_names_per_year[:5].sum() == 35603

### Question 6b
Count the number of different names for each gender in `CA`. Does the result help explaining the findings in Question 5?

In [98]:
num_of_names_per_gender = ca['Sex'].value_counts()
# YOUR CODE HERE
#raise NotImplementedError()

In [99]:
num_of_names_per_gender

F    217309
M    150622
Name: Sex, dtype: int64

In [100]:
assert num_of_names_per_gender["F"] > 200000

### Question 7a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Count` for each `Name` in the `ca` table. You should use `df.groupby("col_name").sum()`. Your result should be a Pandas Series.

**Note:** *In this question we are now computing the number of registered babies with a given name.*

In [121]:
count_for_names = ca.groupby(['Name'])['Count'].sum()
# YOUR CODE HERE
#raise NotImplementedError()

In [122]:
count_for_names.sort_values(ascending=False)[:5]

Name
Michael    428290
David      370070
Robert     350423
John       312809
James      278456
Name: Count, dtype: int64

In [120]:
assert count_for_names["Michael"] == 428290
assert count_for_names[:100].sum() == 96149

### Question 7b

Find the sum of `Count` for each female name after year 1999 (`>1999`) in California.


In [128]:
female_name_count = ca[(ca['Year'] > 1999) & (ca['Sex'] == 'F')].groupby(['Name'])['Count'].sum()
# YOUR CODE HERE
#raise NotImplementedError()

In [129]:
female_name_count.sort_values(ascending=False)[:5]

Name
Emily       46277
Isabella    42875
Sophia      41475
Samantha    33436
Mia         33029
Name: Count, dtype: int64

In [130]:
assert female_name_count["Emily"] == 46277
assert female_name_count[:100].sum() == 45883

#### You are done! Remember to validate and submit via JupyterHub