Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = ""
COLLABORATORS = ""

---

# Lab 2: Pandas Overview

**This lab was distributed Monday 9/9/2019 and should be completed by Friday 9/13/2019 at 11:59PM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (ie. selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation/Grouping dataframes

In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. You may press `shift+tab` on the method parameters to see the documentation for that method.

## Setup

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a two-dimensional labeled data structure with columns of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

Notice the syntax: you're passing a dictionary into the DataFrame.  The keys become the column names (e.g. `'fruit'`), and the values are lists (`['apple',....`)

In [3]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


**Method 2: ** You can also define a dataframe by specifying the rows like below.

Here, you're passing in tuples for each row of data (e.g. `("red", "apple")`) and specifying the column names separately.

In [4]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


**Other methods**: Usually you won't be creating data frames in such a manual way.  You'll often be loading dataframes in from other file types -- for example comma separated (csv) files.  More on that later.

You can obtain the dimensions of a matrix by using the shape attribute dataframe.shape

In [5]:
(num_rows, num_columns) = fruit_info.shape
num_rows, num_columns

(4, 2)

### Question 1

You can add a column by `dataframe['new column name'] = [data]`. Add a column called `rank` to the `fruit_info` table which contains a 1,2,3, or 4 based on your personal preference ordering for each fruit. 

(note you'll need to comment out or delete the `NotImplementedError()`)

In [6]:
# YOUR CODE HERE
fruit_info['rank'] = [3,2,4,1]
# raise NotImplementedError()

In [7]:
fruit_info

Unnamed: 0,fruit,color,rank
0,apple,red,3
1,orange,orange,2
2,banana,yellow,4
3,raspberry,pink,1


In [8]:
assert fruit_info["rank"].dtype == np.dtype('int64')

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `rank` column you created, and save the new dataframe (without the `rank` column) to `fruit_info_original`. Some notes:

* You'll need to look up `drop` to figure out the right syntax.
* Make sure to use the `axis` parameter correctly

In [9]:
fruit_info_original = fruit_info.drop(['rank'], axis=1)
#raise NotImplementedError()

In [10]:
fruit_info_original

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


In [11]:
assert fruit_info_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `fruit_info_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `fruit_info` because `inplace` by default is `False`)

In [12]:
fruit_info_original.rename(columns = {"fruit": "Fruit", "color":"Color"}, inplace = True)
# raise NotImplementedError()

In [13]:
fruit_info_original

Unnamed: 0,Fruit,Color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


In [14]:
assert fruit_info_original.columns[1] == 'Color' # the column number might be different for you

## Babyname datasets
Now that we have learned the basics, we'll spend some time working with the babynames dataset. Let's clean and wrangle the following data frames for the remainder of the lab.

First let's run the following shell to build the dataframe.
It'll download the data from the web for baby names with more than 5 occurences from 1880-2018. There should be 1957046 records.

### `fetch_and_cache` Helper

The following function downloads and caches data in the `data/` directory and returns the `Path` to the downloaded file

In [15]:
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url, stream = True)
        with file_path.open('wb') as f:
            f.write(resp.content)
        f.close()
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path

Below we use fetch and cache to download the `names.zip` zip file. 

**This might take a little while! Consider stretching.**

In [16]:
data_url = 'https://www.ssa.gov/oact/babynames/names.zip'
names_path = fetch_and_cache(data_url, 'names.zip')

Downloading... Done!


The following cell builds the final full `baby_names` DataFrame. 

In [17]:
import zipfile
zf = zipfile.ZipFile(names_path, 'r')

field_names = ['Name', 'Sex', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        year = int(f.filename[3:7])
        names = pd.read_csv(fh, header=None, names=field_names)
        names["Year"] = year
        return names
    
# List comprehension
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.txt')
]

baby_names = pd.concat(states).reset_index(drop=True)

In [18]:
baby_names.head()

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


In [19]:
len(baby_names)

1957046

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would give select the columns `Name` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [20]:
#Example:
baby_names.loc[2:5, 'Name']

2         Emma
3    Elizabeth
4       Minnie
5     Margaret
Name: Name, dtype: object

In [21]:
#Example:  Notice the difference between these two methods
baby_names.loc[2:5, ['Name']]

Unnamed: 0,Name
2,Emma
3,Elizabeth
4,Minnie
5,Margaret


The `.loc` actually uses the index (the bolded, leftmost series in the dataframe) rather than the row location to perform the selection. The previous example is just a coincidence that it matches the array slicing syntax - the index and row location aren't always the same value. For example, you could set your index to a non-numeric code, like a serial number or other unique ID, if that's how you want to identify your records.

But we can always uses [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slicing the dataframe using row location and column location.

See the following example:

In [22]:
#Example: We change the index from 0,1,2... to the Name column
df = baby_names[:5].set_index("Name") 
df

Unnamed: 0_level_0,Sex,Count,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mary,F,7065,1880
Anna,F,2604,1880
Emma,F,2003,1880
Elizabeth,F,1939,1880
Minnie,F,1746,1880


We can now lookup rows by name directly:

In [23]:
df.loc[['Mary', 'Anna'], :]

Unnamed: 0_level_0,Sex,Count,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mary,F,7065,1880
Anna,F,2604,1880


However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [24]:
#Example: 
#df.loc[2:5,"Year"] You can't do this
df.iloc[1:4,1:3]

Unnamed: 0_level_0,Count,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Anna,2604,1880
Emma,2003,1880
Elizabeth,1939,1880


### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Year` and `Name` **in that order** from the `baby_names` table.

In [25]:
year_and_name = baby_names.loc[:, ["Year", "Name"]]
# YOUR CODE HERE
#raise NotImplementedError()

In [26]:
year_and_name[:5]

Unnamed: 0,Year,Name
0,1880,Mary
1,1880,Anna
2,1880,Emma
3,1880,Elizabeth
4,1880,Minnie


In [27]:
assert year_and_name.shape == (1957046, 2)
assert year_and_name.columns[0] == "Year"

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with brackets. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only babies born in the year 2000.

In [28]:
baby_2000 = baby_names[baby_names['Year'] == 2000]

### Question 5
Select the female names in Year 2000 (for all baby_names) that have larger than 3000 counts.

(If you use condition `p` & condition `q` to filter the dataframe, make sure to use `df[(df[p]) & (df[q])]`)

In [29]:
baby_2000_f = baby_2000[(baby_2000["Sex"] == "F") & (baby_2000["Count"] > 3000)]
# YOUR CODE HERE
#raise NotImplementedError()

In [30]:
baby_2000_f.head()

Unnamed: 0,Name,Sex,Count,Year
1332810,Emily,F,25956,2000
1332811,Hannah,F,23082,2000
1332812,Madison,F,19968,2000
1332813,Ashley,F,17997,2000
1332814,Sarah,F,17702,2000


In [31]:
assert len(baby_2000_f) == 113
assert baby_2000_f["Count"].sum() == 803955

## Data Aggregration (Grouping Data Frames)

### Question 6
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

Count the number of different names for each sex in the year 2000. (You may use the `baby_2000` DataFrame created above.)

**Note:** *We are not computing the number of babies but instead the number of names (rows in the table) for each year.*

In [32]:
num_of_names_mf = baby_2000["Sex"].value_counts()
# YOUR CODE HERE
#raise NotImplementedError()

In [33]:
num_of_names_mf

F    17655
M    12117
Name: Sex, dtype: int64

In [34]:
assert num_of_names_mf["M"] == 12117
assert num_of_names_mf.sum() == len(baby_2000)

### Question 7a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Count` for each `Name` in the `baby_names` table (the dataframe with baby names for all years). You should use `df.groupby("col_name").sum()`. Your result should be a Pandas Series.

**Note:** *In this question we are now computing the total number of registered babies from 1880-2018 with a given name.*

In [35]:
count_for_names = baby_names.groupby("Name").sum()
# YOUR CODE HERE
# raise NotImplementedError()

In [36]:
count_for_names.sort_values(by = "Count", ascending=False)[:5]

Unnamed: 0_level_0,Count,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,5187679,541822
John,5146508,541822
Robert,4840228,535771
Michael,4384463,490699
Mary,4140840,525715


In [37]:
assert count_for_names.loc["Michael", "Count"] == 4384463
assert count_for_names.sort_values(by = "Count", ascending=False).index[0] == "James"

### Question 7b

What do the values in `Year` represent in the dataframe `count_for_names`? How come the column `Sex` is no longer present in the dataframe?

`Year` is the sum of all years where a given name appears. Although this isn't particularly meaningful to us, it's how Python interprets the `.sum()` method of `groupby()` - it takes the sum of all numeric values in the dataframe, grouped by the specified column (`Name`).

`Sex` is no longer present in the dataframe because it's not a numeric value. Python isn't able to sum it, so it drops it. We could keep sex by grouping by both `Name` and `Sex` in the groupby - that way, we would get a dataframe that has the total number of registered babies by both name and sex (for example, male Taylors and female Taylors would have two separate rows in such a dataframe, while they share one row in this dataframe).

### Question 7c

Find the sum of `Count` for each female name after year 1999 (`>1999`).


In [38]:
female_name_count = baby_names[(baby_names["Sex"] == "F") & (baby_names["Year"] > 1999)].groupby("Name").sum()
# YOUR CODE HERE
# raise NotImplementedError()

In [39]:
female_name_count.sort_values(by = "Count", ascending=False)[:5]

Unnamed: 0_level_0,Count,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Emma,358667,38171
Emily,332839,38171
Olivia,321581,38171
Isabella,306218,38171
Sophia,286197,38171


In [40]:
assert female_name_count.loc["Emily", "Count"] == 332839
assert female_name_count[:100].sum()["Count"] == 90759

## Thinking about sampling

### Question 8a
The data we used in the questions above is pulled from [here](https://www.ssa.gov/oact/babynames/limits.html) (we're using the link "National"). From your exploration of the data and the details provided by the Social Security Administration, is the data contained in the .zip file that we downloaded a sample? What population is it a sample of? How was it sampled?

*Your answer here*

### Question 8b
The [Baby Name Uniqueness Analyzer](https://datayze.com/name-uniqueness-analyzer.php), which was built using SSA data, tells you how likely a person with a given name is to meet someone else with the same name. Try it out, with your name, your friends' names, your parents' names, etc, and answer the following questions:<br>
1. In a couple sentences, describe how this tool was built. You don't have to talk about specific functions or coding approaches, but you should describe how you think the data was manipulated or subset to show the result.
1. Is the question that the tool aims to answer an example of a conditional probability question?
1. How might the results of some of your queries change if the SSA had provided data that was sampled differently?<br>

*Your answer here*

#### You are done! Remember to submit this lab on bCourses in both html and ipynb formats after clicking Kernel -> Restart & Run All.