Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Lab 2: Pandas Overview

**This lab was distributed Monday 9/3/2018 and should be completed by Friday 9/7/2018 at 11:59PM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (ie. selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation/Grouping dataframes

In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. You may press `shift+tab` on the method parameters to see the documentation for that method.

## Setup

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a two-dimensional labeled data structure with columns of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

Notice the syntax: you're passing a dictionary into the DataFrame.  The keys become the column names (e.g. `'fruit'`), and the values are lists (`['apple',....`)

In [2]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


**Method 2: ** You can also define a dataframe by specifying the rows like below.

Here, you're passing in tuples for each row of data (e.g. `("red", "apple")`) and specifying the column names separately.

In [3]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


**Other methods**: Usually you won't be creating data frames in such a manual way.  You'll often be loading dataframes in from other file types -- for example comma separated (csv) files.  More on that later.

You can obtain the dimensions of a matrix by using the shape attribute dataframe.shape

In [None]:
(num_rows, num_columns) = fruit_info.shape
num_rows, num_columns

### Question 1

You can add a column by `dataframe['new column name'] = [data]`. Add a column called `rank` to the `fruit_info` table which contains a 1,2,3, or 4 based on your personal preference ordering for each fruit. 

(note you'll need to comment out or delete the `NotImplementedError()`)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
fruit_info

In [None]:
assert fruit_info["rank"].dtype == np.dtype('int64')

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `rank` column you created. Some notes:

* You'll need to look up `drop` to figure out the right syntax.
* Make sure to use the `axis` parameter correctly

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
fruit_info_original

In [None]:
assert fruit_info_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `fruit_info_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `fruit_info` because `inplace` by default is `False`)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
fruit_info_original

In [5]:
assert fruit_info_original.columns[0] == 'Color' # the column number might be different for you

NameError: name 'fruit_info_original' is not defined

## Babyname datasets
Now that we have learned the basics, we'll spend some time working with the babynames dataset. Let's clean and wrangle the following data frames for the remainder of the lab.

First let's run the following shell to build the dataframe.
It'll download the data from the web, and extract data specific to California. There should be totally 367931 records.

### `fetch_and_cache` Helper

The following function downloads and caches data in the `data/` directory and returns the `Path` to the downloaded file

In [11]:
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path

Below we use fetch and cache to download the `namesbystate.zip` zip file. 

**This might take a little while! Consider stretching.**

In [12]:
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')

Downloading... 

ConnectionError: HTTPSConnectionPool(host='www.ssa.gov', port=443): Max retries exceeded with url: /oact/babynames/state/namesbystate.zip (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000019ED1AB3160>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

The following cell builds the final full `baby_names` DataFrame. 

In [None]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=field_names)

# List comprehension
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = pd.concat(states).reset_index(drop=True)

In [None]:
baby_names.head()

In [None]:
len(baby_names)

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `ex` data frame, we would use :

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would give select the columns `Name` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [None]:
#Example:
baby_names.loc[2:5, 'Name']

In [None]:
#Example:  Notice the difference between these two methods
baby_names.loc[2:5, ['Name']]

The `.loc` actually uses the index rather than row id to perform the selection. The previous example is just a coincidence that it matches the array slicing syntax. 

But we can always uses [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slicing the dataframe using row id and column id.

See the following example:

In [None]:
#Example: We change the index from 0,1,2... to the Name column
df = baby_names[:5].set_index("Name") 
df

We can now lookup rows by name directly:

In [None]:
df.loc[['Mary', 'Anna'], :]

However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [None]:
#Example: 
#df.loc[2:5,"Year"] You can't do this
df.iloc[1:4,2:3]

### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Name` and `Year` **in that order** from the `baby_names` table.

In [None]:
name_and_year = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
name_and_year[:5]

In [None]:
assert name_and_year.shape == (5933561, 2)

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with brackets. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in California

In [None]:
baby_ca = baby_names[baby_names['State'] == "CA"]

### Question 5a
Select the names in Year 2000 (for all baby_names) that have larger than 3000 counts. What do you notice?

(If you use `p & q` to filter the dataframe, make sure to use `df[df[(p) & (q)]]`)

In [None]:
baby_ca_2000 = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
result

## Data Aggregration (Grouping Data Frames)

### Question 6a
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. Count the number of different names for each Year in `CA` (California).  (You may use the `ca` DataFrame created above.)

**Note:** *We are not computing the number of babies but instead the number of names (rows in the table) for each year.*

In [None]:
num_of_names_per_year = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
num_of_names_per_year[:5]

### Question 6b
Count the number of different names for each gender in `CA`. Does the result help explaining the findings in Question 5?

In [None]:
num_of_names_per_gender = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
num_of_names_per_gender

### Question 7a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Count` for each `Name` in the `ca` table. You should use `df.groupby("col_name").sum()`. Your result should be a Pandas Series.

**Note:** *In this question we are now computing the number of registered babies with a given name.*

In [None]:
count_for_names = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
count_for_names.sort_values(ascending=False)[:5]

### Question 7b

Find the sum of `Count` for each female name after year 1999 (`>1999`) in California.


In [None]:
female_name_count = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
female_name_count.sort_values(ascending=False)[:5]

#### You are done! Remember to submit this lab on bCourses in both html and ipynb formats.