In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

# Lab 04: Pandas Overview

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes

* Slicing data frames (i.e. selecting rows and columns)

* Filtering data (using boolean arrays)

In this lab you are going to use several pandas methods, such as `drop` and `loc`. You may press `shift+tab` on the method parameters to see the documentation for that method. 

**Hint:** If you are familiar with the `datascience` library used in Foundations of Data Science, the `datascience-to-pandas.ipynb` conversion notebook may serve as a useful guide. It can be found in the lessons folder.

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

**Note**: The Pandas interface is notoriously confusing, and the documentation is not consistently great. Throughout the semester, you will have to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist.

Run the cell below, but **please** don't change it.

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# 1. Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1:** You can create a data frame by specifying the columns and values using a dictionary as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
            'color': ['red', 'orange', 'yellow', 'pink']})
fruit_info

**Syntax 2:** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

You can obtain the dimensions of a dataframe by using the shape attribute `dataframe.shape`.

In [None]:
fruit_info.shape

You can also convert the entire dataframe into a two-dimensional numpy array.

In [None]:
fruit_info.values

<!-- BEGIN QUESTION -->

**Question 1.** For a DataFrame `df`, you can add a column with `df['new column name'] = ...` and assign a list or array of values to the column. Use the `.Series` method to add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty).

**Note:** To earn all the points for this question you **must** use the `.Series` method. 


In [None]:
...
fruit_info

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.** You can also add a column to `df` with `df.loc[:, 'new column name'] = ...`. As discussed in the lesson, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `new column name` indicates the column you are modifying (or in this case, adding). 

Make a copy of the `fruit_info` dataframe named `fruit_info_copy` using the `.copy` method. Then add a column called `rank2` to the `fruit_info_copy` table which contains the same values in the same order as the `rank1` column.

**Note:** To earn all the points for this question you **must** use the `.copy` method and `.loc`.


In [None]:
...
fruit_info_copy

<!-- END QUESTION -->

**Question 3.** Use the `.drop` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created in `fruit_info_copy` (make sure to use the `axis` parameter correctly). Save the output to an object named `fruit_info_original`.

**Note:** `.drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

**Hint:** Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.


In [None]:
fruit_info_original = ...
fruit_info_original

In [None]:
grader.check("q3")

**Question 4.** Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_copy` so they begin with capital letters. Set this new dataframe to `fruit_info_caps`.


In [None]:
fruit_info_caps = ...
fruit_info_caps

In [None]:
grader.check("q4")

# 2. Babynames

Now that we have learned the basics, let's move on to the `babynames` dataset. The `babynames` dataset contains a record of the given names of babies born in the United States each year.

First let's run the following cells to build the dataframe `babynames`. The cells below download the data from the web and extract the data into a dataframe. There should be a total of 890627 records.

In [None]:
babynames = pd.read_csv('data/baby_names.csv', index_col = 0)
len(babynames)

In [None]:
babynames.head()

# 3. Selecting Rows and Columns (Slicing)

## Selection Using Label/Index (using `.loc`)

### Column Selection 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. (Reminder that the colon `:` means "everything.")  For example, if we want the `color` column of the `flowers` data frame, we would use: `flowers.loc[:, 'color']`

- You can also slice across columns. For example, `babynames.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- **Alternative:** While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

### Row Selection

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e. primary key) of the dataframe.

**Example 1.**

In [None]:
babynames.loc[2:5, 'Name']

**Example 2.**  Notice the difference between this method and the method in **Example: 1.**

Just passing in `'Name'` returns a Series while `['Name']` returns a `Dataframe`.

In [None]:
babynames.loc[2:5, ['Name']]

**Note:** `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `2:5` with `.loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5.

### Selection using Integer location (using `.iloc`)

Another pandas feature is `.iloc[]` which lets you slice the dataframe by row position and column position instead of by row index and column label (which is the case for `.loc[]`). This is really the main difference between the two functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `.iloc[]`, the end index is **not** included, like with normal Python slicing.

**Note:** As a mnemonic, remember that the i in `.iloc` means "integer". 

Below, we have sorted the `babynames` dataframe. Notice how the **position** of a row is not necessarily equal to the **index** of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `.loc[]` and `.iloc[]`.

**Example 3.**

In [None]:
sorted_babynames = babynames.sort_values(by = ['Name'])
sorted_babynames.head()

**Example 4.** Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the `Name` column of the `baby_names` dataframe using both `.iloc[]` and `.loc[]`. Observe the difference, especially after sorting `babynames` by name.

In [None]:
sorted_babynames.iloc[1:4, 3]

Notice that using `.loc[]` with 1:4 gives different results, since it selects using the **index**.

In [None]:
sorted_babynames.loc[1:4, 'Name']

**Example 5.** Lastly, we can change the index of a dataframe using the `set_index` method. We change the index from $0,1,2,\ldots$ to the `Name` column.

In [None]:
df = babynames[:5].set_index("Name") 
df

**Example 6.** However, if we still want to access rows by location we will need to use the integer location (`.iloc`) accessor.

**Note:** We can't do this `df.loc[2:5, 'Year']`, but we can use the integer position.

In [None]:
df.iloc[1:4, 2:3]

<!-- BEGIN QUESTION -->

**Question 5.** List the names of the states that are in the `babynames` data set. Do this using a `panda`s `DataFrame` method.

**Note:** To earn all the points for the question you must do it programmatically . 

In [None]:
babynames.State.unique()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 6.** Selecting multiple columns is easy.  You just need to supply a list of column names.  Use `.loc` to select the `Name` and `Year` **in that order** from the `babynames` table.

**Note:** To earn all the points for this question you **must** use the `.loc` method. 


In [None]:
name_and_year = ...
name_and_year[:5]

<!-- END QUESTION -->

**Note:** `.loc[]` can be used to re-order the columns within a dataframe.

# 4. Filtering Data

## Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df['column name'] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol   | Usage      | Meaning 
------   | ---------- | -------------------------------------
$==$     | a == b     | Does a equal b?
$\lt =$  | a <= b     | Is a less than or equal to b?
$\gt =$  | a >= b     | Is a greater than or equal to b?
$\lt$    | a < b      | Is a less than 
$\gt$    | a > b      | Is a greater than b?
~        | ~p         | Returns negation of p
&#124;   | p &#124; q | p OR q
&        | p & q      | p AND q
^        | p ^ q      | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in North Carolina.

In [None]:
nc = babynames[babynames['State'] == 'NC']
nc.head()

**Question 7.** To count the number of instances of each unique value in a Series, we can use the `value_counts()` method as `df['col_name'].value_counts()`. Count the number of different names for each Year in NC (North Carolina).

**Note:** We are **not** computing the number of babies but instead the number of names (rows in the table) for each year.


In [None]:
num_of_names_per_year = ...
num_of_names_per_year.head()

In [None]:
grader.check("q7")

**Question 8.** Count the number of different names that were given to male and female babies in NC (North Carolina).


In [None]:
num_of_names_per_sex = ...
num_of_names_per_sex

In [None]:
grader.check("q8")

**Question 9.** Using a boolean array, select the names in Year 2019 from the `nc` table that have at least 500 counts. Keep all columns from the original `nc` dataframe.

**Hint:** Any time you use `p & q` to filter the dataframe, make sure to use `df[(df[p]) & (df[q])]` or `df.loc[(df[p]) & (df[q])]`. That is, make sure to wrap conditions with parentheses.

**Note:** Both slicing and `.loc` will achieve the same result, it is just that `.loc` is typically faster in production. You are free to use whichever one you would like.


In [None]:
result = ...
result.head()

In [None]:
grader.check("q9")

In [None]:
all(i >= 500 for i in result['Count'].values)

<!-- BEGIN QUESTION -->

**Question 10.** Some names gain/lose popularity because of cultural phenomena such as a political figure coming to power or a successful athlete or entertainer in the during the prime years of his/her career. 

Below, we plot the popularity of the name Jordan in North Carolina over time. What do you notice about this plot? What might be the cause of the steep drop?


In [None]:
name = 'Jordan'
state = 'NC'

male_baby = babynames[(babynames['Name'] == name) & (babynames['State'] == state) & (babynames['Sex'] == 'M')]
female_baby = babynames[(babynames['Name'] == name) & (babynames['State'] == state) & (babynames['Sex'] == 'F')]

plt.rcParams["figure.figsize"] = (18,7)
plt.plot(male_baby['Year'], male_baby['Count'], 'b', label = 'Male')
plt.plot(female_baby['Year'], female_baby['Count'], 'r', label = 'Female')
plt.title(f'Popularity of {name} Over Time')
plt.xticks(np.arange(1940, 2025, step = 10))
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend();

<!-- END QUESTION -->

**Solution:** Answers will vary

<!-- BEGIN QUESTION -->

**Question 11.** Another cultural phenomena that has recently happened is the "Karen" meme. Read the NY Times article [A Brief History of Karen](https://ncssm.instructure.com/courses/6050/pages/a-brief-history-of-karen). Then use the `babynames` data set to investigate the change in popularity of the name Karen. 

Below, we plot the popularity of the name Karen in North Carolina over time. What do you notice about this plot? When did the first frop in popularity occur? What might be the cause of the steep drop? Look at plots from other states in the data set and compare the results to North Carolina. Do you think the change in popularity is restricted to Southeastern part of the US? What other information would you like to know?

Finally, write 2-3 paragraphs based on your findings from the questions listed above. Be sure to mention any information you may have gotten from other sources (make sure you cite your sources). 


In [None]:
name = 'Karen'
state = 'NC'

karen = babynames[(babynames['Name'] == name) & (babynames['State'] == state) & (babynames['Sex'] == 'F')]

plt.rcParams["figure.figsize"] = (18,7)
plt.plot(karen['Year'], karen['Count'], 'r')
plt.title(f'Popularity of {name} Over Time')
plt.xticks(np.arange(1940, 2025, step = 5))
plt.xlabel('Year')
plt.ylabel('Count');

<!-- END QUESTION -->

**Solution:** Answers will vary

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)