# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

Before you begin run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc201_001_003_a7.ipynb")

**Name:** 

**Section:** 

**Date:**

## This Week's Assignment

In this week's assignment, you'll learn how to:

- load data into and `pandas` `DataFrame`.

- access, manipulate and slice data that is stored in a `pandas` `DataFrame`.

Let's get started!

**Note**: The Pandas interface is notoriously confusing, and the documentation is not consistently great. Throughout the semester, you will have to search through Pandas documentatio, experiment and use tools like ChatGPT, but remember it is part of the learning experience and will help shape you as a data scientist.

**Question 1.** Use the `import` function to import `pandas` as `pd` and `numpy` as `np`.

In [None]:
# Import the pandas and numpy modules using the appropriate aliases
import ... as ...
import ... as ...

In [None]:
grader.check("q1")

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1:** You can create a data frame by specifying the columns and values using a dictionary as shown below. 

The keys of the **dictionary** are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
            'color': ['red', 'orange', 'yellow', 'pink']})
fruit_info

**Syntax 2:** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct **tuple**, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

**Question 2.** For a `DataFrame` named `df`, you can add a column with 

```
df['new column name'] = ...
``` 

and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty).

In [None]:
fruit_info[...] = [...]
fruit_info

In [None]:
grader.check("q2")

Let's make a copy of the `fruit_info` dataframe named `fruit_info_copy`. Making a copy of a dataframe in Pandas is easy. You can use the copy method of the data frame object.

Run the cell below

In [None]:
fruit_info_copy = fruit_info.copy()
fruit_info_copy

So why would we use `.copy()` instead of reassigning the `fruit_info` dataframe to `fruit_info_copy`? Well, when working with complex code or in a team environment, reassigning dataframes to new names can lead to unintended side effects if someone later modifies the new dataframe and it unintentionally affects the original one. Using `.copy()` makes it explicit that you're creating a separate copy, reducing the risk of such side effects. Also, using `.copy()` makes your code more readable by explicitly indicating that you are creating a copy of the dataframe, which can be especially useful when reviewing code or sharing it with others.

We can also add a column to a dataframe that is a `Series`. Remember, a `Series` is a one-dimensional data structure that can hold various types of data, including integers, floats, strings, and more. It is similar to a column in a spreadsheet or a single column of data in a dataframe. Each element in a `Series` has a label called an index, which is used to access elements in the `Series`. 

**Note:** By default, a `Series` will have integer index labels, starting with 0.

Run the cell below to create a series named `rank2`.

In [None]:
rank2 = pd.Series([4, 3, 2, 1])
rank2

The first column is the index value and the second column are the values.

Now, let's add the series to the `fruit_info_copy` dataframe.

Run the cell below.

In [None]:
fruit_info_copy['rank2'] = rank2
fruit_info_copy

**Question 3.** Use the `.drop` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `rank2` column that was added to `fruit_info_copy` (make sure to use the `axis` parameter correctly). Save the output to an object named `fruit_info_original`.

**Note:** `.drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

**Hint:** Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how you can drop a column from a Pandas dataframe or send the following message to ChatGPT $-$ _"how can a drop a column from a pandas dataframe"_.

In [None]:
fruit_info_original = ...
fruit_info_original

In [None]:
grader.check("q3")

**Question 4.** Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with capital letters. Save this new dataframe to `fruit_info_caps`.

**Note:** Feel free to use multiple lines of code.

In [None]:
...
fruit_info_copy = ...

fruit_info_caps

In [None]:
grader.check("q4")

## Babynames

Now that we have learned the basics, let's move on to the `babynames` dataset. The `babynames` dataset contains a record of the given names of babies born in the United States each year.

First let's run the following cell to build the dataframe `babynames`. The cell below loads the data into a dataframe. There should be a total of 890627 records.

In [None]:
babynames = pd.read_csv('data/baby_names.csv', index_col = 0)
len(babynames)

Next, let's take a look at the first 5 observations.

In [None]:
babynames.head()

<!-- BEGIN QUESTION -->

**Question 5.** Based on the output in the previous cell, what do you think each column represents.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Selecting Rows and Columns (Slicing)

### Selection Using Label & Index (using `.loc`)

#### Column Selection 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. (Reminder that the colon `:` means "everything.")  For example, if we want the `color` column of the `flowers` data frame, we would use: `flowers.loc[:, 'color']`

- You can also slice across columns. For example, `babynames.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- **Alternative:** While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

#### Row Selection

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e. primary key) of the dataframe.

**Example 1.**

In [None]:
# Select the first 5 values in the babynames dataframe from the Name column
# Notice the the range 0:4 starts at the first row (index 0) and includes
# the fifth row (index 4).

# This command returns a Series.

babynames.loc[0:4, 'Name']

**Example 2.**  

Notice the difference between this method and the method in **Example: 1.**

Just passing in `'Name'` returns a Series while `['Name']` returns a `Dataframe`.

In [None]:
# Select the first 5 values in the babynames dataframe from the Name column
# Notice the the range 0:4 starts at the first row (index 0) and includes
# the fifth row (index 4).

# This command returns a DataFrame.

babynames.loc[0:4, ['Name']]

**Note:** `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `0:4` with `.loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 4.

### Selection using Integer location (using `.iloc`)

Another Pandas feature is `.iloc[]` which lets you slice the dataframe by row position and column position instead of by row index and column label (which is the case for `.loc[]`). This is really the main difference between the two functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `.iloc[]`, the end index is **not** included, like with normal Python slicing.

**Note:** As a mnemonic, remember that the i in `.iloc` means "integer". 

Below, we have sorted the `babynames` dataframe. Notice how the **position** of a row is not necessarily equal to the **index** (the values in **bold** in the far left column) of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `.loc[]` and `.iloc[]`.

**Example 3.**

In [None]:
sorted_babynames = babynames.sort_values(by = ['Name'])
sorted_babynames.head()

**Example 4.** Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the `Name` column of the `baby_names` dataframe using both `.iloc[]` and `.loc[]`. Observe the difference, especially after sorting `babynames` by name.

In [None]:
sorted_babynames.iloc[1:4, 3]

Notice that using `.loc[]` with `1:4` gives different results, since it selects using the **index**.

In [None]:
sorted_babynames.loc[1:4, 'Name']

**Example 5.** Lastly, we can change the index of a dataframe using the `set_index` method. We change the index from $0,1,2,\ldots$ to the `Name` column.

In [None]:
babynames_idx = babynames[:5].set_index("Name") 
babynames_idx

**Example 6.** However, if we still want to access rows by location we will need to use the integer location (`.iloc`) accessor.

**Note:** We can't do this `babynames_idx.loc[:5, 'Year']`, but we can use the integer position.

In [None]:
# Select the first 5 values in the babynames dataframe using row index 
# values 0, 1, 2, 3, 4 from the Year column with index value 2

# This command returns a Series.
babynames_idx.iloc[:5, 2]

To use the name location we need to use the index values **Mary** and **Willie** and the column name **Year**.

In [None]:
# Select the first 5 values in the babynames dataframe using row index 
# values Mary through Willie from the Year column

# This command returns a Series.

babynames_idx.loc['Mary':'Willie', 'Year']

**Question 6.** List only the unique names of the states that are in the `babynames` dataset. Save the states to a **list** named `states`. Do this using the `pandas` `DataFrame` method named `.unique()`. You will need to make sure you know what data type can be used with the `.unique()` method and the data type that gets returned from the `.unique()` method.

**Note:** To earn all the points for the question you must do it programmatically. 

In [None]:
states = ...
states

In [None]:
grader.check("q6")

**Question 7.** Selecting multiple columns is easy.  You just need to supply a list of column names.  Use `.loc` to select the `Name` and `Year` **(in that order)** from the `babynames` table. Save this new object to a dataframe called `name_and_year`. For help, you can refer back to the **Examples** from the previous section in this notebook.

**Note:** To earn all the points for this question you **must** use the `.loc` method.

In [None]:
name_and_year = ...
name_and_year

In [None]:
grader.check("q7")

## Filtering Data

### Filtering with Boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like 

```
df[df['column name'] < 5]]
```

where `df` is the name of the dataframe, `column name` is the name of the column, and `< 5` is the comparison statement. Meaning, any value in the column that is larger than 5 will be displayed.


For your reference, some commonly used comparison operators are given below.

Symbol   | Usage      | Meaning 
------   | ---------- | -------------------------------------
$==$     | a == b     | Does a equal b?
$\lt =$  | a <= b     | Is a less than or equal to b?
$\gt =$  | a >= b     | Is a greater than or equal to b?
$\lt$    | a < b      | Is a less than 
$\gt$    | a > b      | Is a greater than b?
~        | ~p         | Returns negation of p
&#124;   | p &#124; q | p OR q
&        | p & q      | p AND q
^        | p ^ q      | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in North Carolina.

In [None]:
babynames[babynames['State'] == 'NC']

Let's break down each part.

In [None]:
# This returns a Series

babynames['State']

In [None]:
# This returns a Boolean array. Meaning a Series of
# True or False values based on whether the value for
# the observation (row) is equal to NC

babynames['State'] == 'NC'

In [None]:
# This returns a dataframe where any row that returns 
# False from the expression babynames['State'] == 'NC'
# will be hidden (masked). Only the rows where the 
# result is True will be displayed.

babynames[babynames['State'] == 'NC']

<!-- BEGIN QUESTION -->

**Question 8.** Write an expression that will filter the `babynames` dataframe and only return the rows for the year you were born.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 9.** Write 2 questions that you think can be answered by exploring the `babynames` dataframe. What would you need to do to the dataframe (i.e. _"data move"_) in order to answer your question.

For example, if my question is "How many male babies named Kanye were born in each year?", then I would to filter on the name Kanye and on the sec male. If I wanted to know if the name was increasing or decreasing in popularity I could create a line plot and use the year as the $x-$axis and the count and the the $y-$axis.

**Note:** To earn all the points for this question you do not need know *all* the details required to complete your data move. If you get stuck ask ChatGPT to explain it to you in plain simple English. If you use ChatGPT make a note of it in your response. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)