# Denison CS181/DA210 SW Lab #5 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

---

## Part A: Deleting Columns/Rows

We'll use the following subset of the `topnames` dataset as an example throughout.  Note that we'll use the `.copy()` method to copy our `DataFrame` before making any changes to it.

In [None]:
# Create the DataFrame from a List of Lists
topnamesLoL = [ [2018, "Male", "Liam", 19837],
                [2018, "Female", "Emma", 18688],
                [2017, "Male", "Liam", 18798],
                [2017, "Female", "Emma", 19800],
                [2016, "Male", "Noah", 19117],
                [2016, "Female", "Emma", 19496] ]
topnamesColumns = ["year", "sex", "name", "count"]

topnames0 = pd.DataFrame(topnamesLoL, columns=topnamesColumns)

# View the DataFrame before making any changes
topnames0

#### Single-column deletion

If we want to delete a single column from a `DataFrame`, we can use a `del` statement.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn2 = topnames0.copy()

# Delete the 'name' column
del tn2["name"]

# Display the modified DataFrame
tn2

Similarly, we could use `pop()` to delete a single column.  Like with `list`s in Python, `pop()` returns the removed element.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn3 = topnames0.copy()

# Delete and store the 'count' column
column_series = tn3.pop("count")

# Display the modified DataFrame
tn3

As the result of using `pop()` is a single column, its type is a `Series`.

In [None]:
# Display the popped Series
column_series

#### Multiple-column deletion

We can use the `drop()` method of the `DataFrame` class to drop multiple columns.  The first argument to `drop()` should be a single column label (to drop one column) or a list of column labels (to drop multiple columns).

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn4 = topnames0.copy()

# Delete just the 'Name' column
tn4.drop('name', axis=1, inplace=True)

# Display the modified DataFrame
tn4

In the above example, we specified `inplace=True`.  This modifies the given `DataFrame`.  We could skip the copy step by using `inplace=False`, which would return the modified `DataFrame` copy.

In [None]:
# Delete just the 'Name' column
tn5 = topnames0.drop('name', axis=1, inplace=False) # return a modified copy

# Display the modified DataFrame (same as tn4)
tn5

#### Row deletion

In the previous two examples, we used `axis=1` to specify that we wanted to drop one or more columns.  We could instead use `axis=0` to specify that we should drop rows.

In [None]:
# Delete rows 2-3 (using the row labels)
tn6 = topnames0.drop([2,3], axis=0, inplace=False) # return a modified copy

# Display the modified DataFrame
tn6

If we use a multi-level index for row labels, we can specify a drop using a specific level.

In [None]:
# Copy topnames0 and give it a two-level index
tn7 = topnames0.set_index(['year', 'sex'], inplace=False)

# Delete rows for 2017 (using the two-level index)
tn7.drop([2017], level="year", axis=0, inplace=True)

# Display the modified DataFrame
tn7

In [None]:
# Copy topnames0 and give it a two-level index
tn8 = topnames0.set_index(['year', 'sex'], inplace=False)

# Delete rows for Male (using the two-level index)
tn8.drop(["Male"], level="sex", axis=0, inplace=True)

# Display the modified DataFrame
tn8

---

## Part B: Adding a Column

We can add a new column, represented as a `Series`, to an existing `DataFrame`.  To do this, we use the same syntax to project a column, but on the _left-hand side_ of an assignment statement.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn9 = topnames0.copy()

# Add some new columns
tn9["oddyear"] = tn9["year"] % 2 == 1                     # year is odd
tn9["namelen"] = tn9["name"].apply(len)                   # length of each name
tn9["namecaps"] = tn9["name"].apply(lambda s: s.upper())  # name in all caps

# Display the modified DataFrame
tn9

Note that if we project multiple columns, this projection is tied to the original data.  This means that if we modify the projection, we modify the original as well.

The same thing occurs with Python `list`s.

In [None]:
# Start with a large list
mylist = [1, 3, 6, 7, 19, 22]

# Create an alias for my list
alias = mylist

# Modify the alias
alias[-1] = -10
alias.append(1000)

# Check the state of both
print(mylist)
print(alias)

With `DataFrame`s, this can cause issues if we try to add a column to a projection.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn10 = topnames0.copy()

# Project the "name" and "count" columns
tn10_proj = tn10[["name", "count"]]

# Attempt to add a column to the projection
tn10_proj["namelen"] = tn10["name"].apply(len) # displays warning

However, despite this warning, it will add the column to the projection.

In [None]:
# Display the original DataFrame
tn10

In [None]:
# Display the modified projection
tn10_proj

Because of the assumed correspondence of a projection and the original data, we can use `copy()` to make clear that the assignment is a one-time operator.

---

## Part C: Updating columns

We can use the same syntax for adding a column to update all values in a given column.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn11 = topnames0.copy()

# Modify the 'count' column (e.g., to change units)
tn11["count"] = tn11["count"] / 1000

# Display the modified DataFrame
tn11

We can use `.loc` and a row filter to update only some values in a given column.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn12 = topnames0.copy()

# Modify the 'count' column for rows with Female 'sex' (e.g., to change units)
tn12.loc[tn12.sex == "Female", "count"] = tn12["count"] / 1000

# Display the modified DataFrame
tn12

We have the same issue trying to modify the values in a projection as we did trying to add a column.

In [None]:
# First, copy the DataFrame (we'll modify the copy)
tn13 = topnames0.copy()

# Project the "name" and "count" columns
tn13_proj = tn13[["name", "count"]]

# Modify the projection
tn13_proj.loc[:,"count"] = tn13_proj["count"] / 1000 # displays a warning

---

## Part D: Try it Yourself

**Q1:** Read CSV file `indicators2016.csv` in `datadir` into a data frame named `indicators0`, with no index.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display a subset of the DataFrame
indicators0.head()

**Q2:** Use `.pop()` to remove the `'code'` column from the dataset (modifying the original dataset), and store the resulting `Series` in `code_series`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the Series
code_series

In [None]:
# Display the modified DataFrame (should not have a 'code' column)
indicators0.head()

**Q3:** Make a copy of `indicators0`, called `indicators`, and assign `code_series` to be its index.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the modified DataFrame copy
indicators

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: After popping the `'code'` column from `indicators0`, how could you instead create a new column in the data frame with the same value as `code_series`, effectively putting that column back into the data frame?

---

## Part E: A New Dataset

The file `members.csv` in `publicdata` has (fake) information on a number of individuals in Ohio.

**Q4:** Read this dataset into a `pandas DataFrame` using `read_csv`.  Name it `members0` and do not include an index.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame
members0

**Q5:** Repeat the above, but now do include an index by specifying `index_col` in the constructor.  Name this `DataFrame` `members`.  (Hint: take a look at the file to determine a reasonable index.)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame
members

**Q6:** We will now split the column 'Name' into two different columns for first and last name.  As a first step:

  1. Write a lambda function that will, given a string, split on spaces and select only the first element of the resultant list.
  2. Apply the lambda function to the `'Name'` column, and store the result as `fname_series`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the Series
fname_series

**Q7:** Assign `fname_series` as a new column in `members` with column name `FName`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame (should have a new column, FName)
members

**Q8:** Similar to Q6-Q7, create a new column, called `LName`, in `members` using the last name.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame (should have a new column, LName)
members

**Q9:** Similar to Q6-Q8, create two new columns, called `City` and `State`, based on the `Address` column.  Make sure that neither the city nor state has any leading or trailing spaces.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame (should have two new columns, City and State)
members

**Q10:** Drop the original `'Name'` and `'Address'` columns.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame (should have removed Name and Address columns)
members

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: One of the users, Kirk Marshall, wants to change his last name to Crossley.  How would you use `loc` do do this?  What about `iloc`?

---

---
## Part F

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE