## Notebook 2b: Pandas Dataframe Manipulations
-------------------------
<br>

**By the end of this notebook, you should be able to**: 
- Get column statistics
- Create a new column
- Select columns and rows
- Apply these dataframe manipulations to the cholera problem
<br><br>


In the previous Pandas dataframe (from Notebook 2a), we created a new column `deaths_per_1000` by putting a couple of other columns through a calculation. This is an example of a **dataframe manipulation** where we create new data from existing data or reorganize the data to make it easier to work with.  In this section, we will practice more dataframe manipulations. To begin, let's look at some data about that cholera outbreak from 1849. 

19th Century London was divided into districts, much like Chicago is divided into neighborhoods. These districts were grouped by geography, just like Chicago (South Side, North Side, West side, Far South Side, etc.).

<br>

<img src="https://i2.wp.com/londontopia.net/wp-content/uploads/2014/08/london-county.jpg" width=600>





In [None]:
# Load our Pandas data science library
import pandas as pd
pd.options.mode.chained_assignment = None

In [None]:
# Load data about London
outbreak = pd.read_csv("Datasets/The_Outbreak_of_1849.csv")
outbreak

### Action 1. Column statistics. 
-------------------------------------
As data scientists, we often want to summarize lots of the data, like all the values in a column, with a single number.

Pandas provides a number of utilities to give us these **summary statistics**:

- **min**: find the minimum value(s) of a column. 
- **max**: find the maximum value(s) of a column. 
- **mean**: compute the average of the column. 
- **sum**: add up elements in a column. 

In [None]:
deaths_min = outbreak['deaths'].min()
deaths_max = outbreak['deaths'].max()
deaths_mean = outbreak['deaths'].mean()
deaths_sum = outbreak['deaths'].sum()

print(f"deaths min: {deaths_min}\ndeaths max: {deaths_max}\ndeaths mean: {deaths_mean}\ndeaths sum: {deaths_sum}")

### Action 2. Creating a new column. 
------------------------------------------
Let's say that you want to create a new column in Pandas. This can be done by setting a column to a value by using the `df["column_name"] = ...` notation. Let's create a new column where we calculate the number of people per house using the following calculation: 

$$people \ per \ house = {population \over number \ of \ houses}$$

In [None]:
# A. Create a new column that calculates the number of people per house. 
outbreak['people_per_house'] = outbreak['population']/outbreak['number_houses']
outbreak

### Action 3. Selecting columns.
-----------------------------------
Let's say that you want to 'select' only certain dataframe columns. You can select just one column using `df["column_name"]` or multiple columns as follows `df[["column_name_1", "column_name_2"]]`. See the following...

In [None]:
# Select only the district column. 
new_df = outbreak["district"]
new_df

In [None]:
# Select the "District" and "Region" columns. Notice the double brackets because we are putting a list in outbreak[...]
new_df = outbreak[["district", "region"]]
new_df

### Action 3. Selecting rows.
-----------------------------------
Selecting certain rows is a different process because rows in a dataframe can contain different data types. When we filter rows, we just want to see the rows that contain a certain value or range of values.

We use what is called a "boolean" which is a "true/false statement" and is coded like this: `df[(df["column_name"] == "some_value")]`

The double equal signs `==` means "is it equal to?" as opposed to `=` which means "**make** it equal to" like how we set variable equal to a certain value.

For example, let's say that we only want to see the row for East London...

In [None]:
new_row = outbreak[(outbreak["district"] == "East London")]
new_row

You can also use other **"operators"** like: 
- `>` greater than
- `<` less than
- `<=` less than or equal to 
- `>=` greater than or equal to
- `!=` not equal to

If we wanted to select the districts that are below an elevation of 20:

In [None]:
low_elev = outbreak[(outbreak["elevation"] < 20)]
low_elev

## Now it's your turn...

### Applying Dataframe Manipulations to the Problem

One of the first things we can do when exploring data for insights is to see if there are **spatial patterns** to our outcome variable. In other words, does location (region) affect the outcome (death rate)?

Let's apply what we just learned about dataframe manipulations to see if cholera's impact was spatial.

Perform the following dataframe manipulation exercises! Fill in the `???` with the proper Python code!

**1. Print out a version of `outbreak` containing only the district, region, population and deaths columns. Call this dataframe `outbreak_spatial`.**

In [None]:
# Put your answer here!
outbreak_spatial = outbreak[["district", "region", "population", "deaths"]]
outbreak_spatial

**2. Create a column called "deaths_per_1000" that is the mortality rate.  *Hint: see outcome variable*.**

In [None]:
# Put your answer here! 
outbreak_spatial["deaths_per_1000"]= outbreak_spatial["deaths"] / outbreak_spatial["population"] *1000
outbreak_spatial

**3. Select only the rows for the North districts and put it in a new dataframe called `outbreak_north`**

In [None]:
outbreak_north = outbreak_spatial[(outbreak_spatial["region"] == "North")]
outbreak_north

**4. Repeat Step 3 for the other four regions.** Make a new dataframe for each: `outbreak_south`, `outbreak_east`, etc.

In [None]:
outbreak_south = outbreak_spatial[(outbreak_spatial["region"] == "South")]
outbreak_south

In [None]:
outbreak_east = outbreak_spatial[(outbreak_spatial["region"] == "East")]
outbreak_east

In [None]:
outbreak_west = outbreak_spatial[(outbreak_spatial["region"] == "West")]
outbreak_west

In [None]:
outbreak_central = outbreak_spatial[(outbreak_spatial["region"] == "Central")]
outbreak_central

**5. Calculate the average (mean) death rate for each region.**

In [None]:
death_rate_north = outbreak_north['deaths_per_1000'].mean()

print(f"North: {death_rate_north}")

In [None]:
death_rate_south = outbreak_south['deaths_per_1000'].mean()
death_rate_east = outbreak_east['deaths_per_1000'].mean()
death_rate_west = outbreak_west['deaths_per_1000'].mean()
death_rate_central = outbreak_central['deaths_per_1000'].mean()

print(f"North: {death_rate_north}\nSouth: {death_rate_south}\nEast: {death_rate_east}\nWest: {death_rate_west}\nCentral: {death_rate_central}")


To help visualize the spatial patterns, here is a map of death rates by boroughs, which are similar to districts:

<img src="imgs/1849_map.png" align=left style="width: 400px;"/>


### 2.3 Reflection

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2A**: Is it spatial?
    
**Based on the death rates, for the different regions, do you think that cholera had spatial patterns?** If so, which region(s) of London were most impacted?</font>

> Write your answer here! 

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2B**: The cartoon
    
**See the cartoon at the start of the previous notebook. What variables would you want to use to \*explore or explain\* your outcome variable?** Why? Come up with a potential explanation of how elevation or home value, etc., could impact the spread of cholera?</font>

> Write your answer here! 

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 2C:** Reflection </font>

**What did you learn in this notebook?**
> Write your answer here! 

**Please fill out the Notebook survey here!**
> https://forms.gle/54KHEbPGsRxQU3Bh9

<br>

--------------------------------

<br>

<img src="imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Last step: save your work!** </font>