# Notebook 7: Seeing the Problem as Data 

## Part 1: Is Covid Real?

Suprisingly, this was a hotly debated question in the early months of Covid. Many people suggested that the very existence of Covid was a "hoax" or that it was no worse than the flu. As we know, social media has allowed conspiracy theories and misinformation to spread easily and quickly among people. 

Fortunately, data is a powerful tool to help us see a problem clearly. Let's see if we can use data to set the record straight!

Chicago is the third-largest city in the United States, with an estimated 2.71 million people living within the city's limits. This is _approximately_ the size of 19th century London (2.5 million). 

<table><tr>
    <td> <img src="../imgs/chicago_zip_map.jpeg" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

<br>

How do we know there's a problem? We have gathered some data from the Illinois Department of Public Health. Specifically, how many people in Chicago have died each year form 2012 to 2021. What does this data tell us? 

In [None]:
from matplotlib import pyplot as plt
import pandas as pd

years = list(range(2012,2022))  
chicago_total_deaths = [18911, 18825, 19003, 19308, 19809, 19742, 19660, 19630, 25769, 22882]
chicago_populations = [2714856, 2718782, 2722389, 2720546, 2704958, 2716450, 2705994, 2693976, 2741730, 2696555]

Chicago = pd.DataFrame(list(zip(years, chicago_total_deaths, chicago_populations)), columns=['year', 'deaths', 'population'])
Chicago

In [None]:
# Task 1: Normalize the number of deaths in Chicago by creating a new column: “deaths_per_1000”

Chicago['deaths_per_1000'] = ???

Chicago

We can also visualize this data with a line graph.

In [None]:
plt.plot(Chicago['Year'], Chicago['DeathsPer1000'])
plt.title("Chicago 'Deaths per 1000' per Year")
plt.xlabel("Year")
plt.ylabel("Deaths per 1000 People")

print("Source: https://dph.illinois.gov/data-statistics/vital-statistics/death-statistics.html")

------------------------------------------
 
<br>
 
<img src="../imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 7a:** From Cholera to COVID</font>

We did a similar study of deaths in London to identify the years that cholera outbreaks occurred. How would you use the above data to argue against the suggestion that Covid was a "hoax"?

> Write your answer here! 

<br>

<img src="../imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Stop and save your work!** </font>

## Part 2: Research is 'Standing on the Shoulders of Giants'
In May 2020, Kevin Credit from the Center for Spatial Data Science at the UChicago wrote "Neighborhood inequity: Exploring the factors underlying racial and ethnic disparities in COVID-19 testing and infection rates using ZIP code data in Chicago and New York". 

Fortunately at the conclusion of publishing this paper, Kevin shared his data with us! This poses one important question: 

Kevin's research also would not have been possible without open source research data from: 
Illinois Department of Health: https://www.dph.illinois.gov/covid19/covid19-statistics

U.S. Census Bureau: https://data.census.gov/cedsci


In [None]:
# Next we load our data into a usable format and view the first few rows
    
frame = pd.read_csv("../data/cov_chi_with_positivity_lite.csv")
frame.head()

In [None]:
# Since there are so many columns, let's list them

cols = frame.columns.tolist()
print(cols)

_____________________________________________________________________
It's not always easy to know what a column name means.

Thankfully, Kevin's data are accompanied by a comprehensive variable guide! 

**NOTE: after the variable guide, there is a cell where you can peek at each variable. Give it a try!**

_____________________________________________________________________

**Population**

`POP`: Total population 

`P0_44`: Number of people ages 0 to 44 

`P45_64`: Number of people ages 45 to 64: 

`P65_`: Number of people ages 65 and older 

<p>&nbsp;</p>

**Socio-economic status** 

`MEDINC`: Median household income 

`PERNOINS`: Percent without health insurance 

<p>&nbsp;</p>

**Racial/Ethnic neighbourhood types**

`BLKNH`: Black non-Hispanic-Majority Neighborhood  

`HISPNH`: Hispanic-Majority Neighborhood  

`WNH`: White non-Hispanic-Majority Neighborhood 

`PERASN`: Asian-Majority Neighborhood 

<p>&nbsp;</p>

**Occupations**
`PEROFFTC`: Percent office and telecommute workers

`PERHSRV`: Percent healthcare service workers

`PERPSRV`: Percent public service workers

`PERFOOD`: Percent food service workers

`PERCLEAN`: Percent cleaning service workers

<p>&nbsp;</p>

**How people get to work**

`PERAUTO`: Percent personal automobile commuters

`PERTRAN`: Percent public transportation commuters

`PERPEDB`: Percent pedestrian and bike commuters

`PERTELE`: Percent telecommuters (work from home)


<p>&nbsp;</p>

**Built environments**

`FDTRTPER`: Percent food desert tracts 

`WS_5`: Hospital accessibility score 

`POPDENS`: Population density (per square meter)  

`PERCROWD`: Percent housing units w/ > 1 person per room 

<p>&nbsp;</p>

 
**COVID-19**

`CASE4_16`: Number of positive cases the week ending 4/16  

`TEST4_16`: Number of tests performed the week ending 4/16 


In [None]:
# Use this cell to explore the different variables by changing the name inside. 
# Remove '.head()' to see all of the data!
frame['???'].head()


<br>
 
<img src="../imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Task 2:** Variable Identification</font>

Identify variables from the dataset that could be used to measure Covid's impact on communities.  

> Write your answer here! 

<br>

<br>

<img src="../imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Stop and save your work!** </font>

________________________________________________________________

## Part 3: Creating New Data

First we want to examine the `case rate`: **the percentage of the population that has COVID-19**. 

You'll notice that there is *not* already a case rate variable in Kevin's data -- this means we need to construct it ourselves!

In Python, we can do this by "declaring a function" that inputs **something** and outputs the case rate. In this case **something** is the total number of cases and the population size. The function says how to use these inputs to **return** a value, in this case, the case rate.

In [None]:
# This is a function! If you run this cell nothing happens. 
# Why do you think that is?

def case_rate(total_cases, population): 
    cases_over_population = total_cases / population
    return cases_over_population * 100  # Why multiply it by 100? 

In [None]:
# Generate a new column of data for the week of 4/16 called “case_rate_4_16” 
# using the “case_rate” function for April 16th, 2020 

frame['case_rate_4_16'] = case_rate(frame[???], frame[???])

# Next we can preview our new data to see if they pass the **smell-test**!

frame['case_rate_4_16'].head()

<br>

<img src="../imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 7b:** Case Rate 'Smell Test'</font>

What do the values for `case_rate_4_16` mean? Does your generated numbers appear to be a reasonable size (not too big and not too small)?  

> Write your answer here! 

<br>


<br>

Next, we want to consider the **testing rate: the total percentage of the population that got tested!**

**Don't forget to preview your new data field to ensure everything looks good!** 

<br>

In [None]:
# Make a function called “test_rate” that normalizes test rate
# HINT: it looks a LOT like the case_rate function

def test_rate(total_tests, population):
    ??? = ??? ??? ???
    return ??? ??? ???


In [None]:
# Generate a new column of data for the week of 4/16 called 
# “test_rate_4_16” using the “test_rate” function.

frame["test_rate_4_16"] = test_rate(???, ???)

# Preview our new data
frame['testing_rate_4_16'].head()

<br>

<img src="../imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Stop and save your work!** </font>

## Part 4: Positivity Rate

Now that you have some experience building and using functions, let's apply it to build our desired outcome variable, **positivity rate**!

<img src="../imgs/pos_rate.png" alt="Drawing" style="width: 600px;"/>

In [None]:
# Define a function called "pos_rate" that calculates positivity rate 

def pos_rate(cases, tests):
    ??? = ??? ??? ???
    return ??? ??? ???


In [None]:
# Using your pos_rate function, create a new 
# column in the dataframe for the week of 4/16 called “pos_rate_4_16”

???

# Preview your new data

???

## Congratulations! You've just created your very own outcome variable! 

Positivity rate enables us as researchers to see the impact of COVID, even when the total number of tests completed is low. In the upcoming exercises we will visualize and test your new variable of interest (among others!)

<br>

<img src="../imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Stop and save your work!** </font>