In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_palette('colorblind')
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv')
data.head()

Unnamed: 0,ZIP Code,Week Number,Week Start,Week End,Cases - Weekly,Cases - Cumulative,Case Rate - Weekly,Case Rate - Cumulative,Tests - Weekly,Tests - Cumulative,...,Test Rate - Cumulative,Percent Tested Positive - Weekly,Percent Tested Positive - Cumulative,Deaths - Weekly,Deaths - Cumulative,Death Rate - Weekly,Death Rate - Cumulative,Population,Row ID,ZIP Code Location
0,60632,36,08/30/2020,09/05/2020,117.0,4282.0,128.0,4703.5,1446.0,25164,...,27640.9,0.1,0.2,1,100,1.1,109.8,91039,60632-2020-36,POINT (-87.711251 41.810038)
1,60601,17,04/19/2020,04/25/2020,7.0,45.0,48.0,306.6,35.0,197,...,1342.4,0.1,0.1,1,2,6.8,13.6,14675,60601-2020-17,POINT (-87.622844 41.886262)
2,60632,39,09/20/2020,09/26/2020,107.0,4617.0,118.0,5071.5,1355.0,29203,...,32077.5,0.1,0.2,3,103,3.3,113.1,91039,60632-2020-39,POINT (-87.711251 41.810038)
3,60632,49,11/29/2020,12/05/2020,615.0,9447.0,676.0,10376.9,2884.0,53923,...,59230.7,0.3,0.2,6,135,6.6,148.3,91039,60632-2020-49,POINT (-87.711251 41.810038)
4,60632,50,12/06/2020,12/12/2020,581.0,10028.0,638.0,11015.1,2848.0,56771,...,62359.0,0.2,0.2,10,145,11.0,159.3,91039,60632-2020-50,POINT (-87.711251 41.810038)


## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

### Data cleaning
*By Joanna Chi*

In [3]:
# Look at how many rows in the data
data.shape

(11280, 21)

In [4]:
# Find the missing values in the dataset
data.isnull().sum()

ZIP Code                                  0
Week Number                               0
Week Start                                0
Week End                                  0
Cases - Weekly                          223
Cases - Cumulative                      223
Case Rate - Weekly                      223
Case Rate - Cumulative                  223
Tests - Weekly                          330
Tests - Cumulative                        0
Test Rate - Weekly                        0
Test Rate - Cumulative                    0
Percent Tested Positive - Weekly          0
Percent Tested Positive - Cumulative      0
Deaths - Weekly                           0
Deaths - Cumulative                       0
Death Rate - Weekly                       0
Death Rate - Cumulative                   0
Population                                0
Row ID                                    0
ZIP Code Location                       188
dtype: int64

In [5]:
# Drop null values and see how many rows are left
data = data.dropna()
data.shape

(10543, 21)

In [6]:
(11280-10543)/11280
# Dropped values are only about 6.5% of the data set, so we can remove them.

0.06533687943262412

In [7]:
# Remove the "Row ID" and "ZIP Code Location" columns because we don't need them in our analysis
data = data.drop(['Row ID','ZIP Code Location'],axis=1)
data.head()

Unnamed: 0,ZIP Code,Week Number,Week Start,Week End,Cases - Weekly,Cases - Cumulative,Case Rate - Weekly,Case Rate - Cumulative,Tests - Weekly,Tests - Cumulative,Test Rate - Weekly,Test Rate - Cumulative,Percent Tested Positive - Weekly,Percent Tested Positive - Cumulative,Deaths - Weekly,Deaths - Cumulative,Death Rate - Weekly,Death Rate - Cumulative,Population
0,60632,36,08/30/2020,09/05/2020,117.0,4282.0,128.0,4703.5,1446.0,25164,1588,27640.9,0.1,0.2,1,100,1.1,109.8,91039
1,60601,17,04/19/2020,04/25/2020,7.0,45.0,48.0,306.6,35.0,197,238,1342.4,0.1,0.1,1,2,6.8,13.6,14675
2,60632,39,09/20/2020,09/26/2020,107.0,4617.0,118.0,5071.5,1355.0,29203,1488,32077.5,0.1,0.2,3,103,3.3,113.1,91039
3,60632,49,11/29/2020,12/05/2020,615.0,9447.0,676.0,10376.9,2884.0,53923,3168,59230.7,0.3,0.2,6,135,6.6,148.3,91039
4,60632,50,12/06/2020,12/12/2020,581.0,10028.0,638.0,11015.1,2848.0,56771,3128,62359.0,0.2,0.2,10,145,11.0,159.3,91039


In [8]:
# Check the datatypes of each variable to see if there are any we need to fix
data.dtypes

ZIP Code                                 object
Week Number                               int64
Week Start                               object
Week End                                 object
Cases - Weekly                          float64
Cases - Cumulative                      float64
Case Rate - Weekly                      float64
Case Rate - Cumulative                  float64
Tests - Weekly                          float64
Tests - Cumulative                        int64
Test Rate - Weekly                        int64
Test Rate - Cumulative                  float64
Percent Tested Positive - Weekly        float64
Percent Tested Positive - Cumulative    float64
Deaths - Weekly                           int64
Deaths - Cumulative                       int64
Death Rate - Weekly                     float64
Death Rate - Cumulative                 float64
Population                                int64
dtype: object

In [9]:
# We need to change "Week Start" and "Week End" from object to datetime, and to change ZIP Code from object to numeric
data['Week Start'] = pd.to_datetime(data['Week Start'])
data['Week End'] = pd.to_datetime(data['Week End'])
data['ZIP Code'] = pd.to_numeric(data['ZIP Code'])

In [10]:
# Finally, we found that the ZIP code 60666 only contained the airport and no residents, so we chose to remove it
data = data.drop(data.loc[data['ZIP Code'] == 60666].index)

### Data preparation
*By Ryu Kimiko*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need are analyzing house price, we derived some new variables *(from existing variables)* that intuitively seem to be associated with house price. 

2. We have created a standardized version of the dataset, as we are computing Euclidean distances to find houses similar to a given house

In [11]:
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area

In [12]:
######-----Standardizing the dataset for Lasso / Ridge-------#########

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

### Analysis 1
*By \<Name of person doing the analysis>*

### Analysis 2
*By \<Name of person doing the analysis>*

### Analysis 3
*By \<Name of person doing the analysis>*

### Analysis 4
*By \<Joanna Chi>*

How do testing rates and positive test rates vary across different regions of Chicago?

In [None]:
# Barplot visualizing test rates by region and year
a = sns.barplot(data, x='Chicago Region',y='Test Rate - Weekly',hue='Year')
a.figure.set_figwidth(15)
a.set_title('Mean Testing Rates Across Chicago')
plt.xlabel('Chicago Area')
plt.ylabel('Test Rates')
plt.show()

In [None]:
# Barplot visualizing positivity rates by region and year
a = sns.barplot(data, x='Chicago Region',y='Percent Tested Positive - Weekly',hue='Year')
plt.xlabel('Chicago Area')
plt.ylabel('Percent Tested Positive')
a.figure.set_figwidth(15)
a.set_title('Mean Positive Test Rates Across Chicago')
plt.show()

## Other sections

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

Put each model in a section of its name and mention the name of the team-member tuning the model. Below is an example:

## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.