# The Impact of the Coronavirus Pandemic on Connecticut's Residential Real Estate Market

I'll be using this file to do all the necessary data wrangling, analysis and plotting with python, and maybe even take a few notes that will go to the final paper.

#### Objectives
- [ ] import dataset
- [ ] add data about aggregate infection numbers in Connecticut
- [ ] if poverty rates found, add
- [ ] describe where we got the data from
- [ ] show table summary
- [ ] answer questions using plots and graphs

#### Questions
The background goes over the various questions we want to answer with this paper. Specifically, we want to investigate whatever demographic data we can find to explain income, along with infection rates, affected the residential real estate market in CT. A few questions we have to start include the following:
1. What do the sale prices in each quartile look like?
2. What is the trend in sales from 2016 to 2018?
3. How does infection affect sales in each quartile?
4. What cities show the largest changes? Why do we think that was?
We might have to define the quartiles.

In [5]:
# install dependencies
!pip install --upgrade pip

Collecting pip
  Downloading pip-23.1.1-py3-none-any.whl (2.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-23.1.1


In [120]:
# import modules
from IPython.display import display

import pandas as pd

# import datasets
data = pd.read_csv('./data/CTRRE_2011-2021.csv')
infection_data = pd.read_csv('./data/COVID19_Tests_Cases_Deaths_by_Town-ARCHIVE.csv')
income_data = pd.read_csv('./data/median-household-income-town-2020.csv')

# preview primary dataset
display(data.head())

We want to add 2016 to 2020 median household income from all cities in Connecticut, as well as relevant infection rates in the state.

In [18]:
# preview infection dataset
display(infection_data.tail())

Unnamed: 0,Last update date,Town number,Town,Total cases,Confirmed cases,Probable cases,Case rate,Total deaths,Confirmed deaths,Probable deaths,People tested,Rate tested per 100k,Number of tests,Number of positives,Number of negatives,Number of indeterminates
102240,06/24/2022,165,Windsor Locks,2702,2507.0,195.0,21021.0,44,40.0,4.0,10507.0,81741.0,44778.0,3347.0,41376.0,55.0
102241,06/24/2022,166,Wolcott,4089,3557.0,532.0,24652.0,64,54.0,10.0,14324.0,86357.0,71416.0,5163.0,66148.0,105.0
102242,06/24/2022,167,Woodbridge,1699,1519.0,180.0,19417.0,48,42.0,6.0,8668.0,99063.0,50559.0,2115.0,48373.0,71.0
102243,06/24/2022,168,Woodbury,1646,1367.0,279.0,17323.0,12,10.0,2.0,7918.0,83330.0,34243.0,2023.0,32192.0,28.0
102244,06/24/2022,169,Woodstock,1440,1391.0,49.0,18325.0,7,7.0,0.0,6740.0,85772.0,23654.0,1680.0,21945.0,29.0


In [124]:
# preview dataset with median household income
display(income_data.head())

Unnamed: 0,Town,FIPS,Year,Race/Ethnicity,Measure Type,Variable,Value
0,Andover,901301080,2005-2009,All,Number,Median Household Income,84757.0
1,Andover,901301080,2005-2009,All,Number,Margins of Error,9003.0
2,Andover,901301080,2005-2009,All,Ratio to State Median,Median Household Income,1.25
3,Andover,901301080,2005-2009,All,Ratio to State Median,Margins of Error,0.13
4,Andover,901301080,2005-2009,American Indian and Alaska Native Alone,Number,Median Household Income,-9999.0


The data about infection rates looks clean and easy enough to merge with the primary dataset. The data about median income households is a little less straightforward, and from the outset, it looks like I definitely need to do some row and column manipulation to get the desired values.

I'll start with wrangling the infection rates dataset and merging it with the primary dataset.

In [52]:
# rename columns for convenience
infection_data = infection_data.rename(columns={
    'Last update date': 'update_date',
    'Town': 'town',
    'Total cases ': 'total_cases'
})

inf_df1 = infection_data.loc[:, ['update_date', 'town', 'total_cases']]
display(inf_df1.head())

Unnamed: 0,update_date,town,total_cases
0,01/17/2021,Andover,118
1,01/17/2021,Ansonia,1236
2,01/17/2021,Ashford,158
3,01/17/2021,Avon,614
4,01/17/2021,Barkhamsted,115


In [53]:
# convert 'update_date' column to date datatype
# offset every date to the last date of the month
# then filter to get the highest number of cases for each month
inf_df1['update_date'] = pd.to_datetime(inf_df1['update_date'])
inf_df1['month'] = inf_df1['update_date'] + pd.offsets.MonthEnd(0)
inf_df1 = inf_df1.groupby([pd.Grouper(key='month', freq='M'), 'town'])['total_cases'].idxmax().reset_index()

display(inf_df1.head(20))

Unnamed: 0,month,town,total_cases
0,2020-03-31,Andover,11401
1,2020-03-31,Ansonia,37011
2,2020-03-31,Ashford,36741
3,2020-03-31,Avon,37013
4,2020-03-31,Barkhamsted,35858
5,2020-03-31,Beacon Falls,37015
6,2020-03-31,Berlin,36576
7,2020-03-31,Bethany,37017
8,2020-03-31,Bethel,37018
9,2020-03-31,Bethlehem,36748


In [167]:
# find how many unique values exist in 'Variable' column
# since it has all the income data
#income_data['Variable'].unique()

# we want to pivot the income dataset so that we can isolate the
# 'Median Household Income' value and make it it's own column

# in new 'idx' col, rename every other row to be as the one above it
inc_tmp1 = income_data.set_index(pd.Index([i//2 for i in range(len(income_data))]))
inc_tmp1 = inc_tmp1.rename_axis('idx').reset_index()

# pivot to isolate values in 'Variable' column
inc_tmp1 = inc_tmp1.pivot(index='idx', columns='Variable', values='Value')

repeat each row but fix indices
inc_tmp1 = inc_tmp1.loc[inc_tmp1.index.repeat(2)].reset_index(drop=True)

# merge tables laterally, then drop every other row
inc_tmp2 = pd.concat([income_data, inc_tmp1], axis=1)
inc_tmp2 = inc_tmp2.iloc[::2]

income_data2 = inc_df1.rename(columns={
    'Town': 'town',
    'Year': 'year',
    'Race/Ethnicity': 'demographics',
    'Median Household Income': 'med_hsehld_income'
})
inc_df1 = income_data2.loc[:, ['town', 'year', 'demographics', 'med_hsehld_income']].reset_index()
inc_df1 = inc_df1.drop('index', axis=1)
inc_df1.head()

Unnamed: 0,town,year,demographics,med_hsehld_income
0,Andover,2005-2009,All,84757.0
1,Andover,2005-2009,All,1.25
2,Andover,2005-2009,American Indian and Alaska Native Alone,-9999.0
3,Andover,2005-2009,American Indian and Alaska Native Alone,-9999.0
4,Andover,2005-2009,Asian Alone,250001.0


I decided to create a sequence of doubly-repeating indices to get a single column from the 2 unique values in 'Variable', instead of two rows with `NaN` in both of them where the other's value was supposed to be.

By this point, we have the relevant columns from the income and infections dataset that we could merge with the primary dataset. Unfortunately, the primary dataset with residential housing sales does not have a date variable that we definitely need to merge with the other two.

My solution at the moment is to instead get the necessary rows from the housing dataset and concatenate them with the infections and income datasets appropriately.