# The Impact of the Coronavirus Pandemic on Connecticut's Residential Real Estate Market

I'll be using this file to do all the necessary data wrangling, analysis and plotting with python, and maybe even take a few notes that will go to the final paper.

#### Objectives
- [ ] import dataset
- [ ] add data about aggregate infection numbers in Connecticut
- [ ] if poverty rates found, add
- [ ] describe where we got the data from
- [ ] show table summary
- [ ] answer questions using plots and graphs

#### Questions
The background goes over the various questions we want to answer with this paper. Specifically, we want to investigate whatever demographic data we can find to explain income, along with infection rates, affected the residential real estate market in CT. A few questions we have to start include the following:
1. What do the sale prices in each quartile look like?
2. What is the trend in sales from 2016 to 2018?
3. How does infection affect sales in each quartile?
4. What cities show the largest changes? Why do we think that was?
We might have to define the quartiles.

In [2]:
# install dependencies
# !pip install --upgrade pip

In [3]:
# import modules
from IPython.display import display

import pandas as pd

# import datasets
data = pd.read_csv('./data/CTRRE_2011-2021.csv')
infection_data = pd.read_csv('./data/COVID19_Tests_Cases_Deaths_by_Town-ARCHIVE.csv')
income_data = pd.read_csv('./data/median-household-income-town-2020.csv')

# preview primary dataset
display(data.head())

Unnamed: 0,list_year,town,population,residential_type,month,year,in_pandemic,assessed_value,sale_amount,price_index,norm_assessed_value,norm_sale_amount,norm_sales_ratio,latitude,longitude
0,2020,Ashford,4193,Single Family,10,2020,1,253000,430000.0,254.076,99576.5,169240.7,0.588372,41.8731,-72.1216
1,2020,Avon,18821,Condo,3,2021,1,130400,179900.0,258.935,50360.13,69476.9,0.724847,41.8096,-72.8305
2,2020,Avon,18821,Single Family,4,2021,1,619290,890000.0,261.237,237060.6,340686.81,0.695831,41.8096,-72.8305
3,2020,Avon,18821,Single Family,7,2021,1,862330,1447500.0,267.789,322018.45,540537.51,0.595737,41.8096,-72.8305
4,2020,Avon,18821,Single Family,12,2020,1,847520,1250000.0,254.081,333562.92,491969.1,0.678016,41.8096,-72.8305


We want to add 2016 to 2020 median household income from all cities in Connecticut, as well as relevant infection rates in the state.

In [4]:
# preview infection dataset
display(infection_data.tail())

Unnamed: 0,Last update date,Town number,Town,Total cases,Confirmed cases,Probable cases,Case rate,Total deaths,Confirmed deaths,Probable deaths,People tested,Rate tested per 100k,Number of tests,Number of positives,Number of negatives,Number of indeterminates
102240,06/24/2022,165,Windsor Locks,2702,2507.0,195.0,21021.0,44,40.0,4.0,10507.0,81741.0,44778.0,3347.0,41376.0,55.0
102241,06/24/2022,166,Wolcott,4089,3557.0,532.0,24652.0,64,54.0,10.0,14324.0,86357.0,71416.0,5163.0,66148.0,105.0
102242,06/24/2022,167,Woodbridge,1699,1519.0,180.0,19417.0,48,42.0,6.0,8668.0,99063.0,50559.0,2115.0,48373.0,71.0
102243,06/24/2022,168,Woodbury,1646,1367.0,279.0,17323.0,12,10.0,2.0,7918.0,83330.0,34243.0,2023.0,32192.0,28.0
102244,06/24/2022,169,Woodstock,1440,1391.0,49.0,18325.0,7,7.0,0.0,6740.0,85772.0,23654.0,1680.0,21945.0,29.0


In [5]:
# preview dataset with median household income
display(income_data.head())

Unnamed: 0,Town,FIPS,Year,Race/Ethnicity,Measure Type,Variable,Value
0,Andover,901301080,2005-2009,All,Number,Median Household Income,84757.0
1,Andover,901301080,2005-2009,All,Number,Margins of Error,9003.0
2,Andover,901301080,2005-2009,All,Ratio to State Median,Median Household Income,1.25
3,Andover,901301080,2005-2009,All,Ratio to State Median,Margins of Error,0.13
4,Andover,901301080,2005-2009,American Indian and Alaska Native Alone,Number,Median Household Income,-9999.0


The data about infection rates looks clean and easy enough to merge with the primary dataset. The data about median income households is a little less straightforward, and from the outset, it looks like I definitely need to do some row and column manipulation to get the desired values.

I'll start with wrangling the infection rates dataset and merging it with the primary dataset.

In [6]:
# rename columns for convenience
infection_data = infection_data.rename(columns={
    'Last update date': 'update_date',
    'Town': 'town',
    'Total cases ': 'total_cases'
})

inf_df1 = infection_data.loc[:, ['update_date', 'town', 'total_cases']]
display(inf_df1.head())

Unnamed: 0,update_date,town,total_cases
0,01/17/2021,Andover,118
1,01/17/2021,Ansonia,1236
2,01/17/2021,Ashford,158
3,01/17/2021,Avon,614
4,01/17/2021,Barkhamsted,115


In [7]:
# convert 'update_date' column to date datatype
# offset every date to the last date of the month
# then filter to get the highest number of cases for each month
inf_df1['update_date'] = pd.to_datetime(inf_df1['update_date'])
inf_df1['month'] = inf_df1['update_date'] + pd.offsets.MonthEnd(0)
inf_df1 = inf_df1.groupby([pd.Grouper(key='month', freq='M'), 'town'])['total_cases'].idxmax().reset_index()

display(inf_df1.head(20))

Unnamed: 0,month,town,total_cases
0,2020-03-31,Andover,11401
1,2020-03-31,Ansonia,37011
2,2020-03-31,Ashford,36741
3,2020-03-31,Avon,37013
4,2020-03-31,Barkhamsted,35858
5,2020-03-31,Beacon Falls,37015
6,2020-03-31,Berlin,36576
7,2020-03-31,Bethany,37017
8,2020-03-31,Bethel,37018
9,2020-03-31,Bethlehem,36748


In [8]:
# find how many unique values exist in 'Variable' column
# since it has all the income data
#income_data['Variable'].unique()

# we want to pivot the income dataset so that we can isolate the
# 'Median Household Income' value and make it it's own column

# in new 'idx' col, rename every other row to be as the one above it
inc_tmp1 = income_data.set_index(pd.Index([i//2 for i in range(len(income_data))]))
inc_tmp1 = inc_tmp1.rename_axis('idx').reset_index()

# pivot to isolate values in 'Variable' column
inc_tmp1 = inc_tmp1.pivot(index='idx', columns='Variable', values='Value')

# repeat each row but fix indices
inc_tmp1 = inc_tmp1.loc[inc_tmp1.index.repeat(2)].reset_index(drop=True)

# merge tables laterally, then drop every other row
inc_tmp2 = pd.concat([income_data, inc_tmp1], axis=1)
inc_tmp2 = inc_tmp2.iloc[::2]

income_data2 = inc_tmp2.rename(columns={
    'Town': 'town',
    'Year': 'year',
    'Race/Ethnicity': 'demographics',
    'Median Household Income': 'med_hsehld_income'
})
inc_df1 = income_data2.loc[:, ['town', 'year', 'demographics', 'med_hsehld_income']].reset_index()
inc_df1 = inc_df1.drop('index', axis=1)
inc_df1.head()

Unnamed: 0,town,year,demographics,med_hsehld_income
0,Andover,2005-2009,All,84757.0
1,Andover,2005-2009,All,1.25
2,Andover,2005-2009,American Indian and Alaska Native Alone,-9999.0
3,Andover,2005-2009,American Indian and Alaska Native Alone,-9999.0
4,Andover,2005-2009,Asian Alone,250001.0


I decided to create a sequence of doubly-repeating row numbers to get a single column from the 2 unique values in 'Variable', instead of two rows with `NaN` in both of them where the other's value was supposed to be.

By this point, we have the relevant columns from the income and infections dataset that we could merge with the primary dataset. Unfortunately, the primary dataset with residential housing sales does not have a date variable that we definitely need to merge with the other two.

My solution at the moment is to instead get the necessary rows from the housing dataset and concatenate them with the infections and income datasets appropriately.

Let's get a quick summary of both the infections and income table so we know how to proceed with matching with the sales dataset. You might have already noticed that there are quite a few unreasonable numbers in the income table. We hope to not lose too much data from this table.

In [9]:
# summarize the covid cases table
cov_summary = inf_df1.describe()
display(cov_summary)

Unnamed: 0,month,total_cases
count,4732,4732.0
mean,2021-05-15 23:59:59.999999744,51965.761834
min,2020-03-31 00:00:00,181.0
25%,2020-10-23 06:00:00,23045.5
50%,2021-05-15 12:00:00,54369.5
75%,2021-12-07 18:00:00,78397.25
max,2022-06-30 00:00:00,102243.0
std,,31197.901314


In [10]:
# summarize the median household income table
inc_summary = inc_df1.describe()
inc_summary

Unnamed: 0,med_hsehld_income
count,40800.0
mean,21562.241111
std,47818.093792
min,-9999.0
25%,-9999.0
50%,1.0
75%,50114.0
max,250001.0


In [11]:
# create date column to organize dataframe into dates that the sales were made
df1_tmp1 = data
df1_tmp1['month_sold'] = pd.to_datetime(df1_tmp1['year'].astype(str) + df1_tmp1['month'].astype(str), format='%Y%m')
display(df1_tmp1.head(10))

Unnamed: 0,list_year,town,population,residential_type,month,year,in_pandemic,assessed_value,sale_amount,price_index,norm_assessed_value,norm_sale_amount,norm_sales_ratio,latitude,longitude,month_sold
0,2020,Ashford,4193,Single Family,10,2020,1,253000,430000.0,254.076,99576.5,169240.7,0.588372,41.8731,-72.1216,2020-10-01
1,2020,Avon,18821,Condo,3,2021,1,130400,179900.0,258.935,50360.13,69476.9,0.724847,41.8096,-72.8305,2021-03-01
2,2020,Avon,18821,Single Family,4,2021,1,619290,890000.0,261.237,237060.6,340686.81,0.695831,41.8096,-72.8305,2021-04-01
3,2020,Avon,18821,Single Family,7,2021,1,862330,1447500.0,267.789,322018.45,540537.51,0.595737,41.8096,-72.8305,2021-07-01
4,2020,Avon,18821,Single Family,12,2020,1,847520,1250000.0,254.081,333562.92,491969.1,0.678016,41.8096,-72.8305,2020-12-01
5,2020,Berlin,20107,Single Family,7,2021,1,412000,677500.0,267.789,153852.47,252997.7,0.608118,41.6215,-72.7457,2021-07-01
6,2020,Bethel,20287,Single Family,12,2020,1,171360,335000.0,254.081,67443.06,131847.72,0.511522,41.3712,-73.414,2020-12-01
7,2020,Bethlehem,3408,Single Family,8,2021,1,168900,352000.0,268.387,62931.51,131153.89,0.47983,41.6404,-73.2058,2021-08-01
8,2020,Bloomfield,21399,Condo,9,2021,1,163730,250000.0,269.086,60846.72,92907.1,0.65492,41.8265,-72.7301,2021-09-01
9,2020,Branford,28230,Single Family,1,2021,1,530500,700000.0,255.296,207798.01,274191.53,0.757857,41.2799,-72.8141,2021-01-01


In [12]:
df1_tmp1 = df1_tmp1.sort_values(by='month_sold')
df1 = df1_tmp1.loc[:, [
    'list_year', 'town', 'population', 'residential_type',
    'assessed_value', 'sale_amount', 'month_sold'
]]
df1.head(10)

Unnamed: 0,list_year,town,population,residential_type,assessed_value,sale_amount,month_sold
175807,2017,Washington,3619,Single Family,2260610,4250000.0,2010-05-01
157600,2017,Bloomfield,21399,Single Family,126000,140000.0,2010-07-01
161780,2017,Bloomfield,21399,Single Family,125580,215000.0,2010-08-01
411306,2010,Cromwell,14252,Single Family,302670,427115.0,2010-10-01
407664,2010,Hamden,61160,Single Family,277340,387500.0,2010-10-01
421587,2010,Killingly,17681,Single Family,113540,155000.0,2010-10-01
414199,2010,Norwalk,90821,Condo,182700,245000.0,2010-10-01
414192,2010,Stonington,18354,Single Family,14400,15700.0,2010-10-01
421594,2010,Stamford,134820,Single Family,622760,790000.0,2010-10-01
427628,2010,Stamford,134820,Single Family,615700,842800.0,2010-10-01


In [13]:
# what does the month column look like in the other datasets
inf_df1.head(10)

Unnamed: 0,month,town,total_cases
0,2020-03-31,Andover,11401
1,2020-03-31,Ansonia,37011
2,2020-03-31,Ashford,36741
3,2020-03-31,Avon,37013
4,2020-03-31,Barkhamsted,35858
5,2020-03-31,Beacon Falls,37015
6,2020-03-31,Berlin,36576
7,2020-03-31,Bethany,37017
8,2020-03-31,Bethel,37018
9,2020-03-31,Bethlehem,36748


In [14]:
# filter income table to remain with values from the period '2016-2020'
inc_df1.query("year == '2016-2020'")

Unnamed: 0,town,year,demographics,med_hsehld_income
220,Andover,2016-2020,All,99449.00
221,Andover,2016-2020,All,1.25
222,Andover,2016-2020,American Indian and Alaska Native Alone,-9999.00
223,Andover,2016-2020,American Indian and Alaska Native Alone,-9999.00
224,Andover,2016-2020,Asian Alone,-9999.00
...,...,...,...,...
40795,Woodstock,2016-2020,Two or More Races,-9999.00
40796,Woodstock,2016-2020,White Alone,93182.00
40797,Woodstock,2016-2020,White Alone,1.07
40798,Woodstock,2016-2020,White Alone Not Hispanic or Latino,92302.00


In [15]:
inc_df1['med_hsehld_income'].value_counts()[1]

239

It is reasonable to proceed with only income data for the period of 2016-2020.

It also appears that I need to rename the `month_sold` column to `month` so I can merge the covid and sales datasets.

In [16]:
# rename `months_sold` column to make it easy to merge the dataframes
df2_tmp1 = df1.rename(columns={ 'month_sold': 'month' })

# only get rows from after 2016 (dataset goes up to 2020)
df2_tmp2 = df2_tmp1[df2_tmp1['month'] >= '2016-01-01']

# merge the covid cases and sales tables
df2_tmp3 = pd.merge(df2_tmp2, inf_df1, on=['month', 'town'], how='outer')
df2_tmp3.loc[df2_tmp3['month'] < '2020-03-01', 'total_cases'] = 0
df2_tmp3

Unnamed: 0,list_year,town,population,residential_type,assessed_value,sale_amount,month,total_cases
0,2015.0,West Hartford,64034.0,Single Family,309750.0,604000.0,2016-01-01,0.0
1,2015.0,West Hartford,64034.0,Two Family,223090.0,395000.0,2016-01-01,0.0
2,2015.0,West Hartford,64034.0,Two Family,237090.0,475000.0,2016-01-01,0.0
3,2015.0,West Hartford,64034.0,Single Family,179690.0,318000.0,2016-01-01,0.0
4,2015.0,West Hartford,64034.0,Condo,161630.0,282500.0,2016-01-01,0.0
...,...,...,...,...,...,...,...,...
272499,,Windsor Locks,,,,,2022-06-30,102240.0
272500,,Wolcott,,,,,2022-06-30,102072.0
272501,,Woodbridge,,,,,2022-06-30,102242.0
272502,,Woodbury,,,,,2022-06-30,102243.0


In [19]:
# let's merge the income table with the one we just created
# merge using towns
# we're only using data from 2016 so `town` is enough to merge
#pd.merge(df2_tmp3, inc_df1, on=['town'])
df2_tmp3

Unnamed: 0,list_year,town,population,residential_type,assessed_value,sale_amount,month,total_cases
0,2015.0,West Hartford,64034.0,Single Family,309750.0,604000.0,2016-01-01,0.0
1,2015.0,West Hartford,64034.0,Two Family,223090.0,395000.0,2016-01-01,0.0
2,2015.0,West Hartford,64034.0,Two Family,237090.0,475000.0,2016-01-01,0.0
3,2015.0,West Hartford,64034.0,Single Family,179690.0,318000.0,2016-01-01,0.0
4,2015.0,West Hartford,64034.0,Condo,161630.0,282500.0,2016-01-01,0.0
...,...,...,...,...,...,...,...,...
272499,,Windsor Locks,,,,,2022-06-30,102240.0
272500,,Wolcott,,,,,2022-06-30,102072.0
272501,,Woodbridge,,,,,2022-06-30,102242.0
272502,,Woodbury,,,,,2022-06-30,102243.0
