### CT Statewide House Sales Transactions
This notebook is exploring the data from https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2016/5mzw-sjtu?category=Housing-and-Development , which lists CT statewide sales transactions on individual properties from 2001-2016.
* Note: this link includs a PDF that explains each of the NonUseCodes.

My GitHub repository: https://github.com/jamiekasulis/ct_real_estate_sales, where you can view
* The raw data file
* The notebook I used to clean that file
* The clean data files that I analyze here

### Potential Inquiries
* See where properties with multiple transactions gained and lost value --> (How many houses, how much value, by town, over different periods of time)
* Foreclosures-- where and when have there been a lot?
* Building of new developments? (Might be shown by selling many houses in a short period of time on a new road)
* Signs of house flipping, i.e. a purchase and a sale for significantly more within a short period
* Has recovery been different for different segments of the market (different price-range houses)?
* Are there observable effects of the crumbling fundations in the northeastern part of CT?

### Calculations
* Adjust sales prices for inflation/season -- there is a Python package for seasonal adjustment
* Take a close look at the assessment column
* Each town's assessment rate, or look at a hosue's sale ratio relative to its town only
* Distribution of house prices in given towns, or on given streets
* Town-by-town medians, ranges

### Themes
* Variability
* Recovery

In [1]:
import pandas as pd

#### Update MAX_YEAR when new data comes out.

In [2]:
MIN_YEAR = 2001
MAX_YEAR = 2016 # update this when new data comes out

In [9]:
def combine_data_into_master_df():
    """
    Returns a DataFrame which combines all of the clean CSVs for each year.
    """
    # Combine the year-by-year clean csvs, which are located at 'data/clean_data_20xx_listings.csv'
    df = pd.read_csv('data/clean_data_' + str(MIN_YEAR) + '_listings.csv')
    for year in range(MIN_YEAR+1, MAX_YEAR+1):
        df = df.append(pd.read_csv('data/clean_data_' + str(year) + '_listings.csv'))

    # Now remove the index column
    #df = df.drop('Unnamed: 0', 1)
    return df

In [11]:
df = combine_data_into_master_df()
df.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,684,10173,2001,04/17/2002 12:00:00 AM,Ansonia,1-3 EAGLE STREET,63630,116000.0,54.853448,Residential,Two Family,0,
1,694,10005,2001,10/04/2001 12:00:00 AM,Ansonia,1 CRESTWOOD ROAD,76370,160000.0,47.73125,Residential,Single Family,0,
2,697,10253,2001,06/18/2002 12:00:00 AM,Ansonia,1 DAVIES CT,97720,180000.0,54.288889,Residential,Single Family,0,
3,698,10094,2001,01/17/2002 12:00:00 AM,Ansonia,1 DOREL TER,110600,259900.0,42.554829,Residential,Single Family,0,
4,710,10100,2001,01/30/2002 12:00:00 AM,Ansonia,1 JAMES STREET,63210,132000.0,47.886364,Residential,Single Family,0,


In [12]:
df.tail()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
49768,813238,160015,2016,10/25/2016 12:00:00 AM,Woodbury,UPPER GRASSY HILL ROAD(66-4),3500,115000.0,0.030435,Vacant Land,,28,PURCHASED LAND IN OPEN SPACE FROM UNCLE
49769,813243,160085,2016,02/17/2017 12:00:00 AM,Woodbury,WASHINGTON ROAD,1540,77000.0,0.02,Vacant Land,,14,SALE WITH ID: 143600
49770,813244,160153,2016,06/20/2017 12:00:00 AM,Woodbury,WASHINGTON ROAD,68230,72500.0,0.941103,Vacant Land,,-1,UNIQUE ID: 174200
49771,813245,160112,2016,04/24/2017 12:00:00 AM,Woodbury,WEEKEEPEEMEE ROAD,186430,150000.0,1.242867,Vacant Land,,10,ID 208400
49772,813263,160047,2016,12/30/2016 12:00:00 AM,Woodbury,WHITE DEER ROCKS ROAD (8-12D),5090,50000.0,0.1018,Vacant Land,,28,OPEN SPACE


In [285]:
# Make sure the IDs match the right rows from the original raw file
df.sort_values('ID') # Looks good

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
0,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
1,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
0,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
0,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058000,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK
0,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.341232,Residential,Single Family,0,
1,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.340000,Residential,Single Family,0,
0,8,30047,2003,04/19/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,56600,80000.0,70.750000,Vacant Land,,0,
0,9,40003,2004,10/18/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,194100,446639.0,43.457916,Residential,Single Family,7,
0,10,70005,2007,11/19/2007 12:00:00 AM,Andover,10 CHESTER BROOKS LANE,313400,425000.0,0.737412,Residential,Single Family,,


### Notes on the data frames
* Use DataFrame 'df' if you want to look at ALL of the data
* Use 'use_df' if you want to just look at the rows that don't have NonUseCodes. You should use use_df if you are calculating any statistics.
* Use 'res_df' if you want to look at RESIDENTIAL properties (but not condos or apartments) that don't have NonUseCodes. Best way to observe the real estate market in general.

In [287]:
def get_residential(df):
    """
    Returns just the residential properties.
    """
    return df[df['PropertyType'] == 'Residential']

def get_commercial(df):
    """
    Returns just the commercial properties.
    """
    return df[df['PropertyType'] == 'Commercial']

In [324]:
#use_df = df[df['NonUseCode'] == ] Finish after CSVs are updated
res_df = get_residential(use_df)

### Foreclosures-- where and when have there been a lot?
Note: In order for this analysis to be truthfully valuable, you should do some kind of adjustment. This might be very complicated. (Adjusting by population fails to weigh folks living in apartments, for example.)

In [289]:
# How many foreclosures on properties 2001-2016?
foreclosures = df[df['NonUseCode'] == 14]
print("Number of properties that have been foreclosed on: %d" %len(foreclosures['Address'].unique()))

Number of properties that have been foreclosed on: 26736


#### Make a function to return a DataFrame that ranks towns by most foreclosures.

In [290]:
def count_foreclosures_by_town(df, town, years=(MIN_YEAR, MAX_YEAR), property_type = 'Residential'):
    """
    Returns a subset of df of foreclosed properties from town.
    Is a helper function for make_foreclosures_by_town_dict()
    """
    # Make sure proper arguments passed
    if property_type not in ['Residential', 'Commercial', 'All']:
        raise Exception("Not a valid property_type.")
        return
    if years[0] < MIN_YEAR or years[1] > MAX_YEAR:
        raise Exception("Not a valid year range")
        return
    
    else:
        # Convert year range to a list of all years in that range
        years = list(range(years[0], years[1]+1))
    
    subset = df[(df['Town'] == town) & (df['ListYear'].isin(years)) & (df['NonUseCode'] == 14)]
    return subset

In [291]:
def make_foreclosures_by_town_dict(df, years=(MIN_YEAR, MAX_YEAR), property_type = 'Residential'):
    """
    Returns a dictionary with keys = town name and value = number of foreclosures.
    This function feeds make_rankings_df_from_dict.
    """
    foreclosure_ranks_by_town = {}
    for town in df['Town'].unique():
        if town not in foreclosure_ranks_by_town.keys():
            foreclosure_ranks_by_town[town] = len(count_foreclosures_by_town(df, town, years, property_type))
    return foreclosure_ranks_by_town

In [292]:
def make_rankings_df_from_dict(dictionary, value_name):
    """
    Makes a dataframe out of a dictionary. First column name is 'Town' and second is value_name.
    dictionary should be produced by an explicit call to make_foreclosures_by_town_dict.
    """
    df = pd.DataFrame.from_dict(data=dictionary, orient='index') # make initial df
    df['Town'] = df.index # pull index out into its own column
    df.columns = [value_name, 'Town'] # Set column names
    df = df[['Town', value_name]] # Reorder the columns
    
    df = df.sort_values(value_name, ascending=False, ) # Sort
    df = df.reset_index(drop=True)
    
    return df

In [293]:
def rank_towns_by_foreclosure_count(df, years=(MIN_YEAR, MAX_YEAR), property_type = 'Residential'):
    """
    Returns a dataframe of each town and the number of foreclosures they experienced in the range 'years'.
    Descending order. Index+1 can serve as the rank.
    The only function you should have to explicitly call to do this. All helper functions are called within.
    """
    return make_rankings_df_from_dict(make_foreclosures_by_town_dict(df), 'Foreclosures')

In [294]:
# Dataframe of foreclosure rankings by town, all years (MIN_YEAR to MAX_YEAR)
fc_rankings_town_all_years = rank_towns_by_foreclosure_count(df)

In [295]:
fc_rankings_town_all_years.head()

Unnamed: 0,Town,Foreclosures
0,Bridgeport,2912
1,Waterbury,2511
2,New Britain,1237
3,New Haven,1124
4,Meriden,990


### Omit properties with a NonUseCode
These ones should not be used in calculations.
__NOTE:__ Delete this section after the CSVs have been regenerated. use_df is redefined above.

In [296]:
use_df = df[df['NonUseCode'].astype(str) == "nan"]
use_df.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
69324,683453,40010,2004,05/31/2005 12:00:00 AM,Union,334 STREETICKNEY HILL,199890,374900.0,0.0,Residential,Single Family,,
0,686,60235,2006,09/14/2007 12:00:00 AM,Ansonia,1-3 WOODLAWN AVE,113300,163000.0,0.695092,Residential,Three Family,,
1,693,60249,2006,09/26/2007 12:00:00 AM,Ansonia,1 CONDON DRIVE,149100,335000.0,0.445075,Residential,Single Family,,
3,714,60203,2006,07/31/2007 12:00:00 AM,Ansonia,1 LAROVERA TERRACE,177000,387000.0,0.457364,Residential,Single Family,,
5,730,68052,2006,11/06/2006 12:00:00 AM,Ansonia,1 THOMAS STREET,97000,243000.0,0.399177,Residential,Single Family,,addl remarks


### Grab just residential, non-condo/apartment homes
__NOTE:__ Delete this (short) section after the CSVs have been regenerated. res_df will be assigned above.

In [297]:
res_df = get_residential(use_df)

### Calculate town-by-town statistics
* Median assessed value
* Median sale amount
* Median sales ratio

### Dataframes:
* town_df for all residential properties without NonUseCodes in time range MINYEAR - MAXYEAR

In [373]:
# Make a dataframe with each town. We will store summary statistics in this dataframe.
town_df = pd.DataFrame(columns=['Town', 'MedianAssessedValue', 'MedianSaleAmount', 'MedianSalesRatio',
                               'MinAssessedValue', 'MaxAssessedValue', 'MinSaleAmount', 'MaxSaleAmount',
                               'MinSalesRatio', 'MaxSalesRatio'])
town_df['Town'] = df['Town'].unique()

In [374]:
town_df.head()

Unnamed: 0,Town,MedianAssessedValue,MedianSaleAmount,MedianSalesRatio,MinAssessedValue,MaxAssessedValue,MinSaleAmount,MaxSaleAmount,MinSalesRatio,MaxSalesRatio
0,Ansonia,,,,,,,,,
1,Ashford,,,,,,,,,
2,Avon,,,,,,,,,
3,Barkhamsted,,,,,,,,,
4,Berlin,,,,,,,,,


In [381]:
def calculate_median_for_town(source_df, town, column, residential=True):
    """
    Calculates the median value of 'column' for a given town from df.
    NOTE: This is the median assessed value for SOLD properties. Will be
    different from the median assessed value for ALL properties.
    
    source_df should omit properties with a NonUseCode.
    """
    town_df = source_df[source_df['Town'] == town]
    
    if residential==True:
        town_df = get_residential(town_df)
    
    return town_df[column].median()

In [382]:
def calculate_medians_for_all_towns(town_df, source_df, column, residential=True):
    """
    Calculates the median value of 'column' for ALL unique towns in df.
    Returns this information as a DataFrame
    """
    median_column_name = 'Median' + column
    new_town_df = town_df.copy()
    for index in new_town_df.index:
        town = new_town_df.loc[index, 'Town']
        new_town_df.loc[index, median_column_name] = calculate_median_for_town(source_df, town, column, residential)
    
    return new_town_df

In [383]:
town_df = calculate_medians_for_all_towns(town_df, use_df, 'AssessedValue').head()

In [384]:
town_df.head()

Unnamed: 0,Town,MedianAssessedValue,MedianSaleAmount,MedianSalesRatio,MinAssessedValue,MaxAssessedValue,MinSaleAmount,MaxSaleAmount,MinSalesRatio,MaxSalesRatio
0,Ansonia,128900,,,,,,,,
1,Ashford,137700,,,,,,,,
2,Avon,282610,,,,,,,,
3,Barkhamsted,159900,,,,,,,,
4,Berlin,172400,,,,,,,,


In [385]:
town_df = calculate_medians_for_all_towns(town_df, use_df, 'SaleAmount').head()
town_df = calculate_medians_for_all_towns(town_df, use_df, 'SalesRatio').head()

In [386]:
town_df.head()

Unnamed: 0,Town,MedianAssessedValue,MedianSaleAmount,MedianSalesRatio,MinAssessedValue,MaxAssessedValue,MinSaleAmount,MaxSaleAmount,MinSalesRatio,MaxSalesRatio
0,Ansonia,128900,195750,0.691169,,,,,,
1,Ashford,137700,213000,0.7,,,,,,
2,Avon,282610,412000,0.699646,,,,,,
3,Barkhamsted,159900,250000,0.660984,,,,,,
4,Berlin,172400,258000,0.683516,,,,,,


In [387]:
# Let's check to make sure the numbers match.
berlin_sales = res_df[(res_df['Town'] == 'Berlin') & (res_df['PropertyType'] == 'Residential')]
print(berlin_sales['AssessedValue'].median())
print(berlin_sales['SaleAmount'].median())
print(berlin_sales['SalesRatio'].median())

172400.0
258000.0
0.6835164835164841


In [388]:
avon_sales = res_df[(res_df['Town'] == 'Avon') & (res_df['PropertyType'] == 'Residential')]
print(avon_sales['AssessedValue'].median())
print(avon_sales['SaleAmount'].median())
print(avon_sales['SalesRatio'].median())

282610.0
412000.0
0.699645645645646


### Calculate SalesRatio
If my calculation of it is the same, then they use their SaleAmount to calculate it. Otherwise, they are using a market estimation.

*Data page defines SalesRatio as 'Ratio of the sale price to the assessed value.' I want to know if 'sales price' refers to SalesAmount or not.*

In [389]:
res_df[['Address', 'AssessedValue', 'SaleAmount', 'SalesRatio']].head(5)

Unnamed: 0,Address,AssessedValue,SaleAmount,SalesRatio
69324,334 STREETICKNEY HILL,199890,374900.0,0.0
0,1-3 WOODLAWN AVE,113300,163000.0,0.695092
1,1 CONDON DRIVE,149100,335000.0,0.445075
3,1 LAROVERA TERRACE,177000,387000.0,0.457364
5,1 THOMAS STREET,97000,243000.0,0.399177


I'll calculate SalesRatio myself and compare them side-by-side.

In [390]:
res_df_testing = res_df.copy()
res_df_testing['MySalesRatio'] = res_df_testing['AssessedValue'] / res_df_testing['SaleAmount']
res_df_testing[['SalesRatio', 'MySalesRatio']]

Unnamed: 0,SalesRatio,MySalesRatio
69324,0.000000,0.533182
0,0.695092,0.695092
1,0.445075,0.445075
3,0.457364,0.457364
5,0.399177,0.399177
6,0.487727,0.487727
7,0.295909,0.295909
8,0.455556,0.455556
9,0.465730,0.465730
10,0.409362,0.409362


For some reason, the first row has a SalesRatio of 0 when it doesn't seem like it should be. Could be an error when the data was entered.

### Calculate minimum and maximum sale amounts for each town

In [391]:
def get_town_min(source_df, town_df, town, column):
    """
    Finds the row with the minimum value of column for a given town in source_df.
    Saves this to town_df (as a separate copy).
    
    source_df should be residential properties w/o NonUseCodes if you are using res_df.
    """
    new_town_df = town_df.copy()
    just_this_town = source_df[source_df['Town'] == town] # get the data for just this town
    
    min_val = just_this_town[column] # a list of values to find the minimum from
    min_val = min_val.min() # Grab the minimum column value
    min_row = just_this_town[just_this_town['SaleAmount'] == min_val] # Grab the row
    return min_row

In [392]:
get_town_min(res_df, town_df, 'Berlin', 'SaleAmount')

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
1076,14948,140059,2014,12/04/2014 12:00:00 AM,Berlin,23 JEFFREY LANE,220400,4000.0,55.1,Residential,Single Family,,


In [393]:
def get_town_max(source_df, town_df, town, column):
    """
    Finds the row with the maximum value of column for a given town in source_df.
    Saves this to town_df (as a separate copy).
    
    source_df should be residential properties w/o NonUseCodes if you are using res_df.
    """
    new_town_df = town_df.copy()
    just_this_town = source_df[source_df['Town'] == town] # get the data for just this town
    
    max_val = just_this_town[column] # a list of values to find the minimum from
    max_val = max_val.max() # Grab the minimum column value
    max_row = just_this_town[just_this_town['SaleAmount'] == max_val] # Grab the row
    return max_row

In [394]:
get_town_max(res_df, town_df, 'Berlin', 'SaleAmount')

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
635,16666,130084,2013,12/31/2013 12:00:00 AM,Berlin,66 QUINCY TRAIL,542900,700000.0,0.776,Residential,Single Family,,


In [395]:
def calculate_town_mins_and_maxes(source_df, town_df, column):
    """
    Will use the data from source_df to calculate mins and maxes for column for every town.
    """
    new_town_df = town_df.copy()
    
    # Calculate min and max for each town
    for index in town_df.index:
        max_val = get_town_max(source_df, town_df, town_df.loc[index, 'Town'], column)[column].iloc[0]
        min_val = get_town_min(source_df, town_df, town_df.loc[index, 'Town'], column)[column].iloc[0]
    
        # Add to new_town_df
        min_column_name = 'Min' + column
        max_column_name = 'Max' + column
        new_town_df.loc[index, min_column_name] = min_val
        new_town_df.loc[index, max_column_name] = max_val
    
    return new_town_df

In [405]:
town_df = calculate_town_mins_and_maxes(res_df, town_df, 'SaleAmount')
town_df.head(5)

Unnamed: 0,Town,MedianAssessedValue,MedianSaleAmount,MedianSalesRatio,MinAssessedValue,MaxAssessedValue,MinSaleAmount,MaxSaleAmount,MinSalesRatio,MaxSalesRatio
0,Ansonia,128900,195750,0.691169,,,20100,645000.0,,
1,Ashford,137700,213000,0.7,,,40000,495000.0,,
2,Avon,282610,412000,0.699646,,,7500,5000000.0,,
3,Barkhamsted,159900,250000,0.660984,,,16000,700000.0,,
4,Berlin,172400,258000,0.683516,,,4000,700000.0,,


In [406]:
calculate_town_mins_and_maxes(res_df, town_df, 'AssessedValue')

IndexError: single positional indexer is out-of-bounds

In [407]:
calculate_town_mins_and_maxes(res_df, town_df, 'SalesRatio')

IndexError: single positional indexer is out-of-bounds