### CT Statewide House Sales Transactions
This notebook is for producing a cleaned version of the data from https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2016/5mzw-sjtu?category=Housing-and-Development , which lists CT statewide sales transactions on individual properties from 2001-2016.

View the raw data here: https://raw.githubusercontent.com/jamiekasulis/ct_real_estate_sales/master/Real_Estate_Sales_2001-2016.csv

View the meanings of NonUseCodes here: file:///C:/Users/jleekas/Downloads/OPM-RealEstate_Codes.pdf

#### Needed Cleaning
* Trim whitespace and replace double-spaces with single-spaces (done)
* Replace address abbreviations like "LN" with their full form, "LANE" (done)
* Catch mispellings of towns and street names using Python fuzzywuzzy

__Note:__ There are several data cleaning notebooks because to run the processes on all the years of data at once has been taking so, so long.

This is where I test my cleaning process on just a small sample of 2000 listings.

In [None]:
import pandas as pd

In [2]:
raw_df = pd.read_csv("https://raw.githubusercontent.com/jamiekasulis/ct_real_estate_sales/master/Real_Estate_Sales_2001-2016.csv")

### Subset: top 2000 listings.

In [3]:
sample = raw_df[0:2000]
len(sample)

2000

### Trim whitespace at the ends and middle of fields.

In [76]:
def trim_whitespace(df, column):
    """
    Removes all trailing and leading whitespace in a string. Will also turn double spaces into single spaes.
    """
    new_df = df.copy()
    
    for index in new_df.index:
        new_df.loc[index,column] = new_df.loc[index,column].strip().replace('  ', ' ')
    return new_df

In [77]:
new_sample = trim_whitespace(sample, 'Town')
new_sample = trim_whitespace(sample, 'Address')

In [78]:
new_sample.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK


In [79]:
street_conversions = {
    ' LN':' LANE',
    ' RD':' ROAD',
    ' ST':' STREET',
    ' DR':' DRIVE',
    ' PL':' PLACE',
    ' HL': ' HILL'
}
def convert_address_street_abbreviations(df, conversions):
    """
    Will go through all the rows in a copy of df and change street abbreviations to their full names,
    i.e. "10 CHESTER BROOKS LN" will become "10 CHESTER BROOKS LANE".
    """
    new_df = df.copy()
    
    # Iterate through each row
    for index in new_df.index:
        current = str(new_df.loc[index, 'Address']) # get current address
        #print(current)
        
        for key in street_conversions.keys():
            if key in current:
                # DR is a special case because 'DR' in 'DRIVE' already. Avoid changing to 'DRIVEIVE'
                if key != ' DR' or (key == ' DR' and ' DRIVE' not in current):
                    new_df.loc[index, 'Address'] = current.replace(key, street_conversions[key])
                break
    
    return new_df

In [80]:
clean_df = convert_address_street_abbreviations(new_sample, street_conversions)

In [81]:
clean_df.head(50)

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK
5,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.341232,Residential,Single Family,0,
6,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.34,Residential,Single Family,0,
7,8,30047,2003,04/19/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,56600,80000.0,70.75,Vacant Land,,0,
8,9,40003,2004,10/18/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,194100,446639.0,43.457916,Residential,Single Family,7,
9,10,70005,2007,11/19/2007 12:00:00 AM,Andover,10 CHESTER BROOKS LANE,313400,425000.0,0.737412,Residential,Single Family,,


### Create CSVs for each year of listings.

In [82]:
def clean_raw_data(raw_data):
    clean_data = raw_data.copy()
    clean_data = trim_whitespace(clean_data, 'Town')
    print("First clean done.")
    clean_data = trim_whitespace(clean_data, 'Address')
    print("Second clean done.")
    clean_data = convert_address_street_abbreviations(clean_data, street_conversions)
    print("Third clean done. Returning...")
    
    return clean_data

In [83]:
new_clean_data = clean_raw_data(sample)
new_clean_data.head()

First clean done.
Second clean done.
Third clean done. Returning...


Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK


In [84]:
def create_clean_csv(raw_df, year):
    """
    Given a raw dataframe and a listing year, this will first extract all the listings
    from year from raw_df. Then, it will create a clean version of that dataframe.
    Then, it will write this to a csv file.
    
    The file name convention is clean_data_year_listings.csv
    """
    raw_subset = raw_df[raw_df['ListYear'] == year]
    clean_subset = clean_raw_data(raw_subset)
    file_location = "data/clean_data_" + str(year) + "_listings.csv"
    clean_subset.to_csv(file_location)

In [None]:
create_clean_csv(raw_df, 2001)