### CT Statewide House Sales Transactions
This notebook is for producing a cleaned version of the data from https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2016/5mzw-sjtu?category=Housing-and-Development , which lists CT statewide sales transactions on individual properties from 2001-2016. This cleans the data in year-by-year sections and outputs csvs for each year. It then recomes the data in these csvs into one cleaned csv with all the data.

View the raw data here: https://raw.githubusercontent.com/jamiekasulis/ct_real_estate_sales/master/Real_Estate_Sales_2001-2016.csv

View the meanings of NonUseCodes here: file:///C:/Users/jleekas/Downloads/OPM-RealEstate_Codes.pdf

#### Needed Cleaning
* Trim whitespace and replace double-spaces with single-spaces (done)
* Replace address abbreviations like "LN" with their full form, "LANE" (done)
* Fix NonUseCodes. Should be ints only. Use 0 or -1 for absence of a NonUseCode. (done)
* Remove duplicate transactions. (done)
* Catch mispellings of towns and street names using Python fuzzywuzzy (do later, when you are ready to look at individual properties.)

__Note:__ There are several data cleaning notebooks because to run the processes on all the years of data at once has been taking so, so long.

This is where I test my cleaning process on just a small sample of 2000 listings.

In [1]:
import pandas as pd

In [2]:
raw_df = pd.read_csv("https://raw.githubusercontent.com/jamiekasulis/ct_real_estate_sales/master/Real_Estate_Sales_2001-2016.csv")

### Subset: top 2000 listings. Use this to test cleaning functions.

In [79]:
sample = raw_df[0:2000]
len(sample)

2000

### Trim whitespace at the ends and middle of fields.

In [3]:
def trim_whitespace(df, column):
    """
    Removes all trailing and leading whitespace in a string. Will also turn double spaces into single spaes.
    """
    new_df = df.copy()
    
    for index in new_df.index:
        new_df.loc[index,column] = str(new_df.loc[index,column]).strip().replace('  ', ' ')
    return new_df

In [5]:
new_sample = trim_whitespace(sample, 'Town')
new_sample = trim_whitespace(sample, 'Address')

In [6]:
new_sample.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK


### Convert abbreviated street names to their full names.
Will have to do more thorough Address cleaning later.

In [4]:
street_conversions = {
    ' LN':' LANE',
    ' RD':' ROAD',
    ' ST':' STREET',
    ' DR':' DRIVE',
    ' PL':' PLACE',
    ' HL': ' HILL'
}
def convert_address_street_abbreviations(df, conversions):
    """
    Will go through all the rows in a copy of df and change street abbreviations to their full names,
    i.e. "10 CHESTER BROOKS LN" will become "10 CHESTER BROOKS LANE".
    """
    new_df = df.copy()
    
    # Iterate through each row
    for index in new_df.index:
        current = str(new_df.loc[index, 'Address']) # get current address
        #print(current)
        
        for key in street_conversions.keys():
            if key in current:
                # DR is a special case because 'DR' in 'DRIVE' already. Avoid changing to 'DRIVEIVE'
                if key != ' DR' or (key == ' DR' and ' DRIVE' not in current):
                    new_df.loc[index, 'Address'] = current.replace(key, street_conversions[key])
                break
    
    return new_df

In [8]:
clean_df = convert_address_street_abbreviations(new_sample, street_conversions)

In [9]:
clean_df.head(50)

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14 - Foreclosure,PROPERTY WAS OWNED BY THE BANK
5,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.341232,Residential,Single Family,0,
6,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA ROAD,91800,189900.0,48.34,Residential,Single Family,0,
7,8,30047,2003,04/19/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,56600,80000.0,70.75,Vacant Land,,0,
8,9,40003,2004,10/18/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,194100,446639.0,43.457916,Residential,Single Family,7,
9,10,70005,2007,11/19/2007 12:00:00 AM,Andover,10 CHESTER BROOKS LANE,313400,425000.0,0.737412,Residential,Single Family,,


### Create CSVs for each year of listings.

In [126]:
def clean_raw_data(raw_data):
    clean_data = raw_data.copy()
    clean_data = trim_whitespace(clean_data, 'Town')
    print("First clean done.")
    clean_data = trim_whitespace(clean_data, 'Address')
    print("Second clean done.")
    clean_data = convert_address_street_abbreviations(clean_data, street_conversions)
    print("Third clean done.")
    clean_data = clean_nonusecode(clean_data)
    print("Fourth cleaning done.")
    clean_data = remove_duplicate_rows(clean_data)
    print("Fifth cleaning done. Returning...")
    
    return clean_data

In [14]:
new_clean_data = clean_raw_data(sample)
new_clean_data.head()

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,-1,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,-1,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,-1,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,-1,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058,Residential,Single Family,14,PROPERTY WAS OWNED BY THE BANK


In [127]:
def create_clean_csv(raw_df, year):
    """
    Given a raw dataframe and a listing year, this will first extract all the listings
    from year from raw_df. Then, it will create a clean version of that dataframe.
    Then, it will write this to a csv file.
    
    The file name convention is clean_data_year_listings.csv
    """
    raw_subset = raw_df[raw_df['ListYear'] == year]
    clean_subset = clean_raw_data(raw_subset)
    file_location = "data/clean_data_" + str(year) + "_listings.csv"
    clean_subset.to_csv(file_location, index=False)

In [16]:
create_clean_csv(raw_df, 2001)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


In [17]:
df_2001 = pd.read_csv("data/clean_data_2001_listings.csv")

In [18]:
raw_2001 = raw_df[raw_df['ListYear'] == 2001]
raw_2001.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
683,684,10173,2001,04/17/2002 12:00:00 AM,Ansonia,1-3 EAGLE ST,63630,116000.0,54.853448,Residential,Two Family,0,
693,694,10005,2001,10/04/2001 12:00:00 AM,Ansonia,1 CRESTWOOD RD,76370,160000.0,47.73125,Residential,Single Family,0,
696,697,10253,2001,06/18/2002 12:00:00 AM,Ansonia,1 DAVIES CT,97720,180000.0,54.288889,Residential,Single Family,0,
697,698,10094,2001,01/17/2002 12:00:00 AM,Ansonia,1 DOREL TER,110600,259900.0,42.554829,Residential,Single Family,0,
709,710,10100,2001,01/30/2002 12:00:00 AM,Ansonia,1 JAMES ST,63210,132000.0,47.886364,Residential,Single Family,0,


In [19]:
df_2001.head()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,684,10173,2001,04/17/2002 12:00:00 AM,Ansonia,1-3 EAGLE STREET,63630,116000.0,54.853448,Residential,Two Family,0,
1,694,10005,2001,10/04/2001 12:00:00 AM,Ansonia,1 CRESTWOOD ROAD,76370,160000.0,47.73125,Residential,Single Family,0,
2,697,10253,2001,06/18/2002 12:00:00 AM,Ansonia,1 DAVIES CT,97720,180000.0,54.288889,Residential,Single Family,0,
3,698,10094,2001,01/17/2002 12:00:00 AM,Ansonia,1 DOREL TER,110600,259900.0,42.554829,Residential,Single Family,0,
4,710,10100,2001,01/30/2002 12:00:00 AM,Ansonia,1 JAMES STREET,63210,132000.0,47.886364,Residential,Single Family,0,


In [20]:
raw_2001.tail()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
815813,815814,10084,2001,02/07/2002 12:00:00 AM,Woodstock,SENEXET,79630,140000.0,56.87857,Vacant Land,,0,
815814,815815,10239,2001,09/11/2002 12:00:00 AM,Woodstock,SENEXET RD,110780,335000.0,33.06866,Vacant Land,,14,
815865,815866,10210,2001,08/02/2002 12:00:00 AM,Woodstock,TOWN FARM RD,33040,100000.0,33.04,Vacant Land,,0,
815868,815869,10138,2001,05/02/2002 12:00:00 AM,Woodstock,UNDERWOOD RD,3920,10000.0,39.2,Vacant Land,,1,
815875,815876,10254,2001,09/23/2002 12:00:00 AM,Woodstock,VALLEY VIEW RD,20330,16000.0,127.0625,Vacant Land,,0,


In [21]:
df_2001.tail()

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
59579,815814,10084,2001,02/07/2002 12:00:00 AM,Woodstock,SENEXET,79630,140000.0,56.87857,Vacant Land,,0,
59580,815815,10239,2001,09/11/2002 12:00:00 AM,Woodstock,SENEXET ROAD,110780,335000.0,33.06866,Vacant Land,,14,
59581,815866,10210,2001,08/02/2002 12:00:00 AM,Woodstock,TOWN FARM ROAD,33040,100000.0,33.04,Vacant Land,,0,
59582,815869,10138,2001,05/02/2002 12:00:00 AM,Woodstock,UNDERWOOD ROAD,3920,10000.0,39.2,Vacant Land,,1,
59583,815876,10254,2001,09/23/2002 12:00:00 AM,Woodstock,VALLEY VIEW ROAD,20330,16000.0,127.0625,Vacant Land,,0,


In [22]:
len(raw_2001)

59584

In [23]:
len(df_2001)

59584

### Let's standardize this length test.

In [6]:
def get_raw_df(year):
    """Get the raw data in a dataframe for a particular ListYear year."""
    return raw_df[raw_df['ListYear'] == year]

In [7]:
def get_clean_df(year):
    """Get the clean data in a dataframe for a particular ListYear year.
    Uses a clean csv."""
    filename = "data/clean_data_" + str(year) + "_listings.csv"
    return pd.read_csv(filename)

In [26]:
len(get_raw_df(2001)) == len(get_clean_df(2001))

True

In [8]:
def lengths_are_equal(year):
    """
    Will get the raw and clean data for the given year and compare the lengths of these dataframes.
    """
    return len(get_raw_df(year)) == len(get_clean_df(year))

In [28]:
lengths_are_equal(2001)

True

### Now let's produce csvs for all years and test them all.

In [29]:
create_clean_csv(raw_df, 2002)
lengths_are_equal(2002)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [30]:
create_clean_csv(raw_df, 2003)
lengths_are_equal(2003)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [31]:
create_clean_csv(raw_df, 2004)
lengths_are_equal(2004)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [32]:
create_clean_csv(raw_df, 2005)
lengths_are_equal(2005)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [33]:
create_clean_csv(raw_df, 2006)
lengths_are_equal(2006)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [34]:
create_clean_csv(raw_df, 2007)
lengths_are_equal(2007)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [35]:
create_clean_csv(raw_df, 2008)
lengths_are_equal(2008)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [36]:
create_clean_csv(raw_df, 2009)
lengths_are_equal(2009)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [37]:
create_clean_csv(raw_df, 2010)
lengths_are_equal(2010)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [38]:
create_clean_csv(raw_df, 2011)
lengths_are_equal(2011)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [39]:
create_clean_csv(raw_df, 2012)
lengths_are_equal(2012)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [40]:
create_clean_csv(raw_df, 2013)
lengths_are_equal(2013)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [41]:
create_clean_csv(raw_df, 2014)
lengths_are_equal(2014)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [42]:
create_clean_csv(raw_df, 2015)
lengths_are_equal(2015)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

In [43]:
create_clean_csv(raw_df, 2016)
lengths_are_equal(2016)

First clean done.
Second clean done.
Third clean done.
Fourth cleaning done. Returning.


True

### Recombine the clean csvs into one clean dataframe and write it to a master csv file
Note: You will NOT be able to upload this to GitHub.

In [5]:
clean_df = pd.read_csv("data/clean_data_2001_listings.csv")
clean_df = clean_df.append(pd.read_csv("data/clean_data_2002_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2003_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2004_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2005_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2006_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2007_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2008_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2009_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2010_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2011_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2012_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2013_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2014_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2015_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2016_listings.csv"))

In [9]:
clean_df

Unnamed: 0.1,Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,683,684,10173,2001,04/17/2002 12:00:00 AM,Ansonia,1-3 EAGLE STREET,63630,116000.0,54.853448,Residential,Two Family,0,
1,693,694,10005,2001,10/04/2001 12:00:00 AM,Ansonia,1 CRESTWOOD ROAD,76370,160000.0,47.731250,Residential,Single Family,0,
2,696,697,10253,2001,06/18/2002 12:00:00 AM,Ansonia,1 DAVIES CT,97720,180000.0,54.288889,Residential,Single Family,0,
3,697,698,10094,2001,01/17/2002 12:00:00 AM,Ansonia,1 DOREL TER,110600,259900.0,42.554829,Residential,Single Family,0,
4,709,710,10100,2001,01/30/2002 12:00:00 AM,Ansonia,1 JAMES STREET,63210,132000.0,47.886364,Residential,Single Family,0,
5,714,715,10268,2001,06/27/2002 12:00:00 AM,Ansonia,1 LESTER STREET,82530,74500.0,110.778523,Residential,Two Family,0,
6,732,733,10012,2001,10/11/2001 12:00:00 AM,Ansonia,1 WESTBROOK AVE,74830,131000.0,57.122137,Residential,Two Family,0,
7,737,738,10115,2001,02/22/2002 12:00:00 AM,Ansonia,10-12 CLIFTON AVE,60550,20000.0,302.750000,Residential,Single Family,25,
8,738,739,10187,2001,04/29/2002 12:00:00 AM,Ansonia,10-12 HALL STREET,87710,168000.0,52.208333,Residential,Single Family,0,
9,740,741,10337,2001,09/03/2002 12:00:00 AM,Ansonia,10-12 PARKER STREET,112630,186500.0,60.391421,Residential,Two Family,0,


In [None]:
clean_df.tail()

In [None]:
len(clean_df)

In [None]:
sum_length = 0
for year in range(2001, 2017):
    years_df = pd.read_csv("data/clean_data_" + str(year) + "_listings.csv")
    sum_length += len(years_df)

print("The length of clean_df should equal the sum length of each individual year df.")
print("Length of clean_df = %d" %len(clean_df))
print("Sum length of dfs = %d" %sum_length)
print("Equal? " + str(len(clean_df) == sum_length))

In [None]:
# Now write to a csv file
clean_df.to_csv("data/clean_data.csv", index=False)

### Additional checks on the data
#### Town names

In [None]:
clean_df['Town'].unique()

In [None]:
len(clean_df['Town'].unique()) # Should be 169 towns

### Additional checks on the data
#### Years

In [None]:
clean_df['ListYear'].unique() # Should only be [2001, 2016]

### Additional checks on the data
#### PropertyType

In [None]:
clean_df['PropertyType'].unique()

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Condo Family'])

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Condo'])

In [None]:
# So we have this one problematic row. It probably is a Condo but I could just omit it.
clean_df[clean_df['PropertyType'] == 'C']

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Apartments'])

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Apartment'])

In [None]:
clean_df[clean_df['PropertyType'] == 'Apartment']

In [None]:
clean_df[clean_df['PropertyType'] == 'Apartments']

In [None]:
apartments_sales_median = clean_df[clean_df['PropertyType'] == 'Apartments']['SaleAmount'].median()
apartment_sales_median = clean_df[clean_df['PropertyType'] == 'Apartment']['SaleAmount'].median()

In [None]:
print("'Apartments': %d\n'Apartments': %d" %(apartments_sales_median, apartment_sales_median))

In [None]:
clean_df[clean_df['PropertyType'] == '10 Mill Forest']
# Note: I believe these are purchases of forest land for the 10 Mill Law

### Additional checks on the data
#### NonUseCode

In [None]:
clean_df['NonUseCode'].unique()

In [12]:
df_2001 = pd.read_csv("data/clean_data_2003_listings.csv")
small_df = df_2001[0:200]

In [13]:
small_df.head(100)

Unnamed: 0.1,Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,7,8,30047,2003,04/19/2004 12:00:00 AM,Andover,10 CHESTER BRKS LANE,56600,80000.0,70.75,Vacant Land,,0,
1,15,16,30063,2003,07/15/2004 12:00:00 AM,Andover,101 GILEAD ROAD,104000,218000.0,47.70,Residential,Single Family,0,
2,19,20,30055,2003,01/06/2004 12:00:00 AM,Andover,104 WEST STREET,52300,314867.0,16.61,Residential,Single Family,7,
3,20,21,30037,2003,02/02/2004 12:00:00 AM,Andover,106 HENDEE ROAD,106100,220000.0,48.22,Residential,Single Family,0,
4,25,26,30072,2003,08/16/2004 12:00:00 AM,Andover,11 DOGWOOD DRIVE,45600,449900.0,10.13,Residential,Single Family,7,
5,30,31,30068,2003,07/29/2004 12:00:00 AM,Andover,11 WOOD FERN WAY,30200,100000.0,30.20,Vacant Land,,0,
6,37,38,30073,2003,08/23/2004 12:00:00 AM,Andover,113 LONG HL ROAD,85600,182500.0,46.90,Residential,Single Family,0,
7,40,41,30054,2003,04/30/2004 12:00:00 AM,Andover,114 HENDEE ROAD,90400,189900.0,47.60,Residential,Single Family,0,
8,46,47,30050,2003,04/01/2004 12:00:00 AM,Andover,12 CHESTER BRKS LANE,56600,80000.0,70.75,Vacant Land,,0,
9,47,48,30051,2003,03/31/2004 12:00:00 AM,Andover,12 CHESTER BRKS LANE,56600,125000.0,45.28,Vacant Land,,26,


In [None]:
small_df['NonUseCode'].describe()

In [12]:
def clean_nonusecode(df):
    """
    Some of the NonUseCodes are long strings with descriptors, which we don't need because the OPM data includes their
    descriptions in a separate pdf. Some are also NaN.
    This function turns all NonUseCodes into ints and sets the NaN ones to -1.
    """
    new_df = df.copy()
    new_df['NonUseCode'] = new_df['NonUseCode'].astype(str)
    
    for index in new_df.index:
        # NaN case
        current_code = new_df.loc[index, 'NonUseCode']
        if 'na' in current_code:
            new_df.loc[index, 'NonUseCode'] = -1
        #0-9 case
        elif len(current_code) < 2:
            new_df.loc[index, 'NonUseCode'] = "0" + current_code
        # XX... case, where we want to cut off additional text if there is any
        else:
            new_df.loc[index, 'NonUseCode'] = current_code[0:2]
    
    new_df['NonUseCode'] = new_df['NonUseCode'].astype(int)
    return new_df

In [13]:
clean_nonusecode(raw_df[0:100])

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,1,14046,2014,09/29/2015 12:00:00 AM,Andover,US ROUTE 6 M 33 B 36 L 22,10720,75000.0,0.142933,Vacant Land,,-1,
1,2,900035,2009,07/20/2010 12:00:00 AM,Andover,1 DOGWOOD DRIVE,55600,99000.0,0.561616,Vacant Land,,-1,
2,3,14011,2014,01/14/2015 12:00:00 AM,Andover,1 JUROVATY LANE,153100,190000.0,0.805789,Residential,Single Family,-1,
3,4,80009,2008,01/21/2009 12:00:00 AM,Andover,1 ROSE LANE,116600,138900.0,0.839453,Residential,Single Family,-1,
4,5,15006,2015,11/30/2015 12:00:00 AM,Andover,1 ROSE LANE,102900,50000.0,2.058000,Residential,Single Family,14,PROPERTY WAS OWNED BY THE BANK
5,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.341232,Residential,Single Family,0,
6,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.340000,Residential,Single Family,0,
7,8,30047,2003,04/19/2004 12:00:00 AM,Andover,10 CHESTER BRKS LN,56600,80000.0,70.750000,Vacant Land,,0,
8,9,40003,2004,10/18/2004 12:00:00 AM,Andover,10 CHESTER BRKS LN,194100,446639.0,43.457916,Residential,Single Family,7,
9,10,70005,2007,11/19/2007 12:00:00 AM,Andover,10 CHESTER BROOKS LN,313400,425000.0,0.737412,Residential,Single Family,-1,


In [36]:
clean_nonusecode(raw_df[100:200])

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
100,101,100025,2010,08/15/2011 12:00:00 AM,Andover,15 GILEAD ROAD,120900,175000.0,0.690857,Residential,Single Family,-1,
101,102,50033,2005,05/01/2006 12:00:00 AM,Andover,15 SHODDY ML RD,84100,223000.0,37.700000,Residential,Single Family,0,
102,103,30079,2003,09/01/2004 12:00:00 AM,Andover,15 WOOD FERN WAY,44500,105000.0,42.380000,Vacant Land,,0,
103,104,40020,2004,02/08/2005 12:00:00 AM,Andover,15 WOOD FERN WAY,54100,435100.0,12.433923,Residential,Single Family,7,
104,105,15038,2015,06/14/2016 12:00:00 AM,Andover,150 LAKE RD,135100,97000.0,1.392784,Residential,Single Family,-1,
105,106,900031,2009,07/07/2010 12:00:00 AM,Andover,150 LONG HILL ROAD,125700,165000.0,0.761818,Residential,Single Family,-1,
106,107,900002,2009,10/13/2009 12:00:00 AM,Andover,151 LAKESIDE DR,175700,221000.0,0.795023,Residential,Single Family,-1,
107,108,13028,2013,08/04/2014 12:00:00 AM,Andover,152 HENDEE RD,221000,325000.0,0.680000,Residential,Single Family,-1,
108,109,13034,2013,09/04/2014 12:00:00 AM,Andover,154 LONG HILL RD,51800,58647.0,0.883000,Vacant Land,,-1,
109,110,30084,2003,09/09/2004 12:00:00 AM,Andover,154 LONG HL RD,62200,75000.0,82.930000,Vacant Land,,0,


### Additional checks on the data
These informed the cleaning that has been done above.
#### AssessedValue, SaleAmount, SalesRatio

In [41]:
clean_df['AssessedValue'].describe()

count    8.159050e+05
mean     2.629412e+05
std      1.327561e+06
min      0.000000e+00
25%      8.113000e+04
50%      1.306000e+05
75%      2.147600e+05
max      1.389588e+08
Name: AssessedValue, dtype: float64

In [124]:
clean_df['AssessedValue'].head()

NameError: name 'clean_df' is not defined

In [125]:
clean_df['SaleAmount'].head()

NameError: name 'clean_df' is not defined

In [47]:
clean_df['SaleAmount'].describe()

count    8.159050e+05
mean     3.679508e+05
std      2.068168e+06
min      0.000000e+00
25%      1.349000e+05
50%      2.160000e+05
75%      3.500000e+05
max      9.409400e+08
Name: SaleAmount, dtype: float64

In [50]:
clean_df['SalesRatio']

0         54.853448
1         47.731250
2         54.288889
3         42.554829
4         47.886364
5        110.778523
6         57.122137
7        302.750000
8         52.208333
9         60.391421
10        44.984211
11        78.811765
12        48.475570
13        48.185075
14        62.039216
15        46.211429
16        67.394161
17        61.574850
18        51.450000
19        68.783570
20       132.284569
21        88.700000
22        53.480000
23        46.472222
24        52.809524
25        42.508065
26        50.807273
27        40.400000
28        89.687500
29        49.585014
            ...    
49743      4.498598
49744      1.490652
49745      0.854260
49746      0.868561
49747      0.751132
49748      0.645022
49749      1.012250
49750      1.192088
49751      0.788815
49752      5.627879
49753      1.875960
49754      0.871413
49755      0.630475
49756      1.002476
49757      0.060500
49758      1.508857
49759      0.001121
49760      0.143000
49761      0.113884


In [51]:
clean_df['SalesRatio'].describe()

count    8.150720e+05
mean     7.976728e+02
std      1.407832e+05
min      0.000000e+00
25%      7.000000e-01
50%      1.459335e+00
75%      4.710000e+01
max      6.119000e+07
Name: SalesRatio, dtype: float64

In [53]:
# Some of these sales ratios are high. Let's make sure the assessed price and sale price are appropriately different.
clean_df[clean_df['SalesRatio'] > 5][['Address', 'AssessedValue', 'SaleAmount', 'SalesRatio', 'PropertyType']]

Unnamed: 0,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType
0,1-3 EAGLE STREET,63630,116000.0,54.853448,Residential
1,1 CRESTWOOD ROAD,76370,160000.0,47.731250,Residential
2,1 DAVIES CT,97720,180000.0,54.288889,Residential
3,1 DOREL TER,110600,259900.0,42.554829,Residential
4,1 JAMES STREET,63210,132000.0,47.886364,Residential
5,1 LESTER STREET,82530,74500.0,110.778523,Residential
6,1 WESTBROOK AVE,74830,131000.0,57.122137,Residential
7,10-12 CLIFTON AVE,60550,20000.0,302.750000,Residential
8,10-12 HALL STREET,87710,168000.0,52.208333,Residential
9,10-12 PARKER STREET,112630,186500.0,60.391421,Residential


In [55]:
# This makes sense. Homes with a high sales ratio should be on the lower end in terms of salesprice, which
# would also decrease the taxed value of the home?

### Additional checks on the data
#### Remarks

In [63]:
clean_df['Remarks'].head(20)

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
18    NaN
19    NaN
Name: Remarks, dtype: object

In [65]:
clean_df['Remarks'].tail(20)

49753                                         PARTIAL SALE
49754                                                  NaN
49755                                                  NaN
49756                                                  NaN
49757                                                  NaN
49758                 UNIQUE: 19700  PARTIAL INTEREST SALE
49759                                SOLD WITH ACCT 261700
49760                                         PARTIAL SALE
49761         3.326 ACRES SOLD TO NEIGHBOR/FARM LAND VALUE
49762                                   LOT 1 HILLTOP EAST
49763                                                  NaN
49764                                                  NaN
49765                                                  NaN
49766    STATE HELPED TOWN PURCHASE LAND FOR OPEN SPACE...
49767    STATE HELPED TOWN PURCHASE LAND FOR OPEN SPACE...
49768              PURCHASED LAND IN OPEN SPACE FROM UNCLE
49769                                 SALE WITH ID: 1436

In [66]:
# Looks like more remarks were recorded towards the end of the data (later years).

In [11]:
MIN_YEAR = 2001
MAX_YEAR = 2016

In [128]:
def get_dataframe(MIN_YEAR, MAX_YEAR):
    """
    Pass in a range of years. Will combine all of the CSVs corresponding to that time range into one main datafame.
    This dataframe will be cleaned in various ways:
        1. Remove leading and trailing whitespace
        2. Replace double-spaces with single-spaces in the Address field
        3. Replace abbreviations like "LN" and "RD" in the Address field with their full names ("LANE", "ROAD", etc.)
        4. Fix NonUseCodes so that they are only two-digit or less integers.
        5. Remove duplicate rows.
    """
    # Combine the year-by-year clean csvs, which are located at 'data/clean_data_20xx_listings.csv'
    df = pd.read_csv('data/clean_data_' + str(MIN_YEAR) + '_listings.csv')
    for year in range(MIN_YEAR+1, MAX_YEAR+1):
        df = df.append(pd.read_csv('data/clean_data_' + str(year) + '_listings.csv'))

    # Now remove the index column
    #df = df.drop('Unnamed: 0', 1)
    return df

In [129]:
df = get_dataframe(MIN_YEAR, MAX_YEAR)

### Additional checks on the data
#### Looking for duplicate transactions

In [16]:
# Are all the SerialNumbers unique?
len(df['SerialNumber'].unique())

56011

In [18]:
len(df['SerialNumber'])

815905

In [19]:
# How many duplicates are there?
len(df['SerialNumber']) - len(df['SerialNumber'].unique())

759894

In [24]:
import random

In [41]:
def get_random_serial_number(df):
    """
    Returns a serial number from df at random.
    """
    random_index = random.randint(0, len(df))
    random_index
    random_serial = df.iloc[random_index]['SerialNumber']
    return random_serial

In [150]:
def show_all_rows_with_random_serial_number(df):
    """
    Returns a subset of df based on a random serial number.
    """
    random_serial = get_random_serial_number(df)
    subset = df[df['SerialNumber'] == random_serial]
    #print("SERIAL #: %d\t\tSIZE: %d" %(random_serial, len(subset)))
    return subset

In [57]:
show_all_rows_with_random_serial_number(df)

SERIAL #: 16074		SIZE: 23


Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
5812,89157,16074,2016,03/07/2017 12:00:00 AM,Burlington,3 WILDES WAY,295330,402500.0,0.733739,Residential,Single Family,-1,
9250,141082,16074,2016,12/12/2016 12:00:00 AM,Darien,58 MANSFIELD AVE,1358490,2295000.0,0.591935,Residential,Single Family,-1,
9329,142142,16074,2016,09/11/2017 12:00:00 AM,Deep River,100 LORDS LA,160720,260000.0,0.618154,Residential,Single Family,-1,
9477,144045,16074,2016,02/13/2017 12:00:00 AM,Derby,2 SHERWOOD AVE,131600,189900.0,0.692996,Residential,Single Family,-1,
9648,146357,16074,2016,05/31/2017 12:00:00 AM,Durham,192 PISGAH ROAD,182980,389900.0,0.4693,Residential,Single Family,7,New construction. Changed percent completed
10522,158039,16074,2016,10/31/2016 12:00:00 AM,East Hartford,16 SKYLINE DRIVE.,139810,217000.0,0.644286,Residential,Single Family,-1,
13362,214350,16074,2016,11/17/2016 12:00:00 AM,Farmington,2 LINCOLN STREET,135520,195000.0,0.694974,Residential,Single Family,-1,
16301,264224,16074,2016,03/13/2017 12:00:00 AM,Haddam,885 SAYBROOK ROAD,204870,310000.0,0.660871,Residential,Single Family,-1,
18238,307671,16074,2016,04/13/2017 12:00:00 AM,Lebanon,644 BEAUMONT HWY,107530,30000.0,3.584333,Residential,Single Family,25,HOUSE TO BE DEMOLISHED
19456,335772,16074,2016,01/09/2017 12:00:00 AM,Mansfield,6 WESTWOOD ROAD,199500,288000.0,0.692708,Residential,Single Family,-1,


Findings
* First 1-2 digits of serial number represent the last 1-2 digits of the year (i.e. 2015 serial numbers start with '15'. 2004 serial numbers start with '4')
* Does not seem to repeat towns in a year. So if there are 10 rows with the same serial number in 2010, those will all be from 10 differernt towns.

In [166]:
def test_duplicate_row_removal(df):
    """
    Raises an exception if there is a duplicate row, judged by whether the length of unique towns is
    not equal to the length of a sample subset of rows with the same serial number.
    """
    fails_test_table = None
    for i in range(0,50):
        # Are there rows with the same serial number AND same Town?
        sample_subset = show_all_rows_with_random_serial_number(df)
        passes_test = len(sample_subset) == len(sample_subset['Town'].unique())     # Best case scenario: lengths are equal.

        if not passes_test:
            fails_test_table = sample_subset
            raise Exception("Test failed! Duplicate row likely.")
            break

In [70]:
fails_test_table

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
147,677,20074,2002,09/02/2003 12:00:00 AM,Andover,WEST STREET,40500,82500.0,49.090000,Vacant Land,,6,
1074,9986,20074,2002,11/15/2002 12:00:00 AM,Avon,6 WILTSHIRE LANE,320320,575000.0,55.707826,Residential,Single Family,0,
1076,9988,20074,2002,11/15/2002 12:00:00 AM,Avon,6 WILTSHIRE LANE,320320,575000.0,55.700000,Residential,Single Family,0,
1431,11709,20074,2002,08/08/2003 12:00:00 AM,Barkhamsted,44 BRIARWOOD ROAD,133110,240000.0,55.460000,Residential,Single Family,14,
1556,12422,20074,2002,04/30/2003 12:00:00 AM,Beacon Falls,18 LORRAINE DRIVE,123570,250000.0,49.428000,Residential,Single Family,0,
1560,12427,20074,2002,04/30/2003 12:00:00 AM,Beacon Falls,18 LORRAINE DRIVE,123570,250000.0,49.420000,Residential,Single Family,0,
1829,18206,20074,2002,03/27/2003 12:00:00 AM,Bethany,227 AMITY ROAD,145180,289900.0,50.079338,Residential,Single Family,0,
1830,18209,20074,2002,03/27/2003 12:00:00 AM,Bethany,227 AMITY ROAD,145180,289900.0,50.070000,Residential,Single Family,0,
2683,27006,20074,2002,11/20/2002 12:00:00 AM,Bloomfield,35 SPICE BUSH LANE,94960,199000.0,47.718590,Residential,Single Family,0,
2684,27007,20074,2002,11/20/2002 12:00:00 AM,Bloomfield,35 SPICE BUSH LANE,94960,199000.0,47.710000,Residential,Single Family,0,


Findings
* If the same serial number occurrs among rows with the same year and town, they are duplicates.
* One row seems to be the partially rounded version of the other, and the rounded one is usually the second one.
* The rounding here is pretty inconsequential, I think, as these are not measurements that demand decimal accuracy.

__Fix this with a clean that removes duplicates, preferring the first row over the second or subsequent duplicate rows.__

In [159]:
def remove_duplicate_rows(df):
    """
    Removes duplicate rows.
    Rows are duplicates if they have the same serial number, ListYear, and town.
    In most cases, there are just two copies of a row with the difference being that the second one is slightly
    rounded.
    Arbitrarily choose the first row and throw out subsequent duplicate rows.
    """
    new_df = df.copy()
    return new_df.drop_duplicates(['SerialNumber', 'ListYear', 'Town', 'DateRecorded', 'Address'])
    

In [160]:
df = get_dataframe(MIN_YEAR, MAX_YEAR)

In [161]:
df_no_duplicates = remove_duplicate_rows(df)

In [165]:
for i in range(0, 50):
    randomly = show_all_rows_with_random_serial_number(df_no_duplicates)
    passes_test = len(randomly['Address'].unique()) == len(randomly)
    print(passes_test)
    if passes_test == False:
        break
randomly

True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
False


Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
242,2782,10218,2001,05/21/2002 12:00:00 AM,Ansonia,4-6 COOK STREET,75670,110000.0,68.790909,Residential,Two Family,0,
949,9861,10218,2001,03/27/2002 12:00:00 AM,Avon,6 BRIGHTON WAY,105630,178000.0,59.342697,Vacant Land,,0,
1414,15768,10218,2001,03/28/2002 12:00:00 AM,Berlin,37 CRATER LANE,104780,204000.0,51.362745,Residential,Single Family,0,
1914,19721,10218,2001,03/27/2002 12:00:00 AM,Bethel,1205 LEXINGTON BLVD,42980,109900.0,39.108280,Condo,,7,
2873,28723,10218,2001,03/21/2002 12:00:00 AM,Bloomfield,86 W DUDLEY TOWN ROAD,50900,45000.0,113.111100,Vacant Land,,0,
3571,35181,10218,2001,01/15/2002 12:00:00 AM,Branford,49 JEFFERSON WDS,67760,131600.0,51.489360,Condo,,0,
4139,40445,10218,2001,10/29/2001 12:00:00 AM,Bridgeport,129 COLUMBIA STREET,43670,36000.0,121.305600,Residential,Two Family,0,
7521,75006,10218,2001,11/16/2001 12:00:00 AM,Bristol,37 BERKSHIRE DRIVE,106270,170000.0,62.511760,Residential,Single Family,0,
8680,90447,10218,2001,08/16/2002 12:00:00 AM,Burlington,BIGWOOD LANE,130060,255000.0,51.003920,Residential,Single Family,16,
9119,95250,10218,2001,07/15/2002 12:00:00 AM,Canton,MULTI ADDRESSES,89140,190000.0,46.915790,Vacant Land,,3,
