### CT Statewide House Sales Transactions
This notebook is for producing a cleaned version of the data from https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2016/5mzw-sjtu?category=Housing-and-Development , which lists CT statewide sales transactions on individual properties from 2001-2016. This cleans the data in year-by-year sections and outputs csvs for each year. It then recomes the data in these csvs into one cleaned csv with all the data.

View the raw data here: https://raw.githubusercontent.com/jamiekasulis/ct_real_estate_sales/master/Real_Estate_Sales_2001-2016.csv

View the meanings of NonUseCodes here: file:///C:/Users/jleekas/Downloads/OPM-RealEstate_Codes.pdf

#### Needed Cleaning
* Trim whitespace and replace double-spaces with single-spaces (done)
* Replace address abbreviations like "LN" with their full form, "LANE" (done)
* Fix NonUseCodes. Should be ints only. Use 0 or -1 for absence of a NonUseCode. (done)
* Remove duplicate transactions. (done)
* Catch mispellings of towns and street names using Python fuzzywuzzy (do later, when you are ready to look at individual properties.)

__Note:__ There are several data cleaning notebooks because to run the processes on all the years of data at once has been taking so, so long.

This is where I test my cleaning process on just a small sample of 2000 listings.

In [5]:
import pandas as pd

In [6]:
def make_raw_df():
    """
    Return the raw df, which is uncleaned, to test cleaning repeatedly.
    """
    return pd.read_csv("https://media.githubusercontent.com/media/jamiekasulis/ct_real_estate_sales/master/data/Real_Estate_Sales_2001-2016.csv")

### Subset: top 2000 listings. Use this to test cleaning functions.

In [7]:
def sample_raw_df():
    """
    Return a sample of 2000 listings from the raw data. Use for testing.
    """
    return make_raw_df()[0:2000]

In [8]:
sample = sample_raw_df()
print(len(sample))

2000


### Trim whitespace at the ends and middle of fields.

In [9]:
def trim_whitespace(df, column):
    """
    Removes all trailing and leading whitespace in a string. Will also turn double spaces into single spaes.
    """
    new_df = df.copy()
    
    for index in new_df.index:
        new_df.loc[index,column] = str(new_df.loc[index,column]).strip().replace('  ', ' ')
    return new_df

In [10]:
new_sample = trim_whitespace(sample, 'Town')
new_sample = trim_whitespace(new_sample, 'Address')

In [None]:
new_sample.head()

### Convert abbreviated street names to their full names.
Will have to do more thorough Address cleaning later.

In [11]:
street_conversions = {
    ' LN':' LANE',
    ' RD':' ROAD',
    ' ST':' STREET',
    ' DR':' DRIVE',
    ' PL':' PLACE',
    ' HL': ' HILL',
    ' TER': ' TERRACE'
}
def convert_address_street_abbreviations(df, conversions):
    """
    Will go through all the rows in a copy of df and change street abbreviations to their full names,
    i.e. "10 CHESTER BROOKS LN" will become "10 CHESTER BROOKS LANE".
    """
    new_df = df.copy()
    
    # Iterate through each row
    for index in new_df.index:
        current = str(new_df.loc[index, 'Address']) # get current address
        #print(current)
        
        for key in street_conversions.keys():
            if key in current:
                # DR is a special case because 'DR' in 'DRIVE' already. Avoid changing to 'DRIVEIVE'
                if key != ' DR' or (key == ' DR' and ' DRIVE' not in current):
                    new_df.loc[index, 'Address'] = current.replace(key, street_conversions[key])
                break
    
    return new_df

In [12]:
clean_df = convert_address_street_abbreviations(new_sample, street_conversions)

In [None]:
clean_df.head(50)

### Create CSVs for each year of listings.

In [13]:
def remove_duplicate_rows(df):
    """
    Removes duplicate rows.
    Rows are duplicates if they have the same serial number, ListYear, and town.
    In most cases, there are just two copies of a row with the difference being that the second one is slightly
    rounded.
    Arbitrarily choose the first row and throw out subsequent duplicate rows.
    """
    new_df = df.copy()
    return new_df.drop_duplicates(['SerialNumber', 'ListYear', 'Town']) #, 'DateRecorded', 'Address'])
    

In [None]:
clean_df = remove_duplicate_rows_duplicate_rows(clean_df)

In [30]:
clean_df.columns

Index(['ID', 'SerialNumber', 'ListYear', 'DateRecorded', 'Town', 'Address',
       'AssessedValue', 'SaleAmount', 'SalesRatio', 'PropertyType',
       'ResidentialType', 'NonUseCode', 'Remarks'],
      dtype='object')

In [87]:
def identify_duplicate_rows(clean_df, raw_df):
    """
    clean_df is a df with duplicates removed. raw_df is a df that has yet to have its duplicates removed.
    Returns a df of the rows that are in clean_df but not in raw_df.
    Recall that remove_duplicate_rows works on the columns 'SerialNumber', 'ListYear', and 'Town'.
    """
    new_df = raw_df.copy()
    # Create a new column, 'duplication', which is the string concatenation of the 3 columns that remove_duplicates
    # checks to identify duplies. This creates what SHOULD be a unique identifier (although we will return the rows)
    # who have 'duplication's in common with other rows.
    new_df['SerialNumber'] = new_df['SerialNumber'].astype(str)
    new_df['ListYear'] = new_df['ListYear'].astype(str)
    new_df['duplication'] = new_df['SerialNumber'] + new_df['ListYear'] + new_df['Town']
    
    duplicates = new_df[0:0] # Create an empty df of duplicates
    
    # Iterate through all rows, identifying rows that have the same 'duplication' column
    for index in new_df.index:
        current_dupe = new_df.loc[index, 'duplication']
        dupe_in_common = new_df[new_df['duplication'] == current_dupe]
        if len(dupe_in_common) > 1:
            duplicates = pd.concat([duplicates, dupe_in_common])
    return duplicates

In [88]:
identify_duplicate_rows(sp, sp_raw)

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks,duplication
5,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.341232,Residential,Single Family,0,,200302002Andover
6,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.340000,Residential,Single Family,0,,200302002Andover
5,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.341232,Residential,Single Family,0,,200302002Andover
6,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.340000,Residential,Single Family,0,,200302002Andover
22,23,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.232877,Residential,Single Family,0,,200262002Andover
23,24,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.230000,Residential,Single Family,0,,200262002Andover
22,23,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.232877,Residential,Single Family,0,,200262002Andover
23,24,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.230000,Residential,Single Family,0,,200262002Andover
33,34,20037,2002,05/05/2003 12:00:00 AM,Andover,112 LAKE RD,0,129000.0,0.000000,Residential,Single Family,0,,200372002Andover
34,35,20037,2002,05/05/2003 12:00:00 AM,Andover,112 LAKE RD,78400,129000.0,60.770000,Residential,Single Family,0,,200372002Andover


In [21]:
sp_raw = sample_raw_df()

In [56]:
# Do all cleans that come before removing duplicates
sp = trim_whitespace(sp_raw, 'Town')
sp = trim_whitespace(sp, 'Address')
sp = convert_address_street_abbreviations(sp, street_conversions)

In [57]:
print("Length of sp before removing duplicates: " + str(len(sp)))

Length of sp before removing duplicates: 2000


In [62]:
sp = remove_duplicate_rows(sp)

In [63]:
print("Length of sp after removing duplicates: " + str(len(sp)))

Length of sp after removing duplicates: 1932


In [82]:
dupes = identify_duplicate_rows(sp, new_sample)
print("Number of duplicate rows: " + str(len(dupes)))
print(len(sp) == len(sp_raw) - len(dupes) / 2)
print(len(sp_raw) - len(dupes) / 2)

Number of duplicate rows: 272
False
1864.0


In [79]:
dupes

Unnamed: 0,ID,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,NonUseCode,Remarks
0,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.341232,Residential,Single Family,0,
1,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.340000,Residential,Single Family,0,
2,6,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.341232,Residential,Single Family,0,
3,7,20030,2002,04/24/2003 12:00:00 AM,Andover,10 BAUSOLA RD,91800,189900.0,48.340000,Residential,Single Family,0,
4,23,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.232877,Residential,Single Family,0,
5,24,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.230000,Residential,Single Family,0,
6,23,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.232877,Residential,Single Family,0,
7,24,20026,2002,03/04/2003 12:00:00 AM,Andover,109 LAKESIDE DR,96700,146000.0,66.230000,Residential,Single Family,0,
8,34,20037,2002,05/05/2003 12:00:00 AM,Andover,112 LAKE RD,0,129000.0,0.000000,Residential,Single Family,0,
9,35,20037,2002,05/05/2003 12:00:00 AM,Andover,112 LAKE RD,78400,129000.0,60.770000,Residential,Single Family,0,


In [70]:
def clean_nonusecode(df):
    """
    Some of the NonUseCodes are long strings with descriptors, which we don't need because the OPM data includes their
    descriptions in a separate pdf. Some are also NaN.
    This function turns all NonUseCodes into ints and sets the NaN ones to -1.
    """
    new_df = df.copy()
    new_df['NonUseCode'] = new_df['NonUseCode'].astype(str)
    
    for index in new_df.index:
        # NaN case
        current_code = new_df.loc[index, 'NonUseCode']
        if 'na' in current_code:
            new_df.loc[index, 'NonUseCode'] = -1
        #0-9 case
        elif len(current_code) < 2:
            new_df.loc[index, 'NonUseCode'] = "0" + current_code
        # XX... case, where we want to cut off additional text if there is any
        else:
            new_df.loc[index, 'NonUseCode'] = current_code[0:2]
    
    new_df['NonUseCode'] = new_df['NonUseCode'].astype(int)
    return new_df

In [None]:
clean_df = clean_nonusecode(clean_df)

#### Clean SalesRatio, which flips from 50% being 0.5 to 50.0 at some point.

In [None]:
# Notice how, at the top, some of the sales ratios are not just AssessedValue/SaleAmount but 
# AssessedValue/SaleAmount * 100...
clean_df.sort_values('SalesRatio', ascending=False).head(5)

In [None]:
def clean_salesratio(df):
    new_df = df.copy()
    new_df['SalesRatio'] = new_df['AssessedValue'] / new_df['SaleAmount']
    return new_df

In [None]:
clean_df = clean_salesratio(clean_df)

In [None]:
def clean_raw_data(raw_data):
    clean_data = raw_data.copy()
    clean_data = trim_whitespace(clean_data, 'Town')
    print("First clean done. (whitespace trimmed on 'Town')")
    clean_data = trim_whitespace(clean_data, 'Address')
    print("Second clean done (whitespace trimmed on 'Address').")
    clean_data = convert_address_street_abbreviations(clean_data, street_conversions)
    print("Third clean done (street abbreviations).")
    clean_data = clean_nonusecode(clean_data)
    print("Fourth cleaning done (nonusecode).")
    clean_data = remove_duplicate_rows(clean_data)
    print("Fifth cleaning done (removing duplicates).")
    clean_data = clean_salesratio(clean_data)
    print("Sixth cleaning done (salesratio). Returning...")
    
    return clean_data

In [None]:
sample = sample_raw_df()
print(len(sample))
new_clean_data = clean_raw_data(sample)
new_clean_data.head()
print(len(new_clean_data))

In [None]:
def create_clean_csv(raw_df, year):
    """
    Given a raw dataframe and a listing year, this will first extract all the listings
    from year from raw_df. Then, it will create a clean version of that dataframe.
    Then, it will write this to a csv file.
    
    The file name convention is clean_data_year_listings.csv
    """
    raw_subset = raw_df[raw_df['ListYear'] == year]
    clean_subset = clean_raw_data(raw_subset)
    file_location = "data/clean_data_" + str(year) + "_listings.csv"
    clean_subset.to_csv(file_location, index=False)

In [None]:
raw_df = make_raw_df()

In [None]:
raw_df.head()

In [None]:
create_clean_csv(raw_df, 2001)

In [None]:
df_2001 = pd.read_csv("data/clean_data_2001_listings.csv")

In [None]:
raw_2001 = raw_df[raw_df['ListYear'] == 2001]
raw_2001.head()

In [None]:
df_2001.head()

In [None]:
raw_2001.tail()

In [None]:
df_2001.tail()

In [None]:
len(raw_2001)

In [None]:
len(df_2001)

### Let's standardize this length test.

In [None]:
def get_raw_df(year):
    """Get the raw data in a dataframe for a particular ListYear year."""
    return raw_df[raw_df['ListYear'] == year]

In [None]:
def get_clean_df(year):
    """Get the clean data in a dataframe for a particular ListYear year.
    Uses a clean csv."""
    filename = "data/clean_data_" + str(year) + "_listings.csv"
    return pd.read_csv(filename)

In [None]:
len(get_raw_df(2001)) == len(get_clean_df(2001))

In [None]:
"""
EDIT: Should not use this as a test anymore because the cleaning functions now remove duplicates, so of course the
lengths would not be equal.
"""
def lengths_are_equal(year):
    """
    Will get the raw and clean data for the given year and compare the lengths of these dataframes.
    """
    return len(get_raw_df(year)) == len(get_clean_df(year))

In [None]:
lengths_are_equal(2001)

### Now let's produce csvs for all years and test them all.

In [None]:
create_clean_csv(raw_df, 2002)
lengths_are_equal(2002)

In [None]:
create_clean_csv(raw_df, 2003)
lengths_are_equal(2003)

In [None]:
create_clean_csv(raw_df, 2004)
lengths_are_equal(2004)

In [None]:
create_clean_csv(raw_df, 2005)
lengths_are_equal(2005)

In [None]:
create_clean_csv(raw_df, 2006)
lengths_are_equal(2006)

In [None]:
create_clean_csv(raw_df, 2007)
lengths_are_equal(2007)

In [None]:
create_clean_csv(raw_df, 2008)
lengths_are_equal(2008)

In [None]:
create_clean_csv(raw_df, 2009)
lengths_are_equal(2009)

In [None]:
create_clean_csv(raw_df, 2010)
lengths_are_equal(2010)

In [None]:
create_clean_csv(raw_df, 2011)
lengths_are_equal(2011)

In [None]:
create_clean_csv(raw_df, 2012)
lengths_are_equal(2012)

In [None]:
create_clean_csv(raw_df, 2013)
lengths_are_equal(2013)

In [None]:
create_clean_csv(raw_df, 2014)
lengths_are_equal(2014)

In [None]:
create_clean_csv(raw_df, 2015)
lengths_are_equal(2015)

In [None]:
create_clean_csv(raw_df, 2016)
lengths_are_equal(2016)

### Recombine the clean csvs into one clean dataframe and write it to a master csv file
Note: You will NOT be able to upload this to GitHub.

In [None]:
clean_df = pd.read_csv("data/clean_data_2001_listings.csv")
clean_df = clean_df.append(pd.read_csv("data/clean_data_2002_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2003_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2004_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2005_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2006_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2007_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2008_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2009_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2010_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2011_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2012_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2013_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2014_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2015_listings.csv"))
clean_df = clean_df.append(pd.read_csv("data/clean_data_2016_listings.csv"))

In [None]:
clean_df

In [None]:
clean_df.tail()

In [None]:
len(clean_df)

In [None]:
sum_length = 0
for year in range(2001, 2017):
    years_df = pd.read_csv("data/clean_data_" + str(year) + "_listings.csv")
    sum_length += len(years_df)

print("The length of clean_df should equal the sum length of each individual year df.")
print("Length of clean_df = %d" %len(clean_df))
print("Sum length of dfs = %d" %sum_length)
print("Equal? " + str(len(clean_df) == sum_length))

In [None]:
# Now write to a csv file
clean_df.to_csv("data/clean_data.csv", index=False)

### Additional checks on the data
#### Town names

In [None]:
clean_df['Town'].unique()

In [None]:
len(clean_df['Town'].unique()) # Should be 169 towns

### Additional checks on the data
#### Years

In [None]:
clean_df['ListYear'].unique() # Should only be [2001, 2016]

### Additional checks on the data
#### PropertyType

In [None]:
clean_df['PropertyType'].unique()

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Condo Family'])

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Condo'])

In [None]:
# So we have this one problematic row. It probably is a Condo but I could just omit it.
clean_df[clean_df['PropertyType'] == 'C']

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Apartments'])

In [None]:
len(clean_df[clean_df['PropertyType'] == 'Apartment'])

In [None]:
clean_df[clean_df['PropertyType'] == 'Apartment']

In [None]:
clean_df[clean_df['PropertyType'] == 'Apartments']

In [None]:
apartments_sales_median = clean_df[clean_df['PropertyType'] == 'Apartments']['SaleAmount'].median()
apartment_sales_median = clean_df[clean_df['PropertyType'] == 'Apartment']['SaleAmount'].median()

In [None]:
print("'Apartments': %d\n'Apartments': %d" %(apartments_sales_median, apartment_sales_median))

In [None]:
clean_df[clean_df['PropertyType'] == '10 Mill Forest']
# Note: I believe these are purchases of forest land for the 10 Mill Law

### Additional checks on the data
#### NonUseCode

In [None]:
clean_df['NonUseCode'].unique()

In [None]:
df_2001 = pd.read_csv("data/clean_data_2003_listings.csv")
small_df = df_2001[0:200]

In [None]:
small_df.head(100)

In [None]:
small_df['NonUseCode'].describe()

In [None]:
clean_nonusecode(raw_df[0:100])

In [None]:
clean_nonusecode(raw_df[100:200])

### Additional checks on the data
These informed the cleaning that has been done above.
#### AssessedValue, SaleAmount, SalesRatio

In [None]:
clean_df['AssessedValue'].describe()

In [None]:
clean_df['AssessedValue'].head()

In [None]:
clean_df['SaleAmount'].head()

In [None]:
clean_df['SaleAmount'].describe()

In [None]:
clean_df['SalesRatio']

In [None]:
clean_df['SalesRatio'].describe()

In [None]:
# Some of these sales ratios are high. Let's make sure the assessed price and sale price are appropriately different.
clean_df[clean_df['SalesRatio'] > 5][['Address', 'AssessedValue', 'SaleAmount', 'SalesRatio', 'PropertyType']]

In [None]:
# This makes sense. Homes with a high sales ratio should be on the lower end in terms of salesprice, which
# would also decrease the taxed value of the home?

### Additional checks on the data
#### Remarks

In [None]:
clean_df['Remarks'].head(20)

In [None]:
clean_df['Remarks'].tail(20)

In [None]:
# Looks like more remarks were recorded towards the end of the data (later years).

In [None]:
MIN_YEAR = 2001
MAX_YEAR = 2016

In [None]:
def get_dataframe(MIN_YEAR, MAX_YEAR):
    """
    Pass in a range of years. Will combine all of the CSVs corresponding to that time range into one main datafame.
    This dataframe will be cleaned in various ways:
        1. Remove leading and trailing whitespace
        2. Replace double-spaces with single-spaces in the Address field
        3. Replace abbreviations like "LN" and "RD" in the Address field with their full names ("LANE", "ROAD", etc.)
        4. Fix NonUseCodes so that they are only two-digit or less integers.
        5. Remove duplicate rows.
    """
    # Combine the year-by-year clean csvs, which are located at 'data/clean_data_20xx_listings.csv'
    df = pd.read_csv('data/clean_data_' + str(MIN_YEAR) + '_listings.csv')
    for year in range(MIN_YEAR+1, MAX_YEAR+1):
        df = df.append(pd.read_csv('data/clean_data_' + str(year) + '_listings.csv'))

    # Now remove the index column
    #df = df.drop('Unnamed: 0', 1)
    return df

In [None]:
df = get_dataframe(MIN_YEAR, MAX_YEAR)

### Additional checks on the data
#### Looking for duplicate transactions

In [None]:
# Are all the SerialNumbers unique?
len(df['SerialNumber'].unique())

In [None]:
len(df['SerialNumber'])

In [None]:
# How many duplicates are there?
len(df['SerialNumber']) - len(df['SerialNumber'].unique())

In [None]:
import random

In [None]:
def get_random_serial_number(df):
    """
    Returns a serial number from df at random.
    """
    random_index = random.randint(0, len(df))
    random_index
    random_serial = df.iloc[random_index]['SerialNumber']
    return random_serial

In [None]:
def show_all_rows_with_random_serial_number(df):
    """
    Returns a subset of df based on a random serial number.
    """
    random_serial = get_random_serial_number(df)
    subset = df[df['SerialNumber'] == random_serial]
    #print("SERIAL #: %d\t\tSIZE: %d" %(random_serial, len(subset)))
    return subset

In [None]:
show_all_rows_with_random_serial_number(df)

Findings
* First 1-2 digits of serial number represent the last 1-2 digits of the year (i.e. 2015 serial numbers start with '15'. 2004 serial numbers start with '4')
* Does not seem to repeat towns in a year. So if there are 10 rows with the same serial number in 2010, those will all be from 10 differernt towns.

In [None]:
def test_duplicate_row_removal(df):
    """
    Raises an exception if there is a duplicate row, judged by whether the length of unique towns is
    not equal to the length of a sample subset of rows with the same serial number.
    """
    fails_test_table = None
    for i in range(0,50):
        # Are there rows with the same serial number AND same Town?
        sample_subset = show_all_rows_with_random_serial_number(df)
        passes_test = len(sample_subset) == len(sample_subset['Town'].unique())     # Best case scenario: lengths are equal.

        if not passes_test:
            fails_test_table = sample_subset
            raise Exception("Test failed! Duplicate row likely.")
            break

In [None]:
fails_test_table

Findings
* If the same serial number occurrs among rows with the same year and town, they are duplicates.
* One row seems to be the partially rounded version of the other, and the rounded one is usually the second one.
* The rounding here is pretty inconsequential, I think, as these are not measurements that demand decimal accuracy.

__Fix this with a clean that removes duplicates, preferring the first row over the second or subsequent duplicate rows.__

In [None]:
df = get_dataframe(MIN_YEAR, MAX_YEAR)

In [None]:
df_no_duplicates = remove_duplicate_rows(df)

In [None]:
for i in range(0, 50):
    randomly = show_all_rows_with_random_serial_number(df_no_duplicates)
    passes_test = len(randomly['Address'].unique()) == len(randomly)
    print(passes_test)
    if passes_test == False:
        break
randomly