# Haystacks AI Project 4 Group 1: Quantitative Explainability Solution

## Geolocation

<a id=toc></a>
## Table of Contents

<ul>
    <li><a href=#01-import-packages>Import Packages</a>
    <li><a href=#02-load-dataset>Load Datasets and Check Properties</a>
        <ul>
            <li><a href=#02-a-counties>Georgia Counties</a>
        </ul>
        <ul>
            <li><a href=#02-b-zip-codes>Georgia Zip Codes</a>
        </ul>
        <ul>
            <li><a href=#02-c-houses>Georgia Houses</a>
        </ul>
    <li><a href=#03-clean-data>Clean House Address Coordinate Data</a>
        <ul>
            <li><a href=#03-a-drop-extra-index>Drop Unnecessary Index Column</a>
        </ul>
        <ul>
            <li><a href=#03-b-drop-extra-county>Drop Redundant County Name Column</a>
        </ul>
        <ul>
            <li><a href=#03-c-rename-columns>Rename and Relocate Columns</a>
        </ul>
    <li><a href=#04-save-file>Save Cleaned File</a>
</ul>

<a id=01-import-packages></a>
## Import Packages

Import necessary packages.

In [1]:
# Dataframes and numerical
import pandas as pd
import numpy as np

# Geolocation
import geopandas as gpd
import matplotlib.pyplot as plt

# Apache parquet files (to save space)
# import pyarrow as pa
# import pyarrow.parquet as pq

# Increase pandas default display 
pd.options.display.max_rows = 250
pd.options.display.max_columns = 250

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

<a href=#toc>Back to the top</a>

<a id=02-load-dataset></a>
## Load Datasets and Check Properties

Reference shapefile and list of addresses to be tagged.

<a id=02-a-counties></a>
### Georgia Counties

In [2]:
# From https://towardsdatascience.com/tagging-a-location-to-a-shapefile-area-using-geopandas-5d74336128bf

# Set the filepath and load in a shapefile
# Shape file found here:
# https://maps.princeton.edu/catalog/tufts-gacounties10
ga_counties = "data/geojson/tufts-gacounties10-geojson.json"
map_ga_counties = gpd.read_file(ga_counties)

# Check the GeoDataframe
map_ga_counties.head()

Unnamed: 0,id,statefp10,countyfp10,countyns10,geoid10,name10,namelsad10,lsad10,classfp10,mtfcc10,csafp10,cbsafp10,metdivfp10,funcstat10,aland10,awater10,intptlat10,intptlon10,geometry
0,GISPORTAL.GISOWNER01.GACOUNTIES10.1,13,173,348102,13173,Lanier,Lanier County,6,H1,G4020,,46660.0,,A,479824426,37625302,31.0381973,-83.0631635,"POLYGON ((-83.04290 30.94730, -83.04300 30.947..."
1,GISPORTAL.GISOWNER01.GACOUNTIES10.2,13,29,350496,13029,Bryan,Bryan County,6,H1,G4020,496.0,42340.0,,A,1129148153,47887043,32.0179692,-81.4385431,"POLYGON ((-81.40500 31.93700, -81.40500 31.937..."
2,GISPORTAL.GISOWNER01.GACOUNTIES10.3,13,1,349113,13001,Appling,Appling County,6,H1,G4020,,,,A,1313333924,13417422,31.739712,-82.2901025,"POLYGON ((-82.45870 31.83810, -82.43140 31.838..."
3,GISPORTAL.GISOWNER01.GACOUNTIES10.4,13,241,351489,13241,Rabun,Rabun County,6,H1,G4020,,,,A,958276574,17845339,34.8830262,-83.4047353,"POLYGON ((-83.61820 34.91140, -83.61790 34.911..."
4,GISPORTAL.GISOWNER01.GACOUNTIES10.5,13,23,347451,13023,Bleckley,Bleckley County,6,H1,G4020,,,,A,559100179,8447354,32.4354034,-83.3317174,"POLYGON ((-83.29780 32.54970, -83.29780 32.549..."


For mapping addresses by their coordinates to county, referencing column **name10** or **namelsad10** should work.

<a href=#toc>Back to the top</a>

<a id=02-a-zip-codes></a>
### Georgia Zip Codes

In [3]:
# From https://towardsdatascience.com/tagging-a-location-to-a-shapefile-area-using-geopandas-5d74336128bf

# Set the filepath and load in a shapefile
# Shape file found here:
# https://maps.princeton.edu/catalog/tufts-gacounties10
ga_zipcodes = "data/geojson/harvard-tg00gazcta-geojson.json"
map_ga_zipcodes = gpd.read_file(ga_zipcodes)

# Check the GeoDataframe
map_ga_zipcodes.head()

Unnamed: 0,id,GIST_ID,COUNTY,ZCTA,SHAPE_AREA,SHAPE_LEN,geometry
0,TG00GAZCTA.1,1,13001,31513,0.084476,2.237114,"MULTIPOLYGON (((-82.11453 31.90056, -82.11375 ..."
1,TG00GAZCTA.2,2,13001,31518,0.007428,0.428838,"MULTIPOLYGON (((-82.24437 31.58115, -82.24420 ..."
2,TG00GAZCTA.3,3,13001,31539,0.000378,0.084954,"MULTIPOLYGON (((-82.52332 31.74917, -82.52129 ..."
3,TG00GAZCTA.4,4,13001,31555,0.003701,0.500718,"MULTIPOLYGON (((-82.16560 31.56300, -82.16580 ..."
4,TG00GAZCTA.5,5,13001,31563,0.021709,1.657408,"MULTIPOLYGON (((-82.19452 31.75583, -82.19442 ..."


For mapping addresses by their coordinates to zip code, referencing column **ZCTA** should work.

<a href=#toc>Back to the top</a>

<a id=02-c-houses></a>
### Georgia Houses

Load .csv file into pandas dataframe.

In [4]:
# Load the Georgia sold properties and their Lat Longs
list_location = pd.read_csv('data/cleaned.csv')

# Check the Pandas Dataframe
list_location.head()

Unnamed: 0.1,Unnamed: 0,latitude,longitude,full_street_address,city,county_name,beds,baths_full,baths_half,square_footage,lot_size,year_built,details,special_features,price,transaction_type,listing_status,listing_special_features,census_state_name,census_county_name,zipcode,overall_crime_grade,property_crime_grade,HS_rating,MS_rating,ES_rating,rent,caprate
0,0,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,"Brooks, GA",31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
1,1,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,"Brooks, GA",31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
2,2,30.762972,-81.66024,84 Whippoorwill Circle,Kingsland,Camden-County,3.0,2.0,0.0,1618.0,0.0,1986.0,"Detached, 3 Beds, 2 Baths, 1,618 Sq Ft",0,200000,1,1,0,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1380.0,0.69
3,3,30.804209,-81.653325,101 College Street,Kingsland,Camden-County,4.0,2.0,0.0,2103.0,0.0,2020.0,"Detached, 4 Beds, 2 Baths, 2,103 Sq Ft",2,339900,1,1,2,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1603.0,0.471609
4,4,30.823195,-81.635187,241 Jake Colton Drive,Kingsland,Camden-County,4.0,3.0,0.0,2954.0,0.0,2019.0,"Detached, 4 Beds, 3 Baths, 2,954 Sq Ft",0,679900,1,1,0,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1603.0,0.23577


In order to streamline the CSV file, redundant columns shall be dropped and relevant ones shall be located in a logical manner in order to facilitate usage within the proposed Plotly Dash website.

<a href=#toc>Back to the top</a>

<a id=03-clean-data></a>
## Clean House Address Coordinate Data

Only simple modifications need to be made in order for the data to be presented more efficiently.

<a id=03-a-drop-extra-index></a>
### Drop Unnecessary Index Column

To begin with, drop the unnecessary **Unnamed: 0** column. It's taking up unnecessary space and serves no purpose.

In [5]:
# More about this here:
# https://stackoverflow.com/questions/36519086/how-to-get-rid-of-unnamed-0-column-in-a-pandas-dataframe-read-in-from-csv-fil
# Delete one by one like column is 'Unnamed: 0' so use it's name
list_location.drop('Unnamed: 0', axis=1, inplace=True)

# Check the Pandas Dataframe
list_location.head()

Unnamed: 0,latitude,longitude,full_street_address,city,county_name,beds,baths_full,baths_half,square_footage,lot_size,year_built,details,special_features,price,transaction_type,listing_status,listing_special_features,census_state_name,census_county_name,zipcode,overall_crime_grade,property_crime_grade,HS_rating,MS_rating,ES_rating,rent,caprate
0,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,"Brooks, GA",31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
1,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,"Brooks, GA",31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
2,30.762972,-81.66024,84 Whippoorwill Circle,Kingsland,Camden-County,3.0,2.0,0.0,1618.0,0.0,1986.0,"Detached, 3 Beds, 2 Baths, 1,618 Sq Ft",0,200000,1,1,0,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1380.0,0.69
3,30.804209,-81.653325,101 College Street,Kingsland,Camden-County,4.0,2.0,0.0,2103.0,0.0,2020.0,"Detached, 4 Beds, 2 Baths, 2,103 Sq Ft",2,339900,1,1,2,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1603.0,0.471609
4,30.823195,-81.635187,241 Jake Colton Drive,Kingsland,Camden-County,4.0,3.0,0.0,2954.0,0.0,2019.0,"Detached, 4 Beds, 3 Baths, 2,954 Sq Ft",0,679900,1,1,0,Georgia,"Camden, GA",31548,B-,C+,8.0,6.0,7.333333,1603.0,0.23577


<a href=#toc>Back to the top</a>

<a id=03-b-drop-extra-county></a>
### Drop Redundant County Name Column

Similarly, drop the redundant **census_county_name** column.

In [6]:
# Delete the census_county_name_column
list_location.drop('census_county_name', axis=1, inplace=True)

# Check the Pandas Dataframe
list_location.head()

Unnamed: 0,latitude,longitude,full_street_address,city,county_name,beds,baths_full,baths_half,square_footage,lot_size,year_built,details,special_features,price,transaction_type,listing_status,listing_special_features,census_state_name,zipcode,overall_crime_grade,property_crime_grade,HS_rating,MS_rating,ES_rating,rent,caprate
0,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
1,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks-County,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
2,30.762972,-81.66024,84 Whippoorwill Circle,Kingsland,Camden-County,3.0,2.0,0.0,1618.0,0.0,1986.0,"Detached, 3 Beds, 2 Baths, 1,618 Sq Ft",0,200000,1,1,0,Georgia,31548,B-,C+,8.0,6.0,7.333333,1380.0,0.69
3,30.804209,-81.653325,101 College Street,Kingsland,Camden-County,4.0,2.0,0.0,2103.0,0.0,2020.0,"Detached, 4 Beds, 2 Baths, 2,103 Sq Ft",2,339900,1,1,2,Georgia,31548,B-,C+,8.0,6.0,7.333333,1603.0,0.471609
4,30.823195,-81.635187,241 Jake Colton Drive,Kingsland,Camden-County,4.0,3.0,0.0,2954.0,0.0,2019.0,"Detached, 4 Beds, 3 Baths, 2,954 Sq Ft",0,679900,1,1,0,Georgia,31548,B-,C+,8.0,6.0,7.333333,1603.0,0.23577


<a href=#toc>Back to the top</a>

<a id=03-c-rename-columns></a>
### Rename and Relocate Columns

Rename **full_street_addess**, **county_name**, and **census_state_name** columns for simplicity.

In [7]:
# Rename columns in both dataframes to assist merging
list_location.rename(columns = {'full_street_address': 'address',
                                'county_name': 'county',
                                'census_state_name': 'state'}, inplace = True)

# Remove "County" in the new county column to reduce verbiage and space
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
list_location['county'] = list_location['county'].apply(lambda x: x.replace("-County", ""))

# Check the Pandas Dataframe
list_location.head()

Unnamed: 0,latitude,longitude,address,city,county,beds,baths_full,baths_half,square_footage,lot_size,year_built,details,special_features,price,transaction_type,listing_status,listing_special_features,state,zipcode,overall_crime_grade,property_crime_grade,HS_rating,MS_rating,ES_rating,rent,caprate
0,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
1,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks,3.0,1.0,0.0,1460.0,0.0,1910.0,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",0,99000,1,1,0,Georgia,31643,D-,D-,2.5,2.0,1.0,1219.0,1.231313
2,30.762972,-81.66024,84 Whippoorwill Circle,Kingsland,Camden,3.0,2.0,0.0,1618.0,0.0,1986.0,"Detached, 3 Beds, 2 Baths, 1,618 Sq Ft",0,200000,1,1,0,Georgia,31548,B-,C+,8.0,6.0,7.333333,1380.0,0.69
3,30.804209,-81.653325,101 College Street,Kingsland,Camden,4.0,2.0,0.0,2103.0,0.0,2020.0,"Detached, 4 Beds, 2 Baths, 2,103 Sq Ft",2,339900,1,1,2,Georgia,31548,B-,C+,8.0,6.0,7.333333,1603.0,0.471609
4,30.823195,-81.635187,241 Jake Colton Drive,Kingsland,Camden,4.0,3.0,0.0,2954.0,0.0,2019.0,"Detached, 4 Beds, 3 Baths, 2,954 Sq Ft",0,679900,1,1,0,Georgia,31548,B-,C+,8.0,6.0,7.333333,1603.0,0.23577


Now relocate the **state** and **zipcode** columns next to the **county** columns for consistency:

In [8]:
# A function found here for reordering columns and also dropping them if necessary:
# https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
    columns = list(set(columns) - set(first_cols))
    columns = list(set(columns) - set(drop_cols))
    columns = list(set(columns) - set(last_cols))
    new_order = first_cols + columns + last_cols
    return new_order

In [9]:
# Now execute the function above
my_list = list_location.columns.tolist()
location_data = ['latitude', 'longitude', 'address', 'city', 'county', 'state', 'zipcode']
reordered_cols = reorder_columns(my_list, first_cols=location_data)
list_location = list_location[reordered_cols]

# Check the Pandas Dataframe
list_location.head()

Unnamed: 0,latitude,longitude,address,city,county,state,zipcode,listing_status,details,square_footage,overall_crime_grade,ES_rating,caprate,lot_size,baths_half,MS_rating,HS_rating,listing_special_features,rent,beds,special_features,price,baths_full,year_built,property_crime_grade,transaction_type
0,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks,Georgia,31643,1,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",1460.0,D-,1.0,1.231313,0.0,0.0,2.0,2.5,0,1219.0,3.0,0,99000,1.0,1910.0,D-,1
1,30.781796,-83.558475,505 S Lee Street,Quitman,Brooks,Georgia,31643,1,"Detached, 3 Beds, 1 Bath, 1,460 Sq Ft",1460.0,D-,1.0,1.231313,0.0,0.0,2.0,2.5,0,1219.0,3.0,0,99000,1.0,1910.0,D-,1
2,30.762972,-81.66024,84 Whippoorwill Circle,Kingsland,Camden,Georgia,31548,1,"Detached, 3 Beds, 2 Baths, 1,618 Sq Ft",1618.0,B-,7.333333,0.69,0.0,0.0,6.0,8.0,0,1380.0,3.0,0,200000,2.0,1986.0,C+,1
3,30.804209,-81.653325,101 College Street,Kingsland,Camden,Georgia,31548,1,"Detached, 4 Beds, 2 Baths, 2,103 Sq Ft",2103.0,B-,7.333333,0.471609,0.0,0.0,6.0,8.0,2,1603.0,4.0,2,339900,2.0,2020.0,C+,1
4,30.823195,-81.635187,241 Jake Colton Drive,Kingsland,Camden,Georgia,31548,1,"Detached, 4 Beds, 3 Baths, 2,954 Sq Ft",2954.0,B-,7.333333,0.23577,0.0,0.0,6.0,8.0,0,1603.0,4.0,0,679900,3.0,2019.0,C+,1


<a href=#toc>Back to the top</a>

<a id=04-save-file></a>
## Save Cleaned File

Since county and zipcode information is already available in the .csv file itself for mapping to the geoJSONs for the Plotly Dash website, only the cleaned **pandas** dataframe that was created here needs to be saved for future use.

In [10]:
# To ensure that another Unnamed: 0 column is not created in the cleaned .csv file:
list_location.to_csv('data/haystacks_ga_clean_new_format.csv', index=False)

<a href=#toc>Back to the top</a>