# Haystacks AI Project 4 Group 1: Quantitative Explainability Solution

## Geolocation

<a id=toc></a>
## Table of Contents

<ul>
    <li><a href=#01-import-packages>Import Packages</a>
    <li><a href=#02-load-dataset>Load Datasets and Check Properties</a>
        <ul>
            <li><a href=#02-a-counties>Georgia Counties</a>
        </ul>
        <ul>
            <li><a href=#02-b-zip-codes>Georgia Zip Codes</a>
        </ul>
        <ul>
            <li><a href=#02-c-houses>Georgia Houses</a>
        </ul>
    <li><a href=#03-clean-data>Clean House Address Coordinate Data</a>
        <ul>
            <li><a href=#03-a-drop-extra-index>Drop Unnecessary Index Column</a>
        </ul>
        <ul>
            <li><a href=#03-b-drop-extra-county>Drop Redundant County Name Column</a>
        </ul>
        <ul>
            <li><a href=#03-c-rename-columns>Rename and Relocate Columns</a>
        </ul>
    <li><a href=#04-save-file>Save Cleaned File</a>
</ul>

<a id=01-import-packages></a>
## Import Packages

Import necessary packages.

In [None]:
# Dataframes and numerical
import pandas as pd
import numpy as np

# Geolocation
import geopandas as gpd
import matplotlib.pyplot as plt

# Apache parquet files (to save space)
# import pyarrow as pa
# import pyarrow.parquet as pq

# Increase pandas default display 
pd.options.display.max_rows = 250
pd.options.display.max_columns = 250

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

<a href=#toc>Back to the top</a>

<a id=02-load-dataset></a>
## Load Datasets and Check Properties

Reference shapefile and list of addresses to be tagged.

<a id=02-a-counties></a>
### Georgia Counties

In [None]:
# From https://towardsdatascience.com/tagging-a-location-to-a-shapefile-area-using-geopandas-5d74336128bf

# Set the filepath and load in a shapefile
# Shape file found here:
# https://maps.princeton.edu/catalog/tufts-gacounties10
ga_counties = "data/geojson/tufts-gacounties10-geojson.json"
map_ga_counties = gpd.read_file(ga_counties)

# Check the GeoDataframe
map_ga_counties.head()

For mapping addresses by their coordinates to county, referencing column **name10** or **namelsad10** should work.

<a href=#toc>Back to the top</a>

<a id=02-a-zip-codes></a>
### Georgia Zip Codes

In [None]:
# From https://towardsdatascience.com/tagging-a-location-to-a-shapefile-area-using-geopandas-5d74336128bf

# Set the filepath and load in a shapefile
# Shape file found here:
# https://maps.princeton.edu/catalog/harvard-tg00gazcta
ga_zipcodes = "data/geojson/harvard-tg00gazcta-geojson.json"
map_ga_zipcodes = gpd.read_file(ga_zipcodes)

# Check the GeoDataframe
map_ga_zipcodes.head()

For mapping addresses by their coordinates to zip code, referencing column **ZCTA** should work.

<a href=#toc>Back to the top</a>

<a id=02-c-houses></a>
### Georgia Houses

Load .csv file into pandas dataframe.

In [None]:
# Load the Georgia sold properties and their Lat Longs
list_location = pd.read_csv('data/cleaned.csv')

# Check the Pandas Dataframe
list_location.head()

In order to streamline the CSV file, redundant columns shall be dropped and relevant ones shall be located in a logical manner in order to facilitate usage within the proposed Plotly Dash website.

<a href=#toc>Back to the top</a>

<a id=03-clean-data></a>
## Clean House Address Coordinate Data

Only simple modifications need to be made in order for the data to be presented more efficiently.

<a id=03-a-drop-extra-index></a>
### Drop Unnecessary Index Column

To begin with, drop the unnecessary **Unnamed: 0** column. It's taking up unnecessary space and serves no purpose.

In [None]:
# More about this here:
# https://stackoverflow.com/questions/36519086/how-to-get-rid-of-unnamed-0-column-in-a-pandas-dataframe-read-in-from-csv-fil
# Delete one by one like column is 'Unnamed: 0' so use it's name
list_location.drop('Unnamed: 0', axis=1, inplace=True)

# Check the Pandas Dataframe
list_location.head()

<a href=#toc>Back to the top</a>

<a id=03-b-drop-extra-county></a>
### Drop Redundant County Name Column

Similarly, drop the redundant **census_county_name** column.

In [None]:
# Delete the census_county_name_column
list_location.drop('census_county_name', axis=1, inplace=True)

# Check the Pandas Dataframe
list_location.head()

<a href=#toc>Back to the top</a>

<a id=03-c-rename-columns></a>
### Rename and Relocate Columns

Rename **full_street_addess**, **county_name**, and **census_state_name** columns for simplicity.

In [None]:
# Rename columns in both dataframes to assist merging
list_location.rename(columns = {'full_street_address': 'address',
                                'county_name': 'county',
                                'census_state_name': 'state'}, inplace = True)

# Remove "County" in the new county column to reduce verbiage and space
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
list_location['county'] = list_location['county'].apply(lambda x: x.replace("-County", ""))

# Check the Pandas Dataframe
list_location.head()

Now relocate the **state** and **zipcode** columns next to the **county** columns for consistency:

In [None]:
# A function found here for reordering columns and also dropping them if necessary:
# https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
    columns = list(set(columns) - set(first_cols))
    columns = list(set(columns) - set(drop_cols))
    columns = list(set(columns) - set(last_cols))
    new_order = first_cols + columns + last_cols
    return new_order

In [None]:
# Now execute the function above
my_list = list_location.columns.tolist()
location_data = ['latitude', 'longitude', 'address', 'city', 'county', 'state', 'zipcode']
reordered_cols = reorder_columns(my_list, first_cols=location_data)
list_location = list_location[reordered_cols]

# Check the Pandas Dataframe
list_location.head()

<a href=#toc>Back to the top</a>

<a id=04-save-file></a>
## Save Cleaned File

Since county and zipcode information is already available in the .csv file itself for mapping to the geoJSONs for the Plotly Dash website, only the cleaned **pandas** dataframe that was created here needs to be saved for future use.

In [None]:
# To ensure that another Unnamed: 0 column is not created in the cleaned .csv file:
list_location.to_csv('data/haystacks_ga_clean_new_format.csv', index=False)

<a href=#toc>Back to the top</a>