# CSCI S-96 Final Project
## Real Estate Clustering Models
## Phillip Booth      

## Introduction    

**Clustering models** are core data mining methods. Clustering models are also known as **unsupervised learning** models. The goal of these models is to extract structure and relationships from complex data. Carrying out this type of exploration is difficult since there is no ground truth as a basis of comparison. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

from sklearn.decomposition import PCA, KernelPCA
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering, OPTICS, MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import SpectralEmbedding, MDS
from sklearn.metrics.pairwise import pairwise_distances 
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

import itertools
import time
from pathlib import Path

# Data Collection

## Constants

In [9]:
# Currently my housing search is limited to Massachusetts and New Hampshire
STATES_CONSIDERED = ['MA']
ZIP_CODE_PREFIXES = {
    'MA': 0,
    'NH': 0
}

ATOM_API_KEY = "d8188fda10d5747de082ca4cde8adcc7"
ATOM_API_SLEEP_LENGTH = 1
ATOM_COMMUNITY_API_ENDPOINT = "https://api.gateway.attomdata.com/communityapi/v2.0.0/area/full"
# Change this if you want to save a new file with anew name
ATOM_OUTPUT_FILE_TAG = "NH"

WS_PATH_SEARCH_API_ENDPOINT = "https://www.walkscore.com/auth/search_suggest?query="
WS_OUTPUT_FILE_TAG = "MA"


### Getting the associated US Zip Codes

In [4]:
zip_code_df = pd.read_excel('data/zip_code_database.xls')
print("zip code df shape: ", zip_code_df.shape)
zip_code_df.head()

zip code df shape:  (42632, 15)


Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939.0,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939.0,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787.0,,US,18.43,-67.15,0


In [10]:
# Filter the zip codes down into the ones that I care about for my housing search

zip_code_filtered_df = zip_code_df[zip_code_df['state'].isin(STATES_CONSIDERED)]
print("filtered zip code df shape: ", zip_code_filtered_df.shape)

filtered zip code df shape:  (703, 15)


In [11]:
zip_code_filtered_df[zip_code_filtered_df['primary_city'] == "Andover"]

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
492,1810,STANDARD,0,Andover,,,MA,Essex County,America/New_York,508978.0,,US,42.65,-71.14,33590
493,1812,UNIQUE,0,Andover,,Internal Revenue Service,MA,Essex County,America/New_York,351.0,,US,42.65,-71.14,0
534,1899,UNIQUE,0,Andover,,Bar Coded I R S,MA,Essex County,America/New_York,351.0,,US,42.65,-71.14,0
1926,5501,UNIQUE,0,Andover,,Irs Service Center,MA,Essex County,America/New_York,351.0,,US,42.65,-71.14,510
1927,5544,UNIQUE,0,Andover,,Irs Service Center,MA,Essex County,America/New_York,351.0,,US,42.65,-71.14,0


Based on this post: https://www.policymap.com/2013/04/tips-on-zips-part-iii-making-sense-of-zip-code-boundaries/

It seems like there are many different types of zip codes. We are only interested in the Standard Ones.

In [12]:
zip_code_filtered_df = zip_code_filtered_df[zip_code_filtered_df['type'] == "STANDARD"]
print("filtered zip code df shape: ", zip_code_filtered_df.shape)

filtered zip code df shape:  (496, 15)


## Getting Accessible Data from Atom Data Solutions By Zip Code 

Now that I have the list of zip codes it is time to use the Atom Real estate API to get data for each zip code that I am interested in so I can put together a picture of each zip code.

In [16]:
def get_processed_zips_from_csv(filePath):
    
    output_file = Path(filePath)
    
    if output_file.exists():
        output_df = pd.read_csv(output_file)
        return output_df['geo_code'].unique()
    else:
        return []

def getAtomData(zip_df):
    
    missed_zip_codes = []
    current_output_to_write = []
    cur_processing_number = 0
    num_disk_writes = 0
    atom_file_data_path = 'data/atom_output_{0}.csv'.format(ATOM_OUTPUT_FILE_TAG)
    GLOBAL_PROCESSED_ZIPCODES = get_processed_zips_from_csv(atom_file_data_path)
    
    headers = {'Accept': 'application/json', 'APIKey': ATOM_API_KEY}
    
    for index, row in zip_df.iterrows():
        if row['state'] in ZIP_CODE_PREFIXES:
            
            cur_processing_number += 1
            # The zip code data set does not include the first number of the zipcode
            zip_prefix = ZIP_CODE_PREFIXES[row['state']]

            full_zip_str = str(zip_prefix) + str(row['zip'])
            
            if row['zip'] not in GLOBAL_PROCESSED_ZIPCODES:

                ## Build URL For Request
                atom_url = ATOM_COMMUNITY_API_ENDPOINT + "?AreaId=ZI" + full_zip_str

                atom_resp = requests.get(atom_url, headers=headers)

                if atom_resp.status_code != 200:
                    # We will need to check and see what happened here 
                    missed_zip_codes.append(full_zip_str)
                    print("ATOM Request failed for url: ", atom_url)
                    continue

                atom_resp_json = atom_resp.json()
                
                current_output_to_write.append(atom_resp_json['response']['result']['package']['item'][0])

                if len(current_output_to_write) == 20:
                    process_data_and_save_to_disk(atom_file_data_path, current_output_to_write, num_disk_writes)
                    num_disk_writes += 1
                    
                    # After saving to disk we need to update our PROCESSED ZIP information
                    processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
                    GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)
                        
                    current_output_to_write = []
                    print("current number of missed zips is: ", len(missed_zip_codes))
                    print("*******Processed {0} total zipcodes *****".format(cur_processing_number))
                
                # need to sleep for 6 seconds between calls to avoid 10 per min api limits
                time.sleep(6)
            else:
                print("SKIPPING Already Processed Zip Code: ", full_zip_str)
        else:
            raise Exception("STATE DOES NOT EXIST IN ZIP_CODE_PREFIXES CONSTANT")

    # Save whatever is left to disk      
    process_data_and_save_to_disk(atom_file_data_path, current_output_to_write, num_disk_writes)

    processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
    GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)
    
    print("Finished Retreiving Atom Data")
    print("{0} Zip Codes Were Missed".format(len(missed_zip_codes)))
            
def process_data_and_save_to_disk(atom_file_path, new_values, write_number):
    print("Processing File: {0} for disk write number: {1}".format(atom_file_path, write_number))
    
    atom_output_file = Path(atom_file_path)
    
    if atom_output_file.exists():
        atom_df = pd.read_csv(atom_output_file)
        atom_df = atom_df.append(pd.DataFrame(new_values), ignore_index=True)
    else:
        atom_df = pd.DataFrame(new_values)
    
    atom_df.to_csv(atom_output_file, index=False)  
    
    print("Finished Disk Write")

        

In [1]:
#getAtomData(zip_code_filtered_df[['zip', 'state']])


## Getting accessible Walk Score data by zipcode

Now that I have processed the Atom Data, I have baseline information about each area code. Another important piece of information we are considering is how walkable each town is. There is somethign called a walk score that is used to rate how accessible neighborhoods are on foot.

As a commuter, and someone that does not like driving, walkability is something that I value very highly.

### Walkscore Definitions:

Walk Score measures the walkability of any address based on the distance to nearby places and pedestrian friendliness.

90–100	Walker’s Paradise. Daily errands do not require a car

70–89	Very Walkable. Most errands can be accomplished on foot

50–69	Somewhat Walkable. Some errands can be accomplished on foot

25–49	Car-Dependent. Most errands require a car

0–24	Car-Dependent. Almost all errands require a car

### Transit Score Definitions:

Transit Score measures how well a location is served by public transit based on the distance and type of nearby transit lines.

90–100	Rider’s Paradise
World-class public transportation
70–89	Excellent Transit
Transit is convenient for most trips
50–69	Good Transit
Many nearby public transportation options
25–49	Some Transit
A few nearby public transportation options
0–24	Minimal Transit
It is possible to get on a bus

### Bike Score Definitions:

Bike Score measures whether an area is good for biking based on bike lanes and trails, hills, road connectivity, and destinations.

90–100	Biker’s Paradise
Daily errands can be accomplished on a bike
70–89	Very Bikeable
Biking is convenient for most trips
50–69	Bikeable
Some bike infrastructure
0–49	Somewhat Bikeable
Minimal bike infrastructure

In [26]:
def getFormattedListOfZipCodes(zip_df):
    processedZipList = []
    
    for index, row in zip_df.iterrows():
        if row['state'] in ZIP_CODE_PREFIXES:
            # The zip code data set does not include the first number of the zipcode
            zip_prefix = ZIP_CODE_PREFIXES[row['state']]
            full_zip_str = str(zip_prefix) + str(row['zip'])
            processedZipList.append(full_zip_str)
        else:
            raise Exception("STATE DOES NOT EXIST IN ZIP_CODE_PREFIXES CONSTANT")
    
    return processedZipList

def get_processed_zips_from__df():
    ws_file_path = 'data/walkscore_output_{0}.csv'.format(WS_OUTPUT_FILE_TAG)
    
    ws_output_file = Path(ws_file_path)
    
    if ws_output_file.exists():
        ws_df = pd.read_csv(ws_output_file)
        return ws_df['geo_code'].unique()
    else:
        return []
    

def getWalkScorePaths(zip_list):
    
    missed_zip_codes = []
    complex_zip_codes = []
    current_output_to_write = []
    cur_processing_number = 0
    num_disk_writes = 0
    ws_file_data_path = 'data/walkscore_output_{0}.csv'.format(WS_OUTPUT_FILE_TAG)
    GLOBAL_PROCESSED_ZIPCODES = get_processed_zips_from_csv(ws_file_data_path)
    
    headers = {'Accept': 'application/json'}
    
    for zip_code in zip_list:
        
        cur_processing_number += 1
        
        if zip_code not in GLOBAL_PROCESSED_ZIPCODES:
            
            ## Build URL For Request
            ws_url = WS_PATH_SEARCH_API_ENDPOINT + zip_code

            ws_resp = requests.get(ws_url, headers=headers)

            if ws_resp.status_code != 200:
                # We will need to check and see what happened here 
                missed_zip_codes.append(full_zip_str)
                print("Walkscore Request failed for url: ", ws_url)
                continue

            ws_resp_json = ws_resp.json()
            
            # Just in case the search results return a lot of options I will need to manually take a look
            if len(ws_resp_json['suggestions']) > 1:
                complex_zip_codes.append(ws_resp_json)
                continue
            
            if len(ws_resp_json['suggestions']) == 0:
                missed_zip_codes.append(ws_resp_json)
                print("0 Walkscore Results for url: ", ws_url)
                continue
            
            pathObj = {
                "geo_code": zip_code,
                "path": ws_resp_json['suggestions'][0]["path"]
            }

            current_output_to_write.append(pathObj)

            if len(current_output_to_write) == 20:
                process_data_and_save_to_disk(ws_file_data_path, current_output_to_write, num_disk_writes)
                num_disk_writes += 1

                # After saving to disk we need to update our PROCESSED ZIP information
                processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
                GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)

                current_output_to_write = []
                print("current number of missed zips is: ", len(missed_zip_codes))
                print("current number of complex zips is: ", len(complex_zip_codes))
                print("*******Processed {0} total zipcodes *****".format(cur_processing_number))

            # need to sleep for 3 seconds between calls to avoid api limits
            time.sleep(3)
            
        else:
            print("SKIPPING Already Processed Zip Code: ", full_zip_str)
    
    # Save whatever is left to disk      
    process_data_and_save_to_disk(ws_file_data_path, current_output_to_write, num_disk_writes)

    processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
    GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)

    

In [27]:
formattedZips = getFormattedListOfZipCodes(zip_code_filtered_df[['zip', 'state']])
getWalkScorePaths(formattedZips)

0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01002
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01005
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01007
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01008
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01010
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01011
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01012
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01026
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01027
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01028
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01031
0 Walkscore Results for url:  ht

0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01376
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01378
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01379
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01380
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01430
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01431
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01432
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01434
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01436
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01450
0 Walkscore Results for url:  https://www.walkscore.com/auth/search_suggest?query=01451
0 Walkscore Results for url:  ht

KeyboardInterrupt: 