# CSCI S-96 Final Project
## Real Estate Clustering Models
## Phillip Booth      

## Introduction    

**Clustering models** are core data mining methods. Clustering models are also known as **unsupervised learning** models. The goal of these models is to extract structure and relationships from complex data. Carrying out this type of exploration is difficult since there is no ground truth as a basis of comparison. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

from sklearn.decomposition import PCA, KernelPCA
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering, OPTICS, MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import SpectralEmbedding, MDS
from sklearn.metrics.pairwise import pairwise_distances 
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

import itertools
import time
from pathlib import Path

# Data Collection

## Constants

In [25]:
# Currently my housing search is limited to Massachusetts and New Hampshire
STATES_CONSIDERED = ['MA', 'NH']
ZIP_CODE_PREFIXES = {
    'MA': 0,
    'NH': 0
}

ATOM_API_KEY = "d8188fda10d5747de082ca4cde8adcc7"
ATOM_API_SLEEP_LENGTH = 1
ATOM_COMMUNITY_API_ENDPOINT = "https://api.gateway.attomdata.com/communityapi/v2.0.0/area/full"
# Change this if you want to save a new file with anew name
ATOM_OUTPUT_FILE_TAG = "NH"


### Getting the associated US Zip Codes

In [19]:
zip_code_df = pd.read_excel('data/zip_code_database.xls')
print("zip code df shape: ", zip_code_df.shape)
zip_code_df.head()

zip code df shape:  (42632, 15)


Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939.0,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939.0,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787.0,,US,18.43,-67.15,0


In [20]:
# Filter the zip codes down into the ones that I care about for my housing search

zip_code_filtered_df = zip_code_df[zip_code_df['state'].isin(STATES_CONSIDERED)]
print("filtered zip code df shape: ", zip_code_filtered_df.shape)

filtered zip code df shape:  (284, 15)


In [21]:
zip_code_filtered_df[zip_code_filtered_df['primary_city'] == "Andover"]

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
1036,3216,STANDARD,0,Andover,,,NH,Merrimack County,America/New_York,603.0,,US,43.43,-71.82,1940


Based on this post: https://www.policymap.com/2013/04/tips-on-zips-part-iii-making-sense-of-zip-code-boundaries/

It seems like there are many different types of zip codes. We are only interested in the Standard Ones.

In [22]:
zip_code_filtered_df = zip_code_filtered_df[zip_code_filtered_df['type'] == "STANDARD"]
print("filtered zip code df shape: ", zip_code_filtered_df.shape)

filtered zip code df shape:  (233, 15)


## Getting Accessible Data from Atom Data Solutions By Zip Code 

Now that I have the list of zip codes it is time to use the Atom Real estate API to get data for each zip code that I am interested in so I can put together a picture of each zip code.

In [12]:
GLOBAL_PROCESSED_ZIPCODES = []

def get_processed_zips_from_atom_df():
    atom_file_path = 'data/atom_output_{0}.csv'.format(ATOM_OUTPUT_FILE_TAG)
    
    atom_output_file = Path(atom_file_path)
    
    if atom_output_file.exists():
        atom_df = pd.read_csv(atom_output_file)
        return atom_df['geo_code'].unique()
    else:
        return []

def getAtomData(zip_df):
    
    GLOBAL_PROCESSED_ZIPCODES = get_processed_zips_from_atom_df()
    
    missed_zip_codes = []
    current_output_to_write = []
    cur_processing_number = 0
    num_disk_writes = 0
    
    headers = {'Accept': 'application/json', 'APIKey': ATOM_API_KEY}
    
    for index, row in zip_df.iterrows():
        if row['state'] in ZIP_CODE_PREFIXES:
            
            cur_processing_number += 1
            # The zip code data set does not include the first number of the zipcode
            zip_prefix = ZIP_CODE_PREFIXES[row['state']]

            full_zip_str = str(zip_prefix) + str(row['zip'])
            
            if row['zip'] not in GLOBAL_PROCESSED_ZIPCODES:

                ## Build URL For Request
                atom_url = ATOM_COMMUNITY_API_ENDPOINT + "?AreaId=ZI" + full_zip_str

                atom_resp = requests.get(atom_url, headers=headers)

                if atom_resp.status_code != 200:
                    # We will need to check and see what happened here 
                    missed_zip_codes.append(full_zip_str)
                    print("ATOM Request failed for url: ", atom_url)
                    continue

                atom_resp_json = atom_resp.json()
                
                current_output_to_write.append(atom_resp_json['response']['result']['package']['item'][0])

                if len(current_output_to_write) == 20:
                    process_atom_data_and_save_to_disk(current_output_to_write, num_disk_writes)
                    num_disk_writes += 1
                    
                    # After saving to disk we need to update our PROCESSED ZIP information
                    processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
                    GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)
                        
                    current_output_to_write = []
                    print("current number of missed zips is: ", len(missed_zip_codes))
                    print("*******Processed {0} total zipcodes *****".format(cur_processing_number))
                
                # need to sleep for 6 seconds between calls to avoid 10 per min api limits
                time.sleep(6)
            else:
                print("SKIPPING Already Processed Zip Code: ", full_zip_str)
        else:
            raise Exception("STATE DOES NOT EXIST IN ZIP_CODE_PREFIXES CONSTANT")

    # Save whatever is left to disk      
    process_atom_data_and_save_to_disk(current_output_to_write, num_disk_writes)

    processed_zips = list(map(lambda x: x["geo_code"], current_output_to_write))
    GLOBAL_PROCESSED_ZIPCODES = np.append(GLOBAL_PROCESSED_ZIPCODES, processed_zips)
    
    print("Finished Retreiving Atom Data")
    print("{0} Zip Codes Were Missed".format(len(missed_zip_codes)))
            
def process_atom_data_and_save_to_disk(new_values, write_number):
    atom_file_path = 'data/atom_output_{0}.csv'.format(ATOM_OUTPUT_FILE_TAG)
    print("Processing File: {0} for disk write number: {1}".format(atom_file_path, write_number))
    
    atom_output_file = Path(atom_file_path)
    
    if atom_output_file.exists():
        atom_df = pd.read_csv(atom_output_file)
        atom_df = atom_df.append(pd.DataFrame(new_values), ignore_index=True)
    else:
        atom_df = pd.DataFrame(new_values)
    
    atom_df.to_csv(atom_output_file, index=False)  
    
    print("Finished Disk Write")

        

In [26]:
getAtomData(zip_code_filtered_df[['zip', 'state']])


SKIPPING Already Processed Zip Code:  03031
SKIPPING Already Processed Zip Code:  03032
SKIPPING Already Processed Zip Code:  03033
SKIPPING Already Processed Zip Code:  03034
SKIPPING Already Processed Zip Code:  03036
SKIPPING Already Processed Zip Code:  03037
SKIPPING Already Processed Zip Code:  03038
SKIPPING Already Processed Zip Code:  03042
SKIPPING Already Processed Zip Code:  03043
SKIPPING Already Processed Zip Code:  03044
SKIPPING Already Processed Zip Code:  03045
SKIPPING Already Processed Zip Code:  03046
SKIPPING Already Processed Zip Code:  03047
SKIPPING Already Processed Zip Code:  03048
SKIPPING Already Processed Zip Code:  03049
SKIPPING Already Processed Zip Code:  03051
SKIPPING Already Processed Zip Code:  03052
SKIPPING Already Processed Zip Code:  03053
SKIPPING Already Processed Zip Code:  03054
SKIPPING Already Processed Zip Code:  03055
SKIPPING Already Processed Zip Code:  03057
SKIPPING Already Processed Zip Code:  03060
SKIPPING Already Processed Zip C

ATOM Request failed for url:  https://api.gateway.attomdata.com/communityapi/v2.0.0/area/full?AreaId=ZI03897
Processing File: data/atom_output_NH.csv for disk write number: 0
Finished Disk Write
Finished Retreiving Atom Data
3 Zip Codes Were Missed
