<h1>Capstone Project - The Battle of Neighborhoods</h1>

<h2>Relocation Decision Assistance</h2>

<h3>1. Introduction</h3>

<h4>1.1. Background</h4>

According to the US Census Bureau about 40 million people (or 14% of US population) move every year. People are moving for different reasons: job, family, school, cost of living, etc. But despite the reason the questions people face when they are looking for a new place to live are very common. Is the housing price in the new location affordable? Will I have same or better amenities in my new neighborhood? What is the quality of public schools? The purpose of this project is to help answer those questions with the use of the data science.

<h4>1.2. Problem</h4>

These days, Internet is great source of information on the topic of interest. But despite wide resources or sometimes because of the resource abundance, the answers may not be so easy to find. For example, a person would need to spend hours researching a single neighborhood and may not even investigate other close by areas that might be more suitable for relocation. In some cases, a person may not even know where to start looking. This project is aimed to guide the relocation decision making by showing clusters of suitable neighborhoods according to provided criteria. 

To demonstrate the results, I will study a specific case. A hypothetical family is happy with their current location but want to be closer to their family. Therefore, they want to relocate to the state of MA. Their criteria are good public schools, affordable housing and good amenities – just like their current place.  

<h4>1.3. Interest</h4>

The results of the project will be equally interesting for real estate agents as well as their clients. 

<h3>2. Data acquisition and cleaning</h3>

<h4>2.1. Data sources</h4>

These project uses three data sources:<ul>
<li> USZipCode database available on <a href="https://pypi.org/project/uszipcode/">PyPI</a>. The database provides geographical, statistical, real-estate data for every US zip code.</li>
<li> <a href="https://foursquare.com/">FourSquare</a> data for information on amenities. FourSquare API is used to collect the data.</li>
<li> <a href="https://www.greatschools.org/">GreatSchools</a> web site for public schools rating. I wrote a function to scrape the web site and to get the average schools rating for the provided zip code.</li>

<b>USZipCode database exploration</b>

In [1]:
# Install PyPI library
!python -m pip install --upgrade pip

Requirement already up-to-date: pip in c:\users\olgazaychykova\anaconda3\lib\site-packages (19.2.1)


In [2]:
# Install USzipcode database
# !pip install uszipcode # uncomment this line if the notebook is ran for the first time.

Explore data available for the target zip code 94598. This is the zip code of the hypothetical family.

In [78]:
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True) # set simple_zipcode=False to use rich info database
targetZip=search.by_zipcode("94598")
targetZip=targetZip.to_dict() # convert to dictionary
targetZip # check the info

{'area_code_list': ['510', '707', '925'],
 'bounds_east': -121.925567,
 'bounds_north': 37.942412,
 'bounds_south': 37.849998,
 'bounds_west': -122.05489,
 'common_city_list': ['Walnut Creek'],
 'county': 'Contra Costa County',
 'housing_units': 10756,
 'land_area_in_sqmi': 15.25,
 'lat': 37.91,
 'lng': -122.05,
 'major_city': 'Walnut Creek',
 'median_home_value': 719200,
 'median_household_income': 121067,
 'occupied_housing_units': 10390,
 'population': 25818,
 'population_density': 1693.0,
 'post_office_city': 'Walnut Creek, CA',
 'radius_in_miles': 6.0,
 'state': 'CA',
 'timezone': 'Pacific',
 'water_area_in_sqmi': 0.01,
 'zipcode': '94598',
 'zipcode_type': 'Standard'}

The following geo information will be used for scraping GreatSchools web site: post office city, latitude, longitude, state, radius, zip code. But first, lets visualize this location on the map.

In [4]:
# Install and import map libraries
# !pip install folium # uncomment this line if the notebook is ran first time
import folium # map rendering library

In [5]:
# create map of using latitude and longitude values of the target zip code
lat=targetZip["lat"]
lng=targetZip["lng"]
map_targetZip = folium.Map(location=[lat, lng], zoom_start=5)

# add marker to the map
label = '{}, {}'.format(targetZip["post_office_city"], targetZip["zipcode"])
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
       [lat, lng],
       radius=5,
       popup=label,
       color='blue',
       fill=True,
       fill_color='#3186cc',
       fill_opacity=0.7,
       parse_html=False).add_to(map_targetZip)  
    
map_targetZip

<b>Scrape GreatSchools site</b>

In [90]:
# Import libraries
import pandas as pd
import requests
import math
from bs4 import BeautifulSoup

# Create function to scrape GreatSchools
# The function return average rating of all selected grade (elementary, middle, high) public schools 
# within 5 miles radius of specified zipcode.
# Variables description:
#      lat = latitude
#      lon = longitude
#      rad = radius in miles
#      locationLabel = City%20Name%2C%20State (%20 is the space and %2C%20 is comma space). Example: San%20Francisco%2C%20CA
#      gradeLevels = one of the following: e, m, h
#      state = two letter state code
#      zipCode = 5 digits zip code
def scrape_GreatSchools(lat,lon,rad,locationLabel,gradeLevels,state,zipCode):
    gsURL='https://www.greatschools.org/search/search.page?distance='+rad+'&lat='+lat+'&lon='+lon+'&locationLabel='+locationLabel+'%20'+zipCode+'&locationType=zip&st=public_charter&st=public&gradeLevels='+gradeLevels+'&state='+state+'&sort=distance&view=list'
    res = requests.get(gsURL)
    soup = BeautifulSoup(res.content, 'html.parser')
    # Convert soup object to string and get only the substring with school ratings
    soup_string=str(soup)
    soup_string.find('gon.search={"schools":[')
    soup_string.find('"resultSummary":')
    schools=soup_string[soup_string.find('gon.search={"schools":[')+len('gon.search={"schools":['):soup_string.find('"resultSummary":')]
    # Split by '"remediationData":[]}' string
    schools_split=schools.split('"remediationData":[]}')
    # build cleaned up arrays for zip and rating
    i = 0
    rating=[] # empty rating array
    while i < len(schools_split)-1:
        if schools_split[i][schools_split[i].find('"rating":')+len('"rating":'):schools_split[i].find(',"ratingScale"')]=='null':
            i+=1
        else:
            rating.append(schools_split[i][schools_split[i].find('"rating":')+len('"rating":'):schools_split[i].find(',"ratingScale"')])
            i += 1
    # find average rating for schools
    i=0 
    sum=0
    n=0
    while i<len(rating):
        sum+=int(rating[i])
        n+=1
        i+=1
    if n==0:
        return -1
    else:
        return math.floor(sum/n)

Scrape GreatSchools with target zip code.

In [91]:
#Test function with targetZip
lat=str(targetZip["lat"])
lng=str(targetZip["lng"])
rad='5' #look for schools within 5 miles radius of the zip code
city=targetZip["post_office_city"].replace(' ','%20').replace(',','%2C')
state=targetZip["state"]
zipcode=str(targetZip["zipcode"])
print('Average elementary schools rating: ',scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode))
print('Average middle schools rating: ',scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode))
print('Average high schools rating: ',scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode))

Average elementary schools rating:  6
Average middle schools rating:  7
Average high schools rating:  7


In [92]:
# add elementary schools rating info to the dictionary
targetZip["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
#targetZip["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
#targetZip["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
targetZip

{'area_code_list': ['510', '707', '925'],
 'bounds_east': -121.925567,
 'bounds_north': 37.942412,
 'bounds_south': 37.849998,
 'bounds_west': -122.05489,
 'common_city_list': ['Walnut Creek'],
 'county': 'Contra Costa County',
 'e_school': 6,
 'housing_units': 10756,
 'land_area_in_sqmi': 15.25,
 'lat': 37.91,
 'lng': -122.05,
 'major_city': 'Walnut Creek',
 'median_home_value': 719200,
 'median_household_income': 121067,
 'occupied_housing_units': 10390,
 'population': 25818,
 'population_density': 1693.0,
 'post_office_city': 'Walnut Creek, CA',
 'radius_in_miles': 6.0,
 'state': 'CA',
 'timezone': 'Pacific',
 'water_area_in_sqmi': 0.01,
 'zipcode': '94598',
 'zipcode_type': 'Standard'}

<b>Explore FourSquare API with target zip code</b>

In [10]:
#Define Foursquare Credentials and Version
CLIENT_ID = 'QGKUC2UWM2XORZTDMPZLZCQP55HC325UHK25KFJQOKP5LHWH' # your Foursquare ID
CLIENT_SECRET = 'PMXKF3LGTXVJEGBVKH2XAUUJ23K5JRLKENWF053MRJXI0ISB' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

In [11]:
#Create GET request URL
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = int(targetZip["radius_in_miles"])*1609 # define radius in meters
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lng, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=QGKUC2UWM2XORZTDMPZLZCQP55HC325UHK25KFJQOKP5LHWH&client_secret=PMXKF3LGTXVJEGBVKH2XAUUJ23K5JRLKENWF053MRJXI0ISB&v=20180604&ll=37.91,-122.05&radius=9654&limit=100'

In [12]:
#Send the GET request and examine the resutls.
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d4429eec53093002ce48c63'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4a084f23f964a520ae731fe3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d163941735',
         'name': 'Park',
         'pluralName': 'Parks',
         'primary': True,
         'shortName': 'Park'}],
       'id': '4a084f23f964a520ae731fe3',
       'location': {'address': '301 N San Carlos Dr',
        'cc': 'US',
        'city': 'Walnut Creek',
        'country': 'United States',
        'crossStreet': 'at Ygnacio Valley Rd',
        'distance': 1196,
        'formattedAddress': ['301 N San Carlos Dr (at Ygnacio Valley Rd)',
         'Walnut Creek, CA 94598',
         'United St

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [14]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Clean the json and structure it into a pandas dataframe.
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Heather Farm Park,Park,37.91857,-122.041768
1,Gardens at Heather Farm,Garden,37.91882,-122.044038
2,Lottie's Creamery,Ice Cream Shop,37.899577,-122.060806
3,Sports Basement,Sporting Goods Shop,37.918327,-122.036975
4,Montecatini Ristorante,Italian Restaurant,37.901636,-122.062617


In [15]:
# Print number of venues returned by Foursquare.
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


<h4>2.2. Data cleaning</h4>

The data for the desired relocation area (state of MA) is collected from the USZipCode data. I don't need locations that are lacking information about median house price. Therefore, those locations are excluded.

In [94]:
# Get all zip codes for the state of MA
relocateInfo = search.by_state('MA',returns=500)

# Convert output to the list of dictionaries
relocateZip=[]
for y in relocateInfo:
    relocateZip.append(y.to_dict())
len(relocateZip)

496

In [96]:
# remove locations with missing median home value
i=0
while i<len(relocateZip):
    if relocateZip[i]["median_home_value"]==None: # exclude location with missing median home value        
        del relocateZip[i]
    i+=1
len(relocateZip)

481

Visualize locations on the map.

In [19]:
# Create a map of MA with zip codes superimposed on top.
map_MA = folium.Map(location=[42.292, -71.5], zoom_start=8)

# add markers to map
i=0
while i<len(relocateZip): 
    label = '{}, {}'.format(relocateZip[i]["major_city"], relocateZip[i]["zipcode"])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [relocateZip[i]["lat"], relocateZip[i]["lng"]],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_MA)  
    i+=1
    
map_MA

To narrow down the potential relocation areas, I will remove zip codes that have median house price above the desired target. For the target price, I will take house median price in the current location of the client. 

In [98]:
# Remove locations with median house price above the price that client can afford.
i=0
while i<len(relocateZip):
    if relocateZip[i]["median_home_value"]>targetZip["median_home_value"]: 
        del relocateZip[i]
    i+=1
len(relocateZip)


463

Scrape GreatSchools for each of the zip codes to get the average schools rating. Add schools rating to each dictionary in the list. I have to limit the scraping function call to 50-100 consecutive cals, otherwise I get disconnected from the web site.

In [99]:
i=0
while i<100:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [100]:
i=100
while i<150:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [101]:
i=150
while i<200:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [102]:
i=200
while i<250:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [103]:
i=250
while i<300:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [104]:
i=300
while i<350:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [105]:
i=350
while i<400:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [106]:
i=400
while i<450:
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [107]:
i=450
while i<len(relocateZip):
    lat=str(relocateZip[i]["lat"])
    lng=str(relocateZip[i]["lng"])
    rad='5'
    city=relocateZip[i]["post_office_city"].replace(' ','%20').replace(',','%2C')
    state=relocateZip[i]["state"]
    zipcode=str(relocateZip[i]["zipcode"])
    relocateZip[i]["e_school"]=scrape_GreatSchools(lat,lng,rad,city,'e',state,zipcode)
    #relocateZip[i]["m_school"]=scrape_GreatSchools(lat,lng,rad,city,'m',state,zipcode)
    #relocateZip[i]["h_school"]=scrape_GreatSchools(lat,lng,rad,city,'h',state,zipcode)
    i+=1
print('done')

done


In [108]:
relocateZip

[{'area_code_list': ['413'],
  'bounds_east': -72.582535,
  'bounds_north': 42.100467,
  'bounds_south': 42.030795,
  'bounds_west': -72.667902,
  'common_city_list': ['Agawam'],
  'county': 'Hampden County',
  'e_school': 4,
  'housing_units': 7557,
  'land_area_in_sqmi': 11.44,
  'lat': 42.07,
  'lng': -72.63,
  'major_city': 'Agawam',
  'median_home_value': 213000,
  'median_household_income': 58733,
  'occupied_housing_units': 7215,
  'population': 16769,
  'population_density': 1466.0,
  'post_office_city': 'Agawam, MA',
  'radius_in_miles': 3.0,
  'state': 'MA',
  'timezone': 'Eastern',
  'water_area_in_sqmi': 0.86,
  'zipcode': '01001',
  'zipcode_type': 'Standard'},
 {'area_code_list': ['413'],
  'bounds_east': -72.355041,
  'bounds_north': 42.437947,
  'bounds_south': 42.301437,
  'bounds_west': -72.546776,
  'common_city_list': ['Amherst', 'Cushman', 'Pelham'],
  'county': 'Hampshire County',
  'e_school': 4,
  'housing_units': 10388,
  'land_area_in_sqmi': 55.04,
  'lat': 42

Next, I will add target zip to the list and convert final result into dataframe.

In [109]:
# add targetZip to the list since it also will be used for future clustering.
relocateZip.append(targetZip)
len(relocateZip)

464

In [110]:
# Convert list to the dataframe
df = pd.DataFrame(relocateZip)
df.head()

Unnamed: 0,area_code_list,bounds_east,bounds_north,bounds_south,bounds_west,common_city_list,county,e_school,housing_units,land_area_in_sqmi,...,occupied_housing_units,population,population_density,post_office_city,radius_in_miles,state,timezone,water_area_in_sqmi,zipcode,zipcode_type
0,[413],-72.582535,42.100467,42.030795,-72.667902,[Agawam],Hampden County,4,7557,11.44,...,7215,16769,1466.0,"Agawam, MA",3.0,MA,Eastern,0.86,1001,Standard
1,[413],-72.355041,42.437947,42.301437,-72.546776,"[Amherst, Cushman, Pelham]",Hampshire County,4,10388,55.04,...,9910,29049,528.0,"Amherst, MA",7.0,MA,Eastern,1.65,1002,Standard
2,[978],-72.007388,42.484473,42.356423,-72.205174,[Barre],Worcester County,4,2044,44.24,...,1904,5079,115.0,"Barre, MA",7.0,MA,Eastern,0.26,1005,Standard
3,[413],-72.331642,42.358762,42.185812,-72.472287,[Belchertown],Hampshire County,3,5839,52.64,...,5595,14649,278.0,"Belchertown, MA",8.0,MA,Eastern,2.68,1007,Standard
4,[413],-72.87218,42.25134,42.113028,-73.034916,[Blandford],Hampden County,-1,586,53.8,...,503,1263,23.0,"Blandford, MA",7.0,MA,Eastern,1.96,1008,Standard


<h4>2.3. Feature Selection</h4>

Current data set has 25 features but I only need few in order to obtain data from FourSquare API and for future clustering. Thus, I will keep the following features that will be used for clustering: e_school,and median_home_value. The following features are needed for FourSquare API and record identification: zipcode lat, lng, post_office_city.

In [111]:
# create new dataframe with required features
dfFeature=df[['zipcode','post_office_city','lat','lng','e_school','median_home_value']]
dfFeature.head()

Unnamed: 0,zipcode,post_office_city,lat,lng,e_school,median_home_value
0,1001,"Agawam, MA",42.07,-72.63,4,213000
1,1002,"Amherst, MA",42.38,-72.52,4,338900
2,1005,"Barre, MA",42.42,-72.12,4,208500
3,1007,"Belchertown, MA",42.3,-72.4,3,260000
4,1008,"Blandford, MA",42.2,-73.0,-1,247200


In [112]:
# get some basic statistics about the dataframe
dfFeature.describe(include='all')

Unnamed: 0,zipcode,post_office_city,lat,lng,e_school,median_home_value
count,464.0,464,464.0,464.0,464.0,464.0
unique,464.0,403,,,,
top,2421.0,"Springfield, MA",,,,
freq,1.0,10,,,,
mean,,,42.243129,-71.661545,4.30819,338721.12069
std,,,0.366382,2.474838,1.686517,127844.459974
min,,,37.91,-122.05,-1.0,64000.0
25%,,,42.09,-72.1125,4.0,247175.0
50%,,,42.29,-71.295,5.0,313300.0
75%,,,42.47,-71.0175,5.0,398650.0


In [113]:
# drop those zipcodes where school rating below target
dfFeatureClean=dfFeature[dfFeature["e_school"]>=targetZip["e_school"]]

In [114]:
dfFeatureClean

Unnamed: 0,zipcode,post_office_city,lat,lng,e_school,median_home_value
5,01010,"Brimfield, MA",42.150,-72.200,6,264800
26,01054,"Leverett, MA",42.470,-72.490,6,341700
40,01081,"Wales, MA",42.060,-72.230,7,204300
72,01238,"Lee, MA",42.300,-73.220,6,251300
77,01254,"Richmond, MA",42.380,-73.370,6,404800
90,01338,"Buckland, MA",42.580,-72.800,6,255900
93,01341,"Conway, MA",42.510,-72.700,7,291000
102,01360,"Northfield, MA",42.670,-72.450,6,241400
107,01370,"Shelburne Falls, MA",42.600,-72.800,6,237100
115,01431,"Ashby, MA",42.670,-71.840,6,246400


Get FourSquare data and construct a dataframe of venues.

In [115]:
# Define function to get venues within about 5 miles (8000 meters) radius of the zipcodes.
def getNearbyVenues(names, latitudes, longitudes, radius=8000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zipcode', 
                  'Zipcode Latitude', 
                  'Zipcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [116]:
#Run the above function on each zipcode and create a new dataframe called location_venues.
location_venues = getNearbyVenues(names=dfFeatureClean['zipcode'],
                                   latitudes=dfFeatureClean['lat'],
                                   longitudes=dfFeatureClean['lng']
                                  )

01010
01054
01081
01238
01254
01338
01341
01360
01370
01431
01432
01434
01451
01460
01463
01506
01515
01516
01521
01522
01532
01536
01541
01568
01581
01612
01718
01719
01720
01740
01742
01746
01747
01748
01752
01754
01757
01775
01776
01778
01801
01864
01867
01880
01886
01890
01929
01940
01949
01982
01983
02019
02032
02038
02043
02045
02050
02052
02053
02054
02056
02061
02062
02066
02071
02081
02093
02169
02188
02189
02191
02324
02332
02360
02420
02421
02461
02462
02464
02465
02466
02474
02476
02492
02494
02536
02540
02556
02557
02568
02631
02642
02649
02653
02657
02659
02703
02718
02739
02769
02770
02771
94598


In [135]:
# Check new dataframe
print(location_venues.shape)
location_venues.head()

(8785, 7)


Unnamed: 0,Zipcode,Zipcode Latitude,Zipcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1010,42.15,-72.2,Brimfield Antique Show,42.121464,-72.209101,Antique Shop
1,1010,42.15,-72.2,Brimfield Antiques Center,42.120812,-72.213299,Antique Shop
2,1010,42.15,-72.2,Rapscallion Brewery,42.139765,-72.109926,Brewery
3,1010,42.15,-72.2,Old Village Grille,42.11675,-72.117853,Breakfast Spot
4,1010,42.15,-72.2,Simple Indulgence Day Spa,42.11611,-72.115648,Spa


In [136]:
# Number of venues were returned for each Zipcode
location_venues.groupby('Zipcode').count()

Unnamed: 0_level_0,Zipcode Latitude,Zipcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
01010,28,28,28,28,28,28
01054,27,27,27,27,27,27
01081,23,23,23,23,23,23
01238,86,86,86,86,86,86
01254,84,84,84,84,84,84
01338,20,20,20,20,20,20
01341,15,15,15,15,15,15
01360,16,16,16,16,16,16
01370,25,25,25,25,25,25
01431,27,27,27,27,27,27


In [137]:
#How many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(location_venues['Venue Category'].unique())))

There are 320 uniques categories.


Analyze venues in each zip code.

In [138]:
# one hot encoding
location_onehot = pd.get_dummies(location_venues[['Venue Category']], prefix="", prefix_sep="")

# add zipcode column back to dataframe
location_onehot['Zipcode'] = location_venues['Zipcode'] 

# move neighborhood column to the first column
fixed_columns = [location_onehot.columns[-1]] + list(location_onehot.columns[:-1])
location_onehot = location_onehot[fixed_columns]

location_onehot.head()

Unnamed: 0,Zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,Arcade,...,Warehouse Store,Waterfall,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo
0,1010,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1010,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [139]:
location_grouped = location_onehot.groupby('Zipcode').mean().reset_index()
location_grouped

Unnamed: 0,Zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,Arcade,...,Warehouse Store,Waterfall,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo
0,01010,0.000000,0.00,0.000000,0.035714,0.00,0.071429,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
1,01054,0.000000,0.00,0.000000,0.037037,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.037037,0.00
2,01081,0.000000,0.00,0.000000,0.000000,0.00,0.086957,0.000000,0.00,0.00,...,0.000000,0.043478,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
3,01238,0.034884,0.00,0.000000,0.069767,0.00,0.011628,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.011628,0.000000,0.00
4,01254,0.000000,0.00,0.000000,0.023810,0.00,0.011905,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.023810,0.0,0.000000,0.011905,0.00
5,01338,0.000000,0.00,0.000000,0.100000,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
6,01341,0.000000,0.00,0.000000,0.066667,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
7,01360,0.000000,0.00,0.000000,0.062500,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
8,01370,0.000000,0.00,0.000000,0.120000,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00
9,01431,0.000000,0.00,0.000000,0.037037,0.00,0.000000,0.000000,0.00,0.00,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.00


Add school rating, median home price, and venues into one dataframe for future clustering.

In [143]:
location_grouped.rename(columns={'Zipcode': 'zipcode'}, inplace=True)
location_grouped.head()

Unnamed: 0,zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,Arcade,...,Warehouse Store,Waterfall,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo
0,1010,0.0,0.0,0.0,0.035714,0.0,0.071429,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1054,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0
2,1081,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,...,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1238,0.034884,0.0,0.0,0.069767,0.0,0.011628,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.0
4,1254,0.0,0.0,0.0,0.02381,0.0,0.011905,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.011905,0.0


In [155]:
dfCluster=pd.merge(location_grouped, dfFeatureClean, on='zipcode')
dfCluster.head()

Unnamed: 0,zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,Arcade,...,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,post_office_city,lat,lng,e_school,median_home_value
0,1010,0.0,0.0,0.0,0.035714,0.0,0.071429,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Brimfield, MA",42.15,-72.2,6,264800
1,1054,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037037,0.0,"Leverett, MA",42.47,-72.49,6,341700
2,1081,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Wales, MA",42.06,-72.23,7,204300
3,1238,0.034884,0.0,0.0,0.069767,0.0,0.011628,0.0,0.0,0.0,...,0.0,0.0,0.011628,0.0,0.0,"Lee, MA",42.3,-73.22,6,251300
4,1254,0.0,0.0,0.0,0.02381,0.0,0.011905,0.0,0.0,0.0,...,0.02381,0.0,0.0,0.011905,0.0,"Richmond, MA",42.38,-73.37,6,404800


<h3>3. Cluster the potential re-location areas</h3>

In [156]:
# drop zipcode, post_office_city, lat, lng to use only numerical features for clustering
dropcolumns=['zipcode', 'post_office_city', 'lat', 'lng']
df_clustering = dfCluster.drop(dropcolumns, 1)



In [157]:
df_clustering.head()

Unnamed: 0,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,Arcade,Arepa Restaurant,...,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,e_school,median_home_value
0,0.0,0.0,0.0,0.035714,0.0,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6,264800
1,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,6,341700
2,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,204300
3,0.034884,0.0,0.0,0.069767,0.0,0.011628,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.0,6,251300
4,0.0,0.0,0.0,0.02381,0.0,0.011905,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02381,0.0,0.0,0.011905,0.0,6,404800


In [158]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([8, 0, 3, 3, 9, 3, 8, 3, 3, 3])

In [159]:
# add clustering labels back to the original dataframe
dfCluster.insert(0, 'Cluster Labels', kmeans.labels_)


In [160]:
dfCluster.head()

Unnamed: 0,Cluster Labels,zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,...,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,post_office_city,lat,lng,e_school,median_home_value
0,8,1010,0.0,0.0,0.0,0.035714,0.0,0.071429,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Brimfield, MA",42.15,-72.2,6,264800
1,0,1054,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037037,0.0,"Leverett, MA",42.47,-72.49,6,341700
2,3,1081,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Wales, MA",42.06,-72.23,7,204300
3,3,1238,0.034884,0.0,0.0,0.069767,0.0,0.011628,0.0,0.0,...,0.0,0.0,0.011628,0.0,0.0,"Lee, MA",42.3,-73.22,6,251300
4,9,1254,0.0,0.0,0.0,0.02381,0.0,0.011905,0.0,0.0,...,0.02381,0.0,0.0,0.011905,0.0,"Richmond, MA",42.38,-73.37,6,404800


Visualize clusters

In [165]:
import numpy as np # library to handle data in a vectorized manner

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[42.292, -71.5], zoom_start=3)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dfCluster['lat'], dfCluster['lng'], dfCluster['zipcode'], dfCluster['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The desired locations are in cluster #6.

In [173]:
# create new dataframe to keep only locations that belong to cluster 1 
df_cluster1=dfCluster[dfCluster['Cluster Labels']== 6]
df_cluster1

Unnamed: 0,Cluster Labels,zipcode,Accessories Store,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Aquarium,...,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,post_office_city,lat,lng,e_school,median_home_value
30,6,1742,0.0,0.0,0.01,0.06,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"Concord, MA",42.46,-71.36,7,686700
45,6,1890,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.03,0.0,0.0,0.0,0.0,"Winchester, MA",42.45,-71.15,6,676800
74,6,2420,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,"Lexington, MA",42.46,-71.22,6,688500
75,6,2421,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,"Lexington, MA",42.44,-71.23,6,718300
76,6,2461,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.02,0.0,"Newton Highlands, MA",42.31,-71.21,6,682600
83,6,2492,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,"Needham, MA",42.28,-71.25,7,678500
102,6,94598,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.02,0.0,"Walnut Creek, CA",37.91,-122.05,6,719200


<h3>4. Conclusion and Recommendations</h3>

Based on the results above, the desired relocation places reside in the cluster 6 - same cluster as current location. Let's Map cluter #6 locations.

In [178]:
# create map
map_relocate = folium.Map(location=[42.292, -71.5], zoom_start=10)

# add markers to the map
for lat, lon, poi, cluster in zip(df_cluster1['lat'], df_cluster1['lng'], df_cluster1['post_office_city'], df_cluster1['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_relocate)
       
map_relocate

There are 6 recommended relocation areas that are comparable to the current location.

In [182]:
df_cluster1[['post_office_city','median_home_value','e_school']]

Unnamed: 0,post_office_city,median_home_value,e_school
30,"Concord, MA",686700,7
45,"Winchester, MA",676800,6
74,"Lexington, MA",688500,6
75,"Lexington, MA",718300,6
76,"Newton Highlands, MA",682600,6
83,"Needham, MA",678500,7
102,"Walnut Creek, CA",719200,6


Out of 6 recommended locations there are two with higher school rating and lower home price: Concord,MA and Needham, MA.