<img style="float: right;" src="https://dm2302files.storage.live.com/y4pXLLDC4ZAAKymK7XTZBRc_KQ-6aa299K8wkzXobgdJlxvBw6GX8mwRhHbCy3amu2hThezqYmXLbUgpqSxjW5S9dLPFtjIh3Zaojk8wZT3mM1x9urAm3wDSTxD1BUdDLv9NipF94T-ClvcX_x5sOr3I6XKAH324IP-UBaQObRF0hqBaCjz2ibb3DjrnKvCCzwMcl7bAxPAXQpLFHhyi8MnovXEanjYqnt4pMVtQpblElg/WhatsApp%20Image%202020-05-26%20at%2015.14.51.jpeg?psid=1&width=640&height=97" width="300">

# The Battle of Neighborhoods

<span style="color:gray">*IBM Applied Data Science Capstone - Week 5*</span>

## 1. Introduction

### 1.1 Background

Evaluating houses is a deeply personal and complex process, impacted by diverse factors ranging from the physical characteristics and local amenities to politic-economic factors. It's important to analyze the situation, research the options, and gather all the necessary information before making an important decision about moving.

Besides the intrinsic features of a property such as the number of beds, baths, square footage, price, age, etc, a  factor that can affect the decision of evaluating houses is the proximity to the things that matter most to them. This may include a workplace, views, parks, schools, community service, residences of relatives and so on.

The same rings true for all who seeks a new place around the world in pursuit of their dreams or in search of a better life.

### 1.2 Problem

Moving for work is a different than simply moving, typically because the timeline involved in taking a job in a new location is a lot shorter than when you decide on a change of scenery and then focus on getting the new position.

One of my clients lives in Marble Hill, Manhattan, New York and she loves her neighborhood. She recieved a great job offer from Toronto, and she decided to move to Toronto in 3 weeks to take up the new opportunity. She wants to find out a neighborhood in Toronto that has similar amenities available near her that she gets in Marble Hill. 

In this project Python's data analysis and geospatial analysis packages was used to analyze the whole spectrum of available listings in a market, evaluate and score properties based on various attribute and spatial parameters and arrive at a shortlist of neighborhoods similars to Marble Hill.

### 1.3 Target Audience

Target audience for this project is anyone who is searching for a new properties in neighborhoods that provides similar amenities of their current neighborhood.

## 2. Data collection and cleaning

### 2.1 Data sources

Housing data for the city of New York was collected from the New York University Libraries <span style="color:blue">[1]</span> and the data for the city of Toronto was scraped from Wikipedia <span style="color:blue">[2]</span>. Data was read using Pandas as DataFrame objects. These DataFrames form the foundation of this study upon which both spatial and attribute analysis are performed.

[1] https://geo.nyu.edu/catalog/nyu_2451_34572

[2] https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Foursquare API was used to explore Manhattan, New York City and neighborhoods in Toronto. The explore function was used to get the most common venue categories in each neighborhood, and then the neighborhoods was grouped into clusters. To complete this task k-means clustering algorithm was used. Finally, the Folium library was used to visualize the neighborhoods in New Toronto and their neighborhoods similars to Marble Hill.

#### Import required libraries

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

from tqdm.notebook import tqdm # progress bar library

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import warnings # warnings library
warnings.filterwarnings("ignore")

print('Libraries imported.')

Libraries imported.


#### Create New York DataFrame

In [3]:
# Get New York Dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

newyork_data = newyork_data['features']

# Define the DataFrame columns
columns_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the dataframe
ny_df = pd.DataFrame(columns=columns_names)

for data in newyork_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_df = ny_df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

ny_df.drop_duplicates(subset=['Neighborhood'], inplace=True)

ny_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


#### Create Toronto DataFrame

In [4]:
# Get tables from the URL and transforme into a DataFrame
tr_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0] # index 0 is the table of interest

# Rename columns
columns_names = ["PostalCode", "Borough", "Neighborhood"]
tr_df.columns = columns_names

# Drop cells with a borough that is "Not assigned"
tr_df = tr_df[tr_df.Borough != "Not assigned"].reset_index(drop=True)

# Make neighborhood equals the borough if neighborhood is "Not assigned"
for index, row in tr_df.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]

# Convert Neighborhood strings to list
tr_df['Neighborhood'] = tr_df.Neighborhood.str.split(',', expand=False)

# Explode Neighborhood list to single rows
tr_df = tr_df.explode('Neighborhood').reset_index(drop=True)

tr_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor


In [5]:
# Define a function to get coordinates
def get_latlng(neighborhood, state='Toronto', country='Canada'):
    # Initialize your variable to None
    lat_lng_coords = None
    # Loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, {}, {}'.format(neighborhood, state, country))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [6]:
# Call the function to get the coordinates, store in a new list using list comprehension
coords = [get_latlng(neighborhood) for neighborhood in tqdm(tr_df["Neighborhood"].tolist(), 'Getting latitudes and longitudes')]

# Create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

# Merge the coordinates into the original dataframe
tr_df['Latitude'] = df_coords['Latitude']
tr_df['Longitude'] = df_coords['Longitude']

# Check the neighborhoods and the coordinates
print(tr_df.shape)

tr_df.head()

HBox(children=(FloatProgress(value=0.0, description='Getting latitudes and longitudes', max=217.0, style=Progr…


(217, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.686575,-79.409993
1,M4A,North York,Victoria Village,43.73154,-79.31428
2,M5A,Downtown Toronto,Regent Park,43.66069,-79.36031
3,M5A,Downtown Toronto,Harbourfront,43.63951,-79.38316
4,M6A,North York,Lawrence Manor,43.72294,-79.43116


#### Define Foursquare Credentials and Version

In [7]:
CLIENT_ID = 'MDOOB1JIPM2FOIP0BDZ2IJQFB14NTHEVORYGDCITKFFW2GSC' # your Foursquare ID
CLIENT_SECRET = 'X0BEVTOL5WVOY3JKAJNYKU5FD3P0QHY2D0HCYG2MJKEF1KW2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#### Get the top 100 venues that are within a radius of 500 meters

In [8]:
def getNearbyVenues(neighborhoods, latitudes, longitudes, radius=750, LIMIT=None):
    
    venues_list=[]
    for neighborhood, lat, lng in tqdm(zip(neighborhoods, latitudes, longitudes), 'Getting data'):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            pass
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighborhood,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                            'Neighborhood_Latitude', 
                            'Neighborhood_Longitude', 
                            'Venue', 
                            'Venue_Latitude', 
                            'Venue_Longitude', 
                            'Venue_Category']
                  
    
    return(nearby_venues)

In [9]:
ny_df[ny_df['Borough'] == 'Manhattan']

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
6,Manhattan,Marble Hill,40.876551,-73.91066
100,Manhattan,Chinatown,40.715618,-73.994279
101,Manhattan,Washington Heights,40.851903,-73.9369
102,Manhattan,Inwood,40.867684,-73.92121
103,Manhattan,Hamilton Heights,40.823604,-73.949688
104,Manhattan,Manhattanville,40.816934,-73.957385
105,Manhattan,Central Harlem,40.815976,-73.943211
106,Manhattan,East Harlem,40.792249,-73.944182
107,Manhattan,Upper East Side,40.775639,-73.960508
108,Manhattan,Yorkville,40.77593,-73.947118


In [10]:
neighborhood = 'Yorkville'
# Get Marble Hill venues
ny_venues = getNearbyVenues(neighborhoods=ny_df.loc[ny_df['Neighborhood'] == neighborhood, 'Neighborhood'],
                                 latitudes=ny_df.loc[ny_df['Neighborhood'] == neighborhood, 'Latitude'],
                                 longitudes=ny_df.loc[ny_df['Neighborhood'] == neighborhood, 'Longitude'],
                                 LIMIT=100
                                  )
ny_venues.head()

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Getting data', max=1.0, style=ProgressS…




Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Yorkville,40.77593,-73.947118,Bagel Bob's on York,40.776459,-73.946972,Bagel Shop
1,Yorkville,40.77593,-73.947118,Peng's Noodle Folk,40.777258,-73.94911,Asian Restaurant
2,Yorkville,40.77593,-73.947118,Carl Schurz Park,40.775118,-73.943763,Park
3,Yorkville,40.77593,-73.947118,Park East Wines & Spirits,40.776715,-73.946663,Liquor Store
4,Yorkville,40.77593,-73.947118,Shorty's,40.777957,-73.948561,Sandwich Place


In [11]:
# Get Toronto venues
toronto_venues = getNearbyVenues(neighborhoods=tr_df['Neighborhood'],
                                 latitudes=tr_df['Latitude'],
                                 longitudes=tr_df['Longitude'],
                                 LIMIT=100
                                  )
toronto_venues.head()

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Getting data', max=1.0, style=ProgressS…




Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Parkwoods,43.686575,-79.409993,Aroma Espresso Bar,43.68817,-79.412599,Café
1,Parkwoods,43.686575,-79.409993,Sir Winston Churchill Park,43.683732,-79.409881,Park
2,Parkwoods,43.686575,-79.409993,Mashu Mashu Mediterranean Grill,43.688297,-79.412563,Middle Eastern Restaurant
3,Parkwoods,43.686575,-79.409993,What A Bagel,43.688079,-79.414544,Bagel Shop
4,Parkwoods,43.686575,-79.409993,Loblaws,43.684188,-79.415485,Grocery Store


#### One Hot Encoded DataFrame 

In [12]:
# One hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue_Category']], prefix="", prefix_sep="")
toronto_onehot = pd.get_dummies(toronto_venues[['Venue_Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood']
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
ny_fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[ny_fixed_columns]

tr_fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[tr_fixed_columns]

# Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

In [13]:
ny_grouped.head()

Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,Bar,Beer Store,Burger Joint,Butcher,Café,Chinese Restaurant,Cocktail Bar,Coffee Shop,Cycle Studio,Daycare,Deli / Bodega,Dessert Shop,Diner,Dog Run,Farmers Market,French Restaurant,Gastropub,German Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Health & Beauty Service,Hot Dog Joint,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Liquor Store,Massage Studio,Mexican Restaurant,Monument / Landmark,Nail Salon,Park,Peruvian Restaurant,Pharmacy,Pizza Place,Pool,Pub,Ramen Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Szechuan Restaurant,Thai Restaurant,Turkish Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Yorkville,0.01,0.01,0.03,0.02,0.01,0.03,0.01,0.02,0.01,0.01,0.01,0.02,0.07,0.01,0.01,0.04,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.02,0.04,0.02,0.01,0.01,0.01,0.04,0.02,0.04,0.01,0.01,0.01,0.01,0.01,0.03,0.01,0.01,0.02,0.01,0.01,0.04,0.01,0.01,0.01,0.01,0.01,0.02,0.01,0.01,0.01,0.02,0.02,0.01,0.03,0.02


## 3. Results

So far, the data set was feature engineered with spatial attributes. The frequence for different venues were explicitly defined, using which properties were ranked. In reality, buyers decision making process, although logical, is a little less calculated and a bit fuzzier. Thus, we could get buyers to simply ‘favorite’ and ‘blacklist’ a set of houses to infer their preferences.

#### Get top 10 Marble Hill venues

In [14]:
ny_grouped_top10 = ny_grouped.copy()
ny_grouped_top10.drop(labels='Neighborhood', axis=1, inplace=True)
ny_grouped_top10 = ny_grouped_top10.sort_values(by=0, axis=1, ascending=False).iloc[:,:10]
ny_grouped_top10['Neighborhood'] = ny_venues['Neighborhood']
ny_grouped_top10_columns = [ny_grouped_top10.columns[-1]] + list(ny_grouped_top10.columns[:-1])
ny_grouped_top10_n = ny_grouped_top10[ny_grouped_top10_columns]
ny_grouped_top10_n

Unnamed: 0,Neighborhood,Coffee Shop,Ice Cream Shop,Pizza Place,Gym,Italian Restaurant,Deli / Bodega,Bagel Shop,Wine Shop,Bar,Mexican Restaurant
0,Yorkville,0.07,0.04,0.04,0.04,0.04,0.04,0.03,0.03,0.03,0.03


#### Get top 5 Toronto Neighborhoods

In [15]:
ny_grouped_top10_columns = ny_grouped_top10_n.drop(labels='Neighborhood', axis=1)

toronto_grouped_intersection = toronto_grouped.copy()
toronto_grouped_intersection.drop(labels='Neighborhood', axis=1, inplace=True)
toronto_grouped_intersection = toronto_grouped_intersection[ny_grouped_top10_columns.columns]
toronto_grouped_intersection['Neighborhood'] = toronto_venues['Neighborhood']
toronto_grouped_intersection_columns = [toronto_grouped_intersection.columns[-1]] + list(toronto_grouped_intersection.columns[:-1])
toronto_grouped_intersection = toronto_grouped_intersection[toronto_grouped_intersection_columns]
toronto_grouped_top = toronto_grouped_intersection.groupby('Neighborhood').mean().reset_index()
toronto_grouped_top

Unnamed: 0,Neighborhood,Coffee Shop,Ice Cream Shop,Pizza Place,Gym,Italian Restaurant,Deli / Bodega,Bagel Shop,Wine Shop,Bar,Mexican Restaurant
0,Harbourfront,0.063075,0.005779,0.030619,0.009808,0.013503,0.009807,0.001723,0.000688,0.007991,0.010083
1,Lawrence Manor,0.065583,0.015518,0.031787,0.004697,0.019946,0.002273,0.0,0.0,0.007108,0.009367
2,Parkwoods,0.070725,0.018485,0.030372,0.015229,0.008921,0.006578,0.0,0.0,0.007644,0.005332
3,Regent Park,0.072782,0.007766,0.064346,0.008766,0.016415,0.003538,0.002019,0.0,0.007092,0.004394
4,Victoria Village,0.054502,0.008497,0.06986,0.0,0.008497,0.011364,0.0,0.0,0.012529,0.0


In [28]:
toronto_merged = toronto_grouped_top.copy()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(tr_df.set_index('Neighborhood'), on='Neighborhood')
toronto_merged_columns = list(toronto_merged.columns[11:13]) + [toronto_merged.columns[0]] + list(toronto_merged.columns[13:15]) +  list(toronto_merged.columns[1:10])
toronto_merged = toronto_merged[toronto_merged_columns]
toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Coffee Shop,Ice Cream Shop,Pizza Place,Gym,Italian Restaurant,Deli / Bodega,Bagel Shop,Wine Shop,Bar
0,M5A,Downtown Toronto,Harbourfront,43.63951,-79.38316,0.063075,0.005779,0.030619,0.009808,0.013503,0.009807,0.001723,0.000688,0.007991
1,M6A,North York,Lawrence Manor,43.72294,-79.43116,0.065583,0.015518,0.031787,0.004697,0.019946,0.002273,0.0,0.0,0.007108
2,M3A,North York,Parkwoods,43.686575,-79.409993,0.070725,0.018485,0.030372,0.015229,0.008921,0.006578,0.0,0.0,0.007644
3,M5A,Downtown Toronto,Regent Park,43.66069,-79.36031,0.072782,0.007766,0.064346,0.008766,0.016415,0.003538,0.002019,0.0,0.007092
4,M4A,North York,Victoria Village,43.73154,-79.31428,0.054502,0.008497,0.06986,0.0,0.008497,0.011364,0.0,0.0,0.012529


#### Get coordinates of Toronto

In [16]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


#### Create a map to visualize the top 5 neighborhoods

In [29]:
# Create map
top_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to the map
markers_colors = []
for lat, lon, poi in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label).add_to(top_map)
       
top_map

## 4. Conclusion

In this case study, the input data set was spatially enriched with information about access to different facilities. It demonstrates how data science can be employed to one aspect of the real estate industry. Buying a home is a personal process, however a lot of decisions are heavily influenced by location. As shown in this study, Python libraries such as Pandas can be used for visualization and statistical analysis, and libraries such as the Foursquare API for Python for spatial analysis. The methods adopted in this study can be applied to any other real estate market to build other recommendation engines.

---

<h4>Author:  <a href="https://br.linkedin.com/in/henrique-mand">Henrique Mandt</a></h4>
<p><a href="https://br.linkedin.com/in/henrique-mand">Henrique Mandt</a>, Civil Engineer and Consultant with a track record of developing soluctions that substantially increases operational efficiency, mitigate risks and maximize benefits for teams, investors and customers. He is a Data Scientist enthusiast with interest in data mining, machine learning and spatial statistical modelling.</p>

---