# Applied Data Science Capstone Project

# 1. Introduction

### 1.1 Background

A recent study from the reputable Fake Immigration Research Institute released its findings that there is a lot of migration of New York City residents moving to Toronto. There is thus a potential business opportunity by providing a recommendation service.

### 1.2 Problem
This project will provide a recommendation service to New York City residents planning a move to Toronto. It will do this by comparing the neighbourhoods of New York City and Toronto, and then listing the Toronto neighbourhoods most similar to each New York City neighbourhood.

### 1.3 Audience
The intended audience for this project is any New York City resident looking for a similar neighbourhood to move to in Toronto.

# 2. Data

### 2.1 New York University Spatial Data Repository

This dataset consists of the neighbourhoods of New York City, as well as the latitude and longitude coordinates for each neighbourhood.

Link: https://geo.nyu.edu/catalog/nyu_2451_34572

Example data point:

<blockquote>
{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}
</blockquote>

### 2.2 Wikipedia page of Canadian postal codes starting with M

This page contains postal codes in Canada where the first letter is M. Postal codes beginning with 'M' are located within the city of Toronto.

Link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Example data point:

<blockquote>
{'Postal Code': 'M4A',
 'Borough': 'North York',
 'Neighbourhood': 'Victoria Village'}
</blockquote>

### 2.3 Toronto geospatial coordinates

This dataset contains the geospatial coordinates (latitude and longitude) of each postal code in Toronto.

Link: https://cocl.us/Geospatial_data

Example data point:

<blockquote>
{'Postal Code': 'M1B',
 'Latitude': '43.8066863',
 'Longitude': '79.1943534'}
</blockquote>

### 2.4 Foursquare API

This dataset describes places and venues, including such information as geographical location, their category, working hourrs, full address, and more. It also contains information about what venues exist within a defined radius of any given location.

Link: https://api.foursquare.com

# 3. Methodology

### 3.1 Gather Toronto neighbourhood locations

First, I imported the BeautifulSoup package and used it to scrape a list of Toronto postal codes, neighbourhoods and boroughs (using the dataset described in 2.2). I then looked at each row of the scraped table. If a borough was not explicitly assigned, it was skipped. If the neighbourhood was not assigned, it was set to the value of the borough field.
Assumptions:
I assumed that there was only one table on the scraped page (visually confirmed to be true).
The first row scraped from the table lists the column headers (visually confirmed to be true), so it was skipped from scraping.

In [2]:
# Import BeautifulSoup package
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Use BeautifulSoup package to scrape URL contents
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

# Import pandas package
import pandas as pd
pd.set_option('display.max_rows', None)

# Create pandas dataframe to store data and define the column names
column_names = ['PostalCode','Borough','Neighbourhood']
toronto_hoods = pd.DataFrame(columns = column_names)

# Create an index to track current row # of the scraped table
currRow = 0

# Find each row ("<tr>") of the scraped table
for row in soup.table.find_all('tr'):
    
    # Increment the current row's index
    currRow = currRow + 1
    
    # If this is the first row scraped, skip it (since this just lists the column headers)
    if (currRow == 1): continue
    
    # Reset row-level index and values
    rowIndex = 0
    rowPostal = ''
    rowBorough = ''
    rowHood = ''

    # Find each column ("<td>") of the scraped table
    for col in row.find_all('td'):
        if (rowIndex==0): rowPostal = col.text.strip()
        elif (rowIndex==1): rowBorough = col.text.strip()
        else: rowHood = col.text.strip()
        rowIndex = rowIndex+1
    
    # If a borough is not assigned, ignore it
    if (rowBorough == 'Not assigned'): continue
    
    # Append a new row to the dataframe
    toronto_hoods = toronto_hoods.append({'PostalCode': rowPostal, 'Borough':rowBorough, 'Neighbourhood':rowHood}, ignore_index=True)

I downloaded the Toronto geographical coodinates (using the dataset described in 2.4), and merged it with the dataframe of neighborhoods gathered in the previous step.

In [3]:
# Download Toronto's geographical coordinates by postal code
!wget -q -O 'toronto_geodata.csv' https://cocl.us/Geospatial_data
toronto_geodata = pd.read_csv('toronto_geodata.csv')

# Join the dataframe containing the neighbourhood info with the dataframe containing the geographical coordinates
toronto_df = toronto_hoods.set_index('PostalCode').join(toronto_geodata.set_index('Postal Code'))
toronto_df = toronto_df.reset_index()

The result of this merge was the following dataframe containing the postal codes, boroughs, neighbourhoods, latitudes and longitudes of Toronto. For brevity, only the first five rows are shown.

In [4]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


I imported the Folium library among other necessary packages, and plotted each neighbourhood of Toronto onto a map.

In [5]:
# The code was removed by Watson Studio for sharing.

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.3 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [6]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

### 3.2 Gather New York City neighbourhood locations

I pulled a list of New York City neighbourhoods from the dataset described in section 2.1, and loaded it into a dataframe.

In [7]:
# Download the NYC neighbourhood dataset
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

# Load the data into a dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

# All the relevant neighbourhood data is in the 'features' key
neighbourhoods_data = newyork_data['features']

# Instantiate an empty dataframe and define the columns
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude']
nyc_df = pd.DataFrame(columns=column_names)

# Loop through the data and fill the dataframe one row at a time
for data in neighbourhoods_data:
    borough = neighbourhood_name = data['properties']['borough'] 
    neighbourhood_name = data['properties']['name']
        
    neighbourhood_latlon = data['geometry']['coordinates']
    neighbourhood_lat = neighbourhood_latlon[1]
    neighbourhood_lon = neighbourhood_latlon[0]
    
    nyc_df = nyc_df.append({'Borough': borough,
                                          'Neighbourhood': neighbourhood_name,
                                          'Latitude': neighbourhood_lat,
                                          'Longitude': neighbourhood_lon}, ignore_index=True)

The result of this merge was the following dataframe containing the boroughs, neighbourhoods, latitudes and longitudes of New York City. For brevity, only the first five rows are shown.

In [8]:
nyc_df.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


I plotted each neighbourhood of New York City onto a map.

In [9]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="nyc_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Create map of Toronto using latitude and longitude values
map_nyc = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to map
for lat, lng, label in zip(nyc_df['Latitude'], nyc_df['Longitude'], nyc_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nyc)

map_nyc

### 3.3 Explore the neighbourhoods of Toronto

In [10]:
# The code was removed by Watson Studio for sharing.

I then created a function to explore the various neighbourhoods of a given city.

In [11]:
# Create a function to get nearby venues for a neighbourhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I used this function to explore the venues for each neighbourhood of Toronto. The summary of the resulting dataframe is as follows:

In [12]:
# Use the getNearbyVenues function to get venues for each neighbourhood
toronto_venues = getNearbyVenues(names=toronto_df['Neighbourhood'],
                                  latitudes=toronto_df['Latitude'],
                                  longitudes=toronto_df['Longitude'])

In [13]:
# Count the number of venues and categories
toronto_catCount = len(toronto_venues['Venue Category'].unique())
toronto_venueCount = len(toronto_venues['Venue Category'])

print('There are', toronto_venueCount , 'total venues, split amongst', toronto_catCount, 'categories.')

There are 2141 total venues, split amongst 273 categories.


I then used one-hot encoding to convert categorical variables into a form that's easier to plug in to algorithms.

In [14]:
# One Hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

I grouped the rows by neighbourhood, by taking the mean of the frequency of occurrence of each category.

In [15]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()

The first five rows of the resulting dataframe as as follows:

In [16]:
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.4 Explore the neighbourhoods of New York City

I then used the same function from Section 3.3 earlier to explore the venues for each neighbourhood of New York City. The result was the following dataframe:

In [17]:
# Use the getNearbyVenues function to get venues for each neighbourhood
nyc_venues = getNearbyVenues(names=nyc_df['Neighbourhood'],
                                  latitudes=nyc_df['Latitude'],
                                  longitudes=nyc_df['Longitude'])

In [18]:
# Count the number of venues and categories
nyc_catCount = len(nyc_venues['Venue Category'].unique())
nyc_venueCount = len(nyc_venues['Venue Category'])

print('There are', nyc_venueCount , 'total venues, split amongst', nyc_catCount, 'categories.')

There are 10070 total venues, split amongst 428 categories.


I used one-hot encoding to convert categorical variables into a form that's easier to plug in to algorithms.

In [19]:
# One Hot encoding
nyc_onehot = pd.get_dummies(nyc_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
nyc_onehot['Neighbourhood'] = nyc_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [nyc_onehot.columns[-1]] + list(nyc_onehot.columns[:-1])
nyc_onehot = nyc_onehot[fixed_columns]

I grouped the rows by neighbourhood, by taking the mean of the frequency of occurrence of each category.

In [20]:
# Group the rows by neighbourhood, by taking the mean of the frequency of occurrence of each category
nyc_grouped = nyc_onehot.groupby('Neighbourhood').mean().reset_index()

The first five rows of the resulting dataframe as as follows:

In [21]:
nyc_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Annadale,0.0,0.0,0.0,0.0,0.0,0.181818,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.5 Normalize Toronto's and New York City's venue information

Given that the Toronto's and New York City's grouped datasets each had different categories, I added additional columns to each with 0 values.

In [22]:
# Get the column headers for each city's dataset
toronto_cols = pd.DataFrame(list(toronto_grouped))
nyc_cols = pd.DataFrame(list(nyc_grouped))

# Merge the two sets of column headers without duplicates
merged_cols = pd.merge(toronto_cols, nyc_cols, how = 'outer')

# Create new dataframes for each city, with the new set of column headers, and a fill value of 0.0 when missing
toronto_merged = toronto_grouped.reindex(columns=merged_cols[0], fill_value=0.0)
nyc_merged = nyc_grouped.reindex(columns=merged_cols[0], fill_value=0.0)

Now, both Toronto's and New York City's dataframes should have the same number of columns.

In [25]:
print('The Toronto dataframe has',len(toronto_merged.columns),'columns.')

The Toronto dataframe has 459 columns.


In [26]:
print('The New York City dataframe has',len(nyc_merged.columns),'columns.')

The New York City dataframe has 459 columns.


### 3.6 Find most similar Toronto neighbourhoods for each part of New York City

Because we have multiple categories (ie. many neighbourhoods) for our target result, a simple Euclidean distance algorithm was chosen.

In [52]:
# Import scipy package
from scipy.spatial import distance

In [70]:
# Create a new dataframe to hold the nearest Toronto neighbourhood for each part of New York City
closest_hoods = pd.DataFrame(columns = ['NYC Neighbourhood','Closest Toronto Neighbourhood'])

# For each neighbourhood of New York City
for n_indx in nyc_merged.index:
    
    nearest_index = -1
    nearest_distance = 1
    
    # Iterate through each neighbourhood of Toronto
    for t_indx in toronto_merged.index:
    
        # Find the Euclidean distance between the Toronto neighbourhood and the New York City neighbourhood
        dst = distance.euclidean(toronto_merged.iloc[t_indx,1:], nyc_merged.iloc[n_indx,1:])
        
        if (dst < nearest_distance):
            nearest_index = t_indx
            nearest_distance = dst
    
    # Append the pair of neighbourhoods to a dataframe
    closest_hoods = closest_hoods.append({'NYC Neighbourhood': nyc_merged.iloc[n_indx]['Neighbourhood'],
                                          'Closest Toronto Neighbourhood': toronto_merged.iloc[nearest_index]['Neighbourhood']}, ignore_index=True)

# 4. Results

The results of the project are displayed below. Each New York City neighbourhood is displayed with its corresponding most similar neighbourhood in Toronto.

In [72]:
closest_hoods

Unnamed: 0,NYC Neighbourhood,Closest Toronto Neighbourhood
0,Allerton,"St. James Town, Cabbagetown"
1,Annadale,"St. James Town, Cabbagetown"
2,Arden Heights,"Alderwood, Long Branch"
3,Arlington,Stn A PO Boxes
4,Arrochar,Stn A PO Boxes
5,Arverne,"Richmond, Adelaide, King"
6,Astoria,"Kensington Market, Chinatown, Grange Park"
7,Astoria Heights,"St. James Town, Cabbagetown"
8,Auburndale,Stn A PO Boxes
9,Bath Beach,"St. James Town, Cabbagetown"


# 5. Discussion

Based on the results, a very basic recommendation can now be offered to New York City residents moving to Toronto.

In terms of future work, other distance metrics could be explored further such as the Manhattan, Minkowski and Hamming Distance. Additionally, these could be compared and contrasted to see if they provide the same or different results.

In addition, another source of future work could be limiting the features to just the ones that the user cares about.

# 6. Conclusion

In this study, I analyzed the neighbourhoods of New York City to find the Toronto neighbourhood that is most similar. I gathered the locations of the various neighbourhoods in each city from a variety of datasets, and fed them into the FourSquare API to gather all venues within them. I then normalized them using One Hot Encoding and other techniques, and finally ran a Euclidean distance algorithm to find the most similar neighbourhooods. This project will be very useful for any New York City residents that are planning to move to Toronto, and require a recommendation of where to find housing.