**Introduction/Business Problem**

When people travel to or visit new places, they would want to live in places that are similar to the places they were used to. There are many neighborhoods arount the New York City. When residents of New York city visit Toronto, it would be helpful for them to get a good understanding of the new city if they know what neighbothoods of Toronto are similar to what neighborhoods of New York. On the other hand, if residents of Toronto visit New York City the first time, they would also like to know which parts of New York are similar compared to the parts of Toronto. Thus, in this project, the problem of similarity of the neighborhoods of the two cities will be solved. This is done by exploring and analyzing the venues near New York and Toronto neighborhoods. Neighborhoods with similar venues will be clustered together.

**Data Description**

**Neighborhood Data**

The neighborhood data of New York and Toronto has the following attributes: borough, neighborhood name, latitude of neighborhood, and longitude of neighborhood. The New York neighborhood is publicly available provided by NYU and can be obtained from this link https://geo.nyu.edu/catalog/nyu_2451_34572. The Toronto neighborhood is obtained from the following wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. There will be five columns in the final neighborhood data frame: Borough, Neighborhood Name, Latitude, Longitude, Region. Borough tells us the district to which a neighborhood belongs. Neighborhood Name is the name of the neighborhood. Latitude and Longitude are the coordinates. Region tells us whether this neighborhood is in New York City or Toronto.

**Venue Data**

Venue data of the neighborhoods is obtained by using Foursquare API. By making regular calls through the API, the venue name, venue latitude, venue longitude and venue category within a certain radius of a given neighborhood can be obtained. 

**Methodology**

First, we weill load the neighborhood data. The New York neighborhood data is downloaded from the url https://geo.nyu.edu/catalog/nyu_2451_34572 and handled using json package. The first five rows of New York neighborhood data looks like the following:

In [19]:
# Import packages:
import json
import pandas as pd
import numpy as np

# Open the json file:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)


# Define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the dataframe:
ny_nbhds = pd.DataFrame(columns=column_names)

# Then loop through the data and fill the dataframe one row at a time.
ny_neighborhoods_data = newyork_data['features']
for data in ny_neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_nbhds = ny_nbhds.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
ny_nbhds = ny_nbhds.rename(columns = {'Neighborhood': 'Neighborhood Name'}) 
ny_nbhds['Region'] = 'NYC'
ny_nbhds.head()

Unnamed: 0,Borough,Neighborhood Name,Latitude,Longitude,Region
0,Bronx,Wakefield,40.894705,-73.847201,NYC
1,Bronx,Co-op City,40.874294,-73.829939,NYC
2,Bronx,Eastchester,40.887556,-73.827806,NYC
3,Bronx,Fieldston,40.895437,-73.905643,NYC
4,Bronx,Riverdale,40.890834,-73.912585,NYC


After we get New York data, we will use BeautifulSoup package to scrape the Toronto neighborhood data from the Wikipedia page at the url https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. For purpose of the project, we will only keep the boroughs of Toronto that have a valid name. In addition, if a neighborhood of a boroughood does not have a name, we will use the borough name as the neighborhood name. After the neighborhood names and boroughs are handles, we will use the table called "Geospatial_Coordinates.csv" to append the latitude and longitude information. The table looks like the following:

<img src="files/geo.png">

Note that the coordinates in this table is only for boroughs not for neighborhoods. As a result, for neighborhoods that belong to the samme borough, we merged the records to one as they have the same latitude and longitude. The first five rows of the Toronto neighborhoods looks like the following:

In [20]:
# Import packages:
from bs4 import BeautifulSoup
import requests

# Scrape raw text from the link:
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
txt = requests.get(link).text
raw = BeautifulSoup(txt, 'lxml')
content = raw.find('div', class_='mw-parser-output')
table = content.table.tbody

# Find relevant information and construct table:
trs = table.find_all('tr')
tds = []
for tr in trs:
    tds.append(tr.find_all('td'))
postcodes = []
boroughs = []
neighborhoods = []
for td in tds[1:]:
    postcodes.append(td[0].text)
    boroughs.append(td[1].text)
    neighborhoods.append(td[2].text.strip('\n'))
df = pd.DataFrame({'PostalCode': postcodes, 'Borough': boroughs, 'Neighborhood': neighborhoods})
df_valid = df[(df['Borough'] != 'Not assigned')].reset_index(drop = True)
df_valid['Neighborhood'] = np.where(df_valid['Neighborhood'] == 'Not assigned', df_valid['Borough'], \
                                    df_valid['Neighborhood'])
df_grouped = df_valid.groupby(by = ['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ','.join(x))\
                     .reset_index()

# Append cordinates information:
geo = pd.read_csv('Geospatial_Coordinates.csv').rename(columns = {'Postal Code': 'PostalCode'})
to_nbhds = df_grouped.merge(geo, how = 'left', on = 'PostalCode').drop(columns = 'PostalCode')\
                     .rename(columns = {'Neighborhood': 'Neighborhood Name'})
to_nbhds['Region'] = 'Toronto'
to_nbhds.head()

Unnamed: 0,Borough,Neighborhood Name,Latitude,Longitude,Region
0,Scarborough,"Rouge,Malvern",43.806686,-79.194353,Toronto
1,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Toronto
2,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Toronto
3,Scarborough,Woburn,43.770992,-79.216917,Toronto
4,Scarborough,Cedarbrae,43.773136,-79.239476,Toronto


Now that we have neighborhoods data of both New York and Toronto. We will then use Foursquare API to explore the venues near the neighborhoods. First, we merge the New York neighborhoods data frame and the Toronto neighborhoods data frame together for future venue exploration. Then, we define the Foursquare API parameters. After we have the merged data frame and the API parameters, we will define function to fetch the venue information of venue name, venue latitude, venue longitude, and venue category.

In [39]:
# Merge New York and Toronto neighborhoods data frame:
ny_to_nbhds = pd.concat([ny_nbhds, to_nbhds])

In [40]:
# Define Foursquare Credentials and Version

CLIENT_ID = 'XLX4NU3FNH0NEZVJHJNBFGPMPP5Y3GOVA5LDIPXNXTGQLBUD' # your Foursquare ID
CLIENT_SECRET = '0XGB5WWGTDOQMMFP3Z54VTUGYPCKY3RKYMFVTTAMLGOBWIRH' # your Foursquare Secret
VERSION = '20180605' 
LIMIT = 100
radius = 500

In [41]:
# Define function to get venues:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood Name', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We then use the function to get the top 100 venues of 500 radius of each neighborhood of New York City and Toronto. The first five rows of venues data look like the following:
<img src="files/ny_venues.png">

In [43]:
# Get venues within 500 radius of New York neighborhoods.

ny_to_venues = getNearbyVenues(names = ny_to_nbhds['Neighborhood Name'],\
                               latitudes = ny_to_nbhds['Latitude'],
                               longitudes = ny_to_nbhds['Longitude'])

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

With the venue data available, we will use K-Means cluster to cluster the neighborhoods based on venues around the neighborhoods. To do the K-Means clustering, we first get dummies for each type of venue category. Neighborhoods are then measured and clustered based on the mean of venue category counts of each neighborhood. A K-Means clustering model of k = 10 is fit on the data.

In [71]:
# Get dummies for each venue category:
ny_to_dummy = pd.get_dummies(ny_to_venues[['Venue Category']], prefix = '', prefix_sep = '')

# Add neighborhood name and move to first column:
ny_to_dummy.insert(0,'Neighborhood Name', ny_to_venues['Neighborhood Name'])

# Group venues for same neighborhoods and return the mean:
ny_to_mean = ny_to_dummy.groupby('Neighborhood Name').mean().reset_index()

# Run k-means clustering with k = 10:
from sklearn.cluster import KMeans
k_mean_clustering = ny_to_mean.drop(columns = ['Neighborhood Name'])
kmeans = KMeans(n_clusters = 10, random_state = 666).fit(k_mean_clustering)

# Add neighborhood and latitude, longtitude information
ny_to_mean.insert(0, 'Cluster Labels', kmeans.labels_)
ny_to_cluster = ny_to_mean.merge(ny_to_nbhds, how = 'left', on = ['Neighborhood Name'])
ny_to_cluster.head(10)

Unnamed: 0,Cluster Labels,Neighborhood Name,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,...,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Borough,Latitude,Longitude,Region
0,0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.01,0.0,Downtown Toronto,43.650571,-79.384568,Toronto
1,0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Scarborough,43.7942,-79.262029,Toronto
2,8,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Scarborough,43.815252,-79.284577,Toronto
3,5,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Etobicoke,43.739416,-79.588437,Toronto
4,5,"Alderwood,Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Etobicoke,43.602414,-79.543484,Toronto
5,5,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Bronx,40.865788,-73.859319,NYC
6,5,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island,40.538114,-74.178549,NYC
7,1,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island,40.549286,-74.185887,NYC
8,1,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island,40.635325,-74.165104,NYC
9,1,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island,40.596313,-74.067124,NYC


**Results**

In [76]:
k = 10
clusters = {}
sizes = {}
for i in range(k):
    clusters[i] = ny_to_cluster[ny_to_cluster['Cluster Labels'] == i]
    sizes[i] = len(clusters[i])
sizes

{0: 178, 1: 15, 2: 3, 3: 1, 4: 1, 5: 42, 6: 143, 7: 3, 8: 15, 9: 2}

From the k-means clustering, we clustered the neithborhoods into 10 clusters based on the venues around the neighborhood. First, let's look at the sizes of the clusters. The sizes of the ten clusters are: 178, 15, 3, 1, 1, 42, 143, 3, 15, 2.From the sizes, we could see that most of the neighborhoods of the two cities fall into three clusters, with sizes of 178, 143 and 42. Let's take a look what neighborhoods are clustered together.

In [108]:
top3_dfs = {}
for i in [0, 5, 6]:
    top3_dfs[i] = clusters[i][['Neighborhood Name', 'Borough', 'Region']]
top3_dfs[6].reset_index(drop = True)

Unnamed: 0,Neighborhood Name,Borough,Region
0,Astoria Heights,Queens,NYC
1,Bath Beach,Brooklyn,NYC
2,"Bathurst Manor,Downsview North,Wilson Heights",North York,Toronto
3,Baychester,Bronx,NYC
4,Bedford Park,Bronx,NYC
5,Beechhurst,Queens,NYC
6,Belle Harbor,Queens,NYC
7,Bellerose,Queens,NYC
8,Belmont,Bronx,NYC
9,Bensonhurst,Brooklyn,NYC


Then, let's visualize the largest three clusters on the map .

In [102]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

top3 = pd.concat([clusters[0], clusters[5], clusters[6]])
top3['Cluster Labels'] = np.where(top3['Cluster Labels'] == 0, 1, np.where(top3['Cluster Labels'] == 5, 2, 3))
latitude = 42.0987
longitude =  -75.9180
map_clusters = folium.Map(location=[latitude, longitude], zoom_start = 6)


# set color scheme for the clusters
kclusters = 3
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top3['Latitude'], top3['Longitude'], \
                                  top3['Neighborhood Name'], top3['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=0.05,
        popup=label,
        color=rainbow[cluster - 1],
        fill=True,
        fill_color=rainbow[cluster - 1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Finally, define a function that allows users to find the similar boroughs within a specific region (NYC or Toronto) to the borough that user wanted to compare. The function will take three parameters: input_borough, input_region, and output_region. The function will return the list of boroughs in the specified region that are similar to the borough that user specified. For example, if the user wants to find the boroughs that are similar to Bronx in the Toronto region, the input would be "Bronx", "NYC", "Toronto". According to the data frame constructed before, the function would return the following list of boroughs: North York, Scarborough, Etobicoke, Downtown Toronto, East York.

In [186]:
# Define function to find similar boroughs:

from collections import Counter

def find_similar_borough(input_borough, input_region, output_region):
    df_clusters = ny_to_cluster[['Borough', 'Region', 'Cluster Labels']]
    df_clusters['B+R'] = df_clusters['Borough'] + df_clusters['Region']
    
    if input_borough + input_region in list(df_clusters['B+R']):
        label_list = list(df_clusters[df_clusters['Borough'] == input_borough]['Cluster Labels'])
        cluster_label = Counter(label_list).most_common(1)[0][0] 
        df = df_clusters[df_clusters['Cluster Labels'] == cluster_label]
        
        if output_region in list(df['Region']):
            df_output = df[df['Region'] == output_region]
            output = list(df_output[df_output['Borough'] != input_borough]['Borough'].unique())
            
            if len(output) == 0:
                final_output = 'no similar boroughs in the region'
            else: final_output = output
        else: final_output = 'no similar boroughs in the region'
        
    else: final_output = 'no such borough in the data collected'
        
    return final_output
        

In [188]:
find_similar_borough('Bronx', 'NYC', 'Toronto')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


['North York',
 'Scarborough',
 'West Toronto',
 'Etobicoke',
 'Downtown Toronto',
 'East York']