<a href="https://colab.research.google.com/github/popovstefan/Coursera_Capstone/blob/master/Coursera_Neighborhood_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Segmenting and Clustering Neighborhoods in Toronto

## Scraping data

A list of all postal codes in Canada starting with 'M' (which are located in Toronto, Ontario) is released on this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). Using the ``beautifulsoup4`` Python library I am scraping the page to obtain the postal codes alongside the borough and neighbourhood names.

In [0]:
# Import http request and beautifulsoup scraping libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [0]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
HTML = urlopen(URL)
soup = BeautifulSoup(HTML, 'html.parser')
# Get the first table in the page, which is the only table we need (find_all returns two tables)
table = soup.find_all(name='table')[0]

Table scraping is executed as follows. I create two dictionaries, ```code_borough``` mapping from postal code to borough name and ```code_neighbourhood``` mapping from postal code to a list of neighborhood names (as mentioned, there can be more than one).

Then, after parsing each table row into specific cells, I populate the dictionaries, ignorring rows where borough is not assigned, i.e. the borough name contains the string 'Not assigned'.

Lastly, before passing the dictionaries to the next stage to be transformed into dataframes, I unpack the list value in the ``code_neighbourhood`` dictionary to be a single string where each entry is comma-separated.

In [0]:
import numpy as np
import pandas as pd

# The two dictionaries
code_borough = dict()
code_neighbourhood = dict()

rows = table.find_all(name='tr')
for row in rows[1:]: # skip first row with column headers
  cells = row.find_all(name='td')

  # Parse row content into cells
  postal_code = cells[0].text.strip()
  borough = cells[1].text.strip()
  neighbourhood = cells[2].text.strip()

  # Populate the dictionaries
  if 'Not assigned' in borough:
    continue # Ignorring rows with unassigned borough value
  else:
    code_borough[postal_code] = borough
    if postal_code not in code_neighbourhood:
      code_neighbourhood[postal_code] = list()
    code_neighbourhood[postal_code].append(neighbourhood)

# Unpack the list values in the dictionary to become comma-separated string
code_neighbourhood = [(key, *value) for key, value in code_neighbourhood.items()]

After populating the dictionaries, using the ``pd.DataFrame()`` method I transform them into pandas dataframes. In the third line in the next cell, I set the index to both frames to the ``Postal Code`` column so that I can merge them on it and create a single dataframe from the two dictionaries. Lastly, print the dataframe's shape.

In [6]:
df1 = pd.DataFrame(data=list(code_borough.items()), columns=['Postal Code', 'Borough'])
df2 = pd.DataFrame(data=code_neighbourhood, columns=['Postal Code', 'Neighborhoods'])
df = df1.merge(df2, on='Postal Code')
df.shape

(103, 3)

## Getting the geographical coordinates for each neighborhood

It is a very easy task. I used the ``pd.read_csv(URL)`` method to download the .csv file specified in the assignment that contains the geo coordinates and then merged it with the previous dataframe on the postal code column.

In [13]:
URL = 'https://cocl.us/Geospatial_data'
df_geo = pd.read_csv(URL)
df = df.merge(df_geo, on='Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing Centre,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Data analysis

In this part I will replicate the analysis that was performed in the Lab on the New York data set.

In [0]:
import folium # map rendering library
import requests # library to handle requests

Let us first visualize the city with the neighborhoods marked.

In [16]:
# create map of Manhattan using latitude and longitude values
toronto_latitude, toronto_longitude = 43.6532, -79.3832
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhoods']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [0]:
# Defining FourSquare developer details
CLIENT_ID = 'XX'
CLIENT_SECRET = 'YY'
VERSION = '20180605'

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
  """Function that iterates over the list of neighborhood names, latitudes and longitudes and finds the
    top 100 venues in a radius of 500m (by default).
    Arguments:
      names - list of neighborhood names
      latitutes - list of neighborhood latitudes
      longitudes - list of neighborhood longitudes
      radius - number of meters around the neighborhood location in which the venues will be looked up
      limit - limits the number of venues returned by the API call
      
    Returns: dataframe containing information about each venue (and the neighborhood it's in) found: latitude, longitude, category, name.
    
    Remarks: names, latitudes, longitudes all have the same length."""
  venues_list=[]
  for name, lat, lng in zip(names, latitudes, longitudes):
      #print(name)
      # create the API request URL
      url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
          CLIENT_ID, 
          CLIENT_SECRET, 
          VERSION, 
          lat, 
          lng, 
          radius, 
          limit)
          
      # make the GET request
      results = requests.get(url).json()["response"]['groups'][0]['items']
      
      # return only relevant information for each nearby venue
      venues_list.append([(
          name, 
          lat, 
          lng, 
          v['venue']['name'], 
          v['venue']['location']['lat'], 
          v['venue']['location']['lng'],  
          v['venue']['categories'][0]['name']) for v in results])

  nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
  nearby_venues.columns = ['Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']
  
  return(nearby_venues)

In [35]:
toronto_venues = getNearbyVenues(names=df['Neighborhoods'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )
print ('Number of venues {}\n'.format(toronto_venues.shape))
print ('Unique venue categories {}'.format(len(toronto_venues['Venue Category'].unique())))

Number of venues (2119, 7)

Unique venue categories 264


Once we have the venues, let us analyze each neighborhood separately. Lets start by one-hot encoding the venue categories, that is, create a separate column for each category and set a value of 1 if a neighborhood has a venue of that category in its vicinity, and 0 otherwise.

In [42]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.tail()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,...,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soccer Field,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
2114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2115,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2116,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2117,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2118,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Next, I will group rows by neighborhood name and take the mean of the frequency of each category. Then I will print each neighborhood with its top 5 venues.

In [0]:
def return_most_common_venues(row, num_top_venues):
  """Returns the num_top_venues from a row."""
  row_categories = row.iloc[1:]
  row_categories_sorted = row_categories.sort_values(ascending=False)
  return row_categories_sorted.index.values[0:num_top_venues]

In [64]:
# Group dataframe
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Latin American Restaurant,Clothing Store,Lounge,Breakfast Spot,Skating Rink
1,"Alderwood, Long Branch",Pizza Place,Gym,Athletics & Sports,Pharmacy,Pub
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Fried Chicken Joint,Shopping Mall,Diner
3,Bayview Village,Chinese Restaurant,Café,Japanese Restaurant,Bank,Distribution Center
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Juice Bar,Sandwich Place,Sushi Restaurant


As a last step in this analysis, I will cluster the neighborhoods using the K-means algorithm.

In [0]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [67]:

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhoods')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Women's Store,Discount Store,Deli / Bodega
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Coffee Shop,Portuguese Restaurant,Hockey Arena,Pizza Place,French Restaurant
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Park,Bakery,Café,Pub
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Furniture / Home Store,Clothing Store,Women's Store,Shoe Store,Gift Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,Gym,Distribution Center,Mexican Restaurant


Finally, let us visualize the map.

In [0]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Some of the rows contain NaN values, for unknown reasons (suggestions welcomed), that are problematic, therefore they are removed from the final result

In [0]:
toronto_merged = toronto_merged.dropna()


In [82]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhoods'], toronto_merged['Cluster Labels']):
    if cluster == np.nan:
      print ('asdsadsa')
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Each cluster can be examined individually below.

Cluster 0

In [0]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 1

In [0]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 2, 3 and 4 (each cell below) contain only one venue.

In [89]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Scarborough,2.0,Fast Food Restaurant,Women's Store,Dance Studio,Electronics Store,Eastern European Restaurant


In [90]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
62,Central Toronto,3.0,Garden,Women's Store,Discount Store,Deli / Bodega,Department Store


In [91]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
32,Scarborough,4.0,Playground,Distribution Center,Deli / Bodega,Department Store,Dessert Shop
