<h2> Segmenting and Clustering Neighbourhood in Toronto </h2>

Using the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data 

<h2> Solutions to Question 1 </h2>

<b>Importing the needed libraries</b>

In [1]:
import requests
import lxml.html as lh
import pandas as pd


<b> Remove the Website to put the table in Notebook</b>

In [2]:
canadapost_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #assign the wiki page

page = requests.get(canadapost_url) # create a handle to for contents of the wiki page

doc = lh.fromstring(page.content) # store content of the wiki page under doc

tr_elements = doc.xpath('//tr') # parse data stored between tr in the html

[len(T) for T in tr_elements[:12]] # check the length of the first 12 rows

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

<b> Check the table Headers </B>

In [3]:
tr_elements = doc.xpath('//tr') # parse first row as header

col = [] # create empty list
i = 0

for t in tr_elements[0]: # for each row, store each first element (header) and an empty list
    i+=1
    name=t.text_content()
    print("%d:%s" % (i,name))
    col.append((name,[]))

1:Postcode
2:Borough
3:Neighbourhood



<b> Check the data in the other rowa </b>

In [4]:
for j in range(1,len(tr_elements)): # Because header is the first row, data would be store in the subsequent rows.
    T = tr_elements[j] #T is j'th row
    
    if len(T)!=3: #if row is not size 3, //tr data is not from the table.
        break
        
    i = 0 #i is the index of the first column
    
    for t in T.iterchildren(): #iterate through each element of the row
        data=t.text_content()
            
        col[i][1].append(data) #append the data to the empty list of the i'th column
            
        i+=1 #increment i for the next column

<b>What about the numbers of rows and columns</b>

In [5]:
[len(C) for (title,C) in col]

[287, 287, 287]

<b> Displays the data frame with three columns </b>

In [6]:
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)

In [7]:
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M6A,North York,Lawrence Heights\n
6,M6A,North York,Lawrence Manor\n
7,M7A,Downtown Toronto,Queen's Park\n
8,M8A,Not assigned,Not assigned\n
9,M9A,Queen's Park,Not assigned\n


<b>  Checking the shapes</b>

In [8]:
df.shape

(287, 3)

<b> Check the neighbourhood</b>

In [9]:

df = df.replace('\n','', regex=True) #this is to remove \n
df.rename(columns = {'Postcode':'PostalCode', 'Neighbourhood\n':'Neighbourhood'}, inplace = True) # rename the column Postcode to PostalCode and remove \n in Neighbourhood

In [10]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


<b> Clean the dataframe</b>

In [14]:
df = df[df.Borough != 'Not assigned'] # remove boroughs which are not assigned

df['Neighbourhood'].replace("Not assigned", df['Borough'], inplace=True) # replace the name of neighbourhoods which are not assigned to borough name

In [15]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


<b>Grouping the neighbourhoods with the same postal code </b>

In [16]:
df['Neighbourhood'] = df.groupby('PostalCode')['Neighbourhood'].transform(lambda neigh: ', '.join(neigh)) # Neighbourhood with the same postal code is to be grouped in the same row

df = df.drop_duplicates() # Any duplicates are dropped

if (df.index.name != 'PostalCode'): # before resetting the index number, the index is to be reassigned to postal code first
    df = df.set_index('PostalCode')
    
df.reset_index(inplace=True) # reset index creates new column

<b> Print the Best verison of the dataframe</b>

In [17]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


<b> Print the number of rows.</b>

In [23]:
df.shape # print the number of rows and columns

(103, 5)

<h2> Solutions Question 2 </h2>

In [24]:
!pip -q install geopy

from geopy.geocoders import Nominatim # library to covert address to latitude and longitude

!pip -q install geocoder
import geocoder

<b> Get latitude and longitude for each rows of the dataframe. </b>

In [25]:
def get_latlng(arcgis_geocoder): # defining the function
    
    lat_lng_coords = None # initialising location to None
    
    while(lat_lng_coords is None): # geocode while loop to create latitude and longitude for each rows
        g = geocoder.arcgis('{}, Toronto, Canada'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords

<b> Get the latitude and longitude based on PostalCode </b>

In [26]:
postal_code = df['PostalCode']
coordinates = [get_latlng(postal_code) for postal_code in postal_code.tolist()]

<b>Put the Latitude and Longitude columns in the dataframe and print the first 12 rows. </b>

In [27]:
df_loc = df

df_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])

df_loc['Latitude'] = df_coordinates['Latitude']

df_loc['Longitude'] = df_coordinates['Longitude']

df_loc.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75242,-79.329242
1,M4A,North York,Victoria Village,43.7306,-79.313265
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451286
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715
5,M9A,Queen's Park,Queen's Park,43.662299,-79.528195
6,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
7,M3B,North York,Don Mills North,43.749055,-79.362227
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.707535,-79.311773
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818


<h2> Solutions to Questionn 3 </h2

<b> Importing  the needed libraries</b>

In [28]:
import matplotlib.cm as cm
import matplotlib.colors as colors

import numpy as np

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans

!pip -q install folium
print('folium installed...')
import folium # library for map rendering
print('folium imported...')
print('Done')

folium installed...
folium imported...
Done


<b> Using the geopy library to get the latitude and longitude values of Toronto</b>

In [29]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ln_explorer")

location = geolocator.geocode(address)

latitude = location.latitude

longitude = location.longitude

print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.653963, -79.387207.


<b> Coming up with  the map of Toronto with folium</b>

In [30]:
map_toronto = folium.Map(location = [latitude, longitude], zoom_start=12)

map_toronto

<b>Bring together the neighbourhood on the map</b>

In [31]:
for lat, lng, borough, loc in zip(df_loc['Latitude'],
                                  df_loc['Longitude'], 
                                  df_loc['Borough'], 
                                  df_loc['Neighbourhood']):
    label = '{} - {}'.format(loc, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='3186cc',
        fill_opacity=0.7).add_to(map_toronto)

display(map_toronto)

<b>  Explore the neighbourhoods in the borough containing the word 'Toronto' </b>

In [32]:
df_toronto = df

df_toronto = df[df['Borough'].str.contains('Toronto')]

df_toronto.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818
15,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481
19,M4E,East Toronto,The Beaches,43.676531,-79.295425
20,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675
24,M5G,Downtown Toronto,Central Bay Street,43.656091,-79.38493
25,M6G,Downtown Toronto,Christie,43.668781,-79.42071
30,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.6497,-79.382582
31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.665087,-79.438705


<b> Identify the Foursquare credentials and ID</b>

In [33]:

CLIENT_ID = 'NEV1SHCVX1CYKBXU0OO22DMSNMRBEFT2C00HM1LXCICNKHGM' # your Foursquare ID
CLIENT_SECRET = 'TNJTVRJDTVWYJ0LJSUPHV3LHKDPGQZADLWORB2FMULAEICBC' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NEV1SHCVX1CYKBXU0OO22DMSNMRBEFT2C00HM1LXCICNKHGM
CLIENT_SECRET:TNJTVRJDTVWYJ0LJSUPHV3LHKDPGQZADLWORB2FMULAEICBC


<b> Analyse the neighbourhood of Harbourfront in the borough of Downtown Toronto</b>

In [34]:

df_toronto.loc[2, 'Neighbourhood'] # get the name of the neighbourhood

'Harbourfront'

<b> Print the latitude and longitude values of Harbourfront </b>

In [35]:
neighborhood_latitude = df_toronto.loc[2, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_toronto.loc[2, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_toronto.loc[2, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Harbourfront are 43.65029500000003, -79.35916572299999.


<b> With Foursquare, lets get the top 100 venues that are in Harbourfront within the 500 meters radius</b>

In [36]:
LIMIT = 100 # limit of 100 venues

radius = 500 # radius of 500 meters

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=NEV1SHCVX1CYKBXU0OO22DMSNMRBEFT2C00HM1LXCICNKHGM&client_secret=TNJTVRJDTVWYJ0LJSUPHV3LHKDPGQZADLWORB2FMULAEICBC&v=20180604&ll=43.65029500000003,-79.35916572299999&radius=500&limit=100'

<b> Apply Get Request </b>

In [37]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e42d646542890001b2d4940'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Distillery District',
  'headerFullLocation': 'The Distillery District, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 40,
  'suggestedBounds': {'ne': {'lat': 43.65479500450003,
    'lng': -79.35295813298289},
   'sw': {'lat': 43.645794995500026, 'lng': -79.36537331301709}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ad4c05ef964a520bff620e3',
       'name': 'The Distillery Historic District',
       'location': {'address': 'btwn Front, Cherry, Gardiner & Parliament',
        'lat': 43.65024435658077,
        'lng': -79.35932278633118

<b>Extracts the category of the venue <b>

In [38]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

<b> Clean json and structure into pandas dataframe </b>

In [39]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The Distillery Historic District,Historic Site,43.650244,-79.359323
1,Distillery Sunday Market,Farmers Market,43.650075,-79.361832
2,Arvo,Coffee Shop,43.649963,-79.361442
3,Cacao 70,Dessert Shop,43.650067,-79.360723
4,SOMA chocolatemaker,Chocolate Shop,43.650622,-79.358127


<b> How many venues returned by Foursquare ? </b>

In [40]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

40 venues were returned by Foursquare.


<b> More function to repeat the same process to all the neighbourhoods in Toronto .</b>

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


<b>Produce new dataframe called toronto_venues </b>

In [44]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighbourhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

Harbourfront
Queen's Park
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
Fir

<b> Display toronto venues dataframe to explore the venues, latitude and longitude values of the Neighbourhood </b>

In [45]:
toronto_venues.head(12)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.650295,-79.359166,The Distillery Historic District,43.650244,-79.359323,Historic Site
1,Harbourfront,43.650295,-79.359166,Distillery Sunday Market,43.650075,-79.361832,Farmers Market
2,Harbourfront,43.650295,-79.359166,Arvo,43.649963,-79.361442,Coffee Shop
3,Harbourfront,43.650295,-79.359166,Cacao 70,43.650067,-79.360723,Dessert Shop
4,Harbourfront,43.650295,-79.359166,SOMA chocolatemaker,43.650622,-79.358127,Chocolate Shop
5,Harbourfront,43.650295,-79.359166,Young Centre for the Performing Arts,43.650825,-79.357593,Performing Arts Venue
6,Harbourfront,43.650295,-79.359166,Balzac's Coffee,43.649797,-79.359142,Coffee Shop
7,Harbourfront,43.650295,-79.359166,Spotify,43.649919,-79.358861,Tech Startup
8,Harbourfront,43.650295,-79.359166,Brick Street Bakery,43.650574,-79.359539,Bakery
9,Harbourfront,43.650295,-79.359166,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant


<b> Examine the Neigbourhoods using one hot encoding</b>

In [46]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head(12)


Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Train Station,Tram Station,Tunnel,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<b> Mapping the rows by neighbourhood hence, computing the mean of the frequency of occurrence of each category .</b>

In [47]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head(12)


Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Train Station,Tram Station,Tunnel,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.03,0.0,0.01,0.0,0.03,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.0,0.016393,...,0.0,0.0,0.0,0.016393,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.029412,0.014706,0.0,0.0,0.0,...,0.0,0.0,0.014706,0.029412,0.0,0.0,0.014706,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.02,0.0,0.0,0.01,0.02,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.014085,0.0,0.0,...,0.014085,0.0,0.0,0.0,0.014085,0.0,0.0,0.0,0.0,0.014085
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.010101,0.0,0.010101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.010101,0.010101,0.010101,0.0,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.041667,0.013889,0.0,0.0,0.0,...,0.0,0.0,0.0,0.041667,0.0,0.0,0.041667,0.013889,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.0,0.012821,0.012821,0.0,0.0,0.012821,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,0.0


<b> Produce neighborhood with 5 most common venues .</b>

In [48]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Adelaide, King, Richmond----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.06
2        Hotel  0.05
3   Steakhouse  0.04
4          Bar  0.03


----Berczy Park----
          venue  freq
0   Coffee Shop  0.08
1  Cocktail Bar  0.05
2    Restaurant  0.03
3      Beer Bar  0.03
4    Steakhouse  0.03


----Brockton, Exhibition Place, Parkdale Village----
                    venue  freq
0             Coffee Shop  0.09
1              Restaurant  0.06
2  Furniture / Home Store  0.06
3                    Café  0.06
4                  Bakery  0.04


----Business Reply Mail Processing Centre 969 Eastern----
         venue  freq
0  Coffee Shop  0.09
1   Steakhouse  0.04
2          Bar  0.04
3        Hotel  0.04
4         Café  0.03


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                venue  freq
0         Coffee Shop  0.11
1  Italian Restaurant  0.07
2                Café  0.04
3                 Bar  0.

<b> Arrage the venue in descending order .</b>

In [49]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<b> Produce new dataframe with top 10 venues .</b>

In [50]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(12)


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Hotel,Steakhouse,Bar,Japanese Restaurant,Breakfast Spot,Restaurant,Gym,Bakery
1,Berczy Park,Coffee Shop,Cocktail Bar,Breakfast Spot,Hotel,Seafood Restaurant,Café,Cheese Shop,Steakhouse,Restaurant,Beer Bar
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Restaurant,Furniture / Home Store,Bakery,Bar,Sandwich Place,Italian Restaurant,Gym,Hotel
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Bar,Hotel,Steakhouse,Café,Pub,Seafood Restaurant,Gym,Sushi Restaurant,Thai Restaurant
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Coffee Shop,Italian Restaurant,Bar,Café,Park,Intersection,Sandwich Place,Gym / Fitness Center,Electronics Store,Speakeasy
5,"Cabbagetown, St. James Town",Restaurant,Coffee Shop,Pizza Place,Italian Restaurant,Café,Bakery,Butcher,Breakfast Spot,Indian Restaurant,Pub
6,Central Bay Street,Coffee Shop,Clothing Store,Bakery,Ice Cream Shop,Sandwich Place,Plaza,Sushi Restaurant,Spa,Bookstore,Restaurant
7,"Chinatown, Grange Park, Kensington Market",Café,Bar,Chinese Restaurant,Dumpling Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Coffee Shop,Art Gallery,Ice Cream Shop,Mexican Restaurant
8,Christie,Café,Grocery Store,Playground,Italian Restaurant,Candy Store,Athletics & Sports,Coffee Shop,Baby Store,Yoga Studio,Farm
9,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Men's Store,Hotel,Pub,Dance Studio


<h2> Make k-means to cluster the neighborhood into 5 clusters. </h2>

In [52]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

<b> Produce a new dataframe that have the clusters and  the top 10 venues for each neighborhood. </b>

In [53]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head(12) # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166,0,Coffee Shop,Bakery,Café,Theater,Boat or Ferry,Historic Site,Breakfast Spot,Hotel,Ice Cream Shop,French Restaurant
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715,0,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Indian Restaurant,Deli / Bodega,Food Truck,Fried Chicken Joint,Bookstore,Sushi Restaurant
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818,0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Furniture / Home Store,Italian Restaurant,Middle Eastern Restaurant,Lingerie Store,Bakery
15,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481,0,Coffee Shop,Café,Restaurant,Cocktail Bar,Bakery,Seafood Restaurant,Hotel,Breakfast Spot,Italian Restaurant,Clothing Store
19,M4E,East Toronto,The Beaches,43.676531,-79.295425,0,Health Food Store,Pub,Trail,Neighborhood,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
20,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675,0,Coffee Shop,Cocktail Bar,Breakfast Spot,Hotel,Seafood Restaurant,Café,Cheese Shop,Steakhouse,Restaurant,Beer Bar
24,M5G,Downtown Toronto,Central Bay Street,43.656091,-79.38493,0,Coffee Shop,Clothing Store,Bakery,Ice Cream Shop,Sandwich Place,Plaza,Sushi Restaurant,Spa,Bookstore,Restaurant
25,M6G,Downtown Toronto,Christie,43.668781,-79.42071,0,Café,Grocery Store,Playground,Italian Restaurant,Candy Store,Athletics & Sports,Coffee Shop,Baby Store,Yoga Studio,Farm
30,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.6497,-79.382582,0,Coffee Shop,Café,Hotel,Steakhouse,Bar,Japanese Restaurant,Breakfast Spot,Restaurant,Gym,Bakery
31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.665087,-79.438705,0,Furniture / Home Store,Park,Athletics & Sports,Pharmacy,Bar,Bank,Bakery,Fast Food Restaurant,Pet Store,Café


<h2> Visulazation of the Clusters. </h2>

In [54]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


<h2> 1. First Cluster. </h2>

In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,0,Coffee Shop,Bakery,Café,Theater,Boat or Ferry,Historic Site,Breakfast Spot,Hotel,Ice Cream Shop,French Restaurant
4,Downtown Toronto,0,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Indian Restaurant,Deli / Bodega,Food Truck,Fried Chicken Joint,Bookstore,Sushi Restaurant
9,Downtown Toronto,0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Furniture / Home Store,Italian Restaurant,Middle Eastern Restaurant,Lingerie Store,Bakery
15,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Cocktail Bar,Bakery,Seafood Restaurant,Hotel,Breakfast Spot,Italian Restaurant,Clothing Store
19,East Toronto,0,Health Food Store,Pub,Trail,Neighborhood,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
20,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Breakfast Spot,Hotel,Seafood Restaurant,Café,Cheese Shop,Steakhouse,Restaurant,Beer Bar
24,Downtown Toronto,0,Coffee Shop,Clothing Store,Bakery,Ice Cream Shop,Sandwich Place,Plaza,Sushi Restaurant,Spa,Bookstore,Restaurant
25,Downtown Toronto,0,Café,Grocery Store,Playground,Italian Restaurant,Candy Store,Athletics & Sports,Coffee Shop,Baby Store,Yoga Studio,Farm
30,Downtown Toronto,0,Coffee Shop,Café,Hotel,Steakhouse,Bar,Japanese Restaurant,Breakfast Spot,Restaurant,Gym,Bakery
31,West Toronto,0,Furniture / Home Store,Park,Athletics & Sports,Pharmacy,Bar,Bank,Bakery,Fast Food Restaurant,Pet Store,Café


<h2> 2. Second Cluster. </h2>

In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,Downtown Toronto,1,Harbor / Marina,Pier,Park,Yoga Studio,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


<h2> 3. Third Cluster. </h2>

In [58]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,Central Toronto,2,Park,Yoga Studio,Dumpling Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
69,West Toronto,2,Sandwich Place,Park,Dumpling Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant


<h2> 4. Fourth Cluster. </h2>

In [59]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Central Toronto,3,Health & Beauty Service,IT Services,Eastern European Restaurant,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm


<h2> 5. Fifth Cluster. </h2>

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
73,Central Toronto,4,Playground,Gym Pool,Park,Garden,Yoga Studio,Eastern European Restaurant,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
83,Central Toronto,4,Playground,Gym,Park,Tennis Court,Donut Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
91,Downtown Toronto,4,Playground,Grocery Store,Candy Store,Park,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


<h2> Implications</h2>

The neighbourhoods are divided into 5  clusters based on their common venues. 
<b>Cluster 1:</b> Restaurants, cafes and historic sites are more common in this area. This seems to be the main area to eat in Toronto. 
<b>Cluster 2:</b> This is where the harbour or marina are. 
<b>Cluster 3:</b> Parks and yoga studios. 
<b>Cluster 4:</b> Health and beauty service and IT services. 
<b> Cluster 5: </b>Playground, gyms, and parks. 
