# Segmenting and Clustering Neighborhoods in Toronto

### Import all the libraries to be used by the project.

In [108]:
# Builds code and scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
# in order to obtain the data that is in the table of postal codes 
# and to transform the data into a pandas dataframe

!pip install lxml
!pip install beautifulsoup4

import urllib.request as ur
import lxml.html as lh
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

import io

!pip install geocoder
import geocoder # import geocoder

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors



#### Import _Nominatim_ from the _geopy.geocoders_ library for converting the location address to coordinates.

In [109]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



#### Import _kMeans_ from the _sk.cluster_ library for clustering data and the _folium_ library for rendering maps.

In [110]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Define functions __joinStrSpaces(_string_)__ to modify strings or lists and __cleanTR(_input,length,row_number_)__ to clean scraped data.

In [111]:
def joinStrSpaces(string):
    #check type of attribute passed to function
    pType = type(string)
    
    #if attribute is a string, split the spaces and rejoin split elements as one string,
    #  if attribute is a Python list, split the list by comma delimiter, then split the spaces 
    #  in each string, rejoin split elements, and return the results as a list
    if pType == str:
        combined_string = ''.join(string.split())
        return combined_string
    elif pType == list:
        tmpList = []
        for i,st in enumerate(string):
            st = ''.join(st.split())
            tmpList.append(st)
        return tmpList
   
def cleanTR(arr,length,row_num):
    tmp = []

    for i,s in enumerate(arr):
        s = s.replace('\n','')    
        tmp.append(s)

    if arr[1] != "Not assigned" and arr[2] != "Not assigned":
        rows.append(tmp)
        return(tmp)
    elif arr[1] != "Not assigned" and arr[2] == "Not assigned":
        print("Check row {} and copy the value from the borough column to the Neighborhood column".format(i))

### Load scraped data from Wikipedia into a _pandas_ DataFrame
1. Define a __URL__ of the web page to scrape 
1. Open the page and read the contents into a string
1. Parse the resulting string with _BeautifulSoup_ to retrieve the relevant table rows
1. Clean the data 
1. Insert it into a _pandas_ DataFrame __df__

In [112]:
# URL location of Wikipedia page 'List of postal codes of Canada: M' table
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# load url and store the contents of the website
page = ur.urlopen(url)

# read page and store as variable
data = page.read()

soup = BeautifulSoup(data, 'html.parser')
table = soup.find('table', {'class': 'wikitable sortable'})
tr = table.find_all('tr')

columns = []
rows = []

# loop through retrieved rows from HTML table and call function to clean raw data
# and store it in rows
for i,row in enumerate(tr):

  row=str(row.text)
  row=row.split('\n\n')
  L = len(row)

  row = cleanTR(row,L,i) 

# retrieve column names from first row
column_names = rows.pop(0)

# call function to check for spaces in column names
column_names = joinStrSpaces(column_names)

df = pd.DataFrame(rows, columns=column_names)
df.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)
df.head(105)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## __Answer #1:__
### Print out the shape of the _pandas_ DataFrame consisting of three columns: __PostalCode__, __Borough__, and __Neighborhood__.

In [113]:
df.shape

(103, 3)

In [114]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
ll_joined = latitude, longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))
print(ll_joined)

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.
(43.6534817, -79.3839347)


### Retrieve __postal code__ coordinates by using either:
1. Google _geocoder_ API (unstable) or 
1. Geospatial_Coordinates.csv file

In [115]:
## v 2, use geopy geocode = questionable results
"""
def get_postcode(df, geolocator, lat_field, lon_field):
    location = geolocator.reverse((df[lat_field], df[lon_field]))
    return location.raw['address']['postcode']

def get_ll(df, geolocator, postcode, city, country, dic):
    
    p = df[postcode]
    c = country
    s = city
    
    psc = p+', '+c
    location = geolocator.geocode(psc, addressdetails=True, timeout=2)
    
    if location != None:
        lat,lon = location.raw['lat'],location.raw['lon']
        return lat,lon
    else:
        return p


geolocator = geopy.Nominatim(user_agent='city-search')

city_name = 'Toronto'
country_name = 'Canada'
example_dict = {'city': 'Toronto','country':'Canada'}

latlon = df.apply(get_ll, axis=1, geolocator=geolocator, postcode='PostalCode', city=city_name, country=country_name, dic={'city': 'Toronto','country':'Canada'})
latlon.head(40)
"""

"\ndef get_postcode(df, geolocator, lat_field, lon_field):\n    location = geolocator.reverse((df[lat_field], df[lon_field]))\n    return location.raw['address']['postcode']\n\ndef get_ll(df, geolocator, postcode, city, country, dic):\n    \n    p = df[postcode]\n    c = country\n    s = city\n    \n    psc = p+', '+c\n    location = geolocator.geocode(psc, addressdetails=True, timeout=2)\n    \n    if location != None:\n        lat,lon = location.raw['lat'],location.raw['lon']\n        return lat,lon\n    else:\n        return p\n\n\ngeolocator = geopy.Nominatim(user_agent='city-search')\n\ncity_name = 'Toronto'\ncountry_name = 'Canada'\nexample_dict = {'city': 'Toronto','country':'Canada'}\n\nlatlon = df.apply(get_ll, axis=1, geolocator=geolocator, postcode='PostalCode', city=city_name, country=country_name, dic={'city': 'Toronto','country':'Canada'})\nlatlon.head(40)\n"

In [116]:
## v 3, use Geospatial_Coordinates.csv = full results returned

# read csv file into pandas dataframe and set first column as the index
df_geo = pd.read_csv('Geospatial_Coordinates.csv', delimiter=",", header=0)
    
# send column names to function to remove inner spaces (if present) and rejoin as a single column heading
geo_t = joinStrSpaces(list(df_geo.columns.values))
##print(geo_t)
df_geo.columns=geo_t

selected_column = df_geo.iloc[:, 0:3]  
df_geo.head(105)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### Merge the two _pandas_ DataFrames to produce a single DataFrame with __PostalCode__, __Neighborhood__, __Latitude__, and __Longitude__ columns

In [117]:
# join DataFrames (side by side)
concat_df = pd.concat([df, df_geo], axis=1)
concat_df.shape

(103, 6)

## __Answer #2:__
### Merge the first DataFrame (_df_, columns: __PostalCode__, __Borough__, __Neighborhood__) with the second DataFrame (_df_geo_, __Latitude__, __Longitude__), creating a new DataFrame (_merged_pbnll_) featuring the five (5) joined columns printed below.

In [118]:
# merge dataframes by joining on common column PostalCode
merged_pbnll = pd.merge(left=df, right=df_geo, left_on='PostalCode', right_on='PostalCode')

merged_pbnll

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


#### Generate map of Toronto featuring __postal codes__, __boroughs__ and __neighborhoods__ (grouped by number of neighborhoods within a __PostalCode__)

In [119]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

boroughs_in_toronto = []

def getPlotCircles(postal_codes, boroughs, neighborhoods, lats, lngs):
    # add markers to map
    for postal_code, borough, neighborhood, lat, lng in zip(postal_codes, boroughs, neighborhoods, lats, lngs):
    
        if ("Toronto" in borough):
            label = 'Neighborhood: {}; Borough: {}; Postal Code: {}'.format(neighborhood, borough, postal_code)
            label = folium.Popup(label, parse_html=True)

            #count neighborhoods in borough nd expand radius based on count
            radius = (neighborhood.count(",") + 1) * 2

            if 10 > radius > 2:
                color = 'green'
            elif radius > 4:
                color = 'red'
            else:
                color = 'blue'

            #count lat,lng for borough
            lat_lng = [lat,lng]

            vars = [postal_code, borough, neighborhood, lat, lng]
            ##vars = borough
            boroughs_in_toronto.append(vars)

            folium.CircleMarker(
                lat_lng,
                #[lat, lng],
                radius=radius,
                popup=label,
                color='blue',
                fill=True,
                fill_color=color,
                fill_opacity=0.5,
                parse_html=False).add_to(map_toronto)

# generate plot points to create a circle for each Neighborhood
postal_codes = merged_pbnll['PostalCode']
boroughs = merged_pbnll['Borough']
neighborhoods = merged_pbnll['Neighborhood']
lats = merged_pbnll['Latitude']
lngs = merged_pbnll['Longitude']

getPlotCircles(postal_codes,boroughs,neighborhoods,lats,lngs)
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the postal code, neighborhood and respective borough.  The bigger the circle, the more neighborhoods are enclosed in the respective borough.


However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in West Toronto. So let's slice the original dataframe and create a new dataframe of the West Toronto data.


In [120]:
wtoronto_data = merged_pbnll[merged_pbnll['Borough'] == 'West Toronto'].reset_index(drop=True)
print(wtoronto_data.shape)
wtoronto_data.head(50)

(6, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
1,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
2,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
3,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
4,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
5,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445


Let's get the geographical coordinates of __West Toronto__.


In [121]:
address = 'West Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of West Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of West Toronto are 43.6534817, -79.3839347.


As we did with all of Toronto-named boroughs, let's visualize __West Toronto__ and the neighborhoods in it.


In [122]:
# create map of Manhattan using latitude and longitude values
map_wtoronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(wtoronto_data['Latitude'], wtoronto_data['Longitude'], wtoronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_wtoronto)  
    
map_wtoronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


#### Define Foursquare Credentials and Version


In [123]:
CLIENT_ID = 'CFWCY4RZCR0GUGC1WS51AUT5IMQZPJBUDMYBDMRCDEX14PST' # your Foursquare ID
CLIENT_SECRET = 'S3DDXFJY50CHSIUYW4BLT0SSWJ3QVUS51HO5PQTN1HZH205X' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: CFWCY4RZCR0GUGC1WS51AUT5IMQZPJBUDMYBDMRCDEX14PST
CLIENT_SECRET:S3DDXFJY50CHSIUYW4BLT0SSWJ3QVUS51HO5PQTN1HZH205X


#### Let's explore the first neighborhood in our dataframe.


Get the neighborhood's name.


In [124]:
wtoronto_data.loc[0, 'Neighborhood']

'Dufferin, Dovercourt Village'

Get the neighborhood's latitude and longitude values.


In [125]:
neighborhood_latitude = wtoronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = wtoronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = wtoronto_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Dufferin, Dovercourt Village are 43.66900510000001, -79.4422593.


#### Now, let's get the top 100 venues that are in Dufferin, Dovercourt Village within a radius of 500 meters.


First, let's create the GET request URL. Name your URL **url**.


In [126]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# type your answer here
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    neighborhood_latitude,                                                      
    neighborhood_longitude,
    VERSION,
    radius,
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?client_id=CFWCY4RZCR0GUGC1WS51AUT5IMQZPJBUDMYBDMRCDEX14PST&client_secret=S3DDXFJY50CHSIUYW4BLT0SSWJ3QVUS51HO5PQTN1HZH205X&ll=43.66900510000001,-79.4422593&v=20180605&radius=500&limit=100'

Send the GET request and examine the resutls


In [127]:
results = requests.get(url).json()

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [128]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [129]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(15)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,The Greater Good Bar,Bar,43.669409,-79.439267
1,Parallel,Middle Eastern Restaurant,43.669516,-79.438728
2,Blood Brothers Brewing,Brewery,43.669944,-79.436533
3,Happy Bakery & Pastries,Bakery,43.66705,-79.441791
4,FreshCo,Grocery Store,43.667918,-79.440754
5,Rehearsal Factory,Music Venue,43.668877,-79.443603
6,The Sovereign,Café,43.673116,-79.440265
7,Nova Era Bakery,Bakery,43.669886,-79.437582
8,Food Basics,Supermarket,43.666886,-79.446691
9,TD Canada Trust,Bank,43.667934,-79.441698


And how many venues were returned by Foursquare?


In [130]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

13 venues were returned by Foursquare.


## 2. Explore Neighborhoods in __West Toronto__


#### Let's create a function to repeat the same process to all the neighborhoods in West Toronto


In [131]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _wtoronto_venues_.


In [132]:
# type your answer here
wtoronto_venues = getNearbyVenues(names=wtoronto_data['Neighborhood'],
                                   latitudes=wtoronto_data['Latitude'],
                                   longitudes=wtoronto_data['Longitude']
                                  )

Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High Park, The Junction South
Parkdale, Roncesvalles
Runnymede, Swansea


#### Let's check the size of the resulting dataframe


In [133]:
print(wtoronto_venues.shape)
wtoronto_venues.head(154)

(153, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Dufferin, Dovercourt Village",43.669005,-79.442259,The Greater Good Bar,43.669409,-79.439267,Bar
1,"Dufferin, Dovercourt Village",43.669005,-79.442259,Parallel,43.669516,-79.438728,Middle Eastern Restaurant
2,"Dufferin, Dovercourt Village",43.669005,-79.442259,Blood Brothers Brewing,43.669944,-79.436533,Brewery
3,"Dufferin, Dovercourt Village",43.669005,-79.442259,Happy Bakery & Pastries,43.667050,-79.441791,Bakery
4,"Dufferin, Dovercourt Village",43.669005,-79.442259,FreshCo,43.667918,-79.440754,Grocery Store
...,...,...,...,...,...,...,...
148,"Runnymede, Swansea",43.651571,-79.484450,West End Mamas,43.648703,-79.484919,Health Food Store
149,"Runnymede, Swansea",43.651571,-79.484450,My Place - a Canadian Pub,43.648458,-79.485187,Pub
150,"Runnymede, Swansea",43.651571,-79.484450,(The New) Moksha Yoga Bloor West,43.648658,-79.485242,Yoga Studio
151,"Runnymede, Swansea",43.651571,-79.484450,The Coffee Bouquets,43.648785,-79.485940,Coffee Shop


Let's check how many venues were returned for each neighborhood in __West Toronto__


In [134]:
wtoronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Dufferin, Dovercourt Village",13,13,13,13,13,13
"High Park, The Junction South",25,25,25,25,25,25
"Little Portugal, Trinity",45,45,45,45,45,45
"Parkdale, Roncesvalles",14,14,14,14,14,14
"Runnymede, Swansea",33,33,33,33,33,33


#### Let's find out how many unique categories can be curated from all the returned venues


In [135]:
print('There are {} unique categories.'.format(len(wtoronto_venues['Venue Category'].unique())))

There are 77 unique categories.


## 3. Analyze Each Neighborhood


In [136]:
# one hot encoding
wtoronto_onehot = pd.get_dummies(wtoronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
wtoronto_onehot['Neighborhood'] = wtoronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [wtoronto_onehot.columns[-1]] + list(wtoronto_onehot.columns[:-1])
wtoronto_onehot = wtoronto_onehot[fixed_columns]

And let's examine the new dataframe size.


In [137]:
wtoronto_onehot.shape

(153, 78)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [138]:
wtoronto_grouped = wtoronto_onehot.groupby('Neighborhood').mean().reset_index()
wtoronto_grouped

Unnamed: 0,Neighborhood,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bakery,Bank,Bar,Beer Store,Bookstore,...,Speakeasy,Stadium,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,...,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.153846,0.076923,0.076923,0.0,0.0,...,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"High Park, The Junction South",0.04,0.0,0.04,0.0,0.04,0.0,0.08,0.0,0.04,...,0.04,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0
3,"Little Portugal, Trinity",0.0,0.022222,0.0,0.044444,0.022222,0.0,0.088889,0.022222,0.0,...,0.0,0.0,0.0,0.0,0.0,0.022222,0.022222,0.044444,0.022222,0.022222
4,"Parkdale, Roncesvalles",0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.071429,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Runnymede, Swansea",0.0,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.030303,...,0.0,0.0,0.0,0.060606,0.0,0.0,0.030303,0.0,0.0,0.030303


#### Let's confirm the new size


In [139]:
wtoronto_grouped.shape

(6, 78)

#### Let's print each neighborhood along with the top 5 most common venues


In [140]:
num_top_venues = 5

for hood in wtoronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = wtoronto_grouped[wtoronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.13
1       Nightclub  0.09
2     Coffee Shop  0.09
3  Breakfast Spot  0.09
4      Restaurant  0.04


----Dufferin, Dovercourt Village----
           venue  freq
0       Pharmacy  0.15
1         Bakery  0.15
2  Grocery Store  0.08
3           Park  0.08
4    Music Venue  0.08


----High Park, The Junction South----
                venue  freq
0  Mexican Restaurant  0.08
1                 Bar  0.08
2     Thai Restaurant  0.08
3                Café  0.08
4        Antique Shop  0.04


----Little Portugal, Trinity----
              venue  freq
0               Bar  0.09
1       Coffee Shop  0.07
2        Restaurant  0.04
3       Men's Store  0.04
4  Asian Restaurant  0.04


----Parkdale, Roncesvalles----
                         venue  freq
0                    Gift Shop  0.14
1               Breakfast Spot  0.14
2  Eastern European Restaurant  0.07
3                      Dog Run  0.07
4        

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [141]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [142]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = wtoronto_grouped['Neighborhood']

for ind in np.arange(wtoronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(wtoronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Brockton, Parkdale Village, Exhibition Place",Café,Nightclub,Breakfast Spot,Coffee Shop,Convenience Store,Performing Arts Venue,Pet Store,Climbing Gym,Italian Restaurant,Burrito Place
1,"Dufferin, Dovercourt Village",Pharmacy,Bakery,Park,Bar,Middle Eastern Restaurant,Music Venue,Café,Brewery,Grocery Store,Supermarket
2,"High Park, The Junction South",Mexican Restaurant,Thai Restaurant,Bar,Café,Diner,Discount Store,Music Venue,Park,Fast Food Restaurant,Cajun / Creole Restaurant
3,"Little Portugal, Trinity",Bar,Coffee Shop,Restaurant,Vietnamese Restaurant,Asian Restaurant,Men's Store,Café,Yoga Studio,Greek Restaurant,Miscellaneous Shop
4,"Parkdale, Roncesvalles",Gift Shop,Breakfast Spot,Cuban Restaurant,Bookstore,Movie Theater,Dog Run,Coffee Shop,Eastern European Restaurant,Restaurant,Dessert Shop


## 4. Cluster Neighborhoods in West Toronto


Run _k_-means to cluster the neighborhoods into 5 clusters.


In [143]:
# set number of clusters
kclusters = 5

wtoronto_grouped_clustering = wtoronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(wtoronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 2, 4, 1, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood in West Toronto


In [144]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

wtoronto_merged = wtoronto_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
wtoronto_merged = wtoronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

wtoronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,2,Pharmacy,Bakery,Park,Bar,Middle Eastern Restaurant,Music Venue,Café,Brewery,Grocery Store,Supermarket
1,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,1,Bar,Coffee Shop,Restaurant,Vietnamese Restaurant,Asian Restaurant,Men's Store,Café,Yoga Studio,Greek Restaurant,Miscellaneous Shop
2,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191,3,Café,Nightclub,Breakfast Spot,Coffee Shop,Convenience Store,Performing Arts Venue,Pet Store,Climbing Gym,Italian Restaurant,Burrito Place
3,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,4,Mexican Restaurant,Thai Restaurant,Bar,Café,Diner,Discount Store,Music Venue,Park,Fast Food Restaurant,Cajun / Creole Restaurant
4,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325,0,Gift Shop,Breakfast Spot,Cuban Restaurant,Bookstore,Movie Theater,Dog Run,Coffee Shop,Eastern European Restaurant,Restaurant,Dessert Shop


Finally, let's visualize the resulting clusters in West Toronto.


In [145]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(wtoronto_merged['Latitude'], wtoronto_merged['Longitude'], wtoronto_merged['Neighborhood'], wtoronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## 5. Examine Clusters in West Toronto


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [146]:
wtoronto_merged.loc[wtoronto_merged['Cluster Labels'] == 0, wtoronto_merged.columns[[1] + list(range(5, wtoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,West Toronto,0,Gift Shop,Breakfast Spot,Cuban Restaurant,Bookstore,Movie Theater,Dog Run,Coffee Shop,Eastern European Restaurant,Restaurant,Dessert Shop


#### Cluster 2


In [147]:
wtoronto_merged.loc[wtoronto_merged['Cluster Labels'] == 1, wtoronto_merged.columns[[1] + list(range(5, wtoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,West Toronto,1,Bar,Coffee Shop,Restaurant,Vietnamese Restaurant,Asian Restaurant,Men's Store,Café,Yoga Studio,Greek Restaurant,Miscellaneous Shop
5,West Toronto,1,Café,Coffee Shop,Italian Restaurant,Pizza Place,Pub,Sushi Restaurant,Fish & Chips Shop,Indie Movie Theater,Gym,Gourmet Shop


#### Cluster 3


In [148]:
wtoronto_merged.loc[wtoronto_merged['Cluster Labels'] == 2, wtoronto_merged.columns[[1] + list(range(5, wtoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,West Toronto,2,Pharmacy,Bakery,Park,Bar,Middle Eastern Restaurant,Music Venue,Café,Brewery,Grocery Store,Supermarket


#### Cluster 4


In [149]:
wtoronto_merged.loc[wtoronto_merged['Cluster Labels'] == 3, wtoronto_merged.columns[[1] + list(range(5, wtoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,West Toronto,3,Café,Nightclub,Breakfast Spot,Coffee Shop,Convenience Store,Performing Arts Venue,Pet Store,Climbing Gym,Italian Restaurant,Burrito Place


#### Cluster 5


In [150]:
wtoronto_merged.loc[wtoronto_merged['Cluster Labels'] == 4, wtoronto_merged.columns[[1] + list(range(5, wtoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,West Toronto,4,Mexican Restaurant,Thai Restaurant,Bar,Café,Diner,Discount Store,Music Venue,Park,Fast Food Restaurant,Cajun / Creole Restaurant


# __Answer #3 Explore, Cluster and Observations*__

### Observations for West Toronto Neighborhoods: Restaurants and food services make up approximately half of the venues, while cluster 1 features a lot of specialty retail spots, cluster 2 features eclectic food options and may appeal to hipster-types with it's coffee shops, cafes and niche entertainment opportunities.  Cluster 3 appears less eclectic with more pharmacies, bakeries, grocery and c-stores, but still coontains it's share of social entertainment opportunities like bars, breweries, cafes and music venues.  Cluster 4 is less eclectic than 3, but more diversity as far as venue types; but may be where groups of younger commuting business people come together at night with the assumption that their lives are more structured and busier with less focus on diversity of food types.  Cluster 5 has more food options than 4, still includes it's fair share of bars, cafes, diners and parks where socialization is likely to occur.  From an economic standpoint, cluster 5 may be less affluent than cluster 4, having less of a 'downtown' vibe.

#### Clusters are labeled from 0-4 on the map above, where clusters 0=1, 1=2, 2=3, 3=4, and 4=5 in the descriptions.

## __*__ Observations for clusters of all boroughs containing 'Toronto' in their names are included under their set of clusters at the bottom of this notebook.

# Create a DataFrame with only boroughs containing the text __'Toronto'__

In [151]:
merged_column_names = list(merged_pbnll.columns.values)

df_boroughs = pd.DataFrame(boroughs_in_toronto, columns=merged_column_names)
#print(df_boroughs)

borough_num = len(boroughs_in_toronto)
#print(borough_num)

print('The dataframe {} has {} total boroughs and {} neighborhood listings.'.format(
        "df_boroughs",
        len(df_boroughs['Borough'].unique()),
        df_boroughs.shape[0]
    )
)

df_boroughs.sort_values(['PostalCode','Borough','Latitude','Longitude'], inplace=True)
df_boroughs.head(45)

The dataframe df_boroughs has 4 total boroughs and 39 neighborhood listings.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
12,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
15,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
17,M4M,East Toronto,Studio District,43.659526,-79.340923
18,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
20,M4P,Central Toronto,Davisville North,43.712751,-79.390197
23,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
26,M4S,Central Toronto,Davisville,43.704324,-79.38879
29,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
31,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


### Prepare Foursquare API details and define retrieval function (_getNearbyValues()_)

In [152]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

CLIENT_ID = 'CFWCY4RZCR0GUGC1WS51AUT5IMQZPJBUDMYBDMRCDEX14PST' # your Foursquare ID
CLIENT_SECRET = 'S3DDXFJY50CHSIUYW4BLT0SSWJ3QVUS51HO5PQTN1HZH205X' # your Foursquare Secret
VERSION = '20180605'
LIMIT = 100 # A default Foursquare API limit value

##print('Your credentails:')
##print('CLIENT_ID: ' + CLIENT_ID)
##print('CLIENT_SECRET:' + CLIENT_SECRET)

def getNearbyVenues(boroughs, names, postcodes, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for borough, name, postcode, lat, lng in zip(boroughs, names, postcodes, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            borough,
            name,
            postcode,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough',
                  'Neighborhood',
                  'PostalCode',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above _getNearbyVenues()_ function on each neighborhood and create a new dataframe called _city_venues_.


In [153]:
city_venues = getNearbyVenues(boroughs=df_boroughs['Borough'],
                              names=df_boroughs['Neighborhood'],
                              postcodes=df_boroughs['PostalCode'],
                              latitudes=df_boroughs['Latitude'],
                              longitudes=df_boroughs['Longitude']
                             )
print(city_venues.shape)
print(city_venues['Borough'].unique())
print(len(city_venues['Neighborhood'].unique()))
print(len(city_venues['Venue Category'].unique()))

city_venues.head(1625)

(1624, 9)
['East Toronto' 'Central Toronto' 'Downtown Toronto' 'West Toronto']
39
237


Unnamed: 0,Borough,Neighborhood,PostalCode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,East Toronto,The Beaches,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,East Toronto,The Beaches,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,East Toronto,The Beaches,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,East Toronto,The Beaches,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,East Toronto,"The Danforth West, Riverdale",M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
...,...,...,...,...,...,...,...,...,...
1619,East Toronto,"Business reply mail Processing Centre, South C...",M7Y,43.662744,-79.321558,TTC Russell Division,43.664908,-79.322560,Light Rail Station
1620,East Toronto,"Business reply mail Processing Centre, South C...",M7Y,43.662744,-79.321558,Jonathan Ashbridge Park,43.664702,-79.319898,Park
1621,East Toronto,"Business reply mail Processing Centre, South C...",M7Y,43.662744,-79.321558,Olliffe On Queen,43.664503,-79.324768,Butcher
1622,East Toronto,"Business reply mail Processing Centre, South C...",M7Y,43.662744,-79.321558,ONE Academy,43.662253,-79.326911,Gym / Fitness Center


## Group all __' Toronto'__ boroughs

Let's check how many venues were returned for each __borough__

In [154]:
ff = city_venues.groupby(['Borough']).count() #[['Neighborhood Latitude','Neighborhood Longitude']]
ff.head()

Unnamed: 0_level_0,Neighborhood,PostalCode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Central Toronto,104,104,104,104,104,104,104,104
Downtown Toronto,1248,1248,1248,1248,1248,1248,1248,1248
East Toronto,119,119,119,119,119,119,119,119
West Toronto,153,153,153,153,153,153,153,153


Let's check how many venues per __Venue Category__ were returned for each __Neighborhood__ in each __Borough__

In [155]:
ff = city_venues.groupby(['Borough','Neighborhood','Venue Category'], as_index=False).count() #[['Neighborhood Latitude','Neighborhood Longitude']]
ff.head(2000)

Unnamed: 0,Borough,Neighborhood,Venue Category,PostalCode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
0,Central Toronto,Davisville,Brewery,1,1,1,1,1,1
1,Central Toronto,Davisville,Café,2,2,2,2,2,2
2,Central Toronto,Davisville,Coffee Shop,2,2,2,2,2,2
3,Central Toronto,Davisville,Dessert Shop,3,3,3,3,3,3
4,Central Toronto,Davisville,Diner,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...
1080,West Toronto,"Runnymede, Swansea",Sandwich Place,1,1,1,1,1,1
1081,West Toronto,"Runnymede, Swansea",Smoothie Shop,1,1,1,1,1,1
1082,West Toronto,"Runnymede, Swansea",Sushi Restaurant,2,2,2,2,2,2
1083,West Toronto,"Runnymede, Swansea",Vegetarian / Vegan Restaurant,1,1,1,1,1,1


## Analyze all 'Toronto' __Borough__s and move categories from rows to columns


In [156]:
# one hot encoding
cities_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add boroughs/neighborhood column back to dataframe
cities_onehot['Borough'] = city_venues['Borough']
cities_onehot['Neighborhood'] = city_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [cities_onehot.columns[-1]] + list(cities_onehot.columns[:-1])
cities_onehot = cities_onehot[fixed_columns]

cities_onehot

Unnamed: 0,Borough,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1619,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1620,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1621,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1622,East Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [157]:
# one hot encoding
ff_onehot = pd.get_dummies(ff[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ff_onehot['Borough'] = ff['Borough']
ff_onehot['Neighborhood'] = ff['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [ff_onehot.columns[-1]] + list(ff_onehot.columns[:-1])
ff_onehot = ff_onehot[fixed_columns]

print(ff_onehot.shape)
print(ff_onehot['Neighborhood'])
ff_onehot

(1085, 238)
0               Davisville
1               Davisville
2               Davisville
3               Davisville
4               Davisville
               ...        
1080    Runnymede, Swansea
1081    Runnymede, Swansea
1082    Runnymede, Swansea
1083    Runnymede, Swansea
1084    Runnymede, Swansea
Name: Neighborhood, Length: 1085, dtype: object


Unnamed: 0,Borough,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Central Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Central Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Central Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Central Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Central Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1080,West Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1081,West Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1082,West Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1083,West Toronto,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


#### Next, let's group rows by __neighborhood__ (in boroughs including 'Toronto' within their names) and by taking the mean of the frequency of occurrence of each category


In [158]:
cities_onehotM = ff_onehot.groupby('Neighborhood').mean().reset_index()
cities_onehotM.head(100)

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.071429,0.071429,0.071429,0.071429,0.071429,0.071429,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.022727,0.0,0.022727
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.019608,0.0,0.0,0.0,0.0,0.0,0.0,0.019608,0.0,...,0.019608,0.0,0.0,0.0,0.0,0.0,0.019608,0.0,0.0,0.019608
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019608,0.0,...,0.0,0.0,0.0,0.0,0.019608,0.0,0.0,0.019608,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size


In [159]:
cities_onehotM.shape

(39, 237)

#### Print each __neighborhood__ (in boroughs including 'Toronto' within their names) along with the top 5 most common venue types


In [160]:
num_top_venues = 5

for hood in cities_onehotM['Neighborhood']:
    print("----"+hood+"----")
    temp = cities_onehotM[cities_onehotM['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 6})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                 venue      freq
0                  Pub  0.022727
1          Coffee Shop  0.022727
2               Bistro  0.022727
3  Sporting Goods Shop  0.022727
4       Breakfast Spot  0.022727


----Brockton, Parkdale Village, Exhibition Place----
            venue      freq
0       Pet Store  0.055556
1    Intersection  0.055556
2  Breakfast Spot  0.055556
3            Café  0.055556
4         Stadium  0.055556


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue    freq
0  Gym / Fitness Center  0.0625
1         Auto Workshop  0.0625
2           Pizza Place  0.0625
3            Comic Shop  0.0625
4      Recording Studio  0.0625


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                venue      freq
0               Plane  0.071429
1    Sculpture Garden  0.071429
2  Airport Food Court  0.071429
3        Airport Gate  0.07

In [161]:
# merge dataframes by joining on common column PostalCode
merged_nf = pd.merge(left=cities_onehotM, right=merged_pbnll, left_on='Neighborhood', right_on='Neighborhood')

# move neighborhood column to the first column
fixed_columns = [merged_nf.columns[-1]] + list(merged_nf.columns[:-1])
merged_nf = merged_nf[fixed_columns]
fixed_columns = [merged_nf.columns[-1]] + list(merged_nf.columns[:-1])
merged_nf = merged_nf[fixed_columns]
fixed_columns = [merged_nf.columns[-1]] + list(merged_nf.columns[:-1])
merged_nf = merged_nf[fixed_columns]

In [162]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the _top 10 venues_ for each __neighborhood__ (in boroughs including 'Toronto' within their names).


In [163]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = cities_onehotM['Neighborhood']

for ind in np.arange(cities_onehotM.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(cities_onehotM.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(40)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Café,Fish Market,Sporting Goods Shop,Bakery,Pharmacy,Liquor Store,Department Store,Basketball Stadium,Beach,Farmers Market
1,"Brockton, Parkdale Village, Exhibition Place",Gym,Bakery,Convenience Store,Performing Arts Venue,Pet Store,Coffee Shop,Climbing Gym,Restaurant,Café,Burrito Place
2,"Business reply mail Processing Centre, South C...",Pizza Place,Garden Center,Farmers Market,Butcher,Brewery,Burrito Place,Recording Studio,Garden,Auto Workshop,Fast Food Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Boutique,Plane,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,Bar,Harbor / Marina
4,Central Bay Street,Yoga Studio,Poke Place,Miscellaneous Shop,Modern European Restaurant,Coffee Shop,New American Restaurant,Park,Café,Portuguese Restaurant,Comic Shop
5,Christie,Grocery Store,Coffee Shop,Candy Store,Athletics & Sports,Café,Italian Restaurant,Restaurant,Baby Store,Nightclub,Park
6,Church and Wellesley,Yoga Studio,Food & Drink Shop,Gay Bar,Health & Beauty Service,Hobby Shop,Hotel,Ice Cream Shop,Indian Restaurant,Japanese Restaurant,Juice Bar
7,"Commerce Court, Victoria Hotel",Gym,Coffee Shop,Monument / Landmark,Museum,New American Restaurant,Park,Café,Pizza Place,Burger Joint,Poke Place
8,Davisville,Gym,Pharmacy,Seafood Restaurant,Restaurant,Diner,Italian Restaurant,Farmers Market,Indoor Play Area,Café,Indian Restaurant
9,Davisville North,Department Store,Dog Run,Park,Gym / Fitness Center,Breakfast Spot,Dance Studio,Hotel,Sandwich Place,Food & Drink Shop,Discount Store


<a id='item4'></a>


## Cluster Neighborhoods (in boroughs including 'Toronto' within their names)


Run _k_-means to cluster the neighborhoods (in boroughs including 'Toronto' within their names) into 5 clusters.


In [164]:
# set number of clusters
kclusters = 5

cities_onehotM_clustering = cities_onehotM.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cities_onehotM_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_)

[0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 4 0 1 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0
 0 0]


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood (in boroughs including 'Toronto' within their names).


In [165]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
###print(neighborhoods_venues_sorted)

toronto_merged = df_boroughs

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Pub,Trail,Yoga Studio,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Donut Shop
12,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Yoga Studio,Café,Pub,Spa,Indian Restaurant,Ice Cream Shop,Italian Restaurant,Lounge,Bakery,Liquor Store
15,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Gym,Pub,Sandwich Place,Liquor Store,Burrito Place,Italian Restaurant,Restaurant,Fast Food Restaurant,Fish & Chips Shop,Ice Cream Shop
17,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Yoga Studio,Convenience Store,Bookstore,Seafood Restaurant,Brewery,Café,Cheese Shop,Clothing Store,Pet Store,Coffee Shop
18,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Bus Line,Swim School,Dim Sum Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


Finally, let's visualize the resulting clusters (in boroughs including 'Toronto' within their names)


In [166]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 1 (in boroughs including 'Toronto' within their names)


In [167]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East Toronto,0,Health Food Store,Pub,Trail,Yoga Studio,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Donut Shop
12,East Toronto,0,Yoga Studio,Café,Pub,Spa,Indian Restaurant,Ice Cream Shop,Italian Restaurant,Lounge,Bakery,Liquor Store
15,East Toronto,0,Gym,Pub,Sandwich Place,Liquor Store,Burrito Place,Italian Restaurant,Restaurant,Fast Food Restaurant,Fish & Chips Shop,Ice Cream Shop
17,East Toronto,0,Yoga Studio,Convenience Store,Bookstore,Seafood Restaurant,Brewery,Café,Cheese Shop,Clothing Store,Pet Store,Coffee Shop
20,Central Toronto,0,Department Store,Dog Run,Park,Gym / Fitness Center,Breakfast Spot,Dance Studio,Hotel,Sandwich Place,Food & Drink Shop,Discount Store
23,Central Toronto,0,Yoga Studio,Chinese Restaurant,Salon / Barbershop,Fast Food Restaurant,Spa,Diner,Sporting Goods Shop,Restaurant,Café,Mexican Restaurant
26,Central Toronto,0,Gym,Pharmacy,Seafood Restaurant,Restaurant,Diner,Italian Restaurant,Farmers Market,Indoor Play Area,Café,Indian Restaurant
31,Central Toronto,0,Coffee Shop,Pizza Place,Light Rail Station,Liquor Store,Restaurant,Bank,Bagel Shop,Pub,Supermarket,Fried Chicken Joint
35,Downtown Toronto,0,Gastropub,Caribbean Restaurant,Beer Store,Indian Restaurant,Japanese Restaurant,Sandwich Place,Jewelry Store,Restaurant,Butcher,Café
37,Downtown Toronto,0,Yoga Studio,Food & Drink Shop,Gay Bar,Health & Beauty Service,Hobby Shop,Hotel,Ice Cream Shop,Indian Restaurant,Japanese Restaurant,Juice Bar


#### Cluster 2 (in boroughs including 'Toronto' within their names)


In [168]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,1,Playground,Trail,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
33,Downtown Toronto,1,Park,Playground,Trail,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


#### Cluster 3 (in boroughs including 'Toronto' within their names)


In [169]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,2,Music Venue,Garden,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


#### Cluster 4 (in boroughs including 'Toronto' within their names)


In [170]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Central Toronto,3,Park,Jewelry Store,Trail,Sushi Restaurant,Yoga Studio,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 5 (in boroughs including 'Toronto' within their names)


In [171]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,4,Park,Bus Line,Swim School,Dim Sum Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


## Observations for clusters of neighborhoods containing the word __'Toronto'__ in their respective __borough__ names:__*__

### Cluster 1 has a plurality of gyms, yoga studios and other wellness-focused opportunies, while also providing plenty of breweries, nightclubs, bars, specialty retail shops and a diverse array of dining options. It wouldn't take too many glances to realize that this represents a downtown area where young business professionals spend a lot of time and tourists frequent.   Cluster 2 has a lot of parks and playgrounds, department and electronics stores, and Eastern European and dumpling restaurants.  While still in the city, a cluster like 2 could be more residential in nature and more family-focused than the 'downtown' feel assumed in cluster 1.  Cluster 3 appears more eclectic with more artistic exploits like music venues, yoga studios and gardens, but contains an extremely similar distribution of department and electronic stores, lending to the assumption that it is less family-oriented than cluster 2, but still has a diverse array of activities for residents who like live music and appreciating the beauty of nature (gardens).  Cluster 4 appears more domestic yet than 3, but includes more outdoor facilities such as parks and trails.  The data has a 'suburban' feel to it, and you can almost visualize sushi restaurants dotted between electronics stores and other restaurants common to the rest of the area in shopping centers.   Cluster 5 has more diverse eclectic dining options than 4, but few opportunities for a social nightlife without hopping on one of the bus lines common in the area.  It does have plenty of parks, escape rooms, and oddly, swim schools, where socialization is likely to occur.  The data provides the assumption that this cluster is less 'suburban' than cluster 4, yet likely more mature and less family-oriented than cluster 2.

#### __*__Clusters are labeled from 0-4 on the map above, where clusters 0=1, 1=2, 2=3, 3=4, and 4=5 in the descriptions.