# TORONTO NEIGHBORHOOD SCRAPE, SEGMENT, CLUSTER CAPSTONE
### explore, segment, and cluster the neighborhoods in the city of Toronto.
1. Create structured dataframe of Toronto data
   * to explore and cluster the neighborhoods in Toronto scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . 
        * ignore neighborhoods/postal codes that are not assigned a borough.
        * boroughs lacking a neighborhood have the same name.
   * read data into a pandas dataframe with column keys: Postal Code, Borough, Neighborhood.
        * Postal Code serves as the primary key
        * Neighborhoods with same Postal Code will be concatenated into a single value: "N1,N2"
2. Comment code and provide Markdown titles.
3. Last line display dataframe and should give shape of cleaned dataframe. 
4. Make this the first notebook. Publish to github in an open access directory, provide link to github ipynb for first submission.
5. *Change name of notebook to xxx-v1.ipynb
6. Use Geocoder Python package: https://geocoder.readthedocs.io/index.html to get mapping data. 
    * Final output should be table with column keys: postal code, borough, neighborhood, latitude, longitude
    * May want to incorporate code:
                import geocoder # import geocoder
                lat_lng_coords = None
                while(lat_lng_coords is None):
                  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
                  lat_lng_coords = g.latlng
                latitude = lat_lng_coords[0]
                longitude = lat_lng_coords[1] 
7. Verify latitude and longitude http://cocl.us/Geospatial_data
8. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.
9. *Change name to xxx-v2.ipynb
10. Explore and cluster the neighborhoods in Toronto. Use only boroughs that contain the word Toronto and replicate the same analysis as done with New York City data. 
    * use foursquare to identify categories of venues.
    * use folium to generate maps to visualize your neighborhoods and how they cluster together. 
11. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.

## OVERVIEW
1. Load Relevant Libraries
2. Scrape wikipedia website.
3. Clean Toronto Data. 
4. Get postal code latitude and longitude. 
5. Use Foursquare to pull data on venues in 5 most populous Toronto boroughs.
6. Plot Venues on Map of Toronto Using Folium
7. Analyze and cluster toronto neighborhoods 

### 1. LOAD RELEVANT LIBRARIES
1. General processing
2. Scraping
3. HTML and image display
4. geolocation
5. mapping
6. clustering

In [1]:
#1. load General processing libraries
import pandas as pd 
import numpy as np 
import random 
import types
from botocore.client import Config
import ibm_boto3

#2. load libraries for scraping
from bs4 import BeautifulSoup
import requests # library to handle requests

#3. libraries for displaying HTML and images
from IPython.display import Image 
from IPython.core.display import display, HTML

#4. loading libraries for geolocation
# module to convert an address into latitude and longitude values#
#try:
#    import geocoder
#except:
#    !conda install -c conda-forge geocoder --yes 
#    import geocoder
#
try:
    from geopy.geocoders import Nominatim
except:
    !conda install -c conda-forge geopy --yes 
    from geopy.geocoders import Nominatim
from geopy.exc import GeopyError

#5. loading libraries for Mapping    
try:
    import folium # plotting library
except:
    !conda install -c conda-forge folium=0.5.0 --yes
    import folium # plotting library

#6. Loading libraries for clustering
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


### 2. SCRAPING WIKIPEDIA DATA
1. Load wikipedia page
2. Extract table
3. format HTML table into string type and read data into pd.DataFrame
4. Display wrangled data

In [2]:
#1. Load wikipedia page
# use requests.get to read in wikipedia list of toronto postal codes.
html_url =  'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_file = requests.get(html_url)

#2. Extract table
#structure website using beautiful soup
file = BeautifulSoup(html_file.content, 'lxml')
#Extract postal data table (review website coding to identify data is located in a table structure)
table = file.table

#3. format HTML table into string type
#cast postal data table html as string so it can be read by pandas read_html.
toronto_raw_postal_wiki_df = pd.read_html(str(table))[0]

#4. Display wrangled data
#verify raw postal data table
print("Raw dataframe contains {} postal codes".format(toronto_raw_postal_wiki_df.shape[0]))
toronto_raw_postal_wiki_df.head()

Raw dataframe contains 288 postal codes


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### 3. CLEAN TORONTO DATA
1. identify and eliminate rows with Borough value "Not assigned"
2. assign Neighborhood value to Borough value if Neighborhood is "Not Assigned"
3. group by postcode. If there are multiple neighborhoods, replace with comma delimited string.
4. Display cleaned table, show count of rows and columns.

In [3]:
#1. identify and eliminate rows with Borough value "Not assigned"
#Make logical index of rows where borough is notassigned
Bad_Borough_Index = toronto_raw_postal_wiki_df.loc[:,'Borough'].isin(['Not assigned'])
#Use logical negation to drop bad rows. Assign values optimistically to new clean table.
toronto_postal_df = toronto_raw_postal_wiki_df.loc[~Bad_Borough_Index,:]

#2. assign Neighborhood value to Borough value if Neighborhood is "Not Assigned"
#Find rows where Neighborhood is not assigned.
Bad_Neighborhood_Index = toronto_postal_df.loc[:,'Neighbourhood'].isin(['Not assigned'])
print(" {} rows were found to have missing Neighborhood values".format(sum(Bad_Neighborhood_Index)))
print()        
#Assign Borough value to Bad Neighborhood values.
toronto_postal_df.loc[Bad_Neighborhood_Index,'Neighbourhood']=toronto_postal_df.loc[Bad_Neighborhood_Index,'Borough']
#check data frame for Missing Neighborhood values.
Bad_Neighborhood_Index2 = toronto_postal_df.loc[:,'Neighbourhood'].isin(['Not assigned'])
print("After correction, {} rows were found to have missing Neighborhood values".format(sum(Bad_Neighborhood_Index2)))
toronto_postal_df = toronto_postal_df.sort_values(by=['Postcode']).reset_index(drop=True)
toronto_postal_df.head()

 1 rows were found to have missing Neighborhood values

After correction, 0 rows were found to have missing Neighborhood values


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Port Union
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Highland Creek


In [4]:
#3. group by postcode. If there are multiple neighborhoods, replace with comma delimited string.
#group by postcode, replace neighborhoods with list.
toronto_clean_df = pd.DataFrame(toronto_postal_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(list))
toronto_clean_df.reset_index(inplace=True)  
toronto_clean_df['Neighbourhood'] = toronto_clean_df.Neighbourhood.apply(lambda x: ", ".join(x)).tolist()

#4. Display cleaned table, show count of rows and columns.
#print("The shape is {} rows (Postcodes) and {} columns (Pstcd-Brgh-Nghbrhd)".format(toronto_clean_df.shape[0],toronto_clean_df.shape[1]))
print(toronto_clean_df.shape)
print()
#HTML(toronto_clean_df.to_html())
toronto_clean_df.head()

(103, 3)



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### 4. GET LATITUDE AND LONGITUDE FROM POSTAL CODE
1. Download data file with latitude and longitude.
2. Merge by postal code with clean toronto borough table ***HUGE WASTE OF TIME TO USE GEOPY***.
3. clean and display toronto map data table.

In [5]:
#@hidden_cell
def __iter__(self): return 0
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_824420d8c0ee4fceb89cbd2b7bd91213 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='vNXsdaKCFabapzUardOEbXs-CrncSWjxN6JhP2chFYEE',
    ibm_auth_endpoint="https://iam.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

In [6]:

body = client_824420d8c0ee4fceb89cbd2b7bd91213.get_object(Bucket='courseracapstone-donotdelete-pr-dwmizuffkolnrx',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_3 = pd.read_csv(body)

In [7]:
#1. Download data file with latitude and longitude.
# use requests.get to read in wikipedia list of toronto postal codes.
#CA_df = pd.read_csv("Geospatial_Coordinates.csv")
CA_df = df_data_3
CA_df.rename({'Postal Code':'Postcode'},axis=1, inplace=True)

#2. Merge by postal code with clean toronto borough table
toronto_map_df = toronto_clean_df.merge(CA_df, how='outer', on='Postcode')

#3. clean and display toronto map data table.
HTML(toronto_map_df.to_html())

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Scarborough Village West, Cliffside",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",43.692657,-79.264848


### 5. LIST VENUES IN THE 5 MOST POPULOUS TORONTO BOUROUGHS
1. Find the 5 most populous postal codes
    1. download census data
    2. merge population data into map data
    3. sort by population and take top 5 most populous.
2. Pull foursquare venue information on the postal codes.
    1. Define Foursquare Crendentials and Version
    2. Query foursquare for venues by location (lat and long of top5, 1000 meter radius, LIMIT=30)
        * define query url stem
        * specify query parameter tuple (CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
        * call for top5, record JSON results in dictionary.
    3. Extact JSON venue location and category information, convert to dataframe. 
       * call get_category_type function
       * iterate through JSON dictionary, 
           * use NYC JSON cleaning routine, 
           * extract (name, postalCode, categories, lat, lng) assign to dataframe dictionary.
3. Clean dictionary of dataframes and Display top 3 entries for each
    1. remove venues with no postal code specified
    2. remove venues from different postal code (radius is inaccurate)
    3. standardize fields so easy to merge/join
    4. display top 3 entries for each postal code
4. Merge dictionary into dataframe.

In [8]:
# @hidden_cell
body = client_824420d8c0ee4fceb89cbd2b7bd91213.get_object(Bucket='courseracapstone-donotdelete-pr-dwmizuffkolnrx',Key='data_asset/Canada_demo.csv_shaped_HA5i5aAUSICLDh1vu50_AQ.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)

In [9]:
# 1. Find the 5 most populous postal codes
# a. download census data
# https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/File.cfm?T=1201&SR=1&RPP=9999&PR=0&CMA=0&CSD=0&S=22&O=A&Lang=Eng&OFT=CSV
#population_df = pd.read_csv("CANADA_demo.csv_shaped.csv")
population_df = df_data_1
population_df.rename({"Geographic code": "Postcode", "Population, 2016":"Population"},axis=1, inplace=True)
population_df = population_df[["Postcode","Population"]]

# b. merge population data into map data
toronto_demo_df = toronto_map_df.merge(population_df, how='inner', on='Postcode')

# c. sort by population.
POPLIMIT=5
toronto_5_df = toronto_demo_df.sort_values(by="Population", ascending=False).reset_index(drop=True)
toronto_5_df = toronto_5_df.iloc[:POPLIMIT,:5]
toronto_5_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M2N,North York,Willowdale South,43.77012,-79.408493
1,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
2,M2J,North York,"Fairview, Oriole, Henry Farm",43.778517,-79.346556
3,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
4,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",43.815252,-79.284577


In [10]:
#@hidden_cell
#2. Pull foursquare venue information on the postal codes.
#a. Define Foursquare Crendentials and Version
CLIENT_ID = '3QJ2HBBGDVKEXJFVQGSQZPVM0STTPTPVUGYO5B1LGB4HXDQS'
CLIENT_SECRET = 'R1IIYYE1BRTBDJLWH3COP3GZMVUNPMAWGAPTABJR0WTMXS1C' 

In [11]:
#2. Pull foursquare venue information on the postal codes.
#b Query foursquare for venues by location (lat and long of top5)
# * define query url stem
url_query_stem = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'

#* specify query parameter tuple (CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
LIMIT =30 
radius = 1000
VERSION = '20180604'

# * call for top5, record results in dictionary.
records_dict = dict.fromkeys(toronto_5_df['Postcode'])
for idx, postalcode in enumerate(toronto_5_df['Postcode']):
    latitude = toronto_5_df.loc[idx,'Latitude']
    longitude = toronto_5_df.loc[idx,'Longitude']
    url_query = url_query_stem.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
    records_dict[postalcode] = requests.get(url_query).json()


In [12]:
#3. Extact JSON venue location and category information, convert to dataframe. 
#* call get_category_type function
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# * iterate through JSON dictionary, assign to dataframe dictionary.
dataframe_dict = dict.fromkeys(toronto_5_df['Postcode'])

for postalcode in toronto_5_df['Postcode']:
    #readout category relevant foursquare data.
    items = records_dict[postalcode]['response']['groups'][0]['items']
    
    #Run NY JSON data extraction routine
    # flatten JSON
    dataframe = json_normalize(items)
    # filter columns
    filtered_columns = ['venue.name', 'venue.categories'] + [col for col in dataframe.columns if col.startswith('venue.location.')] + ['venue.id']
    dataframe_filtered = dataframe.loc[:, filtered_columns]
    # filter the category for each row
    dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)
    # clean columns
    dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]
    
    #extract (name, postalCode, categories, lat, lng) assign to dataframe dictionary.
    dataframe_dict[postalcode] = dataframe_filtered[['name','postalCode','categories','lat','lng']]


In [13]:
#3. Clean dictionary of dataframes and Display top 3 entries for each
for idx,postalcode in enumerate(toronto_5_df['Postcode']):
    #a. remove venues with no postal code specified
    #drop rows with postal code nan
    dataframe_dict[postalcode] = dataframe_dict[postalcode].dropna(subset=['postalCode'], axis=0)
    
    #b. remove venues from different postal code (radius is inaccurate)
    #check that all venues are from postal code, if not label nan
    dataframe_dict[postalcode]['Postcode'] = dataframe_dict[postalcode]['postalCode'].apply(lambda x: postalcode if x.startswith(postalcode) else np.nan)
    #drop venues with nan in Postcode
    dataframe_dict[postalcode] = dataframe_dict[postalcode].dropna(subset=['Postcode'], axis=0)
    
    #c. standardize fields so easy to merge/join
    dataframe_dict[postalcode] = dataframe_dict[postalcode].rename({'name':'venue', 
                                                                    'postalCode': 'Postalcode', 
                                                                    'categories':'Category', 
                                                                    'lat':'Latitude', 
                                                                    'lng':'Longitude'},
                                                                   axis=1)   
    #make index orderly
    dataframe_dict[postalcode] = dataframe_dict[postalcode].reset_index(drop=True)
    #Put postcode first
    dataframe_dict[postalcode]= dataframe_dict[postalcode].loc[:,['Postcode',
                                                                  'Postalcode',
                                                                  'venue',
                                                                  'Category',
                                                                  'Latitude',
                                                                  'Longitude']]
    #c. display top 3 entries for each postal code
    print("{}. First 3 Venues for postal code {}".format(idx+1, postalcode))
    display(HTML(dataframe_dict[postalcode].head(3).to_html()))
    print()

1. First 3 Venues for postal code M2N


Unnamed: 0,Postcode,Postalcode,venue,Category,Latitude,Longitude
0,M2N,M2N 5P2,Konjiki Ramen,Ramen Restaurant,43.766998,-79.412222
1,M2N,M2N 5P1,The Keg,Steakhouse,43.766579,-79.412131
2,M2N,M2N 5R4,The Captain's Boil,Seafood Restaurant,43.773255,-79.413805



2. First 3 Venues for postal code M1B


Unnamed: 0,Postcode,Postalcode,venue,Category,Latitude,Longitude
0,M1B,M1B 3W3,Images Salon & Spa,Spa,43.802283,-79.198565
1,M1B,M1B,Caribbean Wave,Caribbean Restaurant,43.798558,-79.195777
2,M1B,M1B 5N7,Staples Morningside,Paper / Office Supplies Store,43.800285,-79.196607



3. First 3 Venues for postal code M2J


Unnamed: 0,Postcode,Postalcode,venue,Category,Latitude,Longitude
0,M2J,M2J 5A7,The LEGO Store,Toy / Game Store,43.778207,-79.343483
1,M2J,M2J 5A7,CF Fairview Mall,Shopping Mall,43.77775,-79.344105
2,M2J,M2J 5A7,SilverCity Fairview Mall Cinemas,Movie Theater,43.778681,-79.344085



4. First 3 Venues for postal code M9V


Unnamed: 0,Postcode,Postalcode,venue,Category,Latitude,Longitude
0,M9V,M9V 1B4,Sheriff's No Frills,Grocery Store,43.741968,-79.586639
1,M9V,M9V 3Y5,Subway,Sandwich Place,43.742421,-79.589471
2,M9V,M9V 1B4,Shoppers Drug Mart,Pharmacy,43.740832,-79.583347



5. First 3 Venues for postal code M1V


Unnamed: 0,Postcode,Postalcode,venue,Category,Latitude,Longitude
0,M1V,M1V 0B3,Jim Chai Kee Wonton Noodle 沾仔記,Noodle House,43.814783,-79.293138
1,M1V,M1V 5B5,DaanGo Cake Lab,Bakery,43.809334,-79.290442
2,M1V,M1V 5P1,The Brighton Convention & Event Centre,Event Space,43.81357,-79.295421





In [14]:
#4. Merge dictionary into dataframe.
toronto_top5_venues_df= pd.DataFrame()
for postalcode in toronto_5_df['Postcode']:
    toronto_top5_venues_df= pd.concat([toronto_top5_venues_df, dataframe_dict[postalcode]])
toronto_top5_venues_df.reset_index(drop=True, inplace=True)
toronto_top5_venues_df=toronto_5_df[['Postcode','Borough','Neighbourhood']].merge(toronto_top5_venues_df,
                                                          how='outer',
                                                          on='Postcode')
toronto_top5_venues_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Postalcode,venue,Category,Latitude,Longitude
0,M2N,North York,Willowdale South,M2N 5P2,Konjiki Ramen,Ramen Restaurant,43.766998,-79.412222
1,M2N,North York,Willowdale South,M2N 5P1,The Keg,Steakhouse,43.766579,-79.412131
2,M2N,North York,Willowdale South,M2N 5R4,The Captain's Boil,Seafood Restaurant,43.773255,-79.413805
3,M2N,North York,Willowdale South,M2N 6Z4,Loblaws,Grocery Store,43.768648,-79.412597
4,M2N,North York,Willowdale South,M2N 6L7,Starbucks,Coffee Shop,43.768192,-79.413021
5,M2N,North York,Willowdale South,M2N 5P2,Satay Sate,Indonesian Restaurant,43.766690,-79.412100
6,M2N,North York,Willowdale South,M2N 6Z4,Cineplex Cinemas Empress Walk,Movie Theater,43.768625,-79.412613
7,M2N,North York,Willowdale South,M2N 6R8,Toronto Centre for the Arts,Theater,43.766228,-79.414115
8,M2N,North York,Willowdale South,M2N 5N4,Sushi Moto Sake & Wine Bar,Sushi Restaurant,43.763902,-79.411559
9,M2N,North York,Willowdale South,M2N 3G1,MYMY Chicken,Fried Chicken Joint,43.764658,-79.411096


### 6. Plot Venues on Map of Toronto Using Folium
1. Get latitude and Longitude of Toronto
2. Create a map of Toronto with venues superimposed on top.
3. Zoom in on North York.

In [15]:
#1. Get latitude and Longitude of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

#2. Create a map of Toronto with venues superimposed on top.
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_top5_venues_df['Latitude'], 
                                            toronto_top5_venues_df['Longitude'], 
                                            toronto_top5_venues_df['Borough'], 
                                            toronto_top5_venues_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [16]:
#3. Zoom in on North York
NorthYork = toronto_5_df.loc[0,:]
latitude = NorthYork.Latitude
longitude = NorthYork.Longitude

nyork_venues_df = toronto_top5_venues_df.loc[toronto_top5_venues_df.Postcode.isin(['M2N']),:]

# create map of Manhattan using latitude and longitude values
map_northyork = folium.Map(location=[latitude, longitude], zoom_start=14)

# add markers to map
for lat, lng, label in zip(nyork_venues_df['Latitude'], 
                           nyork_venues_df['Longitude'], 
                           nyork_venues_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork

### 7. Analyze and cluster Top 5 Toronto neighborhoods using NYC template.
1. Create Toronto Venue Category Feature Vector
    * compute frequency of category.
    * display top 5 and top 10 venues in each neighborhood
2. Cluster Neighborhoods
3. Visualze clusters using map

In [17]:
#1. Create Toronto Venue Category Feature Vector
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_top5_venues_df[['Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_top5_venues_df['Neighbourhood'] \
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
# add Postcode column back to dataframe
toronto_onehot['Postcode'] = toronto_top5_venues_df['Postcode'] \
# move Postcode column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby(['Postcode','Neighbourhood']).mean().reset_index()
toronto_grouped

Unnamed: 0,Postcode,Neighbourhood,American Restaurant,Arts & Crafts Store,Auto Workshop,BBQ Joint,Bakery,Bank,Beer Store,Bubble Tea Shop,...,Shopping Mall,Smoothie Shop,Spa,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Theater,Toy / Game Store,Video Store
0,M1B,"Rouge, Malvern",0.0,0.0,0.076923,0.0,0.076923,0.0,0.0,0.0,...,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1V,"Milliken, Agincourt North, L'Amoreaux East, St...",0.0,0.0,0.0,0.058824,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M2J,"Fairview, Oriole, Henry Farm",0.038462,0.0,0.0,0.0,0.076923,0.038462,0.0,0.0,...,0.038462,0.038462,0.0,0.0,0.0,0.0,0.038462,0.0,0.038462,0.0
3,M2N,Willowdale South,0.0,0.04,0.0,0.0,0.04,0.0,0.0,0.04,...,0.0,0.0,0.0,0.04,0.04,0.08,0.0,0.04,0.0,0.0
4,M9V,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667


In [18]:
toronto_grouped.shape

(5, 62)

In [19]:
# display top 5 and top 10 venues in each neighborhood
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Rouge, Malvern----
                     venue  freq
0     Fast Food Restaurant  0.15
1           Sandwich Place  0.08
2                      Gym  0.08
3              Coffee Shop  0.08
4  Fruit & Vegetable Store  0.08


----Milliken, Agincourt North, L'Amoreaux East, Steeles East----
                venue  freq
0  Chinese Restaurant  0.24
1              Bakery  0.12
2         Pizza Place  0.12
3          Hobby Shop  0.06
4    Malay Restaurant  0.06


----Fairview, Oriole, Henry Farm----
                 venue  freq
0       Clothing Store  0.15
1          Coffee Shop  0.12
2               Bakery  0.08
3  American Restaurant  0.04
4            Juice Bar  0.04


----Willowdale South----
                   venue  freq
0      Korean Restaurant  0.08
1       Ramen Restaurant  0.08
2       Sushi Restaurant  0.08
3          Movie Theater  0.04
4  Indonesian Restaurant  0.04


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, South Steeles, Thistletown, Silverstone---

In [20]:
# display top 5 and top 10 venues in each neighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Rouge, Malvern",Fast Food Restaurant,Chinese Restaurant,Sandwich Place,Greek Restaurant,Fruit & Vegetable Store,Paper / Office Supplies Store,Coffee Shop,Caribbean Restaurant,Gym,Bakery
1,"Milliken, Agincourt North, L'Amoreaux East, St...",Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space
2,"Fairview, Oriole, Henry Farm",Clothing Store,Coffee Shop,Bakery,American Restaurant,Restaurant,Electronics Store,Toy / Game Store,Department Store,Japanese Restaurant,Juice Bar
3,Willowdale South,Korean Restaurant,Ramen Restaurant,Sushi Restaurant,Coffee Shop,Indonesian Restaurant,Movie Theater,Café,Burrito Place,Creperie,Gym
4,"Albion Gardens, Beaumond Heights, Humbergate, ...",Pizza Place,Grocery Store,Video Store,Sandwich Place,Beer Store,Coffee Shop,Construction & Landscaping,Fast Food Restaurant,Fried Chicken Joint,Liquor Store


In [21]:
#2. Cluster Neighborhoods
# set number of clusters
kclusters = 2

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
toronto_grouped_clustering = toronto_grouped_clustering.drop('Postcode', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:3] 
toronto_grouped_clustering.shape

(5, 60)

In [22]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_top5_venues_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.tail() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Postalcode,venue,Category,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",M1V 5K5,Dong Kee Chinese Restaurant,Chinese Restaurant,43.819558,-79.294536,1,Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space
92,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",M1V 5L6,Fragrant Bakery,Bakery,43.813985,-79.291458,1,Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space
93,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",M1V 1K4,Reginos Pizza,Pizza Place,43.81056,-79.28012,1,Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space
94,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",M1V 1R7,Alexmuir Park,Park,43.80864,-79.282189,1,Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space
95,M1V,Scarborough,"Milliken, Agincourt North, L'Amoreaux East, St...",M1V 5B5,South China Noodles 桂林粉麵食店,Chinese Restaurant,43.808849,-79.290205,1,Chinese Restaurant,Pizza Place,Bakery,Park,Noodle House,Malay Restaurant,Dessert Shop,Korean Restaurant,Coffee Shop,Event Space


In [23]:
#3. create map of resuling clusters
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], 
                                  toronto_merged['Longitude'], 
                                  toronto_merged['Neighbourhood'], 
                                  toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters