# Segmenting and Clustering Neighborhoods in Toronto

> - <a href="Part1"> First Part</a>
> - <a href="Part2"> Second Part</a>
> - <a href="Part3"> Third Part</a>
<br/>
# <a name="Part1">First Part</a>
## I.1. Import Libraries

In [1]:
# Needed libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import sklearn.cluster

import csv

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## I.2. Scrap Wikipedia Page Content

In [2]:
# Scrapping data 
page_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_content= requests.get(page_url)

## I.3. Create the Content Dataframe

In [3]:
df_page = pd.read_html(page_content.content, header=0)[0]
df_page

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


## I.4. Ignore cells with a borough that is Not assigned.

In [4]:
df_page_na=df_page[df_page.Borough != 'Not assigned']
df_page_na

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing Centre
169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


## I.5. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood 

> we create new index and delete the olde one

In [5]:
df_page_na = df_page_na.reset_index()
del df_page_na['index']
df_page_na

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


## I.6. Test if a cell has a "Not assigned" neighborhood

In [6]:
for na in df_page_na['Neighborhood']:
    if na == 'Not assigned':
        print('yes')
print("Nothing to do here")

Nothing to do here


## I.7 Test if  more than one neighborhood can exist in one postal code area

In [7]:
duplicate_code = df_page_na[df_page_na.duplicated()]
if duplicate_code.empty:
    print("Nothing to do here")
else:
    print("You have to do more coding...")

Nothing to do here


## I.8. Print Dataframe Shape

In [8]:
df_shape = df_page_na.shape
print("The Dataframe Shape is ", df_shape)

The Dataframe Shape is  (103, 3)


# <a name="PartII"> Second Part </a>
## II.1. Create Geospatial Dataframe

In [9]:
df_geo = pd.read_csv("Geospatial_Coordinates.csv") 
# Preview the first 5 lines of the loaded data 
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## Concatenating Dataframes

## II.2 Concatenate Dataframes

In [10]:
df_concat= pd.concat([df_geo, df_page_na], ignore_index=False,  axis=1)
df_concat

Unnamed: 0,Postal Code,Latitude,Longitude,Postal code,Borough,Neighborhood
0,M1B,43.806686,-79.194353,M3A,North York,Parkwoods
1,M1C,43.784535,-79.160497,M4A,North York,Victoria Village
2,M1E,43.763573,-79.188711,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M1G,43.770992,-79.216917,M6A,North York,Lawrence Manor / Lawrence Heights
4,M1H,43.773136,-79.239476,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...,...,...,...
98,M9N,43.706876,-79.518188,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M9P,43.696319,-79.532242,M4Y,Downtown Toronto,Church and Wellesley
100,M9R,43.688905,-79.554724,M7Y,East Toronto,Business reply mail Processing Centre
101,M9V,43.739416,-79.588437,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


# <a name="Part3">Third Part</a>

## III.1 Get Fourquare Credentials

In [12]:

#Foursquare credentials 
CLIENT_ID = 'CNDJIS2MT5OZCATEL3J33X3AZQI41MVAM0OJIFUFJHGYOPVT' # your Foursquare ID
CLIENT_SECRET = 'MO3DRHIE0PFAJWQE1AIS3K4VY1SQQVRHL0N1K2YMWXNTCXO4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: CNDJIS2MT5OZCATEL3J33X3AZQI41MVAM0OJIFUFJHGYOPVT
CLIENT_SECRET:MO3DRHIE0PFAJWQE1AIS3K4VY1SQQVRHL0N1K2YMWXNTCXO4


## III.2 Get Toronto Venues 
> You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

### Based on Coursera Lab fonction

In [13]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    toronto_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(toronto_venues)

toronto_venues = getNearbyVenues(names=df_concat['Neighborhood'],
                                   latitudes=df_concat['Latitude'],
                                   longitudes=df_concat['Longitude']
                                  )



Parkwoods
Victoria Village
Regent Park / Harbourfront
Lawrence Manor / Lawrence Heights
Queen's Park / Ontario Provincial Government
Islington Avenue
Malvern / Rouge
Don Mills
Parkview Hill / Woodbine Gardens
Garden District / Ryerson
Glencairn
West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale
Rouge Hill / Port Union / Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood
Guildwood / Morningside / West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor / Wilson Heights / Downsview North
Thorncliffe Park
Richmond / Adelaide / King
Dufferin / Dovercourt Village
Scarborough Village
Fairview / Henry Farm / Oriole
Northwood Park / York University
East Toronto
Harbourfront East / Union Station / Toronto Islands
Little Portugal / Trinity
Kennedy Park / Ionview / East Birchmount Park
Bayview Village
D

> Let's check how many venues were returned for each neighborhood

In [26]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,22,22,22,22,22,22
Bathurst Manor / Wilson Heights / Downsview North,20,20,20,20,20,20
Bayview Village,18,18,18,18,18,18
Bedford Park / Lawrence Manor East,78,78,78,78,78,78
Berczy Park,1,1,1,1,1,1
...,...,...,...,...,...,...
Willowdale / Newtonbrook,78,78,78,78,78,78
Woburn,34,34,34,34,34,34
Woodbine Heights,2,2,2,2,2,2
York Mills / Silver Hills,7,7,7,7,7,7


> Print the number of different categories

In [15]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 269 uniques categories.


## III.3 Analyze Toronto Neighborhoods

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## III.4 Grouping by neighborhoods and getting the frequency for each category

In [17]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.0,0.0,0.0
1,Bathurst Manor / Wilson Heights / Downsview North,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.05,0.000000,0.000000,0.000000,0.0,0.0,0.0
2,Bayview Village,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.000000,0.055556,0.000000,0.0,0.0,0.0
3,Bedford Park / Lawrence Manor East,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,...,0.0,0.012821,0.000000,0.00,0.000000,0.000000,0.012821,0.0,0.0,0.0
4,Berczy Park,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,Willowdale / Newtonbrook,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,...,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.0,0.0,0.0
89,Woburn,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.029412,0.000000,0.000000,0.0,0.0,0.0
90,Woodbine Heights,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.0,0.0,0.0
91,York Mills / Silver Hills,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.0,0.0,0.0


## III.5 Print each neighborhood along with the top 5 most common venues

In [19]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
               venue  freq
0               Café  0.14
1     Breakfast Spot  0.09
2        Coffee Shop  0.09
3  Convenience Store  0.05
4                Bar  0.05


----Bathurst Manor / Wilson Heights / Downsview North----
            venue  freq
0            Bank  0.10
1     Coffee Shop  0.10
2     Pizza Place  0.05
3  Ice Cream Shop  0.05
4  Sandwich Place  0.05


----Bayview Village----
                  venue  freq
0     Indian Restaurant  0.11
1           Yoga Studio  0.06
2           Coffee Shop  0.06
3  Fast Food Restaurant  0.06
4          Burger Joint  0.06


----Bedford Park / Lawrence Manor East----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.06
2         Cocktail Bar  0.04
3  American Restaurant  0.04
4            Gastropub  0.04


----Berczy Park----
                      venue  freq
0                 Cafeteria   1.0
1             Movie Theater   0.0
2            Massage Studio   0.0
3            Medical Center   0.0

## III.6 Gather all this into a DataFrame

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Café,Coffee Shop,Breakfast Spot,Pet Store,Stadium,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue,Bakery
1,Bathurst Manor / Wilson Heights / Downsview North,Coffee Shop,Bank,Gift Shop,Pharmacy,Sushi Restaurant,Middle Eastern Restaurant,Deli / Bodega,Fried Chicken Joint,Restaurant,Ice Cream Shop
2,Bayview Village,Indian Restaurant,Gym,Supermarket,Bank,Burger Joint,Coffee Shop,Fast Food Restaurant,Gas Station,Grocery Store,Intersection
3,Bedford Park / Lawrence Manor East,Coffee Shop,Café,Gastropub,Cocktail Bar,American Restaurant,Art Gallery,Hotel,Italian Restaurant,Lingerie Store,Department Store
4,Berczy Park,Cafeteria,Women's Store,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dance Studio


## III.7 Cluster the results 

In [44]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 2, 1, 1, 1, 1, 1])

## III.8 Create Top 10 Venues

In [45]:

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)



toronto_merged = df_concat

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Latitude,Longitude,Postal code,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,43.806686,-79.194353,M3A,North York,Parkwoods,1.0,Fast Food Restaurant,Print Shop,Women's Store,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
1,M1C,43.784535,-79.160497,M4A,North York,Victoria Village,1.0,Moving Target,Bar,Women's Store,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Deli / Bodega
2,M1E,43.763573,-79.188711,M5A,Downtown Toronto,Regent Park / Harbourfront,1.0,Rental Car Location,Moving Target,Electronics Store,Breakfast Spot,Mexican Restaurant,Medical Center,Bank,Intersection,Dessert Shop,Dim Sum Restaurant
3,M1G,43.770992,-79.216917,M6A,North York,Lawrence Manor / Lawrence Heights,1.0,Coffee Shop,Indian Restaurant,Korean Restaurant,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
4,M1H,43.773136,-79.239476,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,1.0,Fried Chicken Joint,Gas Station,Bakery,Hakka Restaurant,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Department Store,Dessert Shop


In [66]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)



# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'],toronto_merged['Cluster Labels'] ):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
       
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters