<h1 align=center><font size = 5>Week 3 Assignment - Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

This jupyter notebook implements Week 3 Assignment - Segmenting and Clustering Neighborhoods in Toronto, which is part of the Applied Data Science Capstone Course at coursera.com


Here I import the libraries that we will be using throut this Notebook

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

## Task 1: Create the dataframe

- Dataframe specification:
    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
    - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


The data is in well-defined tabular format,<br>
I will use pandas read_html() procedure to read it into a dataframe<br>
Let's also print the first five rows with the head() method

In [2]:
df =pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Let's use the shape method to see how many rows we have

In [3]:
df.shape

(180, 3)

So we have 180 rows<br>
Let's drop rows where Borough is "Not assigned" and <br>
print the number of rows whit the shape method 

In [4]:
df = df[df.Borough != "Not assigned"]
df.shape

(103, 3)

So we have 103 rows left where all Borough have an assigned value<br>
Let's proceed with the next task

## Task 2: Use the Geocoder package or the csv file to add GPS coordinates to the dataframe

Let's load csv data into a pandas dataframe and explore it with the head() procedure

In [5]:
gps_df = pd.read_csv('https://cocl.us/Geospatial_data')
gps_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


So we can merge GPS data to the original datafame<br>
This is actually a LEFT JOIN, I will use pandas merge() procedure to implement it

In [6]:
df = pd.merge(df, gps_df, on='Postal Code', how='left')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


We have the dataframe in the needed format, we are finished with Taks 2

## Task 3: Explore and cluster the neighborhoods in Toronto

You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:
    - to add enough Markdown cells to explain what you decided to do and to report any observations you make.
    - to generate maps to visualize your neighborhoods and how they cluster together

As a start, let's create a map of Toronto, with the neigborhoods

In [7]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [8]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

As instructed, let's simplify the above map and segment and cluster only the neighborhoods that contain Toronto in their borough name.<br>So let's slice the original data and create a new dataframe

In [9]:
toronto_nbr_data = df[df['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_nbr_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Now I put the 'Toronto' neighborhoods on the map, so we see what we will work with

In [10]:
# create map of Manhattan using latitude and longitude values
map_toronto_nbr = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_nbr_data['Latitude'], toronto_nbr_data['Longitude'], toronto_nbr_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_nbr)  
    
map_toronto_nbr

Let's now collect the venues for each neighborhood<br>
We will use the FourSquare API, so we start with initilazinig for FourSquare

In [11]:
CLIENT_ID = 'CLIENT_ID' # your Foursquare ID
CLIENT_SECRET = 'CLIENT_SECRET' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Let's define the function that collects the top 100 venues for a given neighborhood

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now comes the code to run the above function on each neighborhood and create a new dataframe called toronto_venues

In [13]:
toronto_venues = getNearbyVenues(names=toronto_nbr_data['Neighborhood'],
                                   latitudes=toronto_nbr_data['Latitude'],
                                   longitudes=toronto_nbr_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town,

We check the resulting dataframe

In [14]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


We check how many Venue Categories do we have.

In [15]:
len(toronto_venues["Venue Category"].unique())

239

Now we transform the dataframe using one-hot encoding

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
# move neigbhorhood column to the first column
toronto_onehot = toronto_onehot[ ['Neighborhood'] + [ col for col in toronto_onehot.columns if col != 'Neighborhood' ] ]
toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [17]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.066667,0.066667,0.066667,0.066667,0.2,0.133333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015385,0.0,0.0,0.0,0.0,0.0,0.015385


Now we will use this dataset to cluster the neighborhoods

In [18]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 1, 1, 0, 0, 0, 0, 0, 0])

We now have the five clusters<br>
In order to make more readable what we collected so far, lets create a dataframe with:
- the neighborhood,
- cluster number,
- and the top 3 venue type for each neighborhood

We will use a function to sort the venues in descending order.

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's use the function on our dataset

In [20]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
top3_venues = pd.DataFrame(columns=columns)
top3_venues['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    top3_venues.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

top3_venues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Nightclub
2,Business reply mail Processing Centre,Pizza Place,Auto Workshop,Comic Shop
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Plane
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant


We now create a datafreame to add GPS coordinates and cluster labels

In [21]:
#we add the cluster column
top3_venues['Cluster'] = kmeans.labels_

toronto_merged = toronto_nbr_data
del toronto_merged['Postal Code']
del toronto_merged['Borough']

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(top3_venues.set_index('Neighborhood'), on='Neighborhood')

#order columns
toronto_merged = toronto_merged[ ['Neighborhood'] + ['Latitude'] + ['Longitude'] + ['Cluster'] + [ col for col in toronto_merged.columns if col != 'Neighborhood'  and col != 'Latitude' and col != 'Longitude' and col != 'Cluster'] ]
toronto_merged


Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery
1,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,College Cafeteria
2,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Middle Eastern Restaurant
3,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Cocktail Bar
4,The Beaches,43.676357,-79.293031,1,Trail,Health Food Store,Pub
5,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Restaurant
6,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Sandwich Place,Italian Restaurant
7,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park
8,"Richmond, Adelaide, King",43.650571,-79.384568,0,Coffee Shop,Café,Restaurant
9,"Dufferin, Dovercourt Village",43.669005,-79.442259,1,Bakery,Pharmacy,Music Venue


Let's visualize the clusters

In [22]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [23]:
toronto_merged.loc[toronto_merged['Cluster'] == 0]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery
1,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,College Cafeteria
2,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Middle Eastern Restaurant
3,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Cocktail Bar
5,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Restaurant
6,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Sandwich Place,Italian Restaurant
7,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park
8,"Richmond, Adelaide, King",43.650571,-79.384568,0,Coffee Shop,Café,Restaurant
10,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,0,Coffee Shop,Aquarium,Hotel
13,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576,0,Coffee Shop,Café,Hotel


Cluster 0 seems to be a shopping area with dominant coffe consumption

In [24]:
toronto_merged.loc[toronto_merged['Cluster'] == 1]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
4,The Beaches,43.676357,-79.293031,1,Trail,Health Food Store,Pub
9,"Dufferin, Dovercourt Village",43.669005,-79.442259,1,Bakery,Pharmacy,Music Venue
11,"Little Portugal, Trinity",43.647927,-79.41975,1,Bar,Asian Restaurant,Restaurant
12,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Italian Restaurant,Coffee Shop
22,"High Park, The Junction South",43.661608,-79.464763,1,Mexican Restaurant,Café,Thai Restaurant
25,"Parkdale, Roncesvalles",43.64896,-79.456325,1,Breakfast Spot,Gift Shop,Restaurant
27,"University of Toronto, Harbord",43.662696,-79.400049,1,Café,Bar,Italian Restaurant
30,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,1,Café,Vietnamese Restaurant,Bakery
32,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,1,Airport Service,Airport Terminal,Plane
38,Business reply mail Processing Centre,43.662744,-79.321558,1,Pizza Place,Auto Workshop,Comic Shop


Cluster 1 seems to be related to air truffic and shopping

In [25]:
toronto_merged.loc[toronto_merged['Cluster'] == 2]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
21,Forest Hill North & West,43.696948,-79.411307,2,Trail,Park,Sushi Restaurant
29,"Moore Park, Summerhill East",43.689574,-79.38316,2,Restaurant,Park,Trail
33,Rosedale,43.679563,-79.377529,2,Park,Trail,Playground


Cluster 2 are the green areas

In [26]:
toronto_merged.loc[toronto_merged['Cluster'] == 3]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
18,Lawrence Park,43.72802,-79.38879,3,Park,Bus Line,Swim School


Cluster 3 can be housing several pools and green areas

In [27]:
toronto_merged.loc[toronto_merged['Cluster'] == 4]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
19,Roselawn,43.711695,-79.416936,4,Home Service,Garden,Yoga Studio


Cluster 4 may be a residential area due to the high number of Home services.

And this is the end of this notebook, thank you for reading!