# Segmenting and Clustering Neighborhoods in Toronto

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Question 1</a>
    
1.1. <a href="#item1">Conclusion</a>
2. <a href="#item2">Question 2</a>
    
2.1. <a href="#item1">Conclusion</a>
    
3. <a href="#item3">Question 3</a>
    
3.1. <a href="#item1">Conclusion</a>
</font>
</div>

## 1. Question 1

*Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:*

<img src = "https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1588636800000&hmac=ssKIQrsG6VHIIby2_yiH4jQ1yUt124BPn_UWPv6ncGk" width="500" height="500" />

The code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

(optional) If needed, install the *pip* package in the current Jupyter kernel.

In [None]:
# import sys
# !{sys.executable} -m pip install BeautifulSoup4

Scrape the webpage from Wikipedia that contains the complete list of postal codes for the city of Toronto. We use the *request* library to scrape the page.

In [None]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

We will use  the *BeautifulSoup* library to handle the content of the page. We will use the *lxml* parser since its the recommended one and also reportedly very fast.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

We will look for a table in the code. *BeautifulSoup* allows us to search for an HTML element by it's type. So lets look for a *table*.

In [None]:
soup.table.name

There is one table element named 'table' in the text. We will get the first table row (the first *tr* element) to see if this has table headers.

In [None]:
#another way to get the same result: soup.table.tr.findAll()
print(soup.table.tr.text)

So now lets scrape the postal codes to a Data Frame. We will iterate in the table and look for all the *tr* elements, the rows and scrape the data from it. Since the first row contains the row header we will need to remove it from the data frame after all the data is loaded.

In [None]:
import pandas as pd

The list of headers for our data frame.

In [None]:
headers=["Postalcode","Borough","Neighborhood"]

The data frame is named *df_toronto*.

In [None]:
df_toronto = pd.DataFrame(columns= headers)

In [None]:
for row in soup.table.find_all('tr'):
    row_data=[]
    for data in row.find_all('th'):
        row_data.append(data.text.strip())
    for data in row.find_all('td'):
        row_data.append(data.text.strip())
    df_toronto.loc[len(df_toronto)] = row_data

Now we have our data frame, We should check it.

In [None]:
df_toronto.head()

We need to remove the first row from the Dataset.

In [None]:
# delete the first row from the dataFrame
df_toronto.drop(0, inplace=True)

In [None]:
df_toronto.head()

Now we will remove the postal codes not assigned to a borough.

In [None]:
# Get names of indexes for which the column Borough has a value "Not assigned"
not_assigned = df_toronto[df_toronto['Borough'] =='Not assigned'].index

# Delete the row indexes from the data frame
df_toronto.drop(not_assigned, inplace=True)

In [None]:
df_toronto.head()

We can check how many rows the data frame contains now.

In [None]:
df_toronto.shape

How many distinct postal codes are in the data frame?

In [None]:
df_toronto.nunique()

So there is no need to merge different rows since this version of the page has no longer repeated postal codes. But the neighbourhood names are not separeted by commas but by slashes /. We can fix this.

In [None]:
df_toronto.Neighborhood = df_toronto.Neighborhood.replace(" /", ',', regex=True)
df_toronto.head()

### 1.1. Conclusion

We can see the full data frame below.

In [None]:
df_toronto

## 2. Question 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1588636800000&hmac=epN7y9Ean0VJJ-40CNmcZ9ztJhgYTUYI5v09sgxi9WY" width="500" height="500" />

There was no data retrieved from Google using the *geocoder* library. *Geocoder* offers other geodata sources, for example *arcgis* for the same purpose.

I will first retrieve the coordinates data from the csv file available for this question.

In [None]:
#add Geo-spatial data
df_coord= pd.read_csv("http://cocl.us/Geospatial_data")
#dfll.set_index("Postcode")

We can check that the file was loaded in the data frame.

In [None]:
df_coord.head()

We need to rename the first column to *Postalcode*  so we can merge this data frame with the *boroughs* dataframe we previously created, *df_toronto*.

In [None]:
df_coord.rename(columns={'Postal Code':'Postalcode'},inplace=True)

This dataframe has the same number of rows.

In [None]:
df_coord.shape

Lets merge the two datagrames, using the *Postalcode* column as the column name to join on.

In [None]:
df_all_boroughs = pd.merge(df_toronto, df_coord, on = 'Postalcode')

The result of the merge is the dataframe *df_all_boroughs*.

In [None]:
df_all_boroughs.head()

### Retrieving the geo coordinates from the ArcGis database using the *geocoder* library

Since using geocoder to retrieve geo coordinates from Google was not working, we used another well known geo data provider, *ArcGis*. 
We don't need an API key to use it.

First we need to install *geocoder* and import it.

In [None]:
import sys
!{sys.executable} -m pip install geocoder

In [None]:
import geocoder

Using *geocoder* to retrieve geo data from Google is not working:

In [None]:
import geocoder
g = geocoder.google('Mountain View, CA')
g.latlng
print(g.latlng)

So we will create a function to retrieve geo data from ArcGis using *geocoder*.

In [None]:
def get_latlng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords[0],lat_lng_coords[1]

If we feed the function with an address we will get the geo data for it.

In [None]:
get_latlng('M3A')

We will create a new column on the data frame and fill it with the geo data for the postal code in each line.

In [None]:
df_toronto['coord'] = df_toronto.Postalcode.apply(lambda x: get_latlng(x))

The data frame now has all the geo data we need but now we need to split it in two columns, *Latitude* and *Longitude* and drop the *coord* column after it.

In [None]:
df_toronto.head()

In [None]:
df_toronto['Latitude'] = df_toronto.coord.apply(lambda x: x[0])
df_toronto['Longitude'] = df_toronto.coord.apply(lambda x: x[1])

In [None]:
df_toronto.drop("coord", axis=1, inplace=True)

Finally, we have the data frame with the desired data, labeled as we wanted.

We can see that both data frames have the same data, with slight differences only on latitude and longitude values can be explained by the different option how to define the coordinates of a neighbourhood. 

In [None]:
df_toronto.head()

In [None]:
df_toronto.nunique()

In [None]:
df_toronto[df_toronto.Neighborhood.duplicated()]

In [None]:
df_all_boroughs.nunique()

In [None]:
df_all_boroughs[df_all_boroughs.Neighborhood.duplicated()]

### 2.1. Conclusion

We retrieved the geo data from two sources. The first source was the CSV file provided by IBM and the second source was the geo data from ArcGis retrieved using *geocoder*.

In [None]:
# from a CSV file
df_all_boroughs

In [None]:
# from the ArcGis geo data using the geocoder library
df_toronto

## 3. Question 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

to add enough Markdown cells to explain what you decided to do and to report any observations you make.
to generate maps to visualize your neighborhoods and how they cluster together.
Once you are happy with your analysis, submit a link to the new Notebook on your Github repository.

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files


#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

## 1. Explore Toronto and select the neighborhoods to cluster

We first display a map of all Toronto neighborhoods.

In [None]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the city of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

We will focus on the downton of Toronto. We will cluster only the neighbourhoods included in the Toronto Downtown borough.

In [None]:
downtown_data = df_toronto[df_toronto['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_data.head()

Display now the map zooming in Downtown Toronto.

In [None]:
address = 'Downtown Toronto, Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Downtown Toronto using latitude and longitude values
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(downtown_data['Latitude'], downtown_data['Longitude'], downtown_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  
    
map_downtown

Foursquare Api credentials

In [None]:
CLIENT_ID = 'XEJXIC4ABL0SSRAWKMYJ5UKIO2DRR0HIU14KOCR5YIHZR2MP' # your Foursquare ID
CLIENT_SECRET = 'P0L3PE2EPASSSHZS513AFA15UNOARCAMHXVRS1ZOUYKOMG3R' # your Foursquare Secret

# needed now
ACCESS_TOKEN = 'CIIQUWECMQFQIEDQW3IKZUNBIJTLCXFW4DFQAVX13CZBRQBD'
VERSION = '20180605' # Foursquare API version

#### Exploring the geo data

Get the first neighborhood in the dataset and explore the information we can get from Foursquare related to it.

In [None]:
downtown_data.loc[0, 'Neighborhood']

Get the neighborhood's latitude and longitude values.

In [None]:
neighborhood_latitude = downtown_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = downtown_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = downtown_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

In [None]:
#### Exploring the venues around it. Lets use a 500 metres around the latitude and longitude we have.

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()
results

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, we will add a **get_category_type** function to get the category type, pretty obvious, right?

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
print('We have a total of {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

## 3.2. Explore Neighborhoods in Downtown Toronto

So now we will do the same data collection for all the neighborhoods in our selection.

First we define a function to retrive the data for each neighborhood.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Using this fuction we create a new dataframe with all the venues in a 500 meters radius of each neighborhood. The dataframe is called *downtown_venues*.

In [None]:
downtown_venues = getNearbyVenues(names=downtown_data['Neighborhood'],
                                   latitudes=downtown_data['Latitude'],
                                   longitudes=downtown_data['Longitude']
                                  )

#### Lets check the size of the resulting dataframe

In [None]:
print(downtown_venues.shape)
downtown_venues.head()

Grouping the dataframe by neighborhood we can see how many venues are in each neighborhood.

In [None]:
downtown_venues.groupby('Neighborhood').count()

#### How many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(downtown_venues['Venue Category'].unique())))

## 3. Analyze Each Neighborhood

#### We now create a dataframe will all the venues and the total of each venue by neighborhood.

##### After we create the new dataframe we need to rearrange it so the neighborhood name appears as the first column, for that we dropt the column and then we add it from the original data frame.

In [None]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix="", prefix_sep="")

# drop the Neighborhood column (that doesn't have the names at the moment)
downtown_onehot = downtown_onehot.drop(['Neighborhood'], axis = 1)

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]]  + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

##### This data frame is huge, we can see that there are now the same number of rows we fetched before but the number of colunmns are now the same of the total of distinct venues we retrieved.

In [None]:
downtown_onehot.shape

In [None]:
#### Next, we will group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

#### Grouping by *Neighborhood* provied a new data frame with the 19 neighborhoods and the number of rows equal to the number of unique venues we found earlier. The data in the set is now showing us the relative size of each type of venue in each neighborhood. Check this in the table above and the dimension in the line below.

In [None]:
downtown_grouped.shape

#### Print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Once again, we create *pandas* dataframe with this data

First, we add a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now we create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 7

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
downtown_merged = downtown_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
downtown_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

downtown_merged.head() # check the last columns!

Finally, we can visualize the resulting clusters.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters