# Segmenting and Clustering Neighborhoods in Toronto

## Introduction
As part of the final assigment for IBM Data Science Certification, we are going to explore neighborhoods in Toronto, use the FourSquare API to get venues for each neighborhood (restaurants, bars, sports venues, etc...) and then cluster those neighborhoods using a summary of those venues as the features for our algorithm.

## 1. Webscrapping and creating Toronto postcodes dataframe
As a first step, we are going to extract the list of Toronto postcodes, boroughs and neighborhoods using an HTML defined in a Wikipedia article: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969



In [19]:
#Install and imports that we need
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests
from bs4 import BeautifulSoup

In [2]:
#Download the html from the URL and convert into a BeautifulSoup object
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
html_data  = requests.get(url).text 
soup_object = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'html_data'

In [3]:
#Extract the tables/table
wiki_tables = soup_object.find_all('table')

#Use pandas to transform the table into a dataframe
wiki_df = pd.read_html(str(wiki_tables[0]),flavor='bs4')[0]

In [4]:
wiki_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
#Need to clean the dataframe: There are not assigned postal codes
wiki_clean_df = wiki_df[wiki_df['Neighbourhood']!='Not assigned'].reset_index(drop=True)
print(wiki_clean_df.shape)
wiki_clean_df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**Note: Adjacent neighborhoods with same post-code have been joined and considered a unique neighborhood. Wikipedia table used as source of data already had this join**

## 2. Add geolocation data (latitude and longitude) to dataframe

As we will need to use the FourSquare API, we need to add into our dataframe the geographical coordinates for each postal code.
One option is to use the Geocoder Python package, that will return the latitude and longitude positions for each one of those postal codes.

However, there is a problem with this package which is making it very unreliable. 
Hence, is impossible to get the coordinates for all the target postal codes within an acceptable amount of time.

In [6]:
#!pip install geocoder
#import geocoder # import geocoder
# initialize your variable to None
#lat_lng_coords = None
#postal_code = 'M3A' 

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#latitude

So, instead of Geocoder package, we are going to download those coordinates and corresponding postal code directly from http://cocl.us/Geospatial_data
This is a link to a CSV, which we will load as a dataframe.

In [7]:
#Use the pandas option to read a csv from URL
url_csv = 'https://cocl.us/Geospatial_data'
postcode_csv_df = pd.read_csv(url_csv)
postcode_csv_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Once we have loaded that new dataframe, we need to join it with the one that we obtained from the Wikipedia and create our final dataframe containing: 
* postal codes
* boroughs
* neighborhoods
* latitudes
* longitudes

In [8]:
#Join our 2 dataframes to get the final one that we will use in next steps
toronto_df = wiki_clean_df
toronto_df = toronto_df.join(postcode_csv_df.set_index('Postal Code'),on='Postal Code')
toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## 3. Show our aggregated neighborhoods in Toronto Map

We are going to use Folium package for Python in order to show the center of our aggregated by postal code neighborhoods in the Toronto Map.

In [9]:
#Install and import folium and nominatim
!pip install folium==0.5.0
!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import folium # plotting library

Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 8.5 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=24e81b5ec1f846134c6c0417c8ccda2ba8e7e2819b3556c6fdf32c0b105dc2e1
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.5.0


In [10]:
#Use Nominatim to get the coordinates of the center of Toronto
#We will need it to fix the center of the Folium Map
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [11]:
#Use Folium to plot the map of Toronto and the neighborhoods
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 4. Invoke FourSquare API in order to get the venues for each neighborhood

In the next step, we are going to make some calls to the FourSquare API in order to get the venues within 500 meters of the center of each neighborhood.


In [21]:
# The code was removed by Watson Studio for sharing.

We will loop through all the neighborhoods, doing a request to FourSquare for getting the venues for each one of them and including all the venues in a unique dataframe.


In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Once our function "getNearbyVenues" has been defined, we are going to call it for our Toronto neighborhoods.

In [14]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighbourhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [22]:
#Check head and size
print(toronto_venues.shape)
toronto_venues.head()

(2119, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


Once we have the list of venues for our Toronto neighbourhoods and the categories for each venue, we want to analyze each neighborhood taken into account those venues categories.
Our hypothesis is that analyzing the frequency for different venue categories in each neighborhood will help us to classify each neighborhood and create some clusters of similar neighborhoods in the city of Toronto.

So, as a first step, we are going to use one hot encoding to create a new dataframe where we will create a row for each venue with the corresponding one hot encoding value (1 for the column that represents the category venue and 0 for the rest of the categories).