# Segmenting and Clustering Neighborhoods in Toronto

## Introduction
As part of the final assigment for IBM Data Science Certification, we are going to explore neighborhoods in Toronto, use the FourSquare API to get venues for each neighborhood (restaurants, bars, sports venues, etc...) and then cluster those neighborhoods using a summary of those venues as the features for our algorithm.

## 1. Webscrapping and creating Toronto postcodes dataframe
As a first step, we are going to extract the list of Toronto postcodes, boroughs and neighborhoods using an HTML defined in a Wikipedia article: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969



In [1]:
#Install and imports that we need
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
#Download the html from the URL and convert into a BeautifulSoup object
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
html_data  = requests.get(url).text 
soup_object = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'html_data'

In [3]:
#Extract the tables/table
wiki_tables = soup_object.find_all('table')

#Use pandas to transform the table into a dataframe
wiki_df = pd.read_html(str(wiki_tables[0]),flavor='bs4')[0]

In [4]:
wiki_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
#Need to clean the dataframe: There are not assigned postal codes
wiki_clean_df = wiki_df[wiki_df['Neighbourhood']!='Not assigned'].reset_index(drop=True)
print(wiki_clean_df.shape)
wiki_clean_df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**Note: Adjacent neighborhoods with same post-code have been joined and considered a unique neighborhood. Wikipedia table used as source of data already had this join**

## 2. Add geolocation data (latitude and longitude) to dataframe

As we will need to use the FourSquare API, we need to add into our dataframe the geographical coordinates for each postal code.
One option is to use the Geocoder Python package, that will return the latitude and longitude positions for each one of those postal codes.

However, there is a problem with this package which is making it very unreliable. 
Hence, is impossible to get the coordinates for all the target postal codes within an acceptable amount of time.

In [6]:
#!pip install geocoder
#import geocoder # import geocoder
# initialize your variable to None
#lat_lng_coords = None
#postal_code = 'M3A' 

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#latitude

So, instead of Geocoder package, we are going to download those coordinates and corresponding postal code directly from http://cocl.us/Geospatial_data
This is a link to a CSV, which we will load as a dataframe.

In [7]:
#Use the pandas option to read a csv from URL
url_csv = 'https://cocl.us/Geospatial_data'
postcode_csv_df = pd.read_csv(url_csv)
postcode_csv_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Once we have loaded that new dataframe, we need to join it with the one that we obtained from the Wikipedia and create our final dataframe containing: 
* postal codes
* boroughs
* neighborhoods
* latitudes
* longitudes

In [12]:
#Join our 2 dataframes to get the final one that we will use in next steps
toronto_df = wiki_clean_df
toronto_df = toronto_df.join(postcode_csv_df.set_index('Postal Code'),on='Postal Code')
toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## 3. Show our aggregated neighborhoods in Toronto Map

We are going to use Folium package for Python in order to show the center of our aggregated by postal code neighborhoods in the Toronto Map.

In [15]:
#Install and import folium and nominatim
!pip install folium==0.5.0
!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import folium # plotting library



In [17]:
#Use Nominatim to get the coordinates of the center of Toronto
#We will need it to fix the center of the Folium Map
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [20]:
#Use Folium to plot the map of Toronto and the neighborhoods
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 4. Invoke FourSquare API in order to get the venues for each neighborhood

In the next step, we are going to make some calls to the FourSquare API in order to get the venues within 500 meters of the center of each neighborhood.


In [22]:
# The code was removed by Watson Studio for sharing.