<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

<h2> 1.- Extracting Data from Wikipedia to Pandas Dataframe</h2> 

Start getting the data from Wikipedia page and creating the dataframe with the table info.

In [16]:
#importing library to open URLs
import urllib.request

#set the URL we want to scrap
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#set all HTML page content into a variable named page using urllib
page = urllib.request.urlopen(url)

#importing the BeautifulSoup library to parse the HTML document in page variable
from bs4 import BeautifulSoup

#parsing the HTML in the page variable with BeautifulSoup
soup = BeautifulSoup(page, "lxml")

#print content of the page parsed by BeautifulSoup
#print(soup.prettify())

#get all tables from the content page using the find_all function of BeautifulSoup
#all_tables = soup.find_all("table")
#all_tables

#get the table we are interested in
canada_postal_codes_table = soup.find("table", class_="wikitable sortable")
#canada_postal_codes_table

#to create the dataframe lets create three different lists that represents each column of the dataframe
postal_code = []
borough = []
neighborhood = []

#lets iterate over each row of the HTML table stored in the canada_postal_codes_table variable
for row in canada_postal_codes_table.findAll("tr"):
    #lets get all the cells of the current row
    cells = row.findAll("td")
    #evaluate if the quantity of cells is three, otherwise discard the row
    if len(cells) == 3:
        #evaluate if the borough name is different than "Not assigned", if it is then it will be excluded    
        if cells[1].find(text=True).rstrip("\n") != "Not assigned":
            postal_code.append(cells[0].find(text=True).rstrip("\n"))
            borough.append(cells[1].find(text=True).rstrip("\n"))
            #evaluate if the the neighborhood name is equals to "Not assigned", if it is then the name of the borough
            #will be set as neighborhood name
            if cells[2].find(text=True).rstrip("\n") == "Not assigned":
                neighborhood.append(cells[1].find(text=True).rstrip("\n"))
            else:
                neighborhood.append(cells[2].find(text=True).rstrip("\n"))

#import Pandas to create the dataframe
import pandas as pd

#lets create the dataframe using the postal_code, borough and neighborhood lists created before
df = pd.DataFrame(postal_code, columns=["PostalCode"])
df["Borough"] = borough
df["Neighborhood"] = neighborhood
df


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Printing the number of rows and columns of the dataframe

In [18]:
df.shape

(103, 3)

<h2>2.- Getting the Latitude and Longitude of each Neighborhood</h2>

Now lets lets create a new dataframe named df_canada_neighborhood that will contain the canada neighborhood info and the Latitude and Longitude of each neighborhood.

For simplicity lets take the latitude and longitude from the CSV file provided in the laboratory page.

In [22]:
#reading the latitude and longitude from CSV file and creating a new dataframe
df_lat_lon = pd.read_csv("Geospatial_Coordinates.csv")
#df_lat_lon

#concatenating the df and df_lat_lon dataframes to have only one dataframe with neighborhood, latitude and longitude
#info
df_canada_neighborhood = pd.concat([df.set_index("PostalCode"), df_lat_lon.set_index("Postal Code")], axis=1, join='inner').reset_index()
df_canada_neighborhood

Unnamed: 0,index,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<h2>3.- Explore and Cluster the Neighborhoods in Toronto</h2>

In [34]:
#Install and import geopy to get the latitude and longitude of Toronto
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = "Toronto, TO"

#Get the latitude and longitude of Toronto
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#create a new dataframe named toronto_data which contains all boroughs that contains the string Toronto into its name
toronto_data = df_canada_neighborhood[df_canada_neighborhood["Borough"].str.contains("Toronto")]
toronto_data.reset_index(inplace=True)
toronto_data.head()

The geograpical coordinate of Toronto are 43.65238435, -79.38356765.


Unnamed: 0,level_0,index,Borough,Neighborhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,19,M4E,East Toronto,The Beaches,43.676357,-79.293031


<h3>3.1.- Exploring the Neighborhoods in Toronto using Foursquare API</h3>

Now lets explore the neighborhoods dataframe using the Foursquare API. We will apply the same exploring that was created in the laboratory.

In [35]:
#set the credentials for Foursquare (the real credentials where changed for security)
CLIENT_ID = 'PVDRGPZL035RJZD5U4LD4H2A5QCUXBO1EDY2DVT3FB2HWECH'
CLIENT_SECRET = 'ZJNLMNVCGSGMI4ICXQLJWQ23ZTJGTVTAMRABPKJZS3R4524C'
VERSION = '20180605'
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PVDRGPZL035RJZD5U4LD4H2A5QCUXBO1EDY2DVT3FB2HWECH
CLIENT_SECRET:ZJNLMNVCGSGMI4ICXQLJWQ23ZTJGTVTAMRABPKJZS3R4524C


Next step is to create the function to get the veneus for each neighborhood in Toronto.

In [42]:
# library to handle requests
import requests

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now lets execute the previous function for each neighborhood in the toronto_data dataframe.  <b>Important:</b> since we have latitude and longitude for each postal zone, those neighborhoods that area separated by comma will be analyzed like one neighborhood.

In [43]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West,  Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport


Now lets check the resulting dataframe.

In [44]:
print(toronto_venues.shape)
toronto_venues.head()

(1616, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


Lets check how many venues were returned for each neighborhood.

In [45]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",17,17,17,17,17,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,61,61,61,61,61,61
Christie,16,16,16,16,16,16
Church and Wellesley,81,81,81,81,81,81
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,35,35,35,35,35,35
Davisville North,7,7,7,7,7,7


Let's find out how many unique categories can be curated from all the returned venues

In [47]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 238 uniques categories.


<h3>3.2.- Analyze each Neighborhood</h3>

Lets create a new dataframe with each veneu category as a variable and set the value of 1 if the veneu category exists for the neighborhood. Assign the result to a dataframe named toronto_onehot.

In [74]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
neighborhood_column = toronto_onehot.pop("Neighborhood")
toronto_onehot.insert(0, "Neighborhood", neighborhood_column)
toronto_onehot  

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1611,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1612,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1613,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1614,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [75]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.016393,0.0,0.016393
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,0.024691
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
