<h1>Clustering Neighborhoods in Toronto</h1>

Here we will explore, segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we wll explore and cluster the neighborhoods in the city of Toronto.

## Table of Contents

1. <a href="#part1">Scrape the Wikipedia web page to build a dataframe</a>
2. <a href="#part2">Get latitude & longitude details of the neighbourhoods and add to the dataframe</a>  
3. <a href="#part3">Explore and cluster the neighborhoods in Toronto</a>  

### 1. Scrape the Wikipedia web page to build a dataframe

_Import beautifulsoup and other required libraries to scrape the web page and load the data into a dataframe_

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

_Scrape the webpage and create a Soup object_

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'lxml')

_Create a data frame with the webpage data_

In [3]:
datatable = soup.find('div', class_ ='mw-parser-output').table
acttable = datatable.find_all('td')

column_names = ['PostalCode', 'Borough']
df = pd.DataFrame(columns=column_names)

for td in acttable:
    boroughNeigh = td.p.span.text
    if boroughNeigh != "Not assigned":
        pstcd = td.p.b.text
        boroughNeigh = td.p.span.text
        df = df.append({'PostalCode':pstcd, 'Borough':boroughNeigh}, ignore_index=True)
df.head(10)

Unnamed: 0,PostalCode,Borough
0,M3A,North York(Parkwoods)
1,M4A,North York(Victoria Village)
2,M5A,Downtown Toronto(Regent Park / Harbourfront)
3,M6A,North York(Lawrence Manor / Lawrence Heights)
4,M7A,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke(Islington Avenue)
6,M1B,Scarborough(Malvern / Rouge)
7,M3B,North York(Don Mills)North
8,M4B,East York(Parkview Hill / Woodbine Gardens)
9,M5B,"Downtown Toronto(Garden District, Ryerson)"


_Wrangle the dataframe to correct format for analysis_

In [4]:
#The neighbourhood Queen's Park/Ontario Provincial Government with postal code M7A does not have any borough associated with it. So dropping this row.

df = df[df.PostalCode != 'M7A'].reset_index(drop=True)

#Spliting the Borough column into Borough & Neighbourhood
df["Neighborhood"] = df["Borough"].str.split(pat='(', n=-1, expand=True)[1]
df["Neighborhood"] = df["Neighborhood"].str.split(pat=')', n=-1, expand=True)[0]
df["Borough"] = df["Borough"].str.split(pat='(', n=-1, expand=True)[0]

#If there are multiple neighborhoods replace the separator  '/' with ','
df['Neighborhood'] = df['Neighborhood'].str.replace(' /', ',', n=-1)

_Grouping the data by Borough to check the different Boroughs and borough counts_

In [5]:
df['Borough'].value_counts()

North York                                                      24
Scarborough                                                     17
Downtown Toronto                                                17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East York                                                        4
East Toronto                                                     4
MississaugaCanada Post Gateway Processing Centre                 1
East YorkEast Toronto                                            1
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
EtobicokeNorthwest                                               1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
Name: Borough, dtype: int64

_We can see that for some boroughs only one record is there and the actual borough name also should be diffenet. So correcting the names like: "East YorkEast Toronto" to "East York", "Downtown TorontoStn A PO Boxes25 The Esplanade" to "Downtown Toronto" and so on depending on if the new borough has more than 1 count._

In [6]:
df['Borough'].loc[34] = 'East York'
df['Borough'].loc[91] = 'Downtown Toronto'
df['Borough'].loc[75] = 'Mississauga'
df['Borough'].loc[99] = 'East Toronto'
df['Borough'].loc[93] = 'Etobicoke'
df['Borough'].value_counts()

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64

_The formatted dataframe is displayed_

In [7]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M9A,Etobicoke,Islington Avenue
5,M1B,Scarborough,"Malvern, Rouge"
6,M3B,North York,Don Mills
7,M4B,East York,"Parkview Hill, Woodbine Gardens"
8,M5B,Downtown Toronto,"Garden District, Ryerson"
9,M6B,North York,Glencairn


_Shape of the dataframe_

In [8]:
df.shape

(102, 3)

### 2. Get latitude & longitude details of the neighbourhoods and add to the dataframe

_Get latitude & longitude details of the neighbourhoods using Nomination._  
_As nomination allows 1 request/sec so adding time delay._  
_Taking the populated latitude & longitude values in a list and adding to the dataframe._

In [81]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import time

torontodata = df # creating a duplicate of df datafarame as torontodata to work with

latList = []
lngList = []

for ind in torontodata.index:
    geolocator = Nominatim(user_agent="toronto_explorer")
    location = None
    location = geolocator.geocode(torontodata['Neighborhood'][ind] + ', ' + torontodata['Borough'][ind])
    time.sleep(1)
    if location is None:
        i=0
        while (location is None):
            try:
                tempneighbor = torontodata["Neighborhood"][ind].split(', ', -1)[i] + ', ' + torontodata['Borough'][ind]
            except:
                tempneighbor = torontodata["Borough"][ind]
            
            location = geolocator.geocode(tempneighbor)
            time.sleep(1)
            i += 1
    latList.append(location.latitude)
    lngList.append(location.longitude)
    
torontodata['Latitude'] = latList 
torontodata['Longitude'] = lngList

torontodata.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654174,-79.380812
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.722079,-79.437507
4,M9A,Etobicoke,Islington Avenue,43.622575,-79.514215
5,M1B,Scarborough,"Malvern, Rouge",43.809196,-79.221701
6,M3B,North York,Don Mills,43.775347,-79.345944
7,M4B,East York,"Parkview Hill, Woodbine Gardens",43.712078,-79.302567
8,M5B,Downtown Toronto,"Garden District, Ryerson",43.653552,-79.379373
9,M6B,North York,Glencairn,43.708712,-79.440685


### 3. Explore and cluster the neighborhoods in Toronto