# ***SEGMENTING AND CLUSTERING NEIGHBORHOODS IN TORONTO CANADA*** 
***Part1***

### In this project, I will be web scraping data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. This data will be the foundation of this project as it will contain the Postal Codes, Boroughs and Neighborhoods in Toronto, Canada.

### The goal of the project is to use a clustering algorithm, K-Means to be precise, to segement neighborhoods in the North York Borough of Toronto.

# ***Libraries for this Project***

In [None]:
!pip install bs4
!pip install geopy 
!pip install folium==0.5.0

In [4]:
from urllib.request import urlopen as uReq
from  bs4 import BeautifulSoup as soup
import json 
from geopy.geocoders import Nominatim 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn import metrics
import folium 
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

print('Libraries imported.')

Libraries imported.


# ***Web Scraping the table data from Wikipedia***

 ##### **In this section, I will read the source code for the website, create a BeautifulSoup object, parse the HTML document and extract all the links needed for the table.** 

In [5]:
my_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

uClient = uReq(my_url)
page_html = uClient.read()

uClient.close()

In [6]:
page_soup = soup(page_html, "html.parser")

page_soup.h1 # Here I am just looking at the first header

<h1 class="firstHeading" id="firstHeading" lang="en">List of postal codes of Canada: M</h1>

In [8]:
myTable = page_soup.find("table",{"class":"wikitable sortable"})



##### **After inspecting 'myTable', I noticed that "td" in the script contains the Postal Code, Borough and Neighborhoods so I will find all "td" and extract the needed data.**

In [9]:
myTable_row = myTable.findAll('td')

myTable_row[0:7]

[<td>M1A
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M2A
 </td>,
 <td>Not assigned
 </td>,
 <td>Not assigned
 </td>,
 <td>M3A
 </td>]

##### **Seeing that every 3 items are the postal code, borough and neighborhood respectively, I will create a for loop to put each item in its respective list.** 

In [10]:
post_code = []                                       
for i in range(0,len(myTable_row),3):
    joined_pc ="".join(myTable_row[i])
    joined_pc = str(joined_pc)
    joined_pc = joined_pc.replace("\n","")
    post_code.append(joined_pc)
    
post_code[0:4]    

['M1A', 'M2A', 'M3A', 'M4A']

In [11]:
borough = []                               
for i in range(1,len(myTable_row),3):
    joined_bo ="".join(myTable_row[i])
    joined_bo = str(joined_bo)
    joined_bo = joined_bo.replace("\n","")
    borough.append(joined_bo)
print(borough[:5])

['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto']


In [12]:
neig = []
for i in range(2,len(myTable_row),3):
    joined_ne ="".join(myTable_row[i])
    joined_ne = str(joined_ne)
    joined_ne = joined_ne.replace("\n","")
    neig.append(joined_ne)
print(neig[:5])

['Not assigned', 'Not assigned', 'Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront']


In [13]:
df = pd.DataFrame({"Postal Code":post_code,"Borough":borough,"Neighborhood":neig})

df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


##### **Now that I have created a dataframe 'df' with the data gotten from the website and inspected the dataframe. I noticed that some Boroughs and Neighboods are not assigned.**

##### ***Assumption:*** If both the Borough and Neighborhood are not assigned for a particular Postal Code I will drop the entire row. 

In [14]:
df.replace("Not assigned",np.nan, inplace = True)

df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


##### **I am checking for missing values to see if there are any missing Neighborhoods with asssigned Boroughs. If none, I will reset the index and group by Postal Code**

In [15]:
missing_values = df.isnull()

for column in missing_values.columns.values.tolist():
    print(column)
    print (missing_values[column].value_counts())
    print("")   

Postal Code
False    180
Name: Postal Code, dtype: int64

Borough
False    103
True      77
Name: Borough, dtype: int64

Neighborhood
False    103
True      77
Name: Neighborhood, dtype: int64



In [16]:
df.dropna(subset=["Borough"],axis = 0,inplace = True)

In [17]:
df.reset_index(drop = True, inplace = True)

In [20]:
df.sort_values(by=["Postal Code"], inplace =True)



In [21]:
df.reset_index(drop = True, inplace = True)

In [23]:
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [24]:
df.to_csv("postalcodeTable.csv", index = False)

In [19]:
df.shape

(103, 3)