## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below: (The picture will be shown after reading, cleaning and formating according to the requirement)

3. To create the above dataframe:

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

# Part-1

In [1]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
text = requests.get(url).text

In [4]:
text


'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of postal codes of Canada: M - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b445d210-b58f-4c98-b1bd-b6b43c9c505b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":960187814,"wgRevisionId":960187814,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontar

In [5]:
#Parse HTML

In [7]:
soup = BeautifulSoup(text, 'html')

In [8]:
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b445d210-b58f-4c98-b1bd-b6b43c9c505b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":960187814,"wgRevisionId":960187814,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontario-relat

In [9]:
table=soup.find('table')

In [10]:
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

In [11]:
#The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
columns = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = columns)

In [12]:
df

Unnamed: 0,Postalcode,Borough,Neighborhood


In [13]:
# Search all the postcode, borough, neighborhood 
for tr in table.find_all('tr'):
    row = []
    for td in tr.find_all('td'):
        row.append(td.text.strip())
    if len(row)==3:
        df.loc[len(df)] = row
        

In [14]:
tr

<tr>
<td>M9Z
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>

In [15]:
row

['M9Z', 'Not assigned', 'Not assigned']

In [16]:
td

<td>Not assigned
</td>

In [17]:
df

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.\
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [18]:
df = df[df['Borough'] != 'Not assigned']

In [19]:
df

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [20]:
df[df['Neighborhood']=='Not assigned']=df[df['Neighborhood']=='Not assigned']['Borough']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)


In [21]:
df

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [22]:
df.reset_index(inplace=True)

In [23]:
df.shape

(103, 4)

In [24]:
df.shape

(103, 4)

In [25]:
df.columns

Index(['index', 'Postalcode', 'Borough', 'Neighborhood'], dtype='object')

In [26]:
df.drop('index', axis=1)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [27]:
a = df['Postalcode'].unique

In [28]:
a

<bound method Series.unique of 0      M3A
1      M4A
2      M5A
3      M6A
4      M7A
      ... 
98     M8X
99     M4Y
100    M7Y
101    M8Y
102    M8Z
Name: Postalcode, Length: 103, dtype: object>

# Part-2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.\
here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [29]:
csvFile = pd.read_csv('Geospatial_Coordinates.csv')

In [30]:
csvFile

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [31]:
csvFile[[1]['Postal Code']]

TypeError: list indices must be integers or slices, not str

In [32]:
df['Postalcode']==csvFile['Postal Code']

0      False
1      False
2      False
3      False
4      False
       ...  
98     False
99     False
100    False
101    False
102    False
Length: 103, dtype: bool

In [33]:
colms = ['Latitude', 'Longitude']
sorted_lt_lg=pd.DataFrame(columns = colms)

In [34]:
sorted_lt_lg

Unnamed: 0,Latitude,Longitude


In [44]:
i=0
j=0
lat=[]
lon=[]
for post in df['Postalcode']:
    j=0
    
    for code in csvFile['Postal Code']:
        if post == code:
            #sorted_lt_lg.iloc[i][0]=csvFile.iloc[j][1]
            #sorted_lt_lg.iloc[i][1]=csvFile.iloc[j][2]
            lat.append(csvFile.iloc[j][1])
            lon.append(csvFile.iloc[j][2])
        j+=1
            #sorted_lt_lg.append(csvFile.iloc[j][1],csvFile.iloc[j][2])
    i+=1

In [43]:
csvFile.iloc[0][2]

-79.19435340000001

In [45]:
lat

[43.7532586,
 43.725882299999995,
 43.6542599,
 43.718517999999996,
 43.6623015,
 43.6678556,
 43.806686299999996,
 43.745905799999996,
 43.7063972,
 43.6571618,
 43.709577,
 43.6509432,
 43.7845351,
 43.72589970000001,
 43.695343900000005,
 43.6514939,
 43.6937813,
 43.6435152,
 43.7635726,
 43.67635739999999,
 43.644770799999996,
 43.6890256,
 43.7709921,
 43.7090604,
 43.6579524,
 43.669542,
 43.773136,
 43.8037622,
 43.7543283,
 43.7053689,
 43.65057120000001,
 43.66900510000001,
 43.7447342,
 43.7785175,
 43.7679803,
 43.685347,
 43.6408157,
 43.647926700000006,
 43.7279292,
 43.7869473,
 43.737473200000004,
 43.6795571,
 43.6471768,
 43.6368472,
 43.711111700000004,
 43.7574902,
 43.7390146,
 43.6689985,
 43.6481985,
 43.713756200000006,
 43.7563033,
 43.716316,
 43.789053,
 43.7284964,
 43.6595255,
 43.7332825,
 43.6911158,
 43.7247659,
 43.692657000000004,
 43.7701199,
 43.7616313,
 43.7280205,
 43.7116948,
 43.67318529999999,
 43.706876,
 43.7574096,
 43.752758299999996,
 43.7

In [48]:
lon

[-79.3296565,
 -79.31557159999998,
 -79.3606359,
 -79.46476329999999,
 -79.3894938,
 -79.53224240000002,
 -79.19435340000001,
 -79.352188,
 -79.309937,
 -79.37893709999999,
 -79.44507259999999,
 -79.55472440000001,
 -79.16049709999999,
 -79.340923,
 -79.3183887,
 -79.3754179,
 -79.42819140000002,
 -79.57720079999999,
 -79.1887115,
 -79.2930312,
 -79.3733064,
 -79.453512,
 -79.21691740000001,
 -79.3634517,
 -79.3873826,
 -79.4225637,
 -79.23947609999999,
 -79.3634517,
 -79.4422593,
 -79.34937190000001,
 -79.3845675,
 -79.4422593,
 -79.23947609999999,
 -79.3465557,
 -79.48726190000001,
 -79.3381065,
 -79.38175229999999,
 -79.4197497,
 -79.26202940000002,
 -79.385975,
 -79.46476329999999,
 -79.352188,
 -79.38157640000001,
 -79.42819140000002,
 -79.2845772,
 -79.37471409999999,
 -79.5069436,
 -79.31557159999998,
 -79.37981690000001,
 -79.4900738,
 -79.56596329999999,
 -79.23947609999999,
 -79.40849279999999,
 -79.49569740000001,
 -79.340923,
 -79.4197497,
 -79.47601329999999,
 -79.53224240

In [49]:
df['Latitude']=lat
df['Longitude']=lon

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [50]:
df.head()

Unnamed: 0,index,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,2,M3A,North York,Parkwoods,43.753259,-79.329656
1,3,M4A,North York,Victoria Village,43.725882,-79.315572
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [51]:
df.drop('index',axis=1)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


# Part 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

In [58]:
# import k-means from clustering stage
from sklearn.cluster import KMeans   
from sklearn.datasets.samples_generator import make_blobs

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

import matplotlib.cm as cm
import matplotlib.colors as colors


from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [60]:
data = df[['Neighborhood','Borough', 'Latitude', 'Longitude']]

In [61]:
data.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Parkwoods,North York,43.753259,-79.329656
1,Victoria Village,North York,43.725882,-79.315572
2,"Regent Park, Harbourfront",Downtown Toronto,43.65426,-79.360636
3,"Lawrence Manor, Lawrence Heights",North York,43.718518,-79.464763
4,"Queen's Park, Ontario Provincial Government",Downtown Toronto,43.662301,-79.389494
