For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown
3. To create the above dataframe:

The dataframe will consist of three columns: <b>PostalCode, Borough, and Neighborhood</b>
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that <b>M5A</b> is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

## Part 1

### Import library

In [1]:
import pandas as pd
import numpy as np
import lxml
from bs4 import BeautifulSoup as bs
import requests

### Get table from URL

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # URL required
req = requests.get(url) # Send request
html_parser = bs(req.content, 'html.parser') #HTML Parser with 

In [3]:
table = html_parser.find('table') # Find Table
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

In [4]:
df = pd.read_html(str(table),header=0)[0] # put table into dataframe

df = df[df.Borough != 'Not assigned'] # Filtered , Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


<b>M5A</b> is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 <b>(Wikipedia content was rewriten)</b> in the above table. <b>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.</b>

In [5]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [6]:
df.loc[df.Neighbourhood =="Not assigned", 'Neighbourhood'] = df['Borough']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [7]:

df = pd.DataFrame(df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join) )
df = df.reset_index()


In [8]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
df.shape

(103, 3)

## Part 2

In [10]:
import geocoder # import geocoder

def getlatlon(row):

    # initialize your variable to None
    lat_lng_coords = None
    search_query = '{}, Toronto, Ontario'.format(row)
    # loop until you get the coordinates
    try:
        while(lat_lng_coords is None):
            
            g = geocoder.arcgis(search_query)
            lat_lng_coords = g.latlng

    except IndexError:
        latitude = 0.0
        longitude = 0.0
        print('BACKUP')
        return [latitude,longitude]

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print(latitude, longitude)
    return [latitude, longitude]

In [11]:
postbox_list = df['Postcode'].apply(getlatlon).tolist()

43.811525000000074 -79.19551721399995
43.78573000000006 -79.15874999999994
43.76569000000006 -79.17525603599995
43.76835912100006 -79.21758999999997
43.76968799900004 -79.23943999999995
43.74312500000008 -79.23174973599998
43.726244585000074 -79.26366999999993
43.71313321100007 -79.28505499999994
43.72357500000004 -79.23497617799995
43.69666500000005 -79.26016331599999
43.759975000000054 -79.26897402899993
43.750710464000065 -79.30055999999996
43.79394000000008 -79.26798280099996
43.78472500000004 -79.29904659999994
43.817810000000065 -79.28024362199994
43.80088094900003 -79.32073999999994
43.83421500000003 -79.21670085099998
43.80284500000005 -79.35623615099996
43.780880000000025 -79.34779577599994
43.781015000000025 -79.38054242199996
43.75719200000003 -79.37986499999994


Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


43.79147500000005 -79.41360487299994
43.76816500000007 -79.40741984599998
43.74785500000007 -79.40006223799998
43.77769500000005 -79.44579657299994
43.752440000000036 -79.32927072599995
43.74919500000004 -79.36190541699995
43.72142500000007 -79.34345278999996
43.757565000000056 -79.44819079299998
43.764665000000036 -79.48718266299994
43.73902625000005 -79.46731999999997
43.74088500000005 -79.50502651899995
43.73458500000004 -79.49315062599999
43.755330601000026 -79.51958999999994
43.730420577000075 -79.31331999999998
43.707535000000064 -79.31177329699995
43.68966500000005 -79.30716910999996
43.67684518300007 -79.29522499999996
43.70976500000006 -79.36390090899994
43.70127000000008 -79.34984401799994
43.688765344000046 -79.33417499999996
43.68326150000007 -79.35511999999994
43.66796500000004 -79.31467251099997
43.662765652000076 -79.33482999999995
43.72816000000006 -79.38708518799996
43.712815000000035 -79.38852582199996
43.71452278400005 -79.40695999999997
43.70339500000006 -79.3859636

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


43.69065500000005 -79.38356145799997
43.68608285400006 -79.40233499999994
43.681940000000054 -79.37847416699998
43.66816000000006 -79.36660236199998
43.666585000000055 -79.38130203699995
43.65512000000007 -79.36263979699999
43.65736301100003 -79.37817999999999
43.65121000000005 -79.37548057699996
43.64516015600003 -79.37367499999993
43.65609081300005 -79.38492999999994
43.649515000000065 -79.38250344699998
43.623470000000054 -79.39150736399995
43.64710000000008 -79.38153109899997
43.648205000000075 -79.37879339899996
43.735460000000046 -79.41916412899997
43.711941154000044 -79.41911999999996
43.69478500000008 -79.41440483299994
43.674840000000074 -79.40369769099993
43.663110000000074 -79.40180056699995
43.65357000000006 -79.39724915699998
43.64081500000003 -79.39953781899999
43.648690000000045 -79.38543999999996
43.64828000000006 -79.38146082599997
43.72312500000004 -79.45158914699994
43.70799000000005 -79.44836733199998
43.692105179000066 -79.43035499999996
43.68864000000008 -79.45101

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


43.66878132800008 -79.42070999999999
43.66508694300006 -79.43870499999997
43.64848500000005 -79.41774150899994
43.639410000000055 -79.42436201999999
43.71381000000008 -79.48830541199999
43.69454500000006 -79.48464278099993
43.67582500000003 -79.48205223199994
43.65997500000003 -79.46287357999995
43.64787000000007 -79.44976249999996
43.64988500000004 -79.47492879599997
43.66110229800006 -79.39103499999999
43.648690000000045 -79.38543999999996
43.648690000000045 -79.38543999999996
43.60987000000006 -79.49817823899997
43.60113082100003 -79.53878499999996
43.65369000000004 -79.51111717299995
43.63276500000006 -79.48960141199996
43.624630000000025 -79.52694976199996
43.662242201000026 -79.52837877199994
43.64969222700006 -79.55394499999994
43.648573449000025 -79.57824999999997
43.75950000000006 -79.55685235299995
43.733760000000075 -79.53752189499994
43.704855000000066 -79.51755242199994
43.696300000000065 -79.53039862799994
43.686915000000056 -79.55727609599995
43.743205000000046 -79.58470

In [12]:
df[['Latitude','Longitude']] = pd.DataFrame(postbox_list, columns=['Latitude', 'Longitude'])

In [13]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785730,-79.158750
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765690,-79.175256
3,M1G,Scarborough,Woburn,43.768359,-79.217590
4,M1H,Scarborough,Cedarbrae,43.769688,-79.239440
5,M1J,Scarborough,Scarborough Village,43.743125,-79.231750
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.726245,-79.263670
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.713133,-79.285055
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.723575,-79.234976
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.696665,-79.260163


## Part 3

### Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

In [14]:
df.shape

(103, 5)

In [15]:
df_toronto = df[df['Borough'].str.contains('Toronto', regex=False)] #Filter only boroughs that contain the word Toronto

In [16]:
df_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676845,-79.295225
41,M4K,East Toronto,"The Danforth West, Riverdale",43.683262,-79.35512
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.667965,-79.314673
43,M4M,East Toronto,Studio District,43.662766,-79.33483
44,M4N,Central Toronto,Lawrence Park,43.72816,-79.387085
45,M4P,Central Toronto,Davisville North,43.712815,-79.388526
46,M4R,Central Toronto,North Toronto West,43.714523,-79.40696
47,M4S,Central Toronto,Davisville,43.703395,-79.385964
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.690655,-79.383561
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686083,-79.402335


In [18]:
df_toronto['Borough'].unique()

array(['East Toronto', 'Central Toronto', 'Downtown Toronto',
       'West Toronto'], dtype=object)

In [21]:
print('Toronto summary : \n There are {} unique Postcodes \
and \n {} Boroughs in dataFrame'.format(df_toronto['Postcode'].unique().size,df_toronto['Postcode'].unique().size))

Toronto summary : 
 There are 38 unique Postcodes and 
 38 Boroughs in dataFrame


In [23]:
df_toronto['Borough'].value_counts() # Postcode count by Borough

Downtown Toronto    18
Central Toronto      9
West Toronto         6
East Toronto         5
Name: Borough, dtype: int64

In [25]:
len(df_toronto['Borough'].unique())

4

In [31]:
import folium
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.7178922, -79.6582404], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto