# Peer-graded Assignment:
## Segmenting and Clustering Neighborhoods in Toronto
### by: Jeffrey Dupree

This notebook will scrape neighborhood information from a Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to create a dataframe consisting of the Postal Code, the Borough name, and the Neighborhood name.

First, we install the necessary libraries.

In [1]:
# If you don't have these packages available, uncomment the appropriate lines below to install them.

#import sys
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install lxml
#!{sys.executable} -m pip install requests

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Next, we need to get the information from the Wikipedia page using `requests.get`.

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Use the BeautifulSoup package to scrape the information from the Wikipedia page. I used the lxml parsing method, but you can use any you like.

In [3]:
soup = BeautifulSoup(source, 'lxml')

Find the table using `soup.find` from BeautifulSoup. Uncomment the second line to see the structure and content of the table. The tags are needed for the next steps.

In [4]:
table = soup.find('table')
#print(table.prettify())

Now a pandas dataframe needs to be created. This will require looping through the elements from the table and assigning the to a list. The list can then be made into a dataframe using `pd.DataFrame`. The columns will need header names. I manually assigned these instead of pulling them from the BeautifulSoup object `table`.

In [5]:
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

# Label the columns.
df = pd.DataFrame(res, columns=['PostalCode','Borough','Neighborhood'])

Next remove the rows where the borough is "Not assigned", assign the borough name for neighborhoods without an assigned name, and combine rows where the postal code is the same but there are multiple neighborhoods.

In [6]:
# Remove rows with Borough = "Not assigned"
df = df[df.Borough != 'Not assigned']

In [7]:
# If Neighborhood = "Not assigned" then assign with Borough value.
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])

In [8]:
# Combine rows where the Postal Code is the same.
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

The resulting dataframe looks like this.

In [9]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Check the size of the dataframe.

In [10]:
df.shape

(103, 3)

In [11]:
# The code was removed by Watson Studio for sharing.

In [12]:
import re
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(country_bias="ca", user_agent=user_agent)

# Create an empty list for your latitude and longitude variables.
latitude = []
longitude = []


for i in range(0,df.shape[0]): #Loop through each Postal Codes.
    g = geolocator.geocode({"postalcode": df.PostalCode[i]}, exactly_one=False) #First try to geocode based on the postal code.
    if g != None and len(g) == 1: #If the postal code returns a single response, extract the lat/lon and record them.
        latitude.append(g[0].latitude)
        longitude.append(g[0].longitude)
    else: #If the postal code does not geocode to a single location, or no location at all, then use the neighborhoods to geocode a location.
        hoods = df.Neighborhood[i].split(', ')
        for j in range(0,len(hoods)):
            sum_lat = 0
            sum_lon = 0
            sum_loc = 0
            g = geolocator.geocode({"city": hoods[j], "state": "on", "county": "toronto"}, exactly_one=False)
            if g != None:
                rtrns = len(g)-1
                for k in range(rtrns,-1,-1): #Loop through the location objects returned to collect the lat/long data. Average to get a geometric center if more than one.
                    pc = re.search('\D\d\D', g[k].address)
                    hm = g[k].address.find(hoods[j])
                    if pc != None: 
                        if pc.group(0) == df.PostalCode[i]:
                            sum_lat = sum_lat + g[k].latitude
                            sum_lon = sum_lon + g[k].longitude
                            sum_loc = sum_loc + 1
                        elif hm >= 0:
                            sum_lat = sum_lat + g[k].latitude
                            sum_lon = sum_lon + g[k].longitude
                            sum_loc = sum_loc + 1
        if sum_loc < 1: #Prevent a 'divide by zero' error by ensuring sum_loc is at least 1.
            sum_loc = 1
        avg_lat = sum_lat / sum_loc
        avg_lon = sum_lon / sum_loc
        latitude.append(avg_lat)
        longitude.append(avg_lon)
#Add the latitude and longitude lists to the dataframe as two new columns.
df['Latitude'] = latitude
df['Longitude'] = longitude

Unfortunately, this method still only finds lat/long data for a little more than half of the postal codes as can be seen below.

In [13]:
print("Only ",df[df.Latitude != 0].shape[0]," of the ",df.shape[0]," postal codes were geocoded.")
df

Only  61  of the  103  postal codes were geocoded.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.775504,-79.134976
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.768914,-79.187291
3,M1G,Scarborough,Woburn,43.765717,-79.221898
4,M1H,Scarborough,Cedarbrae,0.000000,0.000000
5,M1J,Scarborough,Scarborough Village,43.743742,-79.211632
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.724878,-79.253969
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.697174,-79.274823
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",0.000000,0.000000
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.711170,-79.248177


To complete the dataframe I use the provided csv file going forward.

In [14]:
df_csv = pd.read_csv("https://cocl.us/Geospatial_data") #Import the csv as a dataframe.
for i in range(0,df.shape[0]):
    postalcode = df.PostalCode[i]
    csv_row = df.loc[df_csv['Postal Code']==postalcode].index[0] #Select the row in the new dataframe with the postal code that matches the original dataframe.
    if df.Latitude[i] == 0: #If the geocoding failed for this postal code, copy the lat/long from the new dataframe.
        df.Latitude[i] = df_csv.Latitude[csv_row]
        df.Longitude[i] = df_csv.Longitude[csv_row]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Now there are latitude and longitude values for each of the postal codes.

In [15]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.775504,-79.134976
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.768914,-79.187291
3,M1G,Scarborough,Woburn,43.765717,-79.221898
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.743742,-79.211632
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.724878,-79.253969
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.697174,-79.274823
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.711170,-79.248177
