# Scraping Data Table of Toronto Neighborhoods with 'M' Postal Code

Importing the necessary libraries (and ones I might use later on).

In [1]:
import pandas as pd
import numpy as np
import csv
from bs4 import BeautifulSoup
import requests

Scraping and Cleaning Up Table (the full explanations can be found in the 'Clustering and Segmenting' notebook.

In [2]:
# using BeautifulSoup to scrape the data table from wikipedia.
url=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
source=BeautifulSoup(url, 'lxml')

table=source.find('table', class_='wikitable sortable')

rows=table.findAll('tr')

data=[]
for row in rows:
    data.append([t.text.strip() for t in row.findAll('td')])

# appending the data table into a dataframe

df=pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood']) 
df = df[~df['PostalCode'].isnull()] #to filter out bad rows
df.drop(df[df['Borough']=='Not assigned'].index, axis=0, inplace=True) #deleting 'Not assigned' cells.
df1=df.reset_index()
df2=df1.groupby('PostalCode').agg(lambda x: ','.join(x)) # grouping nrighborhoods with the same postal codes.
df2.loc[df2['Neighborhood']=="Not assigned",'Neighborhood']=df2.loc[df2['Neighborhood']=="Not assigned",'Borough'] # assigning the unnamed neighborhoods with their borough names.
df3=df2.reset_index()
df3['Borough']= df3['Borough'].str.replace('nan|[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",") #deleting replicates of borough names from each postal code.

df3.head() # displaying the first 10 rows of the appended and cleansed data of Toronto neighborhoods with 'M' postal codes.

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


# Reading the Geospatial File for the Latitudes and Longtitues of the Toronto Postal Codes

Obtaining the latitudes and longtitudes of each Toronto 'M' postal codes from the csv file since the geocoder was unresponsive.

In [3]:
df_coor=pd.read_csv('http://cocl.us/Geospatial_data')
df_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Renaming the 'Postal Code' column to 'PostalCode' to prepare the table for merging.

In [4]:
df_coor.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [5]:
df_coor.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging the data tables 

In [6]:
df_Toronto=pd.merge(df3, df_coor, on='PostalCode', how='inner')
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [7]:
df_Toronto.shape

(103, 5)