<div class="alert alert-block alert-info">


### Using Beautifulsoup to scrape data from List of postal codes of Canada: M

URL being used is https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


</div>

In [1]:
from bs4 import BeautifulSoup

In [2]:
import requests
import re
import pandas as pd

In [3]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")


<div class="alert alert-block alert-info">
Identified that the table is of the class Wikitable
Hence finding all tables of that class on the page and storing in gdp
</div>

In [4]:
gdp = soup.find_all("table", attrs={"class": "wikitable"})
print("Number of tables on site: ",len(gdp))

Number of tables on site:  1


<div class="alert alert-block alert-info">
Scraping first table ie the table for Toronto Postal Codes with HTML code gdp[0]
</div>

In [5]:
table1 = gdp[0]
# the head will form the column names
body = table1.find_all("tr")
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows

In [6]:
# Iterating through the head HTML code and making list of clean headings

headings = []
for item in head.find_all("th"): # loop through all th elements
    # convert the th elements to text and strip "\n"
    item = (item.text).rstrip("\n")
    # append the clean column name to headings
    headings.append(item)
print("Table Headings",headings)

Table Headings ['Postal Code', 'Borough', 'Neighbourhood']


In [7]:
# looping though the rest of the rows

#print(body_rows[0])
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
    row = [] # this will old entries for one row
    for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
        aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
        #append aa to row - note one row entry is being appended
        row.append(aa)
    # append one row to all_rows
    all_rows.append(row)

In [8]:
# We now use the data on all_rowsa and headings to make a table
# all_rows becomes our data and headings the column names
df = pd.DataFrame(data=all_rows,columns=headings)


In [9]:
df.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,180,180,180
unique,180,11,100
top,M7M,Not assigned,Not assigned
freq,1,77,77


<div class="alert alert-block alert-info">
Total 180 Rows of Postal Codes and associated data retrieved from the Table
</div>

In [10]:
# Removing rows where Borough is Not Assigned

df_wona=df[df.Borough != 'Not assigned']
df_wona.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,10,99
top,M1M,North York,Downsview
freq,1,24,4


<div class="alert alert-block alert-info">
Removing rows where Borough is Not Assigned reduces the no of rows to 103 and is moved to a new data frame df_wona
</div>

In [11]:
#Resetting the index
df_wona = df_wona.reset_index(drop=True)
df_wona

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park Harbourfront
3,M6A,North York,Lawrence Manor Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway Montgomery Road Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre South Ce...
101,M8Y,Etobicoke,Old Mill South King's Mill Park Sunnylea Humbe...


In [12]:

print("\nIterating over rows to check if any row has a borough but a Not assigned  neighborhood, \n") 
  
for i in range(len(df_wona)) : 
  if df_wona.loc[i, "Neighbourhood"]=="Not assigned":
    df_wona.loc[i, "Neighbourhood"]=df_wona.loc[i, "Borough"]
  print(df_wona.loc[i, "Postal Code"], " | ",df_wona.loc[i, "Borough"]," | ",df_wona.loc[i, "Neighbourhood"])


Iterating over rows to check if any row has a borough but a Not assigned  neighborhood, 

M3A  |  North York  |  Parkwoods
M4A  |  North York  |  Victoria Village
M5A  |  Downtown Toronto  |  Regent Park Harbourfront
M6A  |  North York  |  Lawrence Manor Lawrence Heights
M7A  |  Downtown Toronto  |  Queen's Park Ontario Provincial Government
M9A  |  Etobicoke  |  Islington Avenue Humber Valley Village
M1B  |  Scarborough  |  Malvern Rouge
M3B  |  North York  |  Don Mills
M4B  |  East York  |  Parkview Hill Woodbine Gardens
M5B  |  Downtown Toronto  |  Garden District Ryerson
M6B  |  North York  |  Glencairn
M9B  |  Etobicoke  |  West Deane Park Princess Gardens Martin Grove Islington Cloverdale
M1C  |  Scarborough  |  Rouge Hill Port Union Highland Creek
M3C  |  North York  |  Don Mills
M4C  |  East York  |  Woodbine Heights
M5C  |  Downtown Toronto  |  St. James Town
M6C  |  York  |  Humewood-Cedarvale
M9C  |  Etobicoke  |  Eringate Bloordale Gardens Old Burnhamthorpe Markland Wood
M

In [13]:
df_wona.shape

(103, 3)

<div class="alert alert-block alert-info">
Printing the number of rows of the cleaned up dataframe
</div>

In [14]:
print('No of Rows',df_wona.shape[0])

No of Rows 103


In [15]:
!pip install pgeocode



In [30]:
import pgeocode
import math
rows = []
nomi = pgeocode.Nominatim('ca')
for i in range(len(df_wona)):
    postal_code = df_wona.loc[i, "Postal Code"]
    location = nomi.query_postal_code(postal_code)
    latitude = location.latitude
    longitude = location.longitude
    #print(latitude,longitude)
    if math.isnan(latitude) or math.isnan(longitude):
        print(postal_code, " does not have geocoded coordinates")
    else:
        rows.append([df_wona.loc[i, "Postal Code"],df_wona.loc[i, "Borough"],df_wona.loc[i, "Neighbourhood"],latitude,longitude])

df_nhoods = pd.DataFrame(rows, columns=["PostalCode", "Borough","Neighbourhood", "lat","lng"])

df_nhoods

M7R  does not have geocoded coordinates


Unnamed: 0,PostalCode,Borough,Neighbourhood,lat,lng
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,Regent Park Harbourfront,43.6555,-79.3626
3,M6A,North York,Lawrence Manor Lawrence Heights,43.7223,-79.4504
4,M7A,Downtown Toronto,Queen's Park Ontario Provincial Government,43.6641,-79.3889
...,...,...,...,...,...
97,M8X,Etobicoke,The Kingsway Montgomery Road Old Mill North,43.6518,-79.5076
98,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
99,M7Y,East Toronto,Business reply mail Processing Centre South Ce...,43.7804,-79.2505
100,M8Y,Etobicoke,Old Mill South King's Mill Park Sunnylea Humbe...,43.6325,-79.4939


In [31]:
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#address = 'Toronto, Ontario'

#geolocator = Nominatim(user_agent="ny_explorer")
#location = geolocator.geocode(address)
#latitude = location.latitude
#longitude = location.longitude

#print (latitude,longitude)

In [32]:
! pip install folium==0.5.0
import folium # plotting library

print('Folium installed')
print('Libraries imported.')


Libraries imported.


In [37]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df_nhoods['lat'], df_nhoods['lng'], df_nhoods['PostalCode']):
    #label = '{}, {}'.format(label)
    #label = folium.Popup(label, parse_html=True)
    #print(lat,lng,label)
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto