# Candian Post Code Web scraping excersise

We are going to attempt to reach a wiki page listing the postal codes of Canadian neighborhoods and create a usable data frame from which we can preform data analysis on later.

##### Disclaimer: I've completed this excersie without breaking it out into multiple Notebooks. So all links will be the same. Please don't mark me off for this >_<

In [1]:
#First grabbing the necessary imports that we are going to need 
from bs4 import BeautifulSoup
import requests
import pandas as pd

Next We are going to use the requests library to get our content from the wiki article

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

We will then use the beautiful soup library to format and loop over the data

In [3]:
#Creating our beautiful soup object
soup = BeautifulSoup(source.text, 'lxml')

#This grabs the specifically the table from the link listed above
wikiTable = soup.find('table', class_='wikitable')

After setting up our data, we are going to create an empty data from 
that we will fill with the formated data

In [4]:
cd_neigh_df = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighbourhood'])

In the below section of code we are looping over specific classes that were found on the table.

The first loop over the 'tr' class is each of the rows of the table. However in order to fill our data from we need to grab each one of the data elements individually. So that is where the loop over the 'td' comes into play. Once in that loop we gain access to each item of data per row.

In [5]:
col = 0
for table_row in wikiTable.find_all('tr'):
    cd_neigh_df_temp = pd.DataFrame({'Postcode':[0], 'Borough':[0], 'Neighbourhood':[0]})
    for table_data in table_row.find_all('td'):
        cd_neigh_df_temp.iloc[0,col] = table_data.text.replace("\n", "")
        if col == 2:
            cd_neigh_df = cd_neigh_df.append(cd_neigh_df_temp)
        else:
            col += 1
    col = 0

In [6]:
cd_neigh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
0,M2A,Not assigned,Not assigned
0,M3A,North York,Parkwoods
0,M4A,North York,Victoria Village
0,M5A,Downtown Toronto,Harbourfront


In [7]:
#simply removing the rows where the Borough is not assigned
cd_neigh_df = cd_neigh_df[cd_neigh_df.Borough != 'Not assigned']

In [8]:
cd_neigh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
0,M4A,North York,Victoria Village
0,M5A,Downtown Toronto,Harbourfront
0,M6A,North York,Lawrence Heights
0,M6A,North York,Lawrence Manor


In [41]:
#grouping the neighbourhoods so to eliminate duplicate rows. 
#We now have a Data Frame that has coma seperated Neighbourhoods
#instead of individual rows for them
cd_neigh_df = cd_neigh_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
cd_neigh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
cd_neigh_df.shape

(103, 3)

Ideally in this section we were going to use the geocoder to collect our coordinates for each of the post codes, however with the switch to this calling method it did introduce some reliability issues. With these issues I was not able to collect our required coordinates.

In [23]:
# !conda install -c conda-forge geopy --yes 
# from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values


In [22]:
#import geocoder

In [21]:
#I couldn't get the API to work....ugh but here is the code they referenced
# # initialize your variable to None
# lat_lng_coords = None
# postal_code = 'M5G'

# # loop until you get the coordinates
# while(lat_lng_coords is None):
#   g = geocoder.google('Toronto, Ontario')
#   lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]
# print(latitude)
# print(longitude)

In [42]:
#This grabs the provided coordinates based on Postal Codes for the course
geospatial = pd.read_csv('Geospatial_Coordinates.csv')
geospatial

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Our final step is to join this provided data with the data we scraped from the wiki page

In [37]:
cd_neigh_long_lat = cd_neigh_df.set_index('Postcode').join(geospatial.set_index('Postal Code'))

In [40]:
cd_neigh_long_lat.reset_index()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


Now we are going to loop over our existing data frame and plot all of the longitude and latitude over a map of Toronto.

In [79]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

In [78]:
lat_toronto = 43.653200
long_toronto = -79.383200

toronto_map = folium.Map(location=[lat_toronto, long_toronto], zoom_start=11) # generate map centred around Ecco

for index, coords in cd_neigh_long_lat.iterrows():
    folium.features.CircleMarker(
        [coords['Latitude'], coords['Longitude']],
        radius=10,
        popup='Ecco',
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.6
    ).add_to(toronto_map)

toronto_map