"Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe"

## Imports

In [50]:
#imports
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np

## Part 1 - Get the data from the page

In [51]:
html_doc = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
#a soup object of the whole page
soup = BeautifulSoup(html_doc, 'html.parser')

#just the table we're interested
table = soup.find('table', class_='wikitable sortable')

#parse the table entries into a list
wikis = []

rows = table.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells)==3:
        postalCode = cells[0].find(text=True)
        borough = cells[1].find(text=True)
        neigh = cells[2].find(text=True)
        
        #Ignore cells with a borough that is Not assigned
        if "Not assigned" not in borough:
            #append to list
            wikis.append([postalCode, borough, neigh])

## Transform the data into a pandas dataframe

In [52]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
# create the dataframe
dfWikis = pd.DataFrame(wikis, columns=column_names)

## Prepare the data

Combine neighborhoods in the same boroughs

In [53]:
dfPostalCode = dfWikis.groupby(['PostalCode','Borough'], sort = False).agg(lambda x: ','.join(x))
dfPostalCode = dfPostalCode.reset_index()
#remove the \n from some "Neighborhood" fields
dfPostalCode.replace({'\n': ''}, regex=True, inplace=True)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [54]:
for index, row in dfPostalCode.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']

dfPostalCode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Use the .shape method to print the number of rows of the dataframe

In [55]:
#dfPostalCodes.shape
dfPostalCode.shape

(103, 3)

## Part 2 - Get the latitude and the longitude coordinates of each neighborhood

Imports

In [56]:
!conda install -c conda-forge geocoder -y
import geocoder # import geocoder

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geocoder                  1.38.1                     py_0    conda-forge


Create a function that receives a postal code and returns a list with its coordinates.
#### NOTE: Geocoder is not giving the coordinates, hence this function is not used, leaving it anyway

In [57]:
def getCoordinates(postalCode):
    # initialize your variable to None
    lat_lng_coords = None
    counter = 0
    # loop 10 times until you get the coordinates
    while(lat_lng_coords is None and counter < 11):
        g = geocoder.google('{}, Toronto, Ontario'.format(postalCode))
        lat_lng_coords = g.latlng
        counter += 1

    if lat_lng_coords is None:
        lat_lng_coords = {}
        lat_lng_coords[0] = 'None'
        lat_lng_coords[1] = 'None'

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print(postalCode, latitude, longitude)

    return lat_lng_coords

Get the coordinates from the csv into a dataframe

In [58]:
path = 'https://cocl.us/Geospatial_data'
dfCoords = pd.read_csv(path)
dfCoords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Rename "Postal Code" to "PostalCode" on the coordinates df so they match

In [59]:
dfCoords.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
dfCoords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the dataframes

In [60]:
dfComplete = pd.merge(dfPostalCode, dfCoords, on='PostalCode')
dfComplete.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
