# Task 2: Including the longitude and latitude in the table

We will make use of the `pgeocode` package (as recommended [here](https://www.coursera.org/learn/applied-data-science-capstone/discussions/weeks/3/threads/hqLU1FXiEeuhwwo1pM_uuQ/replies/SeYqp1ncEeuujg7tcW_-dw/comments/D5oiUlrXEeuujg7tcW_-dw)) to extract the longitude and latitude from the postal code of each region. First we have to recreate the table (though the cell will be hidden as it is the same, if edited down, code).

In [1]:
# Begin by recreating the dataframe

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'lxml')

neighbourhoods = []

table = soup.find('table', attrs={'class':'wikitable sortable'})
table_data = table.find_all('tr')
headers = [th.text.replace('\n','').replace(' ', '') for th in table_data[0].find_all('th')]
for n in range(1, len(table_data)):
    neighbourhood = dict()
    for header, td in zip(headers, table_data[n].find_all('td')):
        neighbourhood[header] = td.text.replace('\n','')
    neighbourhoods.append(neighbourhood)

df = pd.DataFrame(neighbourhoods)
empty_rows = df[df['Borough']=='Not assigned'].index
df.drop(empty_rows, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
df_postcode = df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
def get_borough(row):
    if row['Neighbourhood'] == 'Not assigned':
        return row['Borough']
    return row['Neighbourhood']

df_postcode['Neighbourhood'] = df_postcode.apply(lambda row: get_borough(row), axis=1)
df_postcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's quickly extract postal codes to use with `pgeocode`.

In [2]:
pcodes = df_postcode['PostalCode'].values

We import `pgeocode` and set the country code to CA so that the API knows that we are dealing with Canadian postal codes.

In [3]:
import pgeocode
geo = pgeocode.Nominatim('ca')

By querying the postal codes in `pcodes` we receive a dataframe of information containing a number of fields, with the columns named in lowercase. In order to merge the tables more readily we shall inlcude and rename the `postal_code` column.

In [4]:
geo_df = geo.query_postal_code(pcodes)
geo_df = geo_df[['postal_code', 'latitude', 'longitude']]
geo_df.columns = ['PostalCode', 'Latitude', 'Longitude']
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.8113,-79.193
1,M1C,43.7878,-79.1564
2,M1E,43.7678,-79.1866
3,M1G,43.7712,-79.2144
4,M1H,43.7686,-79.2389


We now do a left merge on the original df (this way any missing longitude and latitude numbers are including as missing data and we don't lose any rows) with the merge on the column `PostalCode`.

In [5]:
df_postcode = df_postcode.merge(geo_df, how='left', on='PostalCode')
df_postcode.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389
