### Question 1
Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

#### Import required libraries for Webscraping and setting up DataFrame

In [1]:
import pandas as pd
import numpy as np
import bs4 as BeautifulSoup
import matplotlib.pyplot as plt
import requests

In [2]:
url = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).text
soup = BeautifulSoup.BeautifulSoup(html,'html5lib')

#### Webscrapping Steps taken:
- Find the Table containing the location information
- Extract the Postal Code through the "b" tag
- Borough and Neighbourhood data scraped from the row's span text and seperated between the two using the '('
- If no Neighbourhood found set Borough and Neighbourhood to the Borough

In [3]:
table = soup.find('table')
table_contents=[]
for i, row in enumerate(table.find_all('td')):
    cell = {}
    if 'Not assigned' in row.span.text:
        pass
    else:
        if '(' in row.span.text:
            # Splits the Borough from the Neighbourhood into a list of length 2
            split = row.span.text.split("(")
            cell['Postal Code'] = row.b.string
            cell['Borough'] = split[0]
            
            # Splits the Neighbourhoods and joins them together with ","
            cell['Neighbourhood'] = ','.join(split[1][:split[1].rfind(')')].split(' /'))
            table_contents.append(cell)
        else:
            # If no Neighbourhood is assigned to the postal code set Neighbourhood to Borough
            cell['Borough'] = row.span.text
            cell['Neighbourhood'] = row.span.text

# Convert list of data cells into a DataFrame
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [4]:
df.shape

(103, 3)

### Question 2
Append the longitude and latitude to the above DataFrame

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. (2 marks)

In [10]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import geocoder # import geocoder

Set a rate limiter for applying to full dataframes

In [6]:
geolocator = Nominatim(user_agent="toronto_neighbourhoods")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

In [11]:
def check_location_geocoder(df):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(df['Postal Code']))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return pd.Series([latitude, longitude])

Tested Geolocator code with smaller dataset. Was unable to reliably retreive the correct location data

In [12]:
df1 = df.head(10)
print(df1.apply(check_location_geocoder, axis=1))

Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)
Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)
Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)
Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)
Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)
Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='

KeyboardInterrupt: 

In [7]:
def check_location_geopy(df):
    location_info = None
    i=0
    # Limited number of calls fdrom geocode
    while (location_info is None) and i <= 5:
        location_info = geocode(f'{df.Neighbourhood}, {df.Borough}', country_codes='ca', addressdetails=True)
        i+=1
    if location_info != None:
        return pd.Series([location_info.latitude, location_info.longitude])
    else: 
        return pd.Series([0, 0])

In [8]:
df2 = df.head(10)
print(df2.apply(check_location_geopy, axis=1))

           0          1
0  43.758800 -79.320197
1  43.732658 -79.311189
2   0.000000   0.000000
3  43.716391 -79.442566
4   0.000000   0.000000
5  43.638959 -79.521050
6  43.809196 -79.221701
7  43.775347 -79.345944
8   0.000000   0.000000
9   0.000000   0.000000


#### Loaded in CSV data of the location data, using Postal Code as index for joining DataFrames together

In [9]:
loc_data = pd.read_csv(r'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv').set_index('Postal Code')
df = df.set_index('Postal Code')
df = df.join(loc_data).reset_index()
display(df.head())

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


#### Question 4

