"Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe"

## Imports

In [1]:
#imports
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np

## Part 1 - Get the data from the page

In [2]:
html_doc = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
#a soup object of the whole page
soup = BeautifulSoup(html_doc, 'html.parser')

#just the table we're interested
table = soup.find('table', class_='wikitable sortable')

#parse the table entries into a list
wikis = []

rows = table.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells)==3:
        postalCode = cells[0].find(text=True)
        borough = cells[1].find(text=True)
        neigh = cells[2].find(text=True)
        
        #Ignore cells with a borough that is Not assigned
        if "Not assigned" not in borough:
            #append to list
            wikis.append([postalCode, borough, neigh])
        
print(wikis)
        

[['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Harbourfront'], ['M5A', 'Downtown Toronto', 'Regent Park'], ['M6A', 'North York', 'Lawrence Heights'], ['M6A', 'North York', 'Lawrence Manor'], ['M7A', "Queen's Park", 'Not assigned\n'], ['M9A', 'Etobicoke', 'Islington Avenue'], ['M1B', 'Scarborough', 'Rouge'], ['M1B', 'Scarborough', 'Malvern'], ['M3B', 'North York', 'Don Mills North\n'], ['M4B', 'East York', 'Woodbine Gardens'], ['M4B', 'East York', 'Parkview Hill'], ['M5B', 'Downtown Toronto', 'Ryerson'], ['M5B', 'Downtown Toronto', 'Garden District\n'], ['M6B', 'North York', 'Glencairn'], ['M9B', 'Etobicoke', 'Cloverdale\n'], ['M9B', 'Etobicoke', 'Islington'], ['M9B', 'Etobicoke', 'Martin Grove\n'], ['M9B', 'Etobicoke', 'Princess Gardens'], ['M9B', 'Etobicoke', 'West Deane Park'], ['M1C', 'Scarborough', 'Highland Creek'], ['M1C', 'Scarborough', 'Rouge Hill'], ['M1C', 'Scarborough', 'Port Union'], ['M3C', 'North York', 'Flemin

## Transform the data into a pandas dataframe

In [4]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# create the dataframe
dfWikis = pd.DataFrame(wikis, columns=column_names)

dfWikis.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


## Prepare the data

Combine neighborhoods in the same boroughs

In [22]:
dfPostalCode = dfWikis.groupby(['PostalCode','Borough'], sort = False).agg(lambda x: ','.join(x))
dfPostalCode = dfPostalCode.reset_index()
#remove the \n from some "Neighborhood" fields
dfPostalCode.replace({'\n': ''}, regex=True, inplace=True)
dfPostalCode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [25]:
for index, row in dfPostalCode.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']

dfPostalCode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Use the .shape method to print the number of rows of the dataframe

In [26]:
#dfPostalCodes.shape
dfPostalCode.shape

(103, 3)