# Week 3 Assignment

### Web Scraping
The first step is webscraping the Canada postal codes for each Borough and Neighborhood by querying the Wikipedia Page

In [1]:
import requests
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_wikipedia_page=requests.get(wikipedia_link)
#print(raw_wikipedia_page.text)

By examining the page HTML we can identify the tree path of our target.
We need beautifulsoup to help parse our HTML page

In [10]:
#!conda install -c conda-forge beautifulsoup4 --yes

In [19]:
from bs4 import BeautifulSoup
from lxml import html

page = BeautifulSoup(raw_wikipedia_page.text, 'html.parser')
#print(page.prettify())

First we find the table in our HTML, then we extract all rows. For each row we take **Postcode**,**Borough** and **Neighborhood**.

Note that we need to skip the first row of out table, the header, and clean out our values of **\n** characters.

In [33]:
import pandas as pd
# define the dataframe columns
column_names = ['Postcode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods_df = pd.DataFrame(columns=column_names)

table = page.find('table', {'class': 'wikitable'})
rows = table.find_all('tr')
rows = iter(rows)
next(rows)
for row in rows:
    data = row.findChildren('td')
    postcode = data[0]
    borough = data[1]
    neighborhood = data[2]
    neighborhoods_df = neighborhoods_df.append({
        'Postcode': postcode.get_text().replace('\n',''),
        'Borough': borough.get_text().replace('\n',''),
        'Neighborhood': neighborhood.get_text().replace('\n','')
    },ignore_index=True)

In [34]:
neighborhoods_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


The next step is cleaning out **Not assigned** Boroughs and assigning the corresponding Borough to empty Neighborhoods

In [35]:
neighborhoods_df = neighborhoods_df[neighborhoods_df.Borough!='Not assigned']
neighborhoods_df['Neighborhood'].loc[neighborhoods_df['Neighborhood'] == 'Not assigned'] = neighborhoods_df['Borough']
neighborhoods_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In the end we aggregate by **Postcode** and **Borough**, evaluating the **Neighborhood** as the concatenation of all the corresponding values

In [39]:
new_df = neighborhoods_df.groupby(['Postcode','Borough'],as_index=False).agg(lambda x: ','.join(x))
new_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [40]:
new_df.shape()