# Capstone Project

## Wikipedia postal codes scraper

import pandas and numpy

In [1]:
import pandas as pd
import numpy as np

define the url & get the page

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
import requests
html_doc = requests.get(url).text

### BeautifulSoup 

import BeautifulSoup and load the h

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

# print(soup.prettify())

define an empty dataframe to store the results

In [5]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
canada = pd.DataFrame(columns=column_names)

find all rows of the table and for each row save the individual cell in the dataframe created above

In [6]:
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
        PostalCode = data[0].text
        Borough = data[1].text
        Neighbourhood = data[2].text
        
        if Borough == 'Not assigned':
            continue
        else:
            canada = canada.append({'PostalCode': PostalCode,
                                    'Borough': Borough,
                                    'Neighborhood': Neighbourhood}, ignore_index=True)
        
    except IndexError:pass

In [7]:
canada.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods\n
1,M4A,North York,Victoria Village\n
2,M5A,Downtown Toronto,Harbourfront\n
3,M5A,Downtown Toronto,Regent Park\n
4,M6A,North York,Lawrence Heights\n
5,M6A,North York,Lawrence Manor\n
6,M7A,Queen's Park,Not assigned\n
7,M9A,Etobicoke,Islington Avenue\n
8,M1B,Scarborough,Rouge\n
9,M1B,Scarborough,Malvern\n


### Data pre-processing

From the data we notice that the Neighborhood column contains \n at the end of each line, which needs to be removed

In [8]:
canada['Neighborhood'] = canada['Neighborhood'].str.replace('\n', '')

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [9]:
canada_sorted = canada.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
canada_sorted.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


After this transformation we should have the as many rows in the dataframe as there are unique postal codes. We can check if that is the case:

In [10]:
canada_sorted.shape

(103, 3)

In [11]:
canada['PostalCode'].unique().shape

(103,)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [12]:
temp = canada_sorted[canada_sorted['Neighborhood'] == 'Not assigned']

for idx in temp.index.values:
    canada_sorted.loc[idx,'Neighborhood'] = temp.loc[idx,'Borough']

Finally we check how many rows our final dataframe contains

In [13]:
canada_sorted.shape

(103, 3)