# Part 1: Collecting data from Wikipedia

## 1.1 Installing packages
Install BeautifulSoup package for the purpose of crawling and analyzing web pages.

In [2]:
!conda install -c anaconda beautifulsoup4 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.9.2            |           py36_0          61 KB  anaconda
    openssl-1.0.2s             |       h7b6447c_0         3.1 MB  anaconda
    certifi-2019.6.16          |           py36_1         156 KB  anaconda
    beautifulsoup4-4.7.1       |           py36_1         143 KB  anaconda
    ------------------------------------------------------------
                                           Total:         3.5 MB

The following NEW packages will be INSTALLED:

    soupsieve:      1.9.2-py36_0      anaconda   

The following packages will be UPDATED

## 1.2 Importing required packages

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
print('Packages are imported')

Packages are imported


## 1.3 Crawling wiki page

Crawling the required Wikipedia page at this [link](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [4]:
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
crawled_page = requests.get(link)
soup = BeautifulSoup(crawled_page.text, 'html.parser')
print('Page\'s title is: ', soup.title.string)

Page's title is:  List of postal codes of Canada: M - Wikipedia


## 1.4 Analyzing html content

In [5]:
table_data = soup.find("table", {"class" : "wikitable sortable"})
table_body = table_data.find('tbody')
columns_rs = table_body.find_all('th')
columns_rs

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
 </th>]

Remove html tags using regex

In [6]:
import re
def clean_html_tag(raw_html): 
    reg = re.compile('<.*?>')
    cleantext = re.sub(reg, '', raw_html)
    return cleantext

Read columns' titles

In [7]:
columns_list = []
for element in columns_rs:
    temp = str(element)
    temp = clean_html_tag(temp)
    temp = temp.rstrip()
    if (len(temp) > 0):
        columns_list.append(temp)
columns_list

['Postcode', 'Borough', 'Neighbourhood']

Create an empty data frame

In [8]:
df = pd.DataFrame(columns = columns_list)
df

Unnamed: 0,Postcode,Borough,Neighbourhood


Read table rows and add row to data frame

In [9]:
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if len(cols) > 0:
        col1 = clean_html_tag(str(cols[0])).rstrip()
        col2 = clean_html_tag(str(cols[1])).rstrip()
        col3 = clean_html_tag(str(cols[2])).rstrip()
        dict1 = {columns_list[0]: col1, columns_list[1]: col2, columns_list[2]: col3}
        df = df.append(dict1, ignore_index=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [10]:
df.shape

(288, 3)

Ignore cells with a borough that is Not assigned.

In [11]:
row_indices = df.index[df['Borough'] == 'Not assigned'].tolist()
print('Total rows will be deleted: ', len(row_indices))

Total rows will be deleted:  77


In [12]:
final_df = df.drop(row_indices)
final_df.reset_index(inplace=True,drop=True)
final_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [13]:
final_df.shape

(211, 3)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [14]:
row_indices = final_df.index[final_df['Neighbourhood'] == 'Not assigned'].tolist()
row_indices
final_df.loc[row_indices]

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


In [15]:
final_df.loc[row_indices, 'Neighbourhood'] = final_df.loc[row_indices, 'Borough']

In [16]:
final_df.loc[row_indices]

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Queen's Park


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 

In [17]:
final_df['Neighbourhood'] = final_df.groupby('Postcode')['Neighbourhood'].transform(lambda x: ','.join(x))
final_df.drop_duplicates(inplace=True)
final_df.reset_index(inplace=True, drop=True)
final_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [18]:
final_df.shape

(103, 3)

In [22]:
final_df.sort_values(by='Postcode', inplace=True)
final_df.reset_index(inplace=True, drop=True)
final_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [23]:
final_df.shape

(103, 3)

## 1.5 Export data frame to CSV file

In [24]:
export_df = final_df.applymap(str)
export_df.to_csv('canada_postal_code.csv', encoding='utf-8', index=False)
print('Finished exporting CSV')

Finished exporting CSV
