#### *Imports Needed*

In [1]:
from pandas.io.html import read_html
import numpy as np

* *html to float markdown tables to left*

In [2]:
%%html
<style>
table {float:left}
</style>

## **Part 1 - Get and Prepare Dataset**

* Get data from Wikipedia -  [link here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
* Create a dataframe with the collected data
* Remove rows where borough is *Not assigned*
* Join Neibhbourhood with the same Postal Code
* Set *Not assigned* neighbourhood with the borough name
***

#### *1. get html data and convert to a pandas dataframe*

In [8]:
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitables = read_html(page, index_col=0,  attrs={"class":"wikitable"})
toronto_postal = wikitables[0]
print(toronto_postal.shape)

(287, 2)


#### *2. cleaning and prepare data*

In [13]:
toronto_postal.Borough.replace('Not assigned',np.nan, inplace=True) #set NaN to not assigned borough
toronto_postal.dropna(inplace=True) #drop nan values
toronto_postal['Neighbourhood'] = toronto_postal.groupby(['Postcode'])['Neighbourhood'].apply(lambda x: ', '.join(x)) #join Neibhbourhood with the same Postal Code
toronto_postal = toronto_postal.loc[~toronto_postal.index.duplicated(keep='first')] #remove duplicated postal codes
toronto_postal.Neighbourhood.replace('Not assigned',toronto_postal.Borough, inplace=True) #replace not assigned neighbourhoods with borough name
print('Shape of dataset: %s'%(str(toronto_postal.shape)))

Shape of dataset: (103, 2)


#### *3. test data*
* **must match:**

| PostalCode | Borough          | Neighbourhood                          |
|------------|------------------|----------------------------------------|
| M7A        | Queen's Park     | Queen's Park                           |
| M5X        | Downtown Toronto | First Canadian Place, Underground city |
| M1C        | Scarborough      | Highland Creek, Rouge Hill, Port Union |



In [5]:
toronto_postal.loc[['M7A','M5X','M1C']] #they match ;)

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M7A,Queen's Park,Queen's Park
M5X,Downtown Toronto,"First Canadian Place, Underground city"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"


In [14]:
#reset index to put PostalCode as a column and save on a variable called toronto
toronto = toronto_postal.reset_index()
print('Shape of dataset: %s'%(str(toronto_postal.shape)))

Shape of dataset: (103, 2)
