# Notebook for Extraction of Toronto Neighbourhood Data from Wikipedia

In [1]:
from pandas.io.html import read_html

In [2]:
page = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wikitables = read_html(page, attrs={"class":"wikitable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


Need to now see whether this is the one we are looking for

In [3]:
print (wikitables[0].head())

          0             1                 2
0  Postcode       Borough     Neighbourhood
1       M1A  Not assigned      Not assigned
2       M2A  Not assigned      Not assigned
3       M3A    North York         Parkwoods
4       M4A    North York  Victoria Village


Good to go now!!

In [4]:
canada_neighbourhood_details_df = wikitables[0]
canada_neighbourhood_details_df.columns = canada_neighbourhood_details_df.iloc[0]
canada_neighbourhood_details_df = canada_neighbourhood_details_df.reindex(canada_neighbourhood_details_df.index.drop(0))
print (canada_neighbourhood_details_df.head())

0 Postcode           Borough     Neighbourhood
1      M1A      Not assigned      Not assigned
2      M2A      Not assigned      Not assigned
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront


Lets cleanup data now

In [5]:
mod_can_nei_det_df = canada_neighbourhood_details_df[canada_neighbourhood_details_df.Borough != "Not assigned"]
mod_can_nei_det_df = mod_can_nei_det_df.reset_index(drop=True)
print (mod_can_nei_det_df.head())

0 Postcode           Borough     Neighbourhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M5A  Downtown Toronto       Regent Park
4      M6A        North York  Lawrence Heights


Lets see if we have any more "Not assigned values"

In [6]:
print (mod_can_nei_det_df[mod_can_nei_det_df.Neighbourhood == "Not assigned"].head(10))

0 Postcode       Borough Neighbourhood
6      M7A  Queen's Park  Not assigned


Lets change 'Not assigned' to Borough

In [7]:
mod_can_nei_det_df.iloc[[6], [2]] = mod_can_nei_det_df.iloc[[6], [1]].values
print ("Value Post assignment is = \n{}".format(mod_can_nei_det_df.iloc[[6], [2]]))
print (mod_can_nei_det_df[mod_can_nei_det_df.Neighbourhood == "Not assigned"].head(10))
print ("Shape of the table after above modifications is {}".format(mod_can_nei_det_df.shape))

Value Post assignment is = 
0 Neighbourhood
6  Queen's Park
Empty DataFrame
Columns: [Postcode, Borough, Neighbourhood]
Index: []
Shape of the table after above modifications is (212, 3)


Time to consolidate all neighbourhood's postal code who have same postcode

In [9]:
mod_can_nei_det_df = mod_can_nei_det_df.groupby(['Postcode','Borough'],as_index=False).agg(','.join)

In [10]:
print (mod_can_nei_det_df.head())

0 Postcode      Borough                         Neighbourhood
0      M1B  Scarborough                         Rouge,Malvern
1      M1C  Scarborough  Highland Creek,Rouge Hill,Port Union
2      M1E  Scarborough       Guildwood,Morningside,West Hill
3      M1G  Scarborough                                Woburn
4      M1H  Scarborough                             Cedarbrae


In [11]:
print ("Shape of table after grouping Neighbourhood values is {}".format(mod_can_nei_det_df.shape))

Shape of table after grouping Neighbourhood values is (103, 3)
