# <center>Scraping Wikipedia page for Neighborhoods in Toronto City</center>

In [2]:
#Import necessary libraries
from urllib.request import urlopen #library to open and read http requests
from bs4 import BeautifulSoup #library helpful to scrap the web pages
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
print('Libraries imported.')

Libraries imported.


<b>Assumptions</b>
<ul>
<li>The web page we are going to look at is 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'.</li>
<li>It has the full list of Toronto neighborhoods, their boroughs and postal codes.</li>
<li>Few neighborhoods are not assigned to any borough.We are going to process only the cells that have an assigned borough.</li>
<li>More than one neighborhood does exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma</li>
<li>Few rows have borough but Not assigned neighborhood, then the neighborhood will be the same as the borough. For example, the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.</li>
<li>We are going to use <b>BeautifulSoup</b> library in this notebook for scraping the table in the given Wikipedia page.</li>
</ul>

Open the wikipedia url using <b>urllib.urlopen<b> method

In [499]:
html = urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
html

<http.client.HTTPResponse at 0xc8a6240>

Create an object of <b>BeautifulSoup</b> to read the html object

In [500]:
res = BeautifulSoup(html.read(), 'html5lib')

Print the title of Wikipedia page

In [501]:
print(res.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


Create an empty dataframe with appropriate columns

In [502]:
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
toronto_data = pd.DataFrame(columns=column_names)

Scrape the table content by using BeautifulSoup object 'res'

In [503]:
table = res.find('table', {'class': 'wikitable sortable'}) #find the table with 'wikitable sortable' class
for row in table.findAll('tr')[1:]: #[1:] is to avoid row with table header <th>
    cells = row.findAll('td') #find all <td> elements in a row
    if cells[1].text != 'Not assigned': #Filter rows with borough not assigned
        postcode = cells[0].text
        bor = cells[1].text
        neigh = cells[2].text.replace('\n','')
        if neigh == 'Not assigned': #if neighbourhood is not assigned, assign borough name
            neigh = bor
        
        #Append to the dataframe
        toronto_data = toronto_data.append({'PostalCode': postcode,
                                            'Borough': bor,
                                            'Neighborhood': neigh}, ignore_index=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [504]:
toronto_data.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
207,M8Z,Etobicoke,Kingsway Park South West
208,M8Z,Etobicoke,Mimico NW
209,M8Z,Etobicoke,The Queensway West
210,M8Z,Etobicoke,Royal York South West
211,M8Z,Etobicoke,South of Bloor


Shape of the original DataFrame

In [505]:
toronto_data.shape

(212, 3)

Group the data on PostalCode column and join neighborhoods

In [506]:
toronto_grouped = toronto_data.groupby('PostalCode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x)).reset_index()
toronto_grouped.head()

Unnamed: 0,PostalCode,Neighborhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


Merge the grouped columns with the original dataframe 'toronto_data'

In [507]:
toronto_grouped = toronto_grouped.merge(toronto_data[['PostalCode','Borough']], on='PostalCode').drop_duplicates().reset_index(drop=True)
toronto_grouped.head()

Unnamed: 0,PostalCode,Neighborhood,Borough
0,M1B,"Rouge, Malvern",Scarborough
1,M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough
2,M1E,"Guildwood, Morningside, West Hill",Scarborough
3,M1G,Woburn,Scarborough
4,M1H,Cedarbrae,Scarborough


Move Neighborhood column to the last in DataFrame

In [508]:
fixed_columns = [toronto_grouped.columns[0]] + [toronto_grouped.columns[2]] + [toronto_grouped.columns[1]]
toronto_grouped = toronto_grouped[fixed_columns]

The final dataframe

In [509]:
toronto_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [510]:
toronto_grouped.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


Shape of the final dataframe

In [511]:
toronto_grouped.shape

(103, 3)