<h1><center> Scraping the Wikipedia page</center></h1>

The general idea behind web scraping is to retrieve data that exists on a website and convert it into a format that is usable for analysis

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#import_library">Importing required Libraries</a></li>
        <li><a href="#collect_parse">Collecting and Parsing a Webpage</a></li>
        <li><a href="#saving">Saving into Dataframer</a></li>
        <li><a href="#cleaning_data">Saving into Dataframe</a></li>
    </ol>
</div>
<br>
<hr>

<div id="import_library">
    <h2>Importing required Libraries</h2>
</div>

<ul>
<li>Here we will be scraping data in Python using BeautifulSoup. Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.</li>

<li>We are going to import requests library. Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.</li>
</ul>

In [49]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

<div id="collect_parse">
    <h2>Collecting and Parsing a Webpage</h2>
</div>

<ul>
<li>requests.get(url).text will ping a website and return you HTML of the website.</li>

<li>We begin by reading the source code for a given web page and creating a BeautifulSoup (soup)object with the BeautifulSoup function</li>
</ul>

In [99]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')

If we carefully inspect the HTML script all the table contents which we intend to extract is under class Wikitable Sortable.

In [142]:
table = soup.find('table', class_='wikitable sortable')

<div id="saving">
    <h2>Saving into Dataframe</h2>
</div>

Now that we have got the table we can extract row by row data of each Region of Canada and store in list using the method as shown below:

In [100]:
A=[]
B=[]
C=[]
for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True).strip())
        B.append(cells[1].find(text=True).strip())
        C.append(cells[2].find(text=True).strip())

Puting list into in Dataframe and giving appropriate column names

In [129]:
df=pd.DataFrame(A,columns=['Postal code'])
df['Borough']=B
df['Neighborhood']=C
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


<div id="cleaning_data">
    <h2>Cleaning data</h2>
</div>

We are only processing the cells that have an assigned borough and ignoring cells with a borough that is Not assigned.

In [138]:
indexNames = df[ df['Borough'] == 'Not assigned' ].index 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df.reset_index(drop=True, inplace = True)
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


In [139]:
df.shape

(103, 3)