**Table Scrapping from Wikipedia using BeautifulSoup**

Create algorthim to read wikipedia page, and recognize table.

Using soup.find_all and table row and table column tags; manage to read table

In [66]:
import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')



In [67]:
my_table = soup.find_all('table', class_ = 'wikitable sortable')

A=[]
B=[]
C=[]

for row in soup.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

Convert read data from table and soup to dataframe, with header name PostalCode, Borough and Neighbourhood

In [68]:

import pandas as pd
import numpy as np
df = pd.DataFrame(A,columns = ['PostalCode'])
df['Borough']=B
df['Neighbourhood']=C
print('Dataframe shape ',df.shape)
df.head(10)

Dataframe shape  (288, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Filtering and remove row where column equal to 'Not assigned'.

Reindexing so that index number restart back at 0

In [69]:

df_filter = df.drop(df[df.Borough == 'Not assigned'].index)
df_reset = df_filter.reset_index()
df_reset.head(10)

Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M5A,Downtown Toronto,Regent Park
4,6,M6A,North York,Lawrence Heights
5,7,M6A,North York,Lawrence Manor
6,8,M7A,Queen's Park,Not assigned
7,10,M9A,Etobicoke,Islington Avenue
8,11,M1B,Scarborough,Rouge
9,12,M1B,Scarborough,Malvern


Drop the old index column

In [70]:
df_reindex = df_reset.drop(columns=['index'])
df_reindex.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


cleaning, combine using groupby and aggregate column neighbourhood data with comma

In [72]:
df_clean = df_reindex.groupby('PostalCode', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))
df_clean.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill"
2,M1E,Scarborough,"Guildwood\n, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae\n


removing characters like "]]" and "\n"

In [73]:
df_clean.Neighbourhood = df_clean.Neighbourhood.str.replace(r"]]\n", "").str.strip()
df_clean.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill"
2,M1E,Scarborough,"Guildwood\n, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Cliffside, Scarborough Village West\n, Cliffcrest"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


 removing "\n" character

In [75]:
df_clean.Neighbourhood = df_clean.Neighbourhood.str.replace('\n', '').str.strip()
df_clean.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Cliffside, Scarborough Village West, Cliffcrest"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Find value in Neighbourhood column where data is "Not assigned"

In [77]:
df_clean.loc[df_clean['Neighbourhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


Replace 'Not assigned' value with value from Borough column

In [78]:

df_clean['Neighbourhood'] = np.where(df_clean['Neighbourhood'] == 'Not assigned', df_clean['Borough'], df_clean['Neighbourhood'])

In [79]:
df_clean.loc[df_clean['PostalCode'] == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [80]:

df_clean.shape

(103, 3)