## Python web scraping of wikipedia page to a pandas dataframe

#### This wikipedia page contains postal codes for canada regions

###### Importing neccassary libraries and installing beautifulsoup for web scrapping

In [2]:
import numpy as np
import pandas as pd

!conda install -c anaconda beautifulsoup4 --yes
import requests
from bs4 import BeautifulSoup

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    numpy-base-1.15.4          |   py36h81de0dd_0         4.2 MB  anaconda
    numpy-1.15.4               |   py36h1d66e8a_0          35 KB  anaconda
    beautifulsoup4-4.8.1       |           py36_0         153 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    soupsieve-1.9.5            |           py36_0          61 KB  anaconda
    mkl_fft-1.0.6              |   py36h7dd41cf_0         150 KB  anaconda
    certifi-2019.11.28         |           py36_0         156 KB  anaconda
    blas-1.0                   |           

#### Web scraping the wikipedia page using the html parser available in beautifulsoup. And creating a dataframe to store those scrapped datas

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
postal_code = soup.find('table',{'class':'wikitable sortable'})
pcodes = []
try:
    for row in postal_code.find_all('tr'):
        cols = row.find_all('td')
        if len(cols) == 3:
            pcodes.append((cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip()))
except: pass  

In [4]:
pcodes_array = np.asarray(pcodes)
len(pcodes_array)

287

In [5]:
df = pd.DataFrame(pcodes_array)
df.columns = ['Postcode','Borough','Neighborhood']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


#### Preprocessing the dataframe by assigning NaN and then removing the rows which are having NaN in both Borough and Neighborhood column.

In [6]:
df = df.replace(to_replace='Not assigned', value=np.nan)
df = df.loc[df.Borough.notna(),:]
df = df.apply(lambda x: [x.Postcode,x.Borough, x.Borough] if pd.isna(x.Neighborhood) else x, axis=1)

#### Merging the Neighborhood columns which are having a same postalcodes.

In [11]:
df = df.groupby('Postcode').apply(lambda x: x.apply(lambda y:  ', '.join(y) if y.name =='Neighborhood' else y.tolist()[0]))
df.reset_index(inplace=True, drop=True)
df.head(5)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
df.tail(5)

Unnamed: 0,Postcode,Borough,Neighborhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


In [13]:
df.shape

(103, 3)

In [None]:
df.to_csv('Processed_data1.csv', index=False)