# This notebook is created to do Week 3 assignment of Applied Data Science Capstone

## First, lets start with the process of scraping the Toronto Wikepedia page.

##### Step 1 - Fetch the HTML of the wikipedia page from the URL. For this, we will use Urllib.request library.

In [1]:
import urllib.request

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
url

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
Torontopage = urllib.request.urlopen(url)
# Torontopage

##### Step 2 - Use the BeautifulSoup library for storing the HTML data and parse through it to extract the required table

In [6]:
!pip install beautifulsoup4
from bs4 import BeautifulSoup

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 5.4MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1


In [7]:
Torontosoup = BeautifulSoup(Torontopage)
# Torontosoup

##### You could look at the HTML code in two ways :- 1) Either inspect the web page 2) Use BeautifulSoup's prettify function

In [8]:
# Torontosoup.prettify()
# Torontosoup.title
# Torontosoup.title.string
# Torontosoup.table
# all_tables = Torontosoup.find_all("table")
Toronto_table=Torontosoup.find('table', class_='wikitable sortable')

##### Step 3 - Store the table data into a list. Next, convert this data into a dataframe

In [9]:
#Generate lists
A=[]
B=[]
C=[]
for row in Toronto_table.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==3: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

In [10]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighborhood']=C
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


## Now that we have scrapped the data, NEXT logical step is data wrangling and cleaning.

##### Step 1 - Remove '\n' from the end

In [11]:
df_nremoved = df.replace('\n','', regex=True)
df_nremoved

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


##### Step 2- Remove rows with a borough that is <b>Not assigned</b>

In [12]:
df_boroughremoved = df_nremoved[df_nremoved['Borough'] != "Not assigned"]
df_boroughremoved.reset_index(drop=True, inplace=True)
df_boroughremoved
# df_nremoved['Borough'].value_counts()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


##### Step 3 - Group rows with same postal code. In the process, combine the neighbourhoods separated by comma.

In [13]:
df_boroughremoved["Postal Code"].value_counts()
# df_boroughremoved["Postal Code"].unique()

M4G    1
M4M    1
M1L    1
M1W    1
M1K    1
      ..
M2L    1
M6H    1
M6N    1
M3L    1
M9A    1
Name: Postal Code, Length: 103, dtype: int64

##### <b>We found no duplicate postal code in the dataframe</b>

##### Step 4 - Replace Neighborhood with <b>Not assigned</b> entry with the entry in borough

In [14]:
df_test = (df_boroughremoved["Neighborhood"] == "Not assigned")
df_test.value_counts()

False    103
Name: Neighborhood, dtype: int64

##### <b>First stage data is ready</b>

### Finally, lets print the number of rows of our dataframe

In [15]:
print("The number of rows of our dataframe is ",df_boroughremoved.shape)

The number of rows of our dataframe is  (103, 3)


#### Lets save this data in a csv file for further processing

In [16]:
df_boroughremoved.to_csv('Toronto_Data.csv',index=False)