# PART 1: Scrape website to acquire Toronto, Canada neighborhood data

## The first step is to import pandas and install lxml. The latter is required to enable scraping of the website.

In [1]:
# import required libraries and install lxml

import pandas as pd
!pip install lxml



## I used pandas read_html to scrape the table from the wikipedia website. For the purposes of this task it was adequate to get the table from the website.

In [2]:
# Use read_html to convert table to dataframe

toronto_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)

df = toronto_df[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## As it turns out, there are 3 tables on the wikipedia website and scraping the website produced 3 lists.
## Therefore, toronto_df[0] selected the first table, which contains the postal code, borough and neighborhood data needed. 

## The next step is to clean the data. The most recent version of the wikipedia page did not have duplicate postal codes and contained all of the neighborhoods separated by commas for each postal code. Therefore, the only cleaning required was to remove all rows that had 'Not assigned' under the column 'Borough'. 

In [3]:
# Remove all rows that did not have an assigned borough

df_new = df[df.Borough != 'Not assigned']
df_new = df_new.reset_index(drop=True)
df_new

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [4]:
# Use .shape to show the number of rows in the dataframe

print("The number of rows in the data frame is", df_new.shape[0])

The number of rows in the data frame is 103
