## Part 1: This file is for the scraping and cleaning of the toronto neighborhood data on the wikipedia website: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

**Data scraping** - Using beautifulsoup package to scrape data from the website

In [27]:
# import modules needed
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import numpy as np

Use request.get().text to get the html of the website, and use BeautifulSoup to parse the html text and prettify to see the tags. Following tutorial "https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722" 

In [28]:
datawiki_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BS(datawiki_url,'lxml')
#print(soup.prettify())

In [29]:
neighbor_table = soup.find('table',{'class':'wikitable sortable'}) #extract the section reading the table 
type(neighbor_table)

bs4.element.Tag

Extract the table content into a nested list 

In [30]:
zips = neighbor_table.findAll('td')
data_list = []
for row in zips:
    try: 
        data_list.append(row.find('a').contents[0])
    except:
        data_list.append(row.contents[0].strip())
data_list_nest = [[data_list[i] for i in range(0,len(data_list),3)],
                  [data_list[i] for i in range(1,len(data_list),3)],
                  [data_list[i] for i in range(2,len(data_list),3)]]

del data_list # delete the original data list to release memory
#data_list_nest

Convert the nested list into a dataframe 

In [31]:
columns = ['Postcode','Borough','Neighborhood']
df_neighbor = pd.DataFrame(np.transpose(data_list_nest), columns = columns)
df_neighbor.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


**Clean the dataframe to meet the following criteria**
- Ignore rows with a borough that is "Not assigned".
- More than one neighborhood can exist in one postal code area, and these neighbourhoods with the same postcode are combined into one row.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [32]:
df_neighbor = df_neighbor[df_neighbor["Borough"] != "Not assigned"]
df_neighbor['Borough'] = df_neighbor['Borough'].str.replace(', Toronto','')
df_neighbor['Neighborhood'] = df_neighbor['Neighborhood'].str.replace(', Toronto', '').str.replace(' \\(Toronto\\)','')

print(df_neighbor["Borough"].unique())
print("Not assigned" in df_neighbor["Neighborhood"].unique())

['North York' 'Downtown Toronto' 'Etobicoke' 'Scarborough' 'East York'
 'York' 'East Toronto' 'West Toronto' 'Central Toronto' 'Mississauga']
False


In [33]:
df_neighbor_group = df_neighbor.groupby(['Postcode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x))
df_neighbor_group = pd.DataFrame(df_neighbor_group).reset_index()

**In the latest wikipedia table, 'M5A' only has one neighborhood (Harbourfront) while the hyperlink directs to the 'Regent Park'. There might have been some recent modifications, so now only consider the values in the table to be correct.**
- Use 'M4B' to check

In [34]:
df_neighbor_group[df_neighbor_group['Postcode'] == 'M4B']

Unnamed: 0,Postcode,Borough,Neighborhood
35,M4B,East York,"Woodbine Gardens, Parkview Hill"


Show the first 20 rows after cleaning to take a look

In [35]:
df_neighbor_group.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [36]:
print('The number of rows after cleaning is %.0f' % df_neighbor_group.shape[0])

The number of rows after cleaning is 103
