# Analyzing neighborhoods in Toronto

This notebook scrapes the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

The final dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. This is obtained in 3 steps:
1. Read the Wiki page and parse the HTML using BeautifulSoup
2. Parse through the postcode table and create a dataframe with contents
3. Cleanse the dataframe and retain useful rows

The dimensions of the final dataframe are dispayed at the end of the notebook

## 1. Read the Wiki page and parse the HTML using BeautifulSoup

In [4]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

response = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(response.text, "html.parser")
    
# Find the postcode table section, which is the first table on the Wiki page
table = soup.table
sample_row = table.find('tr')
sample_row

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>

## 2. Parse through the postcode table and create a dataframe with contents

In [6]:
# Read column titles and find number of rows
column_names = []
n_rows = 0
for row in table.find_all('tr'):
    # Find heading row <th> tags
    th_tags = row.find_all('th') 
    if len(th_tags) > 0 and len(column_names) == 0:
        for th in th_tags:
            column_names.append(th.get_text())
    # Count rows with data
    td_tags = row.find_all('td')
    if len(td_tags) > 0:
        n_rows+=1

# Create a dataframe and read data into it
df = pd.DataFrame(columns = column_names,index = range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        df.iat[row_marker,column_marker] = column.get_text()
        column_marker += 1
    if len(columns) > 0:
        row_marker += 1                  

# Remove newline \n characters
df = df.replace('\n','', regex=True)
df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 3. Cleanse the dataframe and retain useful rows

In [7]:
# Remove rows with a borough that is Not assigned
df = df[df.Borough!='Not assigned']

# Make Neighbourhood values same as Borough when Not assigned
df['Neighbourhood\n'].replace('Not assigned',df['Borough'],inplace=True)

# Combining rows with the same postcode, with the neighborhoods separated with a comma
df['Neighbourhood\n'] = df.groupby(['Postcode','Borough'])['Neighbourhood\n'].transform(lambda x: ', '.join(x))
df = df.drop_duplicates()

df = df.reset_index()

df.head(15)

Unnamed: 0,index,Postcode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,6,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,8,M7A,Queen's Park,Queen's Park
5,10,M9A,Etobicoke,Islington Avenue
6,11,M1B,Scarborough,"Rouge, Malvern"
7,14,M3B,North York,Don Mills North
8,15,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,17,M5B,Downtown Toronto,"Ryerson, Garden District"


## Dimensions of the final dataframe

In [8]:
print("(Rows, Columns) - ",df.shape)

(Rows, Columns) -  (103, 4)
