## Week 3 - CAPSTONE: IBM Data Science Professional Certificate
# Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Scrape the Data

Pull the data from the table on the Wikipedia Page by requesting the page and parsing the HTML.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
# print the sever response to make sure we get a good 200 code
print(r) 

<Response [200]>


In [3]:
# parse the page and get a list of all the tables on the page
page = BeautifulSoup(r.text, 'html.parser')
tables = page.find_all('table')
print(f'There are {len(tables)} HTML tables on the page')

There are 5 HTML tables on the page


Since there are more then one table lets pull the header cells with HTML tag of **th** to find the correct table.

In [4]:
[table.find_all('th') for table in tables]

[[<th>Postcode</th>,
  <th>Borough</th>,
  <th>Neighbourhood
  </th>],
 [],
 [],
 [<th class="navbox-title" style="font-size:110%"><a href="/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">Canadian postal codes</a>
  </th>],
 []]

From the python list above we can see that the first table on the page is the one we want to scrape.

In [5]:
# list to hold all the records for the zipcode
zipcodes = list()
# create list of every row in the table finding the HTML tag <tr>
zipcode_table = tables[0].find_all('tr')
header = [th.get_text(strip=True) for th in zipcode_table[0].find_all('th')]

# loop over ever row and append each record to the zipcodes list
for row in zipcode_table[1:]:
    zipcodes.append({header[i]: cell.get_text(strip=True) for i, cell in enumerate(row.find_all('td'))})

# conver to pandas dataframe
zip_df = pd.DataFrame(zipcodes)
zip_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Clean the Dataframe

In [6]:
# convert "Not assigned" to pandas.NA types
zip_df = zip_df.replace("Not assigned", pd.NA)

In [7]:
# drop all rows without a Borough
zip_df = zip_df.dropna(subset=['Borough'])
# fill any missing Neighbourhoods with the Borough Name
zip_df['Neighbourhood'] = zip_df['Neighbourhood'].fillna(value=zip_df['Borough'])

In [8]:
# For each postal code and borough group create a list all all the Neighbourhoods 
zip_df = zip_df.groupby(['Postcode','Borough']).apply(lambda x: ", ".join(x['Neighbourhood'].unique())).reset_index()
zip_df.head()

Unnamed: 0,Postcode,Borough,0
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Shape of the Dataframe

In [10]:
zip_df.shape

(103, 3)