# Segmenting and Clustering Neighbourhoods in Toronto#
This notebook is for the Applied Data Science Capstone course in Coursera.

**Step 1.**
We will read the Wikipedia page to get the Neighbourhoods in Toronto table. We will be using BeautifulSoup package.

In [1]:
# Read the webpage
from bs4 import BeautifulSoup
import urllib3

http = urllib3.PoolManager()
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
toronto_postal = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = http.request('GET', toronto_postal)
soup = BeautifulSoup(response.data)

**Step 2.**
We will read Panads to do this exercise.

In [2]:
# Read packages
import pandas as pd

**Step 3.**
We are converting the html table to Pandas dataframe.

In [3]:
# Create DataFrame and delete "nan" value(s)
My_table = soup.find('table',{'class':'wikitable sortable'})

Postcode = []
Borough = []
Neighbrouhood = []
for item in My_table:
    Postcode.append(My_table.get('Postcode'))
    Borough.append(My_table.get('Borough'))
    Neighbrouhood.append(My_table.get('Neighbrouhood'))

number_of_rows = len(My_table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == My_table))

new_table = pd.DataFrame(columns=['PostalCode','Borough','Neighbourhood'], index = range(0,number_of_rows))

row_marker = 0
for row in My_table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        new_table.iat[row_marker,column_marker] = column.get_text()
        column_marker += 1
    if len(columns) > 0:
        row_marker += 1
new_table['Neighbourhood'] = new_table['Neighbourhood'].astype(str).str.replace('\n', '')
new_table.dropna(inplace=True)
new_table.shape

(288, 3)

**Step 4.**
We will remove rows with a Borough that is Not assigned.

In [4]:
# We will remove rows with a borough that is Not assigned.

new_table.drop(new_table[new_table['Borough'] == 'Not assigned'].index, inplace=True)
new_table.shape

(211, 3)

**Step 5.**
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [5]:
new_table.loc[new_table['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = new_table.loc[new_table['Neighbourhood'] == 'Not assigned', 'Borough']

new_table.shape

(211, 3)

**Step 6.**
We will re-arrange the dataframe such that we have a unique Postal Code. We are assuming there is only one borough value per postal code.

In [6]:
# Re-arrange the dataframe such that we have a unique Postal Code.

final_table =new_table.groupby(by=['PostalCode', 'Borough'])['Neighbourhood'].agg([('Neighbourhood', ', '.join)]).reset_index()
final_table.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**Step 7.**
We are calculating the final dataframe shape.

In [7]:
final_table.shape

(103, 3)

**Step 8.**
We will import the coordinate data for each Postal Code. Geocoder did not work after 7+ hours run. Hence, we will read data from the spreadsheet directly.

In [8]:
# The following code did not work!
# import geocoder # import geocoder
# for num_postcode in range(0,final_table.shape[0]):
#    # initialize your variable to None
#    lat_lng_coords = None
#
#    # loop until you get the coordinates
#    while(lat_lng_coords is None):
#        postal_code = final_table.loc[num_postcode, 'PostalCode']
#        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#        lat_lng_coords = g.latlng
#
#    final_table[num_postcode, 'Latitude'] = lat_lng_coords[0]
#    final_table[num_postcode, 'Longitude'] = lat_lng_coords[1]

df_Geospatial_Coordinate = pd.read_csv('Geospatial_Coordinates.csv')
df_Geospatial_Coordinate.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Step 9.**
Change the column "Postal Code" to "PostalCode" and merge two dataframes.

In [9]:
df_Geospatial_Coordinate.rename(columns={'Postal Code':'PostalCode'}, inplace = True)

merged_table = pd.merge(final_table, df_Geospatial_Coordinate, left_on='PostalCode', right_on='PostalCode')
merged_table.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
