# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we are tasked with studying data pertaining to the neighborhoods of Toronto.

## Part 1: Scraping Toronto postal codes from Wikipedia

We start by first scraping data on Toronto's neighborhoods and putting them into a Pandas dataframe. For this part, we are only interesting in obtaining the postal code, the borough, and the neighborhood name. We want our table to list all of the unique postal codes in the area. Data will be scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [123]:
# Import necessary Python libraries to scrape web data and store it
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests

In [38]:
# Load wiki page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')


In [138]:
# Parse Wiki table
neighborhood_dict = {'PostalCode':[],'Borough':[],'Neighborhood':[]}
table=soup.find('table')
for tr in table.find_all('tr'):
    tds=tr.find_all('td')
    if(not tds):
        continue
    pc, bor, nh = [td.text.strip() for td in tds]
    if bor=='Not assigned':
        continue
    elif nh == 'Not assigned':
        nh = bor
    neighborhood_dict['PostalCode'].append(pc)
    neighborhood_dict['Borough'].append(bor)
    neighborhood_dict['Neighborhood'].append(nh)
df_toronto = pd.DataFrame.from_dict(neighborhood_dict)

In [139]:
# Process table to merge rows with the same PostalCode
df_toronto=df_toronto.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [103]:
print(f"The shape of the dataframe is: {df_toronto.shape}")

The shape of the dataframe is: (103, 3)


Now we can see that there are 103 unique postal codes in Toronto. 

## Part 2: Getting Latitude and Longitude
Using the above data, what we want to do now is to get the latitude and longitude of each PostalCode.

In [126]:
# Import geopy
# import geocoder
# pos={"PostalCode":[],"Latitude":[],"Longitude":[]}
# for code in df_toronto['PostalCode']:
#     lat_lng_coords = None
#     while(lat_lng_coords is None):
#         g = geocoder.google(f'{code}, Toronto, Ontario')
#         lat_lng_coords = g.latlng
#         print(f'{code}, Toronto, Ontario', lat_lng_coords)
#  This is unreliable, returning None almost all of the time. Other free geocoders via geopy are also unreliable.

In [143]:
df_locations = pd.read_csv('Geospatial_Coordinates.csv')
df_locations.rename(columns={"Postal Code":"PostalCode"},inplace=True)
df_toronto=df_toronto.merge(df_locations, on='PostalCode')

In [144]:
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3: Explore and Cluster