# Segmenting and Clustering Neighborhoods in Toronto

## Problem 1

In [4]:
# imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Datascraping

In the below code snippet I request the webpage from the url, then I parse it to a BeautifulSoup object to easily find the table I need. Then I read the table to a Pandas Dataframe, but since the read_html returns a list of dataframes, I chose the first (and only) dataframe in the list.

In [56]:
# scrape data from website
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

# extract table from url raw data
bs = BeautifulSoup(response.text, 'html.parser')
table = bs.find('table', {'class': 'wikitable'})

# convert table to dataframe
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Preprocessing

The below code snippet preprocesses the dataframe. First I remove all rows where no borough is assigned. Then, all rows with no assigned neighborhood gets assigned the borough for that neighborhood. Since I alreayd removed the rows with no assigned borough, I won't have an issue with assigning 'Not assigned' to an already 'Not assigned' neighborhood. Lastly, I group the dataframe on the Postal Code, joining neighborhoods and selecting the first borough, however the borough should always be the same for the same postal codes anyways.

In [57]:
# preprocessing
df = df[df['Borough'] != 'Not assigned']
df['Neighbourhood'][df['Neighbourhood'] == 'Not assigned'] == df['Borough'][df['Neighbourhood'] == 'Not assigned']
df = df.groupby('Postal Code', as_index=False).agg({'Borough': 'first','Neighbourhood': ', '.join})
df.head(103)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [37]:
df.shape

(103, 3)

## Problem 2

In [39]:
# imports
try: 
    import geocoder as gc
except ModuleNotFoundError:
    !pip install geocoder
    import geocoder as gc

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 8.4 MB/s  eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [59]:
# include coordinates to dataframe
pc_df = pd.read_csv('http://cocl.us/Geospatial_data')
full_df = pd.merge(df, pc_df, on='Postal Code')
full_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [60]:
full_df.shape

(103, 5)

## Problem 3