# Segmenting and Clustering Neighborhoods in Toronto

## - Outline of this project
+ scraping the web page (wikipedia) and obtain the table content and convert into dataframe
+ process and clean the dataframe
+ clustering the data using K-means and ploting the clusters map by folium

## Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs4
import folium
from sklearn.cluster import KMeans

## Create the DataFrame
+ The data frame will consist of three columns: PostalCode, Borough, and Neighborhood
+ Only process the cells that have an assigned borough. Ignore cells with borough that is Not assigned.
+ More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
+ If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
+ Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
+ In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Scraping the Web page
Using beautifulsoup library of Python, we can obtain the table from the Wikipedia. 


In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

page = requests.get(url)
soup = bs4(page.text, 'lxml')
table = str(soup.table)

dfs = pd.read_html(table)
df = dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Data Processing and Cleaning

In [3]:
# Drop the rows where Borough is "Not assigned"
df1 = df[df.Borough != 'Not assigned']

# Combine the neighborhoods with same postal code
df2 = df1.groupby(['Postal Code', 'Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

# Replace the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighbourhood'] = np.where(df2['Neighbourhood'] == 'Not assigned', df2['Borough'], df2['Neighbourhood'])

df2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
# Shape of DataFrame
df2.shape

(103, 3)

---
---
---

## Import the csv file to get geospacial coordinates (latitudes and longitudes)

In [5]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## Merge the two tables

In [6]:
df3 = pd.merge(df2, lat_lon, on='Postal Code')
df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


---
---
---

## From the dataframe, pick up rows containing 'Toronto'

In [7]:
df4 = df3[df3['Borough'].str.contains('Toronto', regex=False)]
df4.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


## Visualize the dataframe above on map using folium 

In [8]:
map_toronto = folium.Map(location=[43.651070,-79.347015], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df4['Latitude'], df4['Longitude'], df4['Borough'], df4['Neighbourhood']):
    label = f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_toronto)

map_toronto