# Segmenting and Clustering Neighborhoods in Toronto
With the help of:
https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722

In [1]:
#!pip install beautifulsoup4 #already been installed in Watson environment
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Get the table from the website

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})

Create empty lists for columns

In [3]:
zip_list = []
boro_list = []
neigh_list = []

Read the values into three lists above  
_Probably not the best way to do it. Still struggling_

In [4]:
table_rows = My_table.find_all('tr')[1:] #Get rid of first row as it contains header
for tr in table_rows: 
    if tr.find_all('td')[1].text.replace('\n','') != '': #To handle empty values in table
        zip_list.append(tr.find_all('td')[0].text.replace('\n',''))
        boro_list.append(tr.find_all('td')[1].text.replace('\n',''))
        neigh_list.append(tr.find_all('td')[2].text.replace('\n','').replace(' / ', ', '))

In [5]:
df = pd.DataFrame()
df['PostalCode'] = zip_list
df['Borough'] = boro_list
df['Neighborhood'] = neigh_list

---
Drop **Not assigned** rows in Borough, and drop wrong **index** values by resetting   
## Below is the answer for Question 1

In [6]:
#df_nan = df.replace('Not assigned', np.nan,)
#df_nan = df_nan.dropna(subset=['Borough'])
df = df[df.Borough != 'Not assigned']
df = df.reset_index().drop(['index'], axis=1)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
df.shape

(103, 3)

# Second Part - Defining Longitute and Latitute for Given Zip Codes

In [8]:
df_location = pd.read_csv('https://cocl.us/Geospatial_data')

In [9]:
df_location.shape

(103, 3)

In [10]:
df_merged = pd.merge(left=df, right=df_location, left_on='PostalCode', right_on='Postal Code').drop('Postal Code', axis=1)

In [11]:
df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Third Part - Analyzing Toronto Neighborhoods and Clustering

The code in the New York Lab is modified to match Toronto data.   
No venues have been extracted from Foursquare in Toronto.  
As venue based clustering will be done in next projects, that is skipped for now.    
The clustering only made by Longitute-Latitude values, which actually doesnt create any useful insight.    
But is it made to show it works...    
So in the end, model clustered close neighborhoods in the same cluster.   

In [12]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [13]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = df_merged.drop(['Neighborhood','PostalCode','Borough'], axis=1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

In [14]:
# add cluster labels to main dataframe
df_merged['Cluster Labels'] = kmeans.labels_

In [15]:
import folium # map rendering library # Remember to pip install this library from watson custom environment.
import matplotlib.cm as cm
import matplotlib.colors as colors

In [16]:
# create map
map_clusters = folium.Map(location=[43.753259, -79.329656], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters