# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this lab, we'll explore neighborhoods in Toronto Canada by using foursquare API and segment those data about most common venues into different cluster by K-mean clustering. 
The result will be visuallized by Folium then. 

## 1. Download and Explore Dataset

Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

In [1]:
#Download data
!wget https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O canada_postal_code.xml

--2019-05-20 06:27:37--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 103.102.166.224
Connecting to en.wikipedia.org (en.wikipedia.org)|103.102.166.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79026 (77K) [text/html]
Saving to: ‘canada_postal_code.xml’


2019-05-20 06:27:38 (547 KB/s) - ‘canada_postal_code.xml’ saved [79026/79026]



In [2]:
#Install and import BeautifulSoup library to parse XML data above
!pip install bs4



In [3]:
#import needed lib
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [4]:
#Load and process xml data 
with open('canada_postal_code.xml') as f:
    soup=BeautifulSoup(f,'html.parser')

In [5]:
#Extract data from postal table
L=[]
for i in range(1,len(soup.table.find_all('tr'))):
    L.append([ x.rstrip('\n') for x in soup.table.find_all('tr')[i].strings if x.rstrip('\n') != '' ])
print("Let's see first 5 values")
L[:5]

Let's see first 5 values


[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

#### Create a dataframe from data above that consist of three columns: PostalCode, Borough, and Neighborhood

In [6]:
#Import our lists to a dataframe
columns=['PostCode','Borough','Neighborhood']
df=pd.DataFrame(L,columns=columns)
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Ignore cells with a borough that is Not assigned

In [7]:
#Drop rows with Borough == 'Not assigned'
df=df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### Group neighborhoods by PostCode

In [8]:
df=df.groupby(['PostCode','Borough'],as_index=True)['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Replace "Not assigned" Neighborhood by Borough name

In [9]:
df.Neighborhood[df.Neighborhood == 'Not assigned'] = df.Borough

#Verify result
try:
    df.set_index('Neighborhood').loc["Not assigned"]
except KeyError:
    print("There is no Not assigned neighborhood anymore\n")
    
print("The value of Neighboorhood for Queen's park borough now is: %s" % df.set_index('Borough').loc["Queen's Park"].Neighborhood)

There is no Not assigned neighborhood anymore

The value of Neighboorhood for Queen's park borough now is: Queen's Park


#### Our dataframe shape

In [10]:
df.shape

(103, 3)

### Add long - lat values to our dataframe 

#### Download csv file that contains long - lat for each postal code from https://cocl.us/Geospatial_data

In [11]:
!wget https://cocl.us/Geospatial_data -O Geospatial_data

--2019-05-20 06:27:42--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 161.202.50.39
Connecting to cocl.us (cocl.us)|161.202.50.39|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-20 06:27:46--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 103.116.4.197
Connecting to ibm.box.com (ibm.box.com)|103.116.4.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-20 06:27:46--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-20 06:2

#### Process and merge long lat data to our dataframe

In [12]:
#Load lat long
lat_long=pd.read_csv('Geospatial_data')
lat_long.head()
#Update column name
lat_long.rename(columns={'Postal Code':'PostCode'},inplace=True)

#Merge to DataFrame
full_df=pd.merge(df,lat_long,on='PostCode')
full_df.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### Process data with toronto neighborhood only

In [13]:
toronto_data=full_df[full_df['Borough'].str.find("Toronto") > -1].reset_index(drop=True)
print(toronto_data.head(5))
toronto_data.shape

  PostCode          Borough                    Neighborhood   Latitude  \
0      M4E     East Toronto                     The Beaches  43.676357   
1      M4K     East Toronto    The Danforth West, Riverdale  43.679557   
2      M4L     East Toronto  The Beaches West, India Bazaar  43.668999   
3      M4M     East Toronto                 Studio District  43.659526   
4      M4N  Central Toronto                   Lawrence Park  43.728020   

   Longitude  
0 -79.293031  
1 -79.352188  
2 -79.315572  
3 -79.340923  
4 -79.388790  


(38, 5)

## 2. Explore Neighborhoods in Toronto

## 3. Analyze Each Neighborhood

## 4. Cluster Neighborhoods

## 5. Examine clusters 