# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this analysis,I explored how to convert addresses into their equivalent latitude and longitude values. Also,learned how to use the Foursquare API to explore neighborhoods in Toronto City.I used the explore function to get the most common venue categories in each neighborhood, and then used this feature to group the neighborhoods into clusters.I used the k-means clustering algorithm to complete this task. Also used the Folium library to visualize the neighborhoods and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Scrape the following Wikipedia page to obtain data, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,</a>

2. <a href="#item2">Transform th data into pandas Dataframe</a>
    
3. <a href="#item3">Get the Geographical Coordinates for neighbourhood</a>
    
   
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
#Downloading the dependencies 

from bs4 import BeautifulSoup #library for web scraping
import requests  # library to handle requests
import json  # library to handle JSON files
import xml
import pandas as pd #Python library data manipulation and analysis

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library



print('Libraries imported.')

Libraries imported.


The BeautifulSoup package is used to scrape through the data in the table on the Wikipedia page into the pandas dataframe

## 1.Scrape the Wikipedia page to obtain the data 


Scrap List of postal codes of Canada wiki page content by using BeautifulSoup

Use the following Wikipedia page to obtain data City of Toronto Data in the table of postal codes format, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [3]:
# download url data from internet
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
Canada_data = BeautifulSoup(source, 'lxml')

Convert content of PostalCode HTML table as dataframe

In [4]:
# creat a new Dataframe
column_names = ['Postalcode','Borough','Neighborhood']
toronto = pd.DataFrame(columns = column_names)

# loop through to find postcode, borough, neighborhood 
content = Canada_data.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0

for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace(']','')
    toronto = toronto.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

# clean dataframe 

toronto = toronto[toronto.Borough!='Not assigned']
toronto = toronto[toronto.Borough!= 0]
toronto.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,toronto.shape[0]):
    if toronto.iloc[i][2] == 'Not assigned':
        toronto.iloc[i][2] = toronto.iloc[i][1]
        i = i+1
                                 


## 2.Transform th data into pandas Dataframe

In [5]:
df = toronto.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postalcode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB



### Data Cleaning

Drop "None" rows in DataFrame

Drop any row which contains 'Not assigned' value

All "Not assigned" will be replace to 'NaN' using numpy for convenience.


In [11]:
df = df.dropna()
empty = 'Not assigned'
df = df[(df.Postalcode != empty ) & (df.Borough != empty) & (df.Neighborhood != empty)]

In [34]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
df.shape

(103, 3)

## 3.Get the Geographical Coordinates for neighbourhood

Not able to get the geographical coordinates of the neighborhoods using the Geocoder package, hence as given in the instructions the  below link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data has bee used to fetch geographical coordinates of the neighbourhoods.

In [13]:
import requests
import io
# in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
url="http://cocl.us/Geospatial_data"
result=requests.get(url).content
col=pd.read_csv(io.StringIO(result.decode('utf-8')))

# rename the first column to allow merging dataframes on Postcode
col.columns = ['Postalcode', 'Latitude', 'Longitude']
df_new = pd.merge(col, df, on='Postalcode')

# reorder column names and show the dataframe
df_new = df_new[['Postalcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]

In [36]:
df_new.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [14]:
df_new.shape

(103, 5)