<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

### Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Obtain data from Wikipedia and create dataframe</a>

2. <a href="#item2">Get the latitude and the longitude coordinates</a>
   
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from bs4 import BeautifulSoup
import requests
import json # library to handle JSON files
#!canada install -c conda-forge geopy --yes 
!pip install geocoder
import geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library



## 1. Obtain data from Wikipedia and create dataframe

Scraping Wikipedia table with BeautifulSoup library. Extract and transform the data into a pandas dataframe and fill it.

In [2]:
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_doc = requests.get(page).text
soup = BeautifulSoup(html_doc, 'html.parser')
column_names = ['PostalCode','Borough', 'Neighborhood'] 
nbh = pd.DataFrame(columns=column_names)

for link in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = link.find_all(['th','td'])
    try:
        PostalCode = data[0].text
        Borough = data[1].text
        Neighbourhood = data[2].text
        if (Borough == 'Not assigned'):
            continue
        else:
            nbh = nbh.append({'PostalCode': PostalCode,
                                    'Borough': Borough,
                                    'Neighborhood': Neighbourhood},ignore_index=True)   
    except IndexError:pass

Prepare and group the data

In [3]:
nbh['Neighborhood'] = nbh['Neighborhood'].str.replace('\n', '')
nbh['Neighborhood'] = nbh['Neighborhood'].replace('Not assigned', np.nan)
nbh = nbh.dropna(axis=0, subset=['Neighborhood'])
dft=nbh.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
dft.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 2. Get the latitude and the longitude coordinates

Get the geographical coordinates of each postal code from the Geospatial_data and fill the dataframe

In [5]:
url="http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv"
coordinates=pd.read_csv(url)
coordinates.columns = ['PostalCode', 'Latitude', 'Longitude']
nbh = pd.merge(dft,coordinates, on="PostalCode")

nbh = nbh[nbh['Borough'].str.contains('Toronto')].reset_index(drop=True)
nbh.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
