<h1> Segmenting and Clustering Neighborhoods in Toronto</h1>

## Introduction

<i> This code identifies neighborhood area segments in Toronto and cluster them according to venues available in vicinity of those neighborhoods.
    
There are 3 steps -
1. Download and Stucture Toronto Data
2. Explore Toronto Neighbourhoods
3. Analyse and Clusture Neighborhoods </i> 

<u> Note: Here Only 1st step is covered.</u>

In [10]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

! pip install beautifulsoup4 # This library helps parsing webpages data 
from bs4 import BeautifulSoup

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

! pip install geocoder

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 17.3MB/s 
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Libraries imported.


<i> Download Toronto Neighborhood Data (Tip: Zip code starts from 'M') </i>

In [11]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Web data downloaded!')

Web data downloaded!


<i> Parse the web data into meaningful information </i> 

In [12]:
with open('toronto_data.html') as html_file:
    soup=BeautifulSoup(html_file,'lxml')
table_html=soup.find('table',class_='wikitable sortable')
#print(table_html)

<i> Create dataframe to read web data for toronto postal codes </i>

In [18]:
col_names=['PostalCode','Borough','Neighborhood']
neighborhoods = pd.DataFrame(columns=col_names)
neighborhoods

In [19]:
i=0
j=0
for tr in table_html.tbody.find_all('tr'):
    if i==0:
        i=i+1
    else:  
        for td in tr.find_all('td'):
            if j==0:
                postalcode_cd = td.text
                j=j+1
            elif j==1:
                borough_name = td.text
                j=j+1
            else:
                neighborhood_name = td.text
                j=0
                neighborhoods = neighborhoods.append({'PostalCode': postalcode_cd,
                                                      'Borough': borough_name,       
                                                      'Neighborhood': neighborhood_name},                                         
                                                       ignore_index=True)

print('Size',neighborhoods.shape)
neighborhoods.head()

Size (288, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [20]:
# Remove "/n" from Neighborhood column text
for index, row in neighborhoods.iterrows():
    row['Neighborhood']=row['Neighborhood'].rstrip()

# Drop rows where Borough not assigned    
neighborhoods = neighborhoods[neighborhoods.Borough != "Not assigned"]
neighborhoods = neighborhoods.reset_index(drop=True)

# Assign Borough value where neighborhood not assigned  
for index, row in neighborhoods.iterrows():
    if row.at['Neighborhood'] == "Not assigned":
            row.at['Neighborhood'] = row.at['Borough']
            
print('Size',neighborhoods.shape)
neighborhoods.head()

Size (211, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [21]:
# Combine Neighbourhood location for same Postal code 
neighborhoods = neighborhoods.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
print('Size',neighborhoods.shape)
neighborhoods.head(15)

Size (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [22]:
neighborhoods.shape

(103, 3)