# Segmenting and Clustering Neighborhoods in Toronto

### Part 1

##### Installing LXML to be able to read html page:

In [1]:
!pip install lxml



##### Importing required libraries:

In [2]:
import numpy as np # library to handle data in a vectorized manner
import html5lib
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
            
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


##### Reading HTML into 'data' object:

In [3]:
data = pd.read_html ('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641')
#data

##### Reading only 1st table of the HTML page into dataframe:

In [4]:
df = data[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


##### Removing the rows with a NAN value in 'Borough' column, also sorted the Dataframe by the column 'Postcode':

In [5]:
df1 = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df1.sort_values(by='Postcode',ascending=True,inplace=True)
df1.reset_index(drop=True).head(7)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Port Union
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Highland Creek
5,M1E,Scarborough,Guildwood
6,M1E,Scarborough,Morningside


##### Identifying the NAN values in the Neighbourhood column.
##### Replacing the NAN values by the corresponding Borough value.

In [6]:
i=0
for name in list (df1.Neighbourhood):
    if name == 'Not assigned':
        df1.Neighbourhood[i] = df1.Borough[i]
    i+=1
df1.head(7)

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern
23,M1C,Scarborough,Port Union
22,M1C,Scarborough,Rouge Hill
21,M1C,Scarborough,Highland Creek
33,M1E,Scarborough,Guildwood
34,M1E,Scarborough,Morningside


##### Shrinked the table with the use of Groupby on columns Postcode & Borough. Used count() method to obtain the number of Neighbourhoods in each group.

In [7]:
dfgroup = df1.groupby(['Postcode','Borough']).count()
dfgroup.reset_index(inplace=True)
dfgroup.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,2
1,M1C,Scarborough,3
2,M1E,Scarborough,3
3,M1G,Scarborough,1
4,M1H,Scarborough,1


##### As required, Neighbourhoods with the same Postcode are brought into 1 cell seperated by comma using join function:
Here, we have used 2 dataframes, df1 & dfgroup. <br>***df1***: It has all the rows obtained from HTML page after removing 'NAN' Boroughs.<br>***dfgroup***: It has unique values of Postcodes
<br><br>Hence, referring to the Postcode values of dfgroup dataframe, we are traversing throught the df1 dataframe to obtain all the corresponding Neighbourhood values for that particular Postcode. The obtained Neighbourhood values are appended into the list 'lst'. Once, the traversing through the df1 dataframe is complete, the list 'lst' is converted into string seperated by comma and feeded into the corresponding row of Neighbourhood column of the dfgroup dataframe.
<br>Hence, finally the desired dataframe is obtained.

In [8]:
j=0
for pcgp in list (dfgroup.Postcode):
    lst = []
    i=0
    for pc1 in list(df1.Postcode):
        if pcgp == pc1:
            lst.append(df1.Neighbourhood[i])
        i+=1
    lst = list (set(lst))
    dfgroup.Neighbourhood[j] = ','.join([str(ngh) for ngh in lst])
    j+=1
    
print ('Loop completed')   
dfgroup

Loop completed


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Parkwoods,Victoria Village"
1,M1C,Scarborough,"Harbourfront,Regent Park,Lawrence Heights"
2,M1E,Scarborough,"Islington Avenue,Lawrence Manor,Not assigned"
3,M1G,Scarborough,Rouge
4,M1H,Scarborough,Malvern
5,M1J,Scarborough,Don Mills North
6,M1K,Scarborough,"Woodbine Gardens,Ryerson,Parkview Hill"
7,M1L,Scarborough,"Garden District,Glencairn,Cloverdale"
8,M1M,Scarborough,"Princess Gardens,Martin Grove,Islington"
9,M1N,Scarborough,"Highland Creek,West Deane Park"


##### Shape of dfgroup dataframe:

In [9]:
dfgroup.shape

(103, 3)