<h1>Capstone: Segmenting and Clustering Neighbourhoods in Toronto</h1>

<h4>This notebook will segment and cluster neighborhood data from Toronto, Canada obtained by scraping from a wikipedia page.</h4> 

<h2> Part 1: Building the postal code dataframe</h2>

<h4>Import pandas</h4>

In [1]:
# for converting the parsed data in a pandas dataframe
import pandas as pd

<h4>Parse the tables from the web page into a pandas dataframe and check what is returned.</h4>

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print(type(df))
len(df)

<class 'list'>


3

<h4>Since there are three tables, df is a list of dataframes. Select the first into a new dataframe and check length.</h4>

In [3]:
df_hoods=df[0]
print(type(df_hoods))
len(df_hoods)

<class 'pandas.core.frame.DataFrame'>


180

<h4>Check first 5 rows.</h4>

In [4]:
df_hoods.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<h4>Remove rows with Borough "Not assigned" and reset the index. Check the first 5 rows.</h4> 

In [5]:
df_hoods=df_hoods[df_hoods.Borough != 'Not assigned']
df_hoods.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_hoods.reset_index(drop=True, inplace=True)
df_hoods.head(12)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<h4>Check if there are any Neighbourhoods 'Not Assigned'</h4>

In [6]:
df_hoods_NNa = df_hoods[df_hoods.Neighbourhood == 'Not assigned']
len(df_hoods_NNa)

0

<h4>Print the number of rows in the dataframe using the .shape method.</h4>

In [7]:
print(df_hoods.shape[0])

103


<h2> Part 2: Adding the ccordinates using the geocode API. </h2>

<h4> Install geocoder and import geocoder library </h4>

In [8]:
!pip install geocoder




In [9]:
import geocoder

<h4>Obtain coordinates for each postal code in the dataframe using geocoder</h4>

In [None]:
# Create a new dataframe to hold the coordinates
data = {'Latitude':[],'Longitude':[]} 
df_coords = pd.DataFrame(data)

# Iterate through each postal codes and append to the new dataframe
i=0
for i in range(df_hoods.shape[0]):
    
# initialize your variable to None
    lat_lng_coords = None
# change the Postal Code to next in line
    postal_code = df_hoods.PostalCode[i]
# loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    new_row = {'Latitude':latitude, 'Longitude':longitude}
    #print(new_row) #For testing purposes only
    df_coords = df_coords.append(new_row, ignore_index=True)

print(df_coords)

<h4> Create a new dataframe containing the neighbourhood data and the coordinates

In [None]:
# Create dataframe list
dfs = [df_hoods, df_coords]
# Concatenate dataframes in the list
df_hoods_coords = pd.concat(dfs, join='outer', axis=1)
df_hoods_coords.head(12)

In [None]:
df_hoods_coords.shape