
<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>IBM Applied Data Science Capstone Course by Coursera</font></h1>
<h2 align=center><font size = 4> Segmenting and Clustering Neighborhoods in Toronto </font></h2>

* Build a dataframe of the postal code of each neighborhood along with the borough's name in Toronto.
* Get the geographical coordinates (latitude & longitude) of the neighborhoods in Toronto
* Explore and cluster the neighborhoods in Toronto by replicating analysis done to New York data.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1. [Download and Explore Dataset](#0)<br>
2. [Explore Neighborhoods in Toronto](#1)<br>
3. [Analyze Each Neighborhood](#2) <br>
4. [Cluster Neighborhoods](#3) <br>
5. [Examine Clusters](#4) <br>
6. [Summary Analysis](#5) <br>
</div>
<hr>

## Assignment Part 1.

## Import libraries

In [44]:
# library to handle vectorized data
import numpy as np

# library for data analysis
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#library to handle JSON files
import json

#!pip install geopy

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# library to handle requests
import urllib.request

# library to handle requests
import requests

# library to for pulling and parsing data out of HTML and XML files
import bs4 as bs

# transform JSON into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
#!pip install folium
import folium

print ("Libraries imported")

Libraries imported


## 1. Download and Explore Dataset <a id="0"></a>


Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


### Data Wrangling 

In [73]:
# Get data from Wikipedia page and convert to table

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

# create lists to hold table columns data

postalcodeList = []
boroughList = []
neighborhoodList =[]

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    if (len(row) > 0):
        postalcodeList.append(row[0].rstrip('\n'))
        boroughList.append(row[1].rstrip('\n'))
        neighborhoodList.append(row[2].rstrip('\n'))

# Create dataframe with three columns: PostalCode, Borough, and Neighborhood

df = pd.DataFrame({"PostalCode":postalcodeList,
                      "Borough": boroughList,
                 "Neighborhood":neighborhoodList})

# Only process the cells that have an assigned borough.
# Ignore cells with a borough that is Not assigned.

df_dropna = df[df.Borough != "Not assigned"].reset_index(drop = True)
df_dropna.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


###  Group neighborhoods within same borough.

More than one neighborhood can exist in one postal code area.
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [74]:
# group neighborhoods in the same borough

df_grouped = df_dropna.groupby(["PostalCode","Borough"], as_index=False).agg(lambda x:",".join(x))
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


###  Make "Not assigned" Neighborhoods value equal to Borough

If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [75]:
# Make "Not assigned" Neighborhoods value equal to Borough

for index, row in df_grouped.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Print number of rows in clean dataframe

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [76]:
# print number of rows in the clean dataframe

df_grouped.shape

(103, 3)

## Assignment Part 2.

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. Using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

## 2. Explore Neighborhoods in Toronto <a id="1"></a>


### Load the coordinates from downloaded .csv file

In [77]:
# Load the coordinates from downloaded .csv file

coordinates = pd.read_csv("Geospatial_Coordinates.csv")

# rename the column "PostalCode"

coordinates.rename(columns={"Postal Code":"PostalCode"}, inplace=True)

# merge clean Wikipedia table data with coordinates using PostalCode column

df_c = df_grouped.merge(coordinates, on = "PostalCode", how = "left")
df_c


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


###  Use geopy library to get Toronto's coordinates

In [78]:
address = 'Toronto'

geolocator = Nominatim(user_agent = "my_application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}'.format(latitude, longitude))

The geographical coordinate of Toronto are 43.6534817, -79.3839347


###  Create Toronto Neighborhoods' Map.

In [79]:
# create map of Toronto using latitude and longitude

map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

# add markers to map

for lat, lng, borough, neighborhood in zip(df_c['Latitude'], df_c['Longitude'], df_c['Borough'], df_c['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 6,
        popup = label,
        color = 'red',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7).add_to(map_toronto)
map_toronto

###  Select boroughs that contain the word Toronto

In [80]:
# Select boroughs that contain the word Toronto

borough_names = list(df_c.Borough.unique())

borough_with_toronto = []

for x in borough_names:
    if "toronto" in x.lower():
        borough_with_toronto.append(x)
borough_with_toronto

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

In [81]:
# create a new DataFrame with only boroughs that contain the word Toronto
df_c = df_c[df_c['Borough'].isin(borough_with_toronto)].reset_index(drop = True)
print(df_c.shape)
df_c.head()

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [82]:
# create a map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start = 12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_c['Latitude'], 
                                           df_c['Longitude'],
                                           df_c['Borough'],
                                           df_c['Neighborhood']):
    label = '{} , {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'red',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7).add_to(map_toronto)
    
map_toronto                                                                                                               