

IBM Applied Data Science Capstone Course by Coursera
Segmenting and Clustering Neighborhoods in Toronto

* Build a dataframe of the postal code of each neighborhood along with the borough's name in Toronto.
* Get the geographical coordinates (latitude & longitude) of the neighborhoods in Toronto
* Explore and cluster the neighborhoods in Toronto by replicating analysis done to New York data.

## Assignment Part 1.

## Import libraries

In [28]:
# library to handle vectorized data
import numpy as np

# library for data analysis
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#library to handle JSON files
import json

!pip install geopy

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# library to handle requests
import requests

# library to handle requests
from urllib import request
import  urllib.parse

# library to for pulling and parsing data out of HTML and XML files
!pip install bs4
import bs4 as bs


!pip install html5lib
import html5lib

!pip install lxml
import lxml

import html.parser

# transform JSON into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!pip install folium
import folium

print ("Libraries imported")

Libraries imported


## 1. Download and Explore Dataset <a id="0"></a>


Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


### Data Wrangling 

In [30]:
# Get data from Wikipedia page and convert to table

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source, 'html')
table = soup.find('table')
table_rows = table.find_all('tr')

# create lists to hold table columns data

postalcodeList = []
boroughList = []
neighborhoodList =[]

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    if (len(row) > 0):
        postalcodeList.append(row[0].rstrip('\n'))
        boroughList.append(row[1].rstrip('\n'))
        neighborhoodList.append(row[2].rstrip('\n'))

# Create dataframe with three columns: PostalCode, Borough, and Neighborhood

df = pd.DataFrame({"PostalCode":postalcodeList,
                      "Borough": boroughList,
                 "Neighborhood":neighborhoodList})

# Only process the cells that have an assigned borough.
# Ignore cells with a borough that is Not assigned.

df_dropna = df[df.Borough != "Not assigned"].reset_index(drop = True)
df_dropna.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


###  Group neighborhoods within same borough.

More than one neighborhood can exist in one postal code area.
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [31]:
# group neighborhoods in the same borough

df_grouped = df_dropna.groupby(["PostalCode","Borough"], as_index=False).agg(lambda x:",".join(x))
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


###  Make "Not assigned" Neighborhoods value equal to Borough

If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [32]:
# Make "Not assigned" Neighborhoods value equal to Borough

for index, row in df_grouped.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Print number of rows in clean dataframe

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [33]:
# print number of rows in the clean dataframe

df_grouped.shape

(103, 3)