<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>
<h1 align=center><font size = 5>Moaeed Sajid - May 2020</font></h1>
<h1 align=center><font size = 5>Part 1</font></h1>

## Introduction

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. This will require me to be agile and refine my skills to learn new libraries and tools quickly depending on the project.

*Part 1*

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

*Part 2*

Import the latitude and longtitude of each area

*Part 3*

Once the data is in a structured format, replicate the analysis applied to the New York City dataset, to explore and cluster the neighborhoods in the city of Toronto.

*Viewing in Githb*

All the content in these notebooks may not display within Github and so feel free to view this code on nbviewer

https://nbviewer.jupyter.org/github/moaeedsajid/Coursera_Capstone/blob/master/Week3_Part1.ipynb 
https://nbviewer.jupyter.org/github/moaeedsajid/Coursera_Capstone/blob/master/Week3_Part2.ipynb 
https://nbviewer.jupyter.org/github/moaeedsajid/Coursera_Capstone/blob/master/Week3_Part3.ipynb

#### Import Wikipedia Page


In [20]:
#!pip install Beautifulsoup4 
#!pip install lxml

from bs4 import BeautifulSoup
import lxml
import requests

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')
soup

 #aaa;"><a href="/wiki/British_Columbia" title="British Columbia">BC</a>
</td>
<td style="text-align:center; border:1px solid #aaa;"><a href="/wiki/Nunavut" title="Nunavut">NU</a>/<a href="/wiki/Northwest_Territories" title="Northwest Territories">NT</a>
</td>
<td style="text-align:center; border:1px solid #aaa;"><a href="/wiki/Yukon" title="Yukon">YT</a>
</td></tr>
<tr>
<td align="center" style="border: 1px solid #FF0000; background-color: #FFE0E0; font-size: 135%;" width="5%"><a href="/wiki/List_of_postal_codes_of_Canada:_A" title="List of postal codes of Canada: A">A</a>
</td>
<td align="center" style="border: 1px solid #FF4000; background-color: #FFE8E0; font-size: 135%;" width="5%"><a href="/wiki/List_of_postal_codes_of_Canada:_B" title="List of postal codes of Canada: B">B</a>
</td>
<td align="center" style="border: 1px solid #FF8000; background-color: #FFF0E0; font-size: 135%;" width="5%"><a href="/wiki/List_of_postal_codes_of_Canada:_C" title="List of postal codes of Canada: C"

### Create the dataframe

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [7]:
import pandas as pd

codeTable = soup.find('table')

entries = [] # Used for splitting each row 
postalCode = []
borough = []
neighbourhood = []


# split the text
print("Splitting each table entry")
for td in codeTable.find_all('tr'):
    entries.append((td.text).split('\n'))

print("First 5 Entries")
display(entries[0:5])

for entry in entries:
    postalCode.append(entry[1])
    borough.append(entry[3])
    neighbourhood.append(entry[5])

# Dicionary for DF Column Names
dfDict = {'PostalCode' : postalCode, 'Borough' : borough, 'Neighbourhood' : neighbourhood}


dfToronto = pd.DataFrame(dfDict) # Empty DF with Columns
dfToronto.drop([0], inplace = True) # First row had column names, drop these

print()
print("Converted to DataFrame")
display(dfToronto)




Splitting each table entry
First 5 Entries


[['', 'Postal Code', '', 'Borough', '', 'Neighborhood', ''],
 ['', 'M1A', '', 'Not assigned', '', '', ''],
 ['', 'M2A', '', 'Not assigned', '', '', ''],
 ['', 'M3A', '', 'North York', '', 'Parkwoods', ''],
 ['', 'M4A', '', 'North York', '', 'Victoria Village', '']]


Converted to DataFrame


Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,
177,M6Z,Not assigned,
178,M7Z,Not assigned,
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Remove not assigned Boroughs

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.


##

In [8]:
print(" \nDrop any Not Assigned Boroughs \n")
dfToronto.drop(dfToronto.index[dfToronto.Borough == 'Not assigned'], inplace = True)
display(dfToronto)


Drop any Not Assigned Boroughs 



Unnamed: 0,PostalCode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,Business reply mail Processing Centre
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Check for duplicate postal codes

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [9]:
print("\nCheck for any Duplicate PostCodes")
if dfToronto['PostalCode'].nunique() == len(dfToronto.PostalCode):
    print("The", len(dfToronto.PostalCode), "entries are unique.  There are no duplicates, Let's check M5A")
else:
    print("There are", dfToronto['PostalCode'].nunique(),  "unique entries but the length of the table at", len(dfToronto.PostalCode),  "suggests duplicates exist")

print()
print(dfToronto[dfToronto.PostalCode == 'M5A'])

dfToronto.reset_index(drop = True, inplace = True)
dfToronto.replace(to_replace= " /", value = ",", regex=True, inplace=True)
print("\nIndex has been reset and any slashes in Neighbourhood replaced with commas")
display(dfToronto)


Check for any Duplicate PostCodes
The 103 entries are unique.  There are no duplicates, Let's check M5A

  PostalCode           Borough              Neighbourhood
5        M5A  Downtown Toronto  Regent Park, Harbourfront

Index has been reset and any slashes in Neighbourhood replaced with commas


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Checking for not assigned neighbourhoods

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [10]:
print("\nChecking for any Not Assigned Neighbourhoods")
print("There are", len(dfToronto[dfToronto.Neighbourhood == 'Not Assigned']), "Not Assigned Neighbourhoods.  The data is clean")


Checking for any Not Assigned Neighbourhoods
There are 0 Not Assigned Neighbourhoods.  The data is clean


### Shape

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

I will also save this dataframe to csv so it can be used in part 2


In [11]:
dfToronto.to_csv (r'dfToronto.csv',index = False, header = True)

print("\nThe shape of our final table is", dfToronto.shape)


The shape of our final table is (103, 3)
