# Segmenting and Clustering Neighborhoods in Toronto

<b>IBM Data Science Specialztion - Coursera</b><br>
Week 3 assignment

Md Ibtehajul Islam<br>
Email: iibtehajul@gmail.com<br>
LinkedIn: islam-md-ibtehajul<br>

### Importing the libraries

In [21]:
import pandas as pd
import numpy as np
import requests
import math
from bs4 import BeautifulSoup

# Part 1: Web scraping for Toronto neighborhood and build a clean dataframe

## Extracting the dataset from the wikipidea link: 

#### Requesting the link from wikipidea:

In [43]:

wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_page = requests.get(wikipedia_link)
page= wikipedia_page.text


## Scraping the HTML
The data we want is in a table, with 3 columns PostalCode, Borough and Neighborhood.<br>
The table contains a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area.

In [44]:

soup = BeautifulSoup(page, 'html.parser')
match = soup.find_all('tr')
results= match[1:-5]
results[-1]
results[0].contents[5].text[0:-1]

records = []
for result in results:
    postalcode = result.contents[1].text
    borough = result.contents[3].text
    neighbourhood = result.contents[5].text[0:-1]
    records.append((postalcode, borough, neighbourhood))
    
df = pd.DataFrame(records, columns=['Postalcode', 'Borough', 'Neighbourhood'])
df.to_csv('List of postal codes of Canada.csv', index=False)


## Reading the created csv file

In [45]:
df = pd.read_csv('List of postal codes of Canada.csv')
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Data Wrangling: 

In [46]:
# Replacing all "Not Assigned" to NaN values:
df['Borough'].replace('Not assigned', np.nan, inplace = True)

In [47]:
df = df.dropna()

In [48]:
df.shape

(211, 3)

In [49]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


## Grouping the neighbourhood by Postalcode and Borough

In [50]:
df = df.groupby(['Postalcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Deal with Not assigned Neighborhood
For M7A Queen's Park, there is no neighborhood assigned.We will replace the 'Not assigned' with the value of the corresponding Borough.

In [51]:
df.iloc[85]

Postalcode                M7A
Borough          Queen's Park
Neighbourhood    Not assigned
Name: 85, dtype: object

In [52]:
df_n = df.Neighbourhood == 'Not assigned'
df.loc[df_n, 'Neighbourhood'] = df.loc[df_n, 'Borough']
df[df_n]

Unnamed: 0,Postalcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


## Checking the shape of the Dataframe

In [53]:
df.shape

(103, 3)

# Part 2: Adding the lattitude and the longitude with the dataframe


In [54]:
latlong_df = pd.read_csv(r'http://cocl.us/Geospatial_data')


In [34]:
latlong_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [35]:
new_df = pd.merge(df, latlong_df, on = 'Postal Code')
new_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [36]:
new_df.shape

(103, 5)