# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

## Part 1 week 3 IBM Data Science - Applied Data Science Capstone

This assignment consists in exploring, segmenting and clustering the neighborhoods in the city of Toronto based on the postalcode and borough information. In this Part 1 the Wikipedia page will be used to get the postal codes of Canada and the data will be scraped and cleaned for the clustering step.

### **PART 1**

**1. Installing and importing libraries**

In [1]:
# This code installs the required libraries
!pip install beautifulsoup4
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

print('Libraries installed successfully!')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries installed successfully!


In [2]:
# This code imports the required libraries
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

print('Libraries imported successfully!')

Libraries imported successfully!


**2. Scraping the list of postal codes in Canada dataset from Wikipedia**

In [3]:
# This code reads the Wikipedia website content of the server's response
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(url).text

# This code parses the html document from the Wikipedia website using BeautifulSoup library
Canada_data = BeautifulSoup(results, 'html.parser')
wikipedia_table = Canada_data.find('table')

# This code converts the Wikipedia html table into a DataFrame using Pandas library
column_names = ['Postal Code', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns=column_names)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood


In [4]:
# This code searches all the postcodes, boroughs, neighborhoods available
for tr_cell in wikipedia_table.find_all('tr'):
    row_data = []
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data) == 3:
        df.loc[len(df)] = row_data
    
# This code displays the 12 first results in the DataFrame
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


**3. Data cleaning**

In [5]:
# This code removes boroughs with "not assigned" values
df = df[df['Borough'].str.contains("Not assigned") == False].reset_index()

# This code displays the 12 first results in the df DataFrame
df.head(12)

Unnamed: 0,index,Postal Code,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,9,M1B,Scarborough,"Malvern, Rouge"
7,11,M3B,North York,Don Mills
8,12,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
# This code removes the first column
df.drop(['index'], axis = 1, inplace = True)

# This code displays the 12 first results in the cleaned df DataFrame
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
# This code prints the df DataFrame dimensions
print('The df DataFrame shape is:', df.shape)

The df DataFrame shape is: (103, 3)


The df DataFrame contains 103 Postal Codes (rows) and 3 columns: Postal Code, Borough, Neighborhood