# Exploring and clustering Toronto neighborhoods

In this Project we'll gather the toronto neighborhoods by scraping the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, then, we'll clean the data.

- Let's Start by importing our usual Data Analysis packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium

plt.style.use('ggplot')
%matplotlib inline

## Gathering The Data

> In this step we'll scrape the PostalCode, Borough, and Neighborhoods from <a href= 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,'>Wikipedia</a> using pandas read_html method to read the tables contained in the page, we'll set that na value is 'Not assigned'.

> As the web page contains multiple tables, we are interested only in the first table that contain the toronto data

In [2]:
## setting the link to the wikipedia webpage
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

## reading the tables in the webpage and assign the first dataframe to scraped_df
scraped_df = pd.read_html(URL, na_values = 'Not assigned')[0]

## checking the DataFrame
scraped_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Cleaning The Data

> First We'll create a copy of our DataFrame, so changes does not affect the original one.

In [3]:
Toronto = scraped_df.copy()

> Let's check our Data

In [5]:
## Checking for the datatype and null values of per column
Toronto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal Code   180 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


In [6]:
## Checking the uniqueToronto Boroughs
Toronto['Borough'].unique()

array([nan, 'North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

> Because we will process only the cells that have an assigned borough, let's remove the Not assigned ones from our table

In [7]:
Toronto.dropna(subset=['Borough'], inplace= True)

> Let's Check our Data again to confirm

In [8]:
Toronto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 2 to 178
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal Code   103 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
dtypes: object(3)
memory usage: 3.2+ KB


> Is there any Duplicated Postal Code in our Data?

In [9]:
Toronto['Postal Code'].duplicated().sum()

0

> How many Postal Code do we have in our dataset?

In [10]:
len(Toronto['Postal Code'].unique())

103

> How many Boroughs are there?

In [11]:
len(Toronto.Borough.unique())

10

> What is The shape of our Cleaned DataFrame?

In [12]:
Toronto.shape

(103, 3)