# Segmenting and Clustering Neighborhoods in Toronto Part 1

## Obtaining the postal codes 

First, we need to parse the wiki page and transfor the postal code into a dataframe

In [1]:
!pip3 install pandas



In [2]:
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

toronto_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The dataframe should consist of three columns: PostalCode, Borough, and Neighborhood. 
Therefore, we'll rename "Postal Code" to "PostalCode" and "Neighbourhood" to "Neighborhood"

In [3]:
toronto_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
toronto_df.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)


toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Print the column values:

In [4]:
list(toronto_df.columns.values)

['PostalCode', 'Borough', 'Neighborhood']

Only process the cells that have an assigned borough

In [5]:
toronto_df.shape

(180, 3)

Remove not assigned borough rows

In [6]:
toronto_df_filtered = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df_filtered.reset_index(drop=True, inplace=True)
toronto_df_filtered.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
toronto_df_filtered.shape

(103, 3)

So, we have to reduced our row size. Now, we are checking, if there are any  duplicated postal codes:

In [8]:
toronto_df_filtered['PostalCode'].value_counts()

M5C    1
M7A    1
M1W    1
M4V    1
M2K    1
M3C    1
M1E    1
M6P    1
M4X    1
M4M    1
M5V    1
M1J    1
M6R    1
M9N    1
M3L    1
M4R    1
M9V    1
M1H    1
M9A    1
M6S    1
M5S    1
M1R    1
M5W    1
M1C    1
M1K    1
M9W    1
M5X    1
M5P    1
M4G    1
M6N    1
M6J    1
M1T    1
M1X    1
M5B    1
M5E    1
M3H    1
M1L    1
M5N    1
M5G    1
M1M    1
M9B    1
M5K    1
M4K    1
M6H    1
M1S    1
M3J    1
M9M    1
M4T    1
M6A    1
M6M    1
M1V    1
M2P    1
M5L    1
M9P    1
M7R    1
M6B    1
M5T    1
M8V    1
M5M    1
M1N    1
M4A    1
M3N    1
M7Y    1
M5R    1
M9C    1
M4B    1
M6K    1
M4S    1
M2R    1
M8Z    1
M5A    1
M6L    1
M1P    1
M3A    1
M2N    1
M8Y    1
M1G    1
M2H    1
M9R    1
M4E    1
M9L    1
M4H    1
M4L    1
M8W    1
M4Y    1
M4C    1
M3K    1
M1B    1
M4J    1
M4W    1
M5J    1
M2J    1
M6E    1
M3B    1
M5H    1
M2M    1
M3M    1
M6C    1
M8X    1
M6G    1
M2L    1
M4P    1
M4N    1
Name: PostalCode, dtype: int64

Since the row size is the same, all postal codes are unique.

Assign all "Not assigned" neighborhood to the borough ones. Let's check these rows:

In [9]:
toronto_df_filtered[toronto_df_filtered['Neighborhood'] == 'Not assigned'].value_counts()

Series([], dtype: int64)

Which means there is any "not assigned" neighborhood data set. So, the df is already formated according to the request.

In [10]:
row, col = toronto_df_filtered.shape

print('The row size is', row, 'and the column size is', col)

The row size is 103 and the column size is 3


In [11]:
row

103