# Segmenting and Clustering Neighborhoods in Toronto, Canada

## Build DataFrame

Build the DataFrame for the **PostalCode**, **Borough** and **Neighborhood** of the city Toronto, Canada. The data is prepared as a csv file based on the information from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

The result DataFrame should look like this:

<img src = 'attachment:ee33dd20-f59a-45fc-8a2b-1ca9f756aef2.png' width = '600'>

In [72]:
import pandas as pd 

df = pd.read_csv("Toronto.csv")
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Check out the size and relevant information of the DataFrame.

In [73]:
df.shape

(180, 3)

In [74]:
df.describe()

Unnamed: 0,PostalCode,Borough,Neighbourhood
count,180,180,180
unique,180,14,100
top,M1M,Not assigned,Not assigned
freq,1,77,77


## Missing values

Convert "Not assigned" to NaN:

In [75]:
import numpy as np

df.replace("Not assigned", np.nan, inplace = True)
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Drop all rows that do not have an assigned borough:

In [76]:
df.dropna(subset=["Borough", "Neighbourhood"], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Check out the size of the DataFrame again:

In [80]:
df.shape

(103, 3)

<a id='item1'></a>
