# Segmenting and Clustering Neighborhoods in Toronto

### IBM Data Science Professional Certificate

Let's import useful packages:

In [None]:
import pandas as pd

### Pre processing the data

We are going to use the postal codes of Toronto. 

Unfortunately, there is no dataset of it. However, wikipedia has it!

Let's obtain it and convert to a pandas dataframe!

In [None]:
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
postal_codes = pd.DataFrame(table[0])
postal_codes.head()

We don't want to use postal codes that doesn't have a assigned Borough, so we need to filter our dataframe.

In [None]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned'].reset_index(drop=True)
print('Our dataframe contains',postal_codes.shape[0], 'cells!')
postal_codes.head()

Some Neighborhoods are not assigned. In order to solve this, we are going to assign the name of the corresponding Borough to those Neighborhoods.

In [None]:
for i in range(len(postal_codes)):
    if postal_codes['Neighborhood'][i] == 'Not assigned':
        postal_codes['Neighborhood'][i] = postal_codes['Borough'][i]

We need to check if code is correct:

In [None]:
print('Number of cells now: ', postal_codes.shape[0])
postal_codes = postal_codes[postal_codes['Neighborhood'] != 'Not assigned']
print('Number of cells excluding possible "not assigned" cells: ', postal_codes.shape[0])

we can see that the number of cells before filtering possible 'Not assigned' cells are equal after the exclusion.

Now we have to group the Neighborhoods by postcodes. 

Firstly, we sort ou dataframe by postcode.

In [152]:
postal_codes.sort_values(by=['Postcode'],axis=0, inplace=True)
postal_codes.reset_index(inplace=True,drop=True)
postal_codes.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Port Union
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Highland Creek
5,M1E,Scarborough,Guildwood
6,M1E,Scarborough,Morningside
7,M1E,Scarborough,West Hill
8,M1G,Scarborough,Woburn
9,M1H,Scarborough,Cedarbrae


We will need a auxiliar dataframe to later merge our lists. You will understand it later ;)

In [161]:
postal_codes_new = postal_codes[['Postcode','Borough']]
postal_codes_new.drop_duplicates(inplace=True)
postal_codes_new.reset_index(inplace=True, drop=True)
postal_codes_new

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Postcode,Borough
0,M1B,Scarborough
1,M1C,Scarborough
2,M1E,Scarborough
3,M1G,Scarborough
4,M1H,Scarborough
...,...,...
98,M9N,York
99,M9P,Etobicoke
100,M9R,Etobicoke
101,M9V,Etobicoke


In [None]:
postcode = postal_codes.Postcode.unique()
postcode

In [154]:
neighborhoods = []
range_ = len(postal_codes)-1
i = 0
while i < 210:
    j = i + 1
    aux = postal_codes['Neighborhood'][i]
    if i < 209:
        if postal_codes['Postcode'][j] == postal_codes['Postcode'][i]:
            while postal_codes['Postcode'][j] == postal_codes['Postcode'][i]:
                aux = aux + ', ' + postal_codes['Neighborhood'][j]
                j = j + 1
            i = j
        else:
            i = i + 1
    else:
        i = i + 1
    neighborhoods.append(aux)

len(neighborhoods)

103

We can see that our code is correct because the length of 'coder' is equal to the length of 'neighborhoods'.

Checking the 'neighborhoods' list:

In [153]:
neighborhoods[0:10]

['Rouge, Malvern',
 'Port Union, Rouge Hill, Highland Creek',
 'Guildwood, Morningside, West Hill',
 'Woburn',
 'Cedarbrae',
 'Scarborough Village',
 'East Birchmount Park, Ionview, Kennedy Park',
 'Golden Mile, Oakridge, Clairlea',
 'Cliffcrest, Scarborough Village West, Cliffside',
 'Cliffside West, Birch Cliff']

then, let's insert the neighborhoods list into our new postal codes dataframe:

In [162]:
postal_codes_new.insert(2, 'Neighborhood', neighborhoods)
postal_codes_new.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [164]:
postal_codes_new.shape

(103, 3)