# Segmenting and Clustering Neighborhoods in Toronto
### Peer-graded Assignment for the course:<br/>*Applied Data Science Capstone (IBM Data Science Professional Certificate)*, Coursera/IBM.
**Author: Paw Hermansen, 2018, Oct. 19**


## Part 2: Add geographical Location to the Neighborhoods

### Import Pyton Libraries

In [1]:
import pandas as pd
import csv

### Load the Toronto postal codes with neighborhoods created in part1

In [2]:
df = pd.read_csv('data/toronto_postal_codes.csv')

print(df.shape)
df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Load the locations from the csv

I tried several times to get the coordinates using the python geocoder even with several different providers but failed. Instead I choose to read the locations from the csv file given in the assignment.

In [3]:
dfLocs = pd.read_csv('data/Geospatial_Coordinates.csv')

print(dfLocs.shape)
dfLocs.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merge the postal codes dataframe and the locations dataframe

Note that the postal code column name are different in the two dataframes and that both are included in the result. I will remove one of them after checking the result.

In [4]:
df = pd.merge(df, dfLocs, left_on='PostalCode', right_on='Postal Code', how='outer')

print(df.shape)
df.head()

(103, 6)


Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


### Check for consistency between the dataframes

I used an 'outer' merge which means that all the postal codes from both dataframes are included in the result even if the postal code only exists in one of the dataframes in which case the columns from the other dataframe will be *NaN*.

The two original dataframes has each 103 rows and if and only if they match exactly, as they should, the merged dataframe will also have exactly 103 rows. The following result shows that this is the case, as expected.

In [5]:
print("Number of rows = ", df.shape[0])

Number of rows =  103


### Remove the extra *Postal Code* column

In [6]:
df = df.drop(columns=['Postal Code'])

print(df.shape)
df.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Save the dataframe

In [7]:
df.to_csv('data/toronto_neigborhoods.csv', quoting=csv.QUOTE_ALL, index=False)