# Capstone Project Notebook
### (To get IBM Data Science Professional Certificate)

![alt text](Img01.PNG "Data Science")

This notebook will show the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

> Given a city like the City of Toronto, you will segment it into different neighborhoods using the geographical coordinates of the center of each neighborhood, and then using a combination of location data and machine learning, you will group the neighbourhoods into clusters 



In [2]:
import pandas as pd
import numpy as np

# Read source
tmp = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# The first webpage table has the needed information for this project
postal_codes_tmp = tmp[0]

# Filter records which Borough == Not assigned
na_filter = postal_codes_tmp['Borough'] != 'Not assigned'
filtered_postal_codes = postal_codes_tmp[na_filter]

# More than one neighborhood can exist in one postal code area. For such cases, neighborhoods will be shown in the same row separated with a comma (e.g. M1C, M3J, M5A, etc.)
postal_codes = filtered_postal_codes.groupby(['Postcode', 'Borough']).agg([(', '.join)]).reset_index()

# Rename column headers
postal_codes.columns = ['PostalCode', 'Borough', 'Neighborhood']

# Whenever the Neighborhood is Not assigned, the Borough name should be used as the Neighborhood value (e.g. M7A)
postal_codes.Neighborhood.replace('Not assigned',postal_codes.Borough,inplace=True)

print('Pandas is awesome to read html tables!')
postal_codes.head()

Pandas is awesome to read html tables!


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### First section

### Second section

In [3]:
# Regardless the information is shown above, here you can check the number of the dataframe rows
print(str(postal_codes.shape[0]) + ' dataframe rows.')

103 dataframe rows.


In [13]:
# Read coordinates source
coordinates = pd.read_csv('Geospatial_Coordinates.csv')

# Rename column headers
coordinates.columns = ['PostalCode', 'Latitude', 'Longitude']

bigdata = pd.merge(postal_codes, coordinates, how='inner', left_on = 'PostalCode', right_on = 'PostalCode')
print(bigdata.shape)
bigdata.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
