<font size=4><b>Coursera Capstone Project Notebook</b></font>

This notebook will be used for the completion of the Coursera Capstone Project in Python.

In [1]:
import numpy as np
import pandas as pd

Need the following installs in order to complete this project

In [2]:
from geopy.geocoders import Nominatim #for letting location data
from sklearn.cluster import KMeans #for performing kmeans clustering
import folium #for visualizing on a world map

In [3]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


Load and view our dataset

In [4]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [5]:
data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We need to remove all of the not assigned buroughs from our dataset.

In [6]:
data = data[data['Borough'] != 'Not assigned']

In [7]:
data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


We need to combine rows with the same postcode

In [8]:
neighborhoods_in_pc = {}
for postcode in data['Postcode'].unique():
    pc_data = data[data['Postcode'] == postcode]
    neighborhoods_in_pc[postcode] = pc_data['Neighbourhood'].unique()
neighborhoods_in_pc

{'M3A': array(['Parkwoods'], dtype=object),
 'M4A': array(['Victoria Village'], dtype=object),
 'M5A': array(['Harbourfront', 'Regent Park'], dtype=object),
 'M6A': array(['Lawrence Heights', 'Lawrence Manor'], dtype=object),
 'M7A': array(['Not assigned'], dtype=object),
 'M9A': array(['Islington Avenue'], dtype=object),
 'M1B': array(['Rouge', 'Malvern'], dtype=object),
 'M3B': array(['Don Mills North'], dtype=object),
 'M4B': array(['Woodbine Gardens', 'Parkview Hill'], dtype=object),
 'M5B': array(['Ryerson', 'Garden District'], dtype=object),
 'M6B': array(['Glencairn'], dtype=object),
 'M9B': array(['Cloverdale', 'Islington', 'Martin Grove', 'Princess Gardens',
        'West Deane Park'], dtype=object),
 'M1C': array(['Highland Creek', 'Rouge Hill', 'Port Union'], dtype=object),
 'M3C': array(['Flemingdon Park', 'Don Mills South'], dtype=object),
 'M4C': array(['Woodbine Heights'], dtype=object),
 'M5C': array(['St. James Town'], dtype=object),
 'M6C': array(['Humewood-Cedarvale'

Create a new dataframe with postcodes having values of every neighborhood in that postcode

In [9]:
fixed_data = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
fixed_data['Postcode'] = neighborhoods_in_pc.keys()
fixed_data['Neighbourhood'] = neighborhoods_in_pc.values()
fixed_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,,[Parkwoods]
1,M4A,,[Victoria Village]
2,M5A,,"[Harbourfront, Regent Park]"
3,M6A,,"[Lawrence Heights, Lawrence Manor]"
4,M7A,,[Not assigned]


Now, we need to find the Burough for each Postcode

In [10]:
buroughs = []
for pc in fixed_data['Postcode']:
    pc_data = data[data['Postcode'] == pc] #only postcard data in here
    buroughs.append(pc_data['Borough'].unique())
fixed_data['Borough'] = buroughs

In [11]:
fixed_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,[North York],[Parkwoods]
1,M4A,[North York],[Victoria Village]
2,M5A,[Downtown Toronto],"[Harbourfront, Regent Park]"
3,M6A,[North York],"[Lawrence Heights, Lawrence Manor]"
4,M7A,[Queen's Park],[Not assigned]


Columns without assigned Neighbourhood, Neighbourhood should be set to Borough

In [12]:
for index, row in fixed_data.iterrows():
    if (row['Neighbourhood'][0] == 'Not assigned'):
        row['Neighbourhood'][0] = row['Borough']

In [13]:
fixed_data.head(25)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,[North York],[Parkwoods]
1,M4A,[North York],[Victoria Village]
2,M5A,[Downtown Toronto],"[Harbourfront, Regent Park]"
3,M6A,[North York],"[Lawrence Heights, Lawrence Manor]"
4,M7A,[Queen's Park],[[Queen's Park]]
5,M9A,[Etobicoke],[Islington Avenue]
6,M1B,[Scarborough],"[Rouge, Malvern]"
7,M3B,[North York],[Don Mills North]
8,M4B,[East York],"[Woodbine Gardens, Parkview Hill]"
9,M5B,[Downtown Toronto],"[Ryerson, Garden District]"


Clean up the Borough column

In [14]:
fixed_boroughs = []
fixed_neighbourhoods = []
for borough in fixed_data['Borough']:
    b = ','.join(borough)
    fixed_boroughs.append(b)
fixed_data['Borough'] = fixed_boroughs

Clean up the Neighbourhood column

In [15]:
fixed_neighbourhoods = []
for neighborhood in fixed_data['Neighbourhood']:
    if len(neighborhood) == 1:
        n = neighborhood[0]
    else:
        n = ",".join(str(x) for x in neighborhood)
    fixed_neighbourhoods.append(n)
fixed_data['Neighbourhood'] = fixed_neighbourhoods

In [16]:
fixed_data.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,[Queen's Park]
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Assumptions: I am assuming that each postcode can contain multiple neighbourhoods and does not contain any overlapping neighbourhoods. I am also assuming each borough falls into 1 postcode (and each postcode has 1 Borough).

In [17]:
fixed_data.shape

(103, 3)

103 rows and 3 columns.

Now, we need to get the longitude and latitude using GeoPy

In [18]:
neighborhoods = fixed_data['Neighbourhood']
latitudes = []
longitudes = []
for city in neighborhoods:
    try:
        user_agent_name = city.replace(' ','_') + '_explorer'
        geolocator = Nominatim(user_agent=user_agent_name)
        location = geolocator.geocode(city)
        latitude = location.latitude
        longitude = location.longitude
        latitudes.append(latitude)
        longitudes.append(longitude)
        print('The geograpical coordinate of ', city, ' are {}, {}.'.format(latitude, longitude))
    except:
        try:
            user_agent_name = city.replace(' ','_') + '_explorer'
            geolocator = Nominatim(user_agent=user_agent_name)
            location = geolocator.geocode(city)
            latitude = location.latitude
            longitude = location.longitude
            latitudes.append(latitude)
            longitudes.append(longitude)
            print('The geograpical coordinate of ', city, ' are {}, {}.'.format(latitude, longitude))
        except:
            latitudes.append(np.NaN)
            longitudes.append(np.NaN)

The geograpical coordinate of  Parkwoods  are 26.5652643, -81.8817227102639.
The geograpical coordinate of  Victoria Village  are 43.732658, -79.3111892.
The geograpical coordinate of  Harbourfront,Regent Park  are 43.6400801, -79.3801495.
The geograpical coordinate of  Lawrence Heights,Lawrence Manor  are 43.7227784, -79.4509332.
The geograpical coordinate of  Islington Avenue  are 43.6393743, -79.5212175.
The geograpical coordinate of  Rouge,Malvern  are 43.8091955, -79.2217008.
The geograpical coordinate of  Don Mills North  are 43.737178, -79.3434514.
The geograpical coordinate of  Ryerson,Garden District  are 45.5797934, -79.5084658621578.
The geograpical coordinate of  Glencairn  are -34.1595402, 18.4284987.
The geograpical coordinate of  Woodbine Heights  are 43.6999302, -79.3191316.
The geograpical coordinate of  St. James Town  are 43.6694032, -79.3727041.
The geograpical coordinate of  Humewood-Cedarvale  are 43.69079835, -79.4253981936993.
The geograpical coordinate of  Guil

In [19]:
fixed_data['Latitude'] = latitudes
fixed_data['Longitude'] = longitudes
fixed_data.dropna(inplace=True,how='any')

In [20]:
fixed_data.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,26.565264,-81.881723
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.64008,-79.38015
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.722778,-79.450933
5,M9A,Etobicoke,Islington Avenue,43.639374,-79.521218
6,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701
7,M3B,North York,Don Mills North,43.737178,-79.343451
9,M5B,Downtown Toronto,"Ryerson,Garden District",45.579793,-79.508466
10,M6B,North York,Glencairn,-34.15954,18.428499
14,M4C,East York,Woodbine Heights,43.69993,-79.319132


Now, we need to do some visualizations

In [21]:
geo_map = folium.Map(
    location=[43.5,-80],
    zoom_start=8,
    tiles='Stamen Terrain'
)

In [22]:
geo_map

Let's add markers to our map

Make a new df containing only unique rows

In [23]:
marker_df = fixed_data.drop_duplicates(subset='Neighbourhood',keep='first',inplace=False)

In [24]:
for index, row in marker_df.iterrows():
    city = folium.map.FeatureGroup()
    city.add_child(
        folium.CircleMarker(
        [row['Latitude'],row['Longitude']], radius=5,
        color='red',
        fill_color='Red'
        )
    )
    geo_map.add_child(city)
    folium.Marker([row['Latitude'],row['Longitude']],popup=row['Neighbourhood']).add_to(geo_map)

In [25]:
geo_map

Now, we will perform a KNN classification analysis to see which neighborhoods are the best to move into

In [26]:
data = pd.read_excel('Industry_Profile.xlsx',sheet_name=['i000'])
data = data['i000']

Let's view the shape of each column

In [38]:
for col in data.columns:
    print(data[col].shape)

(11,)
(11,)
(11,)
(11,)
(11,)
(11,)
(11,)
(11,)
(11,)


In [41]:
data = data.transpose()

In [42]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
% By Educational Attainment,0 to 8 years,Some secondary,High school graduate,Some postsecondary,Postsecondary certif or diploma,Trade certificate or diploma,College diploma,University certif below bachelor,University degree,Bachelor's degree,Above bachelor's degree
2003,0.0619327,0.122007,0.261352,0.0788846,0.280734,0.0752969,0.190987,0.0144505,0.195103,0.143447,0.0516559
2008,0.0469742,0.102549,0.246576,0.0650875,0.324526,0.0928828,0.216371,0.015272,0.214271,0.150821,0.0634506
2013,0.0414446,0.07912,0.241229,0.0579161,0.332487,0.07912,0.237443,0.015907,0.247804,0.175874,0.0719303
2014,0.0354115,0.0753149,0.275845,0.0548482,0.322622,0.0826438,0.220676,0.0193016,0.235976,0.165828,0.0701307


In [43]:
data.index

Index(['% By Educational Attainment', 2003, 2008, 2013, 2014, 2015, 2016, 2017,
       2018],
      dtype='object')

In [44]:
data.columns = data.iloc[0]
data = data[1:]

In [45]:
data.head()

% By Educational Attainment,0 to 8 years,Some secondary,High school graduate,Some postsecondary,Postsecondary certif or diploma,Trade certificate or diploma,College diploma,University certif below bachelor,University degree,Bachelor's degree,Above bachelor's degree
2003,0.0619327,0.122007,0.261352,0.0788846,0.280734,0.0752969,0.190987,0.0144505,0.195103,0.143447,0.0516559
2008,0.0469742,0.102549,0.246576,0.0650875,0.324526,0.0928828,0.216371,0.015272,0.214271,0.150821,0.0634506
2013,0.0414446,0.07912,0.241229,0.0579161,0.332487,0.07912,0.237443,0.015907,0.247804,0.175874,0.0719303
2014,0.0354115,0.0753149,0.275845,0.0548482,0.322622,0.0826438,0.220676,0.0193016,0.235976,0.165828,0.0701307
2015,0.0289058,0.0675521,0.25219,0.0499992,0.340304,0.0777248,0.23979,0.0228055,0.26105,0.198185,0.0628646


In [50]:
data.columns

Index(['  0 to 8  years', '  Some secondary', '  High school graduate',
       '  Some postsecondary', '  Postsecondary certif or diploma',
       '    Trade certificate or diploma', '    College diploma',
       '    University certif below bachelor', '  University degree',
       '    Bachelor's degree', '    Above bachelor's degree'],
      dtype='object', name='% By Educational Attainment')

In [55]:
X = np.array(data["    Bachelor's degree"])
Y = np.array(data["    Above bachelor's degree"])

In [63]:
X = np.array(X, dtype=float)
Y = np.array(Y, dtype=float)

In [65]:
import statsmodels.api as sm
results = sm.OLS(X, Y).fit()
summary = results.summary()
print(summary)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     555.3
Date:                Tue, 16 Jul 2019   Prob (F-statistic):           6.29e-08
Time:                        20:46:42   Log-Likelihood:                 20.016
No. Observations:                   8   AIC:                            -38.03
Df Residuals:                       7   BIC:                            -37.95
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.5104      0.107     23.565      0.0

  "anyway, n=%i" % int(n))


This regression shows us that we can predict how many college graduates there will be with a high degree of certainty, so Toronto is a great city to move to.