### Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods

#### Part I

**Research Question**

For this project, I will be looking for a location in Toronto that is suitable to open a childcare center. As female labor force participation rate has been going up for the past decade, the need for childcare services have been on the climb as well. There are plenty of opportunities to open up a childcare center in Toronto -- a diverse financial center with a robust workforce. This project targets stakeholders who are interested in opening up a childcare center in Toronto and seek a perfect neighborhood to establish their businesses.

In order to search for an ideal location to open a childcare center, I propose to look for the following key indicators
* In a convenient neighborhood, with few competitors
* With large proportion of working households with children
* In a neighborhood with high income, high education, high occupancy rate 

**Data Description**

This project uses three sets of data

1. Postal Code data is extracted for Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
2. Foursquare API is used to get common venues for each neighborhood in Toronto.
3. Statistics Canada's 2016 Census data are used to extract postal-code level population and demographic data (Data was downloaded from: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/index.cfm?Lang=E)

In [1]:
import pandas as pd
import numpy as np
import requests
import re

!pip install beautifulsoup4
from urllib.request import urlopen
from bs4 import BeautifulSoup



In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html = urlopen(url)

soup = BeautifulSoup(html, 'html.parser')

In [10]:
rows = soup.find_all('tr')

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)

df = pd.DataFrame(list_rows)
df = df[0].str.split(',', 2, expand = True)
df.head(10)

Unnamed: 0,0,1,2
0,[],,
1,[M1A\n,Not assigned\n,Not assigned\n]
2,[M2A\n,Not assigned\n,Not assigned\n]
3,[M3A\n,North York\n,Parkwoods\n]
4,[M4A\n,North York\n,Victoria Village\n]
5,[M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n]"
6,[M6A\n,North York\n,"Lawrence Manor, Lawrence Heights\n]"
7,[M7A\n,Downtown Toronto\n,"Queen's Park, Ontario Provincial Government\n]"
8,[M8A\n,Not assigned\n,Not assigned\n]
9,[M9A\n,Etobicoke\n,"Islington Avenue, Humber Valley Village\n]"


In [11]:
df.tail(10)

Unnamed: 0,0,1,2
175,[M4Z\n,Not assigned\n,Not assigned\n]
176,[M5Z\n,Not assigned\n,Not assigned\n]
177,[M6Z\n,Not assigned\n,Not assigned\n]
178,[M7Z\n,Not assigned\n,Not assigned\n]
179,[M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor..."
180,[M9Z\n,Not assigned\n,Not assigned\n]
181,[\n,\n],
182,[\n\n\nNL\n\nNS\n\nPE\n\nNB\n\nQC\n\nON\n\nMB\...,NL\n,"NS\n, PE\n, NB\n, QC\n, ON\n, MB\n, SK\n, AB\..."
183,[NL\n,NS\n,"PE\n, NB\n, QC\n, ON\n, MB\n, SK\n, AB\n, BC\..."
184,[A\n,B\n,"C\n, E\n, G\n, H\n, J\n, K\n, L\n, M\n, N\n, ..."


In [12]:
df = df.loc[1:180,:]
df.reset_index(drop = True)

df[0] = df[0].str.strip('[')
df[0] = df[0].str.strip(']')
df[0] = df[0].str.strip('\n')
df[1] = df[1].str.strip('\n')
df[1] = df[1].str.strip('')
df[2] = df[2].str.strip('\n]')
df[2] = df[2].str.strip('')

# remove leading and trailing spaces
df[0] = df[0].str.strip()
df[1] = df[1].str.strip()
df[2] = df[2].str.strip()

df.rename(columns={0:'Postal Code', 1:'Borough', 2:'Neighborhood'}, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [14]:
df_final = df.loc[df['Borough'] != "Not assigned"]
df_final.reset_index(drop = True)
df_final.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [15]:
coord = pd.read_csv('Geospatial_Coordinates.csv')
df_comb = df_final.join(coord.set_index('Postal Code'), on = 'Postal Code')
df_comb.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
3,M3A,North York,Parkwoods,43.753259,-79.329656
4,M4A,North York,Victoria Village,43.725882,-79.315572
5,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
6,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [23]:
ext_data = pd.read_csv('ext_data_process.csv')
ext_data.rename(columns={'pcode': 'Postal Code'}, inplace = True)
ext_data['Postal Code'] = ext_data['Postal Code'].str.strip()
ext_data.head(10)

Unnamed: 0,Postal Code,cfam_size,emp_rate,tot_hhldinc_med,population,tot_priv_dwl,workage_prop,singledetach_prop,apt_prop,cfam_wchild_prop,immi_prop,uniab_prop,workftfy_prop,work_usualplc_prop,commute60m_prop
0,M1B,3.2,56.0,69126,66108,20957,0.611935,3.267919,0.20959,0.520609,0.599105,0.254412,0.295118,0.843617,0.275528
1,M1C,3.1,58.8,109785,35626,11588,0.606456,3.161047,0.014197,0.541506,0.448852,0.404743,0.351572,0.826874,0.260165
2,M1E,3.0,52.7,62047,46943,17637,0.595548,2.735431,0.344697,0.442604,0.47793,0.281898,0.270305,0.813225,0.282285
3,M1G,3.2,49.3,54450,29690,10116,0.588919,3.040451,0.41321,0.498419,0.560155,0.278678,0.234486,0.826362,0.259929
4,M1H,2.9,55.0,58492,24383,9274,0.632561,2.713968,0.573178,0.467262,0.579668,0.347083,0.292399,0.845635,0.239944
5,M1J,3.1,52.6,54507,36699,12797,0.597411,2.989817,0.564562,0.477064,0.568508,0.255145,0.25619,0.829356,0.277006
6,M1K,3.0,54.4,53260,48434,18620,0.623516,2.701338,0.453709,0.463026,0.554128,0.265969,0.277882,0.833754,0.219948
7,M1L,3.1,56.7,56779,35081,12884,0.627476,2.823742,0.346881,0.523991,0.521892,0.361104,0.288636,0.819662,0.190141
8,M1M,3.0,55.6,68550,22913,8908,0.601048,2.657772,0.311485,0.489567,0.40394,0.341454,0.311794,0.805005,0.229156
9,M1N,2.9,59.6,73256,22136,9535,0.622855,2.434304,0.103903,0.451827,0.284487,0.338976,0.347241,0.811982,0.18837


In [25]:
# combine toronto postal-neighborhood data with external StatsCan data (pre-processed)
df_proc = df_comb.merge(ext_data, on='Postal Code', how='left')
df_proc.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,cfam_size,emp_rate,tot_hhldinc_med,population,tot_priv_dwl,workage_prop,singledetach_prop,apt_prop,cfam_wchild_prop,immi_prop,uniab_prop,workftfy_prop,work_usualplc_prop,commute60m_prop
0,M3A,North York,Parkwoods,43.753259,-79.329656,3.0,57.5,64761.0,34615.0,13847.0,0.61534,2.613439,0.506606,0.461218,0.496812,0.410005,0.309756,0.803214,0.205002
1,M4A,North York,Victoria Village,43.725882,-79.315572,2.9,54.1,54905.0,14443.0,6299.0,0.597439,2.341167,0.34765,0.414062,0.516408,0.347541,0.315121,0.838028,0.212272
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2.6,64.4,52623.0,41078.0,24186.0,0.764606,1.839266,0.723976,0.251294,0.378371,0.517506,0.367251,0.815626,0.090066
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,3.1,57.4,53933.0,21048.0,8751.0,0.570444,2.606192,0.334985,0.399402,0.497793,0.369856,0.264238,0.822839,0.144707
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,,,,,,,,,,,,,,


In [28]:
df_proc.describe(include = 'all')

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,cfam_size,emp_rate,tot_hhldinc_med,population,tot_priv_dwl,workage_prop,singledetach_prop,apt_prop,cfam_wchild_prop,immi_prop,uniab_prop,workftfy_prop,work_usualplc_prop,commute60m_prop
count,103,103,103,103.0,103.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0
unique,103,10,99,,,,,,,,,,,,,,,,
top,M4G,North York,Downsview,,,,,,,,,,,,,,,,
freq,1,24,4,,,,,,,,,,,,,,,,
mean,,,,43.704608,-79.397153,2.875,60.0,71346.822917,28459.3125,12284.1875,0.648171,2.473797,0.426547,0.426164,0.44615,0.465058,0.333854,0.803587,0.15272
std,,,,0.052463,0.097146,0.241922,7.092012,19833.83983,14007.981457,6171.844265,0.075253,0.459515,0.251209,0.094336,0.129936,0.18088,0.066461,0.024672,0.064819
min,,,,43.602414,-79.615819,2.3,46.4,40291.0,2005.0,1718.0,0.549657,1.549453,0.0,0.16,0.198828,0.111006,0.197018,0.737646,0.042969
25%,,,,43.660567,-79.464763,2.8,55.0,57151.0,18531.25,8175.5,0.60081,2.175181,0.249711,0.396162,0.333759,0.324293,0.28705,0.786945,0.090893
50%,,,,43.696948,-79.38879,2.9,58.7,65508.0,25724.0,11215.5,0.626724,2.494556,0.388622,0.44765,0.443735,0.461745,0.322986,0.803687,0.144288
75%,,,,43.74532,-79.340923,3.0,64.925,79188.5,37820.5,15744.0,0.671057,2.722973,0.5657,0.480081,0.55626,0.633168,0.370895,0.819807,0.210302


The original toronto data has 103 valid neighborhoods, but only 96 has population and demographic data (based on 2016 Census). 

Here is a dictionary of the added variables:

* cfam_size: Average size of census family
* emp_rate: Employment rate
* tot_hhldinc_med: Median total household income (before tax)
* population: Population size
* tot_priv_dwl: Total private dwellings
* workage_prop: Proportion of individuals in working age (20-64)
* singledetach_prop: Proportion of single-detached houses
* apt_prop: Proportion of apartments
* cfam_wchild_prop: Proportion of census families with child
* immi_prop: Proportion of immigrants 
* uniab_prop: Proportion of individuals with college/university degree or above
* workftfy_prop: Proportion of individuals
* work_usualplc_prop: Proportion of individuals go to work at usual place (instead of working from home)
* commute60m_prop: Proportion of individuals commuting 60 minutes or above to go to work

I will use this data along with most common vennues data (extract from foursquare API) to determine which neighborhood is best for opening a childcare center.