# Coursera Capstone - Toronto Segmentation and Clustering

## Overview

This notebook attempts to pull PostalCode, Borough and Neighborhood from Toronto Wiki page. Below is a brief outline of useful resources and brief takeaways after completing this Assignment (for my future reference and others):

* Portions of the Python for Data Science labs were helpful in the webscraping portion of this assignment.
* Utlimately, I decided to use the Pandas tools (rather than BeautifulSoup) since our main objective was to extract tables from the target URL.
* Stackoverflow.com was a great resource for getting through some sticking points (especially as it relates to pandas dataframes).
* Pandas cheat sheets helped as quick reference reminders for methods to summarize data and its structure (here's an example of one of the few that I used https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* The most challenging portion of this Assignment (at least for me) was flattening the groupby results in the Neighborhood column of the dataframe. The following website helped get me over the "hump" in that regard
(http://queirozf.com/entries/pandas-dataframe-groupby-examples#flatten-hierarchical-indices-created-by-groupby). ...credit where credit is due.
* Thank you for reviewing my notebook, and good luck on your endeavors!


## Start off by importing libraries

In [285]:
!pip install bs4 
!pip install lxml
import numpy as np
import pandas as pd
import requests
import urllib.request, urllib.parse, urllib.error

from bs4 import BeautifulSoup #can use BeautifulSoup to parse website



You are using pip version 19.0, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 19.0, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Use Pandas and its features to read the target URL

In [286]:
#web scraping using pandas
#note: set header = 0 to use the first row as column headers

d = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)

#Interesting notes about the pandas list of tables that's created from reading the URL
#type(d) #prints 'list'
#print(len(d)) #prints '3' since there are 3 tables on this page. the one we want to use is index=0
#print(d[0]) #
#format of df['column_name']['row_#_in_column_name']

### Extract the table from the URL as a Pandas dataframe (note the shape and column names)

In [287]:
#set the target table to a 'dataframe'
df = d[0]

#check the shape of df
print(len(df)) #prints '289' is the # of rows (including the column name row)
print(df.shape) #prints '(289, 3)'
print(df.columns) #prints 'Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')'

288
(288, 3)
Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')


### View the dataframe contents (use .head() if you only want to take a peek)

In [288]:
#view the df
print(df)

    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
5        M5A  Downtown Toronto   
6        M6A        North York   
7        M6A        North York   
8        M7A      Queen's Park   
9        M8A      Not assigned   
10       M9A         Etobicoke   
11       M1B       Scarborough   
12       M1B       Scarborough   
13       M2B      Not assigned   
14       M3B        North York   
15       M4B         East York   
16       M4B         East York   
17       M5B  Downtown Toronto   
18       M5B  Downtown Toronto   
19       M6B        North York   
20       M7B      Not assigned   
21       M8B      Not assigned   
22       M9B         Etobicoke   
23       M9B         Etobicoke   
24       M9B         Etobicoke   
25       M9B         Etobicoke   
26       M9B         Etobicoke   
27       M1C       Scarborough   
28       M1C  

### Check to see how many Boroughs are 'Not assigned'

In [289]:
#filter for Boroughs that are "Not Assigned"
df[df.Borough == 'Not assigned'] #shows df filtering rows with Borough == 'Not assigned'

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned
20,M7B,Not assigned,Not assigned
21,M8B,Not assigned,Not assigned
30,M2C,Not assigned,Not assigned
36,M7C,Not assigned,Not assigned
37,M8C,Not assigned,Not assigned
45,M2E,Not assigned,Not assigned


### Remove rows in the Borough column that are 'Not assigned' (and view the results)

In [290]:
#remove Postcodes that are "Not Assigned"
#df = df.drop(df[df.score < 50].index)
df_del_na = df.drop(df[df.Borough == 'Not assigned'].index)
df_del_na

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


### Observe the shape of the updated dataframe

In [291]:
df_del_na.shape

(211, 3)

### Confirm that all 'Not assigned' Boroughs were removed

In [292]:
#check to see if there are any 'Not assigned' Neighborhoods
df_del_na[df_del_na.Neighbourhood == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


### Set Neighborhood from 'Not assigned' to "Queen's Park" (and view the results)

In [293]:
#set Neighborhood from 'Not assigned' to "Queen's Park"
df_del_na.loc[df_del_na['Neighbourhood'] == "Not assigned", 'Neighbourhood'] = "Queen's Park"
df_del_na

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [294]:
#check to verify 'Queen's Park'  Neighbourhood exists
df_del_na[df_del_na.Neighbourhood == "Queen's Park"] 

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


### Rename column name from 'Neighbourhood' to 'Neighborhood'

In [299]:
df_del_na.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)
df_del_na.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


### View the groupby table

In [298]:
df_del_na_grouped = df_del_na.groupby(['PostalCode', 'Borough', 'Neighborhood']).count()
df_del_na_grouped

PostalCode,Borough,Neighborhood
M1B,Scarborough,Malvern
M1B,Scarborough,Rouge
M1C,Scarborough,Highland Creek
M1C,Scarborough,Port Union
M1C,Scarborough,Rouge Hill
M1E,Scarborough,Guildwood
M1E,Scarborough,Morningside
M1E,Scarborough,West Hill
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


### Flatten the table (for the rows in the Neighborhood columns)

In [309]:
#flatten groupby results by setting Neighborhood to list per row
df_del_na_grouped_flat = df_del_na.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda group_series: group_series.tolist()).reset_index()
print(df_del_na_grouped_flat.head(10), '\n', '\n')
print(df_del_na_grouped_flat.tail(10))

  PostalCode      Borough                                       Neighborhood
0        M1B  Scarborough                                   [Rouge, Malvern]
1        M1C  Scarborough           [Highland Creek, Rouge Hill, Port Union]
2        M1E  Scarborough                [Guildwood, Morningside, West Hill]
3        M1G  Scarborough                                           [Woburn]
4        M1H  Scarborough                                        [Cedarbrae]
5        M1J  Scarborough                              [Scarborough Village]
6        M1K  Scarborough      [East Birchmount Park, Ionview, Kennedy Park]
7        M1L  Scarborough                  [Clairlea, Golden Mile, Oakridge]
8        M1M  Scarborough  [Cliffcrest, Cliffside, Scarborough Village West]
9        M1N  Scarborough                      [Birch Cliff, Cliffside West] 
 

    PostalCode     Borough                                       Neighborhood
93         M9A   Etobicoke                                 [Islington A

### Evaluate the shape of the final table

In [311]:
df_del_na_grouped_flat.shape

(103, 3)