### Toronto Segmenting and Clustering Notebook¶
Parsing attempt on a wikipedia page with BeautifulSoup. The target will be Toronto Neighborhood Wikipage as part of the IBM Professional Data Science Capstone week 3 assignment.

### Download and Clean Dataset from Wiki with BeautifulSoup¶
Use BeautifulSoup to parse neighborhood data from Toronto Neighborhood Wikipage.

Import required libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd


# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#### Get html content to parse

In [2]:
# Get html with requests
page_link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_doc = requests.get(page_link, timeout=5)

# Parse html to beautiful soup object
soup = BeautifulSoup(html_doc.content, 'html.parser')

#### Extract table content from soup into list

In [3]:
table= []
for i in soup.table.tbody.find_all('td'):
    table.append(i.text.strip())

# Preview list
table[:12]

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village']

#### Split list into data frame¶

In [4]:
d = {'PostalCode': table[0::3], 'Borough': table[1::3], 'Neighborhood': table[2::3]}
df = pd.DataFrame(d)

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Clean 'Not Assigned' values from Borough and Neighborhood¶

In [5]:
# Drop "Not Assigned" Borough
df = df[df.Borough != 'Not assigned']

# Replace "Not Assigned" Neighborhood with Borough value
df['Neighborhood'] = df.apply(lambda x: x['Borough'] if x['Neighborhood']=='Not assigned' else x['Neighborhood'], axis=1)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Group by PostalCode to concatenate Neighborhood entrie

In [6]:
# Group by PostalCode to concatenate Neighborhood
df_postalgroup = df.groupby('PostalCode')

# Concatenate Neighborhood entries
y = df_postalgroup['Neighborhood'].apply(', '.join)
y = pd.DataFrame(y)

# Remove duplicate and merge
k = df[['PostalCode','Borough']].drop_duplicates(subset ="PostalCode")
df_torontogrouped = pd.merge(k,y,on = 'PostalCode', how = 'left')

df_torontogrouped.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
# Print shape
df_torontogrouped.shape

(103, 3)