# Explore, segment, and cluster the neighborhoods in the city of Toronto  
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.  
I will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format. Once the data is in a structured format, I can apply the analysis to explore and cluster the neighborhoods in the city of Toronto.  
### Sorry for my Engish!

In [1]:
import pandas as pd
import numpy as np

print('Pandas version: ',pd.__version__)
print('Numpy version: ',np.__version__)

Pandas version:  0.20.3
Numpy version:  1.13.1


Install packages *lxml* and *BeautifulSoup* for parsing HTML pages if you haven't installed yet.

In [2]:
# install lxml
#!conda install lxml --yes

# install BeautifulSoup
#!conda install BeautifulSoup4 --yes

Let's use *BeautifulSoup* package to parse wiki page. You should be aware that the page contains several tables and I will use *find()* method instead of *findAll()*. Lucky me, the neighborhood data is contained in the first table on the page. So, the first *find()* gives us all that we want.

In [24]:
# import libraries
from urllib.request import urlopen
import bs4

#wiki page for neighborhoods in Toronto
WIKI_PAGE = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# array for table content
data = []

# get contents of wiki page
page = urlopen(WIKI_PAGE)
# init soup
soup = bs4.BeautifulSoup(page, "lxml")
# get all content inside _first_ 'tbody' tag
table_body = soup.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)

# because of first row of table contains header names arranged by 'th' tags 
# first element of data array is empty - just remove it 
del data[0]

# check result
data[0:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

All looks good. Convert array to pandas dataframe:

In [25]:
df = pd.DataFrame(data, columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [26]:
df.shape

(287, 3)

Remove all rows where borough is not assigned:

In [28]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.shape

(210, 3)

Let's use borough name for 'not assigned' neighborhood:

In [38]:
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
7,M7A,Queen's Park,Not assigned


In [48]:
df['Neighborhood'] = df.apply(lambda x : x['Borough'] if x['Neighborhood'] == 'Not assigned' else x['Neighborhood'], axis=1)
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


Done! All neighborhoods assigned.  
More than one neighborhood can exist in one postal code area. I'll group dataframe by postal code and borough, then apply lambda function to concatenate neighborhoods in every group:

In [56]:
df_grouped = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: "%s" % ', '.join(x)).to_frame().reset_index()
df_grouped.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [57]:
df_grouped.shape

(103, 3)