# Segmenting and Clustering 

### Neighborhoods in Toronto:

#### Parsing data from Wikipedia page, and creating a dataframe.

#### Import libraries:

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

#### Url for wiki page:

In [2]:
url_w = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#### Using BeautifulSoup to parse the html wiki page, and find table within it:

In [3]:
shtml = requests.get(url_w).text
soup = BeautifulSoup(shtml, 'html.parser')

In [4]:
table = soup.table
soup.find('table')
print("... table data parsed ...")

... table data parsed ...


#### Define "table_rows" and find all "tr" tags:

In [5]:
table_rows = table.find_all('tr')

#### Define row_list list, and use loop to append all the rows into it:

In [6]:
rows_list = []

In [7]:
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    rows_list.append(row)
    print(row)

[]
['M1A', 'Not assigned', 'Not assigned\n']
['M2A', 'Not assigned', 'Not assigned\n']
['M3A', 'North York', 'Parkwoods\n']
['M4A', 'North York', 'Victoria Village\n']
['M5A', 'Downtown Toronto', 'Harbourfront\n']
['M5A', 'Downtown Toronto', 'Regent Park\n']
['M6A', 'North York', 'Lawrence Heights\n']
['M6A', 'North York', 'Lawrence Manor\n']
['M7A', "Queen's Park", 'Not assigned\n']
['M8A', 'Not assigned', 'Not assigned\n']
['M9A', 'Etobicoke', 'Islington Avenue\n']
['M1B', 'Scarborough', 'Rouge\n']
['M1B', 'Scarborough', 'Malvern\n']
['M2B', 'Not assigned', 'Not assigned\n']
['M3B', 'North York', 'Don Mills North\n']
['M4B', 'East York', 'Woodbine Gardens\n']
['M4B', 'East York', 'Parkview Hill\n']
['M5B', 'Downtown Toronto', 'Ryerson\n']
['M5B', 'Downtown Toronto', 'Garden District\n']
['M6B', 'North York', 'Glencairn\n']
['M7B', 'Not assigned', 'Not assigned\n']
['M8B', 'Not assigned', 'Not assigned\n']
['M9B', 'Etobicoke', 'Cloverdale\n']
['M9B', 'Etobicoke', 'Islington\n']
['M9B'

In [8]:
rows_list[0:5]

[[],
 ['M1A', 'Not assigned', 'Not assigned\n'],
 ['M2A', 'Not assigned', 'Not assigned\n'],
 ['M3A', 'North York', 'Parkwoods\n'],
 ['M4A', 'North York', 'Victoria Village\n']]

#### Add list content to dataframe (df_neigh):

In [9]:
df_neigh = pd.DataFrame(rows_list)
df_neigh.head(5)

Unnamed: 0,0,1,2
0,,,
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n


#### Rename the columns with proper names, drop zero first row, and remove all the rows where "Borough"is "Not assined"

In [10]:
df_neigh.columns = ['PostalCode', 'Borough','Neighborhood']
df_neigh.drop(0, inplace = True)
df_neigh.drop(df_neigh.loc[df_neigh['Borough']=='Not assigned'].index, inplace=True)
df_neigh.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n
6,M5A,Downtown Toronto,Regent Park\n
7,M6A,North York,Lawrence Heights\n


#### During parsing of html, the new line character (\n) got captured too, so we need to remove it from the values in "Neighborhood" column:

In [11]:
df_neigh['Neighborhood'] = df_neigh['Neighborhood'].map(lambda x: x.rstrip('\n'))
df_neigh.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


#### In next step idea is to group the dataframe accoring to Postal Code, and combine Neighbourhoods under same Borough, separated by ',' character. Also we want to reset the index for our dataset.

In [12]:
df_tor = df_neigh.astype(str).groupby('PostalCode').agg(lambda x: ','.join(x.unique()))
df_tor.reset_index(inplace = True) 
df_tor.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### In order to replace "Not assigned" values in "Neighbourhood", with values from "Borough", the idea is to replace all non assigned values in "Borough " with NaN value. Than we can fill the Nan values, with values from "Neighbourhood" column

In [13]:
df_tor['Neighborhood'].replace("Not assigned", np.nan, inplace = True)
df_tor.Neighborhood.fillna(df_tor.Borough, inplace=True)
df_tor.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


#### Just to keep things safe, we are going to save dataset to .csv file:

In [14]:
df_tor.to_csv('Toronto_PostalCodes.csv')

#### For testing, and observation purposes sorting dataset by 'Neighborhood':

In [15]:
toronto_set = df_tor.sort_values(by=['Neighborhood'])
toronto_set.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
58,M5H,Downtown Toronto,"Adelaide,King,Richmond"
12,M1S,Scarborough,Agincourt
14,M1V,Scarborough,"Agincourt North,L'Amoreaux East,Milliken,Steel..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."
89,M8W,Etobicoke,"Alderwood,Long Branch"
28,M3H,North York,"Bathurst Manor,Downsview North,Wilson Heights"
19,M2K,North York,Bayview Village
62,M5M,North York,"Bedford Park,Lawrence Manor East"
56,M5E,Downtown Toronto,Berczy Park
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


### Size of the set:

In [16]:
toronto_set.shape

(103, 3)