In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

from pandas.io.json import json_normalize

## The Battle of Neighborhoods! 

The Battle of Neighborhoods is a project for segmenting neighborhoods based on their popularity and attractiveness for lining. Great!  
The ML algorithms for the purposes of the project are *segmentation* and *clusterring* using Python. 

Isn't that great  
<img src="https://static.turbosquid.com/Preview/001281/021/5V/_DHQ.jpg" height="100" width="300">

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
result = requests.get(url).text
#result[:10000] # limit the output shown to the first 10 000 characters

We will use beautiful soup to pull the table of interest out of HTML.

To make it more fun ...:  
'I like to refer to myself in 1st person plural' ;)  
So, from here on, it is *me* + *you* <3

In [3]:
# Pull the desired data/table out of the HTML using beautiful soup
soup = BeautifulSoup(result, 'lxml')
#print(soup.prettify())

In [4]:
table = soup.find('table',{'class':'wikitable sortable'})
#print(table)

We can explore the resulting table:  


In [5]:
# find all cells, td tag respectively
cells=table.findAll('td')
#cells[:15]

Get rid of the tags and keep only the text between

In [6]:
cells = list(cells)
cells = [str(cell).lstrip("<td>").rstrip('</td>').strip() for cell in cells] 
cells[:15]

['M1A',
 'Not assigned',
 '',
 'M2A',
 'Not assigned',
 '',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Regent Park / Harbourfront']

It is obvious that the information from the wiki table is ordered row by row in the list of table cells.   
So, we got len(cells)//3 pairs of rows each with 3 values for the respective 3 columns.

Next, we create a *pandas* dataframe and append rows:

In [7]:
df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [8]:
idx = 0
while idx < (len(cells) - 3):
    df = df.append({'PostalCode':cells[idx], 'Borough':cells[idx + 1], 'Neighborhood':cells[idx + 2]}, ignore_index=True)
    idx += 3
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned or in our case that are empty strings ''.

In [9]:
df = df[df['Borough'] != 'Not assigned']
df = df[df['Borough'] != '']
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


More than one neighborhood can exist in one postal code area. We can see from the table that the neighborhood names in the Neighborhood column are split with '/'. We can replace it with a comma.

In [10]:
df['Neighborhood'] = [value.replace(' /', ',') for value in df['Neighborhood'].values]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a cell has a borough but a 'Not assigned'neighborhood, then the neighborhood will be the same as the borough.

In [11]:
for row in range(df.shape[0]):
    if df.loc[row, 'Neighborhood'] == 'Not assigned':
        df.loc[row, 'Neighborhood'] = df.loc[row, 'Borough']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
df.shape

(103, 3)

### Thanks for having staid tuned :) and for graiding this :)