# Segmenting and Clustering Neighborhoods in Toronto
# Case -1

#### Notebook to build the code to scrape the following Wikipedia page,
<a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,</a> in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

### Libraries Installations

In [1]:
# !pip install BeautifulSoup4
# !pip install lxml



Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 150kB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.1 soupsieve-2.0.1
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/79/37/d420b7fdc9a550bd29b8cfeacff3b38502d9600b09d7dfae9a69e623b891/lxml-4.5.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 6.7MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2


## Import Libraries

In [52]:
from bs4 import BeautifulSoup
import requests

import pandas as pd 

In [53]:
data_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(data_url).text

In [54]:
soup = BeautifulSoup(source, 'xml')

In [55]:
table=soup.find('table')

In [56]:
#dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

In [57]:
# Search all the postcode, borough, neighborhood 
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

In [58]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"



## Data Cleaning
### Remove rows where Borough is 'Not assigned'

In [59]:
df=df[df['Borough']!='Not assigned']

In [60]:
if len(df[df['Neighborhood']=='Not assigned']>0):
    df[df['Neighborhood']=='Not assigned']=df['Borough']
    df.head(5)
else:
    df.head(5)


In [61]:
temp_df=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
temp_df=temp_df.reset_index(drop=False)
temp_df.rename(columns={'Neighborhood':'Neighborhood_new'},inplace=True)

In [62]:
df_merge = pd.merge(df, temp_df, on='Postalcode')

In [63]:
df_merge.drop(['Neighborhood'],axis=1,inplace=True)

In [64]:
df_merge.drop_duplicates(inplace=True)

In [65]:
df_merge.rename(columns={'Neighborhood_new':'Neighborhood'},inplace=True)

In [66]:
df_merge.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [67]:
df_merge.shape

(103, 3)