## This notebook is created to obtain the DATA to explore and cluster neighborhoods 
I'll scrap from webpage  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  to obtain the dataset

### Scraping the Data

In [97]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [98]:
import urllib
html=urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
data=BeautifulSoup(html,'html.parser')
table=data.find('table')
#print(table.prettify())

### Convert the table into *pandas* dataframe

In [99]:
content=[]
columns=['PostalCode','Borough','Neighborhood']
for item in table.find_all('tr')[1:]:
    td=item.find_all('td')
    content.append([cell.text for cell in td])
df=pd.DataFrame(content,columns=columns)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


### Pre-prosessing the df

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PostalCode    180 non-null    object
 1   Borough       180 non-null    object
 2   Neighborhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


In [101]:
#remove all rows for which Borough is not assgned
df=df[df['Borough']!='Not assigned\n']
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A\n,North York\n,Parkwoods\n
1,M4A\n,North York\n,Victoria Village\n
2,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
3,M6A\n,North York\n,"Lawrence Manor, Lawrence Heights\n"
4,M7A\n,Downtown Toronto\n,"Queen's Park, Ontario Provincial Government\n"


In [102]:
#remove \n
df['PostalCode']=df['PostalCode'].str.split('\n',expand=True)[0]
df['Neighborhood']=df['Neighborhood'].str.split('\n',expand=True)[0]
df['Borough']=df['Borough'].str.split('\n',expand=True)[0]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be same as the borough.

In [103]:
for i  in np.arange(0,len(df['Neighborhood'])):
    if df['Neighborhood'][i]=='Not assigned':
        df['Neighborhood'][i]=df['Borough'][i]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [104]:
df[df['Neighborhood']!='Not assigned'].all()

PostalCode      True
Borough         True
Neighborhood    True
dtype: bool

Now, we group neighborhoods who share same postalcode, and show the neighborhoods in one row, seperated by comma.

In [110]:
df_group=df.groupby(['PostalCode','Borough'],axis=0)
df.shape[0]==len(df_group)

True

Actually from the above code we can seen the webpage has been updated so that there are no duplicated PostalCode, but here I will still perform the join cell to illustrate the processing.

In [111]:
data=df_group['Neighborhood'].apply(', '.join).to_frame().reset_index()
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [112]:
#save the data into csv for further use
data.to_csv('capstone_data.csv', index=False)

In [113]:
data.shape

(103, 3)