# IBM Capstone Assignment


## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto


### Load required libs

In [1]:
# import libs
import pandas as pd
import seaborn as sns
import requests
from bs4 import BeautifulSoup


### Data scrapping

In [2]:
# data scrapping
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
tables = soup.find_all('table')

# create dataframe
pdf= pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])
for row in tables[0].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        pcode = col[0].text
        borough = col[1].text
        nbhood = col[2].text.strip()
        pdf = pdf.append({"PostalCode":pcode, "Borough":borough, "Neighborhood":nbhood}, ignore_index=True)

### Data wrangling

In [3]:
# remove new line symbols (\n)
pdf = pdf.replace('\n',' ', regex=True)

# remove empty space after postal code
pdf['PostalCode'] = pdf['PostalCode'].str.strip()

# remove rows with not assigned boroughs
pdf = pdf[~pdf['Borough'].str.contains('Not assigned')]
pdf

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Quick data analysis on repeated values

1) Whenever a "Not assigned" value was present for a Borough, it would also be present for its corresponding Neighborhood. Thus, when those values from Borough were removed, they were also removed for the Neighborhood column.

2) Below we can checked that there are no duplicated values for Postal code.

3) Then, the "groupby" function is not going to affect the data frame.

In [4]:
# check repeated values
multi_pcode = pdf['PostalCode'].duplicated().any()
multi_borough = pdf['Borough'].duplicated().any()
multi_neighborhood= pdf['Neighborhood'].duplicated().any()
str1=('Are there repeted values for Postal Code? {}' )
str2=('Are there repeted values for Borough? {}' )
str3=('Are there repeted values for Neighborhood? {}' )

print(str1.format(multi_pcode))
print(str2.format(multi_borough))
print(str3.format(multi_neighborhood))

Are there repeted values for Postal Code? False
Are there repeted values for Borough? True
Are there repeted values for Neighborhood? True


In [7]:
# shape before group by
#print('Shape before groupby: ', pdf.shape)
print('Shape of the dataframe: ', pdf.shape)
# shape after group by
#pdf.groupby(['PostalCode'])
#print('Shape after groupby: ', pdf.shape)

Shape of the dataframe:  (103, 3)
