# Cluster the neighborhoods in Toronto. 

## Step1 Explore our data 

We install the libraries we will use

In [77]:
!pip install bs4
!pip install requests



In [78]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

First of all we will establish contact with the Wikipedia link through our requests and BeautifulSoup libraries, after that we will extract the data from the table in lists for columns and through a for we obtain a series of values of the row

In [91]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [92]:
table = soup.find('table',{'class':"wikitable sortable"}).tbody

In [81]:
rows = table.find_all('tr')
print(rows[0])

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


In [82]:
#en row[0] se encuentran los nombres de las columnas
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
print(columns)

['Postcode', 'Borough', 'Neighbourhood']


In [83]:
df = pd.DataFrame(columns = columns)

In [84]:
for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    values = [td.text.replace('\n','') for td in tds]
    df=df.append(pd.Series(values, index = columns), ignore_index = True)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


After obtaining the dataframe from that table, we will treat the data

We delete those boroughs not assigned

In [85]:
df = df.drop(df[df['Borough'] == 'Not assigned'].index, axis = 0)
df = df.reset_index()
df = df.drop('index', axis = 1)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


We join the same data with postcode and drop duplicates

In [86]:
df["Neighbourhood"] = df.groupby("Postcode")["Neighbourhood"].transform(lambda x: ', '.join(x))
df = df.drop_duplicates()
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
5,M7A,Downtown Toronto,Queen's Park
...,...,...,...
192,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
195,M4Y,Downtown Toronto,Church and Wellesley
196,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
197,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


By last the neighbourhood not assigned is changed by your borough name

In [87]:
df['Neighbourhood'].replace('Not assigned',df['Borough'],inplace = True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
5,M7A,Downtown Toronto,Queen's Park


Shape of our dataframe

In [89]:
df.shape

(103, 3)