# Segmentation and Clustering of Neighborhoods in the city of Toronto, Canada

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto. For the Toronto neighborhood data, a Wikipedia page exists that has all the information that is required to explore and cluster the neighborhoods in Toronto. The data was scraped from the Wikipedia page and wrangled, cleaned, and then read into a pandas dataframe so that it is in a structured format.

For scraping the data from the following website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, I used Beautifulsoup and requests.

In [1]:
!pip install beautifulsoup
!pip install lxml
!pip install html5lib
!pip install requests

Collecting beautifulsoup
  Using cached https://files.pythonhosted.org/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\leo26\AppData\Local\Temp\pip-install-0jqhuifd\beautifulsoup\setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
    
    ----------------------------------------


Command "python setup.py egg_info" failed with error code 1 in C:\Users\leo26\AppData\Local\Temp\pip-install-0jqhuifd\beautifulsoup\




In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Scraping the table from the wikipedia url

In [3]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(wiki_url)
data = r.text
soup = BeautifulSoup(data)
My_table = soup.find('table',{'class':'wikitable sortable'})
#My_table


In [4]:
#We want to skip the first two rows because they contain headers,hence have added [2:]
rows = My_table.find_all('tr')[2:] 
#rows

In [15]:
data = {'Postal Code': [], 'Borough': [],'Neighborhood': []}
for row in rows:
    cols = row.find_all('td')
    data['Postal Code'].append(cols[0].get_text())
    data['Borough'].append(cols[1].get_text())
    data['Neighborhood'].append(cols[2].get_text())
#print(data)

 

Converting the data into dataframe

In [6]:
Final_table = pd.DataFrame(data)
Final_table.to_csv("Toronto Postcodes.csv", index = False)


Exploratory Data Analysis
In this section, the data will be wrangled, cleaned and unwanted elements will be removed. 

In [7]:
df = pd.read_csv("Toronto Postcodes.csv")
df.head()



Unnamed: 0,Postal Code,Borough,Neighborhood
0,M2A,Not assigned,Not assigned\n
1,M3A,North York,Parkwoods\n
2,M4A,North York,Victoria Village\n
3,M5A,Downtown Toronto,Harbourfront\n
4,M5A,Downtown Toronto,Regent Park\n


To remove "\n" from the table in the column "Neighborhood"

In [8]:
df['Neighborhood'] = df['Neighborhood'].map(lambda x: x.rstrip('\n'))
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park


Replace "Not assigned" with NAN in the column "Borough"

In [9]:
df.replace('Not assigned', np.NAN, inplace = True)

df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M2A,,
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park


Drop all the rows with "Not assigned"/NAN in the column "Borough"

In [10]:
df_new = df.dropna(subset=['Borough'])
df_new.head(5)



Unnamed: 0,Postal Code,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park
5,M6A,North York,Lawrence Heights


Replace "Not Assigned" in the column "Neighborhood" with the same Borough name as in the corresponding Borough

In [11]:
df_new['Neighborhood'].replace('Not assigned\n',np.nan,inplace = True)
df_new['Neighborhood'].replace(np.nan,df_new['Borough'],inplace=True)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Unnamed: 0,Postal Code,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park
5,M6A,North York,Lawrence Heights


Combine the rows with the same PostalCodes

In [12]:
df_final = df_new.groupby(['Postal Code','Borough'], sort = False).agg(lambda x: ','.join(x))
#To ensure the title Neighborhood is in the same line as the other column titles
df_final.reset_index(level=['Postal Code','Borough'], inplace=True)
df_final.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [13]:
df_final.shape
print('Number of rows in dataframe: ', df_final.shape[0])

Number of rows in dataframe:  103


In [14]:
#Have saved the file to a new csv file
df_final.to_csv("Toronto Table with Postcodes.csv", index = False)