<h1> Part 1: Segmenting and Clustering Neighborhoods in Toronto </h1>

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Creating an empty dataframe to house the data:

neighbourhoods=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


<h2> I will be using Beautiful Soup to scrape the data off Wikipedia </h2>
<p> If you are talking this course as well and can't figure out how to scrape the data, I am using the instructions found in this <a href="https://www.youtube.com/watch?v=ng2o98k983k"> YouTube tutorial </a> and from <a href="https://towardsdatascience.com/step-by-step-tutorial-web-scraping-wikipedia-with-beautifulsoup-48d7f2dfa52d">this page</a>.</p>

<h3> First step: install and import all the necessary libraries </h3>

In [3]:
#installing beautiful soup
! pip install beautifulsoup4 



In [4]:
#installing the html parser 
! pip install lxml 



In [5]:
#installing  requests 
!pip install requests  



In [6]:
#importing my newly installed packages 
from bs4 import BeautifulSoup
import requests

<h3> Second step: scrape the data </h3>
<p> IMPORTANT NOTE: the wikipedia page has been modified since IBM staff created the instructions, I found the link to the old version by following some tips in the forums </p>

In [7]:
#the line uses requests to .get the .text version (i.e. the html) of whatever is in the link 
page=requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641.').text

soup= BeautifulSoup(page,'lxml') #this gets us a parsed version of the file using a parser called 'lxml'
   
#print(soup.prettify()) #this has been commented out to save space after checking the printed output made sense


In [8]:
wiki_table= soup.find_all('table', class_="wikitable") #isolating the table we want, it is the only one with class wikitable
# print(wiki_table)  #this has been commented out to save space after checking the printed output made sense

In [9]:
#creating 3 empty arrays to store the data from each column before it gets turned into a pandas df 
PostalCode=[]
Borough=[]
Neighborhood=[]

In [10]:
#and then create a for loop that will go trough the page pulling out the data 

for item in wiki_table:  
    rows=item.find_all('tr') #tr is the html item that defines the rows of a table
    
    for row in rows: #inside each tr there will be td that denote the cells
        cells=row.find_all('td') 
        #when python runs this for the first row it won't find any td because that row is the header and has no data
        #therefore we can put an if condition based on the len of the result to skip the header row
        if len(cells)>1:
            
            Post=cells[0] #the post code is the first element in the string
            PostalCode.append(Post.text.strip()) #we append that element to the array we created earlier
            
            Bor=cells[1]
            Borough.append(Bor.text.strip())
            
            Hood=cells[2]
            Neighborhood.append(Hood.text.strip())

In [11]:
#as a sanity check let's make sure that all 3 arrays have the same number of data and that the content matches that from wikipedia

#print(PostalCode)  #this has been commented out to save space after checking the printed output made sense
print(len(PostalCode)) 

288


In [12]:
#print(Borough)  #this has been commented out to save space after checking the printed output made sense
print(len(Borough))

288


In [13]:
#print(Neighborhood)  #this has been commented out to save space after checking the printed output made sense
print(len(Neighborhood))

288


<h3> Third step: add the data to the panda dataframe </h3>

In [14]:
neighbourhoods=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


In [15]:
i=0
for i in range(len(PostalCode)):
    code=PostalCode[i]
    bor=Borough[i]
    neig=Neighborhood[i]
    
    neighbourhoods=neighbourhoods.append({'PostalCode':code,'Borough':bor,'Neighborhood':neig}, ignore_index=True)

In [17]:
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [18]:
neighbourhoods.shape #just as a sanity check let's make sure that there are indeed 288 rows and 3 columns 

(288, 3)

<h2> Now that we have the dataframe it's time to clean the data </h2>
<h3> The first step will be deleting rows that don't have a borough</h3>

In [19]:
#the challenge here is that we have to remove the rows that don't have a brough assigned but those rows don't have an empty field or NaN value, instead they have the words "not assigned"
#the standard dropna() option won't help us, we first have to replace the "Not assigned" with NaN

neighbourhoods.replace('Not assigned', np.nan, inplace=True )


In [20]:
neighbourhoods.dropna(subset=['Borough'], inplace=True) #dropping all the rows where the borough is empty 
neighbourhoods.reset_index(drop=True, inplace=True) #resetting the index
neighbourhoods.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [21]:
neighbourhoods.shape #checking the shape to be able to use it as reference in future sanity checks 

(211, 3)

In [24]:
neighbourhoods.dtypes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

In [26]:
neighbourhoods.astype(str)
neighbourhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


<h3> The second step is filling in the neighborhoods with no value with the name of their Borough </h3>

In [30]:
neighbourhoods['Neighborhood']=neighbourhoods['Neighborhood'].fillna(neighbourhoods['Borough']) #filling NaN values in Neighborhood with the name of the Borough as per instructions 

In [31]:
neighbourhoods.head(10) #checking the head to see if it worked, row 6 for example had a NaN value, this time it should have the name of the Borough 

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [32]:
#sanity check, if we have not lost any data the shape should be the same
neighbourhoods.shape 

(211, 3)

<h3> The third step will be combining neigborhoods with the same post code into a single row</h3>

In [39]:
neighbourhoods=neighbourhoods.groupby(['PostalCode','Borough'], sort=False, as_index=False).agg(lambda x:','.join(x))

In [40]:
neighbourhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [41]:
neighbourhoods.shape #sanity check again, if the grouping has worked there should be fewer rows now 

(103, 3)