<h1> Part 1: Scraping and cleaning data for Neighborhoods in Toronto </h1>

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Creating an empty dataframe to house the data:

neighbourhoods=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


<h2> I will be using Beautiful Soup to scrape the data off Wikipedia </h2>
<p> If you are talking this course as well and can't figure out how to scrape the data, I am using the instructions found in this <a href="https://www.youtube.com/watch?v=ng2o98k983k"> YouTube tutorial </a> and from <a href="https://towardsdatascience.com/step-by-step-tutorial-web-scraping-wikipedia-with-beautifulsoup-48d7f2dfa52d">this page</a>.</p>

<h3> Installing and importing all the necessary libraries </h3>

In [3]:
#installing beautiful soup
! pip install beautifulsoup4 



In [4]:
#installing the html parser 
! pip install lxml 



In [5]:
#installing  requests 
!pip install requests  



In [6]:
#importing my newly installed packages 
from bs4 import BeautifulSoup
import requests

<h3> Scraping the data </h3>
<p> IMPORTANT NOTE: the wikipedia page has been modified since IBM staff created the instructions, I found the link to the old version by following some tips in the forums </p>

In [7]:
#the line uses requests to .get the .text version (i.e. the html) of whatever is in the link 
page=requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641.').text

soup= BeautifulSoup(page,'lxml') #this gets us a parsed version of the file using a parser called 'lxml'
   
#print(soup.prettify()) #this has been commented out to save space after checking the printed output made sense


In [8]:
wiki_table= soup.find_all('table', class_="wikitable") #isolating the table we want, it is the only one with class wikitable
# print(wiki_table)  #this has been commented out to save space after checking the printed output made sense

In [9]:
#creating 3 empty arrays to store the data from each column before it gets turned into a pandas df 
PostalCode=[]
Borough=[]
Neighborhood=[]

In [10]:
#and then create a for loop that will go trough the page pulling out the data 

for item in wiki_table:  
    rows=item.find_all('tr') #tr is the html item that defines the rows of a table
    
    for row in rows: #inside each tr there will be td that denote the cells
        cells=row.find_all('td') 
        #when python runs this for the first row it won't find any td because that row is the header and has no data
        #therefore we can put an if condition based on the len of the result to skip the header row
        if len(cells)>1:
            
            Post=cells[0] #the post code is the first element in the string
            PostalCode.append(Post.text.strip()) #we append that element to the array we created earlier
            
            Bor=cells[1]
            Borough.append(Bor.text.strip())
            
            Hood=cells[2]
            Neighborhood.append(Hood.text.strip())

In [11]:
#as a sanity check let's make sure that all 3 arrays have the same number of data and that the content matches that from wikipedia

#print(PostalCode)  #this has been commented out to save space after checking the printed output made sense
print(len(PostalCode)) 

288


In [12]:
#print(Borough)  #this has been commented out to save space after checking the printed output made sense
print(len(Borough))

288


In [13]:
#print(Neighborhood)  #this has been commented out to save space after checking the printed output made sense
print(len(Neighborhood))

288


<h3> Adding the data to the panda dataframe </h3>

In [14]:
neighbourhoods=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


In [15]:
i=0
for i in range(len(PostalCode)):
    code=PostalCode[i]
    bor=Borough[i]
    neig=Neighborhood[i]
    
    neighbourhoods=neighbourhoods.append({'PostalCode':code,'Borough':bor,'Neighborhood':neig}, ignore_index=True)

In [16]:
neighbourhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [17]:
neighbourhoods.shape #just as a sanity check let's make sure that there are indeed 288 rows and 3 columns 

(288, 3)

<h2> Now that we have the dataframe it's time to clean the data </h2>
<h3> Deleting rows that don't have a borough</h3>

In [18]:
#the challenge here is that we have to remove the rows that don't have a brough assigned but those rows don't have an empty field or NaN value, instead they have the words "not assigned"
#the standard dropna() option won't help us, we first have to replace the "Not assigned" with NaN

neighbourhoods.replace('Not assigned', np.nan, inplace=True )


In [19]:
neighbourhoods.dropna(subset=['Borough'], inplace=True) #dropping all the rows where the borough is empty 
neighbourhoods.reset_index(drop=True, inplace=True) #resetting the index
neighbourhoods.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [20]:
neighbourhoods.shape #checking the shape to be able to use it as reference in future sanity checks 

(211, 3)

In [21]:
neighbourhoods.dtypes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

In [22]:
neighbourhoods.astype(str)
neighbourhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


<h3> Filling in the neighborhoods with no value with the name of their Borough </h3>

In [23]:
neighbourhoods['Neighborhood']=neighbourhoods['Neighborhood'].fillna(neighbourhoods['Borough']) #filling NaN values in Neighborhood with the name of the Borough as per instructions 

In [24]:
neighbourhoods.head(10) #checking the head to see if it worked, row 6 for example had a NaN value, this time it should have the name of the Borough 

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [25]:
#sanity check, if we have not lost any data the shape should be the same
neighbourhoods.shape 

(211, 3)

<h3> Combining neigborhoods with the same post code into a single row</h3>

In [26]:
neighbourhoods=neighbourhoods.groupby(['PostalCode','Borough'], sort=False, as_index=False).agg(lambda x:','.join(x))

In [27]:
neighbourhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [28]:
neighbourhoods.sort_values(by='PostalCode', inplace=True) 
neighbourhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M1B,Scarborough,"Rouge,Malvern"
12,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
18,M1E,Scarborough,"Guildwood,Morningside,West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
32,M1J,Scarborough,Scarborough Village
38,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
44,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
51,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
58,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [29]:
neighbourhoods.shape 

(103, 3)

<h1> Part 2: Adding Geo Data  </h1>

In [30]:
coordinates=pd.read_csv('Geospatial_Coordinates.csv')  #loading the csv with the data

In [31]:
coordinates.head() #checking if it has loaded correctly

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [32]:
coordinates.shape #this is a sanity check to see if the number of columns is the same in this dataframe and in the previous one, if it is we can safely merge them 

(103, 3)

In [33]:
coordinates=coordinates.set_index('Postal Code') #changing the index to be the postal code
coordinates.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [34]:
neighbourhoods=neighbourhoods.set_index('PostalCode')
neighbourhoods.head()

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [35]:
toronto_df=neighbourhoods.join(coordinates, how='outer')   #merging the two dataframes
toronto_df.head(10)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476
M1J,Scarborough,Scarborough Village,43.744734,-79.239476
M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [36]:
toronto_df.shape #another sanity check to make sure nothing got lots in the merge 

(103, 4)

<h1> Part 3: Analizing the data  </h1>

I will be using the full dataset to replicate the analysis we did in the labs

In [37]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

In [38]:
!conda install -c conda-forge geocoder --yes
import geocoder

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [39]:
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [40]:
from geopy.geocoders import Nominatim

In [41]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [42]:
!conda install -c conda-forge folium=0.5.0 --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [43]:
import folium # map rendering library

<h3> Creating a map of Toronto with neighborhoods superimposed on top. </h3>

In [44]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [47]:
toronto_map=folium.Map(location=[latitude,longitude], zoom_start=10)

toronto_map