<a href="https://www.linkedin.com/in/amit-maindola-51801423/"><img src = "https://cdn-images-1.medium.com/max/1200/1*u16a0WbJeckSdi6kGD3gVA.jpeg" width = 400> </a>

<h1 align="center"><font size=5 color="DE5538">Web Scraping and Loading Toronto Neighborhoods data</font></h1>

## Introduction
In this Notebook I will be web scrapping and downloading the Neighborhoods data in Toronto. 
The Neighborhoods data will be extracted from the Wikkipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
After scraping and wrangling the data from Wikkipedia page we would read it in the Pandas DataFrame to load latitude and Longitude

### Table of Contents
<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3 color="black">

1. <a href="#item1">Scrape the Wikkipedia page</a>

2. <a href="#item2">Perform Data Cleaning/Wrangling</a>

3. <a href="#item3">Get Cordinates of the Neighborhoods</a>

</font>
</div>

### Import all the required libraries and dependencies

In [6]:
import numpy as np   # library to handle data in a vectorized manner
import pandas as pd  # library for data analysis
import requests      # library to handle requests
from bs4 import BeautifulSoup   # library to read the Wikkipedia page

print('Libraries imported.')

Libraries imported.


<a id=item1><font size=4 color="229C75">Scrape the Wikkipedia page</font></a>

Let's read the source code of the page and create a BeautifulSoup (soup)object with the BeautifulSoup function.
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.

In [7]:
wikki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(wikki_url).text,'lxml')
# print(soup.prettify())

On inspecting the elements we can see the table contents i.e. Neighborhoods are under wikitable sortable  
So let's find class ‘wikitable sortable’ in the HTML script.Extract the header and rows values in the list form.  
Use strip function to remove "\n" characters from the values  

In [8]:
# Get the table values in a variable
toronto_tbl = soup.find('table',{'class':'wikitable sortable'})
# Extract the table column names
tbl_hdr = [hdr.text.strip() for hdr in toronto_tbl.find_all('th')]
print(tbl_hdr)
# Loop for each row and print the first 10 values
for row in toronto_tbl.find_all('tr')[0:10]:
    values = [cell.text.strip() for cell in row.find_all('td')]
    if len(values) != 0: print(values )

['Postcode', 'Borough', 'Neighbourhood']
['M1A', 'Not assigned', 'Not assigned']
['M2A', 'Not assigned', 'Not assigned']
['M3A', 'North York', 'Parkwoods']
['M4A', 'North York', 'Victoria Village']
['M5A', 'Downtown Toronto', 'Harbourfront']
['M5A', 'Downtown Toronto', 'Regent Park']
['M6A', 'North York', 'Lawrence Heights']
['M6A', 'North York', 'Lawrence Manor']
['M7A', "Queen's Park", 'Not assigned']


Save the data into a Pandas Dataframe

In [9]:
df = pd.DataFrame(columns=[hdr.text.strip() for hdr in toronto_tbl.find_all('th')])
for row in toronto_tbl.find_all('tr'):
    values = [cell.text.strip() for cell in row.find_all('td')]
    if len(values) != 0: df.loc[len(df)] = values

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


View the shape and size of the DataFrame

In [10]:
df.shape

(289, 3)

<a id=item2 ><font size=4 color="229C75">Perform Data Cleaning/Wrangling</font></a>

Create a new DataFrame **toronto_df** from Orignal DataFrame **df** with columns: PostalCode, Borough, and Neighborhood  
Remove the Rows which has Borough as **'Not assigned'**

In [150]:
toronto_df = df[df['Borough'] != 'Not assigned']
# Change column names
toronto_df=toronto_df.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'})
# Reset Index
toronto_df.index=range(len(toronto_df.index))

In [151]:
toronto_df.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront


In [152]:
toronto_df.drop_duplicates(keep=False,inplace=True)
toronto_df.shape

(212, 3)

View the counts of Unique Neighborhoods and Borough

In [153]:
print('The DataFrame has {} Neighborhoods and {} Borough'.format(
    len(toronto_df['Neighborhood'].unique())
    ,len(toronto_df['Borough'].unique()) 
    )
)

The DataFrame has 210 Neighborhoods and 11 Borough


Remove the Rows which has Neighbourhood and Borough as **'Not assigned'**

In [154]:
for i in toronto_df[toronto_df['Neighborhood'] =='Not assigned'].index.values:
    print("'{}' - '{}'".format(toronto_df.loc[i,'Neighborhood'], toronto_df.loc[i,'Borough']))
    toronto_df.at[i,'Neighborhood'] = toronto_df.at[i,'Borough']

# Check if Neighbourhood has been reset
toronto_df[toronto_df['Neighborhood'] =='Not assigned']

'Not assigned' - 'Queen's Park'


Unnamed: 0,PostalCode,Borough,Neighborhood


View the distribution of Values

In [155]:
toronto_df.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,212,212,212
unique,103,11,210
top,M8Y,Etobicoke,Runnymede
freq,8,45,2


Combine Neighborhoods based on Postal Code  
We will group by the Frame on PostalCode and Borough and concatenate the values using agg

In [156]:
toronto_df = toronto_df.groupby(['PostalCode','Borough'],as_index=False)[['Neighborhood']].agg(lambda col : ', '.join(col) )

###### Save the DataFrame as CSV file for reference

In [157]:
toronto_df.to_csv('toronto_neighborhood.csv',index=False)

### View the shape and size of the DataFrame

In [158]:
toronto_df.shape

(103, 3)

<a id=item3><font size=4 color="229C75">Get Cordinates of the Neighborhoods</font></a>

As our Dataframe is ready, we need to get the latitude and the longitude coordinates of each neighborhood.   

Let's give a try at Google geocoder API to get the cordinates. Let's import the required libraries.

### Use Google Geocode REST API to get the Latitude and Longitude Values

###### Method to get the Latitude and Logitude of a passed address using Google API

In [160]:
# @hidden_cell
GOOGLE_MAPS_API_URL = 'https://maps.googleapis.com/maps/api/geocode/json'
GOOGLE_API_KEY = ""
BACKOFF_TIME = 30

In [161]:
"""
Pythond Script for getting the latitude and longitude of the passed Address
Parameters:
    address : Address of the location 
Returns : Dictionary value of Latitude and Longitude
"""
def get_latlng(address):
    if address is not None:
        # Form the request url      
        geocode_url = "{}?address={}&key={}".format(GOOGLE_MAPS_API_URL,address,GOOGLE_API_KEY)
    else:
        raise ValueError("Missing address")
        
#     print(geocode_url)
    
    # Do the request and get the response data
    result = requests.get(geocode_url).json()
    # Return the Dictionary value
    return result['results'][0]['geometry']['location']

Loop for the each address

In [162]:
# Test the defined Method for a value in the Dataset

# result=get_latlang("M5G, Toronto, Ontario")
# result

print(get_latlng("M5G, Toronto, Ontario"))

{'lat': 43.6579524, 'lng': -79.3873826}


##### ReCreate the **torronto_df** datframe from the csv file with Additional columns

In [163]:
toronto_df = pd.read_csv('toronto_neighborhood.csv')
toronto_df['Latitude'] = toronto_df['Longitude'] = ""
toronto_df.head(2)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",,


Add Latitude and Longitude values in the Dataframe

In [168]:
# Loop for every index
for i in toronto_df.index.values:
    # Form the address string
    addr = '{}, Toronto, Ontario'.format(toronto_df.at[i,'PostalCode'])
    # Get the cordinates for address
    cordinates=get_latlng(addr)
    
#     print("Address : {} - codinates {}".format(addr,cordinates))
    toronto_df.at[i,'Latitude'],toronto_df.at[i,'Longitude'] = cordinates['lat'], cordinates['lng']

# View the populated values     
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8067,-79.1944
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7845,-79.1605
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7636,-79.1887
3,M1G,Scarborough,Woburn,43.771,-79.2169
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395


###### Save the DataFrame as CSV file for reference

In [171]:
toronto_df.to_csv('toronto_neigh_latlng.csv',index=False)
toronto_df.shape

(103, 5)