# Web Scraping Project

This notebook scraps toronto's neighbourhood data from a wikipedia webpage

## Criteria for marking
1. Dataframe should consist of three columns: Postal code, borough and neighbourhood
2. All rows must have boroughs, ignore rows where borough is not assigned
3. Merge rows with same poastal code, and same borough but different neighbourhood by separating two negihbourhoods with comma
4. If the neighbourhood is missing, use borough as neighbourhood
5. Clean the dataframe
6. Print dimensions of table in the last cell


Loading the required packages

In [1]:
#installing urllib2 package

#!conda install -c conda-forge urllib2 --yes

In [2]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
#import urllib2
import requests as request
print('Packages are loaded')

Packages are loaded


In [3]:
#Link for the wikipedia page
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
wiki_page = request.get(link).text

In [5]:
html_page = bs(wiki_page, 'html.parser')
#print(html_page.prettify())

In [6]:
#html_page.find_all('tbody')

In [7]:
element_list = []
postal_code = []
borough = []
neighbourhood = []

for child in html_page.tbody.stripped_strings:
    element_list.append(child)
#print(name_list)

for i in range(3, len(element_list)- 2, 3):
    pcode = element_list[i]
    bcode = element_list[i+1]
    ncode = element_list[i+2]
    
    postal_code.append(pcode)
    borough.append(bcode)
    neighbourhood.append(ncode)


df = pd.DataFrame(data = {'Postal Code': postal_code, 'Borough': borough, 'Neighbourhood': neighbourhood}, columns = ['Postal Code', 'Borough', 'Neighbourhood'])
df.head()
#print(postal_code)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Removing *Not assigned* values from Borough

In [8]:
df = df[df.Borough != 'Not assigned']

Checking if there are any Not assigned values in Neighbourhood

In [9]:
df.iloc[:, 2][df.iloc[:,2] == 'Not assinged']

Series([], Name: Neighbourhood, dtype: object)

The column Neighbourhood doesn't contain any *Not assigned* values.

Merging the rows with same postal code, same borough but different Neigbourhood. For that, we will use groupby method available in pandas library. We will group the dataframe *df* by *Postal Code* and *Borough* and then join by comma.

In [10]:
grouped = df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x))
grouped = pd.DataFrame(grouped)

Resetting the index to a series and include *Postal Code* and *Borough* in grouped dataframe

In [11]:
grouped = grouped.reset_index()
grouped.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Printing the shape and rows of dataframe

In [12]:
print('Shape of dataframe: ', grouped.shape)
print('Number of rows in dataframe: ', grouped.shape[0])

Shape of dataframe:  (103, 3)
Number of rows in dataframe:  103


## Part 2: Plotting the coordinates on the map

Installing the geocoder pack

In [13]:
#!conda install -c conda-forge geocoder --yes

In [14]:
#importing the geocoder package
import geocoder

Getting coordinates using geocoder package

In [15]:
#lat_long_coord = None
#latlng_list = []

#for pcode, ncode in zip(grouped['Postal Code'], grouped['Neighbourhood']):
#    while (lat_long_coord is None):
#        g = geocoder.google('{}, {}'.format(pcode, ncode))
#        lat_long_coord = g.latlng
#    latlng_list.append(lat_long_coord)

#coordinates = pd.DataFrame(data = latlng_list, columns = ['Latitude', 'Longitude'])
#coordinates

Loading coordinates data from csv file as not able to get coordinates from above code

In [16]:
coordinates = pd.read_csv(r'C:\Users\kswp234\Box Sync\3rd Rotation\Data Science\Coursera\IBM Data Science\IBM_DS_coursera\Geospatial_Coordinates.csv')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
#merging grouped (neighbourhood) data with coordinates

grouped = pd.merge(grouped, coordinates)
grouped.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Creating Visualizations

In [18]:
#!conda install -c conda-forge folium --yes

In [19]:
import folium

Creating a map of Canada

In [20]:
map_canada = folium.Map(location = [43.6532, -79.3832], zoom_start = 12, tiles = 'Stamen Terrain')
map_canada

Superimposing neighbourhood labels on Canada Map

In [21]:
for lat, lng, borough, neighborhood in zip(grouped['Latitude'], grouped['Longitude'], grouped['Borough'], grouped['Neighbourhood']):
    label = '{}, {}'.format(borough, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)

map_canada