<center>
    <img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png" width="300" alt="cognitiveclass.ai logo"  />
</center>


<h1>Part 1: Using a Web Scraping For the Toronto Neighborhood Data</h1>

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

After searching on Google for a page on Wikipedia contains the information required.

***"This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area"***


Use the `requests` library to download the webpage [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). 

Save the text of the response as a variable named `html_data`.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

Parse the html data using `beautiful_soup`.

In [3]:
soup = BeautifulSoup(html_data, 'html.parser')
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

Using beautiful soup extract the table and store it into a dataframe named `df`. The dataframe should have columns PostalCode, Borough, Neighborhood.

An empty list is created to store each row as a dictionary called `cell` in the for loop.

The for loop loop over each `td` and processes the cells that have an assigned borough. Cells with a borough that is `Not assigned` are ignored.

More than one neighborhood can exist in one postal code area - these two rows will be combined into one row with the neighborhoods separated with a comma. Additionally if a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.



In [4]:
table_contents=[]
table=soup.find('table') # find the table element within the Beautiful soup object
for row in table.findAll('td'): # this loops over each 'cell' within the table
    cell = {}
    if row.span.text=='Not assigned': # ignore 'Not assigned' cells
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # 3 characters for each PostalCode
        cell['Borough'] = (row.span.text).split('(')[0] # split the span element with the forward bracket to extract the Borough
        # extract Neighborhood name as it is the second element in the span after splitting with "("
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') 
        table_contents.append(cell)

print(table_contents[:2])

[{'PostalCode': 'M3A', 'Borough': 'North York', 'Neighborhood': 'Parkwoods'}, {'PostalCode': 'M4A', 'Borough': 'North York', 'Neighborhood': 'Victoria Village'}]


Create a DataFrame with the `table content` list

In [5]:
df=pd.DataFrame(table_contents)
# clean Borough names
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
# verify if 2 neighborhoods are assigned to M5A
df[df['PostalCode']=='M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [6]:
# show how many rows there are in the DataFrame
print('There are {} rows of data scraped from Wikipedia'.format(df.shape[0]))

There are 103 rows of data scraped from Wikipedia


<h1>Part 2: Uget the latitude and the longitude coordinates of each neighborhood</h1>


Using the Geocoder Python package to get the latitude and the longitude coordinates of each neighborhood. Start with importing the module

In [7]:
import geocoder # import geocoder

Using geocoder with while loop to ensure a set of latitude and longitude is found for each postal code.

In [8]:
latitudes = []
longitudes = []

for postal_code in df['PostalCode']:
    # initialize your variable to None
    lat_lng_coords = None
    counter = 0 #initiate counter for loops
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        counter+=1
        if counter>30:
            break
    try:
        latitudes.append(lat_lng_coords[0])
        longitudes.append(lat_lng_coords[1])
    except:
        print('Unreliable package so we go for downloading dataset!')
        break

Unreliable package so we go for downloading dataset!


In [9]:
!curl -O https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2788  100  2788    0     0   4369      0 --:--:-- --:--:-- --:--:--  4363


In [10]:
file_name = 'Geospatial_Coordinates.csv'
df_coordinate = pd.read_csv(file_name)
print('There are {} rows from the downloaded csv file'.format(df_coordinate.shape[0]))
df_coordinate.head()

There are 103 rows from the downloaded csv file


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The download dataset only contains 3 columns, Postal Code, Latitude and Longitude respectively. Hence the coordinates will need to be added to the original `df` using left joint.

In [11]:
df = df.join(df_coordinate.set_index('Postal Code'), on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [12]:
# show how many rows and columns there are in the DataFrame
print('There are {} rows and {} columns in the DataFrame'.format(df.shape[0], df.shape[1]))

There are 103 rows and 5 columns in the DataFrame


<h1>Part 3: Explore and cluster the neighborhoods in Toronto</h1>

In [13]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Get the geographical coordinates of Toronto for mapping.

In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Define Foursquare Credentials and Version - credentials information imported from local credentials.json


In [15]:
with open('credentials.json') as f:
    data = json.load(f)
    CLIENT_ID = data['CLIENT_ID'] 
    CLIENT_SECRET = data['CLIENT_SECRET']  

VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Define information of interest and filter dataframe - define this function that extracts the category of the venue


In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Define the below function to extracts nearby venue for each location:

In [17]:
def getNearbyVenues(postalcodes, boroughs, neighborhoods, latitudes, longitudes, radius=1500):
    
    venues_list=[]
    for postal_code, borough, neighborhood, lat, lng in zip(postalcodes, boroughs, neighborhoods, latitudes, longitudes):
        print(neighborhood)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            postal_code, 
            borough,
            neighborhood,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code','Borough','Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    return(nearby_venues)

In [18]:
toronto_venues = getNearbyVenues(postalcodes=df['PostalCode'], boroughs=df['Borough'], neighborhoods=df['Neighborhood'], latitudes=df['Latitude'], longitudes=df['Longitude'], )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East
The Danforth

Quick check the size of the resulting dataframe and how many categories from the explorer's results:


In [19]:
print(toronto_venues.shape)
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
print('There are {} uniques neighborhood.'.format(len(toronto_venues['Neighborhood'].unique())))
toronto_venues.head()

(6876, 9)
There are 342 uniques categories.
There are 103 uniques neighborhood.


Unnamed: 0,Postal Code,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,North York,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,M3A,North York,Parkwoods,43.753259,-79.329656,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
2,M3A,North York,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
4,M3A,North York,Parkwoods,43.753259,-79.329656,LCBO,43.757774,-79.314257,Liquor Store


###Analyze Each Neighborhood
Before ranking each category the dataframe need to be converted to onehot encoding.

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
# this groups the dataframe by neighborhood and calculate mean of each venue
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
# write a function to sort the venues in descending order:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood

In [23]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd'] # number of ranking

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Shopping Mall,Coffee Shop,Bakery,Gym / Fitness Center,Caribbean Restaurant,Breakfast Spot,Cantonese Restaurant,Hong Kong Restaurant,Park
1,"Alderwood, Long Branch",Café,Park,Discount Store,Pizza Place,Burger Joint,Grocery Store,Bank,Coffee Shop,Department Store,Dog Run
2,"Bathurst Manor, Wilson Heights, Downsview North",Park,Pizza Place,Coffee Shop,Bank,Gas Station,Bridal Shop,Shopping Mall,Fried Chicken Joint,Gym,Mediterranean Restaurant
3,Bayview Village,Park,Japanese Restaurant,Trail,Bank,Gas Station,Grocery Store,Café,Playground,Restaurant,Chinese Restaurant
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Bakery,Sushi Restaurant,Pizza Place,Coffee Shop,Bagel Shop,Café,Pub,Grocery Store,Fast Food Restaurant


Run _k_-means to cluster the neighborhood into 5 clusters:

In [24]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

In [25]:
kmeans.labels_.shape

(103,)

In [26]:
toronto_merged = df
# merge toronto_grouped with df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
# add clustering labels
toronto_merged.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged.head()

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,M3A,North York,Parkwoods,43.753259,-79.329656,Coffee Shop,Pharmacy,Bus Stop,Gas Station,Supermarket,Bank,Mobile Phone Shop,BBQ Joint,Skating Rink,Sandwich Place
1,1,M4A,North York,Victoria Village,43.725882,-79.315572,Coffee Shop,Gym,Fast Food Restaurant,Middle Eastern Restaurant,Indian Restaurant,Gym / Fitness Center,Thrift / Vintage Store,Park,Grocery Store,Shoe Store
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Coffee Shop,Café,Park,Restaurant,Pub,Gastropub,Diner,Japanese Restaurant,Bakery,Farmers Market
3,0,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Furniture / Home Store,Dessert Shop,Vietnamese Restaurant,Sandwich Place,Grocery Store,Greek Restaurant
4,1,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,Coffee Shop,Pizza Place,Café,Ramen Restaurant,Restaurant,Japanese Restaurant,Park,Dance Studio,Middle Eastern Restaurant,Breakfast Spot


In [31]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, postal_code, borough, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(
        str(postal_code)+' '+str(borough)+' '+str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters