This notebook contains my final project for the IBM's Data Science Professional Certificate.

In [1]:
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim

# Part 1 - Scraping
The read_html method of a Panda's DataFrame object reads the HTML code, looking for HTML tables. For every table found a new DataFrame is created, returning an array of DataFrames. Since there's only one table on the website, it will always be at position 0

In [2]:
canada_pc = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', na_values='Not assigned')[0]

The na_values allows to replace certain values with np.nan. In this case, the "Not assigned" string. With this, we can easily drop rows with missing Boroughs

In [3]:
canada_pc.dropna(axis=0, inplace=True)
canada_pc.rename(columns={'Post Code': 'Postal Code'}, inplace=True)
canada_pc.reset_index(drop=True, inplace=True)

canada_pc.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
canada_pc.shape

(103, 3)

Since the library to get geospatial data isn't working correctly, a csv is used instead:

In [5]:
geo_df = pd.read_csv('https://cocl.us/Geospatial_data')

In [6]:
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


# Part 2 - Parsing Data
Now both DataFrames are merged into one containing all data, using the "Postal Code" column. Also, we only use Neighborhoods in Toronto for this analysis

In [7]:
canada_geo = canada_pc.merge(geo_df, how="inner", on="Postal Code")

In [8]:
canada_geo = canada_geo[canada_geo['Borough'].str.contains('Toronto')]

canada_geo.reset_index(drop=True, inplace=True)

In [9]:
canada_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


As you can see, when a Borough has more than one Neighborhood, the data is separated by commas. Like this, data is not so easy to manipulate, so we split all neighborhoods and put them in different rows

In [10]:
def splitNeighborhoods(df):
    for index, row in df.iterrows():
        neighborhoods = row['Neighborhood'].split(', ')
        
        if (len(neighborhoods)) > 1:
            neighborhood_df = pd.DataFrame({
                'Postal Code': [row['Postal Code']] * len(neighborhoods),
                'Borough': [row['Borough']] * len(neighborhoods),
                'Neighborhood': neighborhoods,
                'Latitude': [row['Latitude']] * len(neighborhoods),
                'Longitude': [row['Longitude']] * len(neighborhoods)
            })

            df = pd.concat(objs=[df, neighborhood_df], axis=0)
            df.drop(index, axis=0, inplace=True)
            
    df.reset_index(drop=True, inplace=True)

    return df

In [11]:
canada_geo = splitNeighborhoods(canada_geo)

In [12]:
canada_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
1,M4E,East Toronto,The Beaches,43.676357,-79.293031
2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
4,M6G,Downtown Toronto,Christie,43.669542,-79.422564


In [13]:
canada_geo.shape

(75, 5)

Using the venues method from the Foursquare API, we can obtain all venues that are close to a neighborhood. We get the first 100 closest venues for every neighborhood and put this data in a DataFrame

In [14]:
import requests

CLIENT_ID = 'CD5BPAG42DVQAUWY1NJDFMQUWNTOH1FKQSB55JWAUCHQQD2X' # your Foursquare ID
CLIENT_SECRET = '5RG2JUBRQDJ1QTEG3XBZ3FDFVGI4441K3HJPL3E2VGD5VLNK' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

def createDataFrameFromVenues(neighborhood, venues):
    neighborhood_venues_df = pd.DataFrame({
        'Neighborhood': [],
        'Neighborhood Latitude': [],
        'Neighborhood Longitude': [],
        'Venue': [],
        'Venue Latitude': [],
        'Venue Longitude': [],
        'Venue Category': [],
    })

    for item in venues['response']['groups'][0]['items']:
        venue = item['venue']
        
        neighborhood_venues_df = neighborhood_venues_df.append(pd.DataFrame({
            'Neighborhood': [neighborhood['name']],
            'Neighborhood Latitude': [neighborhood['lat']],
            'Neighborhood Longitude': [neighborhood['lon']],
            'Venue': [venue['name']],
            'Venue Latitude': [venue['location']['lat']],
            'Venue Longitude': [venue['location']['lng']],
            'Venue Category': [venue['categories'][0]['name']]
        }))

    return neighborhood_venues_df

def getTorontoVenues(neighborhoods_df, radius, limit):
    neighborhood_venues_df = pd.DataFrame({
        'Neighborhood': [],
        'Neighborhood Latitude': [],
        'Neighborhood Longitude': [],
        'Venue': [],
        'Venue Latitude': [],
        'Venue Longitude': [],
        'Venue Category': [],
    })

    for index, row in neighborhoods_df.iterrows():
        neighborhood = {
            'name': row['Neighborhood'],
            'lat': row['Latitude'],
            'lon': row['Longitude']
        }
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            neighborhood['lat'], 
            neighborhood['lon'],
            radius, 
            limit
        )
        
        venues_response = requests.get(url)
        
        venues_data = venues_response.json()
        
        neighborhood_venues_df = neighborhood_venues_df.append(createDataFrameFromVenues(neighborhood, venues_data))

    neighborhood_venues_df.reset_index(drop=True, inplace=True)

    return neighborhood_venues_df

In [15]:
venues_df = getTorontoVenues(canada_geo, radius=500, limit=100)

In [16]:
venues_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,St. James Town,43.651494,-79.375418,Gyu-Kaku Japanese BBQ,43.651422,-79.375047,Japanese Restaurant
1,St. James Town,43.651494,-79.375418,GEORGE Restaurant,43.653346,-79.374445,Restaurant
2,St. James Town,43.651494,-79.375418,Fahrenheit Coffee,43.652384,-79.372719,Coffee Shop
3,St. James Town,43.651494,-79.375418,Crepe TO,43.650063,-79.374587,Creperie
4,St. James Town,43.651494,-79.375418,GoodLife Fitness Toronto 137 Yonge Street,43.651242,-79.378068,Gym


In [17]:
venues_df.shape

(3077, 7)

We can now count the venues that were returned by Foursquare for every neighborhood

In [18]:
venues_df.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,99,99,99,99,99,99
Bathurst Quay,14,14,14,14,14,14
Berczy Park,58,58,58,58,58,58
Brockton,22,22,22,22,22,22
Business reply mail Processing Centre,15,15,15,15,15,15
CN Tower,14,14,14,14,14,14
Cabbagetown,47,47,47,47,47,47
Central Bay Street,66,66,66,66,66,66
Chinatown,67,67,67,67,67,67
Christie,16,16,16,16,16,16


To understand how different neighborhoods are related, we use the categories of the venues. Neighborhoods with the same kind of venues must be similar. For this the "one-hot encoding" technique is used, showing with a boolean value if the neighborhood has a venue with the category.

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
neighborhood_index = toronto_onehot.columns.get_loc('Neighborhood')
fixed_columns = [toronto_onehot.columns[neighborhood_index]] + list(toronto_onehot.columns[:neighborhood_index]) + list(toronto_onehot.columns[neighborhood_index + 1:])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,St. James Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can have a better understanding of this data using the mean of each category

In [20]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

toronto_grouped.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.020202,0.0,0.0,...,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.010101,0.0
1,Bathurst Quay,0.071429,0.071429,0.071429,0.142857,0.214286,0.142857,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667


Now, we can get the most popular categories for every neighborhood. For this analysis, we will use the first 10 in descending order. We can say that two neighborhoods are alike if they have a a lot of categories in common

In [21]:
indicators = ['st', 'nd', 'rd']

def getTopNCategories(df, key, n):
    sorted_df = df.drop('Neighborhood')
    sorted_df.sort_values(ascending=False, inplace=True)

    return sorted_df.iloc[0:n].index

def createCategoriesDataFrame(neighborhood, categories):
    columns = ['Neighborhood']
    df = pd.DataFrame({'Neighborhood': [neighborhood]})

    for key, category in enumerate(categories):
        try:
            df['{}{} Most Common Venue'.format(key+1, indicators[key])] = [category]
        except:
            df['{}th Most Common Venue'.format(key+1)] = [category]

    return df

def createTopNCategoriesDataFrame(df, n):
    categories_df = pd.DataFrame()

    for key, row in df.iterrows():
        top_categories = getTopNCategories(row, key, n)
        
        categories_df = categories_df.append(createCategoriesDataFrame(row['Neighborhood'], list(top_categories)))
        
    categories_df.reset_index(drop=True, inplace=True)

    return categories_df

In [22]:
toronto_top_categories = createTopNCategoriesDataFrame(toronto_grouped, 10)

In [23]:
toronto_top_categories

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Café,Hotel,Restaurant,Gym,Clothing Store,Deli / Bodega,Thai Restaurant,Steakhouse,Bakery
1,Bathurst Quay,Airport Service,Airport Lounge,Airport Terminal,Airport,Sculpture Garden,Boat or Ferry,Rental Car Location,Harbor / Marina,Airport Gate,Airport Food Court
2,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Restaurant,Beer Bar,Cheese Shop,Pharmacy,Bakery,Farmers Market,Café
3,Brockton,Café,Coffee Shop,Breakfast Spot,Grocery Store,Stadium,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue,Bakery
4,Business reply mail Processing Centre,Yoga Studio,Fast Food Restaurant,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Light Rail Station,Spa
5,CN Tower,Airport Service,Airport Lounge,Airport Terminal,Airport,Sculpture Garden,Boat or Ferry,Rental Car Location,Harbor / Marina,Airport Gate,Airport Food Court
6,Cabbagetown,Coffee Shop,Pizza Place,Italian Restaurant,Bakery,Park,Chinese Restaurant,Pub,Restaurant,Café,Convenience Store
7,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Japanese Restaurant,Salad Place,Burger Joint,Bar,Department Store,Thai Restaurant
8,Chinatown,Café,Coffee Shop,Mexican Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Bar,Pizza Place,Dessert Shop,Park,Dumpling Restaurant
9,Christie,Grocery Store,Café,Park,Nightclub,Italian Restaurant,Diner,Candy Store,Restaurant,Baby Store,Coffee Shop


# Part 3 - Clustering

With the first 10 categories for every neighborhood, we can now cluster this data. Neighborhoods will be in the same cluster if they are similar (they have categories in common). We will use 6 clusters to split neighborhoods and see how they are related to each other

In [24]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 2, 1, 1, 1, 2, 1, 1, 1, 1], dtype=int32)

With the clusters assigned, we create a dataframe that contains all neighborhood data, their popular categories and the cluster.

In [25]:
# add clustering labels
neighborhoods_venues_sorted = toronto_top_categories.copy()
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = canada_geo

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [26]:
toronto_merged.drop('Postal Code', inplace=True, axis=1)
toronto_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Restaurant,Clothing Store,Italian Restaurant,American Restaurant,Park,Cosmetics Shop,Cocktail Bar,Pizza Place
1,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Asian Restaurant,Trail,Pub,Doner Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Yoga Studio
2,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Cocktail Bar,Seafood Restaurant,Restaurant,Beer Bar,Cheese Shop,Pharmacy,Bakery,Farmers Market,Café
3,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Japanese Restaurant,Salad Place,Burger Joint,Bar,Department Store,Thai Restaurant
4,Downtown Toronto,Christie,43.669542,-79.422564,1,Grocery Store,Café,Park,Nightclub,Italian Restaurant,Diner,Candy Store,Restaurant,Baby Store,Coffee Shop


Using folium a map is created with circle markers that represent neighborhoods. Every cluster is differentiated using a color, so neighborhoods in the same cluster will be shown with the same color. Also, the geopy library is used to get Toronto's coordinates.

In [27]:
import folium
from geopy.geocoders import Nominatim

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [28]:
geolocator = Nominatim(user_agent='Capstone')
full_address, latlon = geolocator.geocode('Toronto, Canada')

toronto_clusters = folium.Map(location=[latlon[0], latlon[1]], zoom_start=12)

In [29]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

When a circle marker is clicked, information about it is shown; neighborhood and its cluster.

In [30]:
for key, row in toronto_merged.iterrows():
    label = folium.Popup(str(row['Neighborhood']) + ' - Cluster ' + str(row['Cluster Labels'] + 1), parse_html=True)
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color=rainbow[row['Cluster Labels']],
        popup=label
    ).add_to(toronto_clusters)

In [31]:
toronto_clusters