# IBM Data Science Professional Certificate: Capstone Project

Welcome! I'm working on a Capstone Project for the IBM Data Science Professional Certificate. I hope you will be interested in following along with my project. I'll be scraping some data and applying some machine learning tools, so I hope my readers will come away having gained some tips and tricks. Alright, let's do this!

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

## Toronto Neighborhood Data

We'll be segmenting and clustering location data about Toronto. To start, we'll need to grab a dataset with geographic data to plot Toronto neighborhoods. Here's how we'll do it:

1. We'll scrape the Toronto Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), which contains the names of Toronto boroughs and neighborhoods. For boroughs that do not have an assigned neighborhood name, we'll use the borough name as the neighborhood name. 


2. We'll then grab the coordinates for the neighborhoods by merging a csv spreadsheet with the coordinates.

### 1. Scrape Wikipedia for Toronto Neighborhood Data

In [2]:
# imports
import requests
from bs4 import BeautifulSoup

URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_wiki = requests.get(URL)

soup = BeautifulSoup(toronto_wiki.content, 'html.parser')

The data we're hoping to grab is in the table under "table class="wikitable sortable""

In [3]:
# grab the table
table = soup.find('table',{'class':'wikitable sortable'})
print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postal Code
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    North York
   </td>
   <td>
    Parkwoods
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    North York
   </td>
   <td>
    Victoria Village
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Regent Park, Harbourfront
   </td>
  </tr>
  <tr>
   <td>
    M6A
   </td>
   <td>
    North York
   </td>
   <td>
    Lawrence Manor, Lawrence Heights
   </td>
  </tr>
  <tr>
   <td>
    M7A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Queen's Park, Ontario Provincial Government
   </td>
  </tr>
  <tr>
   <td>
    M8

In [4]:
# scrape the data and append to list of columns

rows = table.findAll('tr')

# each of the lists below will be a column in our dataframe
c1 = []
c2 = []
c3 = []

for row in rows:
    for i, cell in enumerate(row):
        if i == 1:
            c1.append(cell)
        elif i == 3:
            c2.append(cell)
        elif i == 5:
            c3.append(cell)

# print first three items in each column
print(c1[0:3])
print('\n')
print(c2[0:3])
print('\n')
print(c3[0:3])

[<th>Postal Code
</th>, <td>M1A
</td>, <td>M2A
</td>]


[<th>Borough
</th>, <td>Not assigned
</td>, <td>Not assigned
</td>]


[<th>Neighbourhood
</th>, <td>Not assigned
</td>, <td>Not assigned
</td>]


In [5]:
# create a function to remove the html tags from the rows

def remove_html_tags(item_list):
    item_list = [str(x).replace('<th>','') for x in item_list]
    item_list = [str(x).replace('</th>','') for x in item_list]
    
    item_list = [str(x).replace('\n','') for x in item_list]
    
    item_list = [str(x).replace('<td>','') for x in item_list]
    item_list = [str(x).replace('</td>','') for x in item_list]

    return item_list

In [6]:
# clean the rows

c1 = remove_html_tags(c1)
c2 = remove_html_tags(c2)
c3 = remove_html_tags(c3)

In [7]:
# create our dataframe!

toronto = pd.DataFrame({'postal_code': c1[1:], 'borough': c2[1:], 'neighborhood': c3[1:]})
toronto.head()

Unnamed: 0,postal_code,borough,neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
# remove the rows where the postal code is not assigned to a borough

toronto = toronto[toronto['borough'] != 'Not assigned'].reset_index(drop=True)
toronto.head()

Unnamed: 0,postal_code,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
# there are no rows where the neighborhood is 'Not assigned'

toronto[toronto['neighborhood'] == 'Not assigned']

Unnamed: 0,postal_code,borough,neighborhood


In [10]:
# some postal codes have multiple neighborhoods, separated in the cell by ','

toronto[toronto['postal_code'] == 'M5A']

Unnamed: 0,postal_code,borough,neighborhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [11]:
print(f'Number of rows in the table: {toronto.shape[0]}')

Number of rows in the table: 103


### 2. Get Geographic Coordinates for Each Neighborhood

Unfortunately, Google's API is too slow and prone to disruption. I will instead get the latitude and longitude data by merging a csv file with the coordinates. Below is the code for the API, just to demonstrate how it functions.

In [12]:
# get lat lon columns using geocoder and Google API

# import geocoder # import geocoder

# # initialize your variable to None
# lat_lng_coords = None

# # loop until you get the coordinates
# while(lat_lng_coords is None):
#     for postal_code in toronto['postal_code']:
#         g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#         lat_lng_coords = g.latlng
#         print(lat_long_coords)

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

# toronto['latitude'] = latitude
# toronto['longitude'] = longitude

# toronto.head()

In [13]:
# import csv file with lat, lon
toronto_coords = pd.read_csv('toronto_geo.csv')

# clean column names
toronto_coords.columns = toronto_coords.columns.str.lower()

# take a look at the dataframe
toronto_coords.head()

Unnamed: 0,postal code,latitude,longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# merge dataframes to produce one dataframe with all information
toronto = toronto.merge(toronto_coords, how='left', left_on=['postal_code'], 
                        right_on=['postal code']).drop('postal code', axis=1)

# look at output
toronto.head()

Unnamed: 0,postal_code,borough,neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Visualize Neighborhoods

Alright, let's visualize our neighborhoods!

In [15]:
# we'll use folium for our visualizations
import folium

# zoom lat lon

zoom_lat = 43.726437 
zoom_lon = -79.360177

# initialize map of Toronto using latitude and longitude values
toronto_map = folium.Map(location=[zoom_lat, zoom_lon], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['latitude'], toronto['longitude'], toronto['borough'], toronto['neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

## Explore Neighborhoods

Let's look at the top venues surrounding each of these neighborhoods. We'll create a function that queries the Foursquare API to get the top 50 venues in the area. I will hide my Foursquare API credentials; you can input yours if you want to run the notebook yourself.

In [16]:
# Foursquare API credentials

CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = ''
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [17]:
# create a function that queries the Foursquare API and returns the top 50 venues

def get_neighborhood_top_venues(neighborhoods, latitudes, longitudes, radius, limit):
    
    venues_list=[]
    for neighborhood, lat, lng in zip(neighborhoods, latitudes, longitudes):
        print(f'Getting top venues: {neighborhood}')
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighborhood, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [18]:
toronto_venues = get_neighborhood_top_venues(neighborhoods=toronto['neighborhood'], latitudes=toronto['latitude'],
                                        longitudes=toronto['longitude'],radius=500,limit=50)


Getting top venues: Parkwoods
Getting top venues: Victoria Village
Getting top venues: Regent Park, Harbourfront
Getting top venues: Lawrence Manor, Lawrence Heights
Getting top venues: Queen's Park, Ontario Provincial Government
Getting top venues: Islington Avenue, Humber Valley Village
Getting top venues: Malvern, Rouge
Getting top venues: Don Mills
Getting top venues: Parkview Hill, Woodbine Gardens
Getting top venues: Garden District, Ryerson
Getting top venues: Glencairn
Getting top venues: West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Getting top venues: Rouge Hill, Port Union, Highland Creek
Getting top venues: Don Mills
Getting top venues: Woodbine Heights
Getting top venues: St. James Town
Getting top venues: Humewood-Cedarvale
Getting top venues: Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Getting top venues: Guildwood, Morningside, West Hill
Getting top venues: The Beaches
Getting top venues: Berczy Park
Getting top venues: Caledon

In [19]:
print(f'The function returned {toronto_venues.shape[0]} venues')
toronto_venues.head()

The function returned 1668 venues


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [20]:
# let's count the number of venues returned for each neighborhood

toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
...,...,...,...,...,...,...
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,4,4,4,4,4,4
Woodbine Heights,8,8,8,8,8,8
York Mills West,2,2,2,2,2,2


## Analyze Each Neighborhood

Let's analyze each neighborhood. The goal is to cluster the neighborhoods based on their similarity in terms of most common venues in the neighborhood. 

To accomplish this, we'll first create a dataframe where the columns are the different venue categories.

In [21]:
# one hot encoding
toronto_venues_analysis = pd.get_dummies(toronto_venues['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_venues_analysis['neighborhood_name'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_venues_analysis.columns[-1]] + list(toronto_venues_analysis.columns[:-1])
toronto_venues_analysis = toronto_venues_analysis[fixed_columns]

# lower case column names
toronto_venues_analysis.columns = toronto_venues_analysis.columns.str.lower()

toronto_venues_analysis.head()

Unnamed: 0,neighborhood_name,accessories store,airport,airport food court,airport gate,airport lounge,airport service,airport terminal,american restaurant,antique shop,...,train station,turkish restaurant,vegetarian / vegan restaurant,video game store,vietnamese restaurant,warehouse store,wine bar,wings joint,women's store,yoga studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If we group the dataset by neighborhood, we can take the mean of each of the venue categories in each neighborhood. Later, we can use this information to see which category of venues are most popular in each neighborhood.

In [22]:
# group by neighborhood and calculate mean of each category of venue in each neighborhood
toronto_venues_grouped = toronto_venues_analysis.groupby('neighborhood_name').mean().reset_index()

toronto_venues_grouped.head()

Unnamed: 0,neighborhood_name,accessories store,airport,airport food court,airport gate,airport lounge,airport service,airport terminal,american restaurant,antique shop,...,train station,turkish restaurant,vegetarian / vegan restaurant,video game store,vietnamese restaurant,warehouse store,wine bar,wings joint,women's store,yoga studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The dataframe above is the dataset we'll use to cluster the neighborhoods. 

Before clustering, however, let's create a dataframe that includes the top 10 categories for each neighborhood. This dataframe will enable us to easily explore the most common venues in a particular neighborhood.

In [23]:
# first, let's create a function that sorts the venues in descending order. The output is a 1 by 'x' array 
# of values, where x is the number of top venues input into the function

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [24]:
# then, create a dataframe with the neighborhood and top 10 categories for each neighborhood

# We'll select top 10, but the below code can get more or less by changing the num_top_venues variable
num_top_venues = 10

# indicators will be used to label the columns in the new dataframe
indicators = ['st', 'nd', 'rd']

# create column names according to selected number of top venues
columns = ['neighborhood_name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_neigbhorhood_venues = pd.DataFrame(columns=columns) # use generated column names
toronto_neigbhorhood_venues['neighborhood_name'] = toronto_venues_grouped['neighborhood_name'] 

# loop through range of numbers equivalent to the number of rows in the grouped dataframe
# the for loop essentially loops through what will be each row in the new dataframe

for ind in np.arange(toronto_venues_grouped.shape[0]):
    # set each row equivalent to the top 10 venues
    toronto_neigbhorhood_venues.iloc[ind, 1:] = return_most_common_venues(toronto_venues_grouped.iloc[ind, :], num_top_venues)

toronto_neigbhorhood_venues.head()


Unnamed: 0,neighborhood_name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,lounge,breakfast spot,latin american restaurant,skating rink,clothing store,dog run,dim sum restaurant,diner,discount store,distribution center
1,"Alderwood, Long Branch",pizza place,pharmacy,coffee shop,pub,sandwich place,gym,gay bar,gastropub,dog run,distribution center
2,"Bathurst Manor, Wilson Heights, Downsview North",coffee shop,bank,pizza place,chinese restaurant,shopping mall,bridal shop,diner,sandwich place,deli / bodega,restaurant
3,Bayview Village,japanese restaurant,bank,chinese restaurant,café,yoga studio,distribution center,dim sum restaurant,diner,discount store,dog run
4,"Bedford Park, Lawrence Manor East",sandwich place,italian restaurant,coffee shop,pharmacy,thai restaurant,juice bar,butcher,café,restaurant,indian restaurant


### Cluster Neighborhoods with _k_-means

We can use _k_-means to cluster the neighborhoods according to the most common categories of venues in each neighborhood. The dataset we'll use for clustering is the toronto_venues_grouped dataframe.

In [25]:
# imports
from sklearn.cluster import KMeans

# set number of clusters. This number is fairly arbitrary. We'll select 10 given Toronto's size.
kclusters = 10

toronto_clusters = toronto_venues_grouped.drop('neighborhood_name', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(toronto_clusters)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_[0:10])

[4 1 4 4 4 4 4 4 4 4]


Let's add the clustering labels to each neighborhood in the toronto_neigbhorhood_venues dataframe. 

In [26]:
# add clustering labels
toronto_neigbhorhood_venues.insert(0, 'Cluster Labels', kmeans.labels_)

In [27]:
# create our final dataframe
toronto_final = toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_final = toronto_final.merge(toronto_neigbhorhood_venues, how='inner',
                                    left_on=toronto['neighborhood'], 
                                    right_on=toronto_neigbhorhood_venues['neighborhood_name'])

# cleaning the columns for final output
toronto_final.drop(['postal_code', 'key_0', 'neighborhood_name'], inplace=True, axis=1)

# check out our final dataframe!
print(toronto_final.shape)
toronto_final.head()

(100, 15)


Unnamed: 0,borough,neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Parkwoods,43.753259,-79.329656,7,park,food & drink shop,yoga studio,deli / bodega,eastern european restaurant,dumpling restaurant,drugstore,donut shop,doner restaurant,dog run
1,North York,Victoria Village,43.725882,-79.315572,1,french restaurant,hockey arena,pizza place,coffee shop,portuguese restaurant,intersection,deli / bodega,department store,dessert shop,dim sum restaurant
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,4,coffee shop,bakery,pub,park,breakfast spot,café,theater,spa,mexican restaurant,dessert shop
3,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,4,clothing store,accessories store,boutique,gift shop,furniture / home store,event space,coffee shop,women's store,vietnamese restaurant,airport terminal
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4,coffee shop,yoga studio,diner,restaurant,portuguese restaurant,park,music venue,mexican restaurant,italian restaurant,hobby shop


## Display Neighborhood Clusters

Alright, let's visualize our neighborhood clusters!

In [28]:
# create map
toronto_cluster_map = folium.Map(location=[zoom_lat, zoom_lon], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_final['latitude'], toronto_final['longitude'], toronto_final['neighborhood'], toronto_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_cluster_map)
       
toronto_cluster_map

Thanks for following along! 