## Battle of the Neighborhoods


This notebook will be used for the capstone IBM Data Science course on Coursera

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Scrape Wikipedia Page for Dataframe</a>

2.  <a href="#item2">Add Latitude/Longitude Data to Dataframe</a>

3.  <a href="#item3">Explore and Cluster Neighborhoods</a>
    </div>


In [1]:
import pandas as pd
import numpy as np

<a id='item1'></a>

## 1. Scrape Wikipedia Page for Dataframe

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe like the one shown below:

In [2]:
import requests
import lxml.html as lh

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

There are 180 postcodes beginning with 'M'; the remaining table data on the page are not postcode data, so we exclude anything of the wrong length and remove the final element, which is the header of a different table on the page.

In [87]:
data = [[v.text_content().strip() for v in list(T)] for T in tr_elements if len(T) == 3]
df_toronto_nbhd = pd.DataFrame(data[1:-1], columns = data[0])
df_toronto_nbhd.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We only process the cells that have an assigned borough:

In [95]:
df_toronto_nbhd = df_toronto_nbhd[df_toronto_nbhd['Borough'] != 'Not assigned']

  Confirm no neighborhoods have a borough but no assigned neighborhood:

In [98]:
df_toronto_nbhd[df_toronto_nbhd['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


Print dataframe required for part 1:

Note: instructions indicate multiple neighborhoods sharing a postcode should not be segmented out

In [99]:
df_toronto_nbhd.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [103]:
print(f'The dataframe has {df_toronto_nbhd.shape[0]} rows')

The dataframe has 103 rows


<a id='item2'></a>

## 2. Add Latitude/Longitude Data to Dataframe

Google's API is not returning results for these post codes, so we use the reference csv:

In [110]:
postcode_latlong = pd.read_csv('http://cocl.us/Geospatial_data')

In [113]:
df_toronto_nbhd = df_toronto_nbhd.merge(postcode_latlong)

In [115]:
df_toronto_nbhd.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<a id='item3'></a>

## 3. Explore and Cluster Neighborhoods

In [118]:
import folium
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [119]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


Create a map of Toronto neighborhoods:

In [None]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_toronto_nbhd['Latitude'], df_toronto_nbhd['Longitude'], df_toronto_nbhd['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  


In [130]:
map_toronto

In [124]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Make a dataframe of the venues pulled from Foursquare API

In [None]:
toronto_venues = getNearbyVenues(names=df_toronto_nbhd['Neighbourhood'],
                                   latitudes=df_toronto_nbhd['Latitude'],
                                   longitudes=df_toronto_nbhd['Longitude']
                                  )

Check the number of venues found:

In [173]:
toronto_venues.shape

(2113, 7)

Check how many venues were found for each neighborhood:

In [262]:
venue_counter = toronto_venues.groupby('Neighborhood').count()

We will manually separate neighborhoods with fewer than six venues as a class of "venue-sparse" neighborhoods

In [263]:
sparse_nbhds = venue_counter[venue_counter['Venue'] < 8]
venue_counter = venue_counter[venue_counter['Venue'] >= 8]

In [264]:
sparse_nbhds.shape

(41, 6)

In [265]:
venue_counter.shape

(54, 6)

Create a one-hot encoding of venue types, and extend the neighborhoods dataframe:

In [266]:
onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix = '', prefix_sep = '')
onehot.insert(0,"Neighbourhood",toronto_venues['Neighborhood'])
onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Create new dataframe of neighborhoods with at least six venues

In [267]:
nbhds_venues = venue_counter.merge(onehot, left_on='Neighborhood', right_on='Neighbourhood')
nbhds_venues = nbhds_venues[onehot.columns]

Group rows by neighborhood and normalize for frequency of each category:

In [268]:
toronto_grouped = nbhds_venues.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0
3,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462


In [269]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

We'll pick the top eight venue types as features for each neighborhood, since many neighborhoods have few venue types


Tried a six-venue cutoff, but a variety of cluster counts provided no informative results for that cutoff

In [275]:
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,"Alderwood, Long Branch",Pizza Place,Gym,Coffee Shop,Skating Rink,Pharmacy,Athletics & Sports,Dance Studio,Pub
1,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Ice Cream Shop,Fried Chicken Joint,Sandwich Place,Bridal Shop,Diner,Restaurant
2,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Pharmacy,Liquor Store,Locksmith,Juice Bar,Restaurant
3,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Seafood Restaurant,Cheese Shop,Farmers Market,Beer Bar,Restaurant
4,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Breakfast Spot,Coffee Shop,Yoga Studio,Gym,Pet Store,Performing Arts Venue


Run _k_-means to cluster the neighborhoods with at least eight venues into four clusters. The fifth cluster will be venue-sparse neighborhoods.

In [276]:
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10] 

array([0, 0, 3, 3, 3, 0, 3, 3, 3, 3])

Combine cluster label with top venue types

In [277]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto_nbhd

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2.0,Coffee Shop,Pub,Bakery,Park,Theater,Breakfast Spot,Café,Distribution Center
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,4.0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Miscellaneous Shop,Coffee Shop,Women's Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2.0,Coffee Shop,Sushi Restaurant,Beer Bar,Sandwich Place,Bank,Bar,Portuguese Restaurant,Café


Now add a separate cluster label to venue-sparse neighborhoods, which were not clustered by _k_-means:

In [278]:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].fillna(kclusters).astype(int)
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,6,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,6,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Pub,Bakery,Park,Theater,Breakfast Spot,Café,Distribution Center
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,4,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Miscellaneous Shop,Coffee Shop,Women's Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Sushi Restaurant,Beer Bar,Sandwich Place,Bank,Bar,Portuguese Restaurant,Café


Visualize the clusters:

In [279]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Regardless of the number of clusters, there are three principal clusters: one of neighborhoods downtown and similar, one at the boundaries of downtown-like clusters, and one with few venues nearby. The Lawrence Manor, Lawrence Heights neighborhoods was consistently assigned to its own cluster.