# Final Capstone Project for IBM Data Science Professional Certificate 

## By Nigel Burrows

### Introduction/Business Problem

The goal of this project is to locate which neighborhoods in Toronto are the best for people that are into fitness so that companies that offer health products can send advertisement via mail to communities more likely to purchase their products. This will be good for gyms and fitness trainers that want to narrow down their focus of which neighborhoods to market to .

### Data 

The data we will use is Foursquare data. This data set will be used to tell which venues that are fitness/health orientated are located near certain neighborhoods in Toronto. The Toronto data set was scraped from this Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. That data will be cleaned to remove all records where the Borough is "Not assigned" and all Neighbourhoods will be conjoined if the postal codes are the same

#### Import all necessary libraries

In [1]:
import numpy as np
import pandas as pd
import json # library to handle JSON files

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Canadian Postal Info

In [2]:
CanadaWikiDfList = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
can_postal = CanadaWikiDfList[0]
can_postal = can_postal[~can_postal.Borough.str.contains("Not assigned")]
can_postal.reset_index(inplace=True, drop=True)
can_postal.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Add location data to the postal codes 

In [3]:
can_postal_latlng = pd.read_csv("http://cocl.us/Geospatial_data")
can_postal_latlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [4]:
can_postal = can_postal.set_index('Postal Code').join(can_postal_latlng.set_index('Postal Code'))
can_postal.reset_index(inplace=True, drop=False)
can_postal.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [5]:
# initialize your variable to None
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Foursquare Data

Coupling this with the Foursqaure data that will be retrieved using the Foursquare API we will be able to get data per Borough and see which venues are in the same vicinity

Get all credentials necessary for accessing Foursquare API

In [6]:
CLIENT_ID = 'YC0HG4O3VE1HIGIYNBC3C0KN4DGXXUICPDTGKDGBCNBYVZCZ' # your Foursquare ID
CLIENT_SECRET = 'GBJVHQFH52TYRLGQKRGVYLING1S4HA5JQ5YSDYF4SGULWLO5' # your Foursquare Secret
VERSION = '20190425' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YC0HG4O3VE1HIGIYNBC3C0KN4DGXXUICPDTGKDGBCNBYVZCZ
CLIENT_SECRET:GBJVHQFH52TYRLGQKRGVYLING1S4HA5JQ5YSDYF4SGULWLO5


Find what all Venue Categories there are so we can only get ones associated with health

In [7]:
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
)
categories_url


'https://api.foursquare.com/v2/venues/categories?client_id=YC0HG4O3VE1HIGIYNBC3C0KN4DGXXUICPDTGKDGBCNBYVZCZ&client_secret=GBJVHQFH52TYRLGQKRGVYLING1S4HA5JQ5YSDYF4SGULWLO5&v=20190425'

Send GET to get the JSON

In [8]:
results = requests.get(categories_url).json()
results

{'meta': {'code': 200, 'requestId': '5fd2e3ca0c79573f8692c802'},
 'response': {'categories': [{'id': '4d4b7104d754a06370d81259',
    'name': 'Arts & Entertainment',
    'pluralName': 'Arts & Entertainment',
    'shortName': 'Arts & Entertainment',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
     'suffix': '.png'},
    'categories': [{'id': '56aa371be4b08b9a8d5734db',
      'name': 'Amphitheater',
      'pluralName': 'Amphitheaters',
      'shortName': 'Amphitheater',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
       'suffix': '.png'},
      'categories': []},
     {'id': '4fceea171983d5d06c3e9823',
      'name': 'Aquarium',
      'pluralName': 'Aquariums',
      'shortName': 'Aquarium',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/aquarium_',
       'suffix': '.png'},
      'categories': []},
     {'id': '4bf58dd8d48988d1e1931735',
      'name': 'A

In [9]:
venue_categories = results['response']['categories']

print('Category Hierarchy:')
for level_1 in venue_categories:
    print(level_1['name'])
    for level_2 in level_1['categories']:
        print(" |", level_2['name'])
#         for level_3 in level_2['categories']:
#             print("   |", level_3["name"])
#             for level_4 in level_3['categories']:
#                 print("     |", level_4["name"])

Category Hierarchy:
Arts & Entertainment
 | Amphitheater
 | Aquarium
 | Arcade
 | Art Gallery
 | Bowling Alley
 | Casino
 | Circus
 | Comedy Club
 | Concert Hall
 | Country Dance Club
 | Disc Golf
 | Escape Room
 | Exhibit
 | General Entertainment
 | Go Kart Track
 | Historic Site
 | Karaoke Box
 | Laser Tag
 | Memorial Site
 | Mini Golf
 | Movie Theater
 | Museum
 | Music Venue
 | Pachinko Parlor
 | Performing Arts Venue
 | Pool Hall
 | Public Art
 | Racecourse
 | Racetrack
 | Roller Rink
 | Salsa Club
 | Samba School
 | Stadium
 | Theme Park
 | Tour Provider
 | VR Cafe
 | Water Park
 | Zoo
College & University
 | College Academic Building
 | College Administrative Building
 | College Auditorium
 | College Bookstore
 | College Cafeteria
 | College Classroom
 | College Gym
 | College Lab
 | College Library
 | College Quad
 | College Rec Center
 | College Residence Hall
 | College Stadium
 | College Theater
 | Community College
 | Fraternity House
 | General College & University
 | Law 

Lets store a list of all Categories related to Outdoors and Recreation

In [10]:
fitness_categories =[]

for level_1 in venue_categories:
    if level_1['name'] == 'Outdoors & Recreation':
        for level_2 in level_1['categories']:
            fitness_categories.append(level_2['name'])

# Just show the first 5 records to restrict output
fitness_categories[:5]

['Athletics & Sports', 'Bathing Area', 'Bay', 'Beach', 'Bike Trail']

From this dive into the data we can see we only want venues that have a level 1 category of 'Outdoors & Recreation'

Create function to get categories from venues returned

In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Creat function to get venue data for all neighborhoods

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=600):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
         
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        for v in results:
            # Get only venues associated with the fitness categories
            if v['venue']['categories'][0]['name'] in fitness_categories:
                venues_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name'])])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
can_postal.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
toronto_venues = getNearbyVenues(names=can_postal['Postal Code'],
                                   latitudes=can_postal['Latitude'],
                                   longitudes=can_postal['Longitude']
                                  )

In [15]:
print(toronto_venues.shape)
toronto_venues.head()

(135, 7)


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M5A,43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
2,M5A,43.65426,-79.360636,Underpass Park,43.655764,-79.354806,Park
3,M5A,43.65426,-79.360636,Parliament Square Park,43.650264,-79.362195,Park
4,M7A,43.662301,-79.389494,Queen's Park,43.663946,-79.39218,Park


Let's check how many venues were returned for each Neighbourhood


In [16]:
toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1E,1,1,1,1,1,1
M1H,1,1,1,1,1,1
M1J,2,2,2,2,2,2
M1L,1,1,1,1,1,1
M1M,1,1,1,1,1,1
...,...,...,...,...,...,...
M7A,2,2,2,2,2,2
M7R,1,1,1,1,1,1
M7Y,4,4,4,4,4,4
M8X,1,1,1,1,1,1


#### Let's find out how many unique categories can be curated from all the returned venues


In [17]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 21 uniques categories.


In [18]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add Neighbourhood column back to dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code'] 

# move Neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Postal Code,Athletics & Sports,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina,Indoor Play Area,...,Park,Playground,Plaza,Pool,River,Rock Climbing Spot,Roof Deck,Scenic Lookout,Sculpture Garden,Trail
0,M3A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,M7A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by Neighbourhood and by taking the mean of the frequency of occurrence of each category


In [19]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped

Unnamed: 0,Postal Code,Athletics & Sports,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina,Indoor Play Area,...,Park,Playground,Plaza,Pool,River,Rock Climbing Spot,Roof Deck,Scenic Lookout,Sculpture Garden,Trail
0,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1H,1.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,0.00,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1L,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1M,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,M7A,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
57,M7R,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,0.00,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
58,M7Y,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
59,M8X,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,...,0.00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each Neighbourhood.


In [21]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
Neighbourhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    Neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Neighbourhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1E,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
1,M1H,Athletics & Sports,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina,Indoor Play Area
2,M1J,Playground,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
3,M1L,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
4,M1M,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina


### Methodology

The data retrieved has all postal codes from Toronto that start with an M that has been cleaned up so that no Postal Codes show up that are unassigned. That dataset was scraped from Wikipedia by simply reading the __HTML table__ using the __Pandas__ library.

That data was then used to pull venue data from __Foursquare__ by merging that data with location data made publicly available by the Coursera Labs team. This was done to circumvent a bug with getting location data. The venues retireved from Foursquare are within a __600 metre radius__ of the geographical point and are filtered to only return venues that are subcategories of the Category type __"Outdoors & Recreation"__

This data will be clustered using __k-means clustering__ algorithm so that we can see the segments of Postal Codes/Neighborhoods that are into fitness and group them by the type of fitness facilities that are surrounding them.

### Analysis

Run _k_-means to cluster the Neighbourhood into 5 clusters.


In [22]:
# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 2, 3, 1, 1, 1, 9, 8, 1, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each Neighbourhood.


In [25]:
# add clustering labels
Neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_.astype(int))

toronto_merged = can_postal

# merge toronto_grouped with can_postal to add latitude/longitude for each Neighbourhood
# Had to use inner join because there are some locations that have no venue data
toronto_merged = toronto_merged.join(Neighbourhoods_venues_sorted.set_index('Postal Code'), on='Postal Code', how='inner')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,2,Athletics & Sports,Rock Climbing Spot,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Plaza,Garden,Park,Trail,Indoor Play Area,Beach,Dog Run,Farm,Field,Fountain


Check for each cluster what are the most common venues

In [26]:
for x in range(0,10):
    print("Cluster ", x)
    top_venue_types = toronto_merged.loc[toronto_merged["Cluster Labels"] == x].groupby(["1st Most Common Venue"]).count().sort_values('Postal Code', ascending=False)['Postal Code']
    print(top_venue_types)
    print("==============================")

Cluster  0
1st Most Common Venue
Fountain           4
Plaza              3
Beach              1
Harbor / Marina    1
Park               1
Name: Postal Code, dtype: int64
Cluster  1
1st Most Common Venue
Park    26
Name: Postal Code, dtype: int64
Cluster  2
1st Most Common Venue
Athletics & Sports    3
Name: Postal Code, dtype: int64
Cluster  3
1st Most Common Venue
Playground    2
Garden        1
Name: Postal Code, dtype: int64
Cluster  4
1st Most Common Venue
Trail    3
Name: Postal Code, dtype: int64
Cluster  5
1st Most Common Venue
Dog Run    3
Name: Postal Code, dtype: int64
Cluster  6
1st Most Common Venue
River    1
Name: Postal Code, dtype: int64
Cluster  7
1st Most Common Venue
Park                  6
Athletics & Sports    1
Name: Postal Code, dtype: int64
Cluster  8
1st Most Common Venue
Plaza    2
Name: Postal Code, dtype: int64
Cluster  9
1st Most Common Venue
Athletics & Sports      1
Other Great Outdoors    1
Trail                   1
Name: Postal Code, dtype: int64


We can make better labels for the clusters so that they are more useful for marketing

In [27]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, 'Cluster Labels'] = "Tourists"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, 'Cluster Labels'] = "Park Goers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, 'Cluster Labels'] = "Professional Athletes"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, 'Cluster Labels'] = "Family Park Goers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, 'Cluster Labels'] = "Hikers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, 'Cluster Labels'] = "Dog Runners"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, 'Cluster Labels'] = "Kayakers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7, 'Cluster Labels'] = "Athlete Park Goers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 8, 'Cluster Labels'] = "Shoppers"
toronto_merged.loc[toronto_merged['Cluster Labels'] == 9, 'Cluster Labels'] = "Outdoor sportsmen"

Finally, let's visualize the resulting clusters


In [28]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

map_colors = {
    "Tourists": 0,
    "Park Goers": 1,
    "Professional Athletes": 2,
    "Family Park Goers": 3,
    "Hikers": 4,
    "Dog Runners": 5,
    "Kayakers": 6,
    "Athlete Park Goers": 7,
    "Shoppers": 8,
    "Outdoor sportsmen": 9,
}

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postal Code'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[map_colors[cluster]-1],
        fill=True,
        fill_color=rainbow[map_colors[cluster]-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [29]:
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park Goers,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Park Goers,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Park Goers,Park,Trail,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,Professional Athletes,Athletics & Sports,Rock Climbing Spot,Lake,Beach,Dog Run,Farm,Field,Fountain,Garden,Harbor / Marina
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Tourists,Plaza,Garden,Park,Trail,Indoor Play Area,Beach,Dog Run,Farm,Field,Fountain


### Results and Discussion

It appears as though downtown alot of the venues that are Outdoors is mostly just for the shopping district so most people who live in these areas can be avoided. The __second largest cluster__ is the __purple__ which consists mostly of parks. It appears that access to parks is what most people have access too so it would be good to not only market to this group but market and advertise in the park itself as thats where the largest group of people that might be more involved in athletics and fitness.

Most other groups are pretty scattered and sparse but this should allow marketers to better target those neighborhoods and send the right advertising mail to the right potential customers.

There were alot of postal codes that were excluded so there are alot of neighborhoods that don't have access to any fitness facilities. This allows marketers to better target postal codes and neighborhoods that would even be interested in outdoor fitness. It is best to market and target products towards people that go to parks or take their children or dog to parks for recreation and exercise. Also it seems that most neighborhoods don't have much access to water outdoor sports activities so it's best to stick to marketing to land activities instead.

### Conclusion

This project set out to properly label communities that have access to fitness facilities as they would be more likely to be fitness and health products. The 10 part cluster did a good job in segmenting that market and the cleanup identify which communities even have access to outdoor exercise facilities. This data will go a long way for marketers of fitness and health products so that they can properly direct fudns to the correct markets.