# JAT Capstone Project (Week 3)
This notebook will be the main location for the Capstone Project

# Start of the Week 3 Assignment
**All three parts of the assignment are in here and the start of each is clearly identified**

## This is the start of the First Section of the Assignment

In [1]:
import pandas as pd
import numpy as np

**Import the scraping library suggested**

In [2]:
import requests 
from bs4 import BeautifulSoup 

**Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a _pandas dataframe_**

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # From class instructions

web_page = requests.get(url).text #get page
web_page_result = BeautifulSoup(web_page, 'html.parser') #transform the text to html

neightborhood_table = web_page_result.find('table', class_ = 'wikitable')
neightborhood_data = neightborhood_table.find_all('tr')

# get the Postcode, Borough and Neighbourhood from the neightborhood_data
neightborhood_info = []
for each in neightborhood_data:
    res = each.text.split('\n')[1:-1] # remove some empty strings
    neightborhood_info.append(res)
    
neightborhood_info[0:5]

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village']]

**The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [4]:
neightborhood_info[0][-3] = 'PostalCode' # change the 1st column title to match instructions
neightborhood_info[0][-1] = 'Neighborhood' # change the 3rd column title to match instructions
neighborhood_df = pd.DataFrame(neightborhood_info[1:], columns=neightborhood_info[0])

neighborhood_df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


**Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.  And if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

In [5]:
NA_boroughs = neighborhood_df.index[neighborhood_df['Borough'] == 'Not assigned']
NA_neighborhoods = neighborhood_df.index[neighborhood_df['Neighborhood'] == 'Not assigned']
 
for index in NA_neighborhoods:
    neighborhood_df['Neighborhood'][index] = neighborhood_df['Borough'][index]
    
neighborhood_df.drop(neighborhood_df.index[NA_boroughs], inplace=True) #Drop the NA Boroughs
neighborhood_df.reset_index(drop=True, inplace=True)

neighborhood_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


**More than one neighborhood can exist in one postal code area.  These two rows will be combined into one row with the neighborhoods separated with a comma**

In [6]:
gbPC = neighborhood_df.groupby('PostalCode')

CD_neighborhoods = gbPC['Neighborhood'].apply(lambda x: ', '.join(x))
GRP_boroughs = gbPC['Borough'].apply(lambda x: set(x).pop())
CDF_df = pd.DataFrame(list(zip(GRP_boroughs.index, GRP_boroughs, CD_neighborhoods)))
CDF_df.columns = ['PostalCode', 'Borough', 'Neighborhood']

CDF_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


**In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

In [7]:
print('The number of rows in our dataframe is', CDF_df.shape[0])

The number of rows in our dataframe is 103


## This is the start of the Second Section of the Week 3 Assignment

**This section is to attach the Lat/Long to the Neighborhoods.  I attempted the code provided for getting the Lat/Long, but it would never return.  Went to the csv version**

In [8]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [9]:
import wget
url2 = 'http://cocl.us/Geospatial_data'
wget.download(url2,'Geospatial_Coordinates.csv')

print('Data downloaded!')

coordinates_df = pd.read_csv('Geospatial_Coordinates.csv') 
print('The coordinates dataframe shape is', coordinates_df.shape)
coordinates_df.head()

  0% [                                                                                ]    0 / 2891100% [................................................................................] 2891 / 2891Data downloaded!
The coordinates dataframe shape is (103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
PostCodes_wLL_df = CDF_df.join(coordinates_df.set_index('Postal Code'), on='PostalCode')

PostCodes_wLL_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


## This is the start of the Third Section of the Week 3 Assignment

Download the additional dependencies I will need for this section

In [11]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [12]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

The geograpical coordinate of Toronto are 43.653963, -79.387207.


NOTE:  Wasted a LOT of time at this point to find out that Jupyter better be running in Chrome.  Otherwise, the map after adding stuff goes blank.  Thanks for that information, instructor (NOT!)

In [13]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map

for lat,lng,borough,neighborhood in zip(PostCodes_wLL_df['Latitude'], PostCodes_wLL_df['Longitude'], PostCodes_wLL_df['Borough'], PostCodes_wLL_df['PostalCode']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
        ).add_to(map_toronto)  


map_toronto

Reduced the dataframe down to the Boroughs that contain the word Toronto, which was allowed in the instructions, and make the new dataframe T_df

In [14]:
T_df = PostCodes_wLL_df[PostCodes_wLL_df.Borough.str.contains("Toronto")]

T_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [15]:
print('The number of rows in the Toronto dataframe is', T_df.shape[0])

The number of rows in the Toronto dataframe is 38


In [139]:
#Reset the map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)
# add markers to map

for lat,lng,borough,neighborhood in zip(T_df['Latitude'], T_df['Longitude'], T_df['Borough'], T_df['PostalCode']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
        ).add_to(map_toronto)  

map_toronto

In [17]:
CLIENT_ID = 'PC5PSKF4U55ZE4AXQDJH53V4PYE5PPHIUEEDGWECNJ3J2YQH' # your Foursquare ID
CLIENT_SECRET = 'U5FX4H5XEHPSEN02VM1CDUE4GC133EZBUF0VCJIOWLUDRW2G' # your Foursquare Secret
VERSION = '20190101'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: PC5PSKF4U55ZE4AXQDJH53V4PYE5PPHIUEEDGWECNJ3J2YQH
CLIENT_SECRET:U5FX4H5XEHPSEN02VM1CDUE4GC133EZBUF0VCJIOWLUDRW2G


Let's do some trials before we run routines.  Below is the first Postal Code.

In [18]:
T_df.iloc[0, 0]

'M4E'

In [300]:
neighborhood_latitude = T_df.iloc[0, 3] # PostalCode latitude value
neighborhood_longitude = T_df.iloc[0, 4] # PostalCode longitude value

neighborhood_name = T_df.iloc[0, 0] # PostalCode name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Latitude and longitude values of M4E are 43.67635739999999, -79.2930312.


Get the venues within 500 meters of the Lat,Lng

In [301]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    VERSION, 
    radius, 
    LIMIT)
url


'https://api.foursquare.com/v2/venues/explore?client_id=PC5PSKF4U55ZE4AXQDJH53V4PYE5PPHIUEEDGWECNJ3J2YQH&client_secret=U5FX4H5XEHPSEN02VM1CDUE4GC133EZBUF0VCJIOWLUDRW2G&ll=43.67635739999999,-79.2930312&v=20190101&radius=500&limit=100'

Get some results as see what it looks like

In [302]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d93c62b5315930030082764'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

Borrow the get_category_type function used in Lab

In [303]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [304]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869
4,Dip 'n Sip,Coffee Shop,43.678897,-79.297745


Woo Hoo, we have data.  Now time to create a function to process all of the Postal Codes that have the word Toronto in the Borough name. (borrowed from the Lab).  To make it easier, I am substitituting Postal Code with Neighborhood.

In [305]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [306]:
T_venues = getNearbyVenues(names=T_df['PostalCode'],
                                   latitudes=T_df['Latitude'],
                                   longitudes=T_df['Longitude']
                          )

M4E
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6G
M6H
M6J
M6K
M6P
M6R
M6S
M7Y


In [307]:
print(T_venues.shape)
T_venues.head()

(1709, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4E,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


In [308]:
T_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,5,5,5,5,5,5
M4K,44,44,44,44,44,44
M4L,20,20,20,20,20,20
M4M,38,38,38,38,38,38
M4N,3,3,3,3,3,3
M4P,8,8,8,8,8,8
M4R,19,19,19,19,19,19
M4S,36,36,36,36,36,36
M4T,3,3,3,3,3,3
M4V,15,15,15,15,15,15


Time to analize each neighborhood....

In [309]:
# one hot encoding
T_onehot = pd.get_dummies(T_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
T_onehot['Neighborhood'] = T_venues['Neighborhood'] 

In [343]:
T_onehot = T_onehot[ ['Neighborhood'] + [ col for col in T_onehot.columns if col != 'Neighborhood' ] ]
T_onehot.shape

(1709, 236)

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [312]:
T_grouped = T_onehot.groupby('Neighborhood').mean().reset_index()

How big are we now?

In [313]:
T_grouped.shape

(38, 236)

Look at each neighborhood along with the top 5 most common venues

In [314]:
num_top_venues = 5

for hood in T_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = T_grouped[T_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----M4E----
                        venue  freq
0                       Trail   0.2
1           Health Food Store   0.2
2                         Pub   0.2
3                 Coffee Shop   0.2
4  Modern European Restaurant   0.0


----M4K----
                venue  freq
0    Greek Restaurant  0.20
1         Coffee Shop  0.09
2  Italian Restaurant  0.07
3      Ice Cream Shop  0.05
4           Bookstore  0.05


----M4L----
                  venue  freq
0           Pizza Place  0.10
1                  Park  0.10
2    Italian Restaurant  0.05
3  Fast Food Restaurant  0.05
4          Burger Joint  0.05


----M4M----
                 venue  freq
0                 Café  0.11
1          Coffee Shop  0.08
2   Italian Restaurant  0.05
3  American Restaurant  0.05
4               Bakery  0.05


----M4N----
                     venue  freq
0                     Park  0.33
1              Swim School  0.33
2                 Bus Line  0.33
3        Afghan Restaurant  0.00
4  New American Restaurant  0

Borrowed function from LAB to sort the venues in descending order.  Then look at the top 10 venues for each of our neighborhoods.

In [328]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [329]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = T_grouped['Neighborhood']

for ind in np.arange(T_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(T_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Health Food Store,Coffee Shop,Trail,Pub,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Pizza Place,Brewery,Bubble Tea Shop,Restaurant
2,M4L,Park,Pizza Place,Pub,Liquor Store,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop,Italian Restaurant
3,M4M,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
4,M4N,Park,Swim School,Bus Line,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


Time to cluster the Neighborhoods, using k-means.  I will use 5 clusters for the neighborhood.

In [330]:
# set number of clusters
kclusters = 5

T_grouped_clustering = T_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(T_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:40]

array([0, 0, 0, 0, 3, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [331]:
#neighborhoods_venues_sorted.head(40)
#T_grouped_clustering.head()

In [332]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Clustered Labels', kmeans.labels_)


In [333]:
neighborhoods_venues_sorted.head(40)

Unnamed: 0,Clustered Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,M4E,Health Food Store,Coffee Shop,Trail,Pub,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,0,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Pizza Place,Brewery,Bubble Tea Shop,Restaurant
2,0,M4L,Park,Pizza Place,Pub,Liquor Store,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop,Italian Restaurant
3,0,M4M,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
4,3,M4N,Park,Swim School,Bus Line,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
5,0,M4P,Clothing Store,Food & Drink Shop,Sandwich Place,Gym,Breakfast Spot,Park,Asian Restaurant,Hotel,Doner Restaurant,Donut Shop
6,0,M4R,Coffee Shop,Sporting Goods Shop,Clothing Store,Mexican Restaurant,Diner,Dessert Shop,Park,Gym / Fitness Center,Chinese Restaurant,Rental Car Location
7,0,M4S,Sandwich Place,Pizza Place,Dessert Shop,Coffee Shop,Gym,Sushi Restaurant,Italian Restaurant,Café,Pharmacy,Restaurant
8,2,M4T,Playground,Park,Tennis Court,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
9,0,M4V,Pub,Coffee Shop,Pizza Place,Sports Bar,Restaurant,Bagel Shop,Supermarket,Sushi Restaurant,Liquor Store,Fried Chicken Joint


In [334]:
T_merged = T_df
T_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [335]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
T_merged = T_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='PostalCode')

T_merged.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Coffee Shop,Trail,Pub,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Pizza Place,Brewery,Bubble Tea Shop,Restaurant
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Park,Pizza Place,Pub,Liquor Store,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop,Italian Restaurant
43,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Park,Swim School,Bus Line,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


Visualizing the Clusters

In [336]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


In [337]:
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(T_merged['Latitude'], T_merged['Longitude'], T_merged['PostalCode'], T_merged['Clustered Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [338]:
T_merged.loc[T_merged['Clustered Labels'] == 0, T_merged.columns[[1] + list(range(5, T_merged.shape[1]))]]

Unnamed: 0,Borough,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,East Toronto,0,Health Food Store,Coffee Shop,Trail,Pub,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
41,East Toronto,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Pizza Place,Brewery,Bubble Tea Shop,Restaurant
42,East Toronto,0,Park,Pizza Place,Pub,Liquor Store,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop,Italian Restaurant
43,East Toronto,0,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Park,Seafood Restaurant,Sandwich Place,Cheese Shop
45,Central Toronto,0,Clothing Store,Food & Drink Shop,Sandwich Place,Gym,Breakfast Spot,Park,Asian Restaurant,Hotel,Doner Restaurant,Donut Shop
46,Central Toronto,0,Coffee Shop,Sporting Goods Shop,Clothing Store,Mexican Restaurant,Diner,Dessert Shop,Park,Gym / Fitness Center,Chinese Restaurant,Rental Car Location
47,Central Toronto,0,Sandwich Place,Pizza Place,Dessert Shop,Coffee Shop,Gym,Sushi Restaurant,Italian Restaurant,Café,Pharmacy,Restaurant
49,Central Toronto,0,Pub,Coffee Shop,Pizza Place,Sports Bar,Restaurant,Bagel Shop,Supermarket,Sushi Restaurant,Liquor Store,Fried Chicken Joint
51,Downtown Toronto,0,Coffee Shop,Market,Pub,Pizza Place,Italian Restaurant,Park,Bakery,Café,Restaurant,Grocery Store
52,Downtown Toronto,0,Coffee Shop,Gay Bar,Japanese Restaurant,Sushi Restaurant,Restaurant,Men's Store,Gym,Gastropub,Fast Food Restaurant,Smoke Shop


In [339]:
T_merged.loc[T_merged['Clustered Labels'] == 1, T_merged.columns[[1] + list(range(5, T_merged.shape[1]))]]

Unnamed: 0,Borough,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Central Toronto,1,Garden,Ice Cream Shop,Music Venue,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


In [340]:
T_merged.loc[T_merged['Clustered Labels'] == 2, T_merged.columns[[1] + list(range(5, T_merged.shape[1]))]]

Unnamed: 0,Borough,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,Central Toronto,2,Playground,Park,Tennis Court,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
50,Downtown Toronto,2,Park,Playground,Trail,Building,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [341]:
T_merged.loc[T_merged['Clustered Labels'] == 3, T_merged.columns[[1] + list(range(5, T_merged.shape[1]))]]

Unnamed: 0,Borough,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
44,Central Toronto,3,Park,Swim School,Bus Line,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [342]:
T_merged.loc[T_merged['Clustered Labels'] == 4, T_merged.columns[[1] + list(range(5, T_merged.shape[1]))]]

Unnamed: 0,Borough,Clustered Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
64,Central Toronto,4,Mexican Restaurant,Trail,Jewelry Store,Sushi Restaurant,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store
