<font size="6">Capstone Project - Applied Data Science Specialization</font><br>

This notebook will be used to complete the Capstone Project for the Applied Data Science Specialization by IBM. It is divided into different sections, each with different parts. Every section corresponds to a different requirement on the capstone project.<br>




<div style="text-align: right"><font size="5">Making the Notebook</font></div>

<div style="text-align: right">Peer Graded Asignment for IBM's Applied Data Science Capstone, Week 1</div>
<br>
<br>

In [1]:
import numpy as np
import pandas as pd

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!




<div style="text-align: right"><font size="5">Segmenting and Clustering Neighborhoods in Toronto</font></div>

<div style="text-align: right">Peer Graded Asignment for IBM's Applied Data Science Capstone, Week 3</div>
<br>
<br>

<font size="4">Part 1 - Getting and Wrangling Data</font>


First, I import and install everything I need. The following lines of code are taken (almost) directly from the course's "Segmenting and Clustering Neighborhoods in New York City" Jupyter Notebook:


In [5]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

These only have to be installed once, so I uncomment them only in the need of working in a different environment.

In [6]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes

^C


And proceed to import and install the remaining ones.

In [8]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Now, according to the instructions, I need data on different buroughs, neighborhoods and postal codes from Toronto, which I can obtain here:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Let's try just reading it as a csv.

In [9]:
df = pd.read_csv('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 43


Let's try checking online. There appears to be a pandas command, read_html, which can "read HTML tables into a list of DataFrame objects" according to the documentation.

In [10]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 5           M6A        North York   
 6           M7A  Downtown Toronto   
 7           M8A      Not assigned   
 8           M9A         Etobicoke   
 9           M1B       Scarborough   
 10          M2B      Not assigned   
 11          M3B        North York   
 12          M4B         East York   
 13          M5B  Downtown Toronto   
 14          M6B        North York   
 15          M7B      Not assigned   
 16          M8B      Not assigned   
 17          M9B         Etobicoke   
 18          M1C       Scarborough   
 19          M2C      Not assigned   
 20          M3C        North York   
 21          M4C         East York   
 22          M5C  Downtown Toronto   
 23          M6C              York   
 24          M7C      Not assigned   
 25         

We seem to need the first object in the list. Let's convert it into a dataframe.

In [11]:
df = pd.DataFrame(df[0])
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now let's make the changes required in the assignment:

    1.- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

    2.- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

    3.- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

    4.- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

    5.- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [12]:
#1. We use "loc" to remove the ones that have "Not assigned".
df = df.loc[df.Borough != 'Not assigned']

#2. Appereantly, as of this date (July 7, 2020), there's no repeated values of Postal Code, for total amount of
# postal codes equals the total amount of unique postal codes, so we leave it as it is.
df.shape[0] == df['Postal Code'].unique().shape[0]

#3. Given the sum of values in Neighborhood that are 'Not assigned' is 0, we conclude there are none.
(df.Neighborhood == 'Not assigned').sum()

#4. Everything up to the point has been exbplained in Markdown cells, or inside the comments.

#5. The total number of rows is:
print('Total number of rows is: {} '.format(df.shape[0]))

Total number of rows is: 103 


<font size="4">Part 2 - Adding Latitude and Longitude</font>

I still need latitude and longitude coordinates to visualize the information into a map. As of this date, we are given a csv with the corresponding coordinates, available here: https://cocl.us/Geospatial_data .

In [13]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I make a join to paste every coordinate where it should be. As it can be seen, no rows were eliminated, and because pd.merge i by default an inner join, then every Postal Code in the original dataframe now has coordinates.

In [14]:
toronto = pd.merge(df,lat_lon, on ='Postal Code')
print('Total number of rows after join is: {} '.format(toronto.shape[0]))
toronto.head()

Total number of rows after join is: 103 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<font size="4">Part 3 - Clustering and Visualizing</font>

I proceed to make clusters and create maps to understand the behavior of the wrangled data. Let's first get Toronto's coordinates.

In [17]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


Then let's see a map of the 103 postal codes.

In [19]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], 
                                           toronto['Borough'], toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's reduce the number of observations just for simplicity. Arbitrarily, I will analize those boroughs which mean latitude and mean longitude are higher than the overall mean latitude and longitude.

In [90]:
toronto_grouped = toronto.groupby('Borough').mean()
toronto_grouped = toronto_grouped.loc[toronto_grouped['Latitude'] < toronto.Latitude.mean(),:]
toronto_grouped = toronto_grouped.loc[toronto_grouped['Longitude'] < toronto.Longitude.mean(),:]
print('Now I will be working with {} different boroughs'.format(toronto_grouped.shape[0]))
toronto_grouped

Now I will be working with 5 different boroughs


Unnamed: 0_level_0,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,43.70198,-79.398954
Etobicoke,43.660043,-79.542074
Mississauga,43.636966,-79.615819
West Toronto,43.652653,-79.44929
York,43.690797,-79.472633


I filter over the original dataset, and store the new one in "toronto_filter"

In [92]:
filter = []
for borough in toronto['Borough']:
    filter.append(borough in toronto_grouped.index.values)
toronto_filter = toronto.loc[filter,:]
toronto_filter.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
17,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512


It can be seen that the remaining neighborhoods are the ones on the left and on the bottom.

In [94]:
map_toronto_filter = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_filter['Latitude'], toronto_filter['Longitude'], 
                                           toronto_filter['Borough'], toronto_filter['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_filter)  
    
map_toronto_filter

Now, as I will use Foursquare to get the venues data, I define now the values of the fields with my credentials. However, for safety reasons, I will erase them after I get the data.

In [95]:
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20180605'
radius = 450
LIMIT = 50

I will use the getNearbyVenues function defined in the "Segmenting and Clustering Neighborhoods in New York City" Jupyter Notebook

In [96]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I proceed to obtain the (at most) 50 venues surrounding each neighborhood on a 450 radius.

In [141]:
toronto_venues = getNearbyVenues(names=toronto_filter['Neighborhood'],
                                   latitudes=toronto_filter['Latitude'],
                                   longitudes=toronto_filter['Longitude']
                                  )

Islington Avenue, Humber Valley Village
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Caledonia-Fairbanks
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
Del Ray, Mount Dennis, Keelsdale and Silverthorn
Lawrence Park
Roselawn
Runnymede, The Junction North
Weston
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
Westmount
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Canada Post Gateway Processing Centre
Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens
Davisville
Runnymede, Swansea
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
New Toronto, Mimico South, Humber Bay Shores
South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion G

I seem to be working with way less data than the New York example, but I believe it is still enough to get results. If by the end of the cluster analysis I can't make a clear analysis, then I could try making the radius bigger, setting a higher limit, or add some more boroughs

In [142]:
print('The number of venues is {}'.format(toronto_venues.shape[0]))
toronto_venues.head()

The number of venues is 385


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Park,43.692535,-79.428705,Field
1,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Ravine,43.690188,-79.426106,Trail
2,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Tennis Courts,43.692744,-79.432244,Tennis Court
3,Humewood-Cedarvale,43.693781,-79.428191,Phil White Arena,43.691303,-79.431761,Hockey Arena
4,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,LCBO,43.642099,-79.576592,Liquor Store


In [143]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Alderwood, Long Branch",8,8,8,8,8,8
"Brockton, Parkdale Village, Exhibition Place",26,26,26,26,26,26
Caledonia-Fairbanks,4,4,4,4,4,4
Canada Post Gateway Processing Centre,13,13,13,13,13,13
Davisville,37,37,37,37,37,37
Davisville North,9,9,9,9,9,9
"Del Ray, Mount Dennis, Keelsdale and Silverthorn",4,4,4,4,4,4
"Dufferin, Dovercourt Village",15,15,15,15,15,15
"Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood",9,9,9,9,9,9
"Forest Hill North & West, Forest Hill Road Park",4,4,4,4,4,4


By grouping the data and counting the amount of venues, I can see that 'Weston' has only 1 venue, and several others have too few. So, I will set the minimum amount of venues to be 5, because there are too many with 4 venues, which would instantly make those 4 venue types the "Top 4" on  the respective neighborhood. As can be seen below, I didn't lose too many observations, plus the ones that remian will be of more use on their own.

In [146]:
aux = [toronto_venues.groupby('Neighborhood').count().Venue > 4][0]
toronto_venues_over_5 = toronto_venues[toronto_venues['Neighborhood'].isin(aux[aux].index)]
print('The number of venues is {}'.format(toronto_venues_over_5.shape[0]))
toronto_venues_over_5.head()

The number of venues is 342


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,LCBO,43.642099,-79.576592,Liquor Store
5,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,The Beer Store,43.641313,-79.576925,Beer Store
6,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Starbucks,43.641312,-79.576924,Coffee Shop
7,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Pizza Hut,43.641845,-79.576556,Pizza Place
8,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,Shoppers Drug Mart,43.641312,-79.576924,Pharmacy


I now make dummy variables based on the different categories of every venue, and obtain the mean of every category under each neighborhood, which under this model represents the frequency (probability) of ocurrence of each category on each neighborhood.

In [147]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues_over_5[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues_over_5['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print('The number of different categories are {}.'.format(toronto_onehot.shape[1]-1))
toronto_probabilities = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_probabilities.head()

The number of different categories are 121.


Unnamed: 0,Neighborhood,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Beer Store,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Burger Joint,Burrito Place,Butcher,Café,Cajun / Creole Restaurant,Chinese Restaurant,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Dog Run,Donut Shop,Eastern European Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Flea Market,Food & Drink Shop,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gas Station,Gastropub,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hardware Store,Health Food Store,History Museum,Hobby Shop,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Juice Bar,Korean Restaurant,Latin American Restaurant,Light Rail Station,Liquor Store,Mac & Cheese Joint,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Movie Theater,Music Venue,New American Restaurant,Nightclub,Optical Shop,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Pub,Record Shop,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shoe Repair,Shoe Store,Shopping Plaza,Smoothie Shop,Social Club,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Supermarket,Supplement Shop,Sushi Restaurant,Tanning Salon,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.25,0.125,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.038462,0.0,0.115385,0.0,0.0,0.038462,0.0,0.0,0.076923,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.038462,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.076923,0.0,0.0,0.076923,0.038462,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Canada Post Gateway Processing Centre,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.153846,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.054054,0.0,0.0,0.0,0.0,0.0,0.054054,0.0,0.0,0.0,0.0,0.027027,0.027027,0.0,0.081081,0.027027,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.027027,0.027027,0.0,0.054054,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.054054,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.027027,0.027027,0.0,0.0,0.027027,0.081081,0.0,0.0,0.0,0.027027,0.0,0.081081,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054054,0.0,0.0,0.027027,0.027027,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0
4,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I will make use of the function return_most_common_venues defined on the previously mentioned Notebook to obtain the top 5 most common venues, as well as the lines of code to put all this data nicely into a dataframe.

In [148]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [170]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_probabilities['Neighborhood']

for ind in np.arange(toronto_probabilities.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_probabilities.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.tail()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
13,"Runnymede, Swansea",Café,Italian Restaurant,Pub,Pizza Place,Coffee Shop
14,"South Steeles, Silverstone, Humbergate, Jamest...",Grocery Store,Pharmacy,Beer Store,Fast Food Restaurant,Pizza Place
15,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,Pub,American Restaurant,Sushi Restaurant,Fried Chicken Joint
16,"The Annex, North Midtown, Yorkville",Sandwich Place,Café,Coffee Shop,Pharmacy,Liquor Store
17,Westmount,Pizza Place,Intersection,Middle Eastern Restaurant,Chinese Restaurant,Discount Store


Now that I have the frequency of each venue category in each neighborhood, it is possible to determine how close each observation (neighborhood) is to each other, and with that, create clusters that will share similarities. I will group the data into 4 clusters.

In [171]:
k= 4
toronto_probabilities_k = toronto_probabilities.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=7).fit(toronto_probabilities_k)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_filter

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.merge(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [174]:
toronto_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201,1,Pet Store,Beer Store,Pharmacy,Pizza Place,Liquor Store
1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,2,Pharmacy,Bakery,Brewery,Bank,Park
2,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,2,Bar,Men's Store,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant
3,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191,2,Café,Nightclub,Performing Arts Venue,Coffee Shop,Bakery
4,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Breakfast Spot,Gym,Pizza Place,Park
5,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,2,Discount Store,Mexican Restaurant,Thai Restaurant,Café,Furniture / Home Store
6,M9P,Etobicoke,Westmount,43.696319,-79.532242,3,Pizza Place,Intersection,Middle Eastern Restaurant,Chinese Restaurant,Discount Store
7,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,2,Clothing Store,Coffee Shop,Gift Shop,Fast Food Restaurant,Diner
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,2,Sandwich Place,Café,Coffee Shop,Pharmacy,Liquor Store
9,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325,2,Breakfast Spot,Gift Shop,Bar,Dessert Shop,Restaurant


I use the code from the Notebook to paint the differnet colors of ths clusters in a map

In [175]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Visually,  the turqoise dots seem to align to the shore, so there might be a variable that reflects that. On the other hand, purple and yellow seem to drift to the left, with purples been farther from the rest of the dots. As for red dots, they seem to be the left and right extremes. Now let's visualize the variables in the clusters.

<font size="4">Red</font><br>
It seems the primary determiner of this cluster is the fact that the most common venue is a Hotel

In [176]:
k_ = 0
print
toronto_merged.loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Central Toronto,0,Hotel,Breakfast Spot,Gym,Pizza Place,Park
10,Mississauga,0,Hotel,Coffee Shop,American Restaurant,Intersection,Middle Eastern Restaurant


It can be seen that both frequencies are quite high, compared to the overall frequency, so it is understandable for them to be an independent cluster just for that fact, regardless of the distance that exists between them.

In [206]:
overall_mean = toronto_probabilities_k.mean().mean()
review = toronto_merged.merge(toronto_probabilities, on = 'Neighborhood')
print('The overall mean (frequency of ocurrence) of any venue on any neighborhood is {}'.format(overall_mean))
print('The frequencies of Hotel venues in these 2 clusters are:')
review.loc[review['Cluster Labels'] == 0,'Hotel']


The overall mean (frequency of ocurrence) of any venue on any neighborhood is 0.008264462809917357
The frequencies of Hotel venues in these 2 clusters are:


4     0.222222
10    0.153846
Name: Hotel, dtype: float64

<font size="4">Purple</font><br>
I analized this one last, beacuse I just couldn't find something that characterizes them. The one thing is that they share 3 of their top 5 most common venues (Beer Store, Pharmacy, and Pizza Place), so that might be tiyng them together.

In [177]:
k_ = 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Etobicoke,1,Pet Store,Beer Store,Pharmacy,Pizza Place,Liquor Store
15,Etobicoke,1,Grocery Store,Pharmacy,Beer Store,Fast Food Restaurant,Pizza Place


<font size="4">Turqouise</font><br>
Every neighborhood in the West Tornto Borough is here, and only one of the Central Toronto ones is missing (which is contained in the Red cluster, due to the strong influence of it having a Hotel). It seems that what characterizes these boroughs is a variety in entertainement/to pass time places, such as clothing/discount stores, cafés, pubs, placesto have breakfast, desserts.

In [178]:
k_ = 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,West Toronto,2,Pharmacy,Bakery,Brewery,Bank,Park
2,West Toronto,2,Bar,Men's Store,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant
3,West Toronto,2,Café,Nightclub,Performing Arts Venue,Coffee Shop,Bakery
5,West Toronto,2,Discount Store,Mexican Restaurant,Thai Restaurant,Café,Furniture / Home Store
7,Central Toronto,2,Clothing Store,Coffee Shop,Gift Shop,Fast Food Restaurant,Diner
8,Central Toronto,2,Sandwich Place,Café,Coffee Shop,Pharmacy,Liquor Store
9,West Toronto,2,Breakfast Spot,Gift Shop,Bar,Dessert Shop,Restaurant
11,Central Toronto,2,Dessert Shop,Sandwich Place,Pizza Place,Café,Gym
12,West Toronto,2,Café,Italian Restaurant,Pub,Pizza Place,Coffee Shop
13,Central Toronto,2,Coffee Shop,Pub,American Restaurant,Sushi Restaurant,Fried Chicken Joint


<font size="4">Yellow</font><br>
Same as the red cluster, this one seems to be highly determined by one kind of venue: Pizza Places. Also, both neighborhoods are Etobicoke.

In [179]:
k_ = 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == k_, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Etobicoke,3,Pizza Place,Intersection,Middle Eastern Restaurant,Chinese Restaurant,Discount Store
16,Etobicoke,3,Pizza Place,Pub,Coffee Shop,Dance Studio,Sandwich Place


It can be seen that both frequencies are quite high, compared to the overall frequency, so it is understandable for them to be an independent cluster just for that fact, regardless of any other variable.

In [208]:
overall_mean = toronto_probabilities_k.mean().mean()
review = toronto_merged.merge(toronto_probabilities, on = 'Neighborhood')
print('The overall mean (frequency of ocurrence) of any venue on any neighborhood is {}'.format(overall_mean))
print('The frequencies of Pizza Place venues in these 2 clusters are:')
review.loc[review['Cluster Labels'] == 3,'Pizza Place']


The overall mean (frequency of ocurrence) of any venue on any neighborhood is 0.008264462809917357
The frequencies of Pizza Place venues in these 2 clusters are:


6     0.25
16    0.25
Name: Pizza Place, dtype: float64