# Project 9.2 (for IBM Data Science Professional Certificate)

Hello reader,

Thank you for taking the time to review or look at this project.

I hope this notebook provides you with some value.

On this project, I'm required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York (from previous assignments), the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

Let's start by importing our libraries:

In [8]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup #for scrapping information on websites

import math

print('Libraries imported.')

Libraries imported.


#### Let's start by getting the data of the city of Toronto at Wikipedia site:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [9]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#*** CREATE THE BEAUTIFULSOUP OBJECT ***
soup = BeautifulSoup(source, 'lxml') 

#*** GETTING THE TABLE FROM WIKIPEDIA SITE ***
body = soup.find('div', class_= 'mw-body')
body_con = body.find('div', class_= 'mw-content-ltr')
table = body_con.table

#*** GETTING THE COLUMNS AND CREATING DATA FRAME ***
c = table.find_all('th')
for i in [0,1,2]:
    c[i] = c[i].text
dfcan = pd.DataFrame(columns = c)

#*** WE HAVE TO ARRANGE OUR LIST (tr contains each row and td each column) ***
data = table.find_all('td')
for value, i in zip(data, range(len(data))):
    data[i] = value.text

#*** NOW LET'S ASSIGN THE VALUES TO THE DATA FRAME ***
j = int(len(data)/3) #to loop for the correct rows from the list
for l in range(j):
    low = 3*l
    top = low + 3
    d = data[low:top] #each index must be like 0:3, 3:6, 6:9...
    d[2] = d[2].split('\n')[0] #removing unwanted characters
    if (d[1] != 'Not assigned') and (d[2] == 'Not assigned'):#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
        d[2] = d[1]
    if d[1] != 'Not assigned': #is not necesary to append rows with borough not assigned
        dfcan = dfcan.append({'Postcode':d[0],'Borough':d[1],'Neighbourhood':d[2]}, ignore_index=True)

dfcan = dfcan.dropna(axis=1) #for some reason a NaN column is created, we delete with this line of code
dfcan.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True) #let's change the name of the column

#*** COMBINE NEIGHBOURHOODS THAT ARE IN THE SAME POSTAL AREA ***
dfcan = dfcan.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index() 
dfcan.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We've completed the cleaning of our dataset let's check that every condition is satisfied.

With the code below we check if the borough column has only the desired values.

In [10]:
dfcan.groupby('Borough').count() #check if the Borough column has only desired values

Unnamed: 0_level_0,Postcode,Neighborhood
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,9,9
Downtown Toronto,18,18
East Toronto,5,5
East York,5,5
Etobicoke,12,12
Mississauga,1,1
North York,24,24
Queen's Park,1,1
Scarborough,17,17
West Toronto,6,6


Now, we check if any postal code is listed more than once (we only look at the head for visual purposes but it is correct)

In [11]:
dfcan.groupby('Postcode').count().head()

Unnamed: 0_level_0,Borough,Neighborhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,1,1
M1C,1,1
M1E,1,1
M1G,1,1
M1H,1,1


In [12]:
dfcan.shape

(103, 3)

Now let's see the shape of our dataframe:

#### Let's use geopy library to get the latitude and longitude values of the neighborhoods of Toronto.

In [15]:
address = 'Canada, Toronto'

geolocator = Nominatim(user_agent="Toro_explorer")
locationT = geolocator.geocode(address)
latitudeT = locationT.latitude
longitudeT = locationT.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitudeT, longitudeT))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


By doing some research I see that using the postal code will result in error on the geolocation data for some of the rows. Since some of the neighborhoods have multiple values I used the geolocation of the first location on the string. 

Again, that returned and error for some of the rows (7 to be precise) and for those we used the coordinates for the city of Toronto.

Since our original dataframe does not have such information about coordinates, for simplicity, we will create a new dataframe that will.

Let's begin:

In [16]:
dftoro = pd.DataFrame(data=dfcan, copy=True)

geolocator = Nominatim(user_agent="Toro_explorer")
dftoro['Latitude'] = 0.0
dftoro['Longitude'] = 0.0

for i in range(len(dftoro['Neighborhood'])-1):
    site = dftoro['Postcode'][i] #location if postal code is found
    address = '{}, {}'.format(site, dftoro['Neighborhood'][i])
    location = geolocator.geocode(address)
    
    if (location is None ):
        address = '{}, Toronto'.format(dftoro['Neighborhood'][i].split(', ')[0])
        location = geolocator.geocode(address) #location if the first neighborhood is found
        
        if(location is None):
            location = geolocator.geocode('Canada, Toronto') #location of Toronto as last resource

    dftoro['Latitude'][i] = location.latitude
    dftoro['Longitude'][i] = location.longitude

dftoro

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.754899,-79.197776
3,M1G,Scarborough,Woburn,43.759824,-79.225291
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692
5,M1J,Scarborough,Scarborough Village,43.743742,-79.211632
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.714167,-79.271109
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.708823,-79.295986
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.721939,-79.236232
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.702112,-79.260091


After running the last lines of code and inspecting our dataframe we see that some places (M5G and M9W) have coordinates that are outside Toronto, to see it more clearly let's visualize it on a map:

In [17]:
# create map of Toronto using latitude and longitude values m5g m9w
mapa = folium.Map(location=[latitudeT, longitudeT], zoom_start=1)

# add markers to map
for lat, lng, label in zip(dftoro['Latitude'], dftoro['Longitude'], dftoro['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mapa)  
    
mapa

The two points at the right are two values that are not in Toronto when doing the search on the geolocator (M5G and M9W) for the three search alternatives.

Let's just drop those two:

In [18]:
dftoro = dftoro.drop(dftoro.index[102])
dftoro = dftoro.drop(dftoro.index[57])
dftoro = dftoro.reset_index(drop=True)
dftoro

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.754899,-79.197776
3,M1G,Scarborough,Woburn,43.759824,-79.225291
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692
5,M1J,Scarborough,Scarborough Village,43.743742,-79.211632
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.714167,-79.271109
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.708823,-79.295986
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.721939,-79.236232
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.702112,-79.260091


In [19]:
dftoro.shape

(101, 5)

And now, we can inspect the map in Toronto:

In [20]:
# create map of Toronto using latitude and longitude values
mapa = folium.Map(location=[latitudeT, longitudeT], zoom_start=10.3)

# add markers to map
for lat, lng, label in zip(dftoro['Latitude'], dftoro['Longitude'], dftoro['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mapa)  
    
mapa

##### Let's define Foursquare Credentials and Version:

In [21]:
CLIENT_ID = 'MVSE2VXCQCN44RWIKHPTF1HHIDGIIEOL4DWIK41SW3SLJ31Y' # your Foursquare ID
CLIENT_SECRET = 'RQAHVOBYGC2HIUHSQRBC1F0WXE1J4BES5ZLTTWK1CHXOHIR1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

As in the lab, let's explore our first neighborhood on our data frame:

In [22]:
dftoro.loc[0, 'Neighborhood']

'Rouge, Malvern'

Get the neighborhood's latitude and longitude values.

In [23]:
neighborhood_latitude = dftoro.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dftoro.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dftoro.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.8091955, -79.2217008.


#### Now, let's get the top 100 venues that are in Rouge, Malvern within a radius of 500 meters.

First, let's create the GET request URL

In [24]:
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

From the Foursquare lab in the previous module, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [25]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [26]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Shoppers Drug Mart,Pharmacy,43.809202,-79.22332
1,Subway,Sandwich Place,43.806805,-79.222515
2,Pizza Pizza,Pizza Place,43.806613,-79.221243
3,Pizza Hut,Pizza Place,43.808326,-79.220616
4,Francois' No Frills,Grocery Store,43.808518,-79.223399


In [27]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

10 venues were returned by Foursquare.


### Let's create a function to repeat the same process to all the neighborhoods in Toronto

Using the code from the lab but for a shorter radius:

In [28]:
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

To see all the venues in Toronto:

In [29]:
toro_venues = getNearbyVenues(names=dftoro['Neighborhood'],
                                   latitudes=dftoro['Latitude'],
                                   longitudes=dftoro['Longitude']
                                  )

Let's check how many venues were returned for each neighborhood

In [30]:
print(toro_venues.shape)
toro_venues.head()

(954, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.809196,-79.221701,Shoppers Drug Mart,43.809202,-79.22332,Pharmacy
1,"Rouge, Malvern",43.809196,-79.221701,Pizza Hut,43.808326,-79.220616,Pizza Place
2,"Rouge, Malvern",43.809196,-79.221701,Francois' No Frills,43.808518,-79.223399,Grocery Store
3,"Rouge, Malvern",43.809196,-79.221701,Circle K,43.808097,-79.220449,Convenience Store
4,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334,Highland Creek,43.790281,-79.173703,Neighborhood


Let's check how many venues were returned for each neighborhood

In [31]:
toro_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",28,28,28,28,28,28
Agincourt,9,9,9,9,9,9
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",9,9,9,9,9,9
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",10,10,10,10,10,10
"Alderwood, Long Branch",6,6,6,6,6,6
"Bathurst Manor, Downsview North, Wilson Heights",1,1,1,1,1,1
Bayview Village,10,10,10,10,10,10
"Bedford Park, Lawrence Manor East",1,1,1,1,1,1
Berczy Park,49,49,49,49,49,49
"Birch Cliff, Cliffside West",3,3,3,3,3,3


#### Let's find out how many unique categories can be curated from all the returned venues

In [32]:
print('There are {} uniques categories.'.format(len(toro_venues['Venue Category'].unique())))

There are 188 uniques categories.


## Analyze Each Neighborhood

In [33]:
# one hot encoding
toro_onehot = pd.get_dummies(toro_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toro_onehot['Neighborhood'] = toro_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toro_onehot.columns[-1]] + list(toro_onehot.columns[:-1])
toro_onehot = toro_onehot[fixed_columns]

print(toro_onehot.shape)
toro_onehot.head()

(954, 188)


Unnamed: 0,Yoga Studio,Adult Boutique,American Restaurant,Amphitheater,Aquarium,Art Gallery,Art Museum,Asian Restaurant,Auto Garage,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Café,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Chiropractor,Chocolate Shop,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Convention Center,Cosmetics Shop,Creperie,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop,Dumpling Restaurant,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Fish Market,Flower Shop,Food & Drink Shop,Food Court,Food Stand,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Gas Station,Gastropub,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hakka Restaurant,Hardware Store,Health & Beauty Service,History Museum,Hobby Shop,Home Service,Hong Kong Restaurant,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Laundry Service,Library,Lingerie Store,Liquor Store,Martial Arts Dojo,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Moving Target,Museum,Nail Salon,Neighborhood,New American Restaurant,Nightclub,Noodle House,North Indian Restaurant,Office,Other Great Outdoors,Outdoor Sculpture,Outdoor Supply Store,Paper / Office Supplies Store,Park,Persian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pool Hall,Pub,Ramen Restaurant,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shipping Store,Shoe Store,Shopping Mall,Skating Rink,Soccer Field,Soup Place,South American Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Supermarket,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [34]:
toro_grouped = toro_onehot.groupby('Neighborhood').mean().reset_index()
toro_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,American Restaurant,Amphitheater,Aquarium,Art Gallery,Art Museum,Asian Restaurant,Auto Garage,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Café,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Chiropractor,Chocolate Shop,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Convention Center,Cosmetics Shop,Creperie,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop,Dumpling Restaurant,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Fish Market,Flower Shop,Food & Drink Shop,Food Court,Food Stand,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Gas Station,Gastropub,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hakka Restaurant,Hardware Store,Health & Beauty Service,History Museum,Hobby Shop,Home Service,Hong Kong Restaurant,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Laundry Service,Library,Lingerie Store,Liquor Store,Martial Arts Dojo,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Moving Target,Museum,Nail Salon,New American Restaurant,Nightclub,Noodle House,North Indian Restaurant,Office,Other Great Outdoors,Outdoor Sculpture,Outdoor Supply Store,Paper / Office Supplies Store,Park,Persian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pool Hall,Pub,Ramen Restaurant,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shipping Store,Shoe Store,Shopping Mall,Skating Rink,Soccer Field,Soup Place,South American Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Supermarket,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.035714,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107143,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.035714,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.166667,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [35]:
toro_grouped.shape

(94, 188)

#### Let's print each neighborhood along with the top 5 most common venues

In [36]:
num_top_venues = 5

for hood in toro_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toro_grouped[toro_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
                 venue  freq
0                Hotel  0.11
1  American Restaurant  0.07
2          Coffee Shop  0.07
3   Seafood Restaurant  0.07
4                 Café  0.07


----Agincourt----
                   venue  freq
0       Asian Restaurant  0.22
1     Chinese Restaurant  0.22
2             Food Court  0.11
3  Vietnamese Restaurant  0.11
4          Shopping Mall  0.11


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                  venue  freq
0           Wings Joint  0.11
1    Dim Sum Restaurant  0.11
2   Fried Chicken Joint  0.11
3  Fast Food Restaurant  0.11
4        Sandwich Place  0.11


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                  venue  freq
0              Pharmacy   0.1
1           Auto Garage   0.1
2           Pizza Place   0.1
3  Fast Food Restaurant   0.1
4          Liquor Store   0.1


----Alderwood, Long Branch----
       

### Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order.

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toro_grouped['Neighborhood']

for ind in np.arange(toro_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toro_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Hotel,Coffee Shop,American Restaurant,Seafood Restaurant,Café,Bar,Salon / Barbershop,Latin American Restaurant,Food & Drink Shop,Burger Joint
1,Agincourt,Asian Restaurant,Chinese Restaurant,Coffee Shop,Food Court,Rental Car Location,Shopping Mall,Vietnamese Restaurant,Creperie,Dance Studio,Fast Food Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Wings Joint,Ice Cream Shop,Fried Chicken Joint,Sandwich Place,Dim Sum Restaurant,Chinese Restaurant,Fast Food Restaurant,Pizza Place,Clothing Store,Dance Studio
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Liquor Store,Pizza Place,Beer Store,Fried Chicken Joint,Auto Garage,Pharmacy,Fast Food Restaurant,Grocery Store,Video Store,Hardware Store
4,"Alderwood, Long Branch",Dance Studio,Coffee Shop,Pharmacy,Pizza Place,Pool,Pub,Dessert Shop,Empanada Restaurant,Electronics Store,Dumpling Restaurant


# Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [39]:
# set number of clusters
kclusters = 5

toro_grouped_clustering = toro_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toro_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [40]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toro_merged = dftoro

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toro_merged = toro_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toro_merged.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.809196,-79.221701,3.0,Grocery Store,Pharmacy,Pizza Place,Convenience Store,Wings Joint,Dim Sum Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Dumpling Restaurant
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334,3.0,Wings Joint,Discount Store,Flower Shop,Fish Market,Fast Food Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Dumpling Restaurant,Donut Shop
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.754899,-79.197776,3.0,Train Station,Coffee Shop,Storage Facility,Wings Joint,Discount Store,Fish Market,Fast Food Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store
3,M1G,Scarborough,Woburn,43.759824,-79.225291,3.0,Fast Food Restaurant,Pizza Place,Pharmacy,Discount Store,Furniture / Home Store,Beer Store,Supermarket,Clothing Store,Coffee Shop,Vietnamese Restaurant
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692,3.0,Spa,Toy / Game Store,Wings Joint,Diner,Fish Market,Fast Food Restaurant,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Dumpling Restaurant


Some of the values in the column cluster labels are null:

In [41]:
print('Unique values of the label before fixing: ',toro_merged['Cluster Labels'].unique())

toro_merged[np.isnan(toro_merged['Cluster Labels']) == True] = 5

print('Unique values of the label after fixing: ',toro_merged['Cluster Labels'].unique())

Unique values of the label before fixing:  [ 3.  1. nan  2.  0.  4.]
Unique values of the label after fixing:  [3. 1. 5. 2. 0. 4.]


## Finally, 

Let's visualize the resulting clusters

In [47]:
map_clusters = folium.Map(location=[latitudeT, longitudeT], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)+2))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map

markers_colors = []
for lat, lon, poi, cluster in zip(toro_merged['Latitude'], toro_merged['Longitude'], toro_merged['Neighborhood'], toro_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    check = np.isnan(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster+1)], #cluster+1
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Disclaimer: 

The present document followed the parameters needed for the grade of the last level of the course IBM Data Science Professional Certificate.

The dataset is partitioned as asked in the instructions for the assignment. 

Thank you for taking the time to read this code.

Wish you the best.

## Disclaimer

The present document followed the parameters needed for the grade of the last level of the course IBM Data Science Professional Certificate.

The dataset is partitioned as asked in the instructions for the assignment.