# Applied Data Science Capstone - Week 3

### Peer-graded Assignment: Segmenting and Clustering Neighbourhoods in Toronto

###### Q1
First, we scrape the postal code data from a given Wikipedia page and then transform it into a pandas dataframe.

In [1]:
# Import Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# Get the table of postal codes from the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
Canada_data = BeautifulSoup(source, 'lxml')

In [3]:
# Create a new Pandas DataFrame
column_names = ['Postalcode','Borough','Neighborhood']
Toronto = pd.DataFrame(columns = column_names)

In [4]:
# Loop through to find postcode, borough, neighborhood 
content = Canada_data.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0

for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace(']','')
    Toronto = Toronto.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

In [5]:
Toronto.head

<bound method NDFrame.head of     Postalcode           Borough  \
0            0                 0   
1          M1A      Not assigned   
2          M2A      Not assigned   
3          M3A        North York   
4          M4A        North York   
5          M5A  Downtown Toronto   
6          M6A        North York   
7          M6A        North York   
8          M7A  Downtown Toronto   
9          M8A      Not assigned   
10         M9A         Etobicoke   
11         M1B       Scarborough   
12         M1B       Scarborough   
13         M2B      Not assigned   
14         M3B        North York   
15         M4B         East York   
16         M4B         East York   
17         M5B  Downtown Toronto   
18         M5B  Downtown Toronto   
19         M6B        North York   
20         M7B      Not assigned   
21         M8B      Not assigned   
22         M9B         Etobicoke   
23         M9B         Etobicoke   
24         M9B         Etobicoke   
25         M9B         Etobicoke  

In [6]:
# Clean Pandas DataFrame 
Toronto = Toronto[Toronto.Borough!='Not assigned']
Toronto = Toronto[Toronto.Borough!= 0]
Toronto.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,Toronto.shape[0]):
    if Toronto.iloc[i][2] == 'Not assigned':
        Toronto.iloc[i][2] = Toronto.iloc[i][1]
        i = i+1
                                 
df = Toronto.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Next, we clean our data:

In [7]:
# Drop rows with a "Not assigned" value:
df = df.dropna()
empty = 'Not assigned'
df = df[(df.Postalcode != empty ) & (df.Borough != empty) & (df.Neighborhood != empty)]
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = df.groupby(['Postalcode', 'Borough'])
df2 = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [9]:
print(df2.shape)
df2.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


###### Q2
We will use the Geocoder Python package to get the latitude and the longitude coordinates of each neighborhood. 

Given that this package can be very unreliable, we will run a while loop for each postal code, to make sure that we get the coordinates for all of our neighborhoods.




In [10]:
# Install GeoCoder if not already installed
!conda install -c conda-forge geocoder --yes


Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

    geocoder:        1.38.1-py_1       conda-forge
    ratelim:         0.1.6-py_2        conda-forge

The following packages will be UPDATED:

    

In [11]:
import geocoder # import geocoder


def get_latlng(postal_code):
# initialize your variable to None
    lat_lng_coords = None

# loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng
    return lat_lng_coords
    
# test get_latlng()
get_latlng('M1B')

[43.811525000000074, -79.19551746399998]

In [15]:
# Retrive Postal Code coordinates
postal_codes = df2['Postalcode']    
coords = [ get_latlng(postal_code) for postal_code in postal_codes.tolist() ]

In [16]:
# Add Lat and Long columns
coords_df2 = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df2['Latitude'] = coords_df2['Latitude']
df2['Longitude'] = coords_df2['Longitude']

In [17]:
# test
df2[df2.Postalcode == 'M1B']

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517


In [18]:
df2.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


###### Q3
Clustering


In [19]:
import numpy as np   # Library to handle data in a vectorized manner
import json          # Library to handle JSON files

!conda install -c conda-forge geopy --yes # Install geopy if not already installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0   conda-forge
    geopy:         1.21.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.21.0         | 58 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environ

In [20]:
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

  app.launch_new_instance()


The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [22]:
scarborough_data = df2[df2['Borough'] == 'Scarborough'].reset_index(drop=True)
address1 = 'Scarborough,Toronto'

geolocator1 = Nominatim()
location1 = geolocator1.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude1, longitude1))

The geograpical coordinate of Scarborough are 43.773077, -79.257774.




In [23]:
map_scarb = folium.Map(location=[latitude1, longitude1], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'], scarborough_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scarb)  
    
map_scarb

In [82]:
# The code was removed by Watson Studio for sharing.

In [28]:
neighborhood_latitude = scarborough_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = scarborough_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = scarborough_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude1, longitude1, VERSION, radius, LIMIT)

Latitude and longitude values of Rouge, Malvern are 43.811525000000074, -79.19551746399998.


In [29]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [30]:
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(12)

Unnamed: 0,name,categories,lat,lng
0,SEPHORA,Cosmetics Shop,43.775017,-79.258109
1,Disney Store,Toy / Game Store,43.775537,-79.256833
2,American Eagle Outfitters,Clothing Store,43.776012,-79.258334
3,St. Andrews Fish & Chips,Fish & Chips Shop,43.771865,-79.252645
4,Tommy Hilfiger,Clothing Store,43.776015,-79.257369
5,DAVIDsTEA,Tea Room,43.77632,-79.258688
6,Chipotle Mexican Grill,Mexican Restaurant,43.77641,-79.258069
7,Hot Topic,Clothing Store,43.77545,-79.257929
8,Coliseum Scarborough Cinemas,Movie Theater,43.775995,-79.255649
9,Shoppers Drug Mart,Pharmacy,43.773305,-79.251662


In [46]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

42 venues were returned by Foursquare.


In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
scarborough_venues = getNearbyVenues(names=scarborough_data['Neighborhood'],
                                   latitudes=scarborough_data['Latitude'],
                                   longitudes=scarborough_data['Longitude']
                                  )

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge


In [47]:
print(scarborough_venues.shape)
scarborough_venues.head()

(89, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725,Scarborough Historical Society,43.788755,-79.162438,History Museum
1,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.765815,-79.175193,Homestead Roofing Repair,43.76514,-79.178663,Construction & Landscaping
3,"Guildwood, Morningside, West Hill",43.765815,-79.175193,Heron Park Community Centre,43.768867,-79.176958,Gym / Fitness Center
4,"Guildwood, Morningside, West Hill",43.765815,-79.175193,Heron Park,43.769327,-79.177201,Park


In [48]:
print('There are {} uniques categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 50 uniques categories.


In [49]:
scarborough_venues.head(3)
print(scarborough_venues.groupby('Neighborhood').count()[:4])

                                                    Neighborhood Latitude  \
Neighborhood                                                                
Agincourt                                                              15   
Agincourt North, L'Amoreaux East, Milliken, Ste...                      2   
Birch Cliff, Cliffside West                                             6   
Cedarbrae                                                               2   

                                                    Neighborhood Longitude  \
Neighborhood                                                                 
Agincourt                                                               15   
Agincourt North, L'Amoreaux East, Milliken, Ste...                       2   
Birch Cliff, Cliffside West                                              6   
Cedarbrae                                                                2   

                                                    Venue  Venue Lat

In [50]:
# one hot encoding
scarb_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scarb_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scarb_onehot.columns[-1]] + list(scarb_onehot.columns[:-1])
scarb_onehot = scarb_onehot[fixed_columns]

scarb_grouped = scarb_onehot.groupby('Neighborhood').mean().reset_index()

In [52]:
scarb_onehot.shape

(89, 51)

In [53]:
scarb_grouped = scarb_onehot.groupby('Neighborhood').mean().reset_index()
scarb_grouped

Unnamed: 0,Neighborhood,Auto Garage,Bakery,Bank,Bar,Bistro,Brewery,Bus Line,Bus Station,Business Service,...,Shanghai Restaurant,Shopping Mall,Skating Rink,Soccer Field,Supermarket,Sushi Restaurant,Thai Restaurant,Trail,Train Station,Vietnamese Restaurant
0,Agincourt,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066667,0.133333,0.066667,0.0,0.066667,0.066667,0.0,0.0,0.0,0.066667
1,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0
4,"Clairlea, Golden Mile, Oakridge",0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.1,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
5,"Clarks Corners, Sullivan, Tam O'Shanter",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.090909,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0
6,"Cliffcrest, Cliffside, Scarborough Village West",0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Dorset Park, Scarborough Town Centre, Wexford ...",0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"East Birchmount Park, Ionview, Kennedy Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
scarb_grouped.shape

(15, 51)

###### Top 5 most common venues per neighbourhood

In [57]:
num_top_venues = 5

for hood in scarb_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scarb_grouped[scarb_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.13
1       Shopping Mall  0.13
2        Skating Rink  0.07
3    Department Store  0.07
4                Pool  0.07


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                  venue  freq
0              Pharmacy   1.0
1           Auto Garage   0.0
2                  Pool   0.0
3  Hong Kong Restaurant   0.0
4     Indian Restaurant   0.0


----Birch Cliff, Cliffside West----
                   venue  freq
0               Gym Pool  0.17
1                    Gym  0.17
2           Skating Rink  0.17
3  General Entertainment  0.17
4                   Park  0.17


----Cedarbrae----
                  venue  freq
0                 Trail   0.5
1            Playground   0.5
2           Auto Garage   0.0
3                  Pool   0.0
4  Hong Kong Restaurant   0.0


----Clairlea, Golden Mile, Oakridge----
           venue  freq
0         Bakery   0.2
1   Intersection   0.2
2       Bus Line   0.2
3  Metro 

###### Dataframe of the top 10 venues for each neighborhood

In [59]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [61]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scarb_grouped['Neighborhood']

for ind in np.arange(scarb_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarb_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Shopping Mall,Chinese Restaurant,Vietnamese Restaurant,Department Store,Hong Kong Restaurant,Park,Pool,Shanghai Restaurant,Grocery Store,Skating Rink
1,"Agincourt North, L'Amoreaux East, Milliken, St...",Pharmacy,Vietnamese Restaurant,Construction & Landscaping,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store
2,"Birch Cliff, Cliffside West",Gym Pool,Skating Rink,General Entertainment,Park,College Stadium,Gym,Bank,Grocery Store,Gift Shop,Bakery
3,Cedarbrae,Playground,Trail,College Stadium,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store
4,"Clairlea, Golden Mile, Oakridge",Bakery,Bus Line,Intersection,Soccer Field,Bus Station,Coffee Shop,Metro Station,Vietnamese Restaurant,Cosmetics Shop,Gift Shop


###### Run K-Means with 5 clusters

In [62]:
scarb_data = scarborough_data.drop(16)
# set number of clusters
kclusters = 5

scarb_grouped_clustering = scarb_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarb_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 0, 3, 0, 0, 0, 0, 0, 0], dtype=int32)

In [63]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

scarb_merged = scarb_data


# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scarb_merged = scarb_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scarb_merged.head() # check the last columns!

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517,,,,,,,,,,,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725,4.0,History Museum,Bar,Construction & Landscaping,Gym,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193,0.0,Construction & Landscaping,Park,Gym / Fitness Center,Vietnamese Restaurant,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
3,M1G,Scarborough,Woburn,43.768369,-79.21759,0.0,Korean Restaurant,Business Service,Coffee Shop,Park,Vietnamese Restaurant,Convenience Store,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944,3.0,Playground,Trail,College Stadium,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store


In [67]:
scarb_merged['Cluster Labels'].isna().sum()

1

In [69]:
scarb_merged_cleaned = scarb_merged.drop(scarb_merged.index[0])
scarb_merged_cleaned['Cluster Labels'].isna().sum()

0

In [70]:
scarb_merged_cleaned.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725,4.0,History Museum,Bar,Construction & Landscaping,Gym,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193,0.0,Construction & Landscaping,Park,Gym / Fitness Center,Vietnamese Restaurant,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
3,M1G,Scarborough,Woburn,43.768369,-79.21759,0.0,Korean Restaurant,Business Service,Coffee Shop,Park,Vietnamese Restaurant,Convenience Store,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944,3.0,Playground,Trail,College Stadium,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store
5,M1J,Scarborough,Scarborough Village,43.743125,-79.23175,0.0,Train Station,Grocery Store,Indian Restaurant,Restaurant,Construction & Landscaping,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store


###### Visualize clusters

In [73]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarb_merged_cleaned['Latitude'], scarb_merged_cleaned['Longitude'], scarb_merged_cleaned['Neighborhood'], scarb_merged_cleaned['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color = rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

###### Examine each cluster

In [74]:
scarb_merged_cleaned.loc[scarb_merged_cleaned['Cluster Labels'] == 0, scarb_merged_cleaned.columns[[1] + list(range(5, scarb_merged_cleaned.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,0.0,Construction & Landscaping,Park,Gym / Fitness Center,Vietnamese Restaurant,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
3,Scarborough,0.0,Korean Restaurant,Business Service,Coffee Shop,Park,Vietnamese Restaurant,Convenience Store,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint
5,Scarborough,0.0,Train Station,Grocery Store,Indian Restaurant,Restaurant,Construction & Landscaping,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
6,Scarborough,0.0,Discount Store,Hobby Shop,Department Store,Coffee Shop,Vietnamese Restaurant,Construction & Landscaping,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint
7,Scarborough,0.0,Bakery,Bus Line,Intersection,Soccer Field,Bus Station,Coffee Shop,Metro Station,Vietnamese Restaurant,Cosmetics Shop,Gift Shop
8,Scarborough,0.0,Pharmacy,Bank,Gift Shop,Bistro,Discount Store,Sandwich Place,Coffee Shop,Vietnamese Restaurant,Grocery Store,General Entertainment
9,Scarborough,0.0,Gym Pool,Skating Rink,General Entertainment,Park,College Stadium,Gym,Bank,Grocery Store,Gift Shop,Bakery
10,Scarborough,0.0,Bakery,Gift Shop,Brewery,Construction & Landscaping,Vietnamese Restaurant,Convenience Store,Gym,Grocery Store,General Entertainment,Fried Chicken Joint
12,Scarborough,0.0,Shopping Mall,Chinese Restaurant,Vietnamese Restaurant,Department Store,Hong Kong Restaurant,Park,Pool,Shanghai Restaurant,Grocery Store,Skating Rink
13,Scarborough,0.0,Pizza Place,Pharmacy,Coffee Shop,Thai Restaurant,Fried Chicken Joint,Hobby Shop,Fast Food Restaurant,Shopping Mall,Chinese Restaurant,College Stadium


In [75]:
scarb_merged_cleaned.loc[scarb_merged_cleaned['Cluster Labels'] == 1, scarb_merged_cleaned.columns[[1] + list(range(5, scarb_merged_cleaned.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,1.0,Pharmacy,Vietnamese Restaurant,Construction & Landscaping,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store


In [76]:
scarb_merged_cleaned.loc[scarb_merged_cleaned['Cluster Labels'] == 2, scarb_merged_cleaned.columns[[1] + list(range(5, scarb_merged_cleaned.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Scarborough,2.0,Auto Garage,Convenience Store,Construction & Landscaping,Gym,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store


In [77]:
scarb_merged_cleaned.loc[scarb_merged_cleaned['Cluster Labels'] == 3, scarb_merged_cleaned.columns[[1] + list(range(5, scarb_merged_cleaned.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Scarborough,3.0,Playground,Trail,College Stadium,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Department Store


In [78]:
scarb_merged_cleaned.loc[scarb_merged_cleaned['Cluster Labels'] == 4, scarb_merged_cleaned.columns[[1] + list(range(5, scarb_merged_cleaned.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,4.0,History Museum,Bar,Construction & Landscaping,Gym,Grocery Store,Gift Shop,General Entertainment,Fried Chicken Joint,Fast Food Restaurant,Discount Store
