[View in Colaboratory](https://colab.research.google.com/github/loloaquarius/Coursera_Capstone/blob/master/Segmenting_and_Clustering_Neighborhoods_in_Toronto.ipynb)

## Segmenting and Clustering Neighborhoods in Toronto

** Install some dependencies including lxml and geocoder:**

In [1]:
!pip install -U lxml
!pip install geocoder

Requirement already up-to-date: lxml in /usr/local/lib/python3.6/dist-packages (4.2.5)


**Import necessary libraries and modules into project:**

In [0]:
import requests # library to handle requests
import lxml # dependency library using for BeautifulSoup
import pandas as pd # library for data analsysis
from bs4 import BeautifulSoup # library for parsing HTML
import geocoder # import geocoding library
import numpy as np # library to handle data in a vectorized manner

**Using BeautifulSoup to parse content of wiki page:**

In [0]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

soup = BeautifulSoup(wiki.content,'html.parser')

**Parsing Postcode table into pandas dataframe. Since this table missed some values, I need to remove them from dataframe. Then I group them by Postcode and Borough. Eventually, remove inappropriate letters existing in Neighbourhood column:**

In [4]:
df = pd.read_html(str(soup.table))
df = df[0]
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df[df.Borough != 'Not assigned']
df = df[df.Neighbourhood != 'Not assigned']
df = df.groupby(['Postcode','Borough'], as_index=False).aggregate({'Neighbourhood':', '.join}).reindex(columns=df.columns)
df.Neighbourhood = df.Neighbourhood.apply(lambda x: x.replace(']',''))
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


**Check the number of rows in my dataframe:**

In [5]:
df.shape[0]

102

**Define a function to get latitude and longtitude of locations based on their postcode:**

In [0]:
def getlatlng(postal_code, lat, lng):
  # initialize your variable to None
  lat_lng_coords = None

  # loop until you get the coordinates
  while(lat_lng_coords is None):
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng
  lat.append(lat_lng_coords[0])
  lng.append(lat_lng_coords[1])

**Initialize 2 list lat and lng, then append new coordinates into these 2 list:**

In [0]:
lat = []
lng = []

In [8]:
for val in list(df['Postcode']):
  print(val)
  getlatlng(val, lat, lng)

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


**Insert new 2 columns Latitude and Longtitude into my dataframe:**

In [0]:
df['Latitude'] = lat
df['Longtitude'] = lng

**My dataframe after cleaning and editing shown as below:**

In [10]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longtitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [11]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!pip install folium
import folium # map rendering library

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


**In the scope of this assignment, I just explore and cluster neighborhoods in Toronto, then I create a new dataframe namely toronto_df:**

In [12]:
toronto_df = df[df.Borough.str.contains('Toronto')]
toronto_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longtitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


**Use geopy library to get the latitude and longitude values of Toronto City:**

In [13]:
address = 'Toronto, Ontario'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto City are 43.653963, -79.387207.


**Create a map of Toronto with neighborhoods superimposed on top.**

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, postcode in zip(toronto_df['Latitude'], toronto_df['Longtitude'], toronto_df['Borough'], toronto_df['Postcode']):
    label = '{}, {}'.format(postcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

**I will segment and cluster the neighbourhoods in Downtown Toronto.**

In [15]:
downtown_df = toronto_df[toronto_df.Borough == 'Downtown Toronto']
downtown_df.index = pd.RangeIndex(len(downtown_df.index))
downtown_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longtitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


**Create a new dataframe that contain Neighbourhood column:**

In [16]:
downtown_nbh_df = downtown_df.Neighbourhood.str.split(', ', expand=True).stack()
downtown_nbh_df = downtown_nbh_df.to_frame()
downtown_nbh_df.columns = ['Neighbourhood']
downtown_nbh_df = downtown_nbh_df.drop_duplicates()
downtown_nbh_df.index = pd.RangeIndex(len(downtown_nbh_df.index))
downtown_nbh_df

Unnamed: 0,Neighbourhood
0,Rosedale
1,Cabbagetown
2,St. James Town
3,Church and Wellesley
4,Harbourfront
5,Regent Park
6,Ryerson
7,Garden District
8,Berczy Park
9,Central Bay Street


In [0]:

def getLatLngFromGeolocator(neighbourhood):
  location = geolocator.geocode('{}, Toronto, Ontario'.format(neighbourhood))
  if location != None:
    lat.append(location.latitude)
    lng.append(location.longitude)
  else:
    lat.append(0)
    lng.append(0)

In [39]:
lat = []
lng = []
for val in downtown_nbh_df.Neighbourhood:
  print(val)
  getLatLngFromGeolocator(val)


Rosedale
Cabbagetown
St. James Town
Church and Wellesley
Harbourfront
Regent Park
Ryerson
Garden District
Berczy Park
Central Bay Street
Adelaide
King
Richmond
Harbourfront East
Toronto Islands
Union Station
Design Exchange
Toronto Dominion Centre
Commerce Court
Victoria Hotel
Harbord
University of Toronto
Chinatown
Grange Park
Kensington Market
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
Railway Lands
South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place
Underground city
Christie


In [40]:
downtown_nbh_df['Latitude'] = lat
downtown_nbh_df['Longitude'] = lng
downtown_nbh_df

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Rosedale,43.676453,-79.388434
1,Cabbagetown,43.664473,-79.366986
2,St. James Town,43.669403,-79.372704
3,Church and Wellesley,43.665524,-79.383801
4,Harbourfront,43.64008,-79.38015
5,Regent Park,43.660706,-79.360457
6,Ryerson,43.621573,-79.55913
7,Garden District,43.656502,-79.377128
8,Berczy Park,43.648001,-79.375385
9,Central Bay Street,43.644903,-79.381836


**Remove data unrecognized by API:**

In [49]:
downtown_nbh_df = downtown_nbh_df[downtown_nbh_df['Latitude'] != 0]
downtown_nbh_df

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Rosedale,43.676453,-79.388434
1,Cabbagetown,43.664473,-79.366986
2,St. James Town,43.669403,-79.372704
3,Church and Wellesley,43.665524,-79.383801
4,Harbourfront,43.64008,-79.38015
5,Regent Park,43.660706,-79.360457
6,Ryerson,43.621573,-79.55913
7,Garden District,43.656502,-79.377128
8,Berczy Park,43.648001,-79.375385
9,Central Bay Street,43.644903,-79.381836


**Define Foursquare Credentials and Version**

In [0]:
#@title  { display-mode: "code" }
CLIENT_ID = 'UXFVBGDF5ZNT4ZZWO3EOM001OPXDJWXLNPRBWSLU5QJ3HVAN' # your Foursquare ID
CLIENT_SECRET = 'XPVCBUBSI0WZDU2LX1OZPYUVILDFTA2M44UKSXJUVJCBJS5C' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

**Let's create a function to get nearby venues of neighbourhoods in Downtonw Toronto:**

In [0]:
radius = 500
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]
        if results['groups']:
            results = results['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [52]:
toronto_venues = getNearbyVenues(names=downtown_nbh_df['Neighbourhood'],
                                   latitudes=downtown_nbh_df['Latitude'],
                                   longitudes=downtown_nbh_df['Longitude']
                                  )

Rosedale
Cabbagetown
St. James Town
Church and Wellesley
Harbourfront
Regent Park
Ryerson
Garden District
Berczy Park
Central Bay Street
Adelaide
King
Richmond
Harbourfront East
Toronto Islands
Union Station
Design Exchange
Toronto Dominion Centre
Commerce Court
Victoria Hotel
Harbord
University of Toronto
Chinatown
Grange Park
Kensington Market
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
South Niagara
First Canadian Place
Underground city
Christie


**Let's check the size of the resulting dataframe**

In [53]:
print(toronto_venues.shape)
toronto_venues.head()

(2463, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.676453,-79.388434,Black Camel,43.677016,-79.389367,BBQ Joint
1,Rosedale,43.676453,-79.388434,Ramsden Park,43.676068,-79.389705,Park
2,Rosedale,43.676453,-79.388434,Civello Salon & Spa,43.674413,-79.388378,Salon / Barbershop
3,Rosedale,43.676453,-79.388434,Rebel House,43.677661,-79.389935,Bar
4,Rosedale,43.676453,-79.388434,Starbucks,43.678059,-79.39013,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [54]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,23,23,23,23,23,23
Berczy Park,100,100,100,100,100,100
CN Tower,84,84,84,84,84,84
Cabbagetown,57,57,57,57,57,57
Central Bay Street,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Christie,59,59,59,59,59,59
Church and Wellesley,90,90,90,90,90,90
Commerce Court,100,100,100,100,100,100


**Analyze Each Neighborhood**

In [55]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [56]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Adelaide,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01
1,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CN Tower,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.0
4,Cabbagetown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
6,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.05,0.01,0.0,0.04,0.0,0.01,0.0,0.0,0.0
7,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016949,0.0,0.0,0.0,0.0,0.0,0.0
8,Church and Wellesley,0.011111,0.0,0.011111,0.011111,0.0,0.0,0.0,0.0,0.0,...,0.0,0.011111,0.011111,0.0,0.011111,0.0,0.0,0.0,0.011111,0.0
9,Commerce Court,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0


Let's confirm the new size

In [61]:
toronto_grouped.shape

(34, 234)

Let's print each neighborhood along with the top 5 most common venues

In [62]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.06
2        Hotel  0.05
3    Gastropub  0.04
4   Restaurant  0.04


----Bathurst Quay----
                  venue  freq
0           Coffee Shop  0.17
1                  Café  0.13
2                  Park  0.09
3   Japanese Restaurant  0.04
4  Caribbean Restaurant  0.04


----Berczy Park----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.06
2       Bakery  0.04
3        Hotel  0.04
4   Restaurant  0.04


----CN Tower----
                venue  freq
0               Hotel  0.10
1         Coffee Shop  0.10
2  Italian Restaurant  0.05
3          Sports Bar  0.05
4         Pizza Place  0.05


----Cabbagetown----
                venue  freq
0          Restaurant  0.11
1         Coffee Shop  0.07
2                Café  0.07
3   Indian Restaurant  0.05
4  Italian Restaurant  0.04


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.17
1                Café  0.07
2       Deli / 

**Let's put that into a pandas dataframe**

In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [65]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Café,Hotel,Restaurant,Gastropub,American Restaurant,Japanese Restaurant,Deli / Bodega,Asian Restaurant,Breakfast Spot
1,Bathurst Quay,Coffee Shop,Café,Park,Airport Service,Japanese Restaurant,Diner,Sushi Restaurant,Caribbean Restaurant,Garden,Dance Studio
2,Berczy Park,Coffee Shop,Café,Restaurant,Hotel,Bakery,Cocktail Bar,Seafood Restaurant,Beer Bar,Italian Restaurant,Japanese Restaurant
3,CN Tower,Hotel,Coffee Shop,Sports Bar,Italian Restaurant,Pizza Place,Gym,Aquarium,Scenic Lookout,Brewery,Ice Cream Shop
4,Cabbagetown,Restaurant,Coffee Shop,Café,Indian Restaurant,Diner,Pizza Place,Bakery,Italian Restaurant,Japanese Restaurant,Grocery Store
5,Central Bay Street,Coffee Shop,Café,Hotel,Deli / Bodega,Italian Restaurant,Aquarium,Restaurant,Sports Bar,Gym / Fitness Center,Sandwich Place
6,Chinatown,Café,Bar,Vegetarian / Vegan Restaurant,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Coffee Shop,Ice Cream Shop
7,Christie,Korean Restaurant,Coffee Shop,Indian Restaurant,Ice Cream Shop,Cocktail Bar,Café,Mexican Restaurant,Dessert Shop,Pub,Japanese Restaurant
8,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Burger Joint,Sushi Restaurant,Café,Bubble Tea Shop,Restaurant,Men's Store,Mediterranean Restaurant
9,Commerce Court,Coffee Shop,Café,Hotel,American Restaurant,Restaurant,Steakhouse,Deli / Bodega,Gastropub,Japanese Restaurant,Burger Joint


**Cluster Neighborhoods**

Run k-means to cluster the neighborhood into 5 clusters.

In [87]:
# set number of clusters
kclusters = 2

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Rename columns of downtown_nbh_df dataframe 

In [88]:
downtown_nbh_df.columns = ['Neighborhood','Latitude','Longitude','Cluster Labels']
downtown_nbh_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
0,Rosedale,43.676453,-79.388434,0
1,Cabbagetown,43.664473,-79.366986,0
2,St. James Town,43.669403,-79.372704,0
3,Church and Wellesley,43.665524,-79.383801,0
4,Harbourfront,43.64008,-79.38015,0


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [89]:
toronto_merged = downtown_nbh_df

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,43.676453,-79.388434,1,Italian Restaurant,Park,Yoga Studio,Fish Market,BBQ Joint,Sporting Goods Shop,Bakery,Bank,Bar,Juice Bar
1,Cabbagetown,43.664473,-79.366986,1,Restaurant,Coffee Shop,Café,Indian Restaurant,Diner,Pizza Place,Bakery,Italian Restaurant,Japanese Restaurant,Grocery Store
2,St. James Town,43.669403,-79.372704,1,Coffee Shop,Pizza Place,Restaurant,Café,Indian Restaurant,Library,Playground,Breakfast Spot,Beer Store,Filipino Restaurant
3,Church and Wellesley,43.665524,-79.383801,1,Coffee Shop,Japanese Restaurant,Gay Bar,Burger Joint,Sushi Restaurant,Café,Bubble Tea Shop,Restaurant,Men's Store,Mediterranean Restaurant
4,Harbourfront,43.64008,-79.38015,1,Coffee Shop,Café,Hotel,Sushi Restaurant,Brewery,Pizza Place,Bar,Restaurant,History Museum,Plaza


Finally, let's visualize the resulting clusters

In [90]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Examine Clusters**

We call this cluster **Diversity Area**

In [91]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,King and Spadina,Coffee Shop,Sandwich Place,Italian Restaurant,Bar,French Restaurant,Gym,Hotel,Pizza Place,Dessert Shop


We call this cluster **Coffee and Hotel Area**

In [92]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,Park,Yoga Studio,Fish Market,BBQ Joint,Sporting Goods Shop,Bakery,Bank,Bar,Juice Bar
1,Cabbagetown,Coffee Shop,Café,Indian Restaurant,Diner,Pizza Place,Bakery,Italian Restaurant,Japanese Restaurant,Grocery Store
2,St. James Town,Pizza Place,Restaurant,Café,Indian Restaurant,Library,Playground,Breakfast Spot,Beer Store,Filipino Restaurant
3,Church and Wellesley,Japanese Restaurant,Gay Bar,Burger Joint,Sushi Restaurant,Café,Bubble Tea Shop,Restaurant,Men's Store,Mediterranean Restaurant
4,Harbourfront,Café,Hotel,Sushi Restaurant,Brewery,Pizza Place,Bar,Restaurant,History Museum,Plaza
5,Regent Park,Thai Restaurant,Pet Store,Animal Shelter,Beer Store,Fast Food Restaurant,Restaurant,Auto Dealership,Pub,Sushi Restaurant
6,Ryerson,Women's Store,Portuguese Restaurant,Sandwich Place,Sporting Goods Shop,Breakfast Spot,Discount Store,Burger Joint,Arts & Crafts Store,Department Store
7,Garden District,Clothing Store,Café,Restaurant,Cosmetics Shop,Theater,American Restaurant,Lingerie Store,Tea Room,Thai Restaurant
8,Berczy Park,Café,Restaurant,Hotel,Bakery,Cocktail Bar,Seafood Restaurant,Beer Bar,Italian Restaurant,Japanese Restaurant
9,Central Bay Street,Café,Hotel,Deli / Bodega,Italian Restaurant,Aquarium,Restaurant,Sports Bar,Gym / Fitness Center,Sandwich Place
