# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

3. To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
Use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe

#### Install BeautifullSoup4

In [None]:
!conda install beautifulsoup4 

#### Import Libraries

In [193]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

#### Request and get data from wikipedia

In [194]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(source.text, "html.parser")
table = soup.find_all('table', {'class':'wikitable sortable'})

## 1. Create Data Frame

#### Set Data Frame

In [195]:
columns_name = ['Postcode', 'Borough', 'Neighborhood']
Postcode = pd.DataFrame(columns = columns_name)

#### Covert Data from wikipedia to data frame

In [196]:
A = table[0].find_all('tr')
for i in range(0,len(A)-1):
    B = A[i+1].find_all('td')
    postcode = B[0].text
    borough = B[1].text
    neighborhood = B[2].text   
    Postcode = Postcode.append({'Postcode': postcode,
                                'Borough': borough,
                                'Neighborhood':neighborhood.replace('\n','') }, ignore_index = True)


#### Clean Data

In [197]:
#  1. Only process the cells that have an assigned borough.
Postcode = Postcode[Postcode.Borough != 'Not assigned']

In [198]:
# 2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
Postcode.loc[Postcode.Neighborhood == 'Not assigned', 'Neighborhood'] = Postcode.Borough

In [199]:
# 3. Group Neghborhood by Postcode
NewPostCode = Postcode.groupby(['Postcode','Borough'])['Neighborhood'].agg(lambda x: ','.join(set(x))).reset_index() 

#### Data Frame 

In [201]:
NewPostCode.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,Guildwood,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Ionview,Kennedy Park,East Birchmount Park"
7,M1L,Scarborough,"Golden Mile,Clairlea,Oakridge"
8,M1M,Scarborough,"Cliffside,Scarborough Village West,Cliffcrest"
9,M1N,Scarborough,"Cliffside West,Birch Cliff"


## 2. Data Frame with latitude and the longitude coordinates

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. 

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

#### Load Geospatial_data

In [203]:
GeoDF = pd.read_csv('http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv')
print('Data downloaded and read into a dataframe!')

Data downloaded and read into a dataframe!


#### Rename Postal Code to Postcode  

In [204]:
GeoDF.rename(columns ={'Postal Code':'Postcode'}, inplace = True) 

#### Add Latitude, Longtitude to NewPostCode
Merge NewPostCode and GeoDF

In [205]:
NewPostCode.reset_index(drop=True, inplace=True)
GeoDF.reset_index(drop=True, inplace=True)
NewPostCode = NewPostCode.astype(str)
GeoDF = GeoDF.astype(str)
NewPCGeo = pd.merge(NewPostCode, GeoDF, on = 'Postcode')

#### Change decimal in Latitude and Longtitude to 6 decimal

In [206]:
NewPCGeo['Latitude'] = pd.to_numeric(NewPCGeo['Latitude'], errors='coerce')
NewPCGeo['Longitude'] = pd.to_numeric(NewPCGeo['Longitude'], errors='coerce') 

#### Data Frame with Latitude and Longitude

In [207]:
NewPCGeo.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside,Guildwood,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Ionview,Kennedy Park,East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile,Clairlea,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside,Scarborough Village West,Cliffcrest",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West,Birch Cliff",43.692657,-79.264848


## 3. Explore and cluster the neighborhoods in Toronto.

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

### 3.1  Visualize neighborhood in Toronto

#### Find number of Borough and Neighborhood

In [208]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(NewPCGeo['Borough'].unique()),
        NewPCGeo.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


#### Use geopy library to get the latitude and longitude of Toronto.

In [209]:
from geopy.geocoders import Nominatim

In [210]:
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto 

In [211]:
import folium

In [213]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

#### Add Marker to the map of Toronto

In [214]:
# 1. Create Toronto map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# 2. Add markers to map
for lat, lng, borough, neighborhood in zip(NewPCGeo['Latitude'], NewPCGeo['Longitude'], NewPCGeo['Borough'], NewPCGeo['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)  
  
map_toronto

### 3.2 Explore neighborhood in Toronto

In [215]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(NewPCGeo['Borough'].unique()),
        NewPCGeo.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


#### Define Foursquare Credentials and Version

In [216]:
CLIENT_ID = 'GMWUCCBSTVJ0FWQMEWXKSBAUFNBP2DDTDDIS4PV4GKGOKV2L' # your Foursquare ID
CLIENT_SECRET = 'NNYNVNHKHD55AXIRDF4PCZTCOLUMCBOFAYA3GQCT04P5AIAQ' # your Foursquare Secret
VERSION = '20180604'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GMWUCCBSTVJ0FWQMEWXKSBAUFNBP2DDTDDIS4PV4GKGOKV2L
CLIENT_SECRET:NNYNVNHKHD55AXIRDF4PCZTCOLUMCBOFAYA3GQCT04P5AIAQ


#### create the GET request URL.

In [217]:
LIMIT = 100
raduis = 500
latitude = latitude
longitude = longitude
search_query = address

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, LIMIT, raduis)
url

'https://api.foursquare.com/v2/venues/search?client_id=GMWUCCBSTVJ0FWQMEWXKSBAUFNBP2DDTDDIS4PV4GKGOKV2L&client_secret=NNYNVNHKHD55AXIRDF4PCZTCOLUMCBOFAYA3GQCT04P5AIAQ&ll=43.653963,-79.387207&v=20180604&query=Toronto&radius=100&limit=500'

#### function to the venues from all neighborhoods in Toronto

In [218]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues )

#### Get venues

In [219]:
toronto_venues = getNearbyVenues(names = NewPCGeo['Neighborhood'], latitudes = NewPCGeo['Latitude'], longitudes = NewPCGeo['Longitude'], radius=500)

Malvern,Rouge
Highland Creek,Rouge Hill,Port Union
Morningside,Guildwood,West Hill
Woburn
Cedarbrae
Scarborough Village
Ionview,Kennedy Park,East Birchmount Park
Golden Mile,Clairlea,Oakridge
Cliffside,Scarborough Village West,Cliffcrest
Cliffside West,Birch Cliff
Scarborough Town Centre,Wexford Heights,Dorset Park
Wexford,Maryvale
Agincourt
Sullivan,Clarks Corners,Tam O'Shanter
Agincourt North,Steeles East,L'Amoreaux East,Milliken
L'Amoreaux West,Steeles West
Upper Rouge
Hillcrest Village
Fairview,Oriole,Henry Farm
Bayview Village
Silver Hills,York Mills
Willowdale,Newtonbrook
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Don Mills South,Flemingdon Park
Downsview North,Wilson Heights,Bathurst Manor
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
Riverdale,The Danforth West
The Beac

#### Size of Data Frame

In [220]:
toronto_venues.shape

(2256, 7)

#### Column name

In [221]:
list(toronto_venues)

['Neighborhood',
 'Neighborhood Latitude',
 'Neighborhood Longitude',
 'Venue',
 'Venue Latitude',
 'Venue Longitude',
 'Venue Category']

#### Size of venues in each neighborhood

In [222]:
toronto_venues.groupby('Neighborhood').count().reset_index()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Adelaide,Richmond,King",100,100,100,100,100,100
1,Agincourt,4,4,4,4,4,4
2,"Agincourt North,Steeles East,L'Amoreaux East,M...",3,3,3,3,3,3
3,"Alderwood,Long Branch",10,10,10,10,10,10
4,Bayview Village,4,4,4,4,4,4
5,"Bedford Park,Lawrence Manor East",23,23,23,23,23,23
6,Berczy Park,56,56,56,56,56,56
7,Business reply mail Processing Centre969 Eastern,17,17,17,17,17,17
8,"CFB Toronto,Downsview East",4,4,4,4,4,4
9,Caledonia-Fairbanks,6,6,6,6,6,6


#### Unique categories from all venues in Toronto

In [223]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 282 uniques categories.


#### Analyze each Neighborhood

Make new data frame by make dummies from category 

In [224]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# in toronto_onehot found 'Neighborhood', so change it to 'Neighbor'
toronto_onehot.rename(columns = {'Neighborhood':'Neighbor'}, inplace = True)

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Malvern,Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Highland Creek,Rouge Hill,Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Highland Creek,Rouge Hill,Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Highland Creek,Rouge Hill,Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Morningside,Guildwood,West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### New Data Frame size

In [225]:
toronto_onehot.shape

(2256, 283)

#### Group by Neighborhood and taking mean of each category

In [226]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,Richmond,King",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.010000,0.000000,0.000000,0.010000,0.000000
1,Agincourt,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
2,"Agincourt North,Steeles East,L'Amoreaux East,M...",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,"Alderwood,Long Branch",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,Bayview Village,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
5,"Bedford Park,Lawrence Manor East",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
6,Berczy Park,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
7,Business reply mail Processing Centre969 Eastern,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
8,"CFB Toronto,Downsview East",0.0,0.000000,0.000000,0.250000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9,Caledonia-Fairbanks,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.166667,0.000000


#### New Data Frame size

In [227]:
toronto_grouped.shape

(100, 283)

#### The top 5 most common venues by each neighborhood

In [228]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,Richmond,King----
                 venue  freq
0          Coffee Shop  0.07
1                 Café  0.05
2  American Restaurant  0.04
3      Thai Restaurant  0.04
4                  Gym  0.04


----Agincourt----
               venue  freq
0     Sandwich Place  0.25
1     Breakfast Spot  0.25
2             Lounge  0.25
3       Skating Rink  0.25
4  Accessories Store  0.00


----Agincourt North,Steeles East,L'Amoreaux East,Milliken----
               venue  freq
0         Playground  0.33
1               Park  0.33
2   Sculpture Garden  0.33
3  Accessories Store  0.00
4  Mobile Phone Shop  0.00


----Alderwood,Long Branch----
                venue  freq
0         Pizza Place   0.2
1            Pharmacy   0.1
2                 Pub   0.1
3                Bank   0.1
4  Athletics & Sports   0.1


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4               

Note : It is difficult to read, so we need to explore more 

#### The top 5 most common venues by each neighborhood dont separate 

In [229]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [230]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide,Richmond,King",Coffee Shop,Café,Steakhouse,Gym,Thai Restaurant
1,Agincourt,Lounge,Skating Rink,Sandwich Place,Breakfast Spot,Yoga Studio
2,"Agincourt North,Steeles East,L'Amoreaux East,M...",Park,Playground,Sculpture Garden,Yoga Studio,Doner Restaurant
3,"Alderwood,Long Branch",Pizza Place,Gym,Skating Rink,Sandwich Place,Bank
4,Bayview Village,Japanese Restaurant,Chinese Restaurant,Café,Bank,Diner
5,"Bedford Park,Lawrence Manor East",Coffee Shop,Italian Restaurant,Fast Food Restaurant,Pharmacy,Restaurant
6,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Seafood Restaurant,Bakery
7,Business reply mail Processing Centre969 Eastern,Auto Workshop,Pizza Place,Butcher,Skate Park,Burrito Place
8,"CFB Toronto,Downsview East",Bus Stop,Park,Airport,Other Repair Shop,Drugstore
9,Caledonia-Fairbanks,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market


### 3.3 Cluster Neighborhood

run K-means 3 cluster in Neighborhood

In [231]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ [0:100]

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1], dtype=int32)

#### Create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

In [232]:
#toronto_merged = toronto_data
toronto_merged = toronto_grouped

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(NewPCGeo.set_index('Neighborhood'), on='Neighborhood')

# merge toronto_grouped with toronto_data to add 5th most categories
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')


toronto_merged # check the last columns!

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Cluster Labels,Postcode,Borough,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide,Richmond,King",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M5H,Downtown Toronto,43.650571,-79.384568,Coffee Shop,Café,Steakhouse,Gym,Thai Restaurant
1,Agincourt,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M1S,Scarborough,43.794200,-79.262029,Lounge,Skating Rink,Sandwich Place,Breakfast Spot,Yoga Studio
2,"Agincourt North,Steeles East,L'Amoreaux East,M...",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0,M1V,Scarborough,43.815252,-79.284577,Park,Playground,Sculpture Garden,Yoga Studio,Doner Restaurant
3,"Alderwood,Long Branch",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M8W,Etobicoke,43.602414,-79.543484,Pizza Place,Gym,Skating Rink,Sandwich Place,Bank
4,Bayview Village,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M2K,North York,43.786947,-79.385975,Japanese Restaurant,Chinese Restaurant,Café,Bank,Diner
5,"Bedford Park,Lawrence Manor East",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M5M,North York,43.733283,-79.419750,Coffee Shop,Italian Restaurant,Fast Food Restaurant,Pharmacy,Restaurant
6,Berczy Park,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M5E,Downtown Toronto,43.644771,-79.373306,Coffee Shop,Restaurant,Cocktail Bar,Seafood Restaurant,Bakery
7,Business reply mail Processing Centre969 Eastern,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M7Y,East Toronto,43.662744,-79.321558,Auto Workshop,Pizza Place,Butcher,Skate Park,Burrito Place
8,"CFB Toronto,Downsview East",0.0,0.000000,0.000000,0.250000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1,M3K,North York,43.737473,-79.464763,Bus Stop,Park,Airport,Other Repair Shop,Drugstore
9,Caledonia-Fairbanks,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0,M6E,York,43.689026,-79.453512,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market


In [233]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.4. Examine Clusters

#### Cluster 1

Commercil zone : a lot of activities

In [234]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Accessories Store,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,...,Cluster Labels,Postcode,Borough,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M1V,Scarborough,43.815252,-79.284577,Park,Playground,Sculpture Garden,Yoga Studio,Doner Restaurant
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M6E,York,43.689026,-79.453512,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market
28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M4J,East York,43.685347,-79.338106,Park,Coffee Shop,Convenience Store,Yoga Studio,Drugstore
32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M5P,Central Toronto,43.696948,-79.411307,Park,Sushi Restaurant,Trail,Jewelry Store,Yoga Studio
64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M4W,Downtown Toronto,43.679563,-79.377529,Park,Playground,Trail,Yoga Studio,Doner Restaurant
68,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M1J,Scarborough,43.744734,-79.239476,Playground,Yoga Studio,Donut Shop,Dessert Shop,Dim Sum Restaurant
69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M2L,North York,43.75749,-79.374714,Cafeteria,Park,Yoga Studio,Donut Shop,Dim Sum Restaurant
84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M8X,Etobicoke,43.653654,-79.506944,River,Park,Smoke Shop,Donut Shop,Dessert Shop
91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M9N,York,43.706876,-79.518188,Park,Yoga Studio,Drugstore,Dim Sum Restaurant,Diner
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,M2P,North York,43.752758,-79.400049,Park,Flower Shop,Bank,Yoga Studio,Drugstore


### Cluster 2

Park and shoping zone 

In [235]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Accessories Store,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,...,Cluster Labels,Postcode,Borough,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.040000,0.0,0.0,0.0,...,1,M5H,Downtown Toronto,43.650571,-79.384568,Coffee Shop,Café,Steakhouse,Gym,Thai Restaurant
1,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M1S,Scarborough,43.794200,-79.262029,Lounge,Skating Rink,Sandwich Place,Breakfast Spot,Yoga Studio
3,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M8W,Etobicoke,43.602414,-79.543484,Pizza Place,Gym,Skating Rink,Sandwich Place,Bank
4,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M2K,North York,43.786947,-79.385975,Japanese Restaurant,Chinese Restaurant,Café,Bank,Diner
5,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.043478,0.0,0.0,0.0,...,1,M5M,North York,43.733283,-79.419750,Coffee Shop,Italian Restaurant,Fast Food Restaurant,Pharmacy,Restaurant
6,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M5E,Downtown Toronto,43.644771,-79.373306,Coffee Shop,Restaurant,Cocktail Bar,Seafood Restaurant,Bakery
7,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M7Y,East Toronto,43.662744,-79.321558,Auto Workshop,Pizza Place,Butcher,Skate Park,Burrito Place
8,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M3K,North York,43.737473,-79.464763,Bus Stop,Park,Airport,Other Repair Shop,Drugstore
10,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.090909,0.0,0.0,0.0,...,1,M7R,Mississauga,43.636966,-79.615819,Coffee Shop,Hotel,Gym / Fitness Center,Middle Eastern Restaurant,Burrito Place
11,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,1,M1H,Scarborough,43.773136,-79.239476,Hakka Restaurant,Bank,Fried Chicken Joint,Thai Restaurant,Bakery


### Cluster 3

Sport zone

In [236]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Accessories Store,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,...,Cluster Labels,Postcode,Borough,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2,M9M,North York,43.724766,-79.532242,Baseball Field,Yoga Studio,Dumpling Restaurant,Diner,Discount Store
79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2,M8Y,Etobicoke,43.636258,-79.498509,Baseball Field,Yoga Studio,Dumpling Restaurant,Diner,Discount Store
