First, I installed and imported necessary packages.

In [1]:
import pandas as pd
import numpy as np
import requests
!conda install -c anaconda lxml -y
!conda install -c anaconda beautifulsoup4 -y
!conda install -c anaconda requests -y

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    libgcc-ng: 7.2.0-h7cc24e2_2     --> 8.2.0-hdf63c60_1     anaconda
    libxml2:   2.9.4-h6b072ca_5     --> 2.9.8-hf84eae3_0     anaconda
    libxslt:   1.1.29-hcf9102b_5    --> 1.1.32-h1312cb7_0    anaconda
    lxml:      4.1.0-py35ha401a81_0 --> 4.2.5-py35hefd8a0e_0 anaconda

libgcc-ng-8.2. 100% |################################| Time: 0:00:00  64.01 MB/s
libxml2-2.9.8- 100% |################################| Time: 0:00:00  73.12 MB/s
libxslt-1.1.32 100% |################################| Time: 0:00:00  15.39 MB/s
lxml-4.2.5-py3 100% |################################| Time: 0:00:00  10.94 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautif

Then I extracted the html data from the wikipedia website using Beautiful Soup package and lxml parser.

In [2]:
from bs4 import BeautifulSoup
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')

Next, I looked at the HTML data to find the important tags.

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":861324217,"wgRevisionId":861324217,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

From the HTML data the important tags were the "table" tag and the "td" tag. Before extracting the wikipedia table entries, 
I prepared an empty dataframe named "neighborhoods" with three columns: 'PostalCode', 'Borough', 'Neighborhood'.

In [4]:
column_names = ['PostalCode','Borough','Neighborhood']
neighborhoods = pd.DataFrame(columns = column_names)
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


In extracting the entries from the Wikipedia table, I first assigned the table entries to a list named 'data'.

In [5]:
table = soup.find('table', class_= 'wikitable sortable')
data = []
for td in table.find_all('td'):
    entry = td.text
    data.append(entry)

Then I assigned the items in the 'data' list to the 'neighborhoods' data frame.

In [6]:
for index, item in zip(range(0, len(data),3), data):
    postcode = data[index]
    borough = data[index+1]
    neighborhood = data[index+2]
    neighborhoods = neighborhoods.append({'PostalCode' : postcode, 'Borough' : borough, 'Neighborhood' : neighborhood}, ignore_index = True)

Next I cleaned the data frame.

In [7]:
#first I removed the newline character at the end of the Neighborhood column entries
neighborhoods['Neighborhood'].replace(regex=True,inplace=True,to_replace=r'\n',value=r'')
#next I replaced the 'Not assigned' values in some cells with NaN to make it easier to remove rows later. 
neighborhoods.replace(regex=True,inplace=True,to_replace='Not assigned',value=np.nan)
#after that I copied the value in the 'Borough' column to the corresponding row in the 'Neighborhood' column
#where the 'Neighborhood' row had a NaN value.
neighborhoods["Neighborhood"].fillna(neighborhoods["Borough"], inplace = True)
#finally I dropped all the rows where the 'Borough column' had NaN values. 
neighborhoods.dropna(subset = ['Borough'], inplace = True)

After cleaning the data, I joined all the neighborhoods that have the same postal code.

In [8]:
neighborhoods = neighborhoods.groupby(['PostalCode','Borough'])['Neighborhood'].apply(','.join).reset_index()

And here is the first few rows of the final 'neighborhoods' dataframe.

In [9]:
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


And here are the dimensions of the dataframe.

In [10]:
print('the number of rows in dataframe:', neighborhoods.shape[0])
print('the number of columns in dataframe:', neighborhoods.shape[1])

the number of rows in dataframe: 103
the number of columns in dataframe: 3


#### The second part of the assignment. 

Now I read the provided postal code coordinate data into a new data frame called 'latlong'.

In [11]:
latlong = pd.read_csv('http://cocl.us/Geospatial_data')
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I checked if the 'latlong' and 'neighborhoods' dataframes had the same dimensions.

In [12]:
print (latlong.shape)

(103, 3)


I joined the two dataframes using the join method and matching the entries in the 'Postal Code' columns of both data frames.

In [13]:
TRneighborhoods = neighborhoods.join(latlong.set_index('Postal Code'), on='PostalCode')

I also checked several rows of the new dataframe 'TRneighborhoods'

In [28]:
TRneighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


#### The third part of the assignment.

First I import and install additional packages.

In [16]:
import json 

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

from pandas.io.json import json_normalize 


import matplotlib.cm as cm
import matplotlib.colors as colors


from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.17.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00 968.07 kB/s
geopy-1.17.0-p 100% |################################| Time: 0:00:00   1.49 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.0-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00   1.08 MB/s
branca-0.3.0-p 100% |################################| Time: 0:00:00  24.71 MB/s
vincent-0.4.4- 100% |###################

I then use the geopy library to get the coordinates of Toronto.

In [45]:
address = 'Toronto, TO'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geographical coordinate of Toronto are 43.6708625, -79.3727924125372.


The coordinates provided are slightly different than the ones provided by Google Search. For comparison, the Google provided coordinates are 
43.6532, -79.3832

Next I make the map of Toronto with the neighborhoods data superimposed. 

In [46]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(TRneighborhoods['Latitude'], TRneighborhoods['Longitude'], TRneighborhoods['Borough'], TRneighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next I made a new dataframe containing data only for boroughs that have "Toronto" in their names.

In [40]:
STtoronto_data = TRneighborhoods[TRneighborhoods['Borough'].str.contains('Toronto') == True].reset_index(drop=True)
STtoronto_data.head()


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [41]:
STtoronto_data.shape

(38, 5)

But the number of entries was still too large for me as can be seen in the shape of the 'STtoronto_data'. So I decided to focus on the Scarborough borough instead. The name of this borough reminded me of "Scarborough Fair" song and I decided to explore this area.

In [83]:
scarborough_data = TRneighborhoods[TRneighborhoods['Borough'] == 'Scarborough'].reset_index(drop=True)
scarborough_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [44]:
scarborough_data.shape

(17, 5)

The Scarborough borough had only 17 entries as seen from the shape of the 'scarborough_data' dataframe. Next I also made a map of this borough.

In [47]:
address = 'Scarborough, TO'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))



The geographical coordinate of Scarborough are 43.7626686, -79.2308605092575.


In [48]:
# create map of Manhattan using latitude and longitude values
map_scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'], scarborough_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scarborough)  
    
map_scarborough

In [49]:
# The code was removed by Watson Studio for sharing.

Next I used the foursquare API to get the top 100 venues in scarborough. I used the getNearbyVenues function created in this week's lab to loop through all neighborhoods.

In [52]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I created a new dataframe called "scarborough_venues" and filled it using the previous function.

In [73]:
scarborough_venues = getNearbyVenues(names = scarborough_data['Neighborhood'],
                                       latitudes = scarborough_data['Latitude'],
                                       longitudes = scarborough_data['Longitude']
                                   )


Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West,Steeles West
Upper Rouge


Here is the dimensions of the new dataframe as well as a few rows of the data

In [74]:
print(scarborough_venues.shape)
scarborough_venues.head()

(87, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
2,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


I wantd to find out how many venues there are in each neighborhood of Scarborough.

In [75]:
scarborough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",3,3,3,3,3,3
"Birch Cliff,Cliffside West",4,4,4,4,4,4
Cedarbrae,7,7,7,7,7,7
"Clairlea,Golden Mile,Oakridge",10,10,10,10,10,10
"Clarks Corners,Sullivan,Tam O'Shanter",11,11,11,11,11,11
"Cliffcrest,Cliffside,Scarborough Village West",2,2,2,2,2,2
"Dorset Park,Scarborough Town Centre,Wexford Heights",6,6,6,6,6,6
"East Birchmount Park,Ionview,Kennedy Park",6,6,6,6,6,6
"Guildwood,Morningside,West Hill",6,6,6,6,6,6


Also how many unique categories of venues in Scarborough.

In [76]:
print('There are {} unique categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 56 unique categories.


I applied one hot encoding technique to classify neighborhoods according to the venue category that they have.

In [65]:
# one hot encoding
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scarborough_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
scarborough_onehot = scarborough_onehot[fixed_columns]
# print(scarborough_onehot.shape)
#scarborough_onehot.head()


In [111]:
scarborough_grouped = scarborough_onehot.groupby('Neighborhood').mean().reset_index()
print(scarborough_grouped.shape)
scarborough_grouped

(16, 57)


Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,...,Playground,Rental Car Location,Sandwich Place,Shopping Mall,Skating Rink,Smoke Shop,Soccer Field,Thai Restaurant,Train Station,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Birch Cliff,Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.142857,0.0,0.142857,0.142857,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0
4,"Clairlea,Golden Mile,Oakridge",0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0
5,"Clarks Corners,Sullivan,Tam O'Shanter",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.090909,0.0,0.090909,0.0,0.0,0.0,0.090909,0.0,0.0
6,"Cliffcrest,Cliffside,Scarborough Village West",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Dorset Park,Scarborough Town Centre,Wexford He...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
8,"East Birchmount Park,Ionview,Kennedy Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0
9,"Guildwood,Morningside,West Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,...,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Turned out that one of the neighborhoods had no venues recorded by Foursquare. This is because the new dataframe had one less row than the 'scarborough_data dataframe'  comparison of the names in 'Neighborhood' columns of 'scarborough_data' and 'scarborough_grouped' showed that 'Upper Rouge' had been dropped from the new dataframe.

Now what about 3 top venues of the rest of the neighborhoods?

In [109]:
num_top_venues = 3

for hood in scarborough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scarborough_grouped[scarborough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
            venue  freq
0    Skating Rink  0.25
1  Breakfast Spot  0.25
2          Lounge  0.25


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
         venue  freq
0  Coffee Shop  0.33
1         Park  0.33
2   Playground  0.33


----Birch Cliff,Cliffside West----
                   venue  freq
0           Skating Rink  0.25
1  General Entertainment  0.25
2                   Café  0.25


----Cedarbrae----
                  venue  freq
0      Hakka Restaurant  0.14
1  Caribbean Restaurant  0.14
2    Athletics & Sports  0.14


----Clairlea,Golden Mile,Oakridge----
      venue  freq
0  Bus Line   0.2
1    Bakery   0.2
2      Park   0.1


----Clarks Corners,Sullivan,Tam O'Shanter----
                venue  freq
0         Pizza Place  0.18
1  Italian Restaurant  0.09
2       Shopping Mall  0.09


----Cliffcrest,Cliffside,Scarborough Village West----
                 venue  freq
0  American Restaurant   0.5
1                Motel   0.5
2    Accessories Store

Some of the neighborhoods only had 2 or 1 venues. For the rest of this notebook, I would display the first top venues of each neighborhood. Next I put this information into a new dataframe.

In [68]:
# the function to return the most common category of venues in each neighborhood.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [94]:
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scarborough_grouped['Neighborhood']

for ind in np.arange(scarborough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarborough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue
0,Agincourt,Skating Rink
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Playground
2,"Birch Cliff,Cliffside West",General Entertainment
3,Cedarbrae,Fried Chicken Joint
4,"Clairlea,Golden Mile,Oakridge",Bakery
5,"Clarks Corners,Sullivan,Tam O'Shanter",Pizza Place
6,"Cliffcrest,Cliffside,Scarborough Village West",Motel
7,"Dorset Park,Scarborough Town Centre,Wexford He...",Indian Restaurant
8,"East Birchmount Park,Ionview,Kennedy Park",Hobby Shop
9,"Guildwood,Morningside,West Hill",Mexican Restaurant


#### Begin the clustering.

In [100]:
# set number of clusters
kclusters = 5

scarborough_grouped_clustering = scarborough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarborough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:16] 

array([1, 3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 0, 2, 3], dtype=int32)

The neighborhoods have been clustered and labeled as belonging to the zeroth cluster all the way to the fourth cluster. Next was appending the cluster label to the dataframe.

In [101]:
# because the original 'scarborough_data' had 17 rows, I had to drop the row that had no venues at all.
scarborough_merged = scarborough_data.drop(scarborough_data.index[16])

# add clustering labels
scarborough_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scarborough_merged = scarborough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scarborough_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,1,Fast Food Restaurant
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,3,Golf Course
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,1,Mexican Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Fried Chicken Joint
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,1,Playground
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029,4,Hobby Shop
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577,1,Bakery
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476,1,Motel
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848,1,General Entertainment


#### Finally, making a map showing the clustering of each neighborhood.

In [102]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'], scarborough_merged['Neighborhood'], scarborough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

I tried to analyze each cluster to find the similarity of each cluster member.

Cluster 0

In [103]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 0, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue
13,Scarborough,0,Pizza Place


Cluster 1

In [104]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 1, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue
0,Scarborough,1,Fast Food Restaurant
2,Scarborough,1,Mexican Restaurant
3,Scarborough,1,Coffee Shop
4,Scarborough,1,Fried Chicken Joint
5,Scarborough,1,Playground
7,Scarborough,1,Bakery
8,Scarborough,1,Motel
9,Scarborough,1,General Entertainment
10,Scarborough,1,Indian Restaurant
11,Scarborough,1,Accessories Store


Cluster 2

In [106]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 2, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue
14,Scarborough,2,Playground


Cluster 3

In [107]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 3, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue
1,Scarborough,3,Golf Course
15,Scarborough,3,Chinese Restaurant


Cluster 4

In [108]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 4, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue
6,Scarborough,4,Hobby Shop


I do not know the dissimilarity between different clusters. Maybe the fact that some neighborhoods have just one to two venues caused a lack of data for a meaningful comparison. But what I learned from this exercise was that for kmeans cluster analysis, too little data is not good for a meaningful clustering and analysis.