# Part 1: Scraping Data from Wikipedia Page
----------
In this part, we will obtain a list of postal codes in Toronto, Ontario, and cast it to a dataframe. We will get the data we need from the Wikipedia Page, _List of Postal Codes of Canada: M_. 
#### Importing necessary libraries

In [338]:
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

import numpy as np
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from urllib.request import urlopen

from bs4 import BeautifulSoup

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

print('Libraries imported.')

Libraries imported.


#### Parsing url data using BeautifulSoup

In [339]:
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urlopen(url)
soup=BeautifulSoup(html,'lxml')
type (soup)

bs4.BeautifulSoup

#### Converting html table data to dataframe

In [340]:
dfs = pd.read_html(str(soup))  #returns a list of dataframes
df=dfs[0]         #we're interested in the first (and only) dataframe
df.describe()

Unnamed: 0,Postcode,Borough,Neighborhood
count,287,287,287
unique,180,12,208
top,M9V,Not assigned,Not assigned
freq,8,77,78


In [341]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Checking for "Not assigned" borough values

In [342]:
df.Borough.value_counts()

Not assigned        77
Etobicoke           44
North York          38
Scarborough         37
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

#### Dropping rows with no assigned borough

In [343]:
df=df[df.Borough!='Not assigned']
df.describe()  #verify that 77 rows have been deleted

Unnamed: 0,Postcode,Borough,Neighborhood
count,210,210,210
unique,103,11,208
top,M8Y,Etobicoke,Runnymede
freq,8,44,2


#### Checking for "Not assigned" neighborhood values

In [344]:
df.loc[(df['Neighborhood']=='Not assigned')]

Unnamed: 0,Postcode,Borough,Neighborhood
9,M9A,Queen's Park,Not assigned


#### Replacing "Not assigned" neighborhood values with corresponding borough

In [346]:
df['Neighborhood'].replace('Not assigned',df['Borough'],inplace=True)
df.loc[(df['Neighborhood']=='Not assigned')]  #verify that all "not assigned' values have been removed

Unnamed: 0,Postcode,Borough,Neighborhood


#### Combining rows with same postcode and borough

In [347]:
df_grouped=pd.DataFrame(df.groupby(['Postcode','Borough'])['Neighborhood'].apply(lambda x:', '.join(x)))
df_grouped.reset_index(inplace=True)
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [348]:
df_grouped.describe()

Unnamed: 0,Postcode,Borough,Neighborhood
count,103,103,103
unique,103,11,102
top,M2J,North York,Queen's Park
freq,1,24,2


In [349]:
df_grouped.shape

(103, 3)

# Part 2: Getting the Geographical Coordinates
In this part, we will obtain the geographical coordinates of each postal code. The data we need is in a CSV file accesible via URL.

#### Reading CVS data into a dataframe

In [350]:
df_gc=pd.read_csv('https://raw.githubusercontent.com/izzie29/Coursera_Capstone/master/Geospatial_Coordinates.csv')
df_gc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [351]:
df_gc.shape

(103, 3)

#### Merging the geographical coordinates dataframe with the postal code dataframe

In [353]:
df_final=pd.merge(df_grouped,df_gc,left_on='Postcode',right_on='Postal Code',how='left')
df_final.drop(columns='Postal Code',inplace=True)  #Dropping redundant columnn
df_final.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3: Exploratory Analysis
-----

## Visualizing Toronto Boroughs
#### First let's look at how many distinct boroughs we have.

In [354]:
print(df_final['Borough'].value_counts())

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           11
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64


#### Let's explore the boroughs whose names include "Toronto"

In [355]:
df_toronto=df_final[df_final['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


#### Visualizing the postal codes within the "Toronto" boroughs

In [356]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="the_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [394]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11.4)

# add postcode markers to map
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

## Clustering Neighborhoods Based on Popular Venue Categories
#### Getting Foursquare data

In [359]:
CLIENT_ID = 'UXCBSJTNARDGQFGEWMTAQGKKBLND2SKZSYCK2GPMMORRC2X0' # your Foursquare ID
CLIENT_SECRET = 'X40K5SSJ24MRK2JERAALUUQVL5IJAAY1BBR4OFN33STHKEFF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [360]:
LIMIT=300
radius=700
venues_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
venues_url

'https://api.foursquare.com/v2/venues/explore?&client_id=UXCBSJTNARDGQFGEWMTAQGKKBLND2SKZSYCK2GPMMORRC2X0&client_secret=X40K5SSJ24MRK2JERAALUUQVL5IJAAY1BBR4OFN33STHKEFF&v=20180605&ll=43.653963,-79.387207&radius=700&limit=300'

In [361]:
venues_results = requests.get(venues_url).json()
venues_results

{'meta': {'code': 200, 'requestId': '5e127d2714a126001ffe51da'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 233,
  'suggestedBounds': {'ne': {'lat': 43.660263006300006,
    'lng': -79.37851584317704},
   'sw': {'lat': 43.64766299369999, 'lng': -79.39589815682297}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5227bb01498e17bf485e6202',
       'name': 'Downtown Toronto',
       'location': {'lat': 43.65323167517444,
        'lng': -79.38529600606677,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.65323167517444,
 

#### Extracting information about venues from JSON data
Most of the information we need is contained in the 'items' key.

In [395]:
venues=venues_results['response']['groups'][0]['items']  #accessing data in 'items' key
df_venues=json_normalize(venues)  #convert JSON to dataframe
df_venues.head()

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,venue.location.distance,venue.location.cc,venue.location.city,venue.location.state,venue.location.country,venue.location.formattedAddress,venue.categories,venue.photos.count,venue.photos.groups,venue.location.address,venue.location.crossStreet,venue.location.postalCode,venue.venuePage.id,venue.location.neighborhood
0,e-0-5227bb01498e17bf485e6202-0,0,"[{'summary': 'This spot is popular', 'type': '...",5227bb01498e17bf485e6202,Downtown Toronto,43.653232,-79.385296,"[{'label': 'display', 'lat': 43.65323167517444...",174,CA,Toronto,ON,Canada,"[Toronto ON, Canada]","[{'id': '4f2a25ac4b909258e854f55f', 'name': 'N...",0,[],,,,,
1,e-0-4ae7b27df964a52068ad21e3-1,0,"[{'summary': 'This spot is popular', 'type': '...",4ae7b27df964a52068ad21e3,Japango,43.655268,-79.385165,"[{'label': 'display', 'lat': 43.65526771691681...",219,CA,Toronto,ON,Canada,"[122 Elizabeth St. (at Dundas St. W), Toronto ...","[{'id': '4bf58dd8d48988d1d2941735', 'name': 'S...",0,[],122 Elizabeth St.,at Dundas St. W,M5G 1P5,,
2,e-0-4f513029e4b07c3382c9fdb9-2,0,"[{'summary': 'This spot is popular', 'type': '...",4f513029e4b07c3382c9fdb9,Cafe Plenty,43.654571,-79.38945,"[{'label': 'display', 'lat': 43.65457125894357...",192,CA,Toronto,ON,Canada,"[250 Dundas Street West (Simcoe Street), Toron...","[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",0,[],250 Dundas Street West,Simcoe Street,M5T 2Z5,,
3,e-0-5773f01f498e98371390bdfd-3,0,"[{'summary': 'This spot is popular', 'type': '...",5773f01f498e98371390bdfd,Rolltation,43.654918,-79.387424,"[{'label': 'display', 'lat': 43.65491791857301...",107,CA,Toronto,ON,Canada,"[207 Dundas St W (at University Ave), Toronto ...","[{'id': '4bf58dd8d48988d111941735', 'name': 'J...",0,[],207 Dundas St W,at University Ave,M5G 1C8,,
4,e-0-56ccd5cfcd1069ca160a797e-4,0,"[{'summary': 'This spot is popular', 'type': '...",56ccd5cfcd1069ca160a797e,Tsujiri,43.655374,-79.385354,"[{'label': 'display', 'lat': 43.65537430780922...",216,CA,Toronto,ON,Canada,"[147 Dundas St W (at Elizabeth St), Toronto ON...","[{'id': '4bf58dd8d48988d1dc931735', 'name': 'T...",0,[],147 Dundas St W,at Elizabeth St,M5G 1P5,,


In [396]:
df_venues.shape

(100, 22)

#### Extracting relevant columns


In [397]:
# getting relevant columns
df_venues =df_venues[['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng','venue.location.postalCode']]
df_venues.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng,venue.location.postalCode
0,Downtown Toronto,"[{'id': '4f2a25ac4b909258e854f55f', 'name': 'N...",43.653232,-79.385296,
1,Japango,"[{'id': '4bf58dd8d48988d1d2941735', 'name': 'S...",43.655268,-79.385165,M5G 1P5
2,Cafe Plenty,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",43.654571,-79.38945,M5T 2Z5
3,Rolltation,"[{'id': '4bf58dd8d48988d111941735', 'name': 'J...",43.654918,-79.387424,M5G 1C8
4,Tsujiri,"[{'id': '4bf58dd8d48988d1dc931735', 'name': 'T...",43.655374,-79.385354,M5G 1P5



__The venue category is contained within a dictionary in the venue.category column. See Example below.__

In [399]:
df_venues['venue.categories'][1]

[{'id': '4bf58dd8d48988d1d2941735',
  'name': 'Sushi Restaurant',
  'pluralName': 'Sushi Restaurants',
  'shortName': 'Sushi',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/sushi_',
   'suffix': '.png'},
  'primary': True}]

#### Let's define a function that extracts the category of each venue; then apply the function to all rows in the dataframe.

In [400]:
def get_category_type(row):
    categories_list = row['venue.categories']
    
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

df_venues['venue.categories']=df_venues.apply(get_category_type,axis=1)

In [401]:
df_venues.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng,venue.location.postalCode
0,Downtown Toronto,Neighborhood,43.653232,-79.385296,
1,Japango,Sushi Restaurant,43.655268,-79.385165,M5G 1P5
2,Cafe Plenty,Café,43.654571,-79.38945,M5T 2Z5
3,Rolltation,Japanese Restaurant,43.654918,-79.387424,M5G 1C8
4,Tsujiri,Tea Room,43.655374,-79.385354,M5G 1P5


In [402]:
#clean column labels
df_venues.columns = [col.split(".")[-1] for col in df_venues.columns]  

#drop row with nan values in the postalCode column:
df_venues=df_venues.dropna(subset=['postalCode'])

In [379]:
df_venues.head()

Unnamed: 0,name,categories,lat,lng,postalCode
1,Japango,Sushi Restaurant,43.655268,-79.385165,M5G 1P5
2,Cafe Plenty,Café,43.654571,-79.38945,M5T 2Z5
3,Rolltation,Japanese Restaurant,43.654918,-79.387424,M5G 1C8
4,Tsujiri,Tea Room,43.655374,-79.385354,M5G 1P5
5,Sansotei Ramen 三草亭,Ramen Restaurant,43.655157,-79.386501,M5G 1Z8


In [403]:
df_venues.shape

(91, 5)

#### Creating a dataframe with venue categories as columns

In [404]:
# one hot encoding
toronto_onehot = pd.get_dummies(df_venues[['categories']],prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['postal code'] = df_venues['postalCode'] 

toronto_onehot.head()

Unnamed: 0,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bakery,Bar,Bookstore,Brazilian Restaurant,Breakfast Spot,Bubble Tea Shop,Burrito Place,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Comic Shop,Concert Hall,Cosmetics Shop,Department Store,Dessert Shop,Electronics Store,Food Court,French Restaurant,Furniture / Home Store,Gastropub,Greek Restaurant,Gym,Gym / Fitness Center,Hotel,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Lounge,Modern European Restaurant,Monument / Landmark,Office,Opera House,Park,Pizza Place,Plaza,Ramen Restaurant,Record Shop,Restaurant,Seafood Restaurant,Shopping Mall,Smoke Shop,Souvlaki Shop,Speakeasy,Steakhouse,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,University,Vegetarian / Vegan Restaurant,Yoga Studio,postal code
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,M5G 1P5
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,M5T 2Z5
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,M5G 1C8
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,M5G 1P5
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,M5G 1Z8


In [405]:
toronto_onehot.shape

(91, 60)

####  Grouping rows by postal code and by taking the mean of the frequency of occurrence of each category

In [406]:
toronto_grouped = toronto_onehot.groupby('postal code').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,postal code,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bakery,Bar,Bookstore,Brazilian Restaurant,Breakfast Spot,Bubble Tea Shop,Burrito Place,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Comic Shop,Concert Hall,Cosmetics Shop,Department Store,Dessert Shop,Electronics Store,Food Court,French Restaurant,Furniture / Home Store,Gastropub,Greek Restaurant,Gym,Gym / Fitness Center,Hotel,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Lounge,Modern European Restaurant,Monument / Landmark,Office,Opera House,Park,Pizza Place,Plaza,Ramen Restaurant,Record Shop,Restaurant,Seafood Restaurant,Shopping Mall,Smoke Shop,Souvlaki Shop,Speakeasy,Steakhouse,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,University,Vegetarian / Vegan Restaurant,Yoga Studio
0,M5B 1R7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M5B 1V8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,M5B 2G9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0
3,M5B 2H1,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.428571,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M5B 2L9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's print each postal code along with the top 5 most common venues

In [407]:
num_top_venues = 5

for postcode in toronto_grouped['postal code']:
    print("----"+postcode+"----")
    temp = toronto_grouped[toronto_grouped['postal code'] == postcode].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M5B 1R7----
                venue  freq
0          Comic Shop   1.0
1  Italian Restaurant   0.0
2           Jazz Club   0.0
3           Juice Bar   0.0
4              Lounge   0.0


----M5B 1V8----
                 venue  freq
0              Theater   1.0
1  American Restaurant   0.0
2           Restaurant   0.0
3            Jazz Club   0.0
4            Juice Bar   0.0


----M5B 2G9----
                 venue  freq
0             Tea Room   0.5
1          Pizza Place   0.5
2  American Restaurant   0.0
3           Restaurant   0.0
4            Jazz Club   0.0


----M5B 2H1----
               venue  freq
0     Cosmetics Shop  0.43
1      Shopping Mall  0.14
2          Bookstore  0.14
3  Electronics Store  0.14
4     Clothing Store  0.14


----M5B 2L9----
                 venue  freq
0       Clothing Store   1.0
1  American Restaurant   0.0
2           Restaurant   0.0
3            Jazz Club   0.0
4            Juice Bar   0.0


----M5B 2R8----
                 venue  freq
0            

#### Creating a dataframe with the most common venues by postal code

In [408]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [409]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['postalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postcode_venues_sorted = pd.DataFrame(columns=columns)
postcode_venues_sorted['postalCode'] = toronto_grouped['postal code']

for ind in np.arange(toronto_grouped.shape[0]):
    postcode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
postcode_venues_sorted.head()

Unnamed: 0,postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5B 1R7,Comic Shop,Yoga Studio,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store
1,M5B 1V8,Theater,Yoga Studio,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store
2,M5B 2G9,Tea Room,Pizza Place,Yoga Studio,Coffee Shop,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store
3,M5B 2H1,Cosmetics Shop,Clothing Store,Bookstore,Electronics Store,Shopping Mall,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store
4,M5B 2L9,Clothing Store,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop


#### Clustering

In [410]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('postal code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:15] 

array([1, 1, 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 2, 1], dtype=int32)

In [411]:
# add clustering labels
postcode_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [412]:
toronto_merged =df_venues[['postalCode','lat','lng']]
# add latitude/longitude for each postal code
toronto_merged = pd.merge(postcode_venues_sorted,toronto_merged, on='postalCode',how='left',suffixes=('',''))
toronto_merged.head()

Unnamed: 0,Cluster Labels,postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,lat,lng
0,1,M5B 1R7,Comic Shop,Yoga Studio,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.657031,-79.381403
1,1,M5B 1V8,Theater,Yoga Studio,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.655102,-79.379768
2,1,M5B 2G9,Tea Room,Pizza Place,Yoga Studio,Coffee Shop,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.656586,-79.381167
3,1,M5B 2G9,Tea Room,Pizza Place,Yoga Studio,Coffee Shop,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.656518,-79.380015
4,1,M5B 2H1,Cosmetics Shop,Clothing Store,Bookstore,Electronics Store,Shopping Mall,Coffee Shop,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,43.653515,-79.380696


#### Visualizing Clusters

In [413]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['lat'], toronto_merged['lng'], toronto_merged['postalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Some Observations:
##### 1- Gyms and Greek restaurants are popular in cluster 0:

In [414]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,postalCode,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,lat,lng
25,M5G 1M6,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.656031,-79.38351
56,M5H 3B8,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.648753,-79.385367
66,M5T 1J7,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.655827,-79.392042
75,M5T 2Z5,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,43.654571,-79.38945


##### 2- Greek restaurants and gastropubs are popular in cluster 2:

In [415]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,postalCode,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,lat,lng
15,M5G,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop,43.658833,-79.383684
23,M5G 1J5,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop,43.65857,-79.385123
33,M5G 1Z4,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop,43.658421,-79.385613
46,M5H 2G4,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Electronics Store,Dessert Shop,43.650652,-79.384141
47,M5H 2G4,Gym,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Electronics Store,Dessert Shop,43.650579,-79.383412
73,M5T 2W5,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop,43.654413,-79.390902
90,M5V 3X3,Greek Restaurant,Gastropub,Furniture / Home Store,French Restaurant,Food Court,Electronics Store,Dessert Shop,43.650751,-79.388047
