# Segmenting and Clustering Neighborhoods in the City of Toronto, Ontario, Canada

This assignment will explore postal codes and venues in Toronto.

Import everything we need before we start.

In [1]:
# import libraries

# time for performance measures
import time

# numpy for data vectors
import numpy as np

# pandas for data analysis
import pandas as pd
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

from collections import Counter

# transform json into pandas dataframe
from pandas.io.json import json_normalize

# json
import json

# requests
import requests

# nominatim to convert an address into latitude and longitude values
try:
    from geopy.geocoders import Nominatim 
except (ImportError, ModuleNotFoundError): #install only if necessary
    !conda install -c conda-forge geopy --yes 
    from geopy.geocoders import Nominatim

# Matplotlib for plotting
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as py

# import k-means from clustering stage
from sklearn.cluster import KMeans

# folium for visualizaing maps
try:
    import folium
except (ImportError, ModuleNotFoundError): #install only if necessary
    !conda install -c conda-forge folium=0.5.0 --yes
    import folium # plotting library

## Part 1: Postal Code Data

We are using a wikipedia table of postal codes in Toronto: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
# download html to dataframe
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).content
assert html[:15].decode("utf-8").lower()=='<!doctype html>', 'HTML required' # Check for html format, if not raise error


We prepare the dataframe as given:
> - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood &#x2714;

In [3]:
df = pd.read_html(html)[0] #read html table as df
df.columns = ['PostalCode', 'Borough', 'Neighborhood'] #rename columns

> - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. &#x2714;
> - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park. &#x2714;

I'll assume that all other values of postal code, borough and neighborhood are valid.

In [4]:
df = df[df['Borough'] != 'Not assigned'] #exclude borough 'not assigned'
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood']) #where neighborhood is 'not assigned', use borough

> - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma... &#x2714;

To combine rows, I group by postal code, returning one row per postal code. 

For each postal code, I don't know if there will be multiples or duplicates of borough and neighborhood. So, I will:
1. use agg() with lambda x to aggregate both borough and neighborhood
1. use set() to get only unique values
1. use len() to get number of items per postal code 
1. use counter() (from collections module) to summarize the numbers
1. use apply() and join() to join the set items into strings

In [5]:
# group by postal code, aggregate into sets of unique values for borough and neighborhood
df = df.groupby('PostalCode').agg({
    'Borough': lambda x: set(x),
    'Neighborhood': lambda x: set(x)
    }).reset_index()

# count unique boroughs per postal code
borough_count = Counter([len(s) for s in df['Borough']])
for k,v in borough_count.items():
    print('%i postal code(s) with %i borough(s) each.' % (v, k))
print()

# count unique neighborhoods per postal code
neighbor_count = Counter([len(s) for s in df['Neighborhood']])
for k,v in neighbor_count.items():
    print('%i postal code(s) with %i neighborhood(s) each.' % (v, k))
print()

# combine sets into delimited strings (e.g.'Regent Park, Harbourfront')
df['Borough']=df['Borough'].apply(', '.join)
df['Neighborhood']=df['Neighborhood'].apply(', '.join)

103 postal code(s) with 1 borough(s) each.

30 postal code(s) with 2 neighborhood(s) each.
17 postal code(s) with 3 neighborhood(s) each.
46 postal code(s) with 1 neighborhood(s) each.
4 postal code(s) with 4 neighborhood(s) each.
3 postal code(s) with 5 neighborhood(s) each.
1 postal code(s) with 7 neighborhood(s) each.
2 postal code(s) with 8 neighborhood(s) each.



Looks like I didn't need to worry about the boroughs, since each postal code has only 1.

In [6]:
# preview the dataframe
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


The dataframe is ready.
> - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. &#x2714;
> - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. &#x2714;

In [7]:
# check final row count
pc_count = df.shape[0]
print(pc_count, 'rows')

103 rows


## Part 2: Geographical Data

We can use either the geocoder module or the provided csv at https://cocl.us/Geospatial_data for the latitude and longitude values.

In [8]:
# import geocoder to get lat, long of neighborhoods
try:
    import geocoder
except (ImportError, ModuleNotFoundError): #install only if necessary
    !conda install -c conda-forge geocoder --yes 
    print('geocoder installed.')
    import geocoder
print('geocoder imported.')

geocoder imported.


To use the geocoder, I'll loop through each postal code, call geocoder.google, and add the coordinates to a dataframe.

The guidelines recommend repeatedly calling the geocoder until coordinates are returned. 

I'll also add these conditions to print an error message and break the loops, so they don't run forever:
1. Google sends back the string 'REQUEST DENIED' (probably due to exceeding call limit.
2. The total calls reaches the daily limit.
3. The running time reaches a limit.

In [9]:
# use geocoder

# initialize limiting variables
calls=0
call_limit=2500
t1 = time.perf_counter()
t2=0.0
t_limit = 60.0
error = None

geo_data=[]
# loop through all postal codes
for postal_code in df['PostalCode']:

    # initialize lat_long_coords to None
    lat_lng_coords = None

    # loop until you get the coordinates or error condition met
    while( lat_lng_coords is None and not error):
        g = geocoder.google('{}, Toronto, Ontario'.format('M5G'))
        lat_lng_coords = g.latlng
        calls+=1
        t2=time.perf_counter()
        if str(g)[1:17] == '[REQUEST_DENIED]': # stop if request denied
            error=('Geocoder request denied. ')
        if calls == call_limit: # stop if call limit reached
            error=('Geocoder known call limit (%i) reached. ' % call_limit)
        if t2-t1 > t_limit: # stop if time limit exceeded
            error=('Time limit (%0.2fs) exceeded. ')

    if error: # if any error conditions reached, stop looping postal codes
        print(error)
        break
            
    # build list
    try:
        geo_data.append({
            'PostalCode' : postal_code,
            'Latitude' : lat_lng_coords[0],
            'Longitude' : lat_lng_coords[1]})
    except:
        break
# convert to dataframe
df_geo=pd.DataFrame(geo_data)

coord_count = df_geo.shape[0]
print('Geocoder found %i out of %i postal code coordinates.' % (coord_count, pc_count))
print('Geocoder made %i calls calls over %0.2fs.' % (calls, t2-t1))

Geocoder request denied. 
Geocoder found 0 out of 103 postal code coordinates.
Geocoder made 1 calls calls over 0.19s.


If the geocoder doesn't work, I'll just read the csv and merge it to my postal code dataframe.

In [10]:
if coord_count != pc_count:
    # read csv
    df_geo = pd.read_csv('https://cocl.us/Geospatial_data')
    df_geo.columns=('PostalCode','Latitude','Longitude')

In [11]:
df = pd.merge(df, df_geo)

In [12]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


## Part 3: Clustering and Mapping

> You can decide to work with only boroughs that contain the word Toronto.

I'll use str.contains to slice the dataframe and create a new dataframe.

In [13]:
toronto_data = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
print(toronto_data['Borough'].value_counts())
toronto_data.head()

Downtown Toronto    18
Central Toronto      9
West Toronto         6
East Toronto         5
Name: Borough, dtype: int64


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


#### Use geopy library to get the latitude and longitude values of Toronto.

In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="explorer") #define a user_agent.
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The coordinates of %s are %0.6f %0.6f.' % (address, latitude, longitude))

The coordinates of Toronto, Ontario are 43.653963 -79.387207.


#### Create a map of Toronto with postal codes superimposed on top.

Create the map with folium, labeling each neighborhood

In [15]:
# create map using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for postalcode, borough, neighborhood, lat, long in zip(toronto_data['PostalCode'], 
                                                        toronto_data['Borough'], 
                                                        toronto_data['Neighborhood'], 
                                                        toronto_data['Latitude'], 
                                                        toronto_data['Longitude']):
    label = '%s, (%s), %s' % (postalcode, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Explore Postal Codes in Toronto

#### Define Foursquare Credentials and Version

In [1]:
# @hidden_cell

CLIENT_ID = 'TTJ4LSILREWCDMCJOCXMTGHVJBYIFD0H5K10WVPYIBUOMWQ5' # your Foursquare ID
CLIENT_SECRET = '14RPHQIVUD4MQZBBVZ0LX3K0ZF1OVXLVNVJREKMCFCMO0QYT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#### Use a function to get the top venues within a radius of each postal code in Toronto.

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=10):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Origin', 
                  'Origin Latitude', 
                  'Origin Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now run the above function on each neighborhood and create a new dataframe.

In [18]:
toronto_venues = getNearbyVenues(names=toronto_data['PostalCode'],
                                 latitudes=toronto_data['Latitude'],
                                 longitudes=toronto_data['Longitude'],
                                 radius=500,
                                 limit=100
                                  )

M4E
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6G
M6H
M6J
M6K
M6P
M6R
M6S
M7Y


#### Let's check the resulting dataframe

In [19]:
print(toronto_venues.shape)
toronto_venues.head()

(1699, 7)


Unnamed: 0,Origin,Origin Latitude,Origin Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4E,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


<span style="color:red">
Hmm, Neighborhood doesn't seem like it should be a venue category. Let's remove it.
</span>

In [20]:
toronto_venues = toronto_venues[toronto_venues['Venue Category']!='Neighborhood'].reset_index(drop=True)
toronto_venues.head()

Unnamed: 0,Origin,Origin Latitude,Origin Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop
4,M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Let's check how many venues were returned for each postal code.

In [21]:
toronto_venues[['Origin','Venue']].groupby('Origin').count()

Unnamed: 0_level_0,Venue
Origin,Unnamed: 1_level_1
M4E,4
M4K,42
M4L,18
M4M,38
M4N,4
M4P,9
M4R,20
M4S,35
M4T,4
M4V,15


#### Let's find out how many unique categories can be curated from all the returned venues

In [22]:
print('There are %i uniques categories across %i venues.' % (
    len(toronto_venues['Venue Category'].unique()), 
    toronto_venues.shape[0]))

There are 238 uniques categories across 1695 venues.


In [23]:
toronto_venues['Venue Category'].value_counts()

Coffee Shop                        143
Café                                87
Restaurant                          53
Italian Restaurant                  47
Bakery                              43
Hotel                               39
Bar                                 37
Park                                35
Pizza Place                         34
Gym                                 25
Japanese Restaurant                 25
Gastropub                           23
American Restaurant                 23
Seafood Restaurant                  22
Sandwich Place                      22
Steakhouse                          21
Thai Restaurant                     21
Breakfast Spot                      21
Ice Cream Shop                      20
Pub                                 20
Burger Joint                        19
Sushi Restaurant                    19
Vegetarian / Vegan Restaurant       19
Diner                               18
Beer Bar                            18
Clothing Store           

<span style='color:Red'>
Some of these categories could be grouped into broader categories, like 'Restaurant'. I'll roughly group them by taking the first and last word. e.g. 'Chinese Restaurant' becomes 'Chinese' and 'Restaurant'.
</style>

In [24]:
categories = toronto_venues['Venue Category'].str.split() #split words

for i in range(len(categories)): #for each venue
    categories[i]=[categories[i][0], #first word and
                   categories[i][-1]] #last word

categories = categories.apply(pd.Series).stack() #separate words into 2-column array, then stack them as rows instead
categories.index = categories.index.droplevel(1) #remove index level created by stack
categories.name = 'Categories' #merge requires named series or dataframe
grouped_categories = pd.merge(toronto_venues, categories, left_index=True, right_index=True) #merge on index
grouped_categories.head()

Unnamed: 0,Origin,Origin Latitude,Origin Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Categories
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail,Trail
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store,Health
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store,Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub,Pub


<span style='color:red'>
Splitting the category names doubled the number of venues. I'll normalize the venue numbers later, so it's ok.

We can see that the grouped categories have some larger groups, such as 'Restaurant', 'Shop', 'Cafe', 'Bar', 'Store'. Although 'Coffee shop', which was already consistently classified, did not benefit.
</span>

In [25]:
grouped_categories['Categories'].value_counts()

Restaurant       467
Shop             267
Café             174
Coffee           143
Bar              139
Store             99
Place             88
Bakery            86
Hotel             83
Park              71
Gym               63
Italian           47
Gastropub         46
Pub               43
Steakhouse        42
Bookstore         36
Diner             36
Joint             34
Pizza             34
Theater           30
Japanese          25
Brewery           24
American          23
Sandwich          22
Seafood           22
Breakfast         21
Thai              21
Beer              21
Spot              21
Ice               20
Lounge            20
Vegetarian        19
Burger            19
Sushi             19
Market            19
Clothing          18
Greek             16
Spa               16
Chinese           16
Cocktail          15
Museum            15
Art               14
Bodega            14
Bank              14
Deli              14
Studio            13
Room              13
Dessert      

In [26]:
grouped_categories.groupby(['Categories','Venue Category']).size()

Categories     Venue Category                 
Afghan         Afghan Restaurant                    1
Airport        Airport                              2
               Airport Food Court                   1
               Airport Gate                         1
               Airport Lounge                       2
               Airport Service                      2
               Airport Terminal                     2
American       American Restaurant                 23
Antique        Antique Shop                         3
Aquarium       Aquarium                            10
Art            Art Gallery                         12
               Art Museum                           2
Arts           Arts & Crafts Store                  4
Asian          Asian Restaurant                    12
Auto           Auto Workshop                        1
BBQ            BBQ Joint                            5
Baby           Baby Store                           1
Bagel          Bagel Shop          

## Analyze Each Neighborhood

The rest follows the New York lab pretty closely, except that my unit is postal code instead of neighborhood.

In [27]:
# one hot encoding categories
toronto_onehot = pd.get_dummies(grouped_categories[['Categories']], prefix="", prefix_sep="")

# add postal code column back to dataframe
toronto_onehot['PostalCode'] = grouped_categories['Origin'] 

# move postal code column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Afghan,Airport,American,Antique,Aquarium,Art,Arts,Asian,Auto,BBQ,Baby,Bagel,Bakery,Bank,Bar,Barbershop,Baseball,Basketball,Beach,Beer,Bistro,Boat,Bodega,Bookstore,Boutique,Brazilian,Breakfast,Brewery,Bubble,Building,Burger,Burrito,Bus,Butcher,Cafe,Café,Cajun,Camera,Caribbean,Center,Cheese,Chinese,Chocolate,Church,Climbing,Clothing,Club,Cocktail,Coffee,College,Colombian,Comfort,Comic,Concert,Convenience,Cosmetics,Costume,Court,Coworking,Creperie,Cuban,Cupcake,Dance,Deck,Deli,Department,Dessert,Dim,Diner,Discount,Dive,Dog,Dojo,Doner,Donut,Dumpling,Eastern,Electronics,Entertainment,Ethiopian,Event,Falafel,Farmers,Fast,Ferry,Filipino,Fish,Flea,Flower,Food,Fountain,French,Fried,Fruit,Furniture,Gallery,Gaming,Garden,Gastropub,Gate,Gay,General,German,Gift,Gluten-free,Gourmet,Greek,Grocery,Gym,Hall,Harbor,Health,Historic,History,Hobby,Hookah,Hospital,Hostel,Hotel,Hotpot,House,Ice,Indian,Indie,Intersection,Irish,Italian,Japanese,Jazz,Jewelry,Jewish,Joint,Juice,Korean,Lake,Landmark,Latin,Light,Line,Lingerie,Liquor,Location,Lookout,Lounge,Mac,Malay,Mall,Marina,Market,Martial,Massage,Mediterranean,Men's,Metro,Mexican,Middle,Miscellaneous,Modern,Molecular,Monument,Movie,Museum,Music,New,Nightclub,Noodle,Office,Opera,Optical,Organic,Other,Outdoor,Outdoors,Park,Performing,Persian,Pet,Pharmacy,Pizza,Place,Plane,Playground,Plaza,Poke,Polish,Portuguese,Poutine,Pub,Ramen,Record,Recording,Rental,Restaurant,Rink,Roof,Room,Run,Sake,Salad,Salon,Sandwich,Scenic,School,Sculpture,Seafood,Service,Shoe,Shop,Shopping,Site,Skate,Skating,Smoke,Smoothie,Snack,Soup,Southern,Spa,Space,Speakeasy,Sporting,Sports,Spot,Stadium,Station,Stationery,Steakhouse,Store,Strip,Studio,Supermarket,Supplement,Sushi,Swim,Taco,Tailor,Taiwanese,Tanning,Tapas,Tea,Terminal,Thai,Theater,Theme,Thrift,Toy,Trail,Train,Travel,Truck,Vegetarian,Venue,Video,Vietnamese,Wine,Women's,Workshop,Yoga
0,M4E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,M4E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,M4E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M4E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M4E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [28]:
toronto_onehot.shape

(3390, 259)

#### Next, let's group rows by postal code, taking the mean of the frequency of occurrence of each category

In [29]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Afghan,Airport,American,Antique,Aquarium,Art,Arts,Asian,Auto,BBQ,Baby,Bagel,Bakery,Bank,Bar,Barbershop,Baseball,Basketball,Beach,Beer,Bistro,Boat,Bodega,Bookstore,Boutique,Brazilian,Breakfast,Brewery,Bubble,Building,Burger,Burrito,Bus,Butcher,Cafe,Café,Cajun,Camera,Caribbean,Center,Cheese,Chinese,Chocolate,Church,Climbing,Clothing,Club,Cocktail,Coffee,College,Colombian,Comfort,Comic,Concert,Convenience,Cosmetics,Costume,Court,Coworking,Creperie,Cuban,Cupcake,Dance,Deck,Deli,Department,Dessert,Dim,Diner,Discount,Dive,Dog,Dojo,Doner,Donut,Dumpling,Eastern,Electronics,Entertainment,Ethiopian,Event,Falafel,Farmers,Fast,Ferry,Filipino,Fish,Flea,Flower,Food,Fountain,French,Fried,Fruit,Furniture,Gallery,Gaming,Garden,Gastropub,Gate,Gay,General,German,Gift,Gluten-free,Gourmet,Greek,Grocery,Gym,Hall,Harbor,Health,Historic,History,Hobby,Hookah,Hospital,Hostel,Hotel,Hotpot,House,Ice,Indian,Indie,Intersection,Irish,Italian,Japanese,Jazz,Jewelry,Jewish,Joint,Juice,Korean,Lake,Landmark,Latin,Light,Line,Lingerie,Liquor,Location,Lookout,Lounge,Mac,Malay,Mall,Marina,Market,Martial,Massage,Mediterranean,Men's,Metro,Mexican,Middle,Miscellaneous,Modern,Molecular,Monument,Movie,Museum,Music,New,Nightclub,Noodle,Office,Opera,Optical,Organic,Other,Outdoor,Outdoors,Park,Performing,Persian,Pet,Pharmacy,Pizza,Place,Plane,Playground,Plaza,Poke,Polish,Portuguese,Poutine,Pub,Ramen,Record,Recording,Rental,Restaurant,Rink,Roof,Room,Run,Sake,Salad,Salon,Sandwich,Scenic,School,Sculpture,Seafood,Service,Shoe,Shop,Shopping,Site,Skate,Skating,Smoke,Smoothie,Snack,Soup,Southern,Spa,Space,Speakeasy,Sporting,Sports,Spot,Stadium,Station,Stationery,Steakhouse,Store,Strip,Studio,Supermarket,Supplement,Sushi,Swim,Taco,Tailor,Taiwanese,Tanning,Tapas,Tea,Terminal,Thai,Theater,Theme,Thrift,Toy,Trail,Train,Travel,Truck,Vegetarian,Venue,Video,Vietnamese,Wine,Women's,Workshop,Yoga
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.02381,0.011905,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107143,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.011905,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.011905,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.214286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.011905,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.027778,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.027778,0.0,0.055556,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.026316,0.039474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105263,0.0,0.0,0.0,0.013158,0.013158,0.013158,0.0,0.0,0.0,0.013158,0.0,0.0,0.039474,0.0,0.0,0.013158,0.0,0.0,0.013158,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.131579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.013158,0.0,0.0,0.065789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.052632,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.0,0.0,0.075,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025
7,M4S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042857,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.014286,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.042857,0.085714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.185714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042857,0.0,0.0,0.0,0.014286,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.0,0.066667,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [30]:
toronto_grouped.shape

(38, 259)

#### Let's put each postal code with the top venues in a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [32]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Trail,Pub,Coffee,Health,Store,Shop,Event,Falafel,Farmers,Fast
1,M4K,Restaurant,Shop,Greek,Store,Coffee,Italian,Ice,Bookstore,Spa,Pub
2,M4L,Place,Restaurant,Park,Brewery,Gym,Pub,Store,Pizza,Steakhouse,Shop
3,M4M,Restaurant,Café,Shop,Store,Bakery,Coffee,Bar,Park,Brewery,Bookstore
4,M4N,Park,Restaurant,Line,Swim,School,Bus,Dim,Discount,Falafel,Fried


### Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [33]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 0, 2])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [34]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_data.join(venues_sorted.set_index('PostalCode'), on='PostalCode')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Pub,Coffee,Health,Store,Shop,Event,Falafel,Farmers,Fast
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,2,Restaurant,Shop,Greek,Store,Coffee,Italian,Ice,Bookstore,Spa,Pub
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,2,Place,Restaurant,Park,Brewery,Gym,Pub,Store,Pizza,Steakhouse,Shop
3,M4M,East Toronto,Studio District,43.659526,-79.340923,2,Restaurant,Café,Shop,Store,Bakery,Coffee,Bar,Park,Brewery,Bookstore
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Park,Restaurant,Line,Swim,School,Bus,Dim,Discount,Falafel,Fried


Finally, let's visualize the resulting clusters

In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
colors_array = cm.Set1(np.linspace(0, 1, 9)) # Set1 color map has 9 colors
color_map = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, postalcode, borough, neighbor, cluster, first, second in zip(
    toronto_merged['Latitude'], 
    toronto_merged['Longitude'], 
    toronto_merged['PostalCode'], 
    toronto_merged['Borough'], 
    toronto_merged['Neighborhood'],
    toronto_merged['Cluster Labels'],
    toronto_merged['1st Most Common Venue'],
    toronto_merged['2nd Most Common Venue']):
    label = folium.Popup( '%s (%s: %s). \nCluster %i. \n%s and %s' % (postalcode, borough, neighbor, cluster+1, first, second), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=color_map[cluster%9],
        fill=True,
        fill_color=color_map[cluster%9],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

We can review each cluster.

#### Cluster 1: Central Toronto Restaurants, Playgrounds and Parks

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,0,Restaurant,Playground,Park,Gym,Food,Flower,Flea,Fish,Ethiopian,Ferry


#### Cluster 2: Central Toronto Gardens, Gallery and Furniture

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,M5N,Central Toronto,Roselawn,43.711695,-79.416936,1,Garden,Gallery,Furniture,Fruit,Fried,French,Fountain,Food,Flower,Flea


#### Cluster 3: Toronto Restaurants and Shops

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Pub,Coffee,Health,Store,Shop,Event,Falafel,Farmers,Fast
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,2,Restaurant,Shop,Greek,Store,Coffee,Italian,Ice,Bookstore,Spa,Pub
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,2,Place,Restaurant,Park,Brewery,Gym,Pub,Store,Pizza,Steakhouse,Shop
3,M4M,East Toronto,Studio District,43.659526,-79.340923,2,Restaurant,Café,Shop,Store,Bakery,Coffee,Bar,Park,Brewery,Bookstore
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Park,Restaurant,Line,Swim,School,Bus,Dim,Discount,Falafel,Fried
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,2,Hotel,Park,Gym,Spot,Run,Store,Studio,Breakfast,Clothing,Shop
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,2,Shop,Restaurant,Store,Sporting,Diner,Park,Coffee,Clothing,Spa,Burger
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,2,Restaurant,Shop,Place,Café,Sandwich,Pizza,Dessert,Thai,Park,Coffee
9,M4V,Central Toronto,"Rathnelly, South Hill, Summerhill West, Forest...",43.686412,-79.400049,2,Restaurant,Pub,Shop,Supermarket,Coffee,Station,Bar,Sports,Liquor,Bagel
11,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,2,Restaurant,Store,Park,Pub,Place,Shop,Café,Bakery,Coffee,Pharmacy


#### Cluster 4: Toronto Airport

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,M5V,Downtown Toronto,"South Niagara, Bathurst Quay, King and Spadina...",43.628947,-79.39442,3,Airport,Service,Lounge,Terminal,Plane,Boat,Coffee,Marina,Shop,Harbor


#### Cluster 5: Downtown and Central Toronto Trails and Playgrounds

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,4,Park,Playground,Building,Trail,Falafel,Farmers,Fast,Ferry,Filipino,Yoga
23,M5P,Central Toronto,"Forest Hill West, Forest Hill North",43.696948,-79.411307,4,Park,Trail,Restaurant,Jewelry,Store,Sushi,Filipino,Event,Falafel,Farmers


Thank you for viewing my work.