# Hello!

In this notebook we will be scraping the below Wikipedia web page to extract all the postal codes in Canada. Using this information we will then begin getting geographical information of each neighborhood and, with the help of Foursquare API, get venues within these neighborhoods.

Ultimately, our goal is to group venue categories into clusters that could make us understand the neighborhood dynamic.

In this notebook, we will complete clustering for Toronto, Ontario only.



<h3 id="header">PART 1</h3>


**In order to segment and cluster the neighborhoods in Toronto, we will need the below libraries installed. In order to scrape the below Wikipedia page, we will be using Beautiful Soup library.**
('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') 


In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import urllib.request
from geopy import Nominatim



**We will be requesting the Wiki page and reading the page in a row format. In order to make better sense of the page, we will be using "prettify" function.**

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page, "lxml")

#print(soup.prettify())


In [3]:
table=soup.find('table',{'class':'wikitable sortable'})

<Header>

**Because we do not need the entire page, we will only scrape those with "tr"**

In [4]:
links = table.find_all('tr')
#links

**We need to begin appending the table in the URL into our data frame so we first begin with creating an empty list**

In [5]:
Data=[]

for link in links:
    Data.append([t.text.strip()for t in link.find_all('td')])




In [6]:
    
df=pd.DataFrame(Data,columns=['PostalCode','Borough','Neighborhood'])
df=df[df['Borough']!='Not assigned']
df = df[~df['PostalCode'].isnull()] 


df

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


**More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.**

In [7]:
df2=df.groupby('PostalCode').agg(lambda x: ','.join(x))

df3=df.groupby('PostalCode')



**If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**


In [8]:
 df['Neighborhood'].replace('Not Assigned',df['Borough'],inplace=True)
df                            


Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [9]:
df.shape                           


(103, 3)

<h3 id="header">PART 2</h3>


**Using Geocoder, we will then extract Longitude and Latitude of each PostalCode**

In [10]:
 pip install geocoder


Note: you may need to restart the kernel to use updated packages.


In [11]:
import geocoder # import geocoder


In [12]:
post=pd.read_csv('Geospatial_Coordinates.csv')
post

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [13]:
post.rename(columns={'Postal Code':'PostalCode'}, 
                 inplace=True)
post

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [14]:
df_merge=df.merge(post, on='PostalCode', how='left')
df_merge

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<h3 id="header">PART 3</h3>


**In this part, we will be mapping Toronto and begin the clustering process based on neighborhood and venue category**

In [15]:
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [17]:

address = 'Toronto,ON'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


  This is separate from the ipykernel package so we can avoid doing imports until


The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [18]:
!conda install -c conda-forge folium=0.5.0 --yes # 


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [19]:
import folium

In [20]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

In [21]:
for lat, lng, borough, neighborhood in zip(
        df_merge['Latitude'], 
        df_merge['Longitude'], 
        df_merge['Borough'], 
        df_merge['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

In [33]:
CLIENT_ID = 'KIRCLRF5USTIPBY25REXDGR4C5MVBDKC50MLXRWTNRFOMVPG' # your Foursquare ID
CLIENT_SECRET = 'QNO15UBG3AG0FGIWGQ25ZONJ5UONI3N5XOYEZTSO0KAEZH41' # your Foursquare Secret
VERSION = '20190425' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KIRCLRF5USTIPBY25REXDGR4C5MVBDKC50MLXRWTNRFOMVPG
CLIENT_SECRET:QNO15UBG3AG0FGIWGQ25ZONJ5UONI3N5XOYEZTSO0KAEZH41


<h3 id="header">PART 4</h3>

**We will now be exploring the first neighborhood in our dataframe. Then we will be getting its coordinates. Using these coordinates we will be able to identify close by venues and see what types of venues they are to begin segmentation**

In [34]:

df_merge.loc[0, 'Neighborhood']


'Parkwoods'

In [35]:
df_merge_latitude= df_merge.loc[0,'Latitude']
df_merge_longitude= df_merge.loc[0,'Longitude']

neighborhood_name=df_merge.loc[0,'Neighborhood']

print('Lat and Long values of {} are {},{}'.format(neighborhood_name, df_merge_latitude,df_merge_longitude))




Lat and Long values of Parkwoods are 43.7532586,-79.3296565


**Let us now get the top 100 venues within 500 meter radius**

In [36]:
LIMIT = 100

radius = 500

url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    df_merge_latitude, 
    df_merge_longitude, 
    radius, 
    LIMIT)
url


'https://api.foursquare.com/v2/venues/explore?&client_id=KIRCLRF5USTIPBY25REXDGR4C5MVBDKC50MLXRWTNRFOMVPG&client_secret=QNO15UBG3AG0FGIWGQ25ZONJ5UONI3N5XOYEZTSO0KAEZH41&v=20190425&ll=43.7532586,-79.3296565&radius=500&limit=100'

In [37]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5eeaae46aba297001ba9f173'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

**We will be using Foursquare to GET the information then begin cleaning the json files and writing them into a dataframe**

In [38]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [39]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [40]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [41]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


**There were only 3 venues that are within 500 meter radius of the neighborhood and that is clearly not enough for segmentation, so let us apply this to all the neighborhoods in Toronto**

In [42]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

**Now we can run the below code to for all of Toronto and create a new dataframe for Toronto_Neighborhoods**

In [43]:

Toronto_neighborhoods2 = getNearbyVenues(names=df_merge['Neighborhood'],
                                   latitudes=df_merge['Latitude'],
                                   longitudes=df_merge['Longitude']
                                  )


Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

KeyError: 'groups'

In [44]:
print(Toronto_neighborhoods2.shape)
Toronto_neighborhoods2.head()

(2128, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


**Let us check how many venues were returned for each neighborhood**

In [45]:
(Toronto_neighborhoods2.groupby('Neighborhood').count())

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale West",7,7,7,7,7,7
Woburn,3,3,3,3,3,3
Woodbine Heights,9,9,9,9,9,9
York Mills West,4,4,4,4,4,4


In [46]:
print('There are {} unique categories'.format(len(Toronto_neighborhoods2['Venue Category'].unique())))

There are 275 unique categories


In [47]:
Toronto_neighborhoods2.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


<h3 id="header">PART 5</h3>

**We will now begin analyzing all the neighborhoods in Toronto. In order to begin clustering them, we will have to transform all categorical velues
into numerical values using OneHotEncoder. This way all venue categories will be presented as either 0 or 1**

In [48]:
# one hot encoding
toronto_onehot3 = pd.get_dummies(Toronto_neighborhoods2[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot3['Neighborhood'] = Toronto_neighborhoods2['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns3 = [toronto_onehot3.columns[-1]] + list(toronto_onehot3.columns[:-1])
toronto_onehot3 = toronto_onehot3[fixed_columns3]

toronto_onehot3.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
toronto_onehot3.shape

(2128, 275)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [50]:
toronto_grouped = toronto_onehot3.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
94,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
toronto_grouped.shape

(96, 275)

**We now would like to get each neighborhood with top 5 common venues**

In [52]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print ("----"+hood+"----")
    temp=toronto_grouped[toronto_grouped['Neighborhood']== hood].T.reset_index()
    temp.columns=['venue','freq']
    temp=temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0               Skating Rink   0.2
1             Breakfast Spot   0.2
2                     Lounge   0.2
3             Clothing Store   0.2
4  Latin American Restaurant   0.2


----Alderwood, Long Branch----
            venue  freq
0     Pizza Place  0.25
1     Coffee Shop  0.12
2    Skating Rink  0.12
3  Sandwich Place  0.12
4             Pub  0.12


----Bathurst Manor, Wilson Heights, Downsview North----
                       venue  freq
0                       Bank  0.10
1                Coffee Shop  0.10
2                   Pharmacy  0.05
3              Deli / Bodega  0.05
4  Middle Eastern Restaurant  0.05


----Bayview Village----
                             venue  freq
0              Japanese Restaurant  0.25
1                             Café  0.25
2                             Bank  0.25
3               Chinese Restaurant  0.25
4  Molecular Gastronomy Restaurant  0.00


----Bedford Park, Lawrence Manor East----
           

4   Burrito Place  0.08


----Moore Park, Summerhill East----
                             venue  freq
0                     Tennis Court   1.0
1        Middle Eastern Restaurant   0.0
2              Moroccan Restaurant   0.0
3              Monument / Landmark   0.0
4  Molecular Gastronomy Restaurant   0.0


----New Toronto, Mimico South, Humber Bay Shores----
                  venue  freq
0           Pizza Place  0.08
1    Seafood Restaurant  0.08
2              Pharmacy  0.08
3                Bakery  0.08
4  Fast Food Restaurant  0.08


----North Park, Maple Leaf Park, Upwood Park----
                        venue  freq
0                      Bakery  0.25
1  Construction & Landscaping  0.25
2                        Park  0.25
3            Basketball Court  0.25
4                 Yoga Studio  0.00


----North Toronto West,  Lawrence Park----
                venue  freq
0      Clothing Store  0.11
1         Coffee Shop  0.11
2         Yoga Studio  0.05
3       Metro Station  0.05
4  Me

**We will now put that in a data frame but first, let's write a function to sort the venues in descending order**

In [53]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

**Now let's create the new dataframe and display the top 10 venues for each neighborhood.**

In [54]:
num_top_venues=10

indicators=['st','nd','rd']

columns=['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Clothing Store,Breakfast Spot,Latin American Restaurant,Skating Rink,Women's Store,Doner Restaurant,Diner,Discount Store,Distribution Center
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Skating Rink,Coffee Shop,Pub,Sandwich Place,Gym,Airport Lounge,Deli / Bodega,Ethiopian Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Sandwich Place,Diner,Mobile Phone Shop,Ice Cream Shop,Middle Eastern Restaurant,Restaurant,Deli / Bodega,Fried Chicken Joint
3,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
4,"Bedford Park, Lawrence Manor East",Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Pub,Butcher,Sushi Restaurant,Café,Cupcake Shop,Juice Bar


**We will then run K Means Clustering with 5 clusters**

In [55]:
from sklearn.cluster import KMeans


In [56]:
kclusters=5

toronto_grouped_clustering=toronto_grouped.drop('Neighborhood',1)

kmeans=KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [57]:
neighborhoods_venues_sorted.insert(0,'Cluster Labels',kmeans.labels_)

toronto_merged= df_merge
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')


toronto_merged.head(50) # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Park,Food & Drink Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Hockey Arena,Portuguese Restaurant,French Restaurant,Intersection,Coffee Shop,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Coffee Shop,Event Space,Miscellaneous Shop,Women's Store,Dog Run
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Arts & Crafts Store,Café,Diner,Bar,Bank,Italian Restaurant,Beer Bar
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,,,,,,,,,,,
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1.0,Fast Food Restaurant,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Falafel Restaurant
7,M3B,North York,Don Mills,43.745906,-79.352188,1.0,Coffee Shop,Beer Store,Gym,Japanese Restaurant,Asian Restaurant,Restaurant,Dim Sum Restaurant,Café,Discount Store,Italian Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,1.0,Pizza Place,Athletics & Sports,Gastropub,Intersection,Fast Food Restaurant,Pet Store,Pharmacy,Café,Breakfast Spot,Bank
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1.0,Clothing Store,Coffee Shop,Middle Eastern Restaurant,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Café,Cosmetics Shop,Diner,Tea Room


In [58]:
toronto_merged2=toronto_merged.dropna()

toronto_merged2


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Park,Food & Drink Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Hockey Arena,Portuguese Restaurant,French Restaurant,Intersection,Coffee Shop,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,1.0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Coffee Shop,Event Space,Miscellaneous Shop,Women's Store,Dog Run
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Arts & Crafts Store,Café,Diner,Bar,Bank,Italian Restaurant,Beer Bar
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,4.0,River,Park,Smoke Shop,Women's Store,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,1.0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Yoga Studio,Hotel,Men's Store,Café,Pub
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,1.0,Light Rail Station,Garden Center,Auto Workshop,Farmers Market,Burrito Place,Skate Park,Spa,Recording Studio,Garden,Yoga Studio
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,2.0,Baseball Field,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Department Store


**In order to map the clusters, i had to convert the clusters into integers rather than floats**

In [59]:
toronto_merged2 = toronto_merged2.astype({'Cluster Labels': 'int'})

toronto_merged2.head(50)
                                        
                                


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4,Park,Food & Drink Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Hockey Arena,Portuguese Restaurant,French Restaurant,Intersection,Coffee Shop,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Coffee Shop,Event Space,Miscellaneous Shop,Women's Store,Dog Run
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Arts & Crafts Store,Café,Diner,Bar,Bank,Italian Restaurant,Beer Bar
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1,Fast Food Restaurant,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Falafel Restaurant
7,M3B,North York,Don Mills,43.745906,-79.352188,1,Coffee Shop,Beer Store,Gym,Japanese Restaurant,Asian Restaurant,Restaurant,Dim Sum Restaurant,Café,Discount Store,Italian Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,1,Pizza Place,Athletics & Sports,Gastropub,Intersection,Fast Food Restaurant,Pet Store,Pharmacy,Café,Breakfast Spot,Bank
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Clothing Store,Coffee Shop,Middle Eastern Restaurant,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Café,Cosmetics Shop,Diner,Tea Room
10,M6B,North York,Glencairn,43.709577,-79.445073,1,Pub,Park,Bakery,Japanese Restaurant,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center


In [60]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [61]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
        toronto_merged2['Latitude'], 
        toronto_merged2['Longitude'], 
        toronto_merged2['Neighborhood'], 
        toronto_merged2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3 id="header">PART 6</h3>

**We are now able to cluster Toronto Neighborhoods based on venue category. However, we need to understand what these clusters are to help us understand more about these neighborhoods and venue locations**

Let us begin with **Cluster 0**

*Cluster 0 seems to be the most social area. It contains all the venues for socializing and going out. Hence I am going to call this cluster* ***Social Hub***. 

In [62]:
toronto_merged2.loc[toronto_merged2['Cluster Labels'] == 0, toronto_merged2.columns[[1] + list(range(5, toronto_merged2.shape[1]))]]



Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,North York,0,Pizza Place,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
63,York,0,Pizza Place,Grocery Store,Brewery,Bus Line,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
70,Etobicoke,0,Pizza Place,Discount Store,Intersection,Sandwich Place,Chinese Restaurant,Coffee Shop,Drugstore,Donut Shop,Doner Restaurant,Dog Run
77,Etobicoke,0,Pizza Place,Sandwich Place,Mobile Phone Shop,Bus Line,Discount Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center,Deli / Bodega
93,Etobicoke,0,Pizza Place,Pharmacy,Skating Rink,Coffee Shop,Pub,Sandwich Place,Gym,Airport Lounge,Deli / Bodega,Ethiopian Restaurant


**Cluster 1**

*Cluster 1 seems to contain mostly Parks/Gyms/Shopping Centers and Restaurants. Hence I am going to call this cluster* ***Parks, Recreation and Shopping Hub***. 


In [63]:
toronto_merged2.loc[toronto_merged2['Cluster Labels'] == 1, toronto_merged2.columns[[1] + list(range(5, toronto_merged2.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,1,Hockey Arena,Portuguese Restaurant,French Restaurant,Intersection,Coffee Shop,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,Deli / Bodega
2,Downtown Toronto,1,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
3,North York,1,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Coffee Shop,Event Space,Miscellaneous Shop,Women's Store,Dog Run
4,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Arts & Crafts Store,Café,Diner,Bar,Bank,Italian Restaurant,Beer Bar
6,Scarborough,1,Fast Food Restaurant,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Falafel Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
96,Downtown Toronto,1,Coffee Shop,Pizza Place,Café,Bakery,Italian Restaurant,Pub,Market,Restaurant,Butcher,Beer Store
97,Downtown Toronto,1,Coffee Shop,Café,Hotel,Restaurant,Gym,American Restaurant,Seafood Restaurant,Salad Place,Steakhouse,Asian Restaurant
99,Downtown Toronto,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Yoga Studio,Hotel,Men's Store,Café,Pub
100,East Toronto,1,Light Rail Station,Garden Center,Auto Workshop,Farmers Market,Burrito Place,Skate Park,Spa,Recording Studio,Garden,Yoga Studio


**Cluster 2**

*Cluster 2 seems to contain a lot of Cafetarias, followed by Women's Store and Drug Stores. Hence I shall call this* ***Coffee, Shopping and Dinner***. 


In [64]:
toronto_merged2.loc[toronto_merged2['Cluster Labels'] == 2, toronto_merged2.columns[[1] + list(range(5, toronto_merged2.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,2,Baseball Field,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Department Store
101,Etobicoke,2,Baseball Field,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Department Store


**Cluster 3**

*Cluster 3 seems to contain a lot Drugstore, Rental Car, Dessert Shop. Hence I shall call this* ***Drugstore and Business District***.

In [65]:
toronto_merged2.loc[toronto_merged2['Cluster Labels'] == 3, toronto_merged2.columns[[1] + list(range(5, toronto_merged2.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Etobicoke,3,Home Service,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Deli / Bodega
62,Central Toronto,3,Home Service,Garden,Music Venue,Women's Store,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center


**Cluster 4**


*Cluster 4 seems to contain lots of fast food restaurants, some gymns, pubs. Hence, I shall call this cluster* ***Fast Food & Restaurants Hub***.

In [66]:
toronto_merged2.loc[toronto_merged2['Cluster Labels'] == 4, toronto_merged2.columns[[1] + list(range(5, toronto_merged2.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,4,Park,Food & Drink Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
21,York,4,Park,Women's Store,Pool,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
35,East York,4,Park,Convenience Store,Coffee Shop,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
45,North York,4,Cafeteria,Park,Women's Store,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
61,Central Toronto,4,Park,Bus Line,Swim School,Women's Store,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
64,York,4,Park,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
66,North York,4,Park,Convenience Store,Bar,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Women's Store
85,Scarborough,4,Playground,Park,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
91,Downtown Toronto,4,Park,Playground,Trail,Women's Store,Distribution Center,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
98,Etobicoke,4,River,Park,Smoke Shop,Women's Store,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store


## We are now able to cluster all the venues in Toronto in to 5 clusters

**Social Hub,
Parks, Recreation and Shopping Hub,
Coffee, Shopping and Dinner,
Drugstore and Business District,
Fast Food & Restaurants Hub**
