## <<<<<  -Start of the 1st part of Week 3: Segmenting and Clustering Neighborhoods in Toronto- >>>>>

Using 1 notebook for this assignment.  Uploading it 3 times so writing header and footer to indicate the division for each grading category.

##### Build and scrape the Wikipedia page:

In [2]:
# Using pandas.  Good enough to do the job.
import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)
print(tables[0])

    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
5        M6A        North York   
6        M6A        North York   
7        M7A  Downtown Toronto   
8        M8A      Not assigned   
9        M9A         Etobicoke   
10       M1B       Scarborough   
11       M1B       Scarborough   
12       M2B      Not assigned   
13       M3B        North York   
14       M4B         East York   
15       M4B         East York   
16       M5B  Downtown Toronto   
17       M5B  Downtown Toronto   
18       M6B        North York   
19       M7B      Not assigned   
20       M8B      Not assigned   
21       M9B         Etobicoke   
22       M9B         Etobicoke   
23       M9B         Etobicoke   
24       M9B         Etobicoke   
25       M9B         Etobicoke   
26       M1C       Scarborough   
27       M1C       Scarborough   
28       M1C  

In [3]:
# Identify the object type: to ensure it read correctly since there was no error checking upon scraping the content.
type(tables)

list

##### Check data.

In [4]:
# Since the object is a list of Data Frames, we only want the first data frame from the scrape.
# Using pandas to do the manipulation of data.
df = tables[0]

In [5]:
# Identify the size (shape) and the columns from the conversion.
df.shape, df.columns

((287, 3), Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object'))

In [6]:
# Showing the data.  replicating the instruction sample.
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


##### Clean up data.

In [7]:
# "The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood"
# Need to change the Post column name.
df = df.rename(columns={'Postcode':'PostalCode'})

In [8]:
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
# "Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned."
df = df.drop(df[df['Borough'] == 'Not assigned'].index)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [10]:
# Examine the size of the data.  Should be smaller.
df.shape

(210, 3)

##### Prepare Data.

For this assignment, the data will be used for clustering analysis.

In [11]:
# Only 1 row per Borough, so combine multi neighborhoods into one row separated with a comma.
# Creating new data frame to preserve previous frame in case of error.
# Using sort=False to match the instruction's sample.
df2 = df.groupby(['PostalCode', 'Borough'],sort=False)['Neighbourhood'].apply(', '.join).reset_index()
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [12]:
# "If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough."
# First, find out if there is such a thing.  If not, proceed.
df2[ df2['Neighbourhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [13]:
# Display output similar to the instructions.
df2.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [14]:
# Identify the new data size.
df2.shape

(103, 3)

## <<<<<  -End of the 1st part of Week 3- >>>>>

## <<<<<  -Start of the 2nd part of Week 3: Segmenting and Clustering Neighborhoods in Toronto- >>>>>

##### Read the geo location of the Boroughs & combine the data sets into one.

NOTE: The geo spatial data has been provided through a link in the instructions page so will use this to add to the dataset.


In [15]:
# Get the geo spatial data.
geo_file='https://cocl.us/Geospatial_data'
df_geo = pd.read_csv(geo_file)

In [16]:
# Sanity check.  Note the size of the data where it matches with the prepared data frame from neighborhood dataset.
df_geo.shape, df_geo.columns

((103, 3), Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object'))

In [17]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
# Merge is the term to combine both data frames, df2 & df_geo, based on a column 'key' that has similar column, in this case PostalCode.
# The geo data's Postal column name does not match with the borough data set so renaming here for pd.merge to work.
df_geo = df_geo.rename(columns={'Postal Code':'PostalCode'})
df_geo.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
# Merging using the many-to-many join case with the PostalCode as the key for both data frames
df3 = pd.merge(df2, df_geo, on='PostalCode')

In [20]:
df3.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


## <<<<<  -End of the 2nd part of Week 3- >>>>>

## <<<<<  -Start of the 3rd part (last) of Week 3: Segmenting and Clustering Neighborhoods in Toronto- >>>>>

This section contains 2 parts, each with underlying steps.  
- Installing folium for mapping the data set.  
    - Install folium if not already.
    - Extract Toronto only neighborhoods from the main data set.
    - Initially show these neighborhoods to have a frame of reference for the city.
- Calling the FourSquare API to explore trending venues.  
    - Setup new instructions to obtain the top 5 trending venues per neighborhood.  
    - Cluster these venues around that neighborhood on the map.

###### Data preparation & Mapping.
  
  ---
Install folium as the primary map tool.


In [27]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    ------------------------------------------------------------
                       

---
Prepare a new data set to map Toronto neighborhoods.  This is a subset data to minimize the amount for display.

In [127]:
# Load libraries:
import requests # library to handle requests
import numpy as np

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


In [28]:
# Extract column names for preparation.
df3.columns

Index(['PostalCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [29]:
# Get the number of boroughs in the dataset and how many neighborhoods are in each.  
# This can be used to cross check the numbers as the dataset is now parsed into specific areas.
df3['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Name: Borough, dtype: int64

In [30]:
# Extract Toronto, check the dataset.
import re
toronto = r'Toronto'
reg_toronto = re.compile(toronto)
df_toronto = df3[ df3['Borough'].str.contains(reg_toronto)]
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [31]:
# Get the size of the data.  Cross validate with the counts from above to ensure data is correct.
df_toronto.shape

(39, 5)

In [33]:
# For formality.
df_toronto.reset_index()

Unnamed: 0,index,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
3,15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,19,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,30,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,31,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259


  
  ---
Display the neighborhoods in the folium map.  Circled dots represent each neighborhood according to the geo spatial coordinates.  


In [62]:
# Taken from the weekly lab module:
latitude = 43.654260
longitude = -79.360636

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lon, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='darkpurple',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto


###### FourSquare API & Venue cluster

---
Call the API to gather the top 5 trending venue data.

In [43]:
# FourSquare set up.  (From the course lab.  All of these are from that lab.)

CLIENT_ID = '' #'your-client-ID' from Foursquare
CLIENT_SECRET = '' #'your-client-secret' from Foursquare
#VERSION = '' # Foursquare API version 
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


In [66]:
# Since the cap is for the top 5 with radius in meters.
radius = 500
LIMIT = 25
version = ''

In [76]:
# This function loops through the neighborhood to get the venues.
def get_venues(hood, latitude, longitude):
    venues_list=[]
    for name, lat, lon in zip(hood, latitude, longitude):          
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lon, 
            version,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lon, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [87]:
# Test run, showing only 2 venues to show what the contents are like.
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, latitude, longitude, version, radius, 2)

# send GET request and get trending venues
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e69bc4fc94979001c01be38'},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 47,
  'suggestedBounds': {'ne': {'lat': 43.6587600045, 'lng': -79.35442800013826},
   'sw': {'lat': 43.6497599955, 'lng': -79.36684399986174}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.653446723052674,
          'lng': -79.3620167174383}],
        'distance': 143,
        'postalCod

In [88]:
# Do the entire set.
venues_ = get_venues(hood=df_toronto['Neighbourhood'],
                                   latitude=df_toronto['Latitude'],
                                   longitude=df_toronto['Longitude']
                                  )

In [89]:
# What is the size of the data?
venues_.shape

(754, 7)

In [91]:
# Take a look.  Note the change in spelling.
venues_.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


In [92]:
# What venues per neighborhood.
venues_.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",25,25,25,25,25,25
Berczy Park,25,25,25,25,25,25
"Brockton, Exhibition Place, Parkdale Village",22,22,22,22,22,22
Business Reply Mail Processing Centre 969 Eastern,15,15,15,15,15,15
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",25,25,25,25,25,25
Central Bay Street,25,25,25,25,25,25
"Chinatown, Grange Park, Kensington Market",25,25,25,25,25,25
Christie,18,18,18,18,18,18
Church and Wellesley,25,25,25,25,25,25


In [93]:
# one hot encoding
venues_2 = pd.get_dummies(venues_[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venues_2['Neighborhood'] = venues_['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [venues_2.columns[-1]] + list(venues_2.columns[:-1])
venues_2 = venues_2[fixed_columns]

venues_2.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,Arts & Crafts Store,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [94]:
# And check the size.
venues_2.shape

(754, 181)

In [95]:
# 'Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category'

venues_grp = venues_2.groupby('Neighborhood').mean().reset_index()
venues_grp

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Art Gallery,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.04,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0


In [97]:
# Let's print each neighborhood along with the top 5 most common venues

top_venues = 5   # count

for h in venues_grp['Neighborhood']:
    print("----"+h+"----")
    temp = venues_grp[venues_grp['Neighborhood'] == h].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(top_venues))
    print('\n')


----Adelaide, King, Richmond----
                venue  freq
0  Seafood Restaurant  0.08
1               Hotel  0.08
2    Asian Restaurant  0.08
3          Steakhouse  0.04
4          Food Court  0.04


----Berczy Park----
                venue  freq
0              Bakery  0.08
1  Seafood Restaurant  0.08
2      Farmers Market  0.08
3            Beer Bar  0.08
4        Cocktail Bar  0.08


----Brockton, Exhibition Place, Parkdale Village----
                 venue  freq
0                 Café  0.14
1       Breakfast Spot  0.09
2          Coffee Shop  0.09
3  Japanese Restaurant  0.05
4    Convenience Store  0.05


----Business Reply Mail Processing Centre 969 Eastern----
                  venue  freq
0  Gym / Fitness Center  0.07
1         Auto Workshop  0.07
2            Comic Shop  0.07
3                  Park  0.07
4      Recording Studio  0.07


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  fr

In [98]:
# Let's put that into a *pandas* dataframe
# First, let's write a function to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [121]:
# Create the new dataframe and display the venues for each neighborhood.

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = venues_grp['Neighborhood']

for ind in np.arange(venues_grp.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_grp.iloc[ind, :], top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide, King, Richmond",Seafood Restaurant,Hotel,Asian Restaurant,Coffee Shop,Lounge
1,Berczy Park,Beer Bar,Farmers Market,Bakery,Cocktail Bar,Seafood Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Breakfast Spot,Coffee Shop,Pet Store,Bakery
3,Business Reply Mail Processing Centre 969 Eastern,Garden Center,Auto Workshop,Comic Shop,Pizza Place,Recording Studio
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Sculpture Garden,Boutique


---
Cluster the neighborhood with the top 5 venues.

In [122]:
# Run *k*-means to cluster the neighborhood into 5 clusters.
# import k-means and create the object
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

venue_grouped_clustering = venues_grp.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venue_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 2, 2, 1, 1, 2, 1, 2, 2, 1], dtype=int32)

In [124]:
# Let's create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.
# Column name cleaning.
new_df_toronto = df_toronto.rename(columns={'Neighbourhood':'Neighborhood'})
new_df_toronto.head(2)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [125]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

venue_merged = new_df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
venue_merged = venue_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

venue_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,2,Coffee Shop,Park,Bakery,Breakfast Spot,Performing Arts Venue
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Park,Wings Joint,Distribution Center,Nightclub
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Café,Shopping Mall,Ramen Restaurant
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Japanese Restaurant,Gym,Gastropub,Restaurant,Coffee Shop
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Coffee Shop,Trail,Pub,Health Food Store,Dance Studio


---
Visualize the clusters.

In [128]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venue_merged['Latitude'], venue_merged['Longitude'], venue_merged['Neighborhood'], venue_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## <<<<<  -End of the 3rd part (last) of Week 3: Segmenting and Clustering Neighborhoods in Toronto- >>>>>