### Web Scraping, Data Exploration, and the clustering of Neighborhood Data

This notebook will be used to collect and explore data on neighborhoods in Toronto, then clustering these neighborhoods.
The data is obtained from a Wikipedia page, since neighborhood data for Toronto is not readily available.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  [Web Scraping and Data Collection](#task1)

2.  [Gathering Neighborhood Coordinates](#task2)

3.  [Exploring and Clustering Neighborhoods](#task3)

## 1. Web Scraping and Data Collection <a class="anchor" id="task1"></a>


#### First, let us install and import the libraries necessary for our web scraping and data exploring...

In [1]:
# Installing the necessary libraries

!pip install -U numpy

!pip install -U pandas

!pip install -U scipy

!pip install -U scikit-learn

!pip install -U imbalanced-learn

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting numpy
  Downloading numpy-1.20.2-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 17.4 MB/s eta 0:00:01     |██████████████████████████▍     | 12.6 MB 17.4 MB/s eta 0:00:01
[31mERROR: tensorflow 2.1.0 has requirement scipy==1.4.1; python_version >= "3", but you'll have scipy 1.5.0 which is incompatible.[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.18.5
    Uninstalling numpy-1.18.5:
      Successfully uninstalled numpy-1.18.5
Successfully installed numpy-1.20.2
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pandas
  Downloading pandas-1.2.4-cp37-cp37m-manylinux1_x86_64.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 14.3 MB/s eta 0:00:01
Installing collected packages: pandas
  Attempting

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

print('Imports completed.')

Imports completed.


Creating the beautiful soup object...

In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page=requests.get(url)

soup=BeautifulSoup(page.content, 'html.parser')

We use beautiful soup to find the table in the web page, then use an if-else statement to retrieve the relevant information...(guided by the "hints for scraping" and "Beautiful Soup").

In [4]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

Use pandas to convert to dataframe and make changes to extensively long Borough names...

In [5]:
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [6]:
# First 15 rows of the dataframe

df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
# The shape of the dataframe

print('The dataframe contains {} rows and {} columns'.format(df.shape[0], df.shape[1]))

The dataframe contains 103 rows and 3 columns


## 2. Gathering Neighborhood Coordinates <a class="anchor" id="task2"></a>


In this section we will use the Geocoder python package to retrieve geographical coordinates and update our table.

The coordinates data we take from a csv file provided in the course material.
We first upload the data into the notebook environment from my IBM cloud object storage...

In [8]:

import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# id information hidden
if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_myid = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_myid = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

client_myid = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='my api key id',
    ibm_auth_endpoint="auth_endpoint",
    config=Config(signature_version='oauth'),
    endpoint_url=my endpoint)

body = client_myid.get_object(Bucket='clusteringneighborhoods-donotdelete-pr-mklgqip7meufd2',Key='Geospatial_Coordinates_Toronto_Neighborhoods.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_geo_data = pd.read_csv(body)
df_geo_data.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We are going to join the contents of both dataframes using pandas so I ensure that the target column for the 'vLookup' is equivalent in both dataframes.  The coordinates will be looked up based on the 'PostalCode' column.

In [9]:
df_geo_data.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
df_geo_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the dataframes with pandas..

In [10]:
neighborhoods = pd.merge(df,df_geo_data,on='PostalCode',how='left')

Viewing the first 15 rows of the new dataframe containing neighborhood coordinates.

In [11]:
neighborhoods.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [12]:
# The shape of the dataframe

print('The dataframe contains {} rows and {} columns'.format(neighborhoods.shape[0], neighborhoods.shape[1]))

The dataframe contains 103 rows and 5 columns


## 3. Exploring and Clustering Neighborhoods <a class="anchor" id="task3"></a>

In [13]:
import numpy as np 

import json 

from geopy.geocoders import Nominatim 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

print('Imports completed.')

Imports completed.


In [17]:
# Install and import folium so that we can visualize the maps.

!pip install -U folium==0.5.0
import folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 8.5 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=ec4de53c6aa2a1c5bd1e7093930d7d61953278518b8dc027dea0d6f24a8d29a9
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.5.0


In [18]:

from sklearn.cluster import KMeans

print('done.')

done.


In [20]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]))

The dataframe has 15 boroughs and 103 neighborhoods.


In [21]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronoto is {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronoto is 43.6534817, -79.3839347.


In [23]:
# creating map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### For the purpose of this exercise, I decided to segment and cluster only the neighborhoods in North York.

In [24]:
north_york = neighborhoods[neighborhoods['Borough'] == 'North York'].reset_index(drop=True)
north_york.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


In [30]:
north_york.shape
north_york['Neighborhood'].unique()

array(['Parkwoods', 'Victoria Village',
       'Lawrence Manor, Lawrence Heights', 'Don Mills North', 'Glencairn',
       'Don Mills South', 'Hillcrest Village',
       'Bathurst Manor, Wilson Heights, Downsview North',
       'Fairview, Henry Farm, Oriole', 'Northwood Park, York University',
       'Bayview Village', 'Downsview East', 'York Mills, Silver Hills',
       'Downsview West', 'North Park, Maple Leaf Park, Upwood Park',
       'Humber Summit', 'Willowdale, Newtonbrook', 'Downsview Central',
       'Bedford Park, Lawrence Manor East', 'Humberlea, Emery',
       'Willowdale South', 'Downsview Northwest', 'York Mills West',
       'Willowdale West'], dtype=object)

The coordinates of North York.

In [27]:
address = 'North York, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York is {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York is 43.7543263, -79.44911696639593.


Map of North York with neighborhoods superimposed.

In [28]:
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(north_york['Latitude'], north_york['Longitude'], north_york['Borough'], north_york['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  
    
map_north_york

##### Foursquare credentials and version.

In [31]:
CLIENT_ID = 'MY FOURSQUARE ID' # HIDDEN
CLIENT_SECRET = 'MY FOURSQUARE CLIENT SECRET' # HIDDEN
VERSION = '20180605' 
LIMIT = 100 

##### Definining the function getNearbyVenues as done in the hands-on lab for neighborhoods in New York.

In [32]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Running the function for all neighborhoods in North York

In [33]:
north_york_venues = getNearbyVenues(names=north_york['Neighborhood'],
                                   latitudes=north_york['Latitude'],
                                   longitudes=north_york['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills North
Glencairn
Don Mills South
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview East
York Mills, Silver Hills
Downsview West
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview Central
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale South
Downsview Northwest
York Mills West
Willowdale West


##### The shape and first few rows of data in the newly created venues dataframe.

In [35]:
print(north_york_venues.shape)
north_york_venues.head()

(235, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


##### Getting the number of venues for each neighborhood and number unique venue categories.

In [38]:
north_york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
Don Mills North,5,5,5,5,5,5
Don Mills South,19,19,19,19,19,19
Downsview Central,3,3,3,3,3,3
Downsview East,3,3,3,3,3,3
Downsview Northwest,4,4,4,4,4,4
Downsview West,4,4,4,4,4,4
"Fairview, Henry Farm, Oriole",61,61,61,61,61,61


In [37]:
print('There are {} unique categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 101 unique categories.


### Analysis of the neighborhoods.

First we create dummies for each venue category in the neighborhoods.

In [39]:
# one hot encoding
n_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
n_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [n_york_onehot.columns[-1]] + list(n_york_onehot.columns[:-1])
n_york_onehot = n_york_onehot[fixed_columns]

n_york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sporting Goods Shop,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
n_york_onehot.shape

(235, 102)

#### Grouping by neighborhood and taking the mean of the frequency of occurrence in each of the venue categories.

In [41]:
northyork_grouped = n_york_onehot.groupby('Neighborhood').mean().reset_index()
northyork_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sporting Goods Shop,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095238,...,0.0,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0
3,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills South,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,...,0.052632,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview East,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Fairview, Henry Farm, Oriole",0.0,0.0,0.016393,0.0,0.0,0.016393,0.0,0.032787,0.032787,...,0.016393,0.0,0.016393,0.0,0.0,0.016393,0.016393,0.016393,0.0,0.016393


### Identifying the top 5 venues in each neighborhood, similar to the analysis done in neighborhood clustering lab.

In [42]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [43]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = northyork_grouped['Neighborhood']

for ind in np.arange(northyork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northyork_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Shopping Mall,Mobile Phone Shop
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Liquor Store
2,"Bedford Park, Lawrence Manor East",Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Greek Restaurant
3,Don Mills North,Gym,Caribbean Restaurant,Café,Baseball Field,Japanese Restaurant
4,Don Mills South,Restaurant,Gym,Coffee Shop,Clothing Store,Beer Store


Now we can go on and use k-means to cluster our neighborhoods.

In [44]:
# number of clusters
kclusters = 5

northyork_grouped_clustering = northyork_grouped.drop('Neighborhood', 1)

# k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northyork_grouped_clustering)

# some of cluster labels generated
kmeans.labels_[0:10] 

array([4, 4, 0, 4, 4, 4, 4, 0, 4, 4], dtype=int32)

### The cluster labels plus the top 5 venues of each neighborhood.

In [45]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
northyork_merged = north_york
northyork_merged = northyork_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
northyork_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Bus Stop,Food & Drink Shop,Park,Accessories Store,Korean Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,4.0,Coffee Shop,Hockey Arena,Portuguese Restaurant,Financial or Legal Service,Accessories Store
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,4.0,Clothing Store,Furniture / Home Store,Accessories Store,Carpet Store,Miscellaneous Shop
3,M3B,North York,Don Mills North,43.745906,-79.352188,4.0,Gym,Caribbean Restaurant,Café,Baseball Field,Japanese Restaurant
4,M6B,North York,Glencairn,43.709577,-79.445073,4.0,Japanese Restaurant,Asian Restaurant,Metro Station,Bakery,Sushi Restaurant


### Visualizing the clusters we identifed in a map using folium.

In [66]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(northyork_merged['Latitude'], northyork_merged['Longitude'], northyork_merged['Neighborhood'], northyork_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

##### There was a nan value interfering with the folium code, so dropped then re-ran the code to get the cluster map above.

In [65]:
northyork_merged.dropna(axis=0,inplace=True)

### Further examination of each cluster and differentiation based on venues.

Cluster 1 -- Stores and European style restaurants

In [48]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 0, northyork_merged.columns[[2] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,"Bedford Park, Lawrence Manor East",0.0,Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Greek Restaurant
20,Willowdale South,0.0,Coffee Shop,Ramen Restaurant,Pizza Place,Sushi Restaurant,Café
21,Downsview Northwest,0.0,Grocery Store,Gym / Fitness Center,Athletics & Sports,Liquor Store,Juice Bar
23,Willowdale West,0.0,Grocery Store,Supermarket,Pharmacy,Pizza Place,Discount Store


Cluster 2 -- Parks, juice bars, middle eastern restaurants

In [49]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 1, northyork_merged.columns[[2] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
12,"York Mills, Silver Hills",1.0,Park,Juice Bar,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
22,York Mills West,1.0,Park,Convenience Store,Juice Bar,Miscellaneous Shop,Middle Eastern Restaurant


Cluster 3

In [50]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 2, northyork_merged.columns[[2] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
19,"Humberlea, Emery",2.0,Construction & Landscaping,Baseball Field,Accessories Store,Korean Restaurant,Mobile Phone Shop


Cluster 4

In [51]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 3, northyork_merged.columns[[2] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
15,Humber Summit,3.0,Gym,Pizza Place,Juice Bar,Miscellaneous Shop,Middle Eastern Restaurant


Cluster 5 -- Coffee shops, restaurants, banks, and gyms

In [52]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 4, northyork_merged.columns[[2] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Parkwoods,4.0,Bus Stop,Food & Drink Shop,Park,Accessories Store,Korean Restaurant
1,Victoria Village,4.0,Coffee Shop,Hockey Arena,Portuguese Restaurant,Financial or Legal Service,Accessories Store
2,"Lawrence Manor, Lawrence Heights",4.0,Clothing Store,Furniture / Home Store,Accessories Store,Carpet Store,Miscellaneous Shop
3,Don Mills North,4.0,Gym,Caribbean Restaurant,Café,Baseball Field,Japanese Restaurant
4,Glencairn,4.0,Japanese Restaurant,Asian Restaurant,Metro Station,Bakery,Sushi Restaurant
5,Don Mills South,4.0,Restaurant,Gym,Coffee Shop,Clothing Store,Beer Store
6,Hillcrest Village,4.0,Dog Run,Golf Course,Mediterranean Restaurant,Pool,Fast Food Restaurant
7,"Bathurst Manor, Wilson Heights, Downsview North",4.0,Coffee Shop,Bank,Park,Shopping Mall,Mobile Phone Shop
8,"Fairview, Henry Farm, Oriole",4.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Shoe Store
9,"Northwood Park, York University",4.0,Coffee Shop,Caribbean Restaurant,Furniture / Home Store,Miscellaneous Shop,Bar


## End of analysis!