# Web-scraping Canadian Postal Codes
We use BeautifulSoup to obtain the table of Canadian postal codes from [Wikipedia](http://www.wikizero.biz/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9wb3N0YWxfY29kZXNfb2ZfQ2FuYWRhOl9N)

In [1]:
# Import the necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re
import folium # map rendering library
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [2]:
# Assign variable 'table' to the URL containing the needed table
table = requests.get(r"http://www.wikizero.biz/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9wb3N0YWxfY29kZXNfb2ZfQ2FuYWRhOl9N").text

# Create a BeautifulSoup instance
soup = BeautifulSoup(table, "lxml")
# print(soup.prettify())

### Extracting the table
After inspecting the webpage, I realized that the table of interest has the **tbody** tag  
The other associated tags are:  
* *th* ---- Header tag
* *td* ---- Row tag

In [3]:
# Extract the table from the 'soup' instance
Table = soup.find('tbody')
# print(Table.prettify())

In [4]:
# From Table, we can extract the headers which have the 'th' tag
header = [Columns.text for Columns in Table.find_all('th')]
header[-1] = header[-1][:-1] # remove new line (\n) character at the end
header[0] = 'PostalCode'
print(header)

# From Table, we can extract the rows which have the 'td' tag
body = [body.text for body in Table.find_all('td')]
body = np.array(body).reshape([-1,3])

['PostalCode', 'Borough', 'Neighbourhood']


### Create the initial DataFrame

In [5]:
df = pd.DataFrame.from_records(body)
df.columns = header

# Delete the last character (\n) of every row
df['Neighbourhood'] = df.Neighbourhood.str.replace("\n", "")
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean the data
The necessary steps as specified in the instructions are carried out

In [6]:
# delete rows that do not have a Borough
df = df[df.Borough != 'Not assigned']

# Group neighbourhoods that have the same borough
df_postal_codes = pd.DataFrame(df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].sum())

df_postal_codes.reset_index(level=df_postal_codes.index.names, inplace=True)
df_postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,RougeMalvern
1,M1C,Scarborough,Highland CreekRouge HillPort Union
2,M1E,Scarborough,GuildwoodMorningsideWest Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Use regular expressions to perform string operations
I noticed that the text in the **Neighbourhood** coulumn are not properly formatted after the *join* operation  
There is not comma and space between neighbourhoods.  
Using regular expressions in a for loop, this problem was solved

In [7]:

splits = [re.sub(r"(?<=\w)([A-Z])", r", \1", df_postal_codes.Neighbourhood.values[i]) for i, x in enumerate(df_postal_codes.Neighbourhood.values)]
df_postal_codes['Neighbourhood'] = splits
df_postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
print(df_postal_codes.shape)

(103, 3)


## Get latitude & logitude
Using the provided .csv file

In [9]:
lat_lon = pd.read_csv('Geospatial_Coordinates.csv')

df_postal_codes[['Latitude', 'Longitude']] = lat_lon[['Latitude', 'Longitude']]

df_postal_codes.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


## Map of Toronto

In [10]:
# create map of Toronto using latitude and longitude values
latitude, longitude = 43.651070, -79.347015
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_postal_codes['Latitude'], df_postal_codes['Longitude'], df_postal_codes['Borough'], df_postal_codes['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Explore the North York Borough

In [11]:
df_north_york = df_postal_codes[df_postal_codes['Borough'] == 'North York'].reset_index(drop=True)
df_north_york.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


### Map of North York.

In [12]:
latitude = 43.7615
longitude = -79.4111
# create map of North York using latitude and longitude values
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_north_york['Latitude'], df_north_york['Longitude'], df_north_york['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  
    
map_north_york

#### Define Foursquare Credentials and Version

***Hidden information***

#### Let's explore the first neighbourhood in our dataframe.
Get the neighbourhood's name.

In [14]:
df_north_york.loc[0, 'Neighbourhood']

'Hillcrest Village'

Get the neighbourhood's latitude and longitude values.

In [15]:
neighbourhood_latitude = df_north_york.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = df_north_york.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = df_north_york.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Hillcrest Village are 43.8037622, -79.3634517.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.
First, let's create the GET request URL. Name your URL url.

In [37]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)

Send the GET request

In [17]:
results = requests.get(url).json()

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

clean the json and structure it into a pandas dataframe.

In [19]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Eagle's Nest Golf Club,Golf Course,43.805455,-79.364186
1,AY Jackson Pool,Pool,43.804515,-79.366138
2,Villa Madina,Mediterranean Restaurant,43.801685,-79.363938
3,Duncan Creek Park,Dog Run,43.805539,-79.360695
4,A.Y. Jackson Secondary School Track,Athletics & Sports,43.805068,-79.366677


In [20]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


## Explore Neighbourhoods in North York

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
north_york_venues = getNearbyVenues(names=df_north_york['Neighbourhood'],
                                   latitudes=df_north_york['Latitude'],
                                   longitudes=df_north_york['Longitude']
                                  )

Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
C, F, B Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Bedford Park, Lawrence Manor East
Lawrence Heights, Lawrence Manor
Glencairn
Downsview, North Park, Upwood Park
Humber Summit
Emery, Humberlea


In [23]:
north_york_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
2,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
3,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run
4,Hillcrest Village,43.803762,-79.363452,A.Y. Jackson Secondary School Track,43.805068,-79.366677,Athletics & Sports


In [24]:
north_york_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Downsview North, Wilson Heights",18,18,18,18,18,18
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",21,21,21,21,21,21
"C, F, B Toronto, Downsview East",3,3,3,3,3,3
Don Mills North,5,5,5,5,5,5
Downsview Central,4,4,4,4,4,4
Downsview Northwest,4,4,4,4,4,4
Downsview West,5,5,5,5,5,5
"Downsview, North Park, Upwood Park",4,4,4,4,4,4
"Emery, Humberlea",1,1,1,1,1,1


Let's find out how many unique categories can be curated from all the returned venues

In [25]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 106 uniques categories.


## Analyze the Neighbourhood

In [26]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
north_york_onehot['Neighbourhood'] = north_york_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
north_york_grouped = north_york_onehot.groupby('Neighbourhood').mean().reset_index()
north_york_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,...,0.055556,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,...,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"C, F, B Toronto, Downsview East",0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Downsview, North Park, Upwood Park",0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Emery, Humberlea",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Print each neighbourhood along with the top 5 most common venues

In [28]:
num_top_venues = 5

for hood in north_york_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = north_york_grouped[north_york_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
                venue  freq
0         Coffee Shop  0.11
1  Frozen Yogurt Shop  0.06
2         Supermarket  0.06
3               Diner  0.06
4         Pizza Place  0.06


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3                 Café  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.10
1         Coffee Shop  0.10
2   Indian Restaurant  0.05
3          Restaurant  0.05
4      Sandwich Place  0.05


----C, F, B Toronto, Downsview East----
               venue  freq
0            Airport  0.33
1               Park  0.33
2  Other Repair Shop  0.33
3     Massage Studio  0.00
4          Piano Bar  0.00


----Don Mills North----
                  venue  freq
0  Gym / Fitness Center   0.2
1   Japanese Restaurant   0.2
2  Caribbean Restaurant   0.2
3             

In [29]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

New dataframe and display the top 10 venues for each neighbourhood.

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = north_york_grouped['Neighbourhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Pharmacy,Bank,Pizza Place,Deli / Bodega,Diner,Bridal Shop,Restaurant,Sandwich Place,Supermarket
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
2,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Sandwich Place,Indian Restaurant,Pharmacy,Café,Pizza Place,Butcher,Liquor Store,Juice Bar
3,"C, F, B Toronto, Downsview East",Airport,Park,Other Repair Shop,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner
4,Don Mills North,Gym / Fitness Center,Caribbean Restaurant,Café,Japanese Restaurant,Basketball Court,Women's Store,Event Space,Deli / Bodega,Department Store,Dim Sum Restaurant


## Cluster Neighbourhoods
Run k-means to cluster the neighbourhood into 3 clusters.

In [31]:
# set number of clusters
kclusters = 3

north_york_grouped_clustering = north_york_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(north_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 1])

create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood

In [32]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = df_north_york

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighbourhood
north_york_merged = north_york_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

north_york_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,1,Golf Course,Athletics & Sports,Pool,Mediterranean Restaurant,Dog Run,Women's Store,Empanada Restaurant,Convenience Store,Cosmetics Shop,Deli / Bodega
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,1,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Women's Store,Bakery,Gift Shop,Department Store,Cosmetics Shop,Japanese Restaurant
2,M2K,North York,Bayview Village,43.786947,-79.385975,1,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,0,Park,Women's Store,Comfort Food Restaurant,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner,Discount Store
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,2,Piano Bar,Women's Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner,Discount Store


#### visualize the resulting clusters

In [33]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighbourhood'], north_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

#### Cluster 1

In [34]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 0, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,0,Park,Women's Store,Comfort Food Restaurant,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner,Discount Store
6,North York,0,Park,Convenience Store,Bank,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner
8,North York,0,Park,Food & Drink Shop,Fast Food Restaurant,Women's Store,Empanada Restaurant,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
13,North York,0,Airport,Park,Other Repair Shop,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner
21,North York,0,Park,Bakery,Basketball Court,Construction & Landscaping,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant


#### Cluster 2

In [35]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 1, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Golf Course,Athletics & Sports,Pool,Mediterranean Restaurant,Dog Run,Women's Store,Empanada Restaurant,Convenience Store,Cosmetics Shop,Deli / Bodega
1,North York,1,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Women's Store,Bakery,Gift Shop,Department Store,Cosmetics Shop,Japanese Restaurant
2,North York,1,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Event Space,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant
5,North York,1,Ramen Restaurant,Coffee Shop,Sandwich Place,Shopping Mall,Café,Vietnamese Restaurant,Pharmacy,Bubble Tea Shop,Plaza,Ice Cream Shop
7,North York,1,Coffee Shop,Discount Store,Grocery Store,Pizza Place,Butcher,Pharmacy,Frozen Yogurt Shop,Diner,Golf Course,Gift Shop
9,North York,1,Gym / Fitness Center,Caribbean Restaurant,Café,Japanese Restaurant,Basketball Court,Women's Store,Event Space,Deli / Bodega,Department Store,Dim Sum Restaurant
10,North York,1,Gym,Asian Restaurant,Coffee Shop,Beer Store,Shopping Mall,Grocery Store,Fast Food Restaurant,Italian Restaurant,Japanese Restaurant,Discount Store
11,North York,1,Coffee Shop,Pharmacy,Bank,Pizza Place,Deli / Bodega,Diner,Bridal Shop,Restaurant,Sandwich Place,Supermarket
12,North York,1,Coffee Shop,Furniture / Home Store,Bar,Metro Station,Falafel Restaurant,Massage Studio,Women's Store,Electronics Store,Cosmetics Shop,Deli / Bodega
14,North York,1,Grocery Store,Park,Hotel,Bank,Event Space,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant


#### Cluster 3

In [36]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 2, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,North York,2,Piano Bar,Women's Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dim Sum Restaurant,Diner,Discount Store
