# **Datascience introduction - Capstone Project**

### **Week 3 - Segmenting and clustering**

#### **Install packages**

Intall the webpage scraping package 'Beautiful Soup' and the html5 parser package

In [2]:
!conda install -c conda-forge beautifulsoup4 --yes
!conda install -c conda-forge html5lib --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.9.2            |           py36_0          59 KB  conda-forge
    beautifulsoup4-4.8.0       |           py36_0         144 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         202 KB

The following NEW packages will be INSTALLED:

    soupsieve:      1.9.2-py36_0 conda-forge

The following packages will be UPDATED:

    beautifulsoup4: 4.6.3-py37_0             --> 4.8.0-py36_0 conda-forge


Downloading and Extracting Packages
soupsieve-1.9.2      | 59 KB     |

**Import libraries**

Apart from the usual libraries like pandas and numpy, the following libraries are needed and should thus be imported: 
- webpage sraping (BeautifulSoup)
- getting and posting html requests
- creating a csv file and writing to a csv file

In [3]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import numpy as np

**Scrape webpage and write table contents to csv file**

The scraping process is coded as follows:
- The url of the wikipage is passed as an argument to an html get request. The request is submitted and the resulting response is stored in the variable 'source'. 
- The source variable is passed along with an html parser to the BeautifulSoup constructor. The output is a tree containing all the parsed html tags. 
- With the soup's find methods the tree can be traversed using for loops. The tags that we are looking for are the table tag or the tablebody tag with the postcode data. We are specifically looking for the table rows (tag tr) and within the rows the header fields (tag th) and detail fields (tag td). 
- The text of the postal codes and the non assigned borough and neighbourhood names can be directly retrieved. The text values of assigned borough and neighbourhood names are inside an anchor tag. 
- The retrieved texts for each table row are written to a list.
- A csv file is created and opened in write mode. Every table row list is written to the csv.
- Trailing newline characters in text values are replaced with an empty string.
- After processing all the rows the csv file is closed.
- Most of the code is surrounded by a try-catch-finally clause because unexpected behavior may occur when scraping and the file resource always has to be closed.

In [58]:
# send an html get request and save the response in variable source
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
# pass the source variable and a html parser to the Beautiful Soup constructor, the html document is parsed and transformed 
# into a tree of python objects, soup is the top level object
soup = BeautifulSoup(source, 'html5lib')
# look for the table tag within class wikitable sortable, then grab the table body
table_body = soup.find('table', class_='wikitable sortable').tbody
csv_filename = 'm_postal_code_scrape.csv'

# create a csv file - the header and detail rows will be written as records 
csv_file = open(csv_filename, 'w') # the 'w' indicates that the file is opened form writing
csv_writer = csv.writer(csv_file,delimiter=';')

#put scraper code inside try except clause just in case an unexpected situation may occur
try:
    for row in table_body.find_all('tr'):
        #first check the table header fields
        headerrow_list = [] #declare list for header fields within tablerow
        for header_cell in row.find_all('th'):
            if header_cell and header_cell.text: #check for empty element and epmty text
                # add text to headerrow list and remove any trailing newline characters
                headerrow_list.append(header_cell.text.replace('\n', ''))
        if len(headerrow_list) > 0: # only write a record if there are elements in the list
            csv_writer.writerow(headerrow_list)
        detailrow_list = [] #declare list for detail fields within tablerow
        for detail_cell in row.find_all('td'):
            anchor = detail_cell.find('a')
            # extract borough or neighborhood name from anchor tag
            if anchor and anchor.text:
                # add  anchor text to detailrow list and remove any trailingnewline characters
                detailrow_list.append(anchor.text.replace('\n', ''))
            # extract postcode or 'Not assigned'
            elif detail_cell and detail_cell.text:
                # add text to detailrow list and remove any trailing newline character
                detailrow_list.append(detail_cell.text.replace('\n', ''))
        if len(detailrow_list) > 0:  # only write a record if there are elements in the list
            csv_writer.writerow(detailrow_list)
except Exception as e:
    print(e) # print the exception
finally:
    # close resources
    csv_file.close()

**Load csv file into dataframe**

In [59]:
df = pd.read_csv(csv_filename, sep = ';')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


**Clean up the dataset and aggregate the dataset**

Data is cleaned up and neighbourshoods sharing the same postalcode and borough are aggregated and concatenated.

In [60]:
# replace column name Postcode with PostalCode
df.rename(columns = {'Postcode':'PostalCode'},inplace=True)
# remove rows with both Borough and Neighbourhood 'Not assigned'
df = df[(df.Borough != 'Not assigned') | (df.Neighbourhood != 'Not assigned')]
# replace Neigbourhood 'Not assigned' value with Borough value
df.Neighbourhood.replace('Not assigned',df.Borough,inplace=True)
# aggregate dataframe on PostalCode and Borough, concatenate Neighbourhood values, separated by comma
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

df.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


**Number of rows in dataframe**

In [61]:
print("Number of rows and columns in dataframe: ",df.shape)

Number of rows and columns in dataframe:  (103, 3)


**Geographical Coordinates**

install the geopy package

In [14]:
!conda install -c conda-forge geopy --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          90 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.20.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.20.0         | 57 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##

Import the Nominatim Geocoder API

In [33]:
from geopy.geocoders import Nominatim

Create geocoder API object and pass a user agent.
Create lists for the latitude and longitude geocodes

In [35]:
g = Nominatim(user_agent="coursera_capstone_project")
lat_list = []
long_list = []

Call the geocode() method on the Geocoder object for every postalcode in the dataframe. Add the resulting geocodes to the lists.
Finally the lists are added as new columns to the dataframe.

In [52]:
for index, row in df.iterrows():
    n = g.geocode('{}, Toronto, Ontario'.format(row['PostalCode']))
   if hasattr(n,'latitude') and (n.latitude is not None):
        lat_list.append(n.latitude)
    else:
        lat_list.append('Not found')
    if hasattr(n,'longitude') and (n.longitude is not None): 
        long_list.append(n.longitude)
    else:
        long_list.append('Not found')
df['Latitude'] = pd.Series(lat_list).values
df['Longitude'] = pd.Series(long_list).values
df.head(20)

Unfortunately the above code doesn't supply the geocoordinates because the geocoder object runs into timeouts. I will read the geo codes from the Geospatial.csv .

In [62]:
df_geo = pd.read_csv('https://cocl.us/Geospatial_data')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the 2 dataframes on PostalCode. For a succeful merge, remove the space from the 'Postal Code' header name.

In [63]:
df_geo.rename(columns = {'Postal Code':'PostalCode'},inplace=True)
df = pd.merge(df, df_geo, on='PostalCode')
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Size of dataframe after adding geocoordinates.

In [64]:
df.shape

(103, 5)

In [68]:
df_york = df[df['Borough'].str.contains("York")]
df_york.shape

(34, 5)

### Explore and cluster neighbourhoods

In [107]:
from sklearn.cluster import KMeans
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

***The York neighbourhoods shown on a map***

Determine central point for map. I assume the York borough is the central borough. In the dataframe this borough consist of 5 neighbourhoods. Pick the coordinates from one of the neighbourhoods. They will all be close together, so it doesn't matter which one I choose.

In [76]:
#determine central point for map, fetch geocoordinates for a neighbourhood in the York borough. First one is ok
df_central_york = df_york.loc[df_york['Borough'] == 'York', ['Latitude','Longitude']].iloc[0]
# create map of York using latitude and longitude values
map_york = folium.Map(location=[df_central_york['Latitude'], df_central_york['Longitude']], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df_york['Latitude'], df_york['Longitude'], df_york['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_york)  
    
map_york

***Explore the neighbourhoods and segement them***

In [78]:
CLIENT_ID = 'NZ5STRQDFVOP1JOUY5WW0WZJDEBVZM5VLZ2IIOAZMKS2ULBQ' # your Foursquare ID
CLIENT_SECRET = 'DSG2OGEJSUS1PAK1FJZBEARLLIKGOARPRKMLNDZ01C3CZGCG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API

Let's reuse a function from the clustering lab to retrieve the categories for every neighbourhood

In [79]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

To retrieve the venues for every neighbourhood in york the above function is called. The results are stored in the dataframe york_venues

In [80]:
york_venues = getNearbyVenues(names=df_york['Neighbourhood'],
                                   latitudes=df_york['Latitude'],
                                   longitudes=df_york['Longitude']
                                  )

Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
Leaside
Thorncliffe Park
East Toronto
Bedford Park, Lawrence Manor East
Lawrence Heights, Lawrence Manor
Glencairn
Humewood-Cedarvale
Caledonia-Fairbanks
Downsview, North Park, Upwood Park
Del Ray, Keelesdale, Mount Dennis, Silverthorn
The Junction North, Runnymede
Humber Summit
Emery, Humberlea
Weston


In [81]:
print(york_venues.shape)
york_venues.head()

(334, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,New York Fries,43.803664,-79.363905,Fast Food Restaurant
2,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
3,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
4,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run


Count the venues per neighbourhood.

In [86]:
york_venues.groupby('Neighborhood').size().reset_index(name='counts')

Unnamed: 0,Neighborhood,counts
0,"Bathurst Manor, Downsview North, Wilson Heights",18
1,Bayview Village,4
2,"Bedford Park, Lawrence Manor East",22
3,"CFB Toronto, Downsview East",3
4,Caledonia-Fairbanks,5
5,"Del Ray, Keelesdale, Mount Dennis, Silverthorn",3
6,Don Mills North,5
7,Downsview Central,3
8,Downsview Northwest,4
9,Downsview West,6


Determine the unique venue categories.

In [87]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 123 uniques categories.


Now the venues per neighbourhood are analyzed. One hot coding is applied to show every category explicitly.

In [89]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot.head(5)

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,...,Theater,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Yoga Studio
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Below is a helper function to sort venues in descending order

In [92]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

The following dataframe contains every neighbourhood with the top 10 of venues

In [123]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Pharmacy,Supermarket,Fast Food Restaurant,Diner,Deli / Bodega,Middle Eastern Restaurant,Frozen Yogurt Shop,Pizza Place,Restaurant
1,Bayview Village,Chinese Restaurant,Japanese Restaurant,Café,Bank,Yoga Studio,Department Store,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
2,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Ice Cream Shop,Comfort Food Restaurant,Pharmacy,Pizza Place,Butcher,Liquor Store,Pub,Juice Bar
3,"CFB Toronto, Downsview East",Playground,Airport,Park,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
4,Caledonia-Fairbanks,Park,Women's Store,Fast Food Restaurant,Market,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop


### Cluster the neighbourhoods

Run k-means and cluster the data into 5 clusters

In [124]:
# set number of clusters
kclusters = 5

#drop the neighbourhood column because it is of no use calculating distances with k-means
york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)


Add the cluster label to the top 10 venues dataframe. Merge this dataframe with the dataframe with the original york data.

In [125]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = df_york

# let's change the key column name for a succeful merge
york_merged.rename(columns = {'Neighbourhood':'Neighborhood'},inplace=True)

# merge york_grouped with york_data to add latitude/longitude for each neighborhood
york_merged =york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood').reset_index()

york_merged.head() # check the last columns!

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,17,M2H,North York,Hillcrest Village,43.803762,-79.363452,1.0,Golf Course,Athletics & Sports,Pool,Mediterranean Restaurant,Fast Food Restaurant,Dog Run,Food Truck,Food Court,Coffee Shop,Furniture / Home Store
1,18,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,1.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Toy / Game Store,Tea Room,Japanese Restaurant,Electronics Store,Bakery,Food Court,Women's Store
2,19,M2K,North York,Bayview Village,43.786947,-79.385975,1.0,Chinese Restaurant,Japanese Restaurant,Café,Bank,Yoga Studio,Department Store,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
3,20,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,3.0,Cafeteria,Yoga Studio,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Deli / Bodega
4,21,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,,,,,,,,,,,


Drop the row from the dataset with NaN values, because they give trouble when a map is produced.

In [128]:
york_merged = york_merged.dropna()

And here is a map with the York borough neighborhoods and the marked clusters

In [129]:
# create map
map_clusters = folium.Map(location=[df_central_york['Latitude'], df_central_york['Longitude']], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine the clusters

**Cluster 1 - interesting combination of Comfort food restaurant and cosmetics shop in every neighborhood**

In [131]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M3A,-79.329656,0.0,Park,Bus Stop,Food & Drink Shop,Fast Food Restaurant,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
13,M3K,-79.464763,0.0,Playground,Airport,Park,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
19,M4C,-79.318389,0.0,Skating Rink,Video Store,Beer Store,Curling Ice,Cosmetics Shop,Pharmacy,Park,Bus Stop,Comfort Food Restaurant,Construction & Landscaping
27,M6E,-79.453512,0.0,Park,Women's Store,Fast Food Restaurant,Market,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop


**Cluster 2 - Restaurant and shopping area**

In [132]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,-79.363452,1.0,Golf Course,Athletics & Sports,Pool,Mediterranean Restaurant,Fast Food Restaurant,Dog Run,Food Truck,Food Court,Coffee Shop,Furniture / Home Store
1,M2J,-79.346556,1.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Toy / Game Store,Tea Room,Japanese Restaurant,Electronics Store,Bakery,Food Court,Women's Store
2,M2K,-79.385975,1.0,Chinese Restaurant,Japanese Restaurant,Café,Bank,Yoga Studio,Department Store,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
5,M2N,-79.408493,1.0,Ramen Restaurant,Coffee Shop,Restaurant,Sandwich Place,Pizza Place,Café,Hotel,Steakhouse,Plaza,Bubble Tea Shop
7,M2R,-79.442259,1.0,Pharmacy,Discount Store,Coffee Shop,Butcher,Pizza Place,Clothing Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
9,M3B,-79.352188,1.0,Caribbean Restaurant,Japanese Restaurant,Gym / Fitness Center,Café,Baseball Field,Yoga Studio,Department Store,Diner,Dim Sum Restaurant,Dessert Shop
10,M3C,-79.340923,1.0,Coffee Shop,Gym,Beer Store,Clothing Store,Dim Sum Restaurant,Discount Store,Sandwich Place,Japanese Restaurant,Italian Restaurant,Sporting Goods Shop
11,M3H,-79.442259,1.0,Coffee Shop,Pharmacy,Supermarket,Fast Food Restaurant,Diner,Deli / Bodega,Middle Eastern Restaurant,Frozen Yogurt Shop,Pizza Place,Restaurant
12,M3J,-79.487262,1.0,Coffee Shop,Caribbean Restaurant,Bar,Massage Studio,Dog Run,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
14,M3L,-79.506944,1.0,Grocery Store,Hotel,Park,Shopping Mall,Bank,Curling Ice,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega


**Cluster 3 - The park is the most popular venue**

In [133]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,M2P,-79.400049,2.0,Park,Convenience Store,Bank,Yoga Studio,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Deli / Bodega
22,M4J,-79.338106,2.0,Park,Pizza Place,Convenience Store,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice
33,M9N,-79.518188,2.0,Park,Convenience Store,Yoga Studio,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Deli / Bodega


**CLuster 4 - people who don't care for a park live here**

In [134]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,M2L,-79.374714,3.0,Cafeteria,Yoga Studio,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Deli / Bodega


**CLuster 5 - Baseball is the favorite sport**

In [135]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,M3M,-79.495697,4.0,Korean Restaurant,Food Truck,Baseball Field,Department Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Yoga Studio
32,M9M,-79.532242,4.0,Furniture / Home Store,Baseball Field,Yoga Studio,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
