# Introduction/Business Problem


<!---Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.--->

> People have different preferences when looking for a place/state/city to live. In this project I investigate which of the seven major rising tech hub cities (Seattle, New York, Colorado, Austin, Los Angeles, Chicago, and Boston) is most similar to San Francisco imagining a persons is deciding which job to accept between all the mentioned cities. This cities were selected based on a recent article ranking the top tech cities in the united states (https://builtin.com/tech-hubs). This project is mostly targeted to STEM professionals looking to move out of San Francisco due to the current housing crisis and rising living expenses but wants to mantain a similar lifestyle/neighborhood.

# Data

> I will be taking the top recommended popular venues in San Francisco along with the venues categories and their rating to try and build a profile for each city. Same thing will be performed for the previously mentioned cities. Once the data is gathered, using the `Foursquare API`, it will be processed to create a purley numerical dataset to pass it as input to a clustering algorithm using Scikit-Learn K-means algorithm. Based on closeness, I expect Los Angeles or Seattle to be the most similar city and San Francisco the least related. Factors that will be taken into account are most common venues, their ratings and their categories. This includes exploring how many parks, bars, coffee places, etc. are in each city. At the end we will look at the results and make some conclusions. 

# Methodology/Table of Contents

- Importing Modules
- Gathering Data
- Processing Data
- Building and Training a Clustering Model
- Results
    - Plot Map to Visualize Results
    - Qualitative Analysis
- Conclusions

# Import Modules

For this project, we use the `geopy` package to collect latitudes and longitudes of specific addresses. The `pandas` package is used to read the `json` responses and process the dataframes. The `Scikit-Learn` package will be used to preprocess the features (normalization, etc) and modeling (clustering algorithms). The `folium` python package is use to create map visualizations.  

In [391]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


# GEOPY: Gathering Latitudes and Longitudes

Let's get the coordinates for all 8 cities mentioned in the introduciton which will serve as the centerpoints to get nearby recommended values.

In [235]:
# Instantiating the geographic locator
geolocator = Nominatim(user_agent="ny_explorer")

In [244]:
# Specifying the address
address = 'Downtown, San Francisco'

# Gathering the location coordinates
location_sf = geolocator.geocode(address)

# Placing coordinates to variables
latitude_sf = location_sf.latitude
longitude_sf = location_sf.longitude
print('The geograpical coordinate of Downtown, San Francisco are {}, {}.'.format(latitude_sf, longitude_sf))

The geograpical coordinate of Downtown, San Francisco are 37.7875138, -122.407159.


Let us do this for the other cities:

In [245]:
address = 'Los Angeles, USA'
location_la = geolocator.geocode(address)
latitude_la = location_la.latitude
longitude_la = location_la.longitude
print('The geograpical coordinate of Downtown, Los Angeles are {}, {}.'.format(latitude_la, longitude_la))

The geograpical coordinate of Downtown, Los Angeles are 34.0536909, -118.2427666.


In [246]:
address = 'Downtown, Boston, USA'
location_bs = geolocator.geocode(address)
latitude_bs = location_bs.latitude
longitude_bs = location_bs.longitude
print('The geograpical coordinate of Downtown, Boston are {}, {}.'.format(latitude_bs, longitude_bs))

The geograpical coordinate of Downtown, Boston are 42.3551473, -71.0599539.


In [247]:
address = 'Downtown, Austin'
location_as = geolocator.geocode(address)
latitude_as = location_as.latitude
longitude_as = location_as.longitude
print('The geograpical coordinate of Downtown, Austin are {}, {}.'.format(latitude_as, longitude_as))

The geograpical coordinate of Downtown, Austin are 30.2680536, -97.7447642.


In [254]:
address = 'Downtown, Chicago, Illinois'
location_ch = geolocator.geocode(address)
latitude_ch = location_ch.latitude
longitude_ch = location_ch.longitude
print('The geograpical coordinate of Downtown, Chicago are {}, {}.'.format(latitude_ch, longitude_ch))

The geograpical coordinate of Downtown, Chicago are 41.8936483, -87.6219597276888.


In [253]:
address = 'Downtown, New York City'
location_ny = geolocator.geocode(address)
latitude_ny = location_ny.latitude
longitude_ny = location_ny.longitude
print('The geograpical coordinate of Downtown, New York are {}, {}.'.format(latitude_ny, longitude_ny))

The geograpical coordinate of Downtown, New York are 40.5997561, -73.9463899.


In [252]:
address = 'Downtown, Denver'
location_co = geolocator.geocode(address)
latitude_co = location_co.latitude
longitude_co = location_co.longitude
print('The geograpical coordinate of Downtown, Denver are {}, {}.'.format(latitude_co, longitude_co))

The geograpical coordinate of Downtown, Colorado are 39.75177015, -105.013872746844.


In [251]:
address = 'Downtown, Seattle'
location_se = geolocator.geocode(address)
latitude_se = location_se.latitude
longitude_se = location_se.longitude
print('The geograpical coordinate of Downtown, Seattle are {}, {}.'.format(latitude_se, longitude_se))

The geograpical coordinate of Downtown, Seattle are 47.6048723, -122.3334582.


Let us write a function that will allow us to collect coordinates for an aribtrary number of cities to automate the process for future projects.

In [411]:
def gathering_coordinates(cities):
    '''
    Allows the user to create a dataframe which contains
    the latitude and longitude of downtown of the cities passed
    '''
    # Initiating empty dataframe
    cities_location = pd.DataFrame(columns=["Cities", "Latitude", "Longitude"])
    
    # Looping through the list of cities
    for i in cities:
        # Specifying the address
        address = "Downtown " + i 

        # Gathering the location coordinates
        location = geolocator.geocode(address)

        # Placing coordinates to variables
        latitude = location.latitude
        longitude = location.longitude

        # Creating a dataframe with calculated values
        processing = pd.DataFrame({"Cities":[i],
                                "Latitude":[latitude],
                                "Longitude":[longitude]})
        
        # Appending Values to cities_location dataframe
        cities_location = cities_location.append(processing)

    return cities_location

In [409]:
cities_list = ["San Francisco, California", "Denver, Colorado", "Seattle, Washington", "Los Angeles, California", "New York City, NY", "Chicago, Illinois", "Austin, Texas", "Boston, Massachusetts"]
cities = gathering_coordinates(cities_list)
cities

Unnamed: 0,Cities,Latitude,Longitude
0,"San Francisco, California",37.787514,-122.407159
0,"Denver, Colorado",39.75177,-105.013873
0,"Seattle, Washington",47.604872,-122.333458
0,"Los Angeles, California",34.498713,-118.584307
0,"New York City, NY",40.599756,-73.94639
0,"Chicago, Illinois",41.893648,-87.62196
0,"Austin, Texas",30.268054,-97.744764
0,"Boston, Massachusetts",42.362918,-71.068737


# Foursquare API

We will now use the foursquare API to get the top 40 venues around each city with a radius of 500 meters using the GET request.

In [413]:
# Defining credentials and version
CLIENT_ID = 'ZOIEHVS2MJYOP1WJMLDACKVZPYROSSM21HQVHCZW5QCRZMWX' # your Foursquare ID
CLIENT_SECRET = 'IHSBAGUIVFEH02ZSH0AZXG2EWCH1OSFN41EFQOBFIKOI1CB3' # your Foursquare Secret
VERSION = '20180604'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZOIEHVS2MJYOP1WJMLDACKVZPYROSSM21HQVHCZW5QCRZMWX
CLIENT_SECRET: IHSBAGUIVFEH02ZSH0AZXG2EWCH1OSFN41EFQOBFIKOI1CB3


When calling through the `Foursquare API` results in a JSON response, which we need to process to convert it into a dataframe. The JSON response contains information of many types, but the main `response` is the one of interest. Let's take a look at one of the items in the response. For each item we get information like `location`, the `venue id`, `venue name`, `city`, etc. The `json_normalize` is a pandas module to convert JSON responses to a dataframe. Let us pass the response to `json_normalize` and see what it looks like.

In [414]:
LIMIT = 40 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

# We are using the "Get Venue Recommendations" endpoint
url = "https://api.foursquare.com/v2/search/recommendations"

params = dict(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    v='20191129',
    ll=(str(latitude_sf) + "," + str(longitude_sf)),
    limit=LIMIT
)

# This makes the request based on the location provided
resp = requests.get(url=url, params=params)
# This loads the json response into `data`
data = json.loads(resp.text)

In [417]:
# Extract items from the foursequare response
recommended_venues = data['response']['group']['results']

# Creates a dataframe using the json_normalize module.
recommended_venues = json_normalize(recommended_venues) # flatten JSON

# filter only the columns we need
filtered_columns = ['venue.name', 'venue.id', 'venue.location.formattedAddress', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
recommended_venues = recommended_venues.loc[:, filtered_columns]

recommended_venues.head()

Unnamed: 0,venue.name,venue.id,venue.location.formattedAddress,venue.categories,venue.location.lat,venue.location.lng
0,Maison Margiela,551cfcaf498e23f2c0115449,"[134 Maiden Ln, San Francisco, CA 94108, Unite...","[{'id': '4bf58dd8d48988d104951735', 'name': 'B...",37.788261,-122.405765
1,Saint Laurent,528d4fe211d2543b7663f4fd,"[108 Geary St, San Francisco, CA 94108, United...","[{'id': '4bf58dd8d48988d104951735', 'name': 'B...",37.787774,-122.405412
2,Williams-Sonoma,4aa45625f964a5207b4620e3,"[340 Post St (btwn Powell & Stockton), San Fra...","[{'id': '58daa1558bbb0b01f18ec1b4', 'name': 'K...",37.788377,-122.407446
3,Tiffany & Co.,4a791992f964a520efe61fe3,"[350 Post St (btwn Powell & Stockton), San Fra...","[{'id': '4bf58dd8d48988d111951735', 'name': 'J...",37.788598,-122.407708
4,UNIQLO,50043438e4b0f448ea4f447f,"[111 Powell St, San Francisco, CA 94102, Unite...","[{'id': '4bf58dd8d48988d103951735', 'name': 'C...",37.78585,-122.408041


After gathering the recommended values we need to use a spearate endopoint to get the ratings of each venue. This is a premium called and since we are only allow to make 500 calls per day, we only collect 40 venues for each city. Let us get a rating for the first venue as an example. 

In [418]:
venue_id = "551cfcaf498e23f2c0115449"

url = "https://api.foursquare.com/v2/venues/{}".format(venue_id)

params = dict(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    v='20191129')

resp2 = requests.get(url=url, params=params)
data2 = json.loads(resp2.text)

In [423]:
venue_details = json_normalize(data2)
venue_details[["response.venue.name", "response.venue.rating"]]

Unnamed: 0,response.venue.name,response.venue.rating
0,Maison Margiela,9.2


Let us now write a definition that will allow us to get the rating for every venue in the recommended venues dataframe.

In [424]:
def get_rating(row):
    '''
    Allows to get the rating given a dataframe with venue id's
    Mostly applicable in the .apply() method 
    '''
    try:
        venue_id = row['id']
    except:
        venue_id = row['venue.id']
    if len(venue_id) == 0:
        return None
    else:
        url = "https://api.foursquare.com/v2/venues/{}".format(venue_id)

        params = dict(
          client_id=CLIENT_ID,
          client_secret=CLIENT_SECRET,
          v='20191129')

        resp_rat = requests.get(url=url, params=params)
        data_rat = json.loads(resp_rat.text)

        rating = json_normalize(data_rat)

        rating = rating["response.venue.rating"][0]

        return rating

Let us apply the new function to our recommended values dataframe and take a look at the result.

In [118]:
recommended_venues['venue.rating'] = recommended_venues.apply(get_rating, axis=1)

In [121]:
recommended_venues.head()

Unnamed: 0,venue.name,venue.id,venue.location.formattedAddress,venue.categories,venue.location.lat,venue.location.lng,venue.rating
0,Maison Margiela,551cfcaf498e23f2c0115449,"[134 Maiden Ln, San Francisco, CA 94108, Unite...","[{'id': '4bf58dd8d48988d104951735', 'name': 'B...",37.788261,-122.405765,9.2
1,Saint Laurent,528d4fe211d2543b7663f4fd,"[108 Geary St, San Francisco, CA 94108, United...","[{'id': '4bf58dd8d48988d104951735', 'name': 'B...",37.787774,-122.405412,9.2
2,Williams-Sonoma,4aa45625f964a5207b4620e3,"[340 Post St (btwn Powell & Stockton), San Fra...","[{'id': '58daa1558bbb0b01f18ec1b4', 'name': 'K...",37.788377,-122.407446,8.9
3,Tiffany & Co.,4a791992f964a520efe61fe3,"[350 Post St (btwn Powell & Stockton), San Fra...","[{'id': '4bf58dd8d48988d111951735', 'name': 'J...",37.788598,-122.407708,8.9
4,UNIQLO,50043438e4b0f448ea4f447f,"[111 Powell St, San Francisco, CA 94102, Unite...","[{'id': '4bf58dd8d48988d103951735', 'name': 'C...",37.78585,-122.408041,8.8


We can borrow the function from the previous lab that allows us to extracts the venue category from `venue.category` column. 

In [425]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let us apply this funciton to the entire dataframe `recommended_venues`.

In [123]:
# filter the category for each row
recommended_venues['venue.categories'] = recommended_venues.apply(get_category_type, axis=1)

# clean columns (get rid of "venue")
recommended_venues.columns = [col.split(".")[-1] for col in recommended_venues.columns]

print('{} venues were returned by Foursquare.'.format(recommended_venues.shape[0]))
recommended_venues.head()

10 venues were returned by Foursquare.


Unnamed: 0,name,id,formattedAddress,categories,lat,lng,rating
0,Maison Margiela,551cfcaf498e23f2c0115449,"[134 Maiden Ln, San Francisco, CA 94108, Unite...",Boutique,37.788261,-122.405765,9.2
1,Saint Laurent,528d4fe211d2543b7663f4fd,"[108 Geary St, San Francisco, CA 94108, United...",Boutique,37.787774,-122.405412,9.2
2,Williams-Sonoma,4aa45625f964a5207b4620e3,"[340 Post St (btwn Powell & Stockton), San Fra...",Kitchen Supply Store,37.788377,-122.407446,8.9
3,Tiffany & Co.,4a791992f964a520efe61fe3,"[350 Post St (btwn Powell & Stockton), San Fra...",Jewelry Store,37.788598,-122.407708,8.9
4,UNIQLO,50043438e4b0f448ea4f447f,"[111 Powell St, San Francisco, CA 94102, Unite...",Clothing Store,37.78585,-122.408041,8.8


So far we have gotten venues around San Francisco. Lets use the function provided in the labs to get nearby venues for each city of interest. 

In [431]:
cities_location = pd.DataFrame({"Cities":["San Francisco", "Los Angeles", "Boston", "Austin", "Colorado", "Chicago", "New York", "Seattle"],
                                "Latitude":[latitude_sf, latitude_la, latitude_bs, latitude_as, latitude_co, latitude_ch, latitude_ny, latitude_se],
                                "Longitude":[longitude_sf, longitude_la, longitude_bs, longitude_as, longitude_co, longitude_ch, longitude_ny, longitude_se]})

cities_location

Unnamed: 0,Cities,Latitude,Longitude
0,San Francisco,37.787514,-122.407159
1,Los Angeles,34.053691,-118.242767
2,Boston,42.355147,-71.059954
3,Austin,30.268054,-97.744764
4,Colorado,39.75177,-105.013873
5,Chicago,41.893648,-87.62196
6,New York,40.599756,-73.94639
7,Seattle,47.604872,-122.333458


Borrowing the structure of the funciton used in previous labs, we can modify it to our needs to get nearby venues for each city insetad of neighborhood using the newest endpoint since the previuos one is beign substituded.

In [285]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000, LIMIT=40):
    '''
    Gathers nearby recommended venues
    '''
    # Empty list to append venues to
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("Gathering venues in ", name)

        # Specifies endpoint
        url = "https://api.foursquare.com/v2/search/recommendations"

        params = dict(
            client_id=CLIENT_ID,
            client_secret=CLIENT_SECRET,
            v='20191129',
            ll=(str(lat) + "," + str(lng)),
            radius=radius,
            limit=LIMIT)

        # Makes request
        results = requests.get(url=url, params=params).json()["response"]['group']['results']
        

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,            
            v['venue']['name'], 
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'], 
            v['venue']['categories'][0]['name']) for v in results])
        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                             'City Latitude', 
                             'City Longitude', 
                             'Venue', 
                             'id',
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

We will now get the venues for every city in the top tech city hubs. 

In [286]:
final_df = getNearbyVenues(names=cities_location['Cities'],
                                   latitudes=cities_location['Latitude'],
                                   longitudes=cities_location['Longitude'])

final_df.head()

San Francisco
Los Angeles
Boston
Austin
Colorado
Chicago
New York
Seattle


Unnamed: 0,City,City Latitude,City Longitude,Venue,id,Venue Latitude,Venue Longitude,Venue Category
0,San Francisco,37.787514,-122.407159,Maison Margiela,551cfcaf498e23f2c0115449,37.788261,-122.405765,Boutique
1,San Francisco,37.787514,-122.407159,Saint Laurent,528d4fe211d2543b7663f4fd,37.787774,-122.405412,Boutique
2,San Francisco,37.787514,-122.407159,Williams-Sonoma,4aa45625f964a5207b4620e3,37.788377,-122.407446,Kitchen Supply Store
3,San Francisco,37.787514,-122.407159,Tiffany & Co.,4a791992f964a520efe61fe3,37.788598,-122.407708,Jewelry Store
4,San Francisco,37.787514,-122.407159,The Archive,4b4bd8caf964a5207ba926e3,37.789494,-122.405766,Men's Store


We can again apply the `get_rating()` function to get the ratings for all the neraby venues gathered.

In [287]:
final_df['Venue Rating'] = final_df.apply(get_rating, axis=1)
final_df

Unnamed: 0,City,City Latitude,City Longitude,Venue,id,Venue Latitude,Venue Longitude,Venue Category,Venue Rating
0,San Francisco,37.787514,-122.407159,Maison Margiela,551cfcaf498e23f2c0115449,37.788261,-122.405765,Boutique,9.2
1,San Francisco,37.787514,-122.407159,Saint Laurent,528d4fe211d2543b7663f4fd,37.787774,-122.405412,Boutique,9.2
2,San Francisco,37.787514,-122.407159,Williams-Sonoma,4aa45625f964a5207b4620e3,37.788377,-122.407446,Kitchen Supply Store,8.9
3,San Francisco,37.787514,-122.407159,Tiffany & Co.,4a791992f964a520efe61fe3,37.788598,-122.407708,Jewelry Store,8.9
4,San Francisco,37.787514,-122.407159,The Archive,4b4bd8caf964a5207ba926e3,37.789494,-122.405766,Men's Store,9.3
...,...,...,...,...,...,...,...,...,...
315,Seattle,47.604872,-122.333458,Starbucks Reserve Bar,58ad168cd8e55956ea9db67e,47.607027,-122.338199,Coffee Shop,8.6
316,Seattle,47.604872,-122.333458,Okinawa Teriyaki,457da38cf964a5202c3f1fe3,47.605109,-122.337969,Japanese Restaurant,8.2
317,Seattle,47.604872,-122.333458,The 5th Avenue Theatre,449ae181f964a5209f341fe3,47.608996,-122.334162,Theater,8.8
318,Seattle,47.604872,-122.333458,Starbucks,4ff1b0b1e4b092e4b2df5bc6,47.607012,-122.335716,Coffee Shop,8.1


We now save the dataframe with the venue ratings since we exhaust our allowed premium calls for the day. 

In [288]:
final_df.to_csv("cities_venues.csv", index=False)

In [289]:
safe = final_df.copy()

# Exploring Data and Feature Engineering for K-Means

In [432]:
# Resets process 
final_df = safe.copy()

In [433]:
print('There are {} uniques categories.'.format(len(final_df['Venue Category'].unique())))
print(final_df.shape)
final_df.head()

There are 128 uniques categories.
(320, 9)


Unnamed: 0,City,City Latitude,City Longitude,Venue,id,Venue Latitude,Venue Longitude,Venue Category,Venue Rating
0,San Francisco,37.787514,-122.407159,Maison Margiela,551cfcaf498e23f2c0115449,37.788261,-122.405765,Boutique,9.2
1,San Francisco,37.787514,-122.407159,Saint Laurent,528d4fe211d2543b7663f4fd,37.787774,-122.405412,Boutique,9.2
2,San Francisco,37.787514,-122.407159,Williams-Sonoma,4aa45625f964a5207b4620e3,37.788377,-122.407446,Kitchen Supply Store,8.9
3,San Francisco,37.787514,-122.407159,Tiffany & Co.,4a791992f964a520efe61fe3,37.788598,-122.407708,Jewelry Store,8.9
4,San Francisco,37.787514,-122.407159,The Archive,4b4bd8caf964a5207ba926e3,37.789494,-122.405766,Men's Store,9.3


We comfirm we have 40 data points for each city and feature.

In [434]:
final_df.groupby('City').count().head()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,id,Venue Latitude,Venue Longitude,Venue Category,Venue Rating
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Austin,40,40,40,40,40,40,40,40
Boston,40,40,40,40,40,40,40,40
Chicago,40,40,40,40,40,40,40,40
Colorado,40,40,40,40,40,40,40,40
Los Angeles,40,40,40,40,40,40,40,40


We need to prepare the dataset for K-means. As we know, k-means does not take non-numerical features as input. For this we one-hot-encode the venue categories. 

In [435]:
# one hot encoding
venue_cat_onehot = pd.get_dummies(final_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venue_cat_onehot['City'] = final_df['City'] 
venue_cat_onehot['Rating'] = final_df['Venue Rating'] 

print(venue_cat_onehot.shape)
venue_cat_onehot.head()

(320, 130)


Unnamed: 0,Accessories Store,American Restaurant,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Café,Cajun / Creole Restaurant,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Donut Shop,Electronics Store,Falafel Restaurant,Farmers Market,Filipino Restaurant,Food & Drink Shop,Food Truck,French Restaurant,Furniture / Home Store,Gastropub,Gift Shop,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Hawaiian Restaurant,Health & Beauty Service,Historic Site,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Kitchen Supply Store,Library,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Outdoor Supply Store,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Resort,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Smoke Shop,Snack Place,Social Club,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Theater,Theme Park,Theme Park Ride / Attraction,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio,City,Rating
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,San Francisco,9.2
1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,San Francisco,9.2
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,San Francisco,8.9
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,San Francisco,8.9
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,San Francisco,9.3


We take care of repeating datapoints for each city by taking the mean of venues in each city. Similarly, we can take the `max`, `min` or any other parameter since it will have an effect on the clustering algorithm

In [436]:
final_grouped = venue_cat_onehot.groupby('City').mean().reset_index()
final_grouped

Unnamed: 0,City,Accessories Store,American Restaurant,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Café,Cajun / Creole Restaurant,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Donut Shop,Electronics Store,Falafel Restaurant,Farmers Market,Filipino Restaurant,Food & Drink Shop,Food Truck,French Restaurant,Furniture / Home Store,Gastropub,Gift Shop,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Hawaiian Restaurant,Health & Beauty Service,Historic Site,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Kitchen Supply Store,Library,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Outdoor Supply Store,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Resort,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Smoke Shop,Snack Place,Social Club,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Theater,Theme Park,Theme Park Ride / Attraction,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio,Rating
0,Austin,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.025,0.0,0.075,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.7175
1,Boston,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.025,0.0,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.05,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.05,0.0,0.05,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,8.61
2,Chicago,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.025,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.05,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,8.8025
3,Colorado,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.025,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.025,0.075,0.0,0.0,0.0,0.0,0.025,0.0,0.05,8.435
4,Los Angeles,0.0,0.025,0.0,0.025,0.025,0.025,0.0,0.0,0.0,0.025,0.0,0.025,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.05,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.05,0.025,0.025,0.0,0.025,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.075,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,8.79
5,New York,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.075,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.025,0.025,0.025,0.0,0.0,0.025,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.075,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.075,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.075,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,7.6075
6,San Francisco,0.025,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.025,0.025,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.05,0.0,0.025,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.775
7,Seattle,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.05,0.0,0.0,0.0,0.025,0.075,0.1,0.05,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,8.505


Let us gather the top venues in each city.

In [437]:
num_top_venues = 5

for hood in final_grouped['City']:
    print("----" + hood + "----")
    temp = final_grouped[final_grouped['City'] == hood].drop(columns=["Rating"]).T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Austin----
          venue  freq
0         Hotel  0.10
1  Cocktail Bar  0.08
2  Burger Joint  0.05
3    Steakhouse  0.05
4        Lounge  0.05


----Boston----
                     venue  freq
0              Coffee Shop  0.08
1  New American Restaurant  0.08
2     Gym / Fitness Center  0.05
3                Gastropub  0.05
4           Sandwich Place  0.05


----Chicago----
                     venue  freq
0      American Restaurant  0.12
1               Donut Shop  0.08
2  New American Restaurant  0.08
3           Cosmetics Shop  0.05
4            Grocery Store  0.05


----Colorado----
                          venue  freq
0  Theme Park Ride / Attraction  0.08
1                          Park  0.08
2                   Coffee Shop  0.08
3                   Yoga Studio  0.05
4                Ice Cream Shop  0.05


----Los Angeles----
            venue  freq
0           Plaza  0.08
1  Ice Cream Shop  0.05
2         Theater  0.05
3     Coffee Shop  0.05
4       Speakeasy  0.05


----New

Borrowing the function from the previous labs we can also get the most common venues and append them to a dataframe. 

In [438]:
def return_most_common_venues(row, num_top_venues):
    '''Returns most common venues'''
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [439]:
# Sepcifying the number of venues to return
num_top_venues = 10

indicators = ['st', 'nd', 'rd', 'th', 'th', 'th', 'th', 'th', 'th', 'th', 'th']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe with the same neighborhoods as in toronto_grouped
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['City'] = final_grouped['City']

# Fill each row on each column.
for ind in np.arange(final_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(final_grouped.drop(columns=["Rating"]).iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Austin,Hotel,Cocktail Bar,Lounge,Steakhouse,Coffee Shop,Burger Joint,Park,Cajun / Creole Restaurant,Chinese Restaurant,Salad Place
1,Boston,Coffee Shop,New American Restaurant,Gym / Fitness Center,Steakhouse,Sandwich Place,Salad Place,Gastropub,Hotel,Falafel Restaurant,Restaurant
2,Chicago,American Restaurant,New American Restaurant,Donut Shop,Grocery Store,Restaurant,Cosmetics Shop,Yoga Studio,Café,Salon / Barbershop,Resort
3,Colorado,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant
4,Los Angeles,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site
5,New York,Pizza Place,Sushi Restaurant,Bakery,Italian Restaurant,Food & Drink Shop,Bagel Shop,Deli / Bodega,Farmers Market,Mexican Restaurant,Bubble Tea Shop
6,San Francisco,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
7,Seattle,Hotel,Coffee Shop,Cocktail Bar,Concert Hall,Café,Donut Shop,Deli / Bodega,Seafood Restaurant,Scenic Lookout,Sandwich Place


We need to drop the `City` feature before processing the dataframe.

In [441]:
final_grouped_clustering = final_grouped.drop('City', 1)
final_grouped_clustering.head()

Unnamed: 0,Accessories Store,American Restaurant,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Café,Cajun / Creole Restaurant,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Donut Shop,Electronics Store,Falafel Restaurant,Farmers Market,Filipino Restaurant,Food & Drink Shop,Food Truck,French Restaurant,Furniture / Home Store,Gastropub,Gift Shop,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Hawaiian Restaurant,Health & Beauty Service,Historic Site,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Kitchen Supply Store,Library,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Outdoor Supply Store,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Resort,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Smoke Shop,Snack Place,Social Club,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Theater,Theme Park,Theme Park Ride / Attraction,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio,Rating
0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.025,0.0,0.075,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.7175
1,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.025,0.0,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.05,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.05,0.0,0.05,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,8.61
2,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.025,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.05,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,8.8025
3,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.025,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.025,0.075,0.0,0.0,0.0,0.0,0.025,0.0,0.05,8.435
4,0.0,0.025,0.0,0.025,0.025,0.025,0.0,0.0,0.0,0.025,0.0,0.025,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.05,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.05,0.025,0.025,0.0,0.025,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.075,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,8.79


# Normalizing Features

K-Means algorithms are deeply influenced by the magnitude of the features. In this case, the `Rating` feature is manigutdes higher than any other feature in the dataframe. Because of this, we will normalize the dataframe using the `StandardScaler` and the `MinMaxScaler` modules in `Scikit-Learn`. 

In [443]:
from sklearn import preprocessing

In [444]:
x = final_grouped_clustering #returns a numpy array

# Instantiating the scalers
min_max_scaler = preprocessing.MinMaxScaler()
std_scaler = preprocessing.StandardScaler()

# Fit and transform the dataframe
x_scaled_minmax = min_max_scaler.fit_transform(x)
x_scaled_std = std_scaler.fit_transform(x)

# Convert fitted values into a DataFrame
final_grouped_clustering_std_norm = pd.DataFrame(x_scaled_std, columns=final_grouped_clustering.columns)
final_grouped_clustering_mms_norm = pd.DataFrame(x_scaled_minmax, columns=final_grouped_clustering.columns)

In [446]:
final_grouped_clustering_std_norm.head()

Unnamed: 0,Accessories Store,American Restaurant,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Café,Cajun / Creole Restaurant,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Donut Shop,Electronics Store,Falafel Restaurant,Farmers Market,Filipino Restaurant,Food & Drink Shop,Food Truck,French Restaurant,Furniture / Home Store,Gastropub,Gift Shop,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Hawaiian Restaurant,Health & Beauty Service,Historic Site,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Kitchen Supply Store,Library,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Outdoor Supply Store,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Resort,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Smoke Shop,Snack Place,Social Club,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Theater,Theme Park,Theme Park Ride / Attraction,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio,Rating
0,-0.377964,-0.377964,-0.377964,-0.57735,1.290994,-0.377964,-0.377964,-0.377964,-0.377964,-0.629941,-0.377964,1.732051,-0.57735,-0.377964,1.290994,-0.377964,-0.377964,-0.377964,-0.538816,-0.57735,2.12132,-0.729325,2.645751,-0.377964,1.290994,-0.707107,1.632993,0.0,-0.538816,-0.377964,-0.538816,-0.707107,-0.377964,-0.377964,-0.377964,-0.377964,-0.688247,-0.377964,-0.377964,1.732051,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.538816,1.732051,-0.57735,0.538816,-0.774597,-0.57735,-0.377964,-0.377964,-0.57735,-0.57735,1.436265,2.645751,-0.729325,-0.377964,-0.377964,0.258199,-0.377964,-1.290994,1.732051,-0.57735,2.645751,-0.377964,-0.377964,-0.377964,-0.57735,2.645751,-0.57735,1.290994,-0.377964,-0.377964,2.645751,-0.774597,-0.57735,-0.107211,2.645751,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,0.13484,1.732051,-0.377964,-0.629941,-0.562544,-0.377964,-0.377964,-0.377964,0.707107,-0.377964,0.898027,-0.774597,-0.707107,-0.377964,-0.377964,-0.377964,1.0,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,-0.57735,-0.377964,1.507557,-0.57735,0.0,2.645751,-0.377964,-0.377964,-0.377964,0.538816,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,-0.538816,0.503871
1,-0.377964,-0.377964,-0.377964,-0.57735,-0.774597,-0.377964,2.645751,-0.377964,-0.377964,0.377964,-0.377964,-0.57735,1.732051,-0.377964,-0.774597,-0.377964,-0.377964,-0.377964,-0.538816,-0.57735,0.707107,-0.729325,-0.377964,-0.377964,1.290994,0.707107,-0.816497,0.816497,-0.538816,-0.377964,-0.538816,0.707107,-0.377964,-0.377964,-0.377964,-0.377964,-0.688247,-0.377964,2.645751,-0.57735,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,2.334869,-0.57735,-0.57735,-0.898027,1.290994,1.732051,-0.377964,-0.377964,1.732051,-0.57735,-0.377964,-0.377964,-0.729325,-0.377964,-0.377964,-0.774597,-0.377964,-1.290994,-0.57735,-0.57735,-0.377964,-0.377964,2.645751,-0.377964,1.732051,-0.377964,1.732051,1.290994,-0.377964,-0.377964,-0.377964,-0.774597,-0.57735,1.608169,-0.377964,-0.377964,-0.377964,2.645751,-0.377964,-0.377964,0.13484,-0.57735,-0.377964,-0.629941,-0.562544,-0.377964,-0.377964,-0.377964,0.707107,-0.377964,2.334869,-0.774597,2.12132,-0.377964,-0.377964,2.645751,-1.0,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,-0.57735,-0.377964,1.507557,-0.57735,0.0,-0.377964,-0.377964,-0.377964,2.645751,-0.898027,-0.377964,-0.377964,-0.377964,-0.377964,2.645751,-0.377964,-0.57735,2.645751,-0.538816,0.214503
2,-0.377964,2.645751,-0.377964,-0.57735,-0.774597,-0.377964,-0.377964,-0.377964,-0.377964,-0.629941,-0.377964,-0.57735,-0.57735,-0.377964,-0.774597,-0.377964,-0.377964,-0.377964,-0.538816,-0.57735,0.707107,0.437595,-0.377964,-0.377964,-0.774597,-0.707107,-0.816497,-0.816497,-0.538816,-0.377964,2.334869,-0.707107,-0.377964,-0.377964,-0.377964,-0.377964,2.064742,-0.377964,-0.377964,-0.57735,-0.377964,-0.377964,-0.377964,-0.377964,2.645751,0.898027,-0.57735,1.732051,1.975658,-0.774597,-0.57735,-0.377964,-0.377964,-0.57735,-0.57735,-0.377964,-0.377964,-0.729325,-0.377964,-0.377964,0.258199,-0.377964,0.774597,-0.57735,1.732051,-0.377964,-0.377964,-0.377964,2.645751,-0.57735,-0.377964,-0.57735,1.290994,-0.377964,-0.377964,-0.377964,1.290994,-0.57735,1.608169,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.94388,-0.57735,-0.377964,0.377964,-0.562544,-0.377964,-0.377964,2.645751,2.12132,-0.377964,-0.538816,1.290994,-0.707107,-0.377964,-0.377964,-0.377964,1.0,-0.377964,-0.377964,2.645751,-0.377964,-0.377964,1.732051,-0.377964,1.732051,-0.377964,0.301511,-0.57735,-1.0,-0.377964,-0.377964,-0.377964,-0.377964,-0.898027,-0.377964,-0.377964,2.645751,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,0.898027,0.732673
3,-0.377964,-0.377964,2.645751,-0.57735,-0.774597,-0.377964,-0.377964,-0.377964,-0.377964,-0.629941,-0.377964,-0.57735,-0.57735,-0.377964,-0.774597,2.645751,-0.377964,2.645751,-0.538816,-0.57735,-0.707107,1.604515,-0.377964,-0.377964,-0.774597,-0.707107,0.0,0.816497,-0.538816,-0.377964,-0.538816,-0.707107,-0.377964,-0.377964,-0.377964,-0.377964,-0.688247,-0.377964,-0.377964,-0.57735,-0.377964,-0.377964,-0.377964,2.645751,-0.377964,-0.538816,-0.57735,1.732051,-0.898027,1.290994,-0.57735,2.645751,-0.377964,-0.57735,1.732051,-0.982708,-0.377964,1.604515,-0.377964,-0.377964,-0.774597,-0.377964,0.774597,-0.57735,-0.57735,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,-0.57735,-0.774597,-0.377964,-0.377964,-0.377964,1.290994,-0.57735,-0.107211,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,2.29228,-0.57735,-0.377964,0.377964,-0.562544,2.645751,-0.377964,-0.377964,-0.707107,-0.377964,-0.538816,1.290994,-0.707107,-0.377964,-0.377964,-0.377964,1.0,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,1.732051,2.645751,-0.904534,-0.57735,1.0,-0.377964,2.645751,-0.377964,-0.377964,-0.898027,2.645751,2.645751,-0.377964,-0.377964,-0.377964,-0.377964,1.732051,-0.377964,2.334869,-0.256562
4,-0.377964,-0.377964,-0.377964,1.732051,1.290994,2.645751,-0.377964,-0.377964,-0.377964,0.377964,-0.377964,1.732051,1.732051,-0.377964,1.290994,-0.377964,-0.377964,-0.377964,-0.538816,1.732051,-0.707107,-0.729325,-0.377964,2.645751,-0.774597,-0.707107,-0.816497,0.0,0.898027,-0.377964,-0.538816,-0.707107,-0.377964,-0.377964,-0.377964,-0.377964,-0.688247,-0.377964,-0.377964,-0.57735,2.645751,-0.377964,-0.377964,-0.377964,-0.377964,-0.538816,-0.57735,-0.57735,-0.898027,-0.774597,-0.57735,-0.377964,-0.377964,1.732051,-0.57735,-0.982708,-0.377964,1.604515,2.645751,2.645751,-0.774597,2.645751,0.774597,1.732051,-0.57735,-0.377964,-0.377964,-0.377964,-0.377964,-0.57735,-0.377964,1.732051,-0.774597,-0.377964,-0.377964,-0.377964,1.290994,1.732051,-0.964901,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,-0.377964,0.13484,1.732051,-0.377964,-0.629941,2.137667,-0.377964,2.645751,-0.377964,-0.707107,-0.377964,-0.538816,-0.774597,-0.707107,-0.377964,2.645751,-0.377964,-1.0,-0.377964,2.645751,-0.377964,-0.377964,-0.377964,-0.57735,2.645751,-0.57735,-0.377964,-0.904534,1.732051,0.0,-0.377964,-0.377964,-0.377964,-0.377964,1.975658,-0.377964,-0.377964,-0.377964,2.645751,-0.377964,-0.377964,-0.57735,-0.377964,-0.538816,0.699026


In [448]:
final_grouped_clustering_mms_norm.head()

Unnamed: 0,Accessories Store,American Restaurant,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Bookstore,Boutique,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Café,Cajun / Creole Restaurant,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Donut Shop,Electronics Store,Falafel Restaurant,Farmers Market,Filipino Restaurant,Food & Drink Shop,Food Truck,French Restaurant,Furniture / Home Store,Gastropub,Gift Shop,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Hawaiian Restaurant,Health & Beauty Service,Historic Site,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indonesian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Kitchen Supply Store,Library,Lingerie Store,Liquor Store,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Outdoor Supply Store,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Resort,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Smoke Shop,Snack Place,Social Club,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Theater,Theme Park,Theme Park Ride / Attraction,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio,Rating
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.92887
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,1.0,0.5,0.0,0.75,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.333333,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.838912
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.5,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.333333,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.333333,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.333333,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.666667,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.692469
4,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.333333,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.333333,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.98954


# Fitting Clustering Algorithm

We use the KMeans algorithm in the `Scikit-learn` package with 4 clusters.

In [478]:
# set number of clusters
kclusters = 3

# run k-means clustering
kmeans_std = KMeans(n_clusters=kclusters, n_init=20, max_iter=500, random_state=100).fit(final_grouped_clustering_std_norm)
kmeans_min = KMeans(n_clusters=kclusters, n_init=20, max_iter=500, random_state=100).fit(final_grouped_clustering_mms_norm)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 1, 3, 2, 0])

In [479]:
# Creating a copy of the city venues to attach clustering 
# labels for both the minmax and the standard scalers.
neighborhoods_venues_sorted_mms = neighborhoods_venues_sorted.copy()
neighborhoods_venues_sorted_std = neighborhoods_venues_sorted.copy()

# add clustering labels
neighborhoods_venues_sorted_std.insert(0, 'Cluster Labels', kmeans_std.labels_)
neighborhoods_venues_sorted_mms.insert(0, 'Cluster Labels', kmeans_min.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
final_merged_std = final_df.join(neighborhoods_venues_sorted_std.set_index('City'), on='City')
final_merged_mms = final_df.join(neighborhoods_venues_sorted_mms.set_index('City'), on='City')

# final_merged_std.head() 

In [480]:
final_merged_mms.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,id,Venue Latitude,Venue Longitude,Venue Category,Venue Rating,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,San Francisco,37.787514,-122.407159,Maison Margiela,551cfcaf498e23f2c0115449,37.788261,-122.405765,Boutique,9.2,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
1,San Francisco,37.787514,-122.407159,Saint Laurent,528d4fe211d2543b7663f4fd,37.787774,-122.405412,Boutique,9.2,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
2,San Francisco,37.787514,-122.407159,Williams-Sonoma,4aa45625f964a5207b4620e3,37.788377,-122.407446,Kitchen Supply Store,8.9,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
3,San Francisco,37.787514,-122.407159,Tiffany & Co.,4a791992f964a520efe61fe3,37.788598,-122.407708,Jewelry Store,8.9,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
4,San Francisco,37.787514,-122.407159,The Archive,4b4bd8caf964a5207ba926e3,37.789494,-122.405766,Men's Store,9.3,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck


Let us get the unique labels.

In [481]:
final_merged_std["Cluster Labels"].unique()

array([0, 1, 2], dtype=int64)

In [482]:
final_merged_mms["Cluster Labels"].unique()

array([2, 0, 1], dtype=int64)

# Visualizing Results

In [483]:
# # create map
# latitude = 41.850033
# longitude = -87.6500523
# map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)

# # set color scheme for the clusters
# x = np.arange(kclusters)
# ys = [i + x + (i*x)**2 for i in range(kclusters)]
# colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
# rainbow = [colors.rgb2hex(i) for i in colors_array]

# # add markers to the map
# markers_colors = []
# for lat, lon, poi, cluster in zip(final_merged_std['Venue Latitude'], final_merged_std['Venue Longitude'], final_merged_std['City'], final_merged_std['Cluster Labels']):
#     label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
#     folium.CircleMarker(
#         [lat, lon],
#         radius=5,
#         popup=label,
#         color=rainbow[int(cluster)-1],
#         fill=True,
#         fill_color=rainbow[int(cluster)-1],
#         fill_opacity=0.7).add_to(map_clusters)
       
# map_clusters

In [484]:
# create map
latitude = 41.850033
longitude = -87.6500523
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_merged_mms['Venue Latitude'], final_merged_mms['Venue Longitude'], final_merged_mms['City'], final_merged_mms['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [490]:
final_merged_mms.loc[final_merged_mms['Cluster Labels'] == 0, final_merged_mms.columns[[1] + list(range(5, final_merged_mms.shape[1]))]].head()

Unnamed: 0,City Latitude,Venue Latitude,Venue Longitude,Venue Category,Venue Rating,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
40,34.053691,34.055034,-118.245179,Park,9.1,0,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site
41,34.053691,34.051342,-118.244571,Indian Restaurant,8.6,0,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site
42,34.053691,34.050666,-118.244068,American Restaurant,8.9,0,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site
43,34.053691,34.050145,-118.242246,Bookstore,8.9,0,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site
44,34.053691,34.054445,-118.244471,Arts & Crafts Store,8.0,0,Plaza,Ice Cream Shop,Speakeasy,Coffee Shop,Theater,Jazz Club,Park,School,Candy Store,Historic Site


In [491]:
final_merged_mms.loc[final_merged_mms['Cluster Labels'] == 1, final_merged_mms.columns[[1] + list(range(5, final_merged_mms.shape[1]))]].head()

Unnamed: 0,City Latitude,Venue Latitude,Venue Longitude,Venue Category,Venue Rating,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
160,39.75177,39.749562,-105.013887,Theme Park Ride / Attraction,8.4,1,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant
161,39.75177,39.748707,-105.017061,Museum,8.8,1,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant
162,39.75177,39.755622,-105.009853,Sporting Goods Shop,9.3,1,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant
163,39.75177,39.751776,-105.013673,Aquarium,7.6,1,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant
164,39.75177,39.754487,-105.008569,Park,8.6,1,Theme Park Ride / Attraction,Coffee Shop,Park,Yoga Studio,Ice Cream Shop,Café,Sushi Restaurant,Brewery,Pizza Place,Seafood Restaurant


In [492]:
final_merged_mms.loc[final_merged_mms['Cluster Labels'] == 2, final_merged_mms.columns[[1] + list(range(5, final_merged_mms.shape[1]))]].head()

Unnamed: 0,City Latitude,Venue Latitude,Venue Longitude,Venue Category,Venue Rating,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,37.787514,37.788261,-122.405765,Boutique,9.2,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
1,37.787514,37.787774,-122.405412,Boutique,9.2,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
2,37.787514,37.788377,-122.407446,Kitchen Supply Store,8.9,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
3,37.787514,37.788598,-122.407708,Jewelry Store,8.9,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck
4,37.787514,37.789494,-122.405766,Men's Store,9.3,2,Boutique,Hotel,Men's Store,Bubble Tea Shop,Clothing Store,Plaza,Gym / Fitness Center,Shoe Store,Music Venue,Food Truck


# Conclusions

Our hypothesis based on distance was wrong (expectedly). We observe that `Boston` is the most similar city to `San Francisco` based on the information gathered. This is pureley the result of the clustering algorithm. It is important to know the limitations of the current results. This result is base only on 40 recommended venues for each city (limited by the number of premium calls that allowed us to get the venue `Ratings`). This is not representative of an entire city but for the purposes of this project it will sufice. 

From the qualitative analysis we can see that both, Boston and San Francisco, have a high venue density of Boutiques, Stores and Hotels followed by Tea Places and Gyms. Parks are not even in the top 10 most common venues in contrast to the cluster #1 (Denver, Chicago, and New York) which has `Parks` as the 3rd most common venue.