## Cultural Scene Comparison

From a short list of preferred locations, in which city should a small tour-guide company that focuses on museums and cultural arts centers open a new office?

### Table of Contents

1. Get neighborhoods for selected cities, explore, and format
1. Visualize cities and their neighborhoods
1. Access FourSquare's API, find museums and cultural arts centers, then aggregate all venues in those neighborhoods
1. Cluster and analyze neighborhoods with museums, historical sites, and cultural arts within, then map the clusters and recommend which city our client should select for expansion

---

---

---

- Import libraries

In [1076]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', None)

import requests
import api_keys

import json
import re

from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

---

## (1) Get neighborhoods for selected cities, explore data, and format

---

[A] Function to retrieve coordinates from given address

In [167]:
def get_coordinates_to_df(lookup_series, city):
    lat_list = []
    lon_list = []
    
    for lookup in lookup_series:
        
        rough_address = city+' '+lookup
        
        geolocator = Nominatim(user_agent="mapper")
        location = geolocator.geocode(rough_address)
        
        try:
            lat_list.append(location.latitude)
            lon_list.append(location.longitude)  
        except:
            print('NO ADDRESS RETURNED:', rough_address)
            lat_list.append(np.nan)
            lon_list.append(np.nan)
    
    print('Returned tuple of latitude and longitude lists in {}'.format(city))
    
    return lat_list, lon_list

---

[B] Construct dataframes of city neighborhoods and their coordinates

- **New York**; sourced from Coursera

In [148]:
with open('newyork_data.json', 'r') as file:
    new_york_data = json.load(file)

In [236]:
#normalize json into df & drop_duplicates in 'properties.name'
new_york = pd.json_normalize(new_york_data['features']).drop_duplicates(subset='properties.name')

#add lat/lon columns
new_york['Longitude'] = [pair[0] for pair in new_york['geometry.coordinates']]
new_york['Latitude'] = [pair[1] for pair in new_york['geometry.coordinates']]

#keep only names/boroughs/coordinates & rename columns
new_york = new_york[['properties.name', 'properties.borough', 'Latitude', 'Longitude']].reset_index(drop=True)
new_york.columns = ['Neighborhood', 'Borough', 'Latitude', 'Longitude']

In [522]:
new_york.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Wakefield,Bronx,40.894705,-73.847201
1,Co-op City,Bronx,40.874294,-73.829939
2,Eastchester,Bronx,40.887556,-73.827806
3,Fieldston,Bronx,40.895437,-73.905643
4,Riverdale,Bronx,40.890834,-73.912585


- **Paris, France**; sourced from Wikipedia

In [439]:
paris = pd.read_html('https://en.wikipedia.org/wiki/Arrondissements_of_Paris')[2]

In [468]:
#for elements in 'Name' with comma separation, take the last element
paris['Name'] = paris['Name'].apply(lambda x: x.split(',')[-1])

In [469]:
#call function get_coordinates_to_df()
paris_coordinates = get_coordinates_to_df(paris['Name'], 'Paris')

Returned tuple of latitude and longitude lists in Paris


In [470]:
#add coordinates to df
paris['Latitude'] = paris_coordinates[0]
paris['Longitude'] = paris_coordinates[1]

In [472]:
#renaming columns
paris.rename(columns={'Name': 'Neighborhood', 
                      'Arrondissement (R for Right Bank, L for Left Bank)': 'Borough'}, inplace=True)

#adjusting/renaming Arrondissements as 'Borough'
arrondissements = [n.split('th')[0]+'th Arrondissement' for n in paris['Borough']]
arrondissements[0] = '1st-4th Arrondissements'
paris['Borough'] = arrondissements

In [473]:
paris.head()

Unnamed: 0,Borough,Neighborhood,Area (km2),Population(2017 estimate),Density (2017)(inhabitants per km2),Peak of population,Mayor,2020-2026,Latitude,Longitude
0,1st-4th Arrondissements,Hôtel-de-Ville,5.59 km2 (2.16 sq mi),100196,17924,before 1861,Ariel Weil (PS),,48.856426,2.352528
1,5th Arrondissement,Panthéon,2.541 km2 (0.981 sq mi),59631,23477,1911,Florence Berthout (DVD),,48.846191,2.346079
2,6th Arrondissement,Luxembourg,2.154 km2 (0.832 sq mi),41976,19524,1911,Jean-Pierre Lecoq (LR),,48.850433,2.332951
3,7th Arrondissement,Palais-Bourbon,4.088 km2 (1.578 sq mi),52193,12761,1926,Rachida Dati (LR),,48.861596,2.317909
4,8th Arrondissement,Élysée,3.881 km2 (1.498 sq mi),37368,9631,1891,Jeanne d'Hauteserre (LR),,48.846644,2.36983


In [475]:
paris = paris[['Neighborhood', 'Borough', 'Latitude', 'Longitude']].reset_index(drop=True)

In [484]:
paris.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Hôtel-de-Ville,1st-4th Arrondissements,48.856426,2.352528
1,Panthéon,5th Arrondissement,48.846191,2.346079
2,Luxembourg,6th Arrondissement,48.850433,2.332951
3,Palais-Bourbon,7th Arrondissement,48.861596,2.317909
4,Élysée,8th Arrondissement,48.846644,2.36983


- **London, UK**; sourced from Wikipedia

In [505]:
london = pd.read_html('https://en.wikipedia.org/wiki/List_of_areas_of_London')[1]

In [507]:
#rename 'Postcode district' to fix format
london.rename(columns={
    'Postcode\xa0district': 'Postcode district', 
    'London\xa0borough': 'London borough'
}, inplace=True)

#take 'Post town' LONDON only & drop_duplicates under 'Location'
london = london[london['Post town'] == 'LONDON'].drop_duplicates(subset='Location')

#for elements in 'Location' + ''London borough'' with ' (also' in name, split & take the first element
london['Location'] = london['Location'].apply(lambda x: x.split(' (also')[0])

In [511]:
#call function get_coordinates_to_df()
london_coordinates = get_coordinates_to_df(london['Location'], 'London')

NO ADDRESS RETURNED: London Somerstown
Returned tuple of latitude and longitude lists in London


In [512]:
#add coordinates to df
london['Latitude'] = london_coordinates[0]
london['Longitude'] = london_coordinates[1]
london.dropna(subset=['Latitude', 'Longitude'], axis=0, inplace=True)

In [513]:
#get relevant columns & rename/format
london = london[['Location', 'London borough', 'Latitude', 'Longitude']].reset_index(drop=True)
london.columns = ['Neighborhood', 'Borough', 'Latitude', 'Longitude']

In [514]:
#drop footnote number from 'Boroughs'
repl = re.compile(r"\[\d*]")
london['Borough'] = [repl.sub('', name) for name in london['Borough']]

In [515]:
london.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",51.487621,0.11405
1,Acton,"Ealing, Hammersmith and Fulham",51.50814,-0.273261
2,Aldgate,City,51.514248,-0.075719
3,Aldwych,Westminster,51.513131,-0.117593
4,Anerley,Bromley,51.407599,-0.061939


- **Toronto, Canada**; sourced from Wikipedia

In [369]:
toronto = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [370]:
#ignoring rows 'Not assigned'
toronto = toronto[~(toronto['Borough'] == 'Not assigned')]

#for elements in 'Neighbourhood' with ', ' separating >1 name, split & take the last element
toronto['Neighbourhood'] = toronto['Neighbourhood'].apply(lambda x: x.split(', ')[-1])

In [379]:
#call function get_coordinates_to_df()
toronto_coordinates = get_coordinates_to_df(toronto['Neighbourhood'], 'Toronto')

Returned tuple of latitude and longitude lists in Toronto


In [380]:
#add coordinates to df
toronto['Latitude'] = toronto_coordinates[0]
toronto['Longitude'] = toronto_coordinates[1]

In [381]:
#drop any nans & rename col
toronto.dropna(subset=['Latitude', 'Longitude'], axis=0, inplace=True)
toronto.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)

In [407]:
#get relevant data
toronto = toronto[['Neighborhood', 'Borough', 'Latitude', 'Longitude']]
toronto.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
2,Parkwoods,North York,43.761124,-79.324059
3,Victoria Village,North York,43.732658,-79.311189
4,Harbourfront,Downtown Toronto,43.64008,-79.38015
5,Lawrence Heights,North York,43.722778,-79.450933
8,Humber Valley Village,Etobicoke,43.666472,-79.524314


---

## (2) Visualize cities and neighborhoods

---

[A] Get coordinates of preferred city centers

In [414]:
preferred_cities = [
    'New York, NY',
    'Paris, France',
    'London, UK',
    'Toronto, Canada'
]

In [423]:
center_coordinates = {}
for city in preferred_cities:
    
    geolocator = Nominatim(user_agent="mapper")
    location = geolocator.geocode(city)
    
    try:
        center_coordinates[city] = location.latitude, location.longitude
    except:
        print('NO ADDRESS RETURNED:', city)
        center_coordinates[city] = np.nan, np.nan

In [424]:
center_coordinates

{'New York, NY': (40.7127281, -74.0060152),
 'Paris, France': (48.8566969, 2.3514616),
 'London, UK': (51.5073219, -0.1276474),
 'Toronto, Canada': (43.6534817, -79.3839347)}

---

[B] Map each city with surrounding neighborhoods labeled

In [640]:
def generate_neighborhood_markers(city, df):
    
    m = folium.Map(location=center_coordinates[city], zoom_start=11, width='70%', height='70%')
    
    # add markers to map
    for lat, lng, borough, n_hood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
        label = '{}, {}'.format(n_hood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(m)
    
    return m

- **New York**

In [641]:
generate_neighborhood_markers('New York, NY', new_york)

- **Paris**

In [642]:
generate_neighborhood_markers('Paris, France', paris)

- **London**

In [643]:
generate_neighborhood_markers('London, UK', london)

- **Toronto**

In [1077]:
generate_neighborhood_markers('Toronto, Canada', toronto)

---

## (3) Access FourSquare's API and find neighborhoods with museums, historical sites, and cultural arts centers

[A] Search for all venues within a half-mile radius of a given neighborhood's coordinates

In [526]:
#set FourSquare credentials
CLIENT_ID = api_keys.CLIENT_ID
CLIENT_SECRET = api_keys.CLIENT_SECRET
VERSION = '20180605'
LIMIT = 100
radius = 800

In [527]:
#function takes in lists of names/coordinates and returns pd.DataFrame of FourSquare query
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude', 
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    
    return nearby_venues

In [645]:
cities = ['London', 'Toronto', 'Paris', 'New York'] 
city_dfs = [london, toronto, paris, new_york]

In [633]:
# four_sqr_queries = {city: getNearbyVenues(city_df['Neighborhood'], 
#                                           city_df['Latitude'], 
#                                           city_df['Longitude']) for city, city_df in zip(cities, city_dfs)}

###########

In [635]:
#TEMP SAVE
# four_sqr_queries['New York'].to_pickle('four_sq_query_NY_temp.pkl')
# four_sqr_queries['Paris'].to_pickle('four_sq_query_Paris_temp.pkl')
# four_sqr_queries['Toronto'].to_pickle('four_sq_query_Toronto_temp.pkl')
# four_sqr_queries['London'].to_pickle('four_sq_query_London_temp.pkl')

#OPEN & PUT IN DICT
four_sqr_queries={}
four_sqr_queries['New York'] = pd.read_pickle('four_sq_query_NY_temp.pkl')
four_sqr_queries['Paris'] = pd.read_pickle('four_sq_query_Paris_temp.pkl')
four_sqr_queries['Toronto'] = pd.read_pickle('four_sq_query_Toronto_temp.pkl')
four_sqr_queries['London'] = pd.read_pickle('four_sq_query_London_temp.pkl')

dict_keys(['New York', 'Paris', 'Toronto', 'London'])

###########

In [636]:
#check shape of each df
shapes = ['{}: {}'.format(city, four_sqr_queries[city].shape) for city in four_sqr_queries]
shapes

['New York: (10405, 7)',
 'Paris: (1143, 7)',
 'Toronto: (2886, 7)',
 'London: (10106, 7)']

---

[B] Get only neighborhoods that have a venue category with museums/cultural arts centers in it. The targeted categories from FourSquare's API are defined in key_words.

In [637]:
def filter_targeted_venues(city):
    
    #access downloaded foursq query
    df = four_sqr_queries[city]
    
    #get list of indexes to filter df
    key_words = ['Historic', 'Museum', 'Art Gallery', 'Art Museum', 'Performing Arts', 'Public Art']
    lists_of_key_indexes = [list(df[df['Venue Category'].str.contains(word)].index) for word in key_words]
    combined = [x for i_list in lists_of_key_indexes for x in i_list]
    
    return df.loc[combined]

In [638]:
def compare_totals():
    
    all_counts = []
    for city in four_sqr_queries:
        
        venue_counts = filter_targeted_venues(city)['Venue Category'].value_counts().to_frame()
        venue_counts.columns = [city]
        counts_scaled = venue_counts.div()
        all_counts.append(venue_counts)
        
    return pd.concat(all_counts, axis=1).replace(np.nan, 0).astype(int)

In [835]:
comps = compare_totals()
comps

Unnamed: 0,New York,Paris,Toronto,London
Art Gallery,56,7,15,61
Performing Arts Venue,25,1,2,22
Art Museum,12,16,2,32
Historic Site,11,2,1,19
History Museum,11,2,2,24
Museum,11,4,5,26
Public Art,3,1,0,1
Science Museum,0,1,0,7


*Get city proper area and scale number of venues*

In [780]:
city_areas_raw = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_cities', index_col=0)

In [845]:
city_areas = city_areas_raw[1][:]
city_areas.columns = city_areas_raw[1].columns.get_level_values(1)
areas = city_areas[city_areas.columns[7]]
areas.columns = ['Area, proper (km2)', 'Area, metro (km2)', 'Area, urban (km2)']

city_indexes = []
for city in four_sqr_queries:
    idx = areas[areas.index.str.contains(city)].index[0]
    city_indexes.append(idx)

areas_targeted = areas.loc[city_indexes][['Area, proper (km2)']]
areas_targeted = areas_targeted['Area, proper (km2)'].apply(lambda x: x.split('[')[0]).to_frame()
areas_targeted = areas_targeted.astype(float).T

areas_targeted.columns = comps.columns
areas_targeted

Unnamed: 0,New York,Paris,Toronto,London
"Area, proper (km2)",786.0,105.0,630.0,1572.0


In [878]:
comps_scaled = comps.div(areas_targeted.iloc[0])
print('Venues per Square Kilometer')
comps_scaled.round(3)

Venues per Square Kilometer


Unnamed: 0,New York,Paris,Toronto,London
Art Gallery,0.071,0.067,0.024,0.039
Performing Arts Venue,0.032,0.01,0.003,0.014
Art Museum,0.015,0.152,0.003,0.02
Historic Site,0.014,0.019,0.002,0.012
History Museum,0.014,0.019,0.003,0.015
Museum,0.014,0.038,0.008,0.017
Public Art,0.004,0.01,0.0,0.001
Science Museum,0.0,0.01,0.0,0.004


---

## (4) Visualize museums, historical sites, and cultural arts centers on a map

[A] Map each targeted venue

In [894]:
def generate_venue_markers(city, df):
    
    m = folium.Map(location=center_coordinates[city], zoom_start=11, width='70%', height='70%')
    
    # add markers to map
    for lat, lng, venue, n_hood in zip(df['Venue Latitude'], df['Venue Longitude'], df['Venue'], df['Neighborhood']):
        label = '{}, {}'.format(venue, n_hood)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=4,
            popup=label,
            color='green',
            fill=True,
            fill_color='green',
            fill_opacity=0.7,
            parse_html=False).add_to(m)
    
    return m

- **New York**

In [895]:
generate_venue_markers('New York, NY', filter_targeted_venues('New York'))

- **Paris**

In [897]:
generate_venue_markers('Paris, France', filter_targeted_venues('Paris'))

- **London**

In [898]:
generate_venue_markers('London, UK', filter_targeted_venues('London'))

- **Toronto**

In [899]:
generate_venue_markers('Toronto, Canada', filter_targeted_venues('Toronto'))

---

## (5) Cluster and analyze neighborhoods that have museums, historical sites, and cultural arts, then map out the clusters and recommend which city our client should select for expansion

[A] The first function returns a df of all venues located in neighborhoods that have museums/arts categories within them

In [927]:
def all_venues_in_relevant_neighborhoods(city):
    
    #get targeted neighborhood names
    relevant_hoods = filter_targeted_venues(city)['Neighborhood']
    
    #get all venues in df
    df = four_sqr_queries[city]
    venues_by_hood = [df[df['Neighborhood'] == n_hood] for n_hood in set(relevant_hoods)]
    
    return pd.concat(venues_by_hood)

In [928]:
# surrounding_hoods = {city: all_venues_in_relevant_neighborhoods(city) for city in four_sqr_queries}
# surrounding_hoods['Paris']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
567,Popincourt,48.858416,2.379703,Chez Aline,48.857042,2.378640,Sandwich Place
568,Popincourt,48.858416,2.379703,Les Artistes Gourmands,48.856972,2.378375,Italian Restaurant
569,Popincourt,48.858416,2.379703,Maison Nouilles,48.858342,2.382723,Asian Restaurant
570,Popincourt,48.858416,2.379703,Monsieur Antoine,48.860365,2.378295,Cocktail Bar
571,Popincourt,48.858416,2.379703,La Générale,48.859492,2.379357,Performing Arts Venue
...,...,...,...,...,...,...,...
735,Gobelins,48.832397,2.355583,Gaumont Les Fauvettes,48.833521,2.353680,Multiplex
736,Gobelins,48.832397,2.355583,Oops! Hostel,48.834101,2.353695,Hostel
737,Gobelins,48.832397,2.355583,Plug In Café - Le Pub de la Butte,48.828885,2.351380,Bar
738,Gobelins,48.832397,2.355583,Le Celtique,48.830225,2.352885,Bar


In [946]:
#run one-hot encoding and clustering on those neighborhoods
#Analyzing neighborhoods based on Venue Category

def run_oneHot_encoding(all_venues):
    '''
    
    '''
    onehot_df = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")
    
    # add back neighborhood col & move to first place
    onehot_df['Neighborhood'] = all_venues['Neighborhood']
    fix_cols = ['Neighborhood'] + list(onehot_df.drop('Neighborhood', axis=1).columns)
    onehot_df = onehot_df[fix_cols]
    
    #Groupby neighborhood and get average frequency; then get the top-10 venues for each neighborhood
    grouped = onehot_df.groupby('Neighborhood').mean().reset_index()
    
    return grouped

In [951]:
def return_most_common_venues(row, num_top_venues):
    '''
    this is called on each row in each city and
    is nested in the next function
    '''
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#------------------------------------------------------------------------
def top_venues_by_category(grouped):
    
    #create/sort columns based on number of top venues

    num_top_venues = 10
    indicators = ['st', 'nd', 'rd']
    columns = ['Neighborhood']

    for n in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(n+1, indicators[n]))
        except:
            columns.append('{}th Most Common Venue'.format(n+1))

    # create a new dataframe
    sorted_venues = pd.DataFrame(columns=columns)
    sorted_venues['Neighborhood'] = grouped['Neighborhood']

    for n in np.arange(grouped.shape[0]):
        sorted_venues.iloc[n, 1:] = return_most_common_venues(grouped.iloc[n, :], num_top_venues)

    return sorted_venues

---

*Call above functions and create three dictionaries to be used in k-means clustering algorithm*

In [986]:
all_surrounding_hoods = {}
oneHot_groups = {}
venues_sorted = {}

for city in four_sqr_queries:
    
    #all venues located in targeted neighborhoods
    surrounding_hoods = all_venues_in_relevant_neighborhoods(city)
    all_surrounding_hoods[city] = surrounding_hoods
    
    #run onehot encoding & groupby Neighborhoods
    onehot_grouped = run_oneHot_encoding(surrounding_hoods)
    oneHot_groups[city] = onehot_grouped
    
    #create/sort columns based on number of top venues
    top_venues = top_venues_by_category(onehot_grouped)
    venues_sorted[city] = top_venues

---

[C] Run k-means clustering on split of five clusters

In [1087]:
kclusters = 5
def build_k_means(city):
    '''
    '''
    grouped_clustering = oneHot_groups[city].drop('Neighborhood', axis=1)
    
    #run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
    
    #add 'Cluster Labels' to copy of venues_sorted, but create copy first
    venues_sorted_copy = venues_sorted[city][:]
    venues_sorted_copy.insert(0, 'Cluster Labels', [n+1 for n in kmeans.labels_])
    
    #group all venues by neighborhood, then merge neighborhoods_venues_sorted
    grouped_venues = all_surrounding_hoods[city].groupby('Neighborhood').mean().reset_index()
    grouped_venues.drop(['Venue Latitude', 'Venue Longitude'], axis=1, inplace=True)
    grouped_venues = grouped_venues.merge(venues_sorted_copy, on='Neighborhood')

    return grouped_venues

In [1161]:
build_k_means('New York').head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,40.865788,-73.859319,3,Pizza Place,Chinese Restaurant,Spa,Playground,Deli / Bodega,Supermarket,Pharmacy,Outdoors & Recreation,Grocery Store,Check Cashing Service
1,Bellerose,40.728573,-73.720128,5,Deli / Bodega,Italian Restaurant,Gas Station,Cosmetics Shop,Seafood Restaurant,Mobile Phone Shop,Salon / Barbershop,Motel,Bank,Historic Site
2,Belmont,40.857277,-73.888452,5,Italian Restaurant,Pizza Place,Deli / Bodega,Bakery,Dessert Shop,Bank,Donut Shop,Bar,Fish Market,Spanish Restaurant
3,Boerum Hill,40.685683,-73.983748,1,Dance Studio,Coffee Shop,Bar,Sandwich Place,Spa,Arts & Crafts Store,Bakery,Furniture / Home Store,French Restaurant,Cocktail Bar
4,Bronxdale,40.852723,-73.861726,3,Italian Restaurant,Breakfast Spot,Paper / Office Supplies Store,Gym,Coffee Shop,Performing Arts Venue,Bank,Chinese Restaurant,Convenience Store,Eastern European Restaurant


---

[D] Visualize clusters on maps and summarize venue categories of neighborhoods with museums/arts centers

In [1116]:
def map_clusters(city, df):
    
    m = folium.Map(location=center_coordinates[city], zoom_start=11, width='70%', height='70%')
    
    # set color scheme for the clusters
    x = np.arange(kclusters)
    
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.gist_heat(np.linspace(0, 1, len(ys)))
    spectrum = [colors.rgb2hex(i) for i in colors_array]
    
    for lat, lng, cluster, n_hood, most, most2, most3 in zip(df['Neighborhood Latitude'],
                                                             df['Neighborhood Longitude'], 
                                                             df['Cluster Labels'], 
                                                             df['Neighborhood'],
                                                             df['1st Most Common Venue'],
                                                             df['2nd Most Common Venue'],
                                                             df['3rd Most Common Venue']):
        label = '{} \nCluster: {} \n\nTop-3 Venues: \n1. {}, \n2. {}, \n3. {}'.format(n_hood, cluster, most, most2, most3)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=4,
            popup=label,
            color=spectrum[cluster-1],
            fill=True,
            fill_color=spectrum[cluster-1],
            fill_opacity=0.7,
            parse_html=False).add_to(m)

    return m

In [1133]:
def summary_venues_in_area(city):
    
    #run k-means
    df = build_k_means(city)
    
    #for each cluster, get value_counts() of top-3 categories
    cluster_counts = []
    for n in range(1, kclusters+1):
        df_1_counts = df[df['Cluster Labels'] == n]['1st Most Common Venue'].value_counts().to_frame()
        df_2_counts = df[df['Cluster Labels'] == n]['2nd Most Common Venue'].value_counts().to_frame()
        df_3_counts = df[df['Cluster Labels'] == n]['3rd Most Common Venue'].value_counts().to_frame()
        
        new_df = pd.concat([df_1_counts, df_2_counts, df_3_counts], axis=1, join='outer')
        new_df.columns = pd.MultiIndex.from_tuples([('Cluster: {}'.format(n), name) for name in new_df.columns])
        
        cluster_counts.append(new_df)
    
    
    return pd.concat(cluster_counts, axis=1, join='outer').replace(np.nan, '-')

---

- **New York**

In [1131]:
map_clusters('New York, NY', build_k_means('New York'))

In [1123]:
print('Number of Venues per Cluster')
build_k_means('New York').groupby('Cluster Labels').count()[['Neighborhood']]

Number of Venues per Cluster


Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
1,45
2,1
3,25
4,1
5,4


In [1150]:
summary_venues_in_area('New York')[['Cluster: 1', 'Cluster: 3']].replace('-', np.nan).dropna(thresh=1, axis=0).replace(np.nan, '-')

Unnamed: 0_level_0,Cluster: 1,Cluster: 1,Cluster: 1,Cluster: 3,Cluster: 3,Cluster: 3
Unnamed: 0_level_1,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
Italian Restaurant,9,4,3,1,-,-
Coffee Shop,8,4,7,-,1,-
Bar,7,1,3,-,-,-
Park,2,2,-,-,1,-
Pizza Place,2,3,1,9,1,2
Chinese Restaurant,2,-,1,2,2,-
Hotel,2,2,1,-,-,-
Seafood Restaurant,1,1,1,-,-,-
Hotpot Restaurant,1,-,-,-,-,-
Clothing Store,1,1,1,-,2,-


- **Paris**

In [1151]:
map_clusters('Paris, France', build_k_means('Paris'))

In [1152]:
print('Number of Venues per Cluster')
build_k_means('Paris').groupby('Cluster Labels').count()[['Neighborhood']]

Number of Venues per Cluster


Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
1,3
2,3
3,3
4,1
5,3


In [1154]:
summary_venues_in_area('Paris').drop('Cluster: 4', axis=1).replace('-', np.nan).dropna(thresh=1, axis=0).replace(np.nan, '-')

Unnamed: 0_level_0,Cluster: 1,Cluster: 1,Cluster: 1,Cluster: 2,Cluster: 2,Cluster: 2,Cluster: 3,Cluster: 3,Cluster: 3,Cluster: 5,Cluster: 5,Cluster: 5
Unnamed: 0_level_1,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
Hotel,2,1,-,-,2,1,-,-,-,-,-,1
French Restaurant,1,1,-,3,-,-,3,-,-,3,-,-
Bar,-,1,-,-,1,1,-,-,-,-,-,-
Italian Restaurant,-,-,1,-,-,-,-,1,-,-,-,-
Japanese Restaurant,-,-,1,-,-,-,-,-,-,-,-,-
Thai Restaurant,-,-,1,-,-,-,-,-,-,-,-,-
Coffee Shop,-,-,-,-,-,1,-,-,-,-,-,-
Bistro,-,-,-,-,-,-,-,1,-,-,-,-
Ice Cream Shop,-,-,-,-,-,-,-,1,-,-,-,-
Café,-,-,-,-,-,-,-,-,1,-,-,-


- **London**

In [1155]:
map_clusters('London, UK', build_k_means('London'))

In [1158]:
print('Number of Venues per Cluster')
build_k_means('London').groupby('Cluster Labels').count()[['Neighborhood']]

Number of Venues per Cluster


Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
1,12
2,28
3,1
4,43
5,2


In [1159]:
summary_venues_in_area('London')[['Cluster: 2', 'Cluster: 4']].replace('-', np.nan).dropna(thresh=1, axis=0).replace(np.nan, '-')

Unnamed: 0_level_0,Cluster: 2,Cluster: 2,Cluster: 2,Cluster: 4,Cluster: 4,Cluster: 4
Unnamed: 0_level_1,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
Café,1,4,2,10,10,6
Gym / Fitness Center,1,-,-,-,-,-
Italian Restaurant,1,-,1,-,2,-
Hotel,8,4,6,-,-,-
Bakery,-,2,1,-,-,3
Coffee Shop,11,5,2,4,4,7
Art Gallery,-,-,-,-,-,1
Clothing Store,-,1,1,-,2,1
Boutique,1,-,-,-,-,-
Pub,1,7,3,16,12,4


- **Toronto**

In [1156]:
map_clusters('Toronto, Canada', build_k_means('Toronto'))

In [1157]:
print('Number of Venues per Cluster')
build_k_means('Toronto').groupby('Cluster Labels').count()[['Neighborhood']]

Number of Venues per Cluster


Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
1,1
2,4
3,5
4,1
5,1


In [1160]:
summary_venues_in_area('Toronto')[['Cluster: 2', 'Cluster: 3']].replace('-', np.nan).dropna(thresh=1, axis=0).replace(np.nan, '-')

Unnamed: 0_level_0,Cluster: 2,Cluster: 2,Cluster: 2,Cluster: 3,Cluster: 3,Cluster: 3
Unnamed: 0_level_1,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
Coffee Shop,2,-,-,5,-,-
Boutique,1,-,-,-,-,-
Italian Restaurant,1,-,-,-,-,-
Café,-,4,-,-,2,1
Restaurant,-,-,1,-,1,1
Clothing Store,-,-,1,-,1,-
Spa,-,-,1,-,-,-
Chinese Restaurant,-,-,1,-,-,-
Hotel,-,-,-,-,1,3
