# Segmenting and Clustering the city of Chicago

First import the necessary packages.

In [1]:
import pandas as pd
!conda install lxml --yes
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import requests
import numpy as np
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes
import folium


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\luis_\anaconda3

  added / updated specs:
    - lxml


The following packages will be SUPERSEDED by a higher-priority channel:

  conda              conda-forge::conda-4.8.3-py37hc8dfbb8~ --> pkgs/main::conda-4.8.3-py37_0


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\luis_\anaconda3

  added / updated specs:
    - geopy


The following packages will be UPDATED:

  conda                       pkgs/main::conda-4.8.3-py37_0 --> conda-forge::conda-4.8.3-py37hc8dfbb8_1


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...worki

# Data Collection

I used the data from:

<https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Neighborhoods/bbvz-uum9:> and uploaded to my github account in raw.
This data was used to get a list of each neighborhood in the city of Chicago with the geopraphic boundaries.

In [2]:
# We need the url of the data.
file_url = ('https://raw.githubusercontent.com/luisrrc/Coursera_Capstone/master/Neighborhoods_Chicago.csv')

# Now, load the data with pandas library but we only need three columns of the dataframe.
df = pd.read_csv(file_url, usecols = ['PRI_NEIGH', 'Latitude', 'Longitude'], engine = 'python')

# Change the name of the dataframe to "neighborhoods" and rename the column.
neighborhoods = df.rename(columns = {'PRI_NEIGH' : 'Neighborhoods'}) 


neighborhoods.head(10)

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Grand Boulevard,41.821923,-87.606708
1,Printers Row,41.874371,-87.627607
2,United Center,41.888852,-87.667069
3,Sheffield & DePaul,41.921661,-87.658335
4,Humboldt Park,41.887823,-87.740596
5,Garfield Park,41.888185,-87.6954
6,North Lawndale,41.869869,-87.720239
7,Little Village,41.834801,-87.687399
8,Armour Square,41.847127,-87.629201
9,Avalon Park,41.751502,-87.585655


In [3]:
# We use the info function of the pandas library to obtain the info of the dataframe.
neighborhoods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Neighborhoods  82 non-null     object 
 1   Latitude       82 non-null     float64
 2   Longitude      82 non-null     float64
dtypes: float64(2), object(1)
memory usage: 2.0+ KB


Add geo coordinates to the city of Chicago using geopy library.



In [4]:
address = 'Chicago, Il'

geolocator = Nominatim(user_agent="chicago_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago are 41.8755616, -87.6244212.


Now, we need to visualize the neighborhoods of the city of Chicago using Folium library.

In [5]:
# create map of Chicago using latitude and longitude values
map_chicago = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhoods']):
    label = '{}'.format(neighborhoods)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chicago)  
    
map_chicago

Call Foursquare API to explore venues in each neighborhood of the city.

In [6]:
CLIENT_ID = 'NGDD0J4IGZZJ2L0AGS2H20P0G3SNO0AOET5FYUFDTLMFTSZ4' # your Foursquare ID
CLIENT_SECRET = 'QFCAAG2YEQTMQI15VXMHHTOWDSJY2O505HPS3MGRJWTRFAVB' # your Foursquare Secret
VERSION = '20200803' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NGDD0J4IGZZJ2L0AGS2H20P0G3SNO0AOET5FYUFDTLMFTSZ4
CLIENT_SECRET:QFCAAG2YEQTMQI15VXMHHTOWDSJY2O505HPS3MGRJWTRFAVB


The next step is try with the first neighborhood on the list to explore venues with Foursquare API for testing.

In [7]:
# Select the first neighborhood of the Dataframe.
neighborhoods.loc[0, 'Neighborhoods']

'Grand Boulevard'

In [8]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[0, 'Neighborhoods'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Grand Boulevard are 41.82192275, -87.60670813.


Now, we set the search parameters to explore venues in Foursquare API. 
A limit of 100 venues and a radius of 1600 meters.

In [9]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1600 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=NGDD0J4IGZZJ2L0AGS2H20P0G3SNO0AOET5FYUFDTLMFTSZ4&client_secret=QFCAAG2YEQTMQI15VXMHHTOWDSJY2O505HPS3MGRJWTRFAVB&v=20200801&ll=41.82192275,-87.60670813&radius=1600&limit=100'

In [13]:
# Using requests to get the list of venues in the neighborhood from the url.
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f281d7d780c3370710f91e2'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Grand Boulevard',
  'headerFullLocation': 'Grand Boulevard, Chicago',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 88,
  'suggestedBounds': {'ne': {'lat': 41.836322764400016,
    'lng': -87.58742106000318},
   'sw': {'lat': 41.80752273559998, 'lng': -87.62599519999682}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4de5c4ac18389f05586a53a6',
       'name': 'The Lakefront',
       'location': {'address': '1146-1198 E 43rd St',
        'lat': 41.82184492892107,
        'lng': -87.5988948117684,
        'labeled

In [12]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [14]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  """


Unnamed: 0,name,categories,lat,lng
0,The Lakefront,Beach,41.821845,-87.598895
1,Some Like It Black Creative Arts Bar,Juice Bar,41.818432,-87.605251
2,Ain't She Sweet Cafe,Coffee Shop,41.816817,-87.613004
3,Norman's Bistro,Restaurant,41.816795,-87.601809
4,Mariano's Fresh Market,Grocery Store,41.824594,-87.616067


In [15]:
# print the number of venues found by Foursquare
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

88 venues were returned by Foursquare.


#### Let's create a function to repeat the same process to all the neighborhoods in Chicago

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=1600):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
chicago_venues = getNearbyVenues(names=neighborhoods['Neighborhoods'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Grand Boulevard
Printers Row
United Center
Sheffield & DePaul
Humboldt Park
Garfield Park
North Lawndale
Little Village
Armour Square
Avalon Park
Burnside
Hermosa
Avondale
Logan Square
Calumet Heights
New City
Englewood
Grand Crossing
Ashburn
Mount Greenwood
Morgan Park
Jackson Park
Loop
Pullman
Riverdale
Hegewisch
Greektown
Douglas
Edgewater
Magnificent Mile
Lincoln Square
Oakland
Grant Park
West Loop
Fuller Park
Andersonville
Woodlawn
Portage Park
Rush & Division
Little Italy, UIC
Kenwood
Albany Park
Irving Park
West Ridge
Streeterville
Chatham
Roseland
North Center
South Deering
Washington Park
Millenium Park
Near South Side
Chinatown
Chicago Lawn
Auburn Gresham
Beverly
Washington Heights
Edison Park
Hyde Park
Bucktown
Lower West Side
Wrigleyville
Archer Heights
Brighton Park
Mckinley Park
East Village
West Town
Bridgeport
West Elsdon
Gage Park
Clearing
West Lawn
Wicker Park
Ukrainian Village
Galewood
Montclare
Old Town
Belmont Cragin
Austin
Gold Coast
Boystown
River North


 Let's check the size of the resulting dataframe.

In [20]:
print(chicago_venues.shape)
chicago_venues.head()

(6420, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Grand Boulevard,41.821923,-87.606708,The Lakefront,41.821845,-87.598895,Beach
1,Grand Boulevard,41.821923,-87.606708,Some Like It Black Creative Arts Bar,41.818432,-87.605251,Juice Bar
2,Grand Boulevard,41.821923,-87.606708,Ain't She Sweet Cafe,41.816817,-87.613004,Coffee Shop
3,Grand Boulevard,41.821923,-87.606708,Norman's Bistro,41.816795,-87.601809,Restaurant
4,Grand Boulevard,41.821923,-87.606708,Mariano's Fresh Market,41.824594,-87.616067,Grocery Store


Let's check how many venues were returned for each neighborhood.

In [21]:
chicago_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany Park,100,100,100,100,100,100
Andersonville,100,100,100,100,100,100
Archer Heights,28,28,28,28,28,28
Armour Square,100,100,100,100,100,100
Ashburn,42,42,42,42,42,42
...,...,...,...,...,...,...
West Ridge,100,100,100,100,100,100
West Town,100,100,100,100,100,100
Wicker Park,100,100,100,100,100,100
Woodlawn,66,66,66,66,66,66


 Let's find out how many unique categories can be curated from all the returned venues.

In [22]:
print('There are {} uniques categories.'.format(len(chicago_venues['Venue Category'].unique())))

There are 346 uniques categories.


##  Analyze Each Neighborhood

In [23]:
# one hot encoding
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
chicago_onehot['Neighborhood'] = chicago_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]

chicago_onehot.head()

Unnamed: 0,Zoo Exhibit,ATM,Accessories Store,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,American Restaurant,Amphitheater,...,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [24]:
chicago_onehot.shape

(6420, 346)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [25]:
chicago_grouped = chicago_onehot.groupby('Neighborhood').mean().reset_index()
chicago_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,Accessories Store,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,American Restaurant,...,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Albany Park,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.00,0.02,0.0
1,Andersonville,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.01,0.0,0.01,0.00,0.00,0.00,0.0
2,Archer Heights,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.0
3,Armour Square,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.0
4,Ashburn,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,West Ridge,0.0,0.0,0.00,0.02,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.00,0.0,0.02,0.01,0.00,0.00,0.0
78,West Town,0.0,0.0,0.01,0.00,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.01,0.0,0.01,0.00,0.01,0.01,0.0
79,Wicker Park,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.01,0.0,0.01,0.00,0.01,0.02,0.0
80,Woodlawn,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.0


Let's confirm the new size

In [26]:
 chicago_grouped.shape

(82, 346)

In [27]:
num_top_venues = 5

for hood in chicago_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = chicago_grouped[chicago_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Albany Park----
               venue  freq
0        Coffee Shop  0.05
1  Korean Restaurant  0.05
2      Grocery Store  0.04
3                Bar  0.04
4               Park  0.04


----Andersonville----
                   venue  freq
0  Vietnamese Restaurant  0.07
1            Coffee Shop  0.06
2         Breakfast Spot  0.05
3          Grocery Store  0.04
4     Mexican Restaurant  0.03


----Archer Heights----
                 venue  freq
0   Mexican Restaurant  0.14
1           Taco Place  0.11
2     Department Store  0.04
3  Arts & Crafts Store  0.04
4                 Park  0.04


----Armour Square----
                venue  freq
0  Chinese Restaurant  0.12
1                Park  0.05
2         Pizza Place  0.05
3  Mexican Restaurant  0.05
4   Korean Restaurant  0.03


----Ashburn----
                  venue  freq
0    Mexican Restaurant  0.12
1                  Park  0.12
2   Fried Chicken Joint  0.07
3  Fast Food Restaurant  0.07
4        Ice Cream Shop  0.05


----Auburn Gresha

#### Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order.

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = chicago_grouped['Neighborhood']

for ind in np.arange(chicago_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(chicago_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Albany Park,Korean Restaurant,Coffee Shop,Gym,Pizza Place,Grocery Store,Bar,Park,Bakery,Pub,Sandwich Place
1,Andersonville,Vietnamese Restaurant,Coffee Shop,Breakfast Spot,Grocery Store,Chinese Restaurant,Lounge,Pizza Place,Sushi Restaurant,Mexican Restaurant,Bakery
2,Archer Heights,Mexican Restaurant,Taco Place,Mobile Phone Shop,Park,Donut Shop,Bank,Supermarket,Seafood Restaurant,Discount Store,Chinese Restaurant
3,Armour Square,Chinese Restaurant,Pizza Place,Mexican Restaurant,Park,Grocery Store,Korean Restaurant,Asian Restaurant,Bakery,Bar,Bank
4,Ashburn,Park,Mexican Restaurant,Fried Chicken Joint,Fast Food Restaurant,Bus Station,Ice Cream Shop,Liquor Store,Chinese Restaurant,Seafood Restaurant,Food


##  Cluster Neighborhoods

Now we need execute the Elbow Method to find the optimal number of clusters for KMeans algorithm.

In [30]:
clusters = pd.DataFrame()
clusters['cluster_range'] = range(1, 10)
inertia = []

In [31]:
chicago_grouped_clustering = chicago_grouped.drop('Neighborhood', 1)

for k in clusters['cluster_range']:
  kmeans = KMeans(n_clusters=k, random_state=8).fit(chicago_grouped_clustering)
  inertia.append(kmeans.inertia_)

In [32]:
# Now we can use our list of inertia values in the clusters DataFrame:

clusters['inertia'] = inertia
clusters

Unnamed: 0,cluster_range,inertia
0,1,1.958282
1,2,1.652292
2,3,1.513609
3,4,1.415154
4,5,1.3258
5,6,1.243111
6,7,1.183021
7,8,1.088675
8,9,1.035202


In [33]:
import altair as alt

alt.Chart(clusters).mark_line().encode(x='cluster_range', y='inertia')

In [34]:
# set number of clusters
kclusters = 3

chicago_grouped_clustering = chicago_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(chicago_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 1, 2, 1, 1, 1, 0, 2, 1])


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

chicago_merged = neighborhoods

# merge chicago_grouped with neighborhoods to add latitude/longitude for each neighborhood
chicago_merged = chicago_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhoods')

chicago_merged.head() # check the last columns!

Unnamed: 0,Neighborhoods,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Grand Boulevard,41.821923,-87.606708,1,Beach,Park,Art Gallery,Fast Food Restaurant,Fried Chicken Joint,BBQ Joint,Discount Store,Grocery Store,Seafood Restaurant,Sandwich Place
1,Printers Row,41.874371,-87.627607,2,Hotel,Park,Theater,Pizza Place,Coffee Shop,Yoga Studio,Gym,Taco Place,Portuguese Restaurant,Arts & Crafts Store
2,United Center,41.888852,-87.667069,2,New American Restaurant,Deli / Bodega,Restaurant,Pizza Place,Coffee Shop,Yoga Studio,Brewery,Japanese Restaurant,Sandwich Place,Thai Restaurant
3,Sheffield & DePaul,41.921661,-87.658335,2,Pizza Place,Gym / Fitness Center,Coffee Shop,Grocery Store,Gym,Hot Dog Joint,Park,Theater,Greek Restaurant,Seafood Restaurant
4,Humboldt Park,41.887823,-87.740596,0,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,ATM,Donut Shop,Discount Store,Park,Shoe Store,Gas Station,Liquor Store


Let's visualize the resulting clusters.

In [38]:
import matplotlib.cm as cm
import matplotlib.colors as colors



# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(chicago_merged['Latitude'], chicago_merged['Longitude'], chicago_merged['Neighborhoods'], chicago_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation


In the first cluster we can see that fast food venues abound and also liquor stores, the beach and supermarkets.

In [39]:
print('Cluster 1')
chicago_merged.loc[chicago_merged['Cluster Labels'] == 0, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Cluster 1


Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Humboldt Park,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,ATM,Donut Shop,Discount Store,Park,Shoe Store,Gas Station,Liquor Store
6,North Lawndale,Fast Food Restaurant,Park,Shoe Store,Fried Chicken Joint,Discount Store,Sandwich Place,Train Station,Cosmetics Shop,Food,Caribbean Restaurant
9,Avalon Park,Fried Chicken Joint,Fast Food Restaurant,Discount Store,Chinese Restaurant,Grocery Store,Hot Dog Joint,Sandwich Place,Park,Supermarket,Pharmacy
10,Burnside,Fast Food Restaurant,Discount Store,Fried Chicken Joint,Pharmacy,Rental Car Location,Liquor Store,Train Station,Ice Cream Shop,Pizza Place,Southern / Soul Food Restaurant
15,New City,Fast Food Restaurant,Train Station,Sandwich Place,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Lounge,Bus Station,Restaurant,Donut Shop
16,Englewood,Fast Food Restaurant,Gas Station,Train Station,Sandwich Place,Park,Fried Chicken Joint,Caribbean Restaurant,Lounge,Bus Station,Discount Store
20,Morgan Park,Fast Food Restaurant,Sandwich Place,Grocery Store,Liquor Store,Train Station,Bank,General Entertainment,Discount Store,Automotive Shop,Breakfast Spot
34,Fuller Park,Train Station,Sandwich Place,Fast Food Restaurant,Caribbean Restaurant,Gas Station,Fried Chicken Joint,Liquor Store,Lounge,Breakfast Spot,Park
45,Chatham,Fast Food Restaurant,Sandwich Place,Discount Store,Park,Chinese Restaurant,Donut Shop,Lounge,Pharmacy,Boutique,Pizza Place
46,Roseland,Fast Food Restaurant,Fried Chicken Joint,Southern / Soul Food Restaurant,Women's Store,Snack Place,Bus Stop,Bus Station,Liquor Store,Park,Exhibit


The second cluster concentrates the neighborhoods that have the majority of Mexican restaurants in the city, so it would not be ideal to choose it.

In [40]:
print('Cluster 2')
chicago_merged.loc[chicago_merged['Cluster Labels'] == 1, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Cluster 2


Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Grand Boulevard,Beach,Park,Art Gallery,Fast Food Restaurant,Fried Chicken Joint,BBQ Joint,Discount Store,Grocery Store,Seafood Restaurant,Sandwich Place
7,Little Village,Mexican Restaurant,Italian Restaurant,Park,Fast Food Restaurant,Donut Shop,Video Store,Supermarket,Chinese Restaurant,Grocery Store,Hot Dog Joint
11,Hermosa,Bar,Mexican Restaurant,Grocery Store,Sandwich Place,Pizza Place,Pharmacy,Diner,Mobile Phone Shop,Park,Bank
14,Calumet Heights,Harbor / Marina,Mexican Restaurant,Grocery Store,Discount Store,Pharmacy,Food,Fish Market,Seafood Restaurant,Pizza Place,Lounge
17,Grand Crossing,Coffee Shop,Park,Intersection,Gas Station,Grocery Store,Sandwich Place,Light Rail Station,Fried Chicken Joint,Frozen Yogurt Shop,Salon / Barbershop
18,Ashburn,Park,Mexican Restaurant,Fried Chicken Joint,Fast Food Restaurant,Bus Station,Ice Cream Shop,Liquor Store,Chinese Restaurant,Seafood Restaurant,Food
19,Mount Greenwood,Bar,Sandwich Place,Fast Food Restaurant,Pizza Place,Pharmacy,Gas Station,Pub,Fried Chicken Joint,Bank,Convenience Store
23,Pullman,Golf Course,Sandwich Place,Train Station,Park,Diner,Baseball Field,National Park,Gas Station,Café,Big Box Store
24,Riverdale,Golf Course,Sandwich Place,Train Station,Park,Diner,Baseball Field,National Park,Gas Station,Café,Big Box Store
25,Hegewisch,Sandwich Place,Casino,Park,Mexican Restaurant,Burger Joint,Liquor Store,Buffet,Asian Restaurant,Smoke Shop,Fried Chicken Joint


Finally, in the third and last cluster, we can see that the area includes hotels, museums, restaurants with gastronomies from different parts of the world, but few are of Mexican food, therefore this would be the ideal to search for potential candidates.

In [41]:
print('Cluster 3')
chicago_merged.loc[chicago_merged['Cluster Labels'] == 2, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Cluster 3


Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Printers Row,Hotel,Park,Theater,Pizza Place,Coffee Shop,Yoga Studio,Gym,Taco Place,Portuguese Restaurant,Arts & Crafts Store
2,United Center,New American Restaurant,Deli / Bodega,Restaurant,Pizza Place,Coffee Shop,Yoga Studio,Brewery,Japanese Restaurant,Sandwich Place,Thai Restaurant
3,Sheffield & DePaul,Pizza Place,Gym / Fitness Center,Coffee Shop,Grocery Store,Gym,Hot Dog Joint,Park,Theater,Greek Restaurant,Seafood Restaurant
5,Garfield Park,Bar,Bank,Grocery Store,Art Gallery,Coffee Shop,Deli / Bodega,Ukrainian Restaurant,Sandwich Place,Restaurant,New American Restaurant
8,Armour Square,Chinese Restaurant,Pizza Place,Mexican Restaurant,Park,Grocery Store,Korean Restaurant,Asian Restaurant,Bakery,Bar,Bank
12,Avondale,Bar,Mexican Restaurant,Coffee Shop,Grocery Store,Gym / Fitness Center,Cuban Restaurant,Italian Restaurant,Pizza Place,Burger Joint,Frozen Yogurt Shop
13,Logan Square,Coffee Shop,Pizza Place,Hot Dog Joint,Cuban Restaurant,Ice Cream Shop,Park,Italian Restaurant,Bar,Grocery Store,Dessert Shop
21,Jackson Park,Science Museum,Park,Bookstore,Coffee Shop,Café,Grocery Store,Pizza Place,Italian Restaurant,Sushi Restaurant,Spa
22,Loop,Hotel,Park,Pizza Place,American Restaurant,New American Restaurant,Grocery Store,Boat or Ferry,Seafood Restaurant,Breakfast Spot,Coffee Shop
26,Greektown,Coffee Shop,Pizza Place,Sandwich Place,Italian Restaurant,Grocery Store,Hotel,Mexican Restaurant,Breakfast Spot,Burger Joint,Ice Cream Shop


In [42]:
chicago_grouped_count = chicago_onehot.groupby('Neighborhood').sum().reset_index()
chicago_grouped_count.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,Accessories Store,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,American Restaurant,...,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Albany Park,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,2,0
1,Andersonville,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
2,Archer Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Armour Square,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Ashburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, we need to create a dataframe with hotels.

In [43]:

hotel = pd.DataFrame(chicago_grouped_count['Hotel'])

hotel['Neighborhood'] = chicago_grouped_count['Neighborhood']

hotel.rename(columns={0:'Hotel'}, inplace=True)

hotel.head()



Unnamed: 0,Hotel,Neighborhood
0,0,Albany Park
1,0,Andersonville
2,0,Archer Heights
3,2,Armour Square
4,0,Ashburn



 We are going to sort to see which neighborhoods have more hotels.


In [44]:
final_candidates = hotel.sort_values(by='Hotel', ascending=False).reset_index(drop=True).head(5)
final_candidates

Unnamed: 0,Hotel,Neighborhood
0,11,Rush & Division
1,10,Magnificent Mile
2,10,Millenium Park
3,8,Loop
4,8,River North


Millenium Park is a place with a lot of tourism since it has hotels, theaters and entertainment areas, it is undoubtedly an ideal place to invest.
We can see that Rush & Division is the neighborhood with the most hotels, in addition to having restaurants and coffee shops, it would be a very important option to recommend.
Streeterville is a prestigious neighborhood in the city that also has an area with hotels, parks and restaurants.
Magnificent Mile appears in the top 3 but we must discard it since one of the requirements is that there should be no Mexican restaurants in the top 10 of most common venues.

* **Millenium Park** is a place with a lot of tourism since it has hotels, theaters and entertainment areas, it is undoubtedly an ideal place to invest.

* We can see that **Rush & Division** is the neighborhood with the most hotels, in addition to having restaurants and coffee shops, it would be a very important option to recommend.

* **Streeterville** is a prestigious neighborhood in the city that also has an area with hotels, parks and restaurants.

**Magnificent Mile** appears in the top 3 but we must discard it since one of the requirements is that there should be no Mexican restaurants in the top 10 of most common venues.

I recommend our client to visit those neighborhoods to choose where he will finally found his new restaurant.