# IBM Data Science Capstone Project
*Opening a Mexican Restaurant in San Diego*

- Build a dataframe of cities in San Diego, CA through web-scraping and preprocessing
- Retrieve the geographical coordinates of each city
- Query venue data for each city by leveraging FourSquare API data
- Explore and cluster the cities using kmeans machine-learning algorithm
- Determine the best cluster to open a new Mexican Restaurant in San Diego

__1. Scrape the web and create a dataframe of San Diego cities.__

In [1]:
# Import dependencies

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from pandas import DataFrame

# Define html variable
html = "https://en.wikipedia.org/wiki/Category:Cities_in_San_Diego_County,_California"
contents = requests.get(html).text
# Define soup variable
soup = BeautifulSoup(contents, 'html.parser')
# Empty list to append cities
city_list = []

# # Need to parse through DOM to know which table to scrape
for row in soup.find_all("div", class_="mw-category")[1].findAll("li"):
    city_list.append(row.text)

df1 = pd.DataFrame({"City": city_list})
# All outputs are of the form 'City_Name, Californa'
# Only want city names! Use split function.
df1["City"] = df1["City"].str.split(",").str[0]
num_cities = df1.shape[0]

print("")
print(f'There are {num_cities} cities in the dataframe')
print("")
df1


There are 18 cities in the dataframe



Unnamed: 0,City
0,Carlsbad
1,Chula Vista
2,Coronado
3,Del Mar
4,El Cajon
5,Encinitas
6,Escondido
7,Imperial Beach
8,La Mesa
9,Lemon Grove


__2. Retrieve the geographical coordinates of each city.__

In [4]:
import geopy
from geopy.geocoders import  Nominatim

geolocator = Nominatim(user_agent="sd_explorer")


lats = []
longs = []

for city in df1["City"]:
    address = city
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    lats.append(latitude)
    longs.append(longitude)

df1["Latitude"] = lats
df1["Longitude"] = longs

In [6]:
# Install and import folium as map rendering library
# !conda install -c conda-forge folium=0.5.0 --yes
import folium

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage from SciKit-Learn

from sklearn.cluster import KMeans

In [7]:
address = 'San Diego, CA, USA'

geolocator = Nominatim(user_agent="sd_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Diego are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Diego are 32.7174202, -117.1627728.


In [8]:
# Both San Marcos and La Mesa are not charted correctly
la_mesa_lat = 32.772404
la_mesa_long = -117.029327
san_marcos_lat = 33.1350206
san_marcos_long = -117.17433
# Replace the incorrect data values
df1.loc[df1["City"] == "La Mesa", "Latitude"] = la_mesa_lat
df1.loc[df1["City"] == "La Mesa", "Longitude"] = la_mesa_long
df1.loc[df1["City"] == "San Marcos", "Latitude"] = san_marcos_lat
df1.loc[df1["City"] == "San Marcos", "Longitude"] = san_marcos_long
# What does the dataframe look like now?
df1

Unnamed: 0,City,Latitude,Longitude
0,Carlsbad,33.158093,-117.350597
1,Chula Vista,32.640054,-117.084196
2,Coronado,32.69152,-117.176695
3,Del Mar,32.959489,-117.265315
4,El Cajon,32.794773,-116.962526
5,Encinitas,33.036987,-117.291982
6,Escondido,33.121675,-117.081485
7,Imperial Beach,32.583944,-117.113085
8,La Mesa,32.772404,-117.029327
9,Lemon Grove,32.742552,-117.031417


In [9]:
map_sd_1 = folium.Map(location=[latitude, longitude], zoom_start = 10)

# # add markers to map

sd_city = df1["City"]
sd_lat = df1["Latitude"]
sd_long = df1["Longitude"]

for lat, long, city in zip(sd_lat, sd_long, sd_city):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5.3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sd_1)

map_sd_1

__3. Query venue data for each city by leveraging FourSquare API data.__

In [24]:
from config import client_id, client_secret
# client_id = 'YOUR CODE HERE'
# client_secret = 'YOUR CODE HERE'

# print("You'll need to provide your own information here!")

In [25]:
# Defining a function to return venues within a 2000 meter radius of each city

def getNearbyVenues(names, latitudes, longitudes, radius=2000, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            20220721, # whatever date you want to input
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
%%capture
sd_venues = getNearbyVenues(names=sd_city,
                                   latitudes=sd_lat,
                                   longitudes=sd_long
                                  )

In [66]:
print(f'There are {sd_venues.shape[0]} rows and {sd_venues.shape[1]} columns in the dataframe below')
sd_venues

There are 1677 rows and 7 columns in the dataframe below


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Carlsbad,33.158093,-117.350597,Cafe Elysa,33.157190,-117.350444,Bakery
1,Carlsbad,33.158093,-117.350597,Gaia Gelato,33.159270,-117.350831,Ice Cream Shop
2,Carlsbad,33.158093,-117.350597,Naked Cafe,33.159075,-117.350506,Breakfast Spot
3,Carlsbad,33.158093,-117.350597,Choice Juicery,33.159605,-117.348978,Juice Bar
4,Carlsbad,33.158093,-117.350597,Park 101,33.157881,-117.350468,Café
...,...,...,...,...,...,...,...
1672,Vista,33.200037,-117.242536,CVS pharmacy,33.193270,-117.235459,Pharmacy
1673,Vista,33.200037,-117.242536,7-Eleven,33.212160,-117.245407,Convenience Store
1674,Vista,33.200037,-117.242536,Carl's Jr.,33.193259,-117.255762,Fast Food Restaurant
1675,Vista,33.200037,-117.242536,Eriberto's Mexican Food,33.193469,-117.234210,Mexican Restaurant


In [28]:
# How many venues are there in each city?
sd_venues.groupby(["City"]).count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,82,82,82,82,82,82
Chula Vista,100,100,100,100,100,100
Coronado,100,100,100,100,100,100
Del Mar,66,66,66,66,66,66
El Cajon,100,100,100,100,100,100
Encinitas,100,100,100,100,100,100
Escondido,100,100,100,100,100,100
Imperial Beach,100,100,100,100,100,100
La Mesa,100,100,100,100,100,100
Lemon Grove,72,72,72,72,72,72


In [29]:
# How many unique categories can be curated from all the returned venues?
print('There are {} unique categories.'.format(len(sd_venues['Venue Category'].unique())))

# Preview of unique categories
sd_venues['Venue Category'].unique()[:50]

There are 234 unique categories.


array(['Bakery', 'Ice Cream Shop', 'Breakfast Spot', 'Juice Bar', 'Café',
       'Bar', 'Pizza Place', 'American Restaurant', 'Mexican Restaurant',
       'Restaurant', 'Gourmet Shop', 'Beach', 'Massage Studio',
       'Salon / Barbershop', 'French Restaurant', 'Asian Restaurant',
       'Coffee Shop', 'Wine Bar', 'Boutique', 'Hotel', 'Yoga Studio',
       'Record Shop', 'Italian Restaurant', 'Seafood Restaurant',
       'Sushi Restaurant', 'Liquor Store', 'Resort', 'Convenience Store',
       'Board Shop', 'Grocery Store', 'Smoke Shop', 'Pub', 'Diner',
       'Burger Joint', 'Peruvian Restaurant', 'Pharmacy', 'Steakhouse',
       'Gas Station', 'Trail', 'Train Station', 'Dive Bar', 'Brewery',
       'Multiplex', 'Martial Arts School', 'Lingerie Store', 'Taco Place',
       'Gym / Fitness Center', 'Candy Store', 'Vietnamese Restaurant',
       'Sandwich Place'], dtype=object)

__4. Explore and cluster the cities using kmeans machine-learning algorithm.__

In [30]:
# one hot encoding
one_hot_sd = pd.get_dummies(sd_venues[['Venue Category']], prefix = "", prefix_sep = "")

# add City column back to sd_venues dataframe
one_hot_sd['City'] = sd_venues['City']

# move City column to first column
fixed_columns = [one_hot_sd.columns[-1]] + list(one_hot_sd.columns[:-1])
one_hot_sd = one_hot_sd[fixed_columns]

print(f'There are {one_hot_sd.shape[0]} rows and {one_hot_sd.shape[1]} columns in the dataframe below')
one_hot_sd.head()

There are 1677 rows and 235 columns in the dataframe below


Unnamed: 0,City,ATM,Accessories Store,American Restaurant,Amphitheater,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Yoga Studio,Zoo
0,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
# Grouping rows by city

one_hot_group = one_hot_sd.groupby(["City"]).mean().reset_index()

print(one_hot_group.shape)

one_hot_group

(18, 235)


Unnamed: 0,City,ATM,Accessories Store,American Restaurant,Amphitheater,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Yoga Studio,Zoo
0,Carlsbad,0.0,0.0,0.036585,0.0,0.0,0.0,0.0,0.0,0.02439,...,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.012195,0.0
1,Chula Vista,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,...,0.01,0.0,0.02,0.01,0.0,0.0,0.0,0.01,0.0,0.0
2,Coronado,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0
3,Del Mar,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.015152
4,El Cajon,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Encinitas,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.01,...,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.0
6,Escondido,0.01,0.0,0.05,0.0,0.01,0.0,0.0,0.01,0.0,...,0.0,0.01,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0
7,Imperial Beach,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
8,La Mesa,0.0,0.0,0.03,0.0,0.01,0.01,0.0,0.01,0.0,...,0.0,0.0,0.01,0.01,0.0,0.02,0.0,0.01,0.01,0.0
9,Lemon Grove,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,...,0.013889,0.027778,0.013889,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
# Taking the top 5 venue types for each city

num_top_venues = 5

for city in one_hot_group['City']:
    print("----"+city+"----")
    temp = one_hot_group[one_hot_group['City'] == city].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Carlsbad----
            venue  freq
0           Beach  0.12
1            Café  0.05
2     Pizza Place  0.05
3           Hotel  0.05
4  Breakfast Spot  0.05


----Chula Vista----
                venue  freq
0  Mexican Restaurant  0.09
1   Convenience Store  0.04
2       Grocery Store  0.04
3         Coffee Shop  0.03
4          Taco Place  0.03


----Coronado----
                venue  freq
0                Park  0.08
1               Hotel  0.05
2  Seafood Restaurant  0.05
3      Sandwich Place  0.04
4  Mexican Restaurant  0.04


----Del Mar----
                 venue  freq
0                Beach  0.09
1            Surf Spot  0.05
2  American Restaurant  0.05
3   Mexican Restaurant  0.05
4   Seafood Restaurant  0.05


----El Cajon----
                       venue  freq
0                Coffee Shop  0.07
1         Mexican Restaurant  0.06
2             Clothing Store  0.06
3       Fast Food Restaurant  0.04
4  Middle Eastern Restaurant  0.03


----Encinitas----
            venue  fr

In [33]:
# A function to sort the venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
# Putting the top 5 venues for each city in a dataframe

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sd_venues_sorted = pd.DataFrame(columns=columns)
sd_venues_sorted['City'] = one_hot_group['City']

for ind in np.arange(one_hot_group.shape[0]):
    sd_venues_sorted.iloc[ind, 1:] = return_most_common_venues(one_hot_group.iloc[ind, :], num_top_venues)

sd_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Carlsbad,Beach,Café,Pizza Place,Hotel,Breakfast Spot
1,Chula Vista,Mexican Restaurant,Convenience Store,Grocery Store,Coffee Shop,Taco Place
2,Coronado,Park,Hotel,Seafood Restaurant,Sandwich Place,Mexican Restaurant
3,Del Mar,Beach,Surf Spot,American Restaurant,Mexican Restaurant,Seafood Restaurant
4,El Cajon,Coffee Shop,Mexican Restaurant,Clothing Store,Fast Food Restaurant,Middle Eastern Restaurant
5,Encinitas,Coffee Shop,Pizza Place,Bar,Sandwich Place,Brewery
6,Escondido,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Pizza Place,American Restaurant
7,Imperial Beach,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Sandwich Place,Pizza Place
8,La Mesa,Coffee Shop,Salon / Barbershop,Italian Restaurant,Sandwich Place,Mexican Restaurant
9,Lemon Grove,Mexican Restaurant,Coffee Shop,Convenience Store,Pizza Place,Sandwich Place


In [35]:
# set number of clusters
kclusters = 4

sd_grouped_clusters = one_hot_group.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(sd_grouped_clusters)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:18] 

  sd_grouped_clusters = one_hot_group.drop('City', 1)


array([2, 1, 0, 2, 1, 1, 3, 3, 1, 3, 3, 2, 3, 0, 1, 1, 2, 3])

In [37]:
# New dataframe that includes the cluster, as well as the top 5 venues per city
# add clustering labels

# Comment/uncomment below in order to get Cluster Labels - only run once!
# sd_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sd_merged = df1

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
sd_merged = sd_merged.join(sd_venues_sorted.set_index('City'), on='City')

sd_merged["Cluster Labels"] = sd_merged["Cluster Labels"].replace(np.nan,0)
sd_merged = sd_merged.astype({"Cluster Labels": int})
sd_merged # Check the last columns!

Unnamed: 0,City,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Carlsbad,33.158093,-117.350597,2,Beach,Café,Pizza Place,Hotel,Breakfast Spot
1,Chula Vista,32.640054,-117.084196,1,Mexican Restaurant,Convenience Store,Grocery Store,Coffee Shop,Taco Place
2,Coronado,32.69152,-117.176695,0,Park,Hotel,Seafood Restaurant,Sandwich Place,Mexican Restaurant
3,Del Mar,32.959489,-117.265315,2,Beach,Surf Spot,American Restaurant,Mexican Restaurant,Seafood Restaurant
4,El Cajon,32.794773,-116.962526,1,Coffee Shop,Mexican Restaurant,Clothing Store,Fast Food Restaurant,Middle Eastern Restaurant
5,Encinitas,33.036987,-117.291982,1,Coffee Shop,Pizza Place,Bar,Sandwich Place,Brewery
6,Escondido,33.121675,-117.081485,3,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Pizza Place,American Restaurant
7,Imperial Beach,32.583944,-117.113085,3,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Sandwich Place,Pizza Place
8,La Mesa,32.772404,-117.029327,1,Coffee Shop,Salon / Barbershop,Italian Restaurant,Sandwich Place,Mexican Restaurant
9,Lemon Grove,32.742552,-117.031417,3,Mexican Restaurant,Coffee Shop,Convenience Store,Pizza Place,Sandwich Place


In [39]:
# Create SD map with clusters

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sd_merged['Latitude'], sd_merged['Longitude'], sd_merged['City'], sd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [40]:
# Cluster Labels = 0

sd_merged.loc[sd_merged['Cluster Labels'] == 0, sd_merged.columns[[0] + list(range(4, sd_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Coronado,Park,Hotel,Seafood Restaurant,Sandwich Place,Mexican Restaurant
13,San Diego,Hotel,Bar,American Restaurant,Museum,Coffee Shop


In [41]:
# Cluster Labels = 1

sd_merged.loc[sd_merged['Cluster Labels'] == 1, sd_merged.columns[[0] + list(range(4, sd_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,Chula Vista,Mexican Restaurant,Convenience Store,Grocery Store,Coffee Shop,Taco Place
4,El Cajon,Coffee Shop,Mexican Restaurant,Clothing Store,Fast Food Restaurant,Middle Eastern Restaurant
5,Encinitas,Coffee Shop,Pizza Place,Bar,Sandwich Place,Brewery
8,La Mesa,Coffee Shop,Salon / Barbershop,Italian Restaurant,Sandwich Place,Mexican Restaurant
14,San Marcos,Mexican Restaurant,Coffee Shop,Sandwich Place,Sushi Restaurant,Pizza Place
15,Santee,Coffee Shop,Mexican Restaurant,Clothing Store,Brewery,Breakfast Spot


In [42]:
# Cluster Labels = 2

sd_merged.loc[sd_merged['Cluster Labels'] == 2, sd_merged.columns[[0] + list(range(4, sd_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Carlsbad,Beach,Café,Pizza Place,Hotel,Breakfast Spot
3,Del Mar,Beach,Surf Spot,American Restaurant,Mexican Restaurant,Seafood Restaurant
11,Oceanside,American Restaurant,Beach,Coffee Shop,Seafood Restaurant,Ice Cream Shop
16,Solana Beach,Beach,Coffee Shop,Pizza Place,Seafood Restaurant,American Restaurant


In [43]:
# Cluster Labels = 3

sd_merged.loc[sd_merged['Cluster Labels'] == 3, sd_merged.columns[[0] + list(range(4, sd_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Escondido,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Pizza Place,American Restaurant
7,Imperial Beach,Mexican Restaurant,Convenience Store,Fast Food Restaurant,Sandwich Place,Pizza Place
9,Lemon Grove,Mexican Restaurant,Coffee Shop,Convenience Store,Pizza Place,Sandwich Place
10,National City,Mexican Restaurant,Fast Food Restaurant,Convenience Store,Park,Chinese Restaurant
12,Poway,Mexican Restaurant,Pizza Place,Sushi Restaurant,Seafood Restaurant,Auto Workshop
17,Vista,Fast Food Restaurant,Pizza Place,Sandwich Place,Mexican Restaurant,Coffee Shop


In [44]:
# Cluster Labels = 4

sd_merged.loc[sd_merged['Cluster Labels'] == 4, sd_merged.columns[[0] + list(range(4, sd_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue


__5. Determine the best cluster to open a new Mexican Restaurant in San Diego.__

In [48]:
mex_df = one_hot_group[["City", "Mexican Restaurant"]]
mex_df

Unnamed: 0,City,Mexican Restaurant
0,Carlsbad,0.036585
1,Chula Vista,0.09
2,Coronado,0.04
3,Del Mar,0.045455
4,El Cajon,0.06
5,Encinitas,0.03
6,Escondido,0.12
7,Imperial Beach,0.08
8,La Mesa,0.04
9,Lemon Grove,0.111111


In [54]:
# set number of clusters
kclusters = 4

mex_clustered = mex_df.drop(["City"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(mex_clustered)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:18]

  mex_clustered = mex_df.drop(["City"], 1)


array([1, 2, 1, 1, 3, 1, 0, 2, 1, 0, 2, 1, 2, 1, 3, 3, 1, 3])

In [55]:
# Create a new dataframe that includes the cluster as well as the Mex.Restaurant numbers for each city.
mex_merged = mex_df.copy()

# add clustering labels
mex_merged["Cluster Labels"] = kmeans.labels_

mex_merged.head()

Unnamed: 0,City,Mexican Restaurant,Cluster Labels
0,Carlsbad,0.036585,1
1,Chula Vista,0.09,2
2,Coronado,0.04,1
3,Del Mar,0.045455,1
4,El Cajon,0.06,3


In [56]:
# Merge dataframes to add latitude/longitude for each city
# Only run the line below once! Comment/uncomment as necessary to continue the notebook.
mex_merged = mex_merged.join(df1.set_index("City"), on="City")

print(f'There are {mex_merged.shape[0]} cities and {mex_merged.shape[1]} features in the dataframe.')

# Sort results by Cluster Labels
mex_merged.sort_values(["Cluster Labels"], 
                       inplace=True, 
                       ascending=False)
mex_merged

There are 18 cities and 5 features in the dataframe.


Unnamed: 0,City,Mexican Restaurant,Cluster Labels,Latitude,Longitude
17,Vista,0.06,3,33.200037,-117.242536
15,Santee,0.05,3,32.838383,-116.973917
4,El Cajon,0.06,3,32.794773,-116.962526
14,San Marcos,0.07,3,33.135021,-117.17433
7,Imperial Beach,0.08,2,32.583944,-117.113085
12,Poway,0.095238,2,32.962823,-117.035865
1,Chula Vista,0.09,2,32.640054,-117.084196
10,National City,0.085106,2,32.678109,-117.099197
11,Oceanside,0.02,1,33.19587,-117.379483
16,Solana Beach,0.04,1,32.99056,-117.269131


In [58]:
# Create map of Mexican Restaurant data
mex_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mex_merged['Latitude'], mex_merged['Longitude'], mex_merged['City'], mex_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(mex_clusters)
       
mex_clusters

In [59]:
# Save the map as an HTML file
mex_clusters.save('mex_clusters.html')

In [60]:
# Warnings will pop up if you comment out the code below

import warnings
warnings.filterwarnings("ignore")

In [61]:
# Cluster Labels = 0

mex_0 = mex_merged.loc[mex_merged['Cluster Labels'] == 0]
print(f'{mex_0.shape[0]} cities:')
mex_0.sort_values(["Mexican Restaurant"], 
                       inplace=True, 
                       ascending=True)
mex_0

2 cities:


Unnamed: 0,City,Mexican Restaurant,Cluster Labels,Latitude,Longitude
9,Lemon Grove,0.111111,0,32.742552,-117.031417
6,Escondido,0.12,0,33.121675,-117.081485


In [62]:
# Cluster Labels = 1

mex_1 = mex_merged.loc[mex_merged['Cluster Labels'] == 1]
print(f'{mex_1.shape[0]} cities:')
mex_1.sort_values(["Mexican Restaurant"], 
                       inplace=True, 
                       ascending=True)
mex_1

8 cities:


Unnamed: 0,City,Mexican Restaurant,Cluster Labels,Latitude,Longitude
11,Oceanside,0.02,1,33.19587,-117.379483
5,Encinitas,0.03,1,33.036987,-117.291982
0,Carlsbad,0.036585,1,33.158093,-117.350597
16,Solana Beach,0.04,1,32.99056,-117.269131
13,San Diego,0.04,1,32.71742,-117.162773
8,La Mesa,0.04,1,32.772404,-117.029327
2,Coronado,0.04,1,32.69152,-117.176695
3,Del Mar,0.045455,1,32.959489,-117.265315


In [64]:
# Cluster Labels = 2

mex_2 = mex_merged.loc[mex_merged['Cluster Labels'] == 2]
print(f'{mex_2.shape[0]} cities:')
mex_2.sort_values(["Mexican Restaurant"], 
                       inplace=True, 
                       ascending=True)
mex_2

4 cities:


Unnamed: 0,City,Mexican Restaurant,Cluster Labels,Latitude,Longitude
7,Imperial Beach,0.08,2,32.583944,-117.113085
10,National City,0.085106,2,32.678109,-117.099197
1,Chula Vista,0.09,2,32.640054,-117.084196
12,Poway,0.095238,2,32.962823,-117.035865


In [65]:
# Cluster Labels = 3

mex_3 = mex_merged.loc[mex_merged['Cluster Labels'] == 3]
print(f'{mex_2.shape[0]} cities:')
mex_3.sort_values(["Mexican Restaurant"], 
                       inplace=True, 
                       ascending=True)
mex_3

4 cities:


Unnamed: 0,City,Mexican Restaurant,Cluster Labels,Latitude,Longitude
15,Santee,0.05,3,32.838383,-116.973917
17,Vista,0.06,3,33.200037,-117.242536
4,El Cajon,0.06,3,32.794773,-116.962526
14,San Marcos,0.07,3,33.135021,-117.17433


### Observations

- Mexican Restaurants are concentrated in cluster 2.
- They likely suffer from intense competition due to oversupply and overconcentration of restaurants.
- Cluster 0 and 1 cities present the greater opportunity and high potential areas to open a new Mexican Restaurant.
- Cluster 3 cities have a higher level of competition among Mexican Restaurants, but may offer a conservative level of risk for a new business.

### Project Recommendation

- I recommend property developers to capitalize on the observations above and look into neighborhoods in _Cluster 0_ and _Cluster 1_ cities to open a Mexican Restaurant with a high potential for success.
- In particular, the city of __Lemon Grove__ should be explored for high potential neighborhoods.
- Property developers with unique selling propositions that stand out from their competitors may find better success among their competition by contesting the existing Mexican Restaurants in __Del Mar, Coronado__, and  _Cluster 3_ cities.
- Property developers looking to mitigate risk may find success in _Cluster 3_ cities like __Santee and Vista__. The lower level of risk will mean trading away a potentially higher profit margin.
- I advise property developers to avoid neighborhoods in _Cluster 2_ cities, which already have a high concentration of Mexican Restaurants and suffer from intense competition.