# Introduction

The hypothetical problem is explained below.
Suppose somebody wants to open up a restaurant in Miami. This person want to know a good location to open the restaurant and moreover, what type of cuisine would be a good choice for the restaurant.
My proposed solution is as follow: One studies the neighborhoods of Miami and determines in which neighborhoods there are the least restaurants per person (per capita). Once the "low-restaurant" neighborhoods are determined, one clusters all neighborhoods w.r.t. the venues in the neighborhood, except for the restaurants (this only shops, landmarks, parks, etc. would be used in clustering). The idea is that this separates the neighborhoods in similar/like-minded sets. For a "low-restaurant" neighborhood, one then looks at the other neighborhoods of its cluster and which restaurants are popular there. Opening such a restaurant at the "low-restaurant" neighborhood is then the proposed solution.

# Data

For the list of Miami neighborhoods, I will scrape a wikipedia page ([List of Neighborhoods in Miami](https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Miami)). Before showing the code for scraping, preliminary packages are installed and imported (not only packages for scraping, but for all code that is to follow).


In [1]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge beautifulsoup4 --yes
!conda install -c conda-forge lxml --yes
!conda install -c conda-forge geopy --yes
import geopy.distance #measure distance between coordinates
import folium
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



The code below scrapes the wikipedia page for the info on Miami neighborhoods and puts it in a pandas dataframe.

In [7]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Miami").text
soup = BeautifulSoup(source,"lxml")
table = soup.table

headers = table.find_all("th")

rows_body = table.find_all("tr")[1:-1] #first and last not included since first gives the header row and the list is a cummulative row for the whole of Miami
rows_list = []
for row in rows_body:
    temp_list = [ entry.text.replace("\n","") for entry in row.find_all("td")]
    if temp_list:
        rows_list.append(temp_list)
        
df_miami = pd.DataFrame(rows_list,columns = [head.text.replace("\n","") for head in headers])
df_miami.head()

Unnamed: 0,Neighborhood,Demonym,Population2010,Population/Km²,Sub-neighborhoods,Coordinates
0,Allapattah,,54289,4401,,"25.815,-80.224"
1,Arts & Entertainment District,,11033,7948,,"25.799,-80.190"
2,Brickell,Brickellite,31759,14541,West Brickell,"25.758,-80.193"
3,Buena Vista,,9058,3540,Buena Vista East Historic District and Design ...,"25.813,-80.192"
4,Coconut Grove,Grovite,20076,3091,"Center Grove, Northeast Coconut Grove, Southwe...","25.712,-80.257"


The data in the columns "Populations2010" and "Population/Km^2" have commas to indicate thousands. A consequence is that one can not directly transform the datatype to integers. The code below fixes this. Moreover, some entries cause an error, we collect the indices resulting in these errors for further investigation.

In [8]:
alert_list = []
for i in df_miami.index:
    try:
        df_miami.loc[i,"Population2010"] = int(df_miami.loc[i,"Population2010"].replace(",",""))
    except:
        alert_list.append(i)
    try: #two try statements necessary, otherwise failure to do first statement will lead to not do second statement
        df_miami.loc[i,df_miami.columns[3]] = int(df_miami.loc[i,df_miami.columns[3]].replace(",",""))
    except:
        alert_list.append(i)
print(set(alert_list))
df_miami.loc[set(alert_list)]

{16, 21, 22, 7}


Unnamed: 0,Neighborhood,Demonym,Population2010,Population/Km²,Sub-neighborhoods,Coordinates
16,Midtown,Midtowner,-,-,Edgewater and Wynwood,"25.807,-80.193"
21,Venetian Islands,,,,Biscayne Island and San Marco Island,"25.791,-80.161"
22,Virginia Key,,14,-,,"25.736,-80.155"
7,Downtown,Downtowner,"71,000 (13,635 CBD only)",10613,"Brickell, Central Business District (CBD), Dow...","25.774,-80.193"


I decided to drop the rows with indices 16,21 and 22 and set the population of Downtown to 71000.

In [9]:
df_miami.loc[7,"Population2010"]=71000
df_miami_1 = df_miami.drop([16,21,22],axis=0).reset_index(drop=True).astype({"Population2010":"int64" , df_miami.columns[3]: "int64"})
df_miami_1

Unnamed: 0,Neighborhood,Demonym,Population2010,Population/Km²,Sub-neighborhoods,Coordinates
0,Allapattah,,54289,4401,,"25.815,-80.224"
1,Arts & Entertainment District,,11033,7948,,"25.799,-80.190"
2,Brickell,Brickellite,31759,14541,West Brickell,"25.758,-80.193"
3,Buena Vista,,9058,3540,Buena Vista East Historic District and Design ...,"25.813,-80.192"
4,Coconut Grove,Grovite,20076,3091,"Center Grove, Northeast Coconut Grove, Southwe...","25.712,-80.257"
5,Coral Way,,35062,4496,"Coral Gate, Golden Pines, Shenandoah, and Silv...","25.750,-80.283"
6,Design District,,3573,3623,,"25.813,-80.193"
7,Downtown,Downtowner,71000,10613,"Brickell, Central Business District (CBD), Dow...","25.774,-80.193"
8,Edgewater,,15005,6675,,"25.802,-80.190"
9,Flagami,,50834,5665,"Alameda, Grapeland Heights, and Fairlawn","25.762,-80.316"


Now the "Coordinates" column is split in a "Latitude" and "Longitude" column. Also some unnecessary columns are removed.

In [10]:
temp = df_miami_1["Coordinates"].str.split(",",expand=True)
df_miami_1["Latitude"] = temp[0]
df_miami_1["Longitude"] = temp[1]
df_miami_1.drop(columns=["Demonym","Sub-neighborhoods","Coordinates"],inplace=True)
df_miami_1.head()

Unnamed: 0,Neighborhood,Population2010,Population/Km²,Latitude,Longitude
0,Allapattah,54289,4401,25.815,-80.224
1,Arts & Entertainment District,11033,7948,25.799,-80.19
2,Brickell,31759,14541,25.758,-80.193
3,Buena Vista,9058,3540,25.813,-80.192
4,Coconut Grove,20076,3091,25.712,-80.257


In [11]:
df_miami_1.dtypes

Neighborhood      object
Population2010     int64
Population/Km²     int64
Latitude          object
Longitude         object
dtype: object

The "Latitude" and "Longitude" columns are of the wrong type, we correct this, once again collecting indices which lead to errors.

In [12]:
alert_list = []
for i in df_miami_1.index:
    try:
        float(df_miami_1.loc[i,"Latitude"])
    except:
        alert_list.append(i)
    try: #two try statements necessary, otherwise failure to do first statement will lead to not do second statement
        float(df_miami_1.loc[i,"Longitude"])
    except:
        alert_list.append(i)
print(set(alert_list))

{11}


In [14]:
df_miami_1.loc[[11]]

Unnamed: 0,Neighborhood,Population2010,Population/Km²,Latitude,Longitude
11,Health District,2705,2148,,


Somehow this neighborhood has no coordinates given, I have solved this by making estimating the coordinates using google maps. I will enter coordinates (25.787,-80.204) for this neighborhood.

In [15]:
df_miami_1.loc[11,"Latitude"] = 25.787
df_miami_1.loc[11,"Longitude"] = -80.204
df_miami_clean = df_miami_1.astype({"Latitude":"float64","Longitude":"float64"})

## Physical distances between neighborhoods

One notices that some neighborhoods are closer to each other than others. This implies that using a fixed radius later on when searching for venues, could result in significant overlap. To prevent this, in what follows, for each neighborhood the distance to its nearest neighbor is calculated. The distances will then be used later on as the radius in which venues will be searched for.

In [16]:
def coord_distance_matrix(lat_series,long_series):
    if len(lat_series) != len(long_series):
        print("Number of lat/long does not match")
    else:
        dist_matr = np.empty((len(lat_series),len(lat_series)))
        for i in range(len(lat_series)):
            for j in range(len(lat_series)):
                dist_matr[i,j] = geopy.distance.distance((lat_series[i], long_series[i]),(lat_series[j], long_series[j])).m
    return dist_matr

def nearest_not_self_list(matrix):
    min_distance_list =[]
    nearest_index_list=[]
    for i in range(matrix.shape[0]):
        if np.argmin([x for n,x in enumerate(matrix[i,:]) if n != i]) < i:
            arg_min = np.argmin([x for n,x in enumerate(matrix[i,:]) if n != i])
        else:
            arg_min = np.argmin([x for n,x in enumerate(matrix[i,:]) if n != i])+1
        min_distance_list.append(matrix[i,arg_min])
        nearest_index_list.append(arg_min)
    return nearest_index_list , min_distance_list

In [17]:
nearest_neighbor_list , nearest_distance_list = nearest_not_self_list(coord_distance_matrix(df_miami_clean["Latitude"],df_miami_clean["Longitude"]))
df_miami_clean["Nearest Distance"] = nearest_distance_list
df_miami_clean["Nearest Neighborhood"] = [df_miami_clean.loc[m,"Neighborhood"] for m in nearest_neighbor_list]
df_miami_clean

Unnamed: 0,Neighborhood,Population2010,Population/Km²,Latitude,Longitude,Nearest Distance,Nearest Neighborhood
0,Allapattah,54289,4401,25.815,-80.224,1886.01716,Liberty City
1,Arts & Entertainment District,11033,7948,25.799,-80.19,332.354834,Edgewater
2,Brickell,31759,14541,25.758,-80.193,1421.888354,The Roads
3,Buena Vista,9058,3540,25.813,-80.192,100.275691,Design District
4,Coconut Grove,20076,3091,25.712,-80.257,4952.674627,Coral Way
5,Coral Way,35062,4496,25.75,-80.283,3567.623181,Flagami
6,Design District,3573,3623,25.813,-80.193,100.275691,Buena Vista
7,Downtown,71000,10613,25.774,-80.193,868.560717,Lummus Park
8,Edgewater,15005,6675,25.802,-80.19,332.354834,Arts & Entertainment District
9,Flagami,50834,5665,25.762,-80.316,3567.623181,Coral Way


One notices that some neighborhoods are very close together, since I will use these distances to search for venues, this would lead to very few venues for some neighborhoods. As a solution, I will merge neighborhoods whose centers are less than 800 meters apart. 
The algorithm for doing this works (informally) as follows: The two closest neighborhoods are merge into a single new one, with its center located at the midpoint between the original two. After the merge all distances between the neighborhoods are recalculated and the process repeats, until no neighborhoods lying closer than 800 meters together remain.

In [18]:
df_new = df_miami_clean
while True:
    dist_min = df_new["Nearest Distance"].min()
    if dist_min < 800:
        temp = [df_new[df_new["Nearest Distance"] == dist_min]["Neighborhood"].str.cat(sep=", "),
           df_new[df_new["Nearest Distance"] == dist_min]["Population2010"].sum(),
           df_new[df_new["Nearest Distance"] == dist_min].iloc[:,2].sum(),
           df_new[df_new["Nearest Distance"] == dist_min]["Latitude"].mean(),
           df_new[df_new["Nearest Distance"] == dist_min]["Longitude"].mean(),
           np.nan,
           "" 
           ]
        df_new = df_new.append(pd.Series(temp,index = df_new.columns),ignore_index=True).drop(df_new[df_new["Nearest Distance"] == dist_min].index).reset_index(drop=True)
        nearest_neighbor_list , nearest_distance_list = nearest_not_self_list(coord_distance_matrix(df_new["Latitude"],df_new["Longitude"]))
        df_new["Nearest Distance"] = nearest_distance_list
        df_new["Nearest Neighborhood"] = [df_miami_clean.loc[m,"Neighborhood"] for m in nearest_neighbor_list]
    else:
        break
df_new

Unnamed: 0,Neighborhood,Population2010,Population/Km²,Latitude,Longitude,Nearest Distance,Nearest Neighborhood
0,Allapattah,54289,4401,25.815,-80.224,1886.01716,Downtown
1,Brickell,31759,14541,25.758,-80.193,1421.888354,Liberty City
2,Coconut Grove,20076,3091,25.712,-80.257,4952.674627,Buena Vista
3,Coral Way,35062,4496,25.75,-80.283,3567.623181,Coral Way
4,Downtown,71000,10613,25.774,-80.193,868.560717,Grapeland Heights
5,Flagami,50834,5665,25.762,-80.316,3567.623181,Buena Vista
6,Grapeland Heights,14004,4130,25.792,-80.258,2410.500305,Little Havana
7,Liberty City,19725,3733,25.832,-80.225,1886.01716,Allapattah
8,Little Haiti,29760,3840,25.824,-80.191,1041.741114,Little Haiti
9,Little Havana,76163,8423,25.773,-80.215,1472.565929,Grapeland Heights


As an illustration, a map is drawn with a circle around the coordinates of each neighborhood. The radius of this circle is 3/4 times the "Nearest Distance" (which will be used later on when searching for venues), implying some overlap of these circles.

In [19]:
# create map of Miami using latitude and longitude values for the first neighborhood
map_miami = folium.Map(location=[df_new.loc[0,"Latitude"], df_new.loc[0,"Longitude"]], zoom_start=11)

# add markers to map
for lat, lng, neighborhood, r in zip(df_new['Latitude'], df_new['Longitude'], df_new['Neighborhood'], df_new["Nearest Distance"]):
    label = folium.Popup(neighborhood)
    folium.Circle(
        [lat, lng],
        radius=3*r/4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_miami)  
    
map_miami

## Getting the Foursquare data

In [20]:
CLIENT_ID = 'K01FZFOYVCMZ5N2HETZPNWXLKNHIVZKHCCF1AKTAVIY4ZLXX' # your Foursquare ID
CLIENT_SECRET = 'KCLVJ3ROCHHXIINQDMIRUFBTL2I0VJSTJPUIJO2I4RPOYMZ5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: K01FZFOYVCMZ5N2HETZPNWXLKNHIVZKHCCF1AKTAVIY4ZLXX
CLIENT_SECRET:KCLVJ3ROCHHXIINQDMIRUFBTL2I0VJSTJPUIJO2I4RPOYMZ5


In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [22]:
def getNearbyVenues(neighborhoods, latitudes, longitudes, radii):
    
    venues_list=[]
    for hood, lat, lng, r in zip(neighborhoods, latitudes, longitudes, radii):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            3*r/4, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        if results:
            venues_list.append([(
                hood, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        else:
            venues_list.append([(
            hood,
            lat,
            lng,
            "Nothing to see here",
            lat,
            lng,
            "Boring")])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
Miami_venues = getNearbyVenues(df_new["Neighborhood"], df_new["Latitude"], df_new["Longitude"], df_new["Nearest Distance"])
Miami_venues.head(20)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allapattah,25.815,-80.224,Club Tipico Dominicano,25.809557,-80.218593,Nightclub
1,Allapattah,25.815,-80.224,Plaza Seafood Market,25.805638,-80.223992,Seafood Restaurant
2,Allapattah,25.815,-80.224,Snappers Fish & Chicken,25.82411,-80.22487,Seafood Restaurant
3,Allapattah,25.815,-80.224,Papo Llega y Pon,25.803466,-80.223886,Cuban Restaurant
4,Allapattah,25.815,-80.224,Family Dollar,25.807208,-80.223503,Discount Store
5,Allapattah,25.815,-80.224,Redbox,25.808122,-80.224456,Video Store
6,Allapattah,25.815,-80.224,Winn-Dixie,25.808179,-80.224911,Grocery Store
7,Allapattah,25.815,-80.224,McDonald's,25.809014,-80.232281,Fast Food Restaurant
8,Allapattah,25.815,-80.224,Charles Hadley Pool,25.819565,-80.216753,Park
9,Allapattah,25.815,-80.224,Subway,25.824341,-80.222184,Sandwich Place
