# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Argentinian restaurant**, or **BBQ restaurant** in **Argentina**.

Argentina is well known with its great meat, and bbq restaurant. Speccificaly, there are some main cities in Argentina that became very tourist one, and have alot of restaurants, including ones that offer local argentinian food. Since there are lots of restaurants in Argentina, and lots of tourist cities, we will try to detect **locations that are not already crowded with restaurants, or with this spicific argentnian meat**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing restaurants near the city center (any type of restaurant)
* number of and distance to argentinian or BBQ restaurants near the city center

Following data sources will be needed to extract/generate the required information:
* CSV data for Argentina's main cities
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **geopy.geocoders**
* number of restaurants and their type and location in every city will be obtained using **Foursquare API**

#### Import the needed dependencies

In [438]:
# !conda install jupyter_contrib_nbextensions
jupyter nbextension enable codefolding/main

SyntaxError: invalid syntax (<ipython-input-438-fec905041e69>, line 2)

In [150]:
import requests
from geopy.geocoders import Nominatim
import folium
import pandas as pd
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
from sklearn.cluster import KMeans

### Cities Candidates
We will use a CSV file from the web page https://simplemaps.com/. This file contains data for 503 prominent cities in Argentina, including 'name', 'lat', 'lon', 'population' (updated to 2020) for each entry.

In [289]:
df = pd.read_csv('argentina.csv')
df

Unnamed: 0,city,lat,lng,country,iso2,admin_name,capital,population,population_proper
0,Buenos Aires,-34.60828,-58.372295,Argentina,AR,"Buenos Aires, Ciudad Autónoma de",primary,16157000.0,3054300.0
1,Córdoba,-31.41670,-64.183300,Argentina,AR,Córdoba,admin,1329604.0,1329604.0
2,Rosario,-32.95750,-60.639400,Argentina,AR,Santa Fe,minor,1193605.0,1193605.0
3,San Miguel de Tucumán,-26.81670,-65.216700,Argentina,AR,Tucumán,admin,605767.0,605767.0
4,Mar del Plata,-38.00000,-57.550000,Argentina,AR,Buenos Aires,minor,593337.0,593337.0
...,...,...,...,...,...,...,...,...,...
498,Camarones,-44.79710,-65.709900,Argentina,AR,Chubut,minor,,
499,Puelches,-38.14560,-65.914300,Argentina,AR,La Pampa,minor,,
500,Limay Mahuida,-37.15920,-66.675100,Argentina,AR,La Pampa,minor,,
501,Antofagasta de la Sierra,-26.05940,-67.406400,Argentina,AR,Catamarca,minor,,


in order to reduce the number of interesting cities, we decided to look only at cities with population of at least 50K people

In [290]:
df = df[df['population_proper'] > 50000].loc[:,['city', 'lat', 'lng', 'capital', 'population_proper']]
df

Unnamed: 0,city,lat,lng,capital,population_proper
0,Buenos Aires,-34.60828,-58.372295,primary,3054300.0
1,Córdoba,-31.41670,-64.183300,admin,1329604.0
2,Rosario,-32.95750,-60.639400,minor,1193605.0
3,San Miguel de Tucumán,-26.81670,-65.216700,admin,605767.0
4,Mar del Plata,-38.00000,-57.550000,minor,593337.0
...,...,...,...,...,...
68,San Pedro,-24.21960,-64.870000,minor,52068.0
69,Punta Alta,-38.88000,-62.080000,minor,54730.0
71,Ushuaia,-54.80720,-68.304400,admin,56956.0
72,Azul,-36.78330,-59.850000,minor,55728.0


In [323]:
address = 'argentina'
# address = 'Centro, bariloche, argentina'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Argentina are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Argentina are -34.9964963, -64.9672817.


In [252]:
print(dist2center((-34.5997,-58.3819),(-34.640799799999996, -58.407074255774845)))

5110.5450562152455


In order to get a relative magnitude of the population, we will use the normal of this column

In [292]:
list_size = df['population_proper'] / df['population_proper'].max()
df['pop_norm'] = list_size
df

Unnamed: 0,city,lat,lng,capital,population_proper,pop_norm
0,Buenos Aires,-34.60828,-58.372295,primary,3054300.0,1.000000
1,Córdoba,-31.41670,-64.183300,admin,1329604.0,0.435322
2,Rosario,-32.95750,-60.639400,minor,1193605.0,0.390795
3,San Miguel de Tucumán,-26.81670,-65.216700,admin,605767.0,0.198333
4,Mar del Plata,-38.00000,-57.550000,minor,593337.0,0.194263
...,...,...,...,...,...,...
68,San Pedro,-24.21960,-64.870000,minor,52068.0,0.017047
69,Punta Alta,-38.88000,-62.080000,minor,54730.0,0.017919
71,Ushuaia,-54.80720,-68.304400,admin,56956.0,0.018648
72,Azul,-36.78330,-59.850000,minor,55728.0,0.018246


We got 70 cities in Argentina that might be interesting.

Lets observe the map of Argentina with the corresponding size of circle of the cities

In [293]:
map_argentina = folium.Map(location=[latitude, longitude], zoom_start=3.5)

# add markers to df
for lat, lng, city, pop, size in zip(df['lat'], df['lng'], df['city'], df['population_proper'], df['pop_norm']):
    label = '{},\n {}'.format(city, pop)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5 * size + 3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_argentina)  
    
map_argentina

### Foursquare
Now that we have our cities, let's use Foursquare API to get info on restaurants.

We're interested in venues in 'food' category, but only those that are proper restaurants - coffe shops, pizza places, bakeries etc. are not direct competitors so we don't care about those. So we will include in our list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific **Argentinian restaurant** or **BBQ** category, as we need info in the city.

Also, we will check those venues within a **radius of 1000 meters** from the city center.

In [83]:
# function that extracts the category of the venue
# def get_category_type(row):

#     try:
#         categories_list = row['categories']
#     except:
#         categories_list = row['venue.categories']
        
#     if len(categories_list) == 0:
#         return None
#     else:
#         return categories_list[0]['name']

In [294]:
def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

In [295]:
def getNearbyRestaurants(names, latitudes, longitudes, category, radius=1000):
    """This function gets nearby restaurants from specific category for every city
        inputs: 
               "names", pandas series, city name
               "latitudes", pandas series, city lat
               "longitudes", pandas series, city lon
               "radius", =1000 by default
        output: 
               nearby_venues, pandas df with columns: ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'restaurant', 
                  'restaurant Latitude', 
                  'restaurant Longitude', 
                  'restaurant Category']
                  """
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            category,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'restaurant', 
                  'restaurant Latitude', 
                  'restaurant Longitude', 
                  'restaurant Category']
    
    return(nearby_venues)

In [296]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

argentina_venues = getNearbyRestaurants(df['city'],  df['lat'], df['lng'], food_category, radius=1000)

Buenos Aires
Córdoba
Rosario
San Miguel de Tucumán
Mar del Plata
Salta
San Juan
Lanús
Santa Fe
Corrientes
San Salvador de Jujuy
Bahía Blanca
Resistencia
Posadas
Santiago del Estero
Paraná
Merlo
Neuquén
Quilmes
Formosa
José C. Paz
Godoy Cruz
La Plata
La Rioja
Berazategui
Comodoro Rivadavia
San Luis
José María Ezeiza
San Nicolás de los Arroyos
Catamarca
Concordia
Florencio Varela
San Justo
San Martín
Tandil
Puerto Madryn
Mendoza
San Carlos de Bariloche
Villa Mercedes
Gualeguaychú
San Rafael
La Banda
Santa Rosa
Berisso
Morón
Zárate
Río Gallegos
Caseros
Rafaela
Pergamino
Campana
Presidencia Roque Sáenz Peña
Luján
Pilar
Necochea
Villa María
San Ramón de la Nueva Orán
Concepción del Uruguay
Goya
Reconquista
Las Heras
Oberá
San Francisco
Los Polvorines
Tartagal
San Pedro
Punta Alta
Ushuaia
Azul
Cruz del Eje


In [94]:
argentina_venues.rename(columns={'Neighborhood Latitude':'Latitude', 'Neighborhood Longitude': 'Longitude'}, inplace=True)

In [297]:
argentina_venues

Unnamed: 0,City,City Latitude,City Longitude,restaurant,restaurant Latitude,restaurant Longitude,restaurant Category
0,Buenos Aires,-34.60828,-58.372295,Pertutti,-34.609195,-58.373392,Argentinian Restaurant
1,Buenos Aires,-34.60828,-58.372295,London City,-34.608505,-58.374874,Café
2,Buenos Aires,-34.60828,-58.372295,Piazzolla Tango,-34.606099,-58.374899,Theme Restaurant
3,Buenos Aires,-34.60828,-58.372295,FuraiBo,-34.610105,-58.372549,Japanese Restaurant
4,Buenos Aires,-34.60828,-58.372295,Adorado Bar,-34.611895,-58.373358,Café
...,...,...,...,...,...,...,...
958,Azul,-36.78330,-59.850000,Parrilla El Carrito,-36.789406,-59.858032,BBQ Joint
959,Cruz del Eje,-30.73330,-64.800000,El Cantaro,-30.727358,-64.800040,Deli / Bodega
960,Cruz del Eje,-30.73330,-64.800000,Pizza Uno,-30.726104,-64.803277,Pizza Place
961,Cruz del Eje,-30.73330,-64.800000,Viejo Munich,-30.736909,-64.791571,German Restaurant


Now we'll add a column that contains the distance of the restaurant from the city center

In [298]:
import geopy.distance

def dist2center(center_coor, rest__coor):
    """calculate the distance of a restaurant from the city center
    inputs: center_coor, rest__coor - tuples(lat, lon)
    output: dist, float
    """
    return geopy.distance.distance(center_coor, rest__coor).m
    

In [299]:
dist_list = []
for center_lat, center_lon, rest_lat, rest_lon in zip(argentina_venues['City Latitude'], 
                                                      argentina_venues['City Longitude'], 
                                                      argentina_venues['restaurant Latitude'], 
                                                      argentina_venues['restaurant Longitude']):
    dist_list.append(dist2center((center_lat, center_lon),(rest_lat, rest_lon)))

# geopy.distance.distance(coords_1, coords_2).m

argentina_venues['dist2center'] = dist_list
argentina_venues

Unnamed: 0,City,City Latitude,City Longitude,restaurant,restaurant Latitude,restaurant Longitude,restaurant Category,dist2center
0,Buenos Aires,-34.60828,-58.372295,Pertutti,-34.609195,-58.373392,Argentinian Restaurant,142.875942
1,Buenos Aires,-34.60828,-58.372295,London City,-34.608505,-58.374874,Café,237.840260
2,Buenos Aires,-34.60828,-58.372295,Piazzolla Tango,-34.606099,-58.374899,Theme Restaurant,340.024866
3,Buenos Aires,-34.60828,-58.372295,FuraiBo,-34.610105,-58.372549,Japanese Restaurant,203.750072
4,Buenos Aires,-34.60828,-58.372295,Adorado Bar,-34.611895,-58.373358,Café,412.681522
...,...,...,...,...,...,...,...,...
958,Azul,-36.78330,-59.850000,Parrilla El Carrito,-36.789406,-59.858032,BBQ Joint,986.483509
959,Cruz del Eje,-30.73330,-64.800000,El Cantaro,-30.727358,-64.800040,Deli / Bodega,658.770026
960,Cruz del Eje,-30.73330,-64.800000,Pizza Uno,-30.726104,-64.803277,Pizza Place,857.334495
961,Cruz del Eje,-30.73330,-64.800000,Viejo Munich,-30.736909,-64.791571,German Restaurant,900.942656


now, lets **remove all the venues that is not a restaurat type**:

In [366]:
list_of_venues

array(['Argentinian Restaurant', 'Café', 'Theme Restaurant',
       'Japanese Restaurant', 'Burger Joint', 'Salad Place', 'BBQ Joint',
       'Restaurant', 'Deli / Bodega', 'Bistro', 'Italian Restaurant',
       'Mediterranean Restaurant', 'Sandwich Place', 'French Restaurant',
       'Vegetarian / Vegan Restaurant', 'Sushi Restaurant',
       'Venezuelan Restaurant', 'Creperie', 'Steakhouse',
       'Seafood Restaurant', 'Empanada Restaurant', 'Bakery',
       'Pizza Place', 'Asian Restaurant', 'Chinese Restaurant',
       'Gastropub', 'American Restaurant', 'Donut Shop',
       'Spanish Restaurant', 'Fast Food Restaurant',
       'Peruvian Restaurant', 'Breakfast Spot', 'Turkish Restaurant',
       'Middle Eastern Restaurant', 'Diner', 'Comfort Food Restaurant',
       'South American Restaurant', 'Mexican Restaurant',
       'African Restaurant', 'German Restaurant', 'Bagel Shop',
       'Food Stand', 'Snack Place', 'Latin American Restaurant',
       'Cafeteria', 'Food', 'Hot Dog J

In [368]:
list_of_venues = argentina_venues['restaurant Category'].unique()
matchers = ['argentinian','bbq', 'restaurant', 'diner', 'taverna', 'steakhouse']
matching = [s for s in list_of_venues if any(xs in s.lower() for xs in matchers)]

argentina_restaurants = argentina_venues[argentina_venues['restaurant Category'].isin(matching)]
argentina_restaurants

Unnamed: 0,City,City Latitude,City Longitude,restaurant,restaurant Latitude,restaurant Longitude,restaurant Category,dist2center
0,Buenos Aires,-34.60828,-58.372295,Pertutti,-34.609195,-58.373392,Argentinian Restaurant,142.875942
2,Buenos Aires,-34.60828,-58.372295,Piazzolla Tango,-34.606099,-58.374899,Theme Restaurant,340.024866
3,Buenos Aires,-34.60828,-58.372295,FuraiBo,-34.610105,-58.372549,Japanese Restaurant,203.750072
7,Buenos Aires,-34.60828,-58.372295,Elauge Hermanos,-34.609276,-58.375499,BBQ Joint,313.927511
8,Buenos Aires,-34.60828,-58.372295,Bidou de la Merced,-34.605619,-58.373134,Restaurant,305.112268
...,...,...,...,...,...,...,...,...
949,Ushuaia,-54.80720,-68.304400,Gelido,-54.808848,-68.314453,Fast Food Restaurant,671.930813
951,Ushuaia,-54.80720,-68.304400,Bamboo,-54.805051,-68.299648,Chinese Restaurant,388.071016
958,Azul,-36.78330,-59.850000,Parrilla El Carrito,-36.789406,-59.858032,BBQ Joint,986.483509
961,Cruz del Eje,-30.73330,-64.800000,Viejo Munich,-30.736909,-64.791571,German Restaurant,900.942656


How many restaurants there are in each city:

In [394]:
argentina_restaurants.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,restaurant,restaurant Latitude,restaurant Longitude,restaurant Category,dist2center
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Azul,1,1,1,1,1,1,1
Bahía Blanca,10,10,10,10,10,10,10
Berazategui,4,4,4,4,4,4,4
Buenos Aires,55,55,55,55,55,55,55
Caseros,2,2,2,2,2,2,2
Catamarca,8,8,8,8,8,8,8
Comodoro Rivadavia,3,3,3,3,3,3,3
Concepción del Uruguay,2,2,2,2,2,2,2
Concordia,2,2,2,2,2,2,2
Corrientes,3,3,3,3,3,3,3


In [395]:
print('There are {} uniques categories.'.format(len(argentina_restaurants['restaurant Category'].unique())))

There are 34 uniques categories.


Now, we will turn the categories into columns, and calculate the frequencies of each category in the vicinity of the city center: 

In [372]:
# one hot encoding
argentina_onehot = pd.get_dummies(argentina_restaurants[['restaurant Category']], prefix="", prefix_sep="")

# add City column back to dataframe
argentina_onehot['City'] = argentina_restaurants['City'] 

# move City column to the first column
fixed_columns = [argentina_onehot.columns[-1]] + list(argentina_onehot.columns[:-1])
argentina_onehot = argentina_onehot[fixed_columns]

argentina_onehot.shape

(486, 35)

In [373]:
argentina_grouped = argentina_onehot.groupby('City').mean().reset_index()
argentina_grouped

Unnamed: 0,City,African Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Chinese Restaurant,Comfort Food Restaurant,Diner,Empanada Restaurant,...,South American Restaurant,Spanish Restaurant,Steakhouse,Sushi Restaurant,Swiss Restaurant,Tapas Restaurant,Theme Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant
0,Azul,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bahía Blanca,0.0,0.0,0.3,0.0,0.1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0
2,Berazategui,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Buenos Aires,0.0,0.018182,0.345455,0.018182,0.036364,0.018182,0.0,0.0,0.018182,...,0.0,0.036364,0.036364,0.036364,0.0,0.0,0.018182,0.0,0.054545,0.018182
4,Caseros,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Catamarca,0.0,0.0,0.375,0.0,0.125,0.0,0.0,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Comodoro Rivadavia,0.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Concepción del Uruguay,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Concordia,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Corrientes,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lets print the top 10 restaurant types in each city:

In [374]:
num_top_venues = 10

for city in argentina_grouped['City']:
    print("----"+city+"----")
    temp = argentina_grouped[argentina_grouped['City'] == city].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Azul----
                       venue  freq
0                  BBQ Joint   1.0
1         African Restaurant   0.0
2                 Steakhouse   0.0
3          Paella Restaurant   0.0
4        Peruvian Restaurant   0.0
5                 Restaurant   0.0
6         Seafood Restaurant   0.0
7  South American Restaurant   0.0
8         Spanish Restaurant   0.0
9           Sushi Restaurant   0.0


----Bahía Blanca----
                       venue  freq
0     Argentinian Restaurant   0.3
1       Fast Food Restaurant   0.2
2         Italian Restaurant   0.2
3                  BBQ Joint   0.1
4           Tapas Restaurant   0.1
5          German Restaurant   0.1
6                 Steakhouse   0.0
7                 Restaurant   0.0
8         Seafood Restaurant   0.0
9  South American Restaurant   0.0


----Berazategui----
                       venue  freq
0       Fast Food Restaurant  0.50
1     Argentinian Restaurant  0.25
2         Chinese Restaurant  0.25
3           Sushi Restaurant  0.

9              Steakhouse  0.00


----Reconquista----
                       venue  freq
0     Argentinian Restaurant  0.67
1                 Restaurant  0.33
2         African Restaurant  0.00
3                 Steakhouse  0.00
4          Paella Restaurant  0.00
5        Peruvian Restaurant  0.00
6         Seafood Restaurant  0.00
7  South American Restaurant  0.00
8         Spanish Restaurant  0.00
9           Sushi Restaurant  0.00


----Resistencia----
                       venue  freq
0     Argentinian Restaurant   0.4
1                 Restaurant   0.3
2                  BBQ Joint   0.1
3        Empanada Restaurant   0.1
4           Sushi Restaurant   0.1
5         African Restaurant   0.0
6                 Steakhouse   0.0
7        Peruvian Restaurant   0.0
8         Seafood Restaurant   0.0
9  South American Restaurant   0.0


----Rosario----
                       venue  freq
0     Argentinian Restaurant  0.36
1                 Restaurant  0.18
2                  BBQ Joint  0

In [527]:
def return_most_common_venues(row, num_top_venues):
    """
    Inputs: 'row' - pd.Series
            'num_top_venues' - int
    Outputs: 'row_categories_sorted.index.values[0:num_top_venues]' - a numpy.ndarray contains strings with the category name
    """
    list_ = []
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    for idx, item in enumerate(row_categories_sorted):
        if item != 0:
            list_.append(row_categories_sorted.index[idx])
    while len(list_) < 10:
        list_.append('-')
    return np.array(list_)[0:num_top_venues]

We make the frequencies into pandas DataFrame:

In [528]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = argentina_grouped['City']

for ind in np.arange(argentina_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(argentina_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Azul,BBQ Joint,-,-,-,-,-,-,-,-,-
1,Bahía Blanca,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Tapas Restaurant,German Restaurant,BBQ Joint,-,-,-,-
2,Berazategui,Fast Food Restaurant,Argentinian Restaurant,Chinese Restaurant,-,-,-,-,-,-,-
3,Buenos Aires,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
4,Caseros,Argentinian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-


## Using folium

We generate a df for ploting on a map, there are all the restaurants:

In [530]:
argentina_merged = argentina_restaurants.join(city_venues_sorted.set_index('City'), on='City')

argentina_merged.head()

Unnamed: 0,City,City Latitude,City Longitude,restaurant,restaurant Latitude,restaurant Longitude,restaurant Category,dist2center,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Buenos Aires,-34.60828,-58.372295,Pertutti,-34.609195,-58.373392,Argentinian Restaurant,142.875942,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
2,Buenos Aires,-34.60828,-58.372295,Piazzolla Tango,-34.606099,-58.374899,Theme Restaurant,340.024866,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
3,Buenos Aires,-34.60828,-58.372295,FuraiBo,-34.610105,-58.372549,Japanese Restaurant,203.750072,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
7,Buenos Aires,-34.60828,-58.372295,Elauge Hermanos,-34.609276,-58.375499,BBQ Joint,313.927511,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
8,Buenos Aires,-34.60828,-58.372295,Bidou de la Merced,-34.605619,-58.373134,Restaurant,305.112268,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant


Showing all the restaurants near the city center on a map, while our interesting category will be colored in **red**, and the other will be **blue**

In [535]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)

# set color scheme for the clusters
# x = np.arange(kclusters)
# ys = [i + x + (i*x)**2 for i in range(kclusters)]
# colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
# rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, r_lat, r_lon, r_name, dist, cat in zip(argentina_merged['City Latitude'], 
                                  argentina_merged['City Longitude'], 
                                  argentina_merged['City'], 
                                  #argentina_merged['Cluster Labels'],
                                  argentina_merged['restaurant Latitude'],
                                  argentina_merged['restaurant Longitude'],
                                  argentina_merged['restaurant'],
                                  argentina_merged['dist2center'],
                                  argentina_merged['restaurant Category']):
    r_label = folium.Popup(r_name + '\n' + cat + '\n' + str(round(dist,2)) + '[m]' ,parse_html=True)
    if cat in ['Argentinian Restaurant', 'BBQ Joint']:
        folium.Marker(
            [r_lat, r_lon],
            icon=folium.Icon(color='red'),
            popup=r_label).add_to(map_clusters)
    else:
        folium.Marker(
            [r_lat, r_lon],
            icon=folium.Icon(color='blue'),
            popup=r_label).add_to(map_clusters)
    label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)


# except:
#     print(lat, lon, poi, "------>   Has no near venues")
map_clusters

Now we have all the restaurants in area within a radius of 1 km from every main city in Argentina, and we know which ones are local meat or BBQ restaurants.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new restaurants.

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting cities in Argentina that have low restaurant density near the city center, particularly those with low number of Argentinian restaurants. We will limit our analysis to area ~1km around city centers.

In first step we have collected the required **data: location and type (category) of every restaurant within 6km from Berlin center** (Alexanderplatz). We have also **identified Italian restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas of Berlin - we will use **heatmaps** to identify a few promising areas close to center with low number of restaurants in general (*and* no Italian restaurants in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two restaurants in radius of 250 meters**, and we want locations **without Italian restaurants in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

# Modeling

In [377]:
# set number of clusters
kclusters = 5

argentina_grouped_clustering = argentina_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(argentina_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20] 

array([3, 0, 0, 2, 0, 2, 2, 4, 4, 3, 3, 2, 3, 4, 3, 4, 2, 2, 4, 2])

add clustering labels

In [519]:
city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [439]:
# df.rename(columns={'city':'City'}, inplace=True)
df

Unnamed: 0,City,lat,lng,capital,population_proper,pop_norm
0,Buenos Aires,-34.60828,-58.372295,primary,3054300.0,1.000000
1,Córdoba,-31.41670,-64.183300,admin,1329604.0,0.435322
2,Rosario,-32.95750,-60.639400,minor,1193605.0,0.390795
3,San Miguel de Tucumán,-26.81670,-65.216700,admin,605767.0,0.198333
4,Mar del Plata,-38.00000,-57.550000,minor,593337.0,0.194263
...,...,...,...,...,...,...
68,San Pedro,-24.21960,-64.870000,minor,52068.0,0.017047
69,Punta Alta,-38.88000,-62.080000,minor,54730.0,0.017919
71,Ushuaia,-54.80720,-68.304400,admin,56956.0,0.018648
72,Azul,-36.78330,-59.850000,minor,55728.0,0.018246


In [520]:
city_venues_sorted

Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,3,Azul,BBQ Joint,-,-,-,-,-,-,-,-,-
1,0,Bahía Blanca,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Tapas Restaurant,German Restaurant,BBQ Joint,-,-,-,-
2,0,Berazategui,Fast Food Restaurant,Argentinian Restaurant,Chinese Restaurant,-,-,-,-,-,-,-
3,2,Buenos Aires,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
4,0,Caseros,Argentinian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-
5,2,Catamarca,Argentinian Restaurant,Restaurant,Diner,BBQ Joint,-,-,-,-,-,-
6,2,Comodoro Rivadavia,Argentinian Restaurant,BBQ Joint,Restaurant,-,-,-,-,-,-,-
7,4,Concepción del Uruguay,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
8,4,Concordia,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
9,3,Corrientes,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-


merge city_venues_sorted with argentina_restaurants to add latitude/longitude for each city:

In [521]:
argentina_cluster_merged = pd.merge(argentina_restaurants.set_index('City'), city_venues_sorted, on='City', how='left')
argentina_cluster_merged = argentina_cluster_merged.groupby('City')[['Cluster Labels']].mean()
argentina_cluster_merged = pd.merge(argentina_cluster_merged, city_venues_sorted, on='City')
argentina_cluster_merged = pd.merge(argentina_cluster_merged, df, on='City').drop(columns=['Cluster Labels_x'])
argentina_cluster_merged.rename(columns={'lat': 'City Latitude', 'lng':'City Longitude','Cluster Labels_y':'Cluster Labels'}, inplace=True)
argentina_cluster_merged

Unnamed: 0,City,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,City Latitude,City Longitude,capital,population_proper,pop_norm
0,Azul,3,BBQ Joint,-,-,-,-,-,-,-,-,-,-36.7833,-59.85,minor,55728.0,0.018246
1,Bahía Blanca,0,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Tapas Restaurant,German Restaurant,BBQ Joint,-,-,-,-,-38.7167,-62.2667,minor,299101.0,0.097928
2,Berazategui,0,Fast Food Restaurant,Argentinian Restaurant,Chinese Restaurant,-,-,-,-,-,-,-,-34.7679,-58.2133,minor,180523.0,0.059105
3,Buenos Aires,2,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant,-34.60828,-58.372295,primary,3054300.0,1.0
4,Caseros,0,Argentinian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-,-34.6167,-58.5333,minor,95785.0,0.031361
5,Catamarca,2,Argentinian Restaurant,Restaurant,Diner,BBQ Joint,-,-,-,-,-,-,-28.4686,-65.7792,admin,159139.0,0.052103
6,Comodoro Rivadavia,2,Argentinian Restaurant,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-45.8667,-67.5,minor,173266.0,0.056729
7,Concepción del Uruguay,4,Argentinian Restaurant,-,-,-,-,-,-,-,-,-,-32.4833,-58.2333,minor,72528.0,0.023746
8,Concordia,4,Argentinian Restaurant,-,-,-,-,-,-,-,-,-,-31.3922,-58.0169,minor,149450.0,0.048931
9,Corrientes,3,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-,-27.4833,-58.8167,admin,352646.0,0.115459


In [522]:
argentina_cluster_merged.loc[argentina_cluster_merged['Cluster Labels'] == 0, argentina_cluster_merged.columns[[0] + list(range(2, 12))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bahía Blanca,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Tapas Restaurant,German Restaurant,BBQ Joint,-,-,-,-
2,Berazategui,Fast Food Restaurant,Argentinian Restaurant,Chinese Restaurant,-,-,-,-,-,-,-
4,Caseros,Argentinian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-
20,Las Heras,Italian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-
24,Merlo,Argentinian Restaurant,Fast Food Restaurant,Italian Restaurant,Restaurant,-,-,-,-,-,-
26,Neuquén,Fast Food Restaurant,Argentinian Restaurant,BBQ Joint,-,-,-,-,-,-,-
30,Pilar,Fast Food Restaurant,Italian Restaurant,Argentinian Restaurant,BBQ Joint,-,-,-,-,-,-
43,San Justo,Fast Food Restaurant,-,-,-,-,-,-,-,-,-
54,Santiago del Estero,Italian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-


In [523]:
argentina_cluster_merged.loc[argentina_cluster_merged['Cluster Labels'] == 1, argentina_cluster_merged.columns[[0] + list(range(2, 12))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Oberá,Restaurant,-,-,-,-,-,-,-,-,-
45,San Martín,Restaurant,-,-,-,-,-,-,-,-,-
49,San Rafael,Restaurant,Diner,-,-,-,-,-,-,-,-
53,Santa Rosa,Restaurant,-,-,-,-,-,-,-,-,-
55,Tandil,Restaurant,Fast Food Restaurant,Seafood Restaurant,-,-,-,-,-,-,-


In [524]:
argentina_cluster_merged.loc[argentina_cluster_merged['Cluster Labels'] == 2, argentina_cluster_merged.columns[[0] + list(range(2, 12))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Buenos Aires,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
5,Catamarca,Argentinian Restaurant,Restaurant,Diner,BBQ Joint,-,-,-,-,-,-
6,Comodoro Rivadavia,Argentinian Restaurant,BBQ Joint,Restaurant,-,-,-,-,-,-,-
11,Córdoba,Argentinian Restaurant,Restaurant,Fast Food Restaurant,Diner,Italian Restaurant,Vegetarian / Vegan Restaurant,Middle Eastern Restaurant,French Restaurant,Mexican Restaurant,Comfort Food Restaurant
16,La Banda,Argentinian Restaurant,BBQ Joint,Middle Eastern Restaurant,Restaurant,-,-,-,-,-,-
17,La Plata,Mediterranean Restaurant,Mexican Restaurant,Restaurant,Diner,-,-,-,-,-,-
19,Lanús,Argentinian Restaurant,Restaurant,Mexican Restaurant,Sushi Restaurant,Middle Eastern Restaurant,Italian Restaurant,-,-,-,-
21,Luján,Italian Restaurant,Restaurant,-,-,-,-,-,-,-,-
22,Mar del Plata,Restaurant,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Empanada Restaurant,Mediterranean Restaurant,Mexican Restaurant,Seafood Restaurant,Latin American Restaurant,Diner
23,Mendoza,Restaurant,Fast Food Restaurant,Italian Restaurant,Peruvian Restaurant,Spanish Restaurant,-,-,-,-,-


In [525]:
argentina_cluster_merged.loc[argentina_cluster_merged['Cluster Labels'] == 3, argentina_cluster_merged.columns[[0] + list(range(2, 12))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Azul,BBQ Joint,-,-,-,-,-,-,-,-,-
9,Corrientes,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-
10,Cruz del Eje,German Restaurant,BBQ Joint,-,-,-,-,-,-,-,-
12,Florencio Varela,Seafood Restaurant,BBQ Joint,Sushi Restaurant,-,-,-,-,-,-,-
14,Goya,BBQ Joint,Seafood Restaurant,-,-,-,-,-,-,-,-
34,Punta Alta,BBQ Joint,Argentinian Restaurant,Empanada Restaurant,-,-,-,-,-,-,-
50,San Ramón de la Nueva Orán,Vegetarian / Vegan Restaurant,BBQ Joint,Restaurant,-,-,-,-,-,-,-
58,Villa Mercedes,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-


In [526]:
argentina_cluster_merged.loc[argentina_cluster_merged['Cluster Labels'] == 4, argentina_cluster_merged.columns[[0] + list(range(2, 12))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Concepción del Uruguay,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
8,Concordia,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
13,Godoy Cruz,Argentinian Restaurant,American Restaurant,-,-,-,-,-,-,-,-
15,Gualeguaychú,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
18,La Rioja,Argentinian Restaurant,Italian Restaurant,-,-,-,-,-,-,-,-
29,Pergamino,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
36,Reconquista,Argentinian Restaurant,Restaurant,-,-,-,-,-,-,-,-
39,Río Gallegos,Argentinian Restaurant,Greek Restaurant,-,-,-,-,-,-,-,-
44,San Luis,Argentinian Restaurant,Italian Restaurant,Mexican Restaurant,Restaurant,-,-,-,-,-,-
51,San Salvador de Jujuy,Argentinian Restaurant,Restaurant,-,-,-,-,-,-,-,-


In [517]:

def return_most_common_venues(row, num_top_venues):
    """
    Inputs: 'row' - pd.Series
            'num_top_venues' - int
    Outputs: 'row_categories_sorted.index.values[0:num_top_venues]' - a numpy.ndarray contains strings with the category name
    """
    list_ = []
    row_categories = row.iloc[1:]
#     print(row_categories)
    row_categories_sorted = row_categories.sort_values(ascending=False)
    for idx, item in enumerate(row_categories_sorted):
        if item != 0:
            list_.append(row_categories_sorted.index[idx])
    while len(list_) < 10:
        list_.append('-')
#     print(np.array(list_)[0:num_top_venues])
    return np.array(list_)[0:num_top_venues]#row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = argentina_grouped['City']

for ind in np.arange(argentina_grouped.shape[0]):
#     print(return_most_common_venues(argentina_grouped.iloc[ind, :], num_top_venues))
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(argentina_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Azul,BBQ Joint,-,-,-,-,-,-,-,-,-
1,Bahía Blanca,Argentinian Restaurant,Italian Restaurant,Fast Food Restaurant,Tapas Restaurant,German Restaurant,BBQ Joint,-,-,-,-
2,Berazategui,Fast Food Restaurant,Argentinian Restaurant,Chinese Restaurant,-,-,-,-,-,-,-
3,Buenos Aires,Argentinian Restaurant,Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Sushi Restaurant,Steakhouse,Spanish Restaurant,BBQ Joint,Japanese Restaurant
4,Caseros,Argentinian Restaurant,Fast Food Restaurant,-,-,-,-,-,-,-,-
5,Catamarca,Argentinian Restaurant,Restaurant,Diner,BBQ Joint,-,-,-,-,-,-
6,Comodoro Rivadavia,Argentinian Restaurant,BBQ Joint,Restaurant,-,-,-,-,-,-,-
7,Concepción del Uruguay,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
8,Concordia,Argentinian Restaurant,-,-,-,-,-,-,-,-,-
9,Corrientes,BBQ Joint,Restaurant,-,-,-,-,-,-,-,-
