# Capstone Final Project

## Introduction and Background

Perú is a country known, among other things, for its cuisine, its biodiversity and for its varyiety of landscapes. These are some reasons why, despite being considered a third-world country, the number of tourist is growing year by year.


Lima, the Capital City is well-known for many reasons. Firstly, is the arrival place for the mayority of tourist who visit Peru even for work or pleasure time. Added to this, and because of its proximity to the sea, Lima has the honour of being the peruvian region which have one of the best _Ceviche_. _Ceviche_ is probably the most recognized peruvian dish made mainly by fresh fish. However, this is not the only iconic dish in Lima, because being the capital, is a mandatory checkpoint for the people travelling inside the country, causing that in this city we can find a massive variety of restaurants with food from every corner of Perú

This is an exploratory analysis approach which can be interesting for enterpreneurs seeking for open a restaurant(including peruvian seafood, andean food, coast food and so on) in one of the gastronomic capitals of the world.





## Data acquisition and use of it

### This project will use two types of data:


1)Localization of different hotels, hostels, motels and all tourist related info and restaurants in lima.
<br>2)Localization of different lima neighborhoods in a grid.</br>


The first type of data will be obtained using the **FOURSQUARE API**, this API bring us the possibility of extract all the venues in a certain radius from a specific location (latitud and longitud), this feature is interesting for us because we well disgreggate our research area in a grid, so we can obtain specific information about venues for every cell on that grid.
<br>The second type of data will by obtained using the **Google Maps API reverse geocoding**, making a 16x15km grid, which extends for the most populated parts of the city. This data is important to determine the cells with the high density of lodges by km2. Using this information we could have an idea about the capacity of reception of tourist in every specific point in the area</br> 



In [1]:
# !pip install pyproj

In [2]:
#importing base libraries 

#to get data
import requests
#base python libraries
import pandas as pd
import numpy as np
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

#for clusterization
from sklearn.cluster import KMeans
#to make the mapas
import folium
# to make aditional operations
import math
# to transform coordinates
import pyproj
#for serialization
import pickle


## Grid of lima City

In [3]:
key='AIzaSyDjv-SczZZgzP5SYm6BTZw1O1vdTzvpFBg'
address = 'Plaza de Armas de Lima, Lima, Perú'
url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(key, address)
response = requests.get(url).json()

In [4]:
results = response['results']
geographical_data = results[0]['geometry']['location'] # get geographical coordinates
lat = geographical_data['lat']
lon = geographical_data['lng']

lima_center = [lat,lon]
print('Coordinate of {}: {}'.format(address, lima_center))

Coordinate of Plaza de Armas de Lima, Lima, Perú: [-12.0460038, -77.0305458]


In [5]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=18, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=18, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]


print('Coordinate transformation check')
print('-------------------------------')
print('lima center longitude={}, latitude={}'.format(lima_center[1], lima_center[0]))
x, y = lonlat_to_xy(lima_center[1], lima_center[0])
print('lima center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('lima center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
lima center longitude=-77.0305458, latitude=-12.0460038
lima center UTM X=278951.6560586083, Y=-1332458.3618532356
lima center longitude=-77.0305458, latitude=-12.046003799999996


In [6]:
lima_center_x, lima_center_y = lonlat_to_xy(lima_center[1], lima_center[0]) # City center in Cartesian coordinates

x_min = lima_center_x - 3000
x_step = 1000
y_min = lima_center_y - 11000
y_step = 1000 

latitudes = []
longitudes = []
for i in range(0, 15):
    y = y_min + i * y_step
    for j in range(0, 16):
        x = x_min + j * x_step
        lon, lat = xy_to_lonlat(x, y)
        latitudes.append(lat)
        longitudes.append(lon)

print(len(latitudes), 'candidate neighborhood centers generated.')

240 candidate neighborhood centers generated.


In [7]:
map_lima = folium.Map(location=lima_center, zoom_start=13)
folium.Marker(lima_center, popup='lima').add_to(map_lima)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_lima) 
    folium.Circle([lat, lon], radius=500, color='blue', fill=False).add_to(map_lima)
    #folium.Marker([lat, lon]).add_to(map_lima)
map_lima

In [8]:
lima_grid_df=pd.DataFrame(list(zip(latitudes,longitudes)), 
               columns =['latitudes', 'longitudes'])

In [9]:
lima_grid_df

Unnamed: 0,latitudes,longitudes
0,-12.145214,-77.058852
1,-12.145282,-77.049667
2,-12.145350,-77.040482
3,-12.145418,-77.031296
4,-12.145485,-77.022111
5,-12.145552,-77.012926
6,-12.145618,-77.003740
7,-12.145685,-76.994555
8,-12.145751,-76.985369
9,-12.145817,-76.976183


## FOURSQUARE API

In [10]:
CLIENT_ID = 'FKIS1VHXUSWVWXWT1FNOM5ZUBIVGYIWTIXREOI44F0DRVHRU' # your Foursquare ID
CLIENT_SECRET = 'ZAPAHJI5XVM3DLS43JZRRYEL2501GMPTNJE0UBS0ZJARJ2VU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

### The cell below is the function to extract the data about the venues (in particular venues about food and accomodations) near the location in the DataFrame above

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,10000)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
lima_venues = getNearbyVenues(names=lima_grid_df.index,
                                   latitudes=lima_grid_df['latitudes'],
                                   longitudes=lima_grid_df['longitudes']
                                  )

ConnectionError: ('Connection aborted.', OSError("(10053, 'WSAECONNABORTED')"))

In [None]:
len(lima_venues)

In [None]:
lima_venues['Venue Category'].unique()

In [None]:
len(lima_most_lodges_cells)

In [None]:
lima_restaurants=lima_venues[lima_venues['Venue Category'].isin(['Seafood Restaurant', 'Coffee Shop', 
        'Cafeteria','Café', 'Peruvian Restaurant','Restaurant','Food',
        'Chinese Restaurant', 'Bakery','Fish & Chips Shop','Buffet',
        'Vegetarian / Vegan Restaurant','Dessert Shop', 'Ice Cream Shop'])]

### Then we have to get dummies of the different Venue Categories from our Dataset, and group the different categories by Neighborhood, estimating the mean of the frecuency of such categories in every Neighborhood

In [None]:
lima_onehot = pd.get_dummies(lima_restaurants[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
lima_onehot['Neighborhood'] = lima_restaurants['Neighborhood'] 



In [None]:
grouped=lima_onehot.groupby('Neighborhood').mean().reset_index()

### Both cells below are used to determine the most common venues in every Neighborhood

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## TIME FOR CLUSTERING

In [None]:

grouped_clustering = grouped.drop('Neighborhood', 1)
kclusters=np.arange(1,10)
Sum_of_squared_distances = []

for k in kclusters:
    kmeans = KMeans(n_clusters=k, random_state=0).fit(grouped_clustering)
    Sum_of_squared_distances.append(kmeans.inertia_)

In [None]:
# plot the elbow method to choose the best k for kmeans
plt.plot(kclusters, Sum_of_squared_distances, 'bx-')
plt.xlabel('kclusters')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
best_k =5

In [None]:
kmeans = KMeans(n_clusters=best_k, random_state=0).fit(grouped_clustering)

In [None]:
neighborhoods_venues_sorted.insert(1, 'Cluster Labels', kmeans.labels_)

In [None]:
neighborhoods_venues_sorted.sort_values(by='Cluster Labels')

## After all this procedure, we will join the table for each cluster, in the grid cells with lodges, to determine in which places we should open the restaurant, knowing what are the most common restaurants in that place

In [None]:
merged = lima_most_lodges_cells.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on=lima_most_lodges_cells.index ,how='left')
merged=merged.drop(['Neighborhood Latitude','Neighborhood Longitude','Venue','Venue Latitude','Venue Longitude','Venue Category'],axis=1)
merged=merged.dropna()
merged=merged.join(lima_grid_df, on=merged.index,how='left')
merged


## And of course... The Map!

In [None]:
map_clusters = folium.Map(location=[lat, lon], zoom_start=11)

x = np.arange(best_k)
ys = [i + x + (i*x)**2 for i in range(best_k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['latitudes'], merged['longitudes'], merged.index, merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters