# CarpeTrip bot

## 1. Choosing the destination

This solution is aimed at those people who are planning a vacation or a work trip and want to make an educated choice about at which hotel to stay in order to minimize the total cost of accomodation plus travel to the sights we will be visiting.

For starters, we will dress the part of the travel agency and provide our users with a list of sights to visit in the city of interest. For this proof-of-concept, we will consider the use case of a person travelling to Manhattan on vacation.

Let's get the coordinates and the boundaries of the are of interest:

In [1]:
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
import folium
from itertools import product
import requests
import numpy as np
from lxml import html
import re
from pandas.io.json import json_normalize
import pandas as pd
from tqdm import tqdm
from math import ceil, floor

In [2]:
address='Manhattan, New York, Ny'
geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(address)
lat_center, long_center = location.latitude, location.longitude
bounds = [eval(x) for x in location.raw['boundingbox']]

In [3]:
from geopy.geocoders import Nominatim

address='Manhattan, New York, Ny'
geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(address)
location.raw

{'boundingbox': ['40.6839411', '40.8804489', '-74.0472219', '-73.9061585'],
 'class': 'boundary',
 'display_name': 'Manhattan, New York County, New York, United States of America',
 'icon': 'https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png',
 'importance': 0.9854391745917972,
 'lat': '40.7896239',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'lon': '-73.9598939',
 'osm_id': 8398124,
 'osm_type': 'relation',
 'place_id': 236223885,
 'type': 'administrative'}

In [4]:
map_area = folium.Map(location=[lat_center, long_center], zoom_start=11)

In [5]:
bbox = list(product((bounds[0],bounds[1]), (bounds[2],bounds[3])))

folium.Rectangle(
        bbox,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.2,
        parse_html=False).add_to(map_area)
map_area

## 2. Selecting the sites to visit

In the proposed solution, we ask our user to select from a list of categories for the venues they would like to have recommended to them. Such a list could include the following categories:
- museum
- stadium
- bridge
- park
- church
- scenic lookout
- monument

In [6]:
CLIENT_ID = 'HJOMTRFDACPN3RSI4US2CAOB0KYKK4HOVM2CLVXIGBG1GEDP'
CLIENT_SECRET = 'SKXEO4FJB1NM2W5YVF30XYA3HLHEF1HBN04SSBH3Y3Y0BQLM'
VERSION = '20180605'

categories_dict = {
    'Museum': '4bf58dd8d48988d181941735',
    'Stadium': '4bf58dd8d48988d184941735',
    'Bridge': '4bf58dd8d48988d1df941735',
    'Park': '4bf58dd8d48988d163941735',
    'Church': '4bf58dd8d48988d132941735',
    'Scenic Lookout': '4bf58dd8d48988d165941735',
    'Monument': '4bf58dd8d48988d12d941735'
}

In [7]:
chosen_categories = ['Museum', 'Stadium', 'Scenic Lookout', 'Monument']
category_ids = [categories_dict[cat] for cat in ['Museum', 'Stadium', 'Scenic Lookout', 'Monument']]

In [8]:
cat = '{}'.format(",".join(category_ids))
intent = 'browse'
sw = '{},{}'.format(bbox[0][0], bbox[0][1])
ne = '{},{}'.format(bbox[-1][0], bbox[-1][1])
limit = 100

# create the API request URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&sw={}&ne={}&intent={}&categoryId={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,
        sw,
        ne,
        intent, 
        cat, 
        limit)

In [9]:
def foursquare_to_df(url):
    venues_list=[]
    
    # make the GET request
    results = requests.get(url).json()['response']['venues']
    
    # return only relevant information for each cityby venue
    venues_list.append([(
        v['name'], 
        v['location']['lat'], 
        v['location']['lng'],  
        v['categories'][0]['name']) for v in results])

    venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venues.columns = ['Name', 
                  'Latitude', 
                  'Longitude', 
                  'Category']
    
    return venues

In [10]:
nyvenues = foursquare_to_df(url)
nyvenues.head()

Unnamed: 0,Name,Latitude,Longitude,Category
0,Brooklyn Bridge,40.705967,-73.996707,Bridge
1,East River Running Path,40.722409,-73.974076,Park
2,Manhattan Bridge,40.707661,-73.990903,Bridge
3,McCarren Park Track,40.719665,-73.951501,Track Stadium
4,Riverside Park South,40.778324,-73.989111,Park


Choosing from the recommended venues from above, the user defines his list of must-see venues during the trip:
- Brooklyn Bridge
- Empire State Building
- Madison Square Garden
- Central Park
- The Metropolitan Museum of Art
- Solomon R Guggenheim Museum
- Yankee Stadium
- Shakespeare Garden

NB) Inserire sezione per chiedere consigli su una meta

In [11]:
sights_list = ['Brooklyn Bridge',
    'Empire State Building',
    'Madison Square Garden',
    'Central Park',
    'The Metropolitan Museum of Art (Metropolitan Museum of Art)',
    'Solomon R Guggenheim Museum',
    'Yankee Stadium',
    'Shakespeare Garden']

Furthermore, let's make the assumption that the user tells us that he would like to visit some venue not included in the list such as the Statue of Liberty.

In [12]:
add_list = ['Statue of Liberty']
rows = []

for venue in add_list:
    loc = geolocator.geocode(venue)
    lat, long = loc[1][0], loc[1][1]
    row = {'Name': venue,
           'Latitude':lat,
           'Longitude': long,
           'Category': '---'}
    
    rows.append(row)

        
venues_tmp = pd.DataFrame(rows)

# let's add the category of each place assuming we know them 
# Note: they could be asked to the user or searched thorugh Foursquare API
venues_tmp.loc[venues_tmp.Name=='Statue of Liberty', 'Category'] = 'Monument/Landmark'

# Finally let's append the temporary df to the new york dataframe
nyvenues = nyvenues.append(venues_tmp, sort=False).reset_index(drop=True)

nyvenues.tail()

Unnamed: 0,Name,Latitude,Longitude,Category
46,Greenpoint Waterfront,40.731261,-73.961202,Scenic Lookout
47,General Grant National Memorial,40.813436,-73.963067,Historic Site
48,Flatiron Building,40.741096,-73.989594,Building
49,Transmitter Pier,40.729868,-73.962376,Scenic Lookout
50,Statue of Liberty,40.689253,-74.044548,Monument/Landmark


Let's update the sights_list as well:

In [13]:
sights_list += add_list

In [14]:
venues_map = folium.Map(location=[lat_center, long_center], zoom_start=11)


for n, lat, lng, cat in zip(nyvenues.Name, nyvenues.Latitude, nyvenues.Longitude, nyvenues.Category):
    
    if n in sights_list:
        
        label = folium.Popup(str(n), parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            color='red',
            fill=True,
            fill_opacity=0.7)\
            .add_child(label)\
            .add_to(venues_map)

venues_map

In [15]:
from sklearn import metrics
from sklearn import cluster
from sklearn.cluster import DBSCAN

X = nyvenues[['Latitude','Longitude']].values
X_radians = np.radians(X)
kms_per_radian = 6371.0088
kms_max_dist = 1
epsilon = kms_max_dist/kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X_radians)

In [16]:
nyvenues['Cluster'] = db.labels_

In [17]:
nyvenues

Unnamed: 0,Name,Latitude,Longitude,Category,Cluster
0,Brooklyn Bridge,40.705967,-73.996707,Bridge,0
1,East River Running Path,40.722409,-73.974076,Park,1
2,Manhattan Bridge,40.707661,-73.990903,Bridge,0
3,McCarren Park Track,40.719665,-73.951501,Track Stadium,-1
4,Riverside Park South,40.778324,-73.989111,Park,2
5,Brooklyn Heights Promenade,40.698462,-73.996707,Scenic Lookout,0
6,Gantry Plaza State Park,40.746558,-73.958051,State / Provincial Park,3
7,WNYC Transmitter Park,40.729958,-73.960733,Park,4
8,Bethesda Terrace,40.773854,-73.970982,Plaza,5
9,Exchange Waterfront,40.714692,-74.033152,Waterfront,6


In [18]:
explored = []
sights_cluster = []

for sight in sights_list:

    sub = nyvenues.loc[(nyvenues.Name==sight), 'Cluster']
    if not sub.empty:
        clus= sub.values[0]
        
        if clus != -1 and clus not in explored:
            print('Found other attractions near {} you might want to check out (CLUSTER {}):\n'.format(sight, clus))
            sights_cluster.append(clus)
            
            for name in nyvenues.loc[(nyvenues.Cluster==clus) & (nyvenues.Name!=sight), 'Name']:
                print(name)
                
            explored.append(clus)
            print('===================================================\n')
            


Found other attractions near Brooklyn Bridge you might want to check out (CLUSTER 0):

Manhattan Bridge
Brooklyn Heights Promenade
Empire Fulton Ferry Park
Pier 15

Found other attractions near Empire State Building you might want to check out (CLUSTER 7):

The Vessel
Madison Square Garden
Flatiron Building

Found other attractions near The Metropolitan Museum of Art (Metropolitan Museum of Art) you might want to check out (CLUSTER 5):

Bethesda Terrace
American Museum of Natural History
Central Park - Engineers' Gate
Alice in Wonderland Statue
Belvedere Castle
The Obelisk (Cleopatra's Needle)
King Jagiello / Poland Monument



In [19]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[lat_center, long_center], zoom_start=11)

# set color scheme for the clusters
n_clust = len(set(db.labels_))
colors_array = cm.rainbow(np.linspace(0, 1, n_clust))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map

groups = []
for i, color in enumerate(rainbow):
    groups.append(folium.FeatureGroup(name='<span style=\\"color: {};\\">cluster {}</span>'.format(color, i-1)))
    
for n, lat, lng, cat, cluster in zip(nyvenues.Name, nyvenues.Latitude, nyvenues.Longitude, nyvenues.Category, nyvenues.Cluster):
    
    if n in sights_list:
        radius = 12
    else:
        radius = 4    

    label = folium.Popup(str(n), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=radius,
        color=rainbow[cluster+1],
        fill=True,
        fill_opacity=0.7)\
        .add_child(label)\
        .add_to(groups[cluster + 1])
        
    groups[cluster].add_to(map_clusters)
        
folium.map.LayerControl('topright', collapsed=False).add_to(map_clusters)

    
map_clusters

Let's assume the user follows the advice and decides to include in his journey other attractions from cluster 2 since he will already be visiting other places near to them. He then adds
- Belvedere Castle
- The Obelisk (Cleopatra's Needle)
- Central Park - Engineers' Gate
- Alice in Wonderland Statue  



In [20]:
sights_list += ["Belvedere Castle", "The Obelisk (Cleopatra's Needle)", "Central Park - Engineers' Gate", "Alice in Wonderland Statue"]

Now that we have a complete list of all the places our user intends to visit during his staying, it's possible to find the optimal location to book the hotel. Ideally, this will have to be near the center of mass of all venues in order to minimize movement between places.

In [21]:
usr_venues = nyvenues[nyvenues.Name.isin(sights_list)].reset_index(drop=True)

In [22]:
cm = [usr_venues.Latitude.mean(), usr_venues.Longitude.mean()]

folium.Circle(
    [cm[0], cm[1]],
    radius=1000,
    color='blue',
    fill=True,
    fill_opacity=0.7).add_to(venues_map) 

venues_map

Let's look for hotels nearby the optimal position

In [23]:
radius = 1000
category = '4bf58dd8d48988d1fa931735'
limit = 100

url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION,
                cm[0],
                cm[1],
                1000, 
                category, 
                limit)


In [24]:
df_hotels = foursquare_to_df(url)
df_hotels.head()

Unnamed: 0,Name,Latitude,Longitude,Category
0,Baccarat Hotel,40.760656,-73.976784,Hotel
1,Hilton (New York Hilton Midtown),40.762248,-73.97919,Hotel
2,The Sherry-Netherland,40.764332,-73.972496,Hotel
3,JW Marriott Essex House New York,40.766277,-73.978572,Hotel
4,Hilton Club New York,40.762489,-73.97986,Hotel


In [25]:
hotels_opt = list(df_hotels.Name)

In [26]:
trip_url = 'https://www.tripadvisor.it/Hotels-g60763-oa0-New_York_City_New_York-Hotels.html'
trip_sc = requests.get(trip_url)

## 2. Scraping hotel information from TripAdvisor

Just by looking at Trip Advisor, it's possible to find over 30 pages of hotel listings in New York complete with prices data.  
Please note that before parsing the html source code, it will be necessary to allow the execution of Javascript code that populates the elements with the corresponding prices: for this reason, we will use _selenium_ with Python which emulates a web browser and makes it possible to retrieve specific elements by class name.  
The following functions will be used to initialize the web driver from the _selenium_ package and parse the html structure behind each page.

In [27]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  
from selenium.webdriver.support.ui import WebDriverWait
        
    
def init_web_driver():
    chrome_options = Options()  
    chrome_options.add_argument("--headless") 
    driver = webdriver.Chrome(options=chrome_options)
    
    return driver


def webdriver_scrape(driver, url):
    driver.get(url)
    
    try:
        driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
    except Exception as exception_message:
        print(exception_message)
        
    return driver

By inspecting the source code it appears that hotel listings can be found in the class _allowEllipsis_, from which we can obtain name and price by searching for the classes _listing-title_ and _price-wrap_ respectively.  


In [28]:
def find_listing(driver):
    element = driver.find_elements_by_class_name('allowEllipsis')
    
    if element:
        return element
    else:
        return False
    

def find_name(driver):
    element = driver.find_element_by_class_name('listing-title')
    
    if element:
        return element
    else:
        return False
    

def find_price(driver):
    element = driver.find_element_by_class_name('price-wrap')
    
    if element:
        return element
    else:
        return False
    
    
def clean_price(raw_price):  
    try:
        pr_raw = raw_price.replace('\n','')\
                          .replace('\r','')\
                          .replace('\t','')        
        pr = re.search('(\d{2,5}\s*€\s*)?(\d{2,5})', pr_raw).group(2) 
        
        return pr

    except Exception as pe:
        print(pe)


def clean_name(raw_name):
    try:
        nm_raw = raw_name.replace('\n','')\
                         .replace('\r','')\
                         .replace('\t','') 
        nm_blanks = re.sub('(?i)sponsorizzato','',nm_raw)
        nm = re.sub('(^ )|( $)','',nm_blanks)
        return nm
    
    except Exception as ne:
        print(ne)


Let's consider the first listing: from the analysis of the source code it appears that the name can be then found in a child of the considered element with tag "a"  whose path is variable. In this case:

In [29]:
trip_url = 'https://www.tripadvisor.it/Hotels-g60763-oa0-New_York_City_New_York-Hotels.html'

# initialize web driver
driver = init_web_driver()
driver = webdriver_scrape(driver, trip_url)

# parse html
sample_hotel= driver.find_elements_by_class_name("allowEllipsis")[0]
sample_name_raw = sample_hotel.find_element_by_class_name("listing-title").text
sample_price_raw = sample_hotel.find_element_by_class_name("price-wrap").text

print('Listing: {}\nPrice: {}'.format(sample_name_raw, sample_price_raw))

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: headless chrome=83.0.4103.61)


We can extend the process and integrate the previously defined functions that avoid errors in loading the pages and clean the output information.
Please note that, in order to switch page, we just need to increment by 30 (the number of listings per page) the number after _Hotels-g60763-oa_ in the url.

In [None]:
max_page = 30
hotel_prices = {}
hotels_opt = list(df_hotels.Name)

for ii in tqdm(range(1, max_page)):
    trip_url = 'https://www.tripadvisor.it/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html'.format(ii*30)
    driver = webdriver_scrape(driver, trip_url)

    # get list of hotels in the page
    hotel_listings = driver.find_elements_by_class_name("allowEllipsis")
    
    for hotel in hotel_listings:

        hotel_name_raw = WebDriverWait(hotel, 10).until(find_name).text
        hotel_name = clean_name(hotel_name_raw)
        
        try:
            hotel_price_raw = WebDriverWait(hotel, 10).until(find_price).text

            if hotel_name in hotels_opt:
                hotel_price = clean_price(hotel_price_raw)
                df_hotels.loc[df_hotels.Name==hotel_name, 'Price'] = hotel_price
        
        except Exception as e:
            print('Price N/A for {} at page {}: {}'.format(hotel_name, ii+1, e))

    

Let's drop rows for which it was not possible to assign a price:

In [None]:
df_price = df_hotels[~df_hotels.Price.isna()].reset_index(drop=True)
print('Number of rows: {}'.format(df_price.shape[0]))

In [None]:
df_price.head()

Finally, we save the dataframe:

In [259]:
df_price.to_csv('./Resources/Hotels.csv', index=False)

# Taxi

In [31]:
df_price = pd.read_csv('./Resources/Hotels.csv')

In [30]:
usr_venues

Unnamed: 0,Name,Latitude,Longitude,Category,Cluster
0,Brooklyn Bridge,40.705967,-73.996707,Bridge,0
1,The Metropolitan Museum of Art (Metropolitan M...,40.779729,-73.963416,Art Museum,5
2,Yankee Stadium,40.829869,-73.926584,Baseball Stadium,-1
3,Empire State Building,40.7486,-73.985806,Building,7
4,Central Park - Engineers' Gate,40.784075,-73.958695,Monument / Landmark,5
5,Madison Square Garden,40.750752,-73.993542,Basketball Stadium,7
6,Alice in Wonderland Statue,40.77504,-73.966566,Outdoor Sculpture,5
7,Belvedere Castle,40.779359,-73.969032,Castle,5
8,The Obelisk (Cleopatra's Needle),40.779656,-73.965422,Monument / Landmark,5
9,Statue of Liberty,40.689253,-74.044548,Monument/Landmark,-1


In order to proceed, we ask our user how many days he has available to tour the city, and he answers 3: Friday, Saturday and Sunday.

In [32]:
days = 3
nights = 2
surp = usr_venues.shape[0]%days
venues_per_day = usr_venues.shape[0]/days

# let's calculate how many venues are going to be visited each day by our user
venues_plan = np.array([1]*surp + [0]*(days-surp)) + np.array([floor(venues_per_day)]*days)
print(venues_plan)

[4 3 3]


In [33]:
usr_venues

Unnamed: 0,Name,Latitude,Longitude,Category,Cluster
0,Brooklyn Bridge,40.705967,-73.996707,Bridge,0
1,The Metropolitan Museum of Art (Metropolitan M...,40.779729,-73.963416,Art Museum,5
2,Yankee Stadium,40.829869,-73.926584,Baseball Stadium,-1
3,Empire State Building,40.7486,-73.985806,Building,7
4,Central Park - Engineers' Gate,40.784075,-73.958695,Monument / Landmark,5
5,Madison Square Garden,40.750752,-73.993542,Basketball Stadium,7
6,Alice in Wonderland Statue,40.77504,-73.966566,Outdoor Sculpture,5
7,Belvedere Castle,40.779359,-73.969032,Castle,5
8,The Obelisk (Cleopatra's Needle),40.779656,-73.965422,Monument / Landmark,5
9,Statue of Liberty,40.689253,-74.044548,Monument/Landmark,-1


The algorithm will plan visits to the chosen venues moving each time to the closest but will also try to keep venues from a cluster within the same day if possible, so we look for the farthest venue for a given hotel. Let's choose the first hotel of the list to make an example:

In [34]:
def get_airport_info(airport):
    loc = geolocator.geocode(airport)
    lat, long = loc[1][0], loc[1][1]

    airport_info = {'Name': airport,
                    'Latitude':lat,
                    'Longitude': long,
                    'Category': 'Airport'}


    df_air = pd.DataFrame([airport_info], columns=['Name', 'Latitude', 'Longitude', 'Category'])
    
    return df_air


def get_next_place(df_visit, reference_pos, criterion='closest'):
    df_visit['Distance'] = df_visit.apply(lambda x: geodesic(reference_pos, (x.Latitude, x.Longitude)).km, axis=1)
    if criterion == 'closest':
        visit = df_visit[df_visit.Distance == df_visit.Distance.min()]
    else:
        visit = df_visit[df_visit.Distance == df_visit.Distance.max()]
    
    return visit    


def tour_planner(airport_selection, hotel_selection, usr_venues, venues_plan):
    visits = pd.DataFrame()
    df_visit = usr_venues.copy()
    airport_selection['Cluster'], airport_selection['Distance'], airport_selection['Day'] = -1, 0.0, 1
    visits = visits.append(airport_selection, sort=False)
    
    for day in range(len(venues_plan)):
        criterion = 'farthest'
        currentPos = hotel_selection.copy()
        currentPos['Day'] = day + 1
        currentPos['Distance'] = geodesic((currentPos.iloc[0][1], currentPos.iloc[0][2]),\
                                              (visits.iloc[-1][1], visits.iloc[-1][2])).km
        visits = visits.append(currentPos, sort=False)
        cond = visits[(visits.Day==day+1) & (~visits.Category.isin(['Hotel', 'Airport']))].shape[0]

        while cond < venues_plan[day]:  
               
            pos0, clust0 = [(currentPos.values[0][1], currentPos.values[0][2]), currentPos.values[0][4]]
            df_current = df_visit.loc[df_visit.Cluster==clust0, :].copy()

            if clust0==-1 or (clust0>-1 and df_current.empty):
                nextPos = get_next_place(df_visit, pos0, criterion=criterion)
                criterion='closest'
                pos1, clust1 = [(nextPos.values[0][1], nextPos.values[0][2]), nextPos.values[0][4]]

                if clust1==-1:
                    currentPos = nextPos.copy()
                    currentPos['Day'] = day + 1
                    visits = visits.append(currentPos, sort=False)
                    df_visit = df_visit[~df_visit.Name.isin(currentPos.Name)]
                else:
                    cl_dim = df_visit.loc[df_visit.Cluster==clust1, :].shape[0]

                    # check if the cluster fits in one day (among those left in the trip)
                    if any([cl_dim <= lim for lim in venues_plan[day:]]):

                        # check if the cluster will fit in the current day
                        if cl_dim <= venues_plan[day]-len(visits) + 1:
                            currentPos = nextPos.copy()
                            currentPos['Day'] = day + 1
                            visits = visits.append(currentPos, sort=False)
                            df_visit = df_visit[~df_visit.Name.isin(currentPos.Name)]

                        else:
                            # before setting aside the cluster let's check how many other
                            # clusters would be postponed and if they will all fit in the future days
                            cl_dims = df_visit.groupby('Cluster').count()
                            cl_leftover = cl_dims[cl_dims['Name']>venues_plan[day]-len(visits)].sum()[0]

                            if cl_leftover < sum(venues_plan[day+1:]):
                                df_visit = df_visit[df_visit.Cluster!=clust1]

                            else:
                                currentPos = nextPos.copy()
                                currentPos['Day'] = day + 1
                                visits = visits.append(currentPos, sort=False)
                                df_visit = df_visit[~df_visit.Name.isin(currentPos.Name)]
                    else:
                        currentPos = nextPos.copy()
                        currentPos['Day'] = day + 1
                        visits = visits.append(currentPos, sort=False)
                        df_visit = df_visit[~df_visit.Name.isin(currentPos.Name)]


            else:
                nextPos = get_next_place(df_current, pos0, criterion=criterion)
                criterion='closest'
                currentPos = nextPos.copy()
                currentPos['Day'] = day + 1
                visits = visits.append(currentPos, sort=False)
                df_visit = df_visit[~df_visit.Name.isin(currentPos.Name)]

            cond = visits[(visits.Day==day+1) & (~visits.Category.isin(['Hotel', 'Airport']))].shape[0]
    
    airport_selection['Distance'] = geodesic((airport_selection.iloc[0][1], airport_selection.iloc[0][2]),\
                                              (visits.iloc[-1][1], visits.iloc[-1][2])).km
    airport_selection['Day'] = day + 1
    visits = visits.append(airport_selection, sort=False)
    cols = ['Day'] + [col for col in visits.columns if col != 'Day']
    visits = visits[cols].reset_index(drop=True)

    return visits

In [35]:
# get airport info
airport = 'JFK Airport, New York'
airport_selection = get_airport_info(airport)

# select hotel 0
hotel_selection = df_price.loc[df_price.index==0, ['Name', 'Latitude', 'Longitude', 'Category']].copy()
hotel_selection['Cluster'], hotel_selection['Distance'] = -1, 0.0

# get plan for the vacation
visits = tour_planner(airport_selection, hotel_selection, usr_venues, venues_plan)
visits

Unnamed: 0,Day,Name,Latitude,Longitude,Category,Cluster,Distance
0,1,"JFK Airport, New York",40.642948,-73.779373,Airport,-1,0.0
1,1,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1,21.483515
2,1,Statue of Liberty,40.689253,-74.044548,Monument/Landmark,-1,10.199771
3,1,Brooklyn Bridge,40.705967,-73.996707,Bridge,0,4.449106
4,1,Empire State Building,40.7486,-73.985806,Building,7,4.822967
5,1,Madison Square Garden,40.750752,-73.993542,Basketball Stadium,7,0.695654
6,2,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1,2.149211
7,2,Yankee Stadium,40.829869,-73.926584,Baseball Stadium,-1,8.333166
8,2,Central Park - Engineers' Gate,40.784075,-73.958695,Monument / Landmark,5,5.762169
9,2,The Metropolitan Museum of Art (Metropolitan M...,40.779729,-73.963416,Art Museum,5,0.625893


In [36]:
def get_trips(visits):
    df_left = visits.copy()
    df_right = visits.copy()
    df_right.index = [ix-1 for ix in visits.index]
    df_right = df_right[df_right.index>=0]

    df_left = df_left[df_left.columns[:-1]]
    df_left.columns = ['Day', 'Name_start', 'Latitude_start', 'Longitude_start', 'Category_start', 'Cluster_start']
    df_right = df_right[df_right.columns[1:]]
    df_right.columns = ['Name_end', 'Latitude_end', 'Longitude_end', 'Category_end', 'Cluster_end', 'Distance']

    df_rides = pd.concat([df_left, df_right], axis=1)
    df_rides.dropna(axis=0, inplace=True)
    
    return df_rides

df_rides = get_trips(visits)
df_rides

Unnamed: 0,Day,Name_start,Latitude_start,Longitude_start,Category_start,Cluster_start,Name_end,Latitude_end,Longitude_end,Category_end,Cluster_end,Distance
0,1,"JFK Airport, New York",40.642948,-73.779373,Airport,-1,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1.0,21.483515
1,1,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1,Statue of Liberty,40.689253,-74.044548,Monument/Landmark,-1.0,10.199771
2,1,Statue of Liberty,40.689253,-74.044548,Monument/Landmark,-1,Brooklyn Bridge,40.705967,-73.996707,Bridge,0.0,4.449106
3,1,Brooklyn Bridge,40.705967,-73.996707,Bridge,0,Empire State Building,40.7486,-73.985806,Building,7.0,4.822967
4,1,Empire State Building,40.7486,-73.985806,Building,7,Madison Square Garden,40.750752,-73.993542,Basketball Stadium,7.0,0.695654
5,1,Madison Square Garden,40.750752,-73.993542,Basketball Stadium,7,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1.0,2.149211
6,2,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1,Yankee Stadium,40.829869,-73.926584,Baseball Stadium,-1.0,8.333166
7,2,Yankee Stadium,40.829869,-73.926584,Baseball Stadium,-1,Central Park - Engineers' Gate,40.784075,-73.958695,Monument / Landmark,5.0,5.762169
8,2,Central Park - Engineers' Gate,40.784075,-73.958695,Monument / Landmark,5,The Metropolitan Museum of Art (Metropolitan M...,40.779729,-73.963416,Art Museum,5.0,0.625893
9,2,The Metropolitan Museum of Art (Metropolitan M...,40.779729,-73.963416,Art Museum,5,1 Hotel Central Park,40.765116,-73.976485,Hotel,-1.0,1.962312


Now we proceed to calculate the total cost of the tour, including the price of the room and taxi fares, under the following assumption:  
- a taxi is needed when two places are farther apart than 1 km


In [37]:
def get_nytaxifare(driver, start_name, end_name, start_loc, end_loc):  
    
    try:
        city = 'NY'
        start_lat = start_loc[0]
        start_long = start_loc[1]
        end_lat = end_loc[0]
        end_long = end_loc[1]
        taxi_url = 'https://www.taxifarefinder.com/main.php?city={}&from={}&to={}&fromCoord={},{}&toCoord={},{}'.format(
                        city,
                        start_name,
                        end_name,
                        start_lat,
                        start_long,
                        end_lat,
                        end_long)

        taxi_driver = webdriver_scrape(driver, taxi_url)
        raw_fare = WebDriverWait(taxi_driver, 10).until(find_fare).text
        fare = clean_fare(raw_fare)
        
        return fare
    
    except Exception as e:
        return 0.0

    
def find_fare(driver):
    element = driver.find_element_by_class_name('fareValue')
    
    if element:
        return element
    else:
        return False
    
    

def clean_fare(raw_fare):  
    try:
        pr_raw = raw_fare.replace('\n','')\
                          .replace('\r','')\
                          .replace('\t','')        
        pr = re.search('((\d{1,4})([\.\,]\d{0,3}))', pr_raw).group(0) 
        
        return float(pr)

    except Exception as pe:
        print(pe)

For example, let's calculate the cost of going from the selected hotel to the Statue of Liberty:

In [38]:
driver = init_web_driver()
start_name, end_name = df_rides.iloc[0][1],  df_rides.iloc[0][6] 
start_loc, end_loc = (df_rides.iloc[0][2],  df_rides.iloc[0][3]), (df_rides.iloc[0][7],  df_rides.iloc[0][8])  

cab_fare = get_nytaxifare(driver, start_name, end_name, start_loc, end_loc)
cab_fare

59.74

Extending the process to all trips of the vacation:

In [85]:
def cab_rides_calc(df_rides):
    cab = df_rides.copy()
    cab['Pos_start'] = list(zip(cab['Latitude_start'], cab['Longitude_start']))
    cab['Pos_end'] = list(zip(cab['Latitude_end'], cab['Longitude_end']))
    fare_cols = ['Distance', 'Name_start', 'Name_end', 'Pos_start', 'Pos_end']
    fares = []

    for dist, sn, en, sloc, eloc in zip(*[cab[col] for col in fare_cols]):
        if dist>1:
            sn = re.sub('\((\w*|\s|\')*\)','',sn)
            en = re.sub('\((\w*|\s|\')*\)','',en)
            fare = get_nytaxifare(driver, sn, en, sloc, eloc)

        else:
            fare = 0.0

        fares.append(fare)

    df_rides['Fare'] = fares
    
    return df_rides

df_rides = cab_rides_calc(df_rides)


Unnamed: 0,Day,Name_start,Latitude_start,...,Cluster_end,Distance,Fare
0,1,1 Hotel Central Park,40.765116,...,-1.0,10.199771,42.58
1,1,Statue of Liberty,40.689253,...,1.0,4.449106,32.74
2,1,Brooklyn Bridge,40.705967,...,6.0,4.822967,27.61
3,1,Empire State Building,40.7486,...,6.0,0.695654,0.0
4,1,Madison Square Garden,40.750752,...,-1.0,2.149211,13.05


In [87]:
pd.set_option('display.max_columns', 10)

In [94]:
df_rides[[el for el in df_rides.columns if not ('Category' in el or 'Latitude' in el or 'Longitude' in el)]]

Unnamed: 0,Day,Name_start,Cluster_start,Name_end,Cluster_end,Distance,Fare
0,1,1 Hotel Central Park,-1,Statue of Liberty,-1.0,10.199771,42.58
1,1,Statue of Liberty,-1,Brooklyn Bridge,1.0,4.449106,32.74
2,1,Brooklyn Bridge,1,Empire State Building,6.0,4.822967,27.61
3,1,Empire State Building,6,Madison Square Garden,6.0,0.695654,0.0
4,1,Madison Square Garden,6,1 Hotel Central Park,-1.0,2.149211,13.05
5,2,1 Hotel Central Park,-1,Yankee Stadium,8.0,8.333166,31.54
6,2,Yankee Stadium,8,Central Park - Engineers' Gate,3.0,5.762169,24.25
7,2,Central Park - Engineers' Gate,3,Solomon R Guggenheim Museum,3.0,0.123469,0.0
8,2,Solomon R Guggenheim Museum,3,The Metropolitan Museum of Art (Metropolitan M...,3.0,0.508791,0.0
9,2,The Metropolitan Museum of Art (Metropolitan M...,3,1 Hotel Central Park,-1.0,1.962312,12.89


In [237]:
df_rides.to_csv('./Resources/df_fares.csv', index=False)

## Total cost of hotel

In [83]:
df_rides = pd.read_csv('./Resources/df_fares.csv')

In [84]:
df_price

Unnamed: 0,Name,Latitude,Longitude,Category,Price
0,1 Hotel Central Park,40.765116,-73.976485,Hotel,642
1,JW Marriott Essex House New York,40.766277,-73.978572,Hotel,663
2,"The Ritz-Carlton New York, Central Park",40.765272,-73.975999,Resort,319
3,The St. Regis New York,40.761339,-73.974392,Hotel,149
4,The Peninsula New York,40.761658,-73.975384,Hotel,668
5,West Side YMCA,40.77086,-73.98064,Gym / Fitness Center,109
6,Park Central Hotel New York,40.76481,-73.9814,Hotel,288
7,Gardens Suites Hotel by Affinia,40.764383,-73.963126,Hotel,406
8,Park Hyatt New York,40.765255,-73.979177,Hotel,208
9,Dream Midtown,40.76442,-73.982008,Hotel,355


Finally we can calculate the total cost of accomodation for the first hotel:

In [128]:
df_price.loc[0, 'Total cost'] = df_rides['Fare'].sum() + df_price.loc[0,'Price']*nights
print('The total cost of stay for {} is {}€'.format(df_price.loc[0, 'Name'], df_price.loc[0, 'Total cost']))

The total cost of stay for 1 Hotel Central Park is 1627.05€


Let's repeat the process for all other hotels:

In [123]:
# get airport info
airport = 'JFK Airport, New York'
airport_selection = get_airport_info(airport)

In [120]:
import time

In [121]:
time.time()

1590853755.0493176

In [126]:
for ii in range(1, df_price.shape[0]):
    start = time.time()

    # select hotel
    print('loading hotel...')
    hotel_selection = df_price.loc[df_price.index==0, ['Name', 'Latitude', 'Longitude', 'Category']].copy()
    hotel_selection['Cluster'], hotel_selection['Distance'] = -1, 0.0
    print('loaded {}'.format(df_price.loc[ii, 'Name']))

    # get plan for the vacation
    print('drawing vacation plan')
    visits = tour_planner(airport_selection, hotel_selection, usr_venues, venues_plan)
    

    # get rides in transactional form
    df_rides = get_trips(visits)
    
    # calculate cost of rides
    print('calculating cab fares for {} trips...'.format(df_rides.shape[0]))
    df_rides = cab_rides_calc(df_rides)
    
    print('calculating total cost of stay')
    df_price.loc[ii, 'Total cost'] = df_rides['Fare'].sum() + df_price.loc[ii,'Price']*nights
    
    print('fine elaborazione per hotel {} - durata: {}'.format(df_price.loc[ii, 'Name'], time.time()-start))
        
        

loading hotel...
loaded Park Central Hotel New York
drawing vacation plan
calculating cab fares for 14 trips...
calculating total cost of stay
fine elaborazione per hotel Park Central Hotel New York - durata: 92.21805191040039
loading hotel...
loaded Gardens Suites Hotel by Affinia
drawing vacation plan
calculating cab fares for 14 trips...
calculating total cost of stay
fine elaborazione per hotel Gardens Suites Hotel by Affinia - durata: 90.82432961463928
loading hotel...
loaded Park Hyatt New York
drawing vacation plan
calculating cab fares for 14 trips...
calculating total cost of stay
fine elaborazione per hotel Park Hyatt New York - durata: 88.31580948829651
loading hotel...
loaded Dream Midtown
drawing vacation plan
calculating cab fares for 14 trips...
calculating total cost of stay
fine elaborazione per hotel Dream Midtown - durata: 87.87626099586487
loading hotel...
loaded The Manhattan Club
drawing vacation plan
calculating cab fares for 14 trips...
calculating total cost of

In [130]:
df_price.to_csv('./Resources/df_price_tot.csv', index=False)

## Hotel comparison

In [40]:
df_price_tot = pd.read_csv('./Resources/df_price_tot.csv')

In [90]:
df_price_tot

Unnamed: 0,Name,Latitude,Longitude,Category,Price,Total cost
0,1 Hotel Central Park,40.765116,-73.976485,Hotel,642,1627.05
1,JW Marriott Essex House New York,40.766277,-73.978572,Hotel,663,1669.05
2,"The Ritz-Carlton New York, Central Park",40.765272,-73.975999,Resort,319,981.05
3,The St. Regis New York,40.761339,-73.974392,Hotel,149,641.05
4,The Peninsula New York,40.761658,-73.975384,Hotel,668,1679.05
5,West Side YMCA,40.77086,-73.98064,Gym / Fitness Center,109,561.05
6,Park Central Hotel New York,40.76481,-73.9814,Hotel,288,919.05
7,Gardens Suites Hotel by Affinia,40.764383,-73.963126,Hotel,406,1155.05
8,Park Hyatt New York,40.765255,-73.979177,Hotel,208,759.05
9,Dream Midtown,40.76442,-73.982008,Hotel,355,1053.05


Let's make a comparison...

In [71]:
hotel_comp = {'Name': 'Solita Soho Hotel', 'Latitude': 40.719902, 'Longitude': -73.999180, 'Category': 'Hotel'}
cost_per_night = 200

In [72]:
hotel_selection_comp = pd.DataFrame([hotel_comp], columns=['Name', 'Latitude', 'Longitude', 'Category'])

hotel_selection_comp['Cluster'], hotel_selection_comp['Distance'] = -1, 0.0

In [73]:
hotel_selection_comp

Unnamed: 0,Name,Latitude,Longitude,Category,Cluster,Distance
0,Solita Soho Hotel,40.719902,-73.99918,Hotel,-1,0.0


In [74]:
visits_comp = tour_planner(airport_selection, hotel_selection_comp, usr_venues, venues_plan)

In [75]:
df_rides_comp = get_trips(visits_comp)

In [76]:
df_rides_comp = cab_rides_calc(df_rides_comp)

In [79]:
tot_cost_comp = df_rides_comp['Fare'].sum() + cost_per_night*nights

In [80]:
print('Total_cost = sum(fares) + cost_per_night x number_of_nights')
print('Total_cost = {} €  + {} x {} € = {} € '.format(df_rides_comp['Fare'].sum(), cost_per_night, nights, tot_cost_comp))

Total_cost = sum(fares) + cost_per_night x number_of_nights
Total_cost = 367.48 €  + 200 x 2 € = 767.48 € 


In [81]:
df_rides_comp

Unnamed: 0,Day,Name_start,Latitude_start,...,Cluster_end,Distance,Fare
0,1,"JFK Airport, New York",40.642948,...,-1.0,20.453014,59.44
1,1,Solita Soho Hotel,40.719902,...,-1.0,13.663346,43.26
2,1,Yankee Stadium,40.829869,...,5.0,5.762169,24.25
3,1,Central Park - Engineers' Gate,40.784075,...,5.0,0.625893,0.0
4,1,The Metropolitan Museum of Art (Metropolitan M...,40.779729,...,5.0,0.169575,0.0
5,1,The Obelisk (Cleopatra's Needle),40.779656,...,-1.0,7.222107,30.66
6,2,Solita Soho Hotel,40.719902,...,5.0,7.076575,29.07
7,2,Belvedere Castle,40.779359,...,5.0,0.52295,0.0
8,2,Alice in Wonderland Statue,40.77504,...,7.0,3.35566,15.41
9,2,Empire State Building,40.7486,...,-1.0,3.381135,17.24


In [173]:
df_price_tot[(df_price_tot['Price']>cost_per_night) & (df_price_tot['Total cost']<=850)]

Unnamed: 0,Name,Latitude,Longitude,Category,Price,Total cost
8,Park Hyatt New York,40.765255,-73.979177,Hotel,208,759.05
14,Fifty Hotel & Suites by Affinia,40.756069,-73.971031,Hotel,223,789.05
15,W New York - Times Square,40.759296,-73.985573,Hotel,245,833.05


In [141]:
df_hotels[df_hotels.Name.str.contains('Roger')]

Unnamed: 0,Name,Latitude,Longitude,Category


In [150]:
a = geolocator.geocode('Solita Soho Hotel, NYC')