# Holiday Destination Recommender - Report

## 1. Introduction

I am sure you know this problem very well.

You are so ready to go onto your next vacation - only you do not know what destination you should choose. Not so much because you have too much choice and you cannot decide - the issue is rather, you do not actually know what are the destinations that really match your current *mood* for holiday destinations? And is there anything exotic to discover other than all the mainstream places?

So here is the idea.

A *Holiday Destination Recommender*: You tell the recommender the places you like and that match your current mood & the recommender shows you similar places worth visting - places you might have never come across yourself.
Yes, very similar to what you already know from Netflix, Spotify and YouTube.

The project is implemented using an unsupervised learning approach and hence does not depend on the availability of labeled training data. Please refer to section **Methodology** for more details on the approach.



## 2. Data

In this section, I present which type of data is needed, where they are retrieved from and how they are processed to implement the Holiday Destination Recommender.

## 2.1 Data Requirements

To implement the Holiday Destination Recommender, the following data is required:
- **A list of cities that serve as "benchmark cities"**; in this example, we go with the cities "Barcelona, Spain" and "Budapest, Hungary"
- **A list of all cities in the world**: This is needed, so that the recommender can scout new cities to suggest to the user.
- **The latitudes and longitudes of those cities**: This is needed, so that places are uniquely identified and more information can be retrieved about them.
- **The available venues of those cities**: This is needed, so that the recommender can calculate the similarity between the benchmark cities and possible holiday destinations. First and foremost, the available *venue categories* are of interest, since they form the basis of the comparison.

## 2.2 Data Sources

The following data sources are used to retrieve above mentioned data:
- Dataset from https://datahub.io/core/world-cities: Here a CSV is provided that contains all major cities of the world with more than 150000 inhabitants. In total there are more than 20k entries.
- Python library *geopy* to request *Nominatim* interface for retrieving the coordinates of cities.
- Foursquare Places API to obtain information on the available venues for a city.

## 2.3 Data Processing and Preparation

This section depicts the applied data processing flow.


#### 2.3.1 Data Retrieval

1. The user provides a list of benchmark cities as input to the recommender. 
2. The coordinates are retrieved using *geopy* library to request *Nominatim* API.
3. The venues are queried from Foursquare Places API and listed along with their venue category.
4. Above process is repeated for the benchmark cities, which are read from the CSV provided by https://datahub.io/core/world-cities. A random sample of 100 cities is taken from that list: The limit of 100 cities is in order not to exceed the daily requests limit imposed by the Foursquare Places API on the Sandbox account; the random selection makes sure the recommender always explores new cities.

Below you find an extract of the benchmark cities with coordinates and available venues.

![title](img/Benchmark_Cities.png)

And the same for the destination cities.

![title](img/New_Cities.png)

#### 2.3.2 Data Preparation

1. The retrieved dataframes are turned into a one-hot encoding representation of cities vs. venue categories.
2. The frequency per venue category and place (i.e. the ratio between the number of a particular venue category appearing vs. the total number of venues for a city) is calculated based on the one-hot encoding.

Below you find an extract of a) the one-hot encoding and b) the calculated venue frequencies for the benchmark cities.

![title](img/Benchmark_Cities_Onehot.png)

![title](img/Benchmark_Cities_Freq.png)

And the same for the destination cities.

![title](img/New_Cities_Onehot.png)

![title](img/New_Cities_Freq.png)

In the next section, I will discuss how the calculated frequencies of the venue categories are used to generate destination recommendations.