<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone: Travel Recommender System

## Contents:
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Datasets](#Datasets)
- [Download Packages](#Download-Packages)
- [Loading of Libraries](#Loading-of-Libraries) 
- [Downloading of Datasets](#Downloading-of-Datasets)

## Problem Statement

With the rise in the demand for international travel, especially due to the pent-up demand from the ease of Covid travel restrictions, Singapore travellers are spoilt for choice and to decide on which places of interest to visit during the limited vacation time is no easy feat. 

Furthermore, most [tour packages](https://www.chanbrothers.com/package-tours) and [travel websites](https://www.tripadvisor.com.sg/Attractions-g186338-Activities-oa0-London_England.html) promote the typical sights and attractions instead of places where locals rate.

As such, we propose a travel recommender system to discover the hidden gems, in particular, of London.

## Background

According to the [top travel trends among Singapore travellers in 2023](https://www.humanresourcesonline.net/top-travel-trends-among-singapore-travellers-in-2023), London, United Kingdom is ranked one of the top 10 popular travel destinations. As such, our project will primarily focus on the hidden gems of London.

Based on a [preliminary study on Social Media Data Analytics for Tourism](https://ceur-ws.org/Vol-1748/paper-12.pdf), it supports using social media as a useful data source for touristic decision making since it can provide real-time insights of tourists' visiting patterns. Hence, we will be using an Instagram dataset to build our recommender system.

To discover the hidden gems of London, we will refer to a popular travel website, Trip Advisor, to find out the Top 50 tourist spots which will then be excluded from our Instagram dataset. According to [London Tourism Statistics](https://gowithguide.com/blog/london-tourism-statistics-2023-all-you-need-to-know-5213), the average length of stay in London for tourists is about 4.6 days. Assuming a tourist covers 5 spots for 5 days, 25 spots will be covered. As such, with the top 50 tourist spots excluded from our recommender system (an amount double of the 25 spots), it should be safe to say that our system will uncover the hidden gems for our users.

## Datasets

The datasets used in this recommender system project are acquired from Kaggle. It is a [2019 Instagram dataset](https://www.kaggle.com/datasets/shmalex/instagram-dataset?select=instagram_posts.csv) containing 42 million posts, 1.2 million locations and 4.5 million profiles. 

The codes to acquire the dataset is in this notebook. The datasets are saved under the [`instagram-dataset`](../instagram-dataset/) folder.

There are three datasets under this Kaggle site:
* [`instagram_locations.csv`](../instagram-dataset/instagram_locations.csv): this data contains about 1.2 million locations from all over the world
* [`instagram_posts.csv`](../instagram-dataset/instagram_posts.csv): this data contains posts with location information, date, number of likes and comments, captions, etc
* [`instagram_profiles.csv`](../instagram-dataset/instagram_profiles.csv): this data contains profiles of many different people and businesses

For this project, we will only make use of the first two datasets.

To acquire the Top 50 tourist spots of London, we will webscrape from Trip Advisor and the codes are in the notebook [`00_data_acquisition_tripadvisor`](./00_data_acquisition_tripadvisor.ipynb).

### Data Dictionary

**instagram_locations:**
|Feature|Type|Description|
|:---|:---|:---| 
|sid|integer|sequence ID|
|id|integer|Instagrams ID for that could be used on the website ex: ID=230466055 the url is https://www.instagram.com/explore/locations/230466055|
|name|string|name of location|
|street|string|street address|
|zip|string|zip code|
|city|string|name of city|
|region|string|name of region|
|cd|string|country code|
|phone|string|phone number in the format as on Instagram|
|aj_exact_city_match|boolean|Instagram's internal key|
|aj_exact_country_match|boolean|Instagram's internal key|
|blurb|string|description of the place|
|dir_city_id|integer|Instagrams internal city ID|
|dir_city_name|string|name of city|
|dir_city_slug|string|city tag|
|dir_country_id|string|country ID|
|dir_country_name|string|name of country|
|lat|float|latitude|
|lng|float|longitude|
|primary_alias_on_fb|string|name on Facebook|
|slug|string|tag|
|website|string|URL to website, may contain more than 1 URL|
|cts|string|timestamp when the location was visited|

**instagram_posts:**
|Feature|Type|Description|
|:---|:---|:---| 
|sid|integer|sequence ID|
|sid_profile|integer|sequence ID of the profile|
|post_id|string|Instagram ID of post|
|profile_id|float|Instagram ID of profile|
|location_id|float|Instagram ID of location|
|cts|string|timestamp when post was created|
|post_type|integer|1 - photo, 2 - video, 3 - multi|
|description|string|caption of post|
|numbr_likes|float|number of likes at the moment it was visited|
|number_comments|float|number of comments at the moment it was visited|

---

We will first download the Instagram datasets from Kaggle. They are saved in the [`instagram-dataset`](../instagram-dataset/) folder. 

## Download Packages

In [1]:
# Uncomment to download package
# !pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify
  Downloading python_slugify-8.0.0-py2.py3-none-any.whl (9.5 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=a177505abc3aa00c838efb2093c63951e6eb80cf62d3250ff3e4e4d7c22b0a71
  Stored in directory: /Users/jo/Library/Caches/pip/wheels/03/f3/c7/fc5a63bb33d22177609b06c5b4c714b5eb3f1b195ce9dc5e47
Su

## Loading of Libraries

In [2]:
import opendatasets as od
import pandas

## Downloading of Datasets

In [3]:
od.download('https://www.kaggle.com/datasets/shmalex/instagram-dataset?select=instagram_locations.csv')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

  joanneh0


Your Kaggle Key:

  ········


Downloading instagram-dataset.zip to ./instagram-dataset


100%|██████████████████████████████████████| 5.26G/5.26G [05:06<00:00, 18.4MB/s]





- To acquire the Top 50 tourist spots of London, we will webscrape from Trip Advisor and the codes are in the notebook [`00_data_acquisition_tripadvisor`](./00_data_acquisition_tripadvisor.ipynb).