# NLP Recommender System - Travel Advisor

## Introduction
At its lowest point during the early days of the pandemic, the Travel Health Index [dipped to 20 in April 2020](https://www.kiplinger.com/personal-finance/is-travel-finally-back-new-report-reveals-record-tourism-rebound), a sign travel performance had sunk to 20% of April 2019 levels. Three years later, travel has fully rebounded and continually set new records across all global regions. In fact, more people are traveling now than before the pandemic.

This project creates a NLP Recommender System using travel destinations from [50 best places to travel in 2024](https://www.travelandleisure.com/best-places-to-go-2024-8385979). The NLP system was trained on over 20,000 text pairs from teh popular travel site Trip Advisor.

TODO: Update

## Table of Contents
---
- [Introduction](#Introduction)
  - [Our Research Problem](#Our-Research-Problem)
- [Web Scraping](#Web-Scraping)

## Web Scraping
We will scrape tens of thousands of "things to do" from Trip Advisor. Some of the places are taken from [50 Best Places to Travel in 2024](https://www.travelandleisure.com/best-places-to-go-2024-8385979). Other places are popular travel destinations that I hope to go to one day in the future. As a reference, here is a list of the places scraped (and their associated URL on Trip Advisor):

1.   Cartagena, Colombia: https://www.tripadvisor.com/Attractions-g297476-Activities-oa0-Cartagena_Cartagena_District_Bolivar_Department.html
2.   Rajasthan, India: https://www.tripadvisor.com/Attractions-g297665-Activities-oa0-Rajasthan.html
3.   Tallinn, Estonia: https://www.tripadvisor.com/Attractions-g274958-Activities-oa0-Tallinn_Harju_County.html
4.   Warsaw, Poland: https://www.tripadvisor.com/Attractions-g274856-Activities-oa0-Warsaw_Mazovia_Province_Central_Poland.html
5.   Mérida, Mexico: https://www.tripadvisor.com/Attractions-g150811-Activities-oa0-Merida_Yucatan_Peninsula.html
6.   Fort Worth, Texas: https://www.tripadvisor.com/Attractions-g55857-Activities-oa0-Fort_Worth_Texas.html
7.   Las Vegas, Nevada: https://www.tripadvisor.com/Attractions-g45963-Activities-oa0-Las_Vegas_Nevada.html
8.   Paris, France: https://www.tripadvisor.com/Attractions-g187147-Activities-oa0-Paris_Ile_de_France.html
9.   Egersund, Norway: https://www.tripadvisor.com/Attractions-g668814-Activities-oa0-Egersund_Eigersund_Municipality_Rogaland_Western_Norway.html
10.  Hokkaido, Japan: https://www.tripadvisor.com/Attractions-g298143-Activities-oa0-Hokkaido.html
11.  Wellington, New Zealand: https://www.tripadvisor.com/Attractions-g255115-Activities-oa0-Wellington_Greater_Wellington_North_Island.html
12.  Costa Rica: https://www.tripadvisor.com/Attractions-g291982-Activities-oa0-Costa_Rica.html
13.  Bahai, Brazil: https://www.tripadvisor.com/Attractions-g303251-Activities-oa0-State_of_Bahia.html
14.  Sri Lanka, South Asia: https://www.tripadvisor.com/Attractions-g293961-Activities-oa0-Sri_Lanka.html
15.  Porto, Portugal: https://www.tripadvisor.com/Attractions-g189180-Activities-oa0-Porto_Porto_District_Northern_Portugal.html

We will now import the necessary libraries for web scraping.

In [1]:
import requests
import os
import re
from bs4 import BeautifulSoup
import pandas as pd

Below are our three core functions for getting text data.

In [2]:
loc_dir = 'C:/Users/mskeh/Documents/GitHub/Thinkful/Capstone Projects/Final_Capstone_NLP_Search_Recommendation/Data'

def find_things_to_do_text(url):
    """
    Extracts text inside "Things to Do" from a TripAdvisor page.

    Args:
        url (str): The URL of the TripAdvisor page.

    Returns:
        list: List of extracted texts.
    """
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    all_attractions = soup.find_all('div', class_='XfVdV o AIbhI')
    return [re.sub(r'^\d+\.', '', attraction.text).strip() for attraction in all_attractions]

def create_url_list(base_url, number, end):
    """
    Creates a list of URLs for scraping TripAdvisor pages.

    Args:
        base_url (str): The base URL of the TripAdvisor page.
        number (str): A unique identifier for the location.
        end (str): Additional information for the URL.

    Returns:
        list: List of generated URLs.
    """
    return [f'https://www.tripadvisor.com/Attractions-{number}-a_sort.-oa{30*i - 30}-{end}.html' for i in range(2, 201)]

def scrape_and_save(base_url, number, end, name, csv_name):
    """
    Scrapes TripAdvisor pages, saves data to a DataFrame, and exports to CSV.

    Args:
        base_url (str): The base URL of the TripAdvisor page.
        number (str): A unique identifier for the location.
        end (str): Additional information for the URL.
        name (str): The name of the location.
        csv_name (str): The desired name for the CSV file.

    Returns:
        pandas.DataFrame: The DataFrame containing the scraped data.
    """
    url_list = create_url_list(base_url, number, end)
    url_list.insert(0, base_url)

    all_things_to_do = [attraction for url in url_list for attraction in find_things_to_do_text(url)]

    df = pd.DataFrame(all_things_to_do, columns=['Text'])
    df['Location'] = name
    csv_path = os.path.join(loc_dir, csv_name)
    df.to_csv(csv_path, index=False)
    return df

We will now use the three helper functions to scrape data. We will put all of the scraped data into DataFrames and then combine all of them into a single DataFrame.

### Cartagena, Colombia

In [3]:
base_url = 'https://www.tripadvisor.com/Attractions-g297476-Activities-oa0-Cartagena_Cartagena_District_Bolivar_Department.html'
number = 'g297476'
end = 'Cartagena_Cartagena_District_Bolivar_Department'
name = 'Cartagena, Colombia'
csv_name = 'cartagena.csv'

cartagena_df = scrape_and_save(base_url, number, end, name, csv_name)

### Rajasthan, India

In [4]:
base_url = 'https://www.tripadvisor.com/Attractions-g297665-Activities-oa0-Rajasthan.html'
number = 'g297665'
end = 'Rajasthan'
name = 'Rajasthan, India'
csv_name = 'rajasthan.csv'

rajasthan_df = scrape_and_save(base_url, number, end, name, csv_name)

### Tallinn, Estonia

In [5]:
base_url = 'https://www.tripadvisor.com/Attractions-g274958-Activities-oa0-Tallinn_Harju_County.html'
number = 'g274958'
end = 'Tallinn_Harju_County'
name = 'Tallinn, Estonia'
csv_name = 'tallinn.csv'

tallinn_df = scrape_and_save(base_url, number, end, name, csv_name)

### Warsaw, Poland

In [6]:
base_url = 'https://www.tripadvisor.com/Attractions-g274856-Activities-oa0-Warsaw_Mazovia_Province_Central_Poland.html'
number = 'g274856'
end = 'Warsaw_Mazovia_Province_Central_Poland'
name = 'Warsaw, Poland'
csv_name = 'warsaw.csv'

warsaw_df = scrape_and_save(base_url, number, end, name, csv_name)

### Mérida, Mexico

In [7]:
base_url = 'https://www.tripadvisor.com/Attractions-g150811-Activities-oa0-Merida_Yucatan_Peninsula.html'
number = 'g150811'
end = 'Merida_Yucatan_Peninsula'
name = 'Mérida, Mexico'
csv_name = 'merida.csv'

merida_df = scrape_and_save(base_url, number, end, name, csv_name)

### Fort Worth, Texas

In [8]:
base_url = 'https://www.tripadvisor.com/Attractions-g55857-Activities-oa0-Fort_Worth_Texas.html'
number = 'g55857'
end = 'Fort_Worth_Texas'
name = 'Fort Worth, Texas'
csv_name = 'fort_worth.csv'

fort_worth_df = scrape_and_save(base_url, number, end, name, csv_name)

### Las Vegas, Nevada

In [9]:
base_url = 'https://www.tripadvisor.com/Attractions-g45963-Activities-oa0-Las_Vegas_Nevada.html'
number = 'g45963'
end = 'Las_Vegas_Nevada'
name = 'Las Vegas, Nevada'
csv_name = 'las_vegas.csv'

las_vegas_df = scrape_and_save(base_url, number, end, name, csv_name)

### Paris, France

In [10]:
base_url = 'https://www.tripadvisor.com/Attractions-g187147-Activities-oa0-Paris_Ile_de_France.html'
number = 'g187147'
end = 'Paris_Ile_de_France'
name = 'Paris, France'
csv_name = 'paris.csv'

paris_df = scrape_and_save(base_url, number, end, name, csv_name)

### Egersund, Norway

In [11]:
base_url = 'https://www.tripadvisor.com/Attractions-g668814-Activities-oa0-Egersund_Eigersund_Municipality_Rogaland_Western_Norway.html'
number = 'g668814'
end = 'Egersund_Eigersund_Municipality_Rogaland_Western_Norway'
name = 'Egersund, Norway'
csv_name = 'egersund.csv'

egersund_df = scrape_and_save(base_url, number, end, name, csv_name)

### Hokkaido, Japan

In [12]:
base_url = 'https://www.tripadvisor.com/Attractions-g298143-Activities-oa0-Hokkaido.html'
number = 'g298143'
end = 'Hokkaido'
name = 'Hokkaido, Japan'
csv_name = 'hokkaido.csv'

hokkaido_df = scrape_and_save(base_url, number, end, name, csv_name)

### Wellington, New Zealand

In [13]:
base_url = 'https://www.tripadvisor.com/Attractions-g255115-Activities-oa0-Wellington_Greater_Wellington_North_Island.html'
number = 'g255115'
end = 'Wellington_Greater_Wellington_North_Island'
name = 'Wellington, New Zealand'
csv_name = 'wellington.csv'

wellington_df = scrape_and_save(base_url, number, end, name, csv_name)

### Costa Rica, Central America

In [14]:
base_url = 'https://www.tripadvisor.com/Attractions-g291982-Activities-oa0-Costa_Rica.html'
number = 'g291982'
end = 'Costa_Rica'
name = 'Costa Rica, Central America'
csv_name = 'costa_rica.csv'

costa_rica_df = scrape_and_save(base_url, number, end, name, csv_name)

### Bahai, Brazil

In [15]:
base_url = 'https://www.tripadvisor.com/Attractions-g303251-Activities-oa0-State_of_Bahia.html'
number = 'g303251'
end = 'State_of_Bahia'
name = 'Bahai, Brazil'
csv_name = 'bahai.csv'

bahai_df = scrape_and_save(base_url, number, end, name, csv_name)

### Sri Lanka, South Asia

In [16]:
base_url = 'https://www.tripadvisor.com/Attractions-g293961-Activities-oa0-Sri_Lanka.html'
number = 'g293961'
end = 'Sri_Lanka'
name = 'Sri Lanka, South Asia'
csv_name = 'sri_lanka.csv'

sri_lanka_df = scrape_and_save(base_url, number, end, name, csv_name)

### Porto, Portugal

In [17]:
base_url = 'https://www.tripadvisor.com/Attractions-g189180-Activities-oa0-Porto_Porto_District_Northern_Portugal.html'
number = 'g189180'
end = 'Porto_Porto_District_Northern_Portugal'
name = 'Porto, Portugal'
csv_name = 'porto.csv'

porto_df = scrape_and_save(base_url, number, end, name, csv_name)

### Combine all the DataFrames into a single csv
Will will now create an `all_df` which will be put into a `.csv` file containing the text for all "things to do" in all places.

In [19]:
all_df = [cartagena_df, rajasthan_df, tallinn_df, warsaw_df, merida_df,
          fort_worth_df, las_vegas_df, paris_df, egersund_df, hokkaido_df,
          wellington_df, costa_rica_df, bahai_df, sri_lanka_df, porto_df]
all_things_to_do_df = pd.concat(all_df)

In [20]:
csv_name = 'all_things_to_do.csv'
csv_path = os.path.join(loc_dir, csv_name)
all_things_to_do_df.to_csv(csv_path, index=False)

To make processing easier, we will complete further analysis in a different notebook. 