# NLP Recommender System - Travel Advisor

## Introduction
At its lowest point during the early days of the pandemic, the Travel Health Index [dipped to 20 in April 2020](https://www.kiplinger.com/personal-finance/is-travel-finally-back-new-report-reveals-record-tourism-rebound), a sign travel performance had sunk to 20% of April 2019 levels. Three years later, travel has fully rebounded and continually set new records across all global regions. In fact, more people are traveling now than before the pandemic.

This project creates a NLP Recommender System using travel destinations from [50 best places to travel in 2024](https://www.travelandleisure.com/best-places-to-go-2024-8385979). The NLP system was trained on over 20,000 text pairs from teh popular travel site Trip Advisor.

TODO: Update

## Table of Contents
---
- [Introduction](#Introduction)
  - [Our Research Problem](#Our-Research-Problem)
- [Web Scraping](#Web-Scraping)

## Web Scraping
We will scrape tens of thousands of "things to do" from Trip Advisor. Some of the places are taken from [50 Best Places to Travel in 2024](https://www.travelandleisure.com/best-places-to-go-2024-8385979). Other places are popular travel destinations that I hope to go to one day in the future. As a reference, here is a list of the places scraped (and their associated URL on Trip Advisor):

1.   Bogota, Colombia: https://www.tripadvisor.com/Attractions-g294074-Activities-oa0-Bogota.html
2.   Rajasthan, India: https://www.tripadvisor.com/Attractions-g297665-Activities-oa0-Rajasthan.html
3.   Berlin, Germany: https://www.tripadvisor.com/Attractions-g187323-Activities-oa0-Berlin.html
4.   Moscow, Russia: https://www.tripadvisor.com/Attractions-g298484-Activities-oa0-Moscow_Central_Russia.html
5.   Mexico City, Mexico: https://www.tripadvisor.com/Attractions-g150800-Activities-oa0-Mexico_City_Central_Mexico_and_Gulf_Coast.html
6.   New York City, New York: https://www.tripadvisor.com/Attractions-g60763-Activities-oa0-New_York_City_New_York.html
7.   Las Vegas, Nevada: https://www.tripadvisor.com/Attractions-g45963-Activities-oa0-Las_Vegas_Nevada.html
8.   Paris, France: https://www.tripadvisor.com/Attractions-g187147-Activities-oa0-Paris_Ile_de_France.html
9.   Stockholm, Sweden: https://www.tripadvisor.com/Attractions-g189852-Activities-oa0-Stockholm.html
10.  Hokkaido, Japan: https://www.tripadvisor.com/Attractions-g298143-Activities-oa0-Hokkaido.html
11.  Melbourne, Austrailia: https://www.tripadvisor.com/Attractions-g255100-Activities-oa0-Melbourne_Victoria.html
12.  Bahamas,  Atlantic Ocean: https://www.tripadvisor.com/Attractions-g147414-Activities-oa0-Bahamas.html
13.  Bahai, Brazil: https://www.tripadvisor.com/Attractions-g303251-Activities-oa0-State_of_Bahia.html
14.  Bali, Indonesia: https://www.tripadvisor.com/Attractions-g294226-Activities-oa0-Bali.html
15.  Amsterdam, Netherlands: https://www.tripadvisor.com/Attractions-g188590-Activities-oa0-Amsterdam_North_Holland_Province.html
16.  Hong Kong, China: https://www.tripadvisor.com/Attractions-g294217-Activities-oa0-Hong_Kong.html
17.  Rome, Italy: https://www.tripadvisor.com/Attractions-g187791-Activities-oa0-Rome_Lazio.html
18.  Honolulu, Hawaii: https://www.tripadvisor.com/Attractions-g60982-Activities-oa0-Honolulu_Oahu_Hawaii.html
19.  Bangkok, Thailand: https://www.tripadvisor.com/Attractions-g293916-Activities-oa0-Bangkok.html
20.  San Franscisco, California: https://www.tripadvisor.com/Attractions-g60713-Activities-oa0-San_Francisco_California.html

We will now import the necessary libraries for web scraping.

In [1]:
import requests
import os
import re
from bs4 import BeautifulSoup
import pandas as pd

Below are our three core functions for getting text data.

In [2]:
loc_dir = 'C:/Users/mskeh/Documents/GitHub/Thinkful/Capstone Projects/Final_Capstone_NLP_Search_Recommendation/Data'

def find_things_to_do_text(url):
    """
    Extracts text inside "Things to Do" from a TripAdvisor page.

    Args:
        url (str): The URL of the TripAdvisor page.

    Returns:
        list: List of extracted texts.
    """
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    all_attractions = soup.find_all('div', class_='XfVdV o AIbhI')
    return [re.sub(r'^\d+\.', '', attraction.text).strip() for attraction in all_attractions]

def create_url_list(base_url, number, end):
    """
    Creates a list of URLs for scraping TripAdvisor pages.

    Args:
        base_url (str): The base URL of the TripAdvisor page.
        number (str): A unique identifier for the location.
        end (str): Additional information for the URL.

    Returns:
        list: List of generated URLs.
    """
    return [f'https://www.tripadvisor.com/Attractions-{number}-a_sort.-oa{30*i - 30}-{end}.html' for i in range(2, 201)]

def scrape_and_save(base_url, number, end, name, csv_name):
    """
    Scrapes TripAdvisor pages, saves data to a DataFrame, and exports to CSV.

    Args:
        base_url (str): The base URL of the TripAdvisor page.
        number (str): A unique identifier for the location.
        end (str): Additional information for the URL.
        name (str): The name of the location.
        csv_name (str): The desired name for the CSV file.

    Returns:
        pandas.DataFrame: The DataFrame containing the scraped data.
    """
    url_list = create_url_list(base_url, number, end)
    url_list.insert(0, base_url)

    all_things_to_do = [attraction for url in url_list for attraction in find_things_to_do_text(url)]

    df = pd.DataFrame(all_things_to_do, columns=['Text'])
    df['Location'] = name
    csv_path = os.path.join(loc_dir, csv_name)
    df.to_csv(csv_path, index=False)
    return df

We will now use the three helper functions to scrape data. We will put all of the scraped data into DataFrames and then combine all of them into a single DataFrame.

### Prague, Czech Republic

In [3]:
base_url = 'https://www.tripadvisor.com/Attractions-g274707-Activities-oa0-Prague_Bohemia.html'
number = 'g274707'
end = 'Prague_Bohemia'
name = 'Prague, Czech Republic'
csv_name = 'prague.csv'

prague_df = scrape_and_save(base_url, number, end, name, csv_name)

### Rajasthan, India

In [4]:
base_url = 'https://www.tripadvisor.com/Attractions-g297665-Activities-oa0-Rajasthan.html'
number = 'g297665'
end = 'Rajasthan'
name = 'Rajasthan, India'
csv_name = 'rajasthan.csv'

rajasthan_df = scrape_and_save(base_url, number, end, name, csv_name)

### Berlin, Germany

In [5]:
base_url = 'https://www.tripadvisor.com/Attractions-g187323-Activities-oa0-Berlin.html'
number = 'g187323'
end = 'Berlin'
name = 'Berlin, Germany'
csv_name = 'berlin.csv'

berlin_df = scrape_and_save(base_url, number, end, name, csv_name)

### Moscow, Russia

In [6]:
base_url = 'https://www.tripadvisor.com/Attractions-g298484-Activities-oa0-Moscow_Central_Russia.html'
number = 'g298484'
end = 'Moscow_Central_Russia'
name = 'Moscow, Russia'
csv_name = 'moscow.csv'

moscow_df = scrape_and_save(base_url, number, end, name, csv_name)

### Mexico City, Mexico

In [7]:
base_url = 'https://www.tripadvisor.com/Attractions-g150800-Activities-oa0-Mexico_City_Central_Mexico_and_Gulf_Coast.html'
number = 'g150800'
end = 'Mexico_City_Central_Mexico_and_Gulf_Coast'
name = 'Mexico City, Mexico'
csv_name = 'mexico_city.csv'

mexico_city_df = scrape_and_save(base_url, number, end, name, csv_name)

### New York City, New York

In [8]:
base_url = 'https://www.tripadvisor.com/Attractions-g60763-Activities-oa0-New_York_City_New_York.html'
number = 'g60763'
end = 'New_York_City_New_York'
name = 'New York City, New York'
csv_name = 'new_york_city.csv'

new_york_city_df = scrape_and_save(base_url, number, end, name, csv_name)

### Las Vegas, Nevada

In [9]:
base_url = 'https://www.tripadvisor.com/Attractions-g45963-Activities-oa0-Las_Vegas_Nevada.html'
number = 'g45963'
end = 'Las_Vegas_Nevada'
name = 'Las Vegas, Nevada'
csv_name = 'las_vegas.csv'

las_vegas_df = scrape_and_save(base_url, number, end, name, csv_name)

### Paris, France

In [10]:
base_url = 'https://www.tripadvisor.com/Attractions-g187147-Activities-oa0-Paris_Ile_de_France.html'
number = 'g187147'
end = 'Paris_Ile_de_France'
name = 'Paris, France'
csv_name = 'paris.csv'

paris_df = scrape_and_save(base_url, number, end, name, csv_name)

### Stockholm, Sweden

In [11]:
base_url = 'https://www.tripadvisor.com/Attractions-g189852-Activities-oa0-Stockholm.html'
number = 'g189852'
end = 'Stockholm'
name = 'Stockholm, Sweden'
csv_name = 'stockholm.csv'

stockholm_df = scrape_and_save(base_url, number, end, name, csv_name)

### Hokkaido, Japan

In [12]:
base_url = 'https://www.tripadvisor.com/Attractions-g298143-Activities-oa0-Hokkaido.html'
number = 'g298143'
end = 'Hokkaido'
name = 'Hokkaido, Japan'
csv_name = 'hokkaido.csv'

hokkaido_df = scrape_and_save(base_url, number, end, name, csv_name)

### Melbourne, Austrailia

In [13]:
base_url = 'https://www.tripadvisor.com/Attractions-g255100-Activities-oa0-Melbourne_Victoria.html'
number = 'g255100'
end = 'Melbourne_Victoria'
name = 'Melbourne, Austrailia'
csv_name = 'melbourne.csv'

melbourne_df = scrape_and_save(base_url, number, end, name, csv_name)

### Bahamas, Atlantic Ocean

In [14]:
base_url = 'https://www.tripadvisor.com/Attractions-g147414-Activities-oa0-Bahamas.html'
number = 'g147414'
end = 'Bahamas'
name = 'Bahamas, Atlantic Ocean'
csv_name = 'bahamas.csv'

bahamas_df = scrape_and_save(base_url, number, end, name, csv_name)

### Bahai, Brazil

In [15]:
base_url = 'https://www.tripadvisor.com/Attractions-g303251-Activities-oa0-State_of_Bahia.html'
number = 'g303251'
end = 'State_of_Bahia'
name = 'Bahai, Brazil'
csv_name = 'bahai.csv'

bahai_df = scrape_and_save(base_url, number, end, name, csv_name)

### Bali, Indonesia

In [16]:
base_url = 'https://www.tripadvisor.com/Attractions-g294226-Activities-oa0-Bali.html'
number = 'g294226'
end = 'Bali'
name = 'Bali, Indonesia'
csv_name = 'bali.csv'

bali_df = scrape_and_save(base_url, number, end, name, csv_name)

### Amsterdam, Netherlands

In [17]:
base_url = 'https://www.tripadvisor.com/Attractions-g188590-Activities-oa0-Amsterdam_North_Holland_Province.html'
number = 'g188590'
end = 'Amsterdam_North_Holland_Province'
name = 'Amsterdam, Netherlands'
csv_name = 'amsterdam.csv'

amsterdam_df = scrape_and_save(base_url, number, end, name, csv_name)

### Hong Kong, China

In [18]:
base_url = 'https://www.tripadvisor.com/Attractions-g294217-Activities-oa0-Hong_Kong.html'
number = 'g294217'
end = 'Hong_Kong'
name = 'Hong Kong, China'
csv_name = 'hong_kong.csv'

hong_kong_df = scrape_and_save(base_url, number, end, name, csv_name)

### Rome, Italy

In [19]:
base_url = 'https://www.tripadvisor.com/Attractions-g187791-Activities-oa0-Rome_Lazio.html'
number = 'g187791'
end = 'Rome_Lazio'
name = 'Rome, Italy'
csv_name = 'rome.csv'

rome_df = scrape_and_save(base_url, number, end, name, csv_name)

### Honolulu, Hawaii

In [20]:
base_url = 'https://www.tripadvisor.com/Attractions-g60982-Activities-oa0-Honolulu_Oahu_Hawaii.html'
number = 'g60982'
end = 'Honolulu_Oahu_Hawaii'
name = 'Honolulu, Hawaii'
csv_name = 'honolulu.csv'

honolulu_df = scrape_and_save(base_url, number, end, name, csv_name)

### Bangkok, Thailand

In [21]:
base_url = 'https://www.tripadvisor.com/Attractions-g293916-Activities-oa0-Bangkok.html'
number = 'g293916'
end = 'Bangkok'
name = 'Bangkok, Thailand'
csv_name = 'bangkok.csv'

bangkok_df = scrape_and_save(base_url, number, end, name, csv_name)

### San Franscisco, California

In [22]:
base_url = 'https://www.tripadvisor.com/Attractions-g60713-Activities-oa0-San_Francisco_California.html'
number = 'g60713'
end = 'San_Francisco_California'
name = 'San Franscisco, California'
csv_name = 'san_franscisco.csv'

san_franscisco_df = scrape_and_save(base_url, number, end, name, csv_name)

### Combine all the DataFrames into a single csv
Will will now create an `all_df` which will be put into a `.csv` file containing the text for all "things to do" in all places.

In [23]:
all_df = [prague_df, rajasthan_df, berlin_df, moscow_df, mexico_city_df,
          new_york_city_df, las_vegas_df, paris_df, stockholm_df, hokkaido_df,
          melbourne_df, bahamas_df, bahai_df, bali_df, amsterdam_df,
          hong_kong_df, rome_df, honolulu_df, bangkok_df, san_franscisco_df]
all_things_to_do_df = pd.concat(all_df)

In [24]:
csv_name = 'all_things_to_do.csv'
csv_path = os.path.join(loc_dir, csv_name)
all_things_to_do_df.to_csv(csv_path, index=False)

To make processing easier, we will complete further analysis in a different notebook. 