TripAdvisor Dataset

Overview

This repository contains a dataset of hotel reviews and ratings collected from TripAdvisor, which has been processed. The dataset includes reviews of various hotels along with metadata such as multiple-aspect ratings and review texts.

Source

The data was originally distributed by Jiwei Li et al. (2013) and is hosted on his website http://www.cs.cmu.edu/~jiweil/html/hotel-review.html at Carnegie Mellon University.

Processing

For details on how we processed the dataset, please refer to the `process.ipynb` file. Essentially, we used machine learning to extract posts where reviews are written in English. Specifically, we adopted `fastText`　(https://fasttext.cc) by Meta, utilizing the pre-trained model `lid.176.bin`.

Dataset Description

The contents of the two datasets (.csv and .pkl) are same; however, we recommend using pickle file (.pkl) which retains information on pandas variable types and np.nan for missing values, especially datetime. In the original data, the various variables were stored in JSON format, but we have reorganised them so that the reviews and ratings are combined in one line as a pandas data set.

The dataset includes the following columns in each line:

hotel_id: Unique identifier for hotels.
user_id: Unique identifier for users.
title: Heading of the user review.
text: Actual text of the review.
review: reviews combined as follows: title \n text
overall: The rating given by the user.
cleanliness: The rating regarding the cleanliness.
value: The rating regarding the value.
location: The rating regarding the location.
rooms: The rating regarding the rooms.
sleep_quality: The rating regarding the sleep quality.
date_stayed: The date when the user stayed.
date: The date when the review was posted.

Usage

Our actual dataset is stored in Huggingface Dataset so that you need to install datasets package as follows:

pip install datasets

You can load the dataset with datasets package and easily convert to pd.DataFrame as:

from datasets import load_dataset
df = load_dataset("jniimi/tripadvisor-review-rating")
df = df['train'].to_pandas()

or perhaps grabbing the pickle file as:

import pandas as pd
from huggingface_hub import hf_hub_download
f = hf_hub_download('jniimi/tripadvisor-review-rating', repo_type='dataset', filename='data.pkl')
df = pd.read_pickle(f)

However, the whole dataset is huge so that you can also use the sampled data with 1000 observations file (https://huggingface.co/datasets/jniimi/tripadvisor-review-rating/resolve/main/data1000.pkl) as follows (e.g., using google colab):

import pandas as pd
from huggingface_hub import hf_hub_download
f = hf_hub_download('jniimi/tripadvisor-review-rating', repo_type='dataset', filename='data1000.pkl')
df = pd.read_pickle(f)

Citation

As indicated earlier, this dataset is a reprocessed distribution of a published dataset by Dr. Li, so please follow their instructions for use.

1. Original Dataset

Do not forget to cite the original Hotel Dataset (Li et al., 2013) https://nlp.stanford.edu/~bdlijiwei/Code.html

@inproceedings{li2013identifying,
  title={Identifying manipulated offerings on review portals},
  author={Li, Jiwei and Ott, Myle and Cardie, Claire},
  booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing},
  pages={1933--1942},
  year={2013}
}

Li, J., Ott, M., & Cardie, C. (2013, October). Identifying manipulated offerings on review portals. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1933-1942). https://aclanthology.org/D13-1199/

2. Citation for us

Our dataset will soon be citable in academic publications as well.

@misc{tripadvisor_dataset,
author = {Junichiro, Niimi},
title = {Hotel Review Dataset (English)},
year = {2024},
howpublished = {\url{https://github.com/jniimi/tripadvisor_dataset}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data1000.pkl		data1000.pkl
get_data.py		get_data.py
process.ipynb		process.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

data1000.pkl

data1000.pkl

get_data.py

get_data.py

process.ipynb

process.ipynb

Repository files navigation

TripAdvisor Dataset

Overview

Source

Processing

Dataset Description

Usage

Citation

1. Original Dataset

2. Citation for us

About

Releases

Packages

Languages

License

jniimi/tripadvisor_dataset

Folders and files

Latest commit

History

Repository files navigation

TripAdvisor Dataset

Overview

Source

Processing

Dataset Description

Usage

Citation

1. Original Dataset

2. Citation for us

About

Resources

License

Stars

Watchers

Forks

Languages