This repository contains source code for various experiments to explore the feasibility of using a small, high-quality synthetic dataset, generated using a large language model such as PaLM 2, to fine-tune much smaller model to generate high-quality summaries.
pip install -r requirements.txt
Our project uses Yelp Dataset, consisting of over 6.9 million reviews and 150k businesses. To download the dataset, visit Yelp Dataset.
The downloaded dataset requires some preprocessing before we can use it. It involves filtering out low rating businesses, remove excessively long or short reviews, etc.
Requirements:
- 5 GB free disk space
- 8 vCPUs with 64 GB RAM
Before running the preprocessing script, make sure to untar the downloaded data:
mkdir data
tar -xvf yelp_dataset.tar -C ./data
Now run the following commands to process the data. Processing will take about 10 minutes.
cd src
# --data_root: directory containing untar yelp dataset json files
python process_data.py --data_root ./data
For experimentation, let's create a mini-dataset.
cd src
python make_dataset.py --data_file ./data/processed_dataset.json --out_dir ./data --out_file mini_dataset.json --num_businesses 10000 --max_reviews_per_business 8
To evaluate our generated dataset, prompts, and fine-tuned models, we use Yelp dataset released by Brazinskas et al. (2020). The dataset contains 100 human-written summaries generated by Amazon Mechanical Turk (AMT) workers, who summarized 8 reviews per business. Each business has 3 generated summaries by 3 workers, one summary per worker.
To download the dataset, run:
mkdir ./data/gold_summs
curl "https://raw.githubusercontent.com/abrazinskas/FewSum/master/artifacts/yelp/gold_summs/train.csv" --output "./data/gold_summs/train.csv"
curl "https://raw.githubusercontent.com/abrazinskas/FewSum/master/artifacts/yelp/gold_summs/val.csv" --output "./data/gold_summs/val.csv"
curl "https://raw.githubusercontent.com/abrazinskas/FewSum/master/artifacts/yelp/gold_summs/test.csv" --output "./data/gold_summs/test.csv"
curl "https://s3.us-east-2.amazonaws.com/unsup-sum/summaries_0-200_cleaned.csv" --output "./data/chu_dataset.csv"
Now process the downloaded test dataset. This includes merging train/val/test splits together to make an evaluation dataset of human-written summaries; 100 businesses, 800 reviews, and 300 summaries.
cd src
python process_eval_dataset.py --data_root ./data/gold_summs --out_dir ./data
We conducted multiple experiments to evaluate our methods to generate data using PaLM 2 API. The experiments are as follows:
gen_data_exp1.py
: Simple prompt to generate data.gen_data_exp2.py
: 2-step approach; first generate positive and negative summaries and then aggregate.gen_data_exp3.py
: First generate business attributes and then use these attributes to generate summary focused on these top attributes.
These experiments are conducted on eval_dataset
which contains human-written summaries.
To run each experiment:
# NOTE: Make sure you use GCP VM and have Vertex AI API enabled.
python gen_data_exp1.py --data_file ./data/eval_dataset.json --gcp_project my_project --gcp_region us-central1
To evaluate the generated data, run:
python evaluate_gen_data.py --data_file ./data/gen_dataset_exp1.json
Based on our experiemnts, attribute-based approach resulted in best generated summaries. So, to generate dataset for training models, run:
# NOTE: If you get quota errors, request higher quota for the PaLM API requests.
python generate_data.py --data_file ./data/mini_dataset.json
Now we will use the generated dataset to create a training and validation split. For testing, we will use the human-written eval_dataset
we processed above.
To create splits, run :
python make_train_val_split.py --data_file ./data/gen_dataset.json
We used BART base model to experiment with different approaches as follows:
trainer_exp1.py
: vanilla model; given input, generate summary.trainer_exp2.py
: controlled model; uses business attribute to make summary focus on these attributes.trainer_exp3.py
: controlled model; uses business ratings as an input prefix.trainer_exp4.py
: controlled model; uses business ratings as a text prompt as an input prefix.
To run these experiments:
python trainer_exp1.py
python trainer_exp2.py
python trainer_exp3.py
python trainer_exp4.py
We will evaluate our models on the human-written eval_dataset
.
We used Weights & Biases for logging.
To evaluate, run:
python eval.py --data_file ./data/gen_dataset_exp3.json --model_path ./exp1/checkpoint-600
python eval.py --data_file ./data/gen_dataset_exp3.json --model_path ./exp2/checkpoint-600 --use_attr True
python eval.py --data_file ./data/gen_dataset_exp3.json --model_path ./exp3/checkpoint-600 --use_rating True
python eval.py --data_file ./data/gen_dataset_exp3.json --model_path ./exp4/checkpoint-600 --use_rating True --use_rating_as_text True