# Using LegalBench for Question Answering
In this mini-project, I experimented with LegalBench dataset for legal question answering task. I used the [recently released LLM by Google](https://blog.google/technology/developers/gemma-open-models/), `gemma-2b` as the starting point and fine-tuned it using a transformed version of the given dataset. The resulting dataset consists of questions to which answers could be Yes/No, choosing from more than one choices, or open-ended.

I used the preplexity measure and manual evaluation of a subset the generated answers for the test set to see whether fine-tuning provides any benefits in this task.

In [None]:
!pip install --upgrade pandas datasets

I decided to move the data transformation code to a separate Python file to improve code readability. The file is named `Data.py` and is expected to be a sibiling of the current notebook.

In [2]:
from Data import transform_data

First, I read the provided dataset and transform it using the data transformation that I built.

In [3]:
import pandas as pd

df = pd.read_json('raw_data_sample.json')

data = [transform_data(row) for row in df.itertuples()]
indexes, contexts, questions, answers = zip(*data)

assert len(contexts) == len(questions) == len(answers) and len(contexts) > 0

Next, I use `scikit` to split the data into two sets: train (95%), and test (5%).

In [4]:
from sklearn.model_selection import train_test_split
from datasets import Dataset

input_texts = [{"text": f"Answer the Question based on the given Context.\nContext: {c}\nQuestion: {q}\nAnswer: {a}"} for c, q, a in zip(contexts, questions, answers)]
train_texts, test_texts = train_test_split(input_texts, test_size=0.05, random_state=7)

train_dataset = Dataset.from_list(train_texts)
test_dataset = Dataset.from_list(test_texts)

Then, I import the necessary libraries for fine-tuning `gemma-2b` model. I use huggingface for this purpose.

In [None]:
!pip install bitsandbytes transformers peft trl huggingface_hub

In [6]:
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from transformers import BitsAndBytesConfig, set_seed

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

from trl import SFTTrainer

set_seed(7)
model_name = "google/gemma-2b-it"

Since loading `gemma-2b` requires accepting particular terms and conditions, one needs to accept these terms and login into huggingface hub via an access token.
The terms and conditions can be accessed by logging into huggingface: `https://huggingface.co/google/gemma-2b`.
For the purpose of this exercise, please use your own OpenAI token.

In [None]:
from huggingface_hub import login
login(token='YOUR_TOKEN')

## Fine-tuning the Model
The next step is loading the model and the corresponding tokenizer.

### Fitting the LLM on my GPU!
Since fine-tuning `gemma-2b` required more memory than I had available (15GB), I needed to take advantage of quantization to reduce the memory usage. Basically, BitsAndBytes quantization uses lower-precision data types to enable loading larger models.

My configuration loads the linear layers of the model with 4-bit integer precision.

In [32]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Loading the Model
I load the model and its corresponding tokenizer using huggingface. Then, I pass it to `prepare_model_for_kbit_training` so that the BitsAndBytes quantization that I configured earlier is applied to the loaded model.

In [104]:
tokenizer = AutoTokenizer.from_pretrained(model_name, truncation=True, truncation_side = "left")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_value = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
model = prepare_model_for_kbit_training(model)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The next configuration that I found necessary for successfully fine-tuning `gemma-2b` within my time and hardware constraint was reducing the number of trainable parameters. [Recent work](https://arxiv.org/abs/2106.09685) has shown that even training a very small percentage of pretrained parameters can be beneficial. That is where `LoraConfig` comes in.

LoRa technique freezes values of all the parameters in the pretrained model and introduces a pair of matrices that can be trained into each layer of the Transformer to be trained instead.

In [105]:
peft_config = LoraConfig(
        r=8,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['q_proj', 'up_proj', 'v_proj', 'k_proj', 'o_proj', 'down_proj', 'gate_proj']
)
model = get_peft_model(model, peft_config)

Then, I set the training arguments. Due to the time constraints, I experimented with only 3 different learning rates to choose the best one, and kept the rest of the hyperparameters fixed.
Since the fine-tuning data is relatively small, I decided to only do a small number of training steps.

In [106]:
training_arguments = TrainingArguments(
        output_dir="./results",
        do_eval=True,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_8bit",
        save_steps=100,
        logging_steps=100,
        learning_rate=1e-5,
        eval_steps=100,
        num_train_epochs=1,
        warmup_ratio=0.02,
        lr_scheduler_type="linear",
        load_best_model_at_end=True,
        save_strategy="steps",
        evaluation_strategy="steps"
)

Then, I set up the fine-tuning trainer so that it uses the `training_arguments` defined above.

In [107]:
trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        dataset_text_field="text",
        tokenizer=tokenizer,
        args=training_arguments,
        peft_config=peft_config,
        max_seq_length=512
)

Map:   0%|          | 0/9288 [00:00<?, ? examples/s]

Map:   0%|          | 0/489 [00:00<?, ? examples/s]



### Measuring the Preplexity before Fine-tuning
Before fine-tuning, I will measure the preplexity of the original (non-fine-tuned) model on the legal test set.

In [108]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 67.86


Let's keep the model's answers to 100 randomly chosen samples from the test set. I will use them to manually evaluate the models' responses.

In [109]:
from random import choices

subset_test_dataset = list(test_dataset.select(choices(range(len(test_dataset)), k=100)))

with open('ground_truth_subset_test.txt', 'w') as f:
  for i, entry in enumerate(subset_test_dataset):
    f.write(str(i) + "\n")
    f.write(entry['text'])
    f.write('\n====================\n')

with open('original_subset_test.txt', 'w') as f:
  for i, entry in enumerate(subset_test_dataset):
    text = entry["text"][:entry["text"].find('\nAnswer: ')] + '\nAnswer: '
    inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True)
    outputs = model.generate(**inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    f.write(str(i) + "\n")
    f.write(result)
    f.write('\n====================\n')



### Fine-tuning
Now, we can finally start the fine-tuning using the configurations and hyperparameters set earlier.

In [110]:
torch.cuda.empty_cache()
!rm -rf results legal_google
model.config.use_cache = False
trainer.train()
model.config.use_cache = True



Step,Training Loss,Validation Loss
100,3.7564,3.111483
200,2.7149,2.423674
300,2.3386,2.195314
400,2.1252,2.01895
500,2.0145,1.914986
600,1.926,1.848658
700,1.816,1.79915
800,1.8151,1.759271
900,1.8012,1.723946
1000,1.7789,1.696627




Step,Training Loss,Validation Loss
100,3.7564,3.111483
200,2.7149,2.423674
300,2.3386,2.195314
400,2.1252,2.01895
500,2.0145,1.914986
600,1.926,1.848658
700,1.816,1.79915
800,1.8151,1.759271
900,1.8012,1.723946
1000,1.7789,1.696627




In [111]:
trainer.model.save_pretrained('legal_'+model_name)

### Evaluating the Preplexity of the New Model
Then, I evaluated the preplexity of the fine-tuned model.

In [112]:
new_eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(new_eval_results['eval_loss']):.2f}")

assert new_eval_results['eval_loss'] < eval_results['eval_loss']

Perplexity: 4.56


As expected (and desired!), the preplexity of the fine-tuned model is **lower** than that of the original model.

Let's record the model's answers to a subset of the test set.

In [113]:
with open('fine-tuned_subset_test.txt', 'w') as f:
  for i, entry in enumerate(subset_test_dataset):
    text = entry["text"][:entry["text"].find('\nAnswer: ')] + '\nAnswer: '
    inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True)
    outputs = model.generate(**inputs, max_new_tokens=20)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    f.write(str(i) + "\n")
    f.write(result)
    f.write('\n====================\n')



## Manual Evaluation

In this final section, I have summarized the results of the manual evaluation of the original and the fine-tuned model on a subset of the test set. The original model, answers the given question correctly in **36%** of the test cases whereas the fine-tuned model correctly answers **62%** of the test cases. For the full details, please refer to `manual_evaluation.csv`.