# Fine-tuning a model based on raw documents from Confluence

This notebook contains code for fine-tuning a model based on raw documents from Confluence.

## Introduction
The process will contain several parts:

- Data downloading
We downloaded several examples from public available apache foundation Confluence to make a raw dataset. This step done outside of this notebook. You can read more about Confluence export here: [https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html](https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html)
- Data extraction
For data extraction from dumps we will use Apache Tika running on a separate docker container.
Apache Tika - is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can read more about it here: [https://tika.apache.org/](https://tika.apache.org/)
- Data processing
We will use the dataset library to process the data. It is a library for loading and processing datasets in a few lines of code. You can read more about it here: [https://huggingface.co/docs/datasets/](https://huggingface.co/docs/datasets/)
Also we have to extract instruction and data from the raw data.
- Data augmentation
Augmentation of dataset is a process of creating new data from existing data. In this case we use the model for paraphrasing to create new questions and answers.
- Model fine-tuning
Using modern techniques as PEFT, DeepSpeed, LoRA and Accelerate we will fine-tune the model on the dataset. You can read more about it here: [https://huggingface.co/transformers/training.html](https://huggingface.co/transformers/training.html) [https://huggingface.co/blog/peft](https://huggingface.co/blog/peft)


## Setup environment

First of all we need to install all the dependencies needed for the project.

In [21]:
%%bash

pip install beautifulsoup4 requests tqdm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install git+https://github.com/huggingface/peft.git
pip install bitsandbytes transformers evaluate datasets accelerate loralib --upgrade --quiet
pip install rouge-score tensorboard py7zr transformers[deepspeed] nltk

Process is interrupted.


## Data

### Data extraction

In this stage we will have output in the format of
```json
{
    "file": "path/to/file",
    "title": "page name",
    "answer": "page content",
}[]
```

In [22]:
import os

import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

input_directory = os.path.join("..", "data", "confluence_exports")
include_extensions = [".html"]

dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-augmented")

def get_files_to_process(root_path):
    for dirpath, _, filenames in os.walk(root_path):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in include_extensions):
                yield os.path.join(dirpath, filename)


articles_df = pd.DataFrame(columns=["source_raw", "target", "file"])
fileList = list(get_files_to_process(input_directory))

for filePath in tqdm(fileList, desc="Processing files"):
    with open(filePath, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file.read(), "html.parser")
        main_header = soup.find("h1").text.strip()
        header_tags = ["h2", "h3", "h4", "h5", "h6"]
        headers_stack = []
        for header in soup.find_all(header_tags):
            header_level = int(header.name[1])

            while len(headers_stack) >= header_level:
                headers_stack.pop()

            headers_stack.append(header.text)

            target = ''
            current_element = header.next_element

            while current_element is not None and (
                    current_element.name is None or current_element.name not in header_tags):
                if current_element.name is None:
                    target = " ".join([target, current_element.getText().strip()])
                current_element = current_element.next_element

            source_raw = " : ".join([main_header] + headers_stack).replace(':', '>')
            articles_df = pd.concat(
                [articles_df, pd.DataFrame([[source_raw, target, filePath]], columns=["source_raw", "target", "file"])])


def has_content(row):
    return len(row["source_raw"].split()) > 2 and len(row["target"].split()) > 5


articles_df = articles_df.drop_duplicates(subset=["source_raw"])
articles_df = articles_df.drop_duplicates(subset=["target"])
articles_df = articles_df[articles_df.apply(has_content, axis=1)]

articles_df.reset_index(drop=True, inplace=True)

articles_df.sample(10)

Processing files:   0%|          | 0/266 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Rephrasing input data

Since headers isn't a great to be directly used as a question, or instruction query - we will generate prompts for each required entity based on it's label.

In [None]:
%%time

from transformers import pipeline
from transformers.pipelines.base import KeyDataset
from datasets import Dataset

pipe = pipeline("text2text-generation", model="google/flan-t5-large", device_map='auto', framework="pt")

example_query = """
    Reword Title to be a valid prompt based on following examples:
    Example 1:
    Title: Hadoop > Hadoop 2.8.0 Release > Key Git Concepts > Forking onto GitHub
    Q: Give me step-by-step guide on how to fork Hadoop 2.8.0 Release onto GitHub?
    Example 2:
    Title: Tomcat > WebSocket 1.1 TCK > Goals
    Q: Tell to me what are the goals of WebSocket 1.1 TCK in Tomcat.
    Example 3:
    Title: Apache > Tomcat > What is the Native library?
    Q: What is the Apache Tomcat Native library?

"""


def create_query(row):
    return f"{example_query}\nTitle: {row['source_raw']}\nQ:"


def generate_queries(dataset):
    """This function will generate queries for each article in the dataset"""
    for out in tqdm(pipe(KeyDataset(dataset, "prompt"), batch_size=8, return_text=True), desc="Generating queries",
                    total=len(dataset)):
        for row in out:
            yield row["generated_text"]


articles_df['prompt'] = articles_df.apply(create_query, axis=1)

dataset = Dataset.from_pandas(articles_df)

dataset = dataset.add_column("source", list(generate_queries(dataset))) \
    .filter(lambda x: x["source"] is not None and len(x["source"].split()) > 3) \
    .remove_columns(["prompt", "file", "source_raw"])

dataset.save_to_disk(dataset_path)

dataset.to_pandas().sample(10)

### Remove bad queries

Since some of the generated queries are empty or has some garbage in them - we will filter them out.
We will use same classification model to filter out bad queries.

In [None]:
from transformers import pipeline
from tqdm.auto import tqdm

BATCH_SIZE = 10  # You can adjust the batch size according to your needs

candidate_labels = ["question", "request", "nonsense"]

classification_pipeline = pipeline(model="facebook/bart-large-mnli", device_map='auto', framework="pt")

classified_pipeline = tqdm(
    classification_pipeline(KeyDataset(dataset, "source"),
    truncation=True,
    candidate_labels=candidate_labels,
    batch_size=BATCH_SIZE),
    total=len(dataset),
    desc="Classifying dataset by correctness"
)

dataset = dataset \
    .add_column("validity", [x["labels"][0] for x in classified_pipeline]) \
    .filter(lambda x: x["validity"] != "nonsense") \
    .remove_columns(["validity"])

dataset.save_to_disk(dataset_path)
dataset.to_pandas().sample(10)

### Data augmentation

Since we have a small dataset, we will augment it by replacing some words with their synonyms. We will use [wordnet](https://wordnet.princeton.edu/) for that.

In [None]:
%%time

import random
import nltk
from datasets import Dataset, concatenate_datasets
import pandas as pd

nltk.download("punkt")


def random_insertion(text):
    words = nltk.word_tokenize(text)
    word_to_insert = random.choice(words)
    position = random.randint(0, len(words))
    words.insert(position, word_to_insert)
    return ' '.join(words)


def random_swap(text):
    words = nltk.word_tokenize(text)
    if len(words) > 1:
        idx1, idx2 = random.sample(range(len(words)), 2)
        words[idx1], words[idx2] = words[idx2], words[idx1]
    return ' '.join(words)


def random_deletion(text):
    words = nltk.word_tokenize(text)
    if len(words) > 1:
        idx = random.randint(0, len(words) - 1)
        words.pop(idx)
    return ' '.join(words)


def augment_text(text, augmentations):
    for aug in augmentations:
        if aug == "random_insertion":
            text = random_insertion(text)
        elif aug == "random_swap":
            text = random_swap(text)
        elif aug == "random_deletion":
            text = random_deletion(text)
    return text


def augment_dataset(ds, column, augmentations, num_augmentations=4):
    augmented_data = []
    for idx in tqdm(range(len(ds)), desc="Augmenting dataset"):
        row = ds[idx]
        text = row[column]
        for _ in range(num_augmentations):
            augmented_text = augment_text(text, augmentations)
            new_row = row.copy()
            new_row[column] = augmented_text
            augmented_data.append(new_row)
    return concatenate_datasets([ds, Dataset.from_pandas(pd.DataFrame(augmented_data))])


augmentations = ["random_insertion", "random_swap", "random_deletion"]
augmented_input_ds = augment_dataset(dataset, "source", augmentations)
augmented_text_ds = augment_dataset(dataset, "target", augmentations)
dataset = concatenate_datasets([dataset, augmented_input_ds, augmented_text_ds])

dataset.save_to_disk(dataset_path)

dataset.to_pandas().sample(10)

## Training

For this stage we will use Peft, Lora, DeepSpeed, Accelerate and HuggingFace trainer

### Setup deepspeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.


In [4]:
import os

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments, \
    DataCollatorForSeq2Seq
from datasets import concatenate_datasets
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
import numpy as np
import os

# Load tokenizer of FLAN-t5-XL
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = dataset.map(lambda x: tokenizer(x["source"], truncation=True), batched=True, remove_columns=["source", "target"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = dataset.map(lambda x: tokenizer(x["target"], truncation=True), batched=True, remove_columns=["source", "target"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

def preprocess_function(sample, padding="max_length"):
    # add prefix to the input for t5
    inputs = ["Q: " + item for item in sample["source"]]
    targets = ["A: " + item for item in sample["target"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


dataset = dataset.map(preprocess_function, batched=True, remove_columns=["source", "target"])
print(f"Keys of tokenized dataset: {list(dataset.features)}")

dataset.save_to_disk(dataset_path)

ValueError: Column to remove ['source', 'target'] not in the dataset. Current columns in the dataset: ['input_ids', 'attention_mask', 'labels']

In [1]:
import os
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
)
from datasets import load_from_disk
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
)

model_id = "google/flan-t5-large"
batch_size = 4
num_train_epochs = 2.5
gradient_accumulation_steps = 3
learning_rate = 1e-3
label_pad_token_id = -100

dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-augmented")
output_dir = os.path.join("..", "models", f"{model_id.replace('/', '-')}-lora-peft")
dataset = load_from_disk(dataset_path)

model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset = dataset.train_test_split(test_size=0.2, shuffle=True)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8,
)

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=learning_rate,
    num_train_epochs=num_train_epochs,
    gradient_accumulation_steps=gradient_accumulation_steps,
    logging_dir=os.path.join(output_dir, "logs"),
    logging_strategy="steps",
    logging_steps=500,
    deepspeed="ds_config_zero3.json",
    report_to="tensorboard",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
model.config.use_cache = False
trainer.train()
model.save_pretrained(os.path.join(output_dir, "fine-tuned"))


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/andrei/anaconda3/envs/ai-tools/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /home/andrei/anaconda3/envs/ai-tools/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


trainable params: 4718592 || all params: 787868672 || trainable%: 0.5989059049678777
[2023-05-08 21:37:05,127] [INFO] [comm.py:606:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2023-05-08 21:37:05,136] [INFO] [comm.py:656:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.50.234, master_port=29500
[2023-05-08 21:37:05,136] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl


Using /home/andrei/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Emitting ninja build file /home/andrei/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load cpu_adam op: 2.1772849559783936 seconds


Loading extension module cpu_adam...
Using /home/andrei/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Emitting ninja build file /home/andrei/.cache/torch_extensions/py310_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load utils op: 0.04599118232727051 seconds
Rank: 0 partition count [1] and sizes[(4718592, False)] 


Loading extension module utils...
Using /home/andrei/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Time to load utils op: 0.000217437744140625 seconds


Step,Training Loss




## Inference with the trained model

In [13]:
from peft import PeftModel
from transformers import pipeline
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_id = "google/flan-t5-large"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(model, os.path.join(output_dir, "fine-tuned"))
tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate_simple(input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output = model.generate(
        input_ids=input_ids,
        max_length=512,
        temperature=0.7
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Q: What is Apache Tika?"""
generate_simple(input_text)

'A: Apache Tika Apache Tika is a Java servlet container based on Apache Tomcat. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a servlet container implementation of Apache Tika. It is a 

In [15]:
input_text = """Q: How to deploy apache tika to tomcat?"""
generate_simple(input_text)

'A: How do I deploy apache tika tomcat tomcat? See TomcatInTomcat'