# Fine-tuning a model based on raw documents from Confluence

This notebook contains code for fine-tuning a model based on raw documents from Confluence.

## Introduction
The process will contain several parts:

- Data downloading
We downloaded several examples from public available apache foundation Confluence to make a raw dataset. This step done outside of this notebook. You can read more about Confluence export here: [https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html](https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html)
- Data extraction
For data extraction from dumps we will use Apache Tika running on a separate docker container.
Apache Tika - is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can read more about it here: [https://tika.apache.org/](https://tika.apache.org/)
- Data processing
We will use the dataset library to process the data. It is a library for loading and processing datasets in a few lines of code. You can read more about it here: [https://huggingface.co/docs/datasets/](https://huggingface.co/docs/datasets/)
Also we have to extract instruction and data from the raw data.
- Data augmentation
Augmentation of dataset is a process of creating new data from existing data. In this case we use the model for paraphrasing to create new questions and answers.
- Model fine-tuning
Using modern techniques as PEFT, DeepSpeed, LoRA and Accelerate we will fine-tune the model on the dataset. You can read more about it here: [https://huggingface.co/transformers/training.html](https://huggingface.co/transformers/training.html) [https://huggingface.co/blog/peft](https://huggingface.co/blog/peft)


## Setup environment

First of all we need to install all the dependencies needed for the project.

In [3]:
# install common dependencies
!pip install beautifulsoup4 requests tqdm
# install cuda
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install Hugging Face Libraries
!pip install git+https://github.com/huggingface/peft.git
!pip install bitsandbytes transformers evaluate datasets accelerate loralib --upgrade --quiet
# install additional dependencies needed for training
!pip install rouge-score tensorboard py7zr transformers[deepspeed] spacy nltk

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-hftbc1lh
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-hftbc1lh
  Resolved https://github.com/huggingface/peft.git to commit 1a1cfe34791ee9d822fad5f6c9607b966f2b27c0
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Data extraction

In this stage we will have output in the format of
```json
{
    "file": "path/to/file",
    "page": "page name",
    "section": "section name",
    "text": "text"
}[]
```

In [2]:
import os
import pandas as pd
from tqdm.auto import tqdm
from bs4 import BeautifulSoup

input_directory = os.path.join("..", "data", "confluence_exports")
include_extensions = [".html"]

def get_files_to_process(root_path):
    for dirpath, _, filenames in os.walk(root_path):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in include_extensions):
                yield os.path.join(dirpath, filename)

articles_df = pd.DataFrame(columns=["title", "text", "file"])
fileList = list(get_files_to_process(input_directory))

for filePath in tqdm(fileList, desc="Processing files"):
    with open(filePath, "r", encoding="utf-8") as file:
        main_header = soup.find("h1").text.strip()
        soup = BeautifulSoup(file.read(), "html.parser")
        header_tags = ["h2", "h3", "h4", "h5", "h6"]
        headers_stack = []
        for header in soup.find_all(header_tags):
            header_level = int(header.name[1])

            while len(headers_stack) >= header_level:
                headers_stack.pop()

            headers_stack.append(header.text)

            text = ''
            current_element = header.next_element

            while current_element is not None and (current_element.name is None or current_element.name not in header_tags):
                if current_element.name is None:
                    text = " ".join([text, current_element.getText().strip()])
                current_element = current_element.next_element

            title =  " > ".join([main_header] + headers_stack)
            articles_df = pd.concat([articles_df, pd.DataFrame([[title, text, filePath]], columns=["title", "text", "file"])])

def has_content(row):
    return len(row["title"].split()) > 2 and len(row["text"].split()) > 5

articles_df = articles_df.drop_duplicates(subset=["text"])
articles_df = articles_df.drop_duplicates(subset=["title"])
articles_df = articles_df[articles_df.apply(has_content, axis=1)]

articles_df.reset_index(drop=True, inplace=True)

articles_df.sample(10)

Processing files:   0%|          | 0/266 [00:00<?, ?it/s]

Unnamed: 0,title,text,file
873,TIKA : TikaEval > Single Output from One Tool ...,Single Output from One Tool (Profile) NOTE: a...,../data/confluence_exports/TIKA/TikaEval_10945...
1139,TIKA : TikaServer Windows Service > Here is a ...,Attachments: image2022-5-19_22-13-5.png (...,../data/confluence_exports/TIKA/TikaServer-Win...
481,Apache Tomcat : TomcatCreateNativeLaunchers > ...,Available Options What options do you have i...,../data/confluence_exports/TOMCAT/TomcatCreate...
898,TIKA : tika-pipes > Security Warning > From Fi...,From FileShare to FileShare Process all files...,../data/confluence_exports/TIKA/tika-pipes_181...
124,Hadoop : Meetup agenda > Attachments:,Attachments: HDFS-15547 Disk-level tierin...,../data/confluence_exports/HADOOP/Meetup-agend...
1129,TIKA : TikaAndVision > Tika and Tensorflow Ima...,2. Tensorflow Using REST Server This is the r...,../data/confluence_exports/TIKA/TikaAndVision_...
475,Apache Tomcat : Class Not Found Issues > Preface,Preface This page discusses the various ways ...,../data/confluence_exports/TOMCAT/Class-Not-Fo...
510,Apache Tomcat : SupportAndTraining > Example c...,Example company name Use this example as a b...,../data/confluence_exports/TOMCAT/SupportAndTr...
823,TIKA : SMTWithApacheJoshua > Introduction > La...,Details The language pack being used in this ...,../data/confluence_exports/TIKA/SMTWithApacheJ...
737,Apache Tomcat : HowTo > How do I add a questio...,How do I set up multiple sites sharing the sa...,../data/confluence_exports/TOMCAT/HowTo_103099...


## Data adjustments

Since headers isn't a great to be directly used as a question, or instruction query - we will generate prompts for each required entity based on it's label.

In [4]:
from transformers import pipeline
from datasets import Dataset
from transformers.pipelines.base import KeyDataset

pipe = pipeline("text2text-generation", model="google/flan-t5-small", device='cuda:0', framework="pt")

example_query = """
    <title>Hadoop : Hadoop 2.8.0 Release > Key Git Concepts > Forking onto GitHub</title><query>How to fork Hadoop 2.8.0 Release onto GitHub?</query>
    <title>Apache Tomcat : WebSocket 1.1 TCK > Goals</title><query>What are the goals of WebSocket 1.1 TCK in Apache Tomcat?</query>
"""

articles_df = articles_df.sample(10)

def create_query(row):
    return f"{example_query}\n<title>{row['title']}</title>\n"

articles_df["prompt"] = articles_df[["title"]].apply(create_query, axis=1)

dataset = Dataset.from_pandas(articles_df)

processed_queries = pipe(KeyDataset(dataset, "title"), truncation=True)

for query in tqdm(processed_queries, desc="Processing queries", total=len(dataset)):
    print(query)

TypeError: object of type 'generator' has no len()

## Remove useless data

Since some of the generated queries are empty or has some garbage in them - we will filter them out.
We will use same classification model to filter out bad queries.

In [7]:
from datasets import Dataset
from transformers.pipelines.base import KeyDataset
from transformers import pipeline
from tqdm.auto import tqdm
import os

BATCH_SIZE = 10  # You can adjust the batch size according to your needs

inputs_dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs")
valid_questions_dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-valid")
candidate_labels = ["valid",  "nonsense"]

classification_pipeline = pipeline(model="facebook/bart-large-mnli", device='cuda:0')
generation_pipeline = pipeline("text2text-generation", model="t5-base", tokenizer="t5-base", device='cuda:0')

inputs_dataset = Dataset.load_from_disk(inputs_dataset_path).filter(lambda x: x["input"] is not None and len(x["input"].split(" ")) > 3)

classified_pipeline = tqdm(classification_pipeline(KeyDataset(inputs_dataset, "input"), truncation=True, candidate_labels=candidate_labels, batch_size=BATCH_SIZE), total=len(inputs_dataset), desc="Classifying dataset by correctness")

inputs_dataset = inputs_dataset.add_column("validity", [x["labels"][0] for x in classified_pipeline])

inputs_dataset = inputs_dataset.filter(lambda x: x["validity"] == "valid")

inputs_dataset.save_to_disk(valid_questions_dataset_path)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Filter:   0%|          | 0/438 [00:00<?, ? examples/s]

Classifying dataset by correctness:   0%|          | 0/415 [00:00<?, ?it/s]

Flattening the indices:   0%|          | 0/415 [00:00<?, ? examples/s]

Filter:   0%|          | 0/415 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/377 [00:00<?, ? examples/s]

## Data augmentation

Since we have a small dataset, we will augment it by replacing some words with their synonyms. We will use [wordnet](https://wordnet.princeton.edu/) for that.

In [8]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [9]:
from datasets import concatenate_datasets
import torch
from itertools import chain
import random
import nltk
from nltk.corpus import wordnet as wn
from transformers import MarianMTModel, MarianTokenizer
import spacy
import pandas as pd

nltk.download("wordnet")
nlp = spacy.load("en_core_web_sm")

valid_questions_dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-valid")
valid_questions_dataset = Dataset.load_from_disk(valid_questions_dataset_path)

augmented_dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-augmented")

def synonym_augmentation(text):
    words = nltk.word_tokenize(text)
    new_words = []
    for word in words:
        synsets = wn.synsets(word)
        if synsets:
            synonyms = set(chain.from_iterable([word.lemma_names() for word in synsets]))
            if synonyms:
                new_words.append(random.choice(list(synonyms)))
            else:
                new_words.append(word)
        else:
            new_words.append(word)
    return ' '.join(new_words)

def back_translation(text, src_lang="en", tgt_lang="fr"):
    model_name = f'Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).eval().to("cuda:0")

    # Forward translation
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to("cuda:0")
    with torch.no_grad():
        forward_outputs = model.generate(**inputs)
    translated_text = tokenizer.decode(forward_outputs[0], skip_special_tokens=True)

    # Backward translation
    model_name = f'Helsinki-NLP/opus-mt-{tgt_lang}-{src_lang}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).eval().to("cuda:0")

    inputs = tokenizer(translated_text, return_tensors="pt", max_length=512, truncation=True).to("cuda:0")
    with torch.no_grad():
        backward_outputs = model.generate(**inputs)
    return tokenizer.decode(backward_outputs[0], skip_special_tokens=True)

def random_insertion(text):
    words = nltk.word_tokenize(text)
    word_to_insert = random.choice(words)
    position = random.randint(0, len(words))
    words.insert(position, word_to_insert)
    return ' '.join(words)

def random_swap(text):
    words = nltk.word_tokenize(text)
    if len(words) > 1:
        idx1, idx2 = random.sample(range(len(words)), 2)
        words[idx1], words[idx2] = words[idx2], words[idx1]
    return ' '.join(words)

def random_deletion(text):
    words = nltk.word_tokenize(text)
    if len(words) > 1:
        idx = random.randint(0, len(words) - 1)
        words.pop(idx)
    return ' '.join(words)

def augment_text(text, augmentations):
    for aug in augmentations:
        if aug == "synonym":
            text = synonym_augmentation(text)
        elif aug == "back_translation":
            text = back_translation(text)
        elif aug == "random_insertion":
            text = random_insertion(text)
        elif aug == "random_swap":
            text = random_swap(text)
        elif aug == "random_deletion":
            text = random_deletion(text)
    return text

def augment_dataset(ds, column, augmentations, num_augmentations=4):
    augmented_data = []
    for idx in range(len(ds)):
        row = ds[idx]
        text = row[column]
        for _ in range(num_augmentations):
            augmented_text = augment_text(text, augmentations)
            new_row = row.copy()
            new_row[column] = augmented_text
            augmented_data.append(new_row)
    return concatenate_datasets([ds, Dataset.from_pandas(pd.DataFrame(augmented_data))])

augmentations = ["synonym", "random_insertion", "random_swap", "random_deletion"]
augmented_input_ds = augment_dataset(valid_questions_dataset, "input", augmentations)
augmented_text_ds = augment_dataset(valid_questions_dataset, "text", augmentations)
augmented_ds = concatenate_datasets([valid_questions_dataset, augmented_input_ds, augmented_text_ds])
columns_to_remove = [col for col in augmented_ds.column_names if col not in ["input", "text"]]
augmented_ds = augmented_ds.remove_columns(columns_to_remove)
augmented_ds.save_to_disk(augmented_dataset_path)

[nltk_data] Downloading package wordnet to /home/andrei/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Saving the dataset (0/1 shards):   0%|          | 0/4147 [00:00<?, ? examples/s]

## Training

For this stage we will use Peft, Lora, DeepSpeed, Accelerate and HuggingFace trainer

In [2]:
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import concatenate_datasets
import numpy as np
import os

augmented_dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-augmented")

model_id="google/flan-t5-xxl"

# Load tokenizer of FLAN-t5-XL
tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset = load_from_disk(augmented_dataset_path, )
dataset = dataset.train_test_split(test_size=0.2, seed=42)

Loading cached split indices for dataset at /home/andrei/DataspellProjects/ai-tools/datasets/confluence_exports-inputs-augmented/cache-da2cca23ff28f88e.arrow and /home/andrei/DataspellProjects/ai-tools/datasets/confluence_exports-inputs-augmented/cache-8009923e98285385.arrow


In [3]:
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

Train dataset size: 3317
Test dataset size: 830


In [4]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id = "google/flan-t5-base"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

In [5]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/andrei/anaconda3/envs/ai-tools/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /home/andrei/anaconda3/envs/ai-tools/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


trainable params: 1769472 || all params: 249347328 || trainable%: 0.7096414524241463


In [6]:
from datasets import concatenate_datasets
import numpy as np
# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["input"], truncation=True), batched=True, remove_columns=["input", "text"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["text"], truncation=True), batched=True, remove_columns=["input", "text"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Loading cached processed dataset at /home/andrei/DataspellProjects/ai-tools/datasets/confluence_exports-inputs-augmented/cache-62a2f695125de454.arrow


Max source length: 30


Map:   0%|          | 0/4147 [00:00<?, ? examples/s]

Max target length: 512


In [7]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["answer: " + item for item in sample["input"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["text"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["input", "text"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Loading cached processed dataset at /home/andrei/DataspellProjects/ai-tools/datasets/confluence_exports-inputs-augmented/cache-5759f891d562ca86.arrow


Map:   0%|          | 0/830 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/3317 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/830 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/830 [00:00<?, ? examples/s]

In [8]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


In [9]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-t5-xxl"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!# train model
trainer.train()

In [10]:
model.save_pretrained(output_dir)

In [10]:
from transformers import pipeline

# load model from the disk
model = AutoModelForSeq2SeqLM.from_pretrained(output_dir)

# load tokenizer from the disk
tokenizer = AutoTokenizer.from_pretrained(output_dir)

# load pipeline
qa_pipeline = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer
)

# Example question about apache tika
question = "What is Apache Tika?"
# Query
result = qa_pipeline(question=question)


The following are the steps to install Hadoop on
