# Fine-tuning a model based on raw documents from Confluence

This notebook contains code for fine-tuning a model based on raw documents from Confluence.

## Introduction
The process will contain several parts:

- Data downloading
We downloaded several examples from public available apache foundation Confluence to make a raw dataset. This step done outside of this notebook. You can read more about Confluence export here: [https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html](https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html)
- Data extraction
For data extraction from dumps we will use Apache Tika running on a separate docker container.
Apache Tika - is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can read more about it here: [https://tika.apache.org/](https://tika.apache.org/)
- Data processing
We will use the dataset library to process the data. It is a library for loading and processing datasets in a few lines of code. You can read more about it here: [https://huggingface.co/docs/datasets/](https://huggingface.co/docs/datasets/)
Also we have to extract instruction and data from the raw data.
- Data augmentation
Augmentation of dataset is a process of creating new data from existing data. In this case we use the model for paraphrasing to create new questions and answers.
- Model fine-tuning
Using modern techniques as PEFT, DeepSpeed, LoRA and Accelerate we will fine-tune the model on the dataset. You can read more about it here: [https://huggingface.co/transformers/training.html](https://huggingface.co/transformers/training.html) [https://huggingface.co/blog/peft](https://huggingface.co/blog/peft)


## Data extraction

In this stage we will have output in the format of {path, header, text}[]

In [None]:
!pip install datasets beautifulsoup4 requests tqdm

In [None]:
import os

import pandas as pd
import requests
from tqdm.auto import tqdm
from bs4 import BeautifulSoup

TIKA_SERVER = "http://localhost:9998/tika"
BATCH_SIZE = 10  # You can adjust the batch size according to your needs
input_directory = os.path.join("..", "data", "confluence_exports")
output_path = os.path.join("..", "datasets", "confluence_exports.json")
include_extensions = [".md", ".html", ".adoc"]


# Getting the list of articles to process
def get_files_to_process(root_path):
    for dirpath, _, filenames in os.walk(root_path):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in include_extensions):
                yield os.path.join(dirpath, filename)


# Function to determine if the list is mostly links
def is_mostly_links(tag):
    if tag.name != "li":
        return False
    link_count = len(tag.find_all("a"))
    total_count = len(tag.find_all(recursive=False))
    return link_count / total_count > 0.5 if total_count > 0 else False


# Extract text from the tag excluding links
def extract_text(tag):
    if tag is None or is_mostly_links(tag):
        return ""
    if not hasattr(tag, "contents"):
        return tag.strip()
    return "".join([extract_text(child) for child in tag.contents])


# Parse and clean HTML using BeautifulSoup library
def parse_and_clean_html(html):
    soup = BeautifulSoup(html, "html.parser")

    for tag in soup(["script", "style", "nav"]):
        tag.decompose()
    main_header = soup.find("h1")
    main_header_text = main_header.get_text(strip=True) if main_header else ""
    for subheader in soup.find_all(["h2", "h3", "h4", "h5", "h6"]):
        subheader_text = subheader.get_text(strip=True)
        text = extract_text(subheader.find_next_sibling())
        if main_header_text:
            subheader_text = f"{subheader_text}"
        yield {
            "header": subheader_text,
            "text": text.strip(),
            "topic": main_header_text
        }


# Process files using Apache Tika server
def process_file(files):
    for filePath in tqdm(files, desc="Processing files"):
        with open(filePath, "rb") as f:
            headers = {"Accept": "text/html"}
            data = f.read()
            response = requests.put(TIKA_SERVER, data=data, headers=headers)
            html = response.text
            subtexts = parse_and_clean_html(html)
            for entry in subtexts:
                yield {
                    "file": filePath,
                    "header": entry["header"],
                    "text": entry["text"],
                    "topic": entry["topic"]
                }


# Filter out entries with empty content
def has_content(example, min_length=3):
    return len(example["header"].strip().split(" ")) >= min_length and len(example["text"].strip().split(" ")) >= min_length

files_to_process = list(get_files_to_process(input_directory))
dataframe = pd.DataFrame(process_file(files_to_process))

filtered_dataframe = dataframe[dataframe.apply(has_content, axis=1)]
filtered_dataframe.to_json(output_path, orient="records", indent=4)
filtered_dataframe.tail()

## Data processing

First of all let's classify elements in the dataset into 3 categories: instruction, question-answer and trash. We will use the model for classification from the transformers library. You can read more about it here: [https://huggingface.co/transformers/task_summary.html#sequence-classification](https://huggingface.co/transformers/task_summary.html#sequence-classification)

In [None]:
!pip install --upgrade --ignore-installed datasets transformers

In [None]:
from datasets import Dataset
from transformers.pipelines.base import KeyDataset
import nltk
from transformers import pipeline

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

processed_data_path = os.path.join("..", "datasets", "confluence_exports-classified")
candidate_labels = ["instruction", "question-answer", "other"]

classification_pipeline = pipeline(model="facebook/bart-large-mnli", device='cuda:0')
generation_pipeline = pipeline("text2text-generation", model="t5-base", tokenizer="t5-base", device='cuda:0')

def generate_query(samples):
    return [f"generate question: {sample['header']} context: {sample['text']}" for sample in samples]

ds = Dataset.from_json(output_path)
ds = ds.add_column("query", generate_query(ds))

def extract_label(samples):
    return [x["generated_text"].split(":")[1].strip() for x in samples]

classified_pipeline = tqdm(classification_pipeline(KeyDataset(ds, "query"), truncation=True, candidate_labels=candidate_labels, batch_size=BATCH_SIZE), total=len(ds), desc="Classifying dataset")

ds = ds.add_column("label", [x["labels"][0] for x in classified_pipeline])

ds = ds.filter(lambda x: x["label"] != "other")

ds.save_to_disk(processed_data_path)

## Data adjustments

Since headers isn't a great to be directly used as a question, or instruction query - we will generate prompts for each required entity based on it's label.

In [None]:
from transformers import pipeline

generation_pipeline = pipeline("text2text-generation", model="t5-base", tokenizer="t5-base", device='cuda:0', truncation=True, batch_size=4)

def generate_prompt(topic, header, label, text):
    return f"Generate a {label} prompt for the text about {header} - {topic}: {text}"

ds = Dataset.load_from_disk(processed_data_path)

ds = ds.add_column("prompt", [generate_prompt(x['topic'], x['header'], x["label"], x["text"]) for x in ds])


for idx, out in enumerate(tqdm(generation_pipeline(KeyDataset(ds, "prompt")), total=len(ds), desc="Generating prompts")):
    print(out)
