# Generating Fine Tuning Training Data

## Notebook Flow

- This notebook is used for generating the JSONL files used to fine-tune a model. 
- The goal is to use [bullet_scraper.py](./scripts/bullet_scraper.py) to scrape for open-source bullets at a particular URL and use those as expected model completions. 
- These expected model completions have inputs that are generated using through the use of another model's generative capabilities, expanding the acronyms and adding context to the bullet in order to mimic natural human language input. 
- Once the inputs and expected completions are gathered and cleaned, they can be used to train the The Forge's model.

The steps for this notebook are as follows:
- [Step 1: Run the Bullet Scraper](#step-1-run-the-bullet-scraper)
    - Contains optional block to run scraper on specific websites
- [Step 2: Clean the Scraped Data](#step-2-clean-the-scraped-data)
    - Contains optional block to run cleaner on specific text files
- [Step 3: Consolidate the Cleaned Data](#step-3-consolidate-the-cleaned-data)
    - Contains optional block to consolidate data to a different location
- [Step 4: Generating Inputs for Completions](#step-4-generating-inputs-for-completions)
- [Step 5: Training Inputs from a Fine-Tuned Model](#step-5-training-inputs-from-a-fine-tuned-model)

## Step 1: Run the Bullet Scraper

**_IMPORTANT NOTE:_** The scripts in here are all rudimentary, so stopping and restarting the Jupyter Notebook and kernel may be required to exit the asynchronous scraping processes. Some websites may run much longer than others, while others may block the usage of this scraper completely.

This step retrieves "expected outputs", bullets, from a variety of sources in order to form the foundation of the training data required to generate the Bullet Forge.

Please see the _scripts/_ directory for more details. Please be warned that the scraper, although parallelized, may take a long time to run through all of the identified bullet repositories. The block below runs the scraper on all the easily scrape-able websites located within the [websites.txt](../resources/websites.txt) file.

In [None]:
from scripts.bullet_scraper import bullet_scraper

with open('../resources/websites.txt', 'r') as file:
    urls = [url.strip() for url in file.readlines()]

for url in urls:
    await bullet_scraper(url)

The block below allows the user to run the scraper one websites at a time. This is useful for the website(s) below, where there is a hard-coded records limit. The block requires the following input:

- base_url: The base URL including the protocol, e.g., https://www.afeprbullets.com/results.php?Submit5=Search&strength=Positive&rec=8124

In [None]:
from scripts.bullet_scraper import bullet_scraper

base_url = input("What base URL would you like to start scraping from?")

await bullet_scraper(base_url)

## Step 2: Clean the Scraped Data

In this step, the scraped data is given some extra cleaning. This removes bullets that are clearly way too long or short, and provides proper spacing and formatting to bullets that do not follow the standard bullet format.

The block below does not require any inputs; however, please note that the input/output of the data cleaning is within the variable `directory_path` below unless it is explicitly changed in the code block below.

In [None]:
from loguru import logger

from scripts.file_utils import batch_clean_files
from scripts.constants import bullet_pattern

directory_path = "../data/raw/"

logger.info(f"Performing extra cleaning on files in directory: {directory_path}")
batch_clean_files(directory_path, bullet_pattern)
logger.success("Extra cleaning on directory complete!")

The block below allows the user to specify a specific file to clean. The block below requires the following input:

- dirty_file: The name of the text file you want cleaned, from the _data/raw/_ directory, e.g., `afeprbullets`

In [None]:
from loguru import logger

from scripts.file_utils import clean_file
from scripts.constants import bullet_pattern

base_directory_path = "../data/raw/"
dirty_file_path = base_directory_path + input(
    "What is the filename of the file you want to clean?"
)

logger.info(f"Performing extra cleaning on file: {dirty_file_path}")
clean_file(dirty_file_path, bullet_pattern)
logger.success("Extra cleaning on file complete!")

## Step 3: Consolidate the Cleaned Data

In this step, all of the expected completions are consolidated into one JSONL file for easier handling in later steps and notebooks. The JSON objects within the JSONL will carry the following structure: 

```json
{"input": "<ADD DETAIL>", "output": "<THE SCRAPED, CLEANED BULLET>"}
```

The `output` key stores the expected completion and the `input` key stores the prompt. `<ADD DETAIL>` is the location of the expexted inputs a user might provide to the Bullet Forge. [Step 4](#step-4-generating-inputs-for-completions) talks more about the generation of the entire `input` value for training the Bullet Forge.

The block below does not require any inputs; however, please note that the output of the consolidated data is within the variable `directory_path` below, and the name of the output file will always be `raw_consolidated_set.jsonl`, unless it is explicitly changed in the code block below.

In [None]:
from scripts.data_consolidator import consolidate_files

base_path = "../data/raw/"
output_path = "../data/raw/raw_consolidated_set.jsonl"

await consolidate_files(base_path, output_path)

The block below allows the user to specify a new or existing file to be consolidated to a different file, and not the user's existing, master copy of `raw_consolidated_set.jsonl`.
- filename: The name of the file to be consolidated and formatted, e.g. `contributed`. Will also be used as the filename of the output jsonl.

In [None]:
from scripts.data_consolidator import consolidate_files

filename = input("What is the file you would like to consolidate?")

base_path = f"../data/raw/{filename}"

file_path = f"{base_path}.txt"
output_path = f"{base_path}.jsonl"

await consolidate_files(file_path, output_file_path=output_path)

## Step 4: Generating Inputs for Completions

As discussed in [Step 3](#step-3-consolidate-the-cleaned-data), the bullet scraping, cleaning, and consolidating do not yield the inputs that an actual user of The Forge may provide. The `<ADD DETAIL>` is purposefully left within each `input` so that the JSONL's JSON objects can be fed directly into a prompt-engineered model that can expand the bullet back into natural human language. 

Below is an example of a completed JSON object for training. Please note that the JSON object below is pretty-formatted for ease of viewing, but in reality, the JSONL will be flat.

```json
{
    "input": "As the Subject Matter Expert (SME) for the Exceptional Family Member Program (EFMP), I expertly directed 57 enrollments, handled 194 incoming inquiries, and addressed 101 outgoing inquiries. Through my leadership and efficient processing, we beat the package processing time by 50% and achieved an on-time rate of 99%, significantly improving support for our EFMP families.",
    "output": "- EFMP SME; dir'd 57 enrollments/194 incoming/101 outgoing inquiries--beat pkg processing time 50%/on-time rt 99%"
}
```

Any model or tool can be used to get the input-output JSON object. For example, you can prompt-engineer ChatGPT using the below:

```
INSTRUCTIONS: Expand upon condensed information that follows the bullet format `-[ACTION];[IMPACT]--[OUTCOME]`. Within areas of `<ADD DETAIL>` the task is to take the bullet after "output" and:
1. expand all acronyms or non-standard english words into their original forms, 
2. provide more context and language to generate full-form sentences that amount to a small paragraph describing the bullet, 
3. generate the context and language as if the perspective were from that of a member of the military, 
4. replace the `<ADD DETAIL>` area in the JSONL JSON-object

EXAMPLE OUTPUT BELOW:

MY PROMPT: `{"input": "<ADD DETAIL>", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`

YOUR RESPONSE: `{"input": "As the ESD focal point, I played a crucial role in tracking, routing, and resolving 375 Tier III and 6 high-level tickets. My efforts enabled AFNET access for 200,000 users, ensuring smooth operations and connectivity.", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`. 

IMPORTANT NOTE: Take the expansions and replace the `<ADD DETAIL>` area in the JSON. If provided multiple JSONS with `<ADD DETAIL>` areas, then iterate through them performing the expansion. Return all of your results in a code block of type JSON, and have each expansion stays on 1 line of the overall code block. 

ACTION: Please provide an acknowledgement and summary of the instructions above if you understand the upcoming task.
```

## Step 5: Training Inputs from a Fine-Tuned Model

**_IMPORTANT NOTE:_** Please see the [fine tuning notebook](../notebooks/fine_tuning.ipynb) for steps and details on how to fine tune a model for creating the Bullet Forge's training data.

Once enough input-output pairs have been created using the [Step 5](#step-5-training-inputs-from-a-fine-tuned-model) method (a few hundred), the user can then use those completed training pairs to train a model to continue the process of expanding upon the other 30K pairs that are still missing inputs. This involves fine-tuning a model to perform the fill-ins required on `"input": <ADD DETAIL>`. Please see the "**_IMPORTANT NOTE_**" at the top of this block for this fine tuning.

The block below inferences a fine-tuned model trained to create new inputs from bullets using the existing [training and validation set](../data/training/training_validation_set.jsonl). The block below requires the following inputs:

- model_path: the relative directory path or Hugging Face repository that holds all of the model files necessary for instantiation and inferencing
- tokenizer_path: the relative directory path or Hugging Face repository that holds the correct tokenizer for encoding and decoding inputs and outputs

Please note that the output of the data is within the variable `output_path` below, and the name of the output file will always be `training_validation_set.jsonl`, unless it is explicitly changed in the code block below. Additionally, the number of lines of JSON data is capped at 1500 using the `maximum_lines_of_data` variable. If the script is to be run until the end of the target file or until a different number of lines of JSON data are generated, then it must be explicitly changed in the code block below.

In [None]:
import torch
from loguru import logger
from transformers import T5ForConditionalGeneration, T5Tokenizer

from scripts.constants import bullet_data_creation_prefix
from scripts.file_utils import load_jsonl_data, append_line_to_file

# Path of the output file
output_filepath = "../data/raw/raw_bullet_training_set.jsonl"
# Max length of tokens a user may enter for summarization
# Increasing this beyond 512 may increase compute time significantly
max_input_token_length = 512
# Max length of tokens the model should output for the summary
# Approximately the number of tokens it may take to generate a bullet
max_output_token_length = 512
# Beams to use for beam search algorithm
# Increased beams means increased quality, but increased compute time
number_of_beams = 6
# Scales logits before soft-max to control randomness
# Lower values (~0) make output more deterministic
temperature = 0.5
# Limits generated tokens to top K probabilities
# Reduces chances of rare word predictions
top_k = 50
# Applies nucleus sampling, limiting token selection to a cumulative probability
# Creates a balance between randomness and determinism
top_p = 0.90

try:
    # Path of the pre-trained model that will be used
    model_path = input(
        "Input a checkpoint model's Hugging Face repository or a relative path"
    )
    # Path of the pre-trained model tokenizer that will be used
    # Must match the model checkpoint's signature
    tokenizer_path = input(
        "Input a tokenizer's Hugging Face repository or a relative path"
    )
    # Load the pre-trained model and tokenizer
    logger.info(
        f"Instantiating tokenizer from {tokenizer_path}, and model from {model_path}"
    )
    tokenizer = T5Tokenizer.from_pretrained(
        f"{tokenizer_path}",
        model_max_length=max_input_token_length,
        add_special_tokens=False,
    )
    input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
    logger.info(f"Loading {model_path}...")
    # Set device to be used based on GPU availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Model is sent to device for use
    model = input_model.to(device)  # type: ignore
    logger.success("Instantiated target tokenizer and model")

    # Preprocess input
    inputs_array = load_jsonl_data("../data/raw/raw_consolidated_set.jsonl")
    
    maximum_lines_of_data = 1500

    for count, input_line in enumerate(inputs_array):
        if count == (maximum_lines_of_data - 1):
            raise InterruptedError()

        input_text = bullet_data_creation_prefix + input_line["output"]

        encoded_input_text = tokenizer.encode_plus(
            input_text,
            return_tensors="pt",
            truncation=True,
            max_length=max_input_token_length,
        )

        # Generate summary
        summary_ids = model.generate(
            encoded_input_text["input_ids"],
            attention_mask=encoded_input_text["attention_mask"],
            max_length=max_output_token_length,
            num_beams=number_of_beams,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            early_stopping=True,
        )

        output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        # input_text and output_text insert into data sets
        line_of_data = (
            f'{{"input": "{output_text}", "output": "{input_line["output"]}"}}'
        )
        append_line_to_file(output_filepath, line_of_data)
        logger.success(f"Appended {count + 1}/{maximum_lines_of_data}: {line_of_data}")

except KeyboardInterrupt:
    logger.warning("Received interrupt, stopping script...")
except InterruptedError:
    logger.success("Data generation complete! Stopping script...")
except Exception as e:
    logger.error(f"An error occurred during generation: {e}")

The optional block below performs an extra prompt, at the discretion of the user, on the data that was generated from the above step. This could be, for example, a focused prompt for fixing grammar, adding context, and/or taking in an acronyms list and ensuring the accuracy of the expanded acronyms for each input-output pair.

In [None]:
# TODO: (optional) creation of another prompting step to be performed on the data generated by the step prior to this one