# Generating Fine Tuning Training Data


## Notebook Flow

- This notebook is used for generating the JSONL files used to fine-tune a model.
- The goal is to use [bullet_scraper.py](./scripts/bullet_scraper.py) to scrape for open-source bullets at a particular URL and use those as expected model completions.
- These expected model completions have inputs that are generated using through the use of another model's generative capabilities, expanding the acronyms and adding context to the bullet in order to mimic natural human language input.
- Once the inputs and expected completions are gathered and cleaned, they can be used to train the The Forge's model.

The steps for this notebook are as follows:

- [Step 1: Import Dependencies](#step-1-import-dependencies)
- [Step 2: Run the Bullet Scraper](#step-2-run-the-bullet-scraper)
  - Contains optional block to run scraper on specific websites
- [Step 3: Clean the Scraped Data](#step-3-clean-the-scraped-data)
  - Contains optional block to run cleaner on specific text files
- [Step 4: Consolidate the Cleaned Data](#step-4-consolidate-the-cleaned-data)
  - Contains optional block to consolidate data to a different location
- [Step 5: Generating Inputs for Completions](#step-5-generating-inputs-for-completions)
- [Step 6: Training Inputs from a Fine-Tuned Model](#step-6-training-inputs-from-a-fine-tuned-model)
  - Contains optional block to perform extra LLM-enhanced data preparation


## Step 1: Import Dependencies

This block must be run prior to performing any of the steps following this one.

This block imports all the necessary scripts and dependencies used by all of the following blocks.


In [None]:
from scripts.bullet_scraper import *
from scripts.file_utils import *
from scripts.constants import *
from scripts.model_instantiation import *
from scripts.data_consolidator import consolidate_files
from scripts.prompt import data_modification_prompt

## Step 2: Run the Bullet Scraper

**_IMPORTANT NOTE:_** The scripts in here are all rudimentary, so stopping and restarting the Jupyter Notebook and kernel may be required to exit the asynchronous scraping processes. Some websites may run much longer than others, while others may block the usage of this scraper completely.

This step retrieves "expected outputs", bullets, from a variety of sources in order to form the foundation of the training data required to generate the Bullet Forge.

Please see the _scripts/_ directory for more details. Please be warned that the scraper, although parallelized, may take a long time to run through all of the identified bullet repositories. The block below runs the scraper on all the easily scrape-able websites located within the [websites.txt](../resources/websites.txt) file.


In [None]:
with open("../resources/websites.txt", "r") as file:
    urls = [url.strip() for url in file.readlines()]

for url in urls:
    await bullet_scraper(url)

The block below allows the user to run the scraper one websites at a time. This is useful for the website(s) below, where there is a hard-coded records limit. The block requires the following input:

- base_url: The base URL including the protocol, e.g., https://www.afeprbullets.com/results.php?Submit5=Search&strength=Positive&rec=8124


In [None]:
base_url = input("What base URL would you like to start scraping from?")

await bullet_scraper(base_url)

## Step 2: Clean the Scraped Data

In this step, the scraped data is given some extra cleaning. This removes bullets that are clearly way too long or short, and provides proper spacing and formatting to bullets that do not follow the standard bullet format.

The block below does not require any inputs; however, please note that the input/output of the data cleaning is within the variable `directory_path` below unless it is explicitly changed in the code block below.


In [None]:
directory_path = "../data/raw/"

batch_clean_files(directory_path, bullet_pattern)

The block below allows the user to specify a specific file to clean. The block below requires the following input:

- dirty*file: The name of the text file you want cleaned, from the \_data/raw/* directory, e.g., `afeprbullets`


In [None]:
base_directory_path = "../data/raw/"
dirty_file_path = base_directory_path + input(
    "What is the filename of the file you want to clean?"
)

clean_file(dirty_file_path, bullet_pattern)

## Step 3: Consolidate the Cleaned Data

In this step, all of the expected completions are consolidated into one JSONL file for easier handling in later steps and notebooks. The JSON objects within the JSONL will carry the following structure:

```json
{ "input": "<DETAIL>", "output": "<BULLET>" }
```

The `output` key stores the expected completion and the `input` key stores the prompt. `<DETAIL>` is the location of the expected inputs a user might provide to the Bullet Forge. [Step 4](#step-4-generating-inputs-for-completions) talks more about the generation of the entire `input` value for training the Bullet Forge.

The block below does not require any inputs; however, please note that the output of the consolidated data is within the variable `directory_path` below, and the name of the output file will always be `raw_consolidated_set.jsonl`, unless it is explicitly changed in the code block below.


In [None]:
base_path = "../data/raw/"
output_path = "../data/raw/raw_consolidated_set.jsonl"

await consolidate_files(base_path, output_path)

The block below allows the user to specify a new or existing file to be consolidated to a different file, and not the user's existing, master copy of `raw_consolidated_set.jsonl`.

- filename: The name of the file to be consolidated and formatted, e.g. `contributed`. Will also be used as the filename of the output jsonl.


In [None]:
filename = input("What is the file you would like to consolidate?")

base_path = f"../data/raw/{filename}"

file_path = f"{base_path}.txt"
output_path = f"{base_path}.jsonl"

await consolidate_files(file_path, output_file_path=output_path)

## Step 4: Generating Inputs for Completions

As discussed in [Step 3](#step-3-consolidate-the-cleaned-data), the bullet scraping, cleaning, and consolidating do not yield the inputs that an actual user of The Forge may provide. The `<DETAIL>` is purposefully left within each `input` so that the JSONL's JSON objects can be fed directly into a prompt-engineered model that can expand the bullet back into natural human language.

Below is an example of a completed JSON object for training. Please note that the JSON object below is pretty-formatted for ease of viewing, but in reality, the JSONL will be flat.

```json
{
  "input": "As the Subject Matter Expert (SME) for the Exceptional Family Member Program (EFMP), I expertly directed 57 enrollments, handled 194 incoming inquiries, and addressed 101 outgoing inquiries. Through my leadership and efficient processing, we beat the package processing time by 50% and achieved an on-time rate of 99%, significantly improving support for our EFMP families.",
  "output": "- EFMP SME; dir'd 57 enrollments/194 incoming/101 outgoing inquiries--beat pkg processing time 50%/on-time rt 99%"
}
```

Any model or tool can be used to get the input-output JSON object. For example, you can prompt-engineer ChatGPT using the below:

```
INSTRUCTIONS: Expand upon condensed information that follows the bullet format `-[ACTION];[IMPACT]--[OUTCOME]`. Within areas of `<DETAIL>` the task is to take the bullet after "output" and:
1. expand all acronyms or non-standard english words into their original forms,
2. provide more context and language to generate full-form sentences that amount to a small paragraph describing the bullet,
3. generate the context and language as if the perspective were from that of a member of the military,
4. replace the `<DETAIL>` area in the JSONL JSON-object

EXAMPLE OUTPUT BELOW:

MY PROMPT: `{"input": "<DETAIL>", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`

YOUR RESPONSE: `{"input": "As the ESD focal point, I played a crucial role in tracking, routing, and resolving 375 Tier III and 6 high-level tickets. My efforts enabled AFNET access for 200,000 users, ensuring smooth operations and connectivity.", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`.

IMPORTANT NOTE: Take the expansions and replace the `<DETAIL>` area in the JSON. If provided multiple JSONS with `<DETAIL>` areas, then iterate through them performing the expansion. Return all of your results in a code block of type JSON, and have each expansion stays on 1 line of the overall code block.

ACTION: Please provide an acknowledgement and summary of the instructions above if you understand the upcoming task.
```


## Step 5: Training Inputs from a Fine-Tuned Model

**_IMPORTANT NOTE:_** Please see the [fine tuning notebook](../notebooks/fine_tuning.ipynb) for steps and details on how to fine tune a model for creating the Bullet Forge's training data.

Once enough input-output pairs have been created using the [Step 5](#step-5-training-inputs-from-a-fine-tuned-model) method (a few hundred), the user can then use those completed training pairs to train a model to continue the process of expanding upon the other pairs that are still missing inputs. This involves fine-tuning a model to perform the fill-ins required on `"input": <DETAIL>`. Please see the "**_IMPORTANT NOTE_**" at the top of this block for this fine tuning.

Please note that the output of the data is within the variable `output_path` below, and the name of the output file will always be `raw_bullet_training_set.jsonl`, unless it is explicitly changed in the code block below.

The number of lines of JSON data to be processed is capped at 1500. If the script is to be run until the end of the target file or until a different number of lines of JSON data are generated, then it must be explicitly changed in the code block below by adding the following argument in the `data_modification_prompt` function: `stop_at=<NUMBER OF LINES>`. Additionally, the function can be targeted to do one of the following based on the optional `modify_input` value:

1. `True`: Model takes in the input and performs a transformation on it based on the provided prompt, placing the result in the input key-value pair
2. `False` (DEFAULT): Model takes in the output and performs a transformation on it based on the provided prompt, placing the result in the input key-value pair

The block requires the following user input:

- model_path: The model's directory or Hugging Face repository, e.g., google/flan-t5-xl, ../models/opera-bullet-interpreter
- tokenizer_path: The tokenizer's directory or Hugging Face repository, e.g., google/flan-t5-xl, ../models/opera-bullet-interpreter
- save_model_decision: Whether the user wants to save a copy of the model to the local directory, "yes" or "no"


In [None]:
input_filepath = "../data/raw/raw_consolidated_set.jsonl"
output_filepath = "../data/raw/raw_bullet_training_set.jsonl"
prompt_prefix = bullet_data_creation_prefix

data_modification_prompt(output_filepath, input_filepath, prompt_prefix)

The optional block below performs more data transformation, at the discretion of the user, using a user-selected model and prompt. For example, this could be a focused prompt for fixing grammar and punctuation of the sentences.

Beyond the same requirements as the previous block's required user input, the following are also required:

- input_filepath: The input file's relative directory path, e.g. ../data/raw/raw_bullet_training_set.jsonl
- output_filepath: The output file's relative directory path, e.g. ../data/raw/raw_bullet_training_set_NEW.jsonl
- prompt_prefix: The action to be performed by the model on each line of input data, e.g., Fix the grammar, punctuation and spelling, and add more context to the following United States Air and Space Force achievement statement


In [None]:
input_filepath = input("What is the relative path to the file for modification?")
output_filepath = input("What is the relative path for the output file, including filename?")
prompt_prefix = input("What is the action to be performed on the data?") +  ": "

data_modification_prompt(output_filepath, input_filepath, prompt_prefix, modify_input=True)