# Generating Fine Tuning Training Data

## Notebook Flow

- This notebook is used for generating the JSONL files used to fine-tune a model. 
- The goal is to use [scrape.py](./scripts/scrape.py) to scrape for open-source bullets at a particular URL and use those as expected model completions. 
- These expected model completions have inputs that are generated using through the use of another model's generative capabilities, expanding the acronyms and adding context to the bullet in order to mimic natural human language input. 
- Once the inputs and expected completions are gathered and cleaned, they can be used to train the Bullet Forge's model.

The steps for this notebook are as follows:
- [Step 1: Run the Bullet Scraper](#step-1-run-the-bullet-scraper)
- [Step 2: Clean the Scraped Data](#step-2-clean-the-scraped-data)
- [Step 3: Consolidate the Cleaned Data](#step-3-consolidate-the-cleaned-data)
- [Step 4: Generating Inputs for Completions](#step-4-generating-inputs-for-completions)

## Step 1: Run the Bullet Scraper

This step retrieves "expected completions", bullets, from a variety of sources in order to form the foundation of the training data.

Please see the _scripts/_ directory for more details. Please be warned that the scraper, although parallelized, may take a long time to run through all of the identified bullet repositories. The block below runs the scraper on all the easily scrape-able websites located within the [websites.txt](../resources/websites.txt) file.

In [None]:
# Scraper scripts
import subprocess

with open('../resources/websites.txt', 'r') as file:
    urls = [url.strip() for url in file.readlines()]

for url in urls:
    subprocess.run(["python3", "scripts/scrape.py", url])

The block below allows the user to run the scraper one websites at a time. This is useful for the website(s) below, where there is a hard-coded records limit. Please note that this is a very poorly crafted cronjob, so once success has been logged the first few times on a website, wait 30 more seconds and then stop the script. If you don't stop the script manually, you may end up waiting a really long time for the script to shutdown. The block requires the following input:

- base_url: The base URL including the protocol, e.g., https://www.afeprbullets.com/results.php?Submit5=Search&strength=Positive&rec=8124

In [None]:
base_url = input("What base URL would you like to start scraping from?")

!python3 scripts/scrape.py "{BASE_URL}"

## Step 2: Clean the Scraped Data

In this step, the scraped data is given some extra cleaning. This removes bullets that are clearly way too long or short, and provides proper spacing and formatting to bullets that do not follow the standard bullet format.

In [None]:
# Cleaning scripts
import os

input_directory_path = "../data/raw/"

file_paths = " ".join([os.path.join(root, file) for root, _, files in os.walk(input_directory_path) for file in files])

for dirty_file in file_paths:
    subprocess.run(["python3", "scripts/clean.py", dirty_file])

The block below allows the user to specify a specific file to clean. The block below requires the following input:

- dirty_file: The name of the text file you want cleaned, from the _data/raw/_ directory, e.g., `afeprbullets`

In [None]:
# Individual cleaning scripts
dirty_file = input("What is the filename of the file you want to clean?")

!python3 scripts/clean.py "{dirty_file}"

## Step 3: Consolidate the Cleaned Data

In this step, all of the expected completions are consolidated into one JSONL file for easier handling in later steps and notebooks. The JSON objects within the JSONL will carry the following structure: 

```json
{"input": "<ADD DETAIL>", "output": "{line}"}
```

The `output` key stores the expected completion and the `input` key stores the prompt. `<ADD DETAIL>` is the location in which the `output` requires the Bullet Forge user's input. [Step 4](#step-4-generating-inputs-for-completions) talks more about the generation of the entire `input` key's value.

The block below does not require any inputs; however, do note that the output of the consolidated data is within the variable `directory_path` below, and the name of the output file will always be `consolidated_set.jsonl`, unless the user explicitly changes it within the block below.

In [None]:
# Consolidation scripts
import os

input_directory_path = "../data/raw/"
output_directory_path = "../data/raw/consolidated_set.jsonl"

file_paths = " ".join([os.path.join(root, file) for root, _, files in os.walk(input_directory_path) for file in files])

!python3 scripts/consolidate.py "{output_directory_path}" {file_paths}

The block below allows the user to specify a new or existing file to be consolidated to a different file, and not the user's existing, master copy of `consolidated_set.jsonl`.
- filename: The name of the file to be consolidated and formatted, e.g. `contributed`
- output_filename: The of the file not including the extension, e.g., `contributed_consolidated`

In [None]:
# Individual consolidation scripts
filename = input("What is the filename of the file you would like to consolidate?")
output_filename = input("What would you like to name consolidated file?")

file_path = f"../data/raw/{filename}.txt"
output_path =  f"../data/raw/{output_filename}.jsonl"

!python3 scripts/consolidate.py  "{output_path}" "{file_path}"

## Step 4: Generating Inputs for Completions

As discussed in [Step 3](#step-3-consolidate-the-cleaned-data), the bullet scraping, cleaning, and consolidating do not yield the inputs that an actual user of Bullet Forge may provide. The `<ADD DETAIL>` is purposefully left within each `input` so that the JSONL's JSON objects can be fed directly into a prompt-engineered model that can expand the bullet back into natural human language. 

Below is an example of a completed JSON object for training. Please note that the JSON object below is pretty-formatted for ease of viewing, but in reality, the JSONL will be flat.

```json
{
    "input": "As the Subject Matter Expert (SME) for the Exceptional Family Member Program (EFMP), I expertly directed 57 enrollments, handled 194 incoming inquiries, and addressed 101 outgoing inquiries. Through my leadership and efficient processing, we beat the package processing time by 50% and achieved an on-time rate of 99%, significantly improving support for our EFMP families.",
    "output": "- EFMP SME; dir'd 57 enrollments/194 incoming/101 outgoing inquiries--beat pkg processing time 50%/on-time rt 99%"
}
```

Any model or tool can be used to get the input-output JSON object. For example, you can prompt-engineer ChatGPT using the below:

```
INSTRUCTIONS: Expand upon condensed information that follows the bullet format `-[ACTION];[IMPACT]--[OUTCOME]`. Within areas of `<ADD DETAIL>` the task is to take the bullet after "output" and:
1. expand all acronyms or non-standard english words into their original forms, 
2. provide more context and language to generate full-form sentences that amount to a small paragraph describing the bullet, 
3. generate the context and language as if the perspective were from that of a member of the military, 
4. replace the `<ADD DETAIL>` area in the JSONL JSON-object

EXAMPLE OUTPUT BELOW:

MY PROMPT: `{"input": "<ADD DETAIL>", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`

YOUR RESPONSE: `{"input": "As the ESD focal point, I played a crucial role in tracking, routing, and resolving 375 Tier III and 6 high-level tickets. My efforts enabled AFNET access for 200,000 users, ensuring smooth operations and connectivity.", "output": "- ESD focal point; tracked, routed, resolved 375 Tier III/6 High lvl tkts--enabled AFNET access to 200K users"}`. 

IMPORTANT NOTE: Take the expansions and replace the `<ADD DETAIL>` area in the JSON. If provided multiple JSONS with `<ADD DETAIL>` areas, then iterate through them performing the expansion. Return all of your results in a code block of type JSON, and have each expansion stays on 1 line of the overall code block. 

ACTION: Please provide an acknowledgement and summary of the instructions above if you understand the upcoming task.
```

## Step 5: Training Inputs from a Fine-Tuned Model

Once enough input-output pairs have been created using the [Step 5](#step-5-training-inputs-from-a-fine-tuned-model) method (around a few hundred), the user can then use those completed training pairs to continue the process on the other 30K pairs that are still missing inputs for their respective outputs. This involves fine-tuning a model to perform the fill-ins required on `"input": <ADD DETAIL>`.

The block below contains a simple fine-tuning jon on a smaller model to train it to create new inputs from bullets using the existing [training and validation set](../data/training/training_validation_set.jsonl).