# Bullet Forge Training Data

This notebook is used for forming the JSONL files used by `fine-tune.ipynb` to fine-tune the base model. The goal is to use `bullet-scraper.py` to scrape for open-source bullets at a particular URL and use those as completions. Then, those completions have prompts that are a combination of ChatGPT generated long-form paragraphs from those bullets, prepended with the bullet formatting prompt.

## Step 1: Run the Bullet Scraper

Please see the `scraper/` directory for more details. Please be warned that the scraper, although parallelized, may take a long time to run through all of the identified bullet repositories.

The block below runs the scraper on all the easily scrape-able websites located within the `scraper/websites.txt` file.

In [None]:
file_paths=[]

with open('scraper/websites.txt', 'r') as file:
    urls = file.readlines()

for url in urls:
    url = url.strip()
    filename = url[url.index("www.") + 4:url.index(".com")]
    filepath = f"data/raw/{filename}.txt"
    !python3 scraper/scrape.py "{url}" "{filepath}"
    file_paths.append(filepath)

The block below is **_OPTIONAL_**. It allows you to run the scraper on websites one at a time. It requires the following input:
- BASE_URL: The base URL including the protocol.

This is useful for the website(s) below, where there is a hard-coded records limit. Once success has been logged, wait 30 more seconds and stop the script.

- https://www.afeprbullets.com/results.php?Submit5=Search&strength=Positive&rec=8124

In [None]:
BASE_URL = input("Enter a base url to scrape from")

filename = BASE_URL[BASE_URL.index("www.") + 4:BASE_URL.index(".com")]
filepath = f"data/raw/{filename}.txt"
!python3 scraper/scrape.py "{BASE_URL}" "{filepath}"

The block below is **_OPTIONAL_**. It includes a manual cleaning step that can be run on specific output files as needed. This script should assist in further cleaning the scraped data prior to the actual consolidation function.

In [None]:
DIRTY_FILE = input("Enter a filename to clean")

dirty_file_path = f"data/raw/{DIRTY_FILE}"

!python3 scraper/clean.py "{dirty_file_path}"

## Step 2: Consolidate Scraped Data

Consolidate all clean, scraped bullet website outputs from the `data/raw` directory into one JSONL file.

In [None]:
import os

directory_path = "data/raw/"

file_paths = []

for root, dirs, files in os.walk(directory_path):
    for file in files:
        file_path = os.path.join(root, file)
        file_paths.append(file_path)

file_paths_string = " ".join(file_paths)

!python3 scraper/consolidate.py {file_paths_string}