# Generating Training Data and Pulling a Model

This notebook is used for forming the JSONL files used by `fine_tune.ipynb` to fine-tune the base model. The goal is to use `scrape.py` to scrape for open-source bullets at a particular URL and use those as completions. Then, those completions have prompts that are a combination of ChatGPT generated long-form paragraphs from those bullets, prepended with the bullet formatting prompt.

## Step 1: Run the Bullet Scraper

Please see the `scripts/` directory for more details. Please be warned that the scraper, although parallelized, may take a long time to run through all of the identified bullet repositories.

The block below runs the scraper on all the easily scrape-able websites located within the `resources/websites.txt` file.

In [None]:
import subprocess

with open('../resources/websites.txt', 'r') as file:
    urls = [url.strip() for url in file.readlines()]

for url in urls:
    subprocess.run(["python3", "scripts/scrape.py", url])

The block below is **_OPTIONAL_**. It allows you to run the scraper on websites one at a time. It requires the following input:
- BASE_URL: The base URL including the protocol.

This is useful for the website(s) below, where there is a hard-coded records limit.

- https://www.afeprbullets.com/results.php?Submit5=Search&strength=Positive&rec=8124

**_IMPORTANT NOTE_**: once success has been logged, wait 30 more seconds and stop the script, otherwise you will be waiting a really long time for it to shutdown.

In [None]:
BASE_URL = input("Enter a base url to scrape from")

!python3 scripts/scrape.py "{BASE_URL}"

## Step 2: Clean the Scraped Data

The scraped data requires some extra cleaning. This removes bullets that are clearly way too long or short, and provides proper spacing and formatting to bullets that do not follow the standard bullet format.

- DIRTY_FILE: The name of the text file you want cleaned, from the `data/raw` directory

In [None]:
DIRTY_FILE = input("Enter a filename to clean")

!python3 scripts/clean.py "{DIRTY_FILE}"

## Step 3: Consolidate the Cleaned Data

Consolidate all clean, scraped bullet website outputs from the `data/raw` directory into one JSONL file.

In [None]:
import os

directory_path = "../data/raw/"

file_paths = [os.path.join(root, file) for root, _, files in os.walk(directory_path) for file in files]

file_paths_string = " ".join(file_paths)

!python3 scripts/consolidate.py {file_paths_string}

## Step 4: Pull a T5 Pre-Trained Model

A pre-trained model can be pulled from [Hugging Face](https://huggingface.co/) for use during the fine-tuning step of the Bullet Forge development process.

- MODEL_NAME: The name of the model repository source (e.g., username/repository) for pulling from
- MODEL_MAX_LENGTH: The maximum token length for the model to accept as input. Be aware that memory requirements quadruple when doubling the input sequence length for "normal" self-attention.

In [None]:
MODEL_NAME = input("Insert Hugging Face model name")
MODEL_MAX_LENGTH = 512

!python3 scripts/pull_model.py "{MODEL_NAME}" "{MODEL_MAX_LENGTH}"