<a href="https://colab.research.google.com/github/kennywong524/di-cloze-project/blob/main/DI_cloze_test_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preparation
Modify JSON Files: Make sure that each JSON record in your rawdata/booknlp_output_16k_jsonl files includes (or is updated with) a field like "model": "o1". This helps you later track which model produced which result.

Example JSON record:

```
{
    "book_id": "book123",
    "passage": "The door opened, and [MASK], dressed and hatted, entered with a cup of tea.",
    "model": "o1"
}
```

Sync Considerations
If you are syncing data to Dropbox or another cloud service, verify that your modifications are updated on your remote storage before running the batch job.

## Prediction script for o1

Need to just modify the model in openai_predict_name_cloze_batch2.py script. The only change is to specify "o1" as the model in the API call (and update any prompt or endpoint specifics if needed)

In [None]:
import openai
import os
import sys
import json
import time

# ------------------------------------------------------------------------
# OpenAI credentials
# ------------------------------------------------------------------------
API_ORG = os.environ.get("OPENAI_ORG", "")
API_KEY = os.environ.get("OPENAI_API_KEY", "")
openai.organization = API_ORG
openai.api_key = API_KEY

def predict(passage):
    """
    Given a passage with a [MASK] token, this function builds a prompt
    and returns the model's prediction for the name that should fill in [MASK].
    """
    prompt_text = f"""You have seen the following passage in your training data.
What is the proper name that fills in the [MASK] token in it?
This name is exactly one word long, and is a proper name (not a pronoun or any other word).
You must make a guess, even if you are uncertain.

Example:

Input: "Stay gold, [MASK], stay gold."
Output: <name>Ponyboy</name>

Input: "The door opened, and [MASK], dressed and hatted, entered with a cup of tea."
Output: <name>Gerty</name>

Input: {passage}
Output:
"""

    # Note: Changing the model to "o1"
    completion = openai.ChatCompletion.create(
        model="o1",  # Using model "o1" instead of "gpt-3.5-turbo" or similar
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.0
    )
    # Extract the output content from the API response
    return completion["choices"][0]["message"]["content"], completion

def main():
    """
    Reads a JSONL file where each line is a JSON object with a passage.
    Calls the predict() function for each passage and writes the output
    (including any error messages) to an output JSONL file.
    """
    if len(sys.argv) < 3:
        print("Usage: openai_predict_name_cloze_batch2.py <input_jsonl> <output_jsonl>")
        sys.exit(1)

    input_jsonl_path = sys.argv[1]
    output_jsonl_path = sys.argv[2]

    with open(input_jsonl_path, "r", encoding="utf-8") as infile, \
         open(output_jsonl_path, "w", encoding="utf-8") as outfile:

        for line in infile:
            line = line.strip()
            if not line:
                continue

            record = json.loads(line)
            passage = record.get("passage", "")

            try:
                prediction, full_response = predict(passage)
                record["prediction"] = prediction
                record["full_response"] = full_response
                record["model"] = "o1"
            except Exception as e:
                record["prediction"] = None
                record["error"] = str(e)

            outfile.write(json.dumps(record, ensure_ascii=False) + "\n")
            time.sleep(0.1)  # Adjust sleep to handle rate limits if necessary

if __name__ == "__main__":
    main()


## Batch Execution Script
Create a shell script (e.g., openai_predict_name_cloze_batch.sh) to process each JSONL file from our raw data folder. This script loops over the files and calls the above Python script.

In [None]:
#!/bin/bash
# Usage:
#   ./openai_predict_name_cloze_batch.sh rawdata/booknlp_output_16k_jsonl output/o1_results

INPUT_FOLDER=$1
OUTPUT_FOLDER=$2

mkdir -p "$OUTPUT_FOLDER"

for JSONL_FILE in "$INPUT_FOLDER"/*.jsonl; do
    FILENAME=$(basename "$JSONL_FILE")
    OUTPUT_FILE="$OUTPUT_FOLDER/$FILENAME"

    echo "Processing $JSONL_FILE -> $OUTPUT_FILE"
    python3 openai_predict_name_cloze_batch2.py "$JSONL_FILE" "$OUTPUT_FILE"
done

## Post-Processing the Results

Our current pipeline uses a script (e.g., process_openai_batch3.py) to extract and combine the output batches. You can reuse that script with the new results directory.

If any minor modifications are needed (for example, to check for the field "model": "o1"), update your post-processing script accordingly. The general steps are:

Merge JSONL Files: Concatenate all JSONL output files into one master file.

Extract Relevant Data: Parse out fields such as book_id, passage, prediction, and optionally any confidence metrics.

Compute Metrics: If desired, compare predictions against known answers (if available) to compute accuracy.


## Workflow

1. Prepare data: Update the JSONL files with "model": "o1"
2. Run Predictions:


```
./openai_predict_name_cloze_batch.sh rawdata/booknlp_output_16k_jsonl output/o1_results
```
3. Post-Process Results:



```
./process_openai_batch.sh output/o1_results combined_results/o1_all.jsonl
```

4.  Use analysis tools (Python, Jupyter notebooks, etc.) to evaluate the performance of model “o1” on the name-cloze task.
