# 📚 Comparing Fine-Tuned SOP Models

This notebook is designed to **load multiple fine-tuned versions** of Mistral-7B-Instruct-v0.2 and **generate SOPs** for a set of evaluation prompts.

### What this code does:

### 1. Install Required Libraries
- Install Hugging Face `transformers`, `peft` (for LoRA models), `accelerate`, `bitsandbytes`, and `sentencepiece`.

### 2. Set Up Environment
- Detect if CUDA (GPU) is available.
- Define paths to three different fine-tuned models:
  - **ModelA**: Fine-tuned LoRA model on basic SOPs.
  - **ModelB**: Fine-tuned LoRA model trained to generate SOP + stylistic explanation.
  - **ModelC**: Full fine-tuned model on SOPs only (no LoRA).

### 3. Load Tokenizer
- Load the Mistral-7B tokenizer and set proper padding.

### 4. Define Evaluation Prompts
- Three prompts to evaluate:
  - **In-domain prompt**: Sports Management SOP (similar to training data).
  - **Out-of-domain 1**: Space Law SOP (new domain).
  - **Out-of-domain 2**: Theatre and Performance Studies SOP (new domain).

### 5. Load Models
- Define a `load_model()` function:
  - If it’s a LoRA adapter, load it into the base model.
  - If it’s a full fine-tuned model, load it directly.

### 6. Define Text Generation Function
- `generate_sop()`:
  - Format prompt using `[INST]...[/INST]`.
  - Generate text with the model using controlled sampling (temperature, top_p, repetition_penalty).
  - Clean common bad starts like "Dear Sir/Madam" if they appear (optional auto-clean).

### 7. Text Cleaning Utilities
- `clean_bad_sop_starts()` removes unwanted formalities.
- `ensure_ends_with_punctuation()` makes sure SOPs don't end mid-sentence.

### 8. Run Inference
- Loop through each model and each prompt.
- Generate and print SOPs to manually compare model behaviors.

✅ **By the end of this notebook**:
- You can visually compare outputs from different fine-tuned models.
- See how each model handles in-domain vs out-of-domain prompt generalization.
- Inspect if models preserve SOP structure and style.

---

In [None]:
# 📦 Install necessary packages
!pip install -q transformers peft accelerate bitsandbytes sentencepiece

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import re

# 🔧 Setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# 📂 Paths to your LoRA fine-tuned models
model_paths = {
    "ModelA": "/content/drive/MyDrive/mistral-manasa-lora",
    "ModelB": "/content/drive/MyDrive/mistral-v02-sop-explainer-lora-adapter",
    "ModelC": "/content/drive/MyDrive/mistral_sop_finetuned"
}

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# 💬 Prompts
prompts = {
    "In-domain": "Write me an SOP for pursuing a Master's Degree in Sports Management focusing on Football Operations.",
    "Out-of-domain 1": "Write me an SOP for a Master's in Space Law.",
    "Out-of-domain 2": "Write me an SOP for a Master's in Theatre and Performance Studies."
}

# 🧠 Load models
def load_model(path, base_model_id=base_model_id):
    base = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        device_map="auto",
        torch_dtype=torch.float16,
        load_in_4bit=True
    )
    if "finetuned" in path:
        return base.to(device)
    return PeftModel.from_pretrained(base, path).to(device)

# 📄 Generation function
def generate_sop(model, prompt, max_new_tokens=950):
    formatted_prompt = f"<s>[INST] {prompt} [/INST]"
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.85,
        do_sample=True,
        repetition_penalty=1.2,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id  # ✨ Add pad_token_id just to be clean

    )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    final_text = clean_bad_sop_starts(generated_text)

    return final_text

def ensure_ends_with_punctuation(text):
    text = text.strip()
    if not re.search(r'[.!?]$', text):
        # Find last complete sentence
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return ' '.join(sentences[:-1]) + sentences[-1] + '.'
    return text

def clean_bad_sop_starts(text):
    """Remove bad SOP starts like emails."""
    text = text.strip()
    banned_starts = [
        r"^To\s+the\s+(visa|admissions)\s+officer",
        r"^Dear\s+(Sir|Madam)",
        r"^Respected\s+(Sir|Madam)",
        r"^This\s+letter\s+is\s+to\s+inform",
    ]
    for pattern in banned_starts:
        if re.search(pattern, text, re.IGNORECASE):
            # Option 1: Remove the bad start and keep rest
            text = re.sub(pattern, '', text, flags=re.IGNORECASE).lstrip(",. ")

            # Option 2: Or log it if you want to debug later
            print("⚠️ Detected bad start, auto-cleaned.")
            break
    return text

# 🚀 Run generation
# all_outputs = {}
# for model_name, model_path in model_paths.items():
#     print(f"\n🔄 Loading {model_name}...")
#     model = load_model(model_path)
#     model_outputs = {}
#     for prompt_label, prompt in prompts.items():
#         print(f"\n📌 {model_name} - {prompt_label} Prompt:")
#         output = generate_sop(model, prompt)
#         print(output[:1000])  # Print first 1000 characters to avoid overload
#         model_outputs[prompt_label] = output
#     all_outputs[model_name] = model_outputs


In [None]:
# 💬 Prompts
prompts = {
    "In-domain": "Write me an SOP for a Master's degree application in Computer Science focusing on Artificial Intelligence.",
    "Out-of-domain 1": "Write me an SOP for a Master's in Criminal Law.",
    "Out-of-domain 2": "Write me an SOP for a Master's in Art History"
}

# 📄 Generate and Save SOP Outputs for Each Fine-Tuned Model

In this step:
- We generate SOPs for each model across all three evaluation prompts (in-domain and out-of-domain).
- We **enhance the prompts** slightly to encourage better SOP structure:
  - Added guidance to produce a complete, well-structured SOP that does not stop mid-sentence.
- For each model:
  - Generate SOPs for all prompts.
  - Save the outputs into a **nicely formatted text file** (`ModelA_outputs.txt`, `ModelB_outputs.txt`, etc.).
  - Each file contains:
    - The original prompt (with enhancements).
    - The corresponding generated SOP.
    - Visual separators (`=`) for easier manual review.

✅ **By the end of this step**:
- You will have separate output files for each model.
- All outputs will be saved inside `/content/drive/MyDrive/sop_outputs`.
- You can easily **compare models side-by-side** by reading the output text files.

In [None]:
import os

# 📂 Where to save outputs
output_folder = "/content/drive/MyDrive/sop_outputs"
os.makedirs(output_folder, exist_ok=True)

# 🚀 Run generation and save
all_outputs = {}

for model_name, model_path in model_paths.items():
    print(f"\n🔄 Loading {model_name}...")
    model = load_model(model_path)
    model_outputs = {}
    output_text = ""

    for prompt_label, original_prompt in prompts.items():
        # ✨ Updated prompt to encourage full, complete SOP
        full_prompt = (
            f"{original_prompt} The SOP should be complete, well-structured, "
            "with a clear conclusion. Do not stop mid-sentence. End naturally."
        )

        print(f"\n📌 {model_name} - {prompt_label} Prompt:")

        output = generate_sop(model, full_prompt, max_new_tokens=1200)  # 🔥 bigger token limit

        print(output[:1000])  # Preview first 1000 characters

        model_outputs[prompt_label] = output

        # Save nicely formatted text for each model
        output_text += f"### Prompt ({prompt_label}):\n{full_prompt}\n\n"
        output_text += f"### Generated SOP:\n{output}\n\n"
        output_text += "="*80 + "\n\n"

    # Save outputs to file
    file_path = os.path.join(output_folder, f"{model_name}_outputs.txt")
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(output_text)

    all_outputs[model_name] = model_outputs


print("\n✅ All outputs saved in:", output_folder)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



🔄 Loading ModelA...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


📌 ModelA - In-domain Prompt:
[INST] Write me an SOP for a Master's degree application in Computer Science focusing on Artificial Intelligence. The SOP should be complete, well-structured, with a clear conclusion. Do not stop mid-sentence. End naturally. [/INST] Statement of Purpose: As a young girl growing up in India, I was fascinated by the progression of technology and its presence across all spheres of life. It began with my mother being able to shop online from the comfort of her home, then it moved onto having virtual assistants like Siri and Alexa become commonplace among households worldwide, further moving into machine learning and artificial intelligence taking over the marketplace and finally shaping the future we live in today. The evolution of technology has been nothing short of revolutionary and I believe that the next decade will witness more profound changes than ever before. As someone who grew up observing these changes, I am highly determined to make a difference a

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


[INST] Write me an SOP for a Master's in Art History The SOP should be complete, well-structured, with a clear conclusion. Do not stop mid-sentence. End naturally. [/INST] Statement of Purpose: I have always been passionate about the arts and their interpretation. As a young girl, I found joy in expressing myself through various mediums such as painting, sculpture or photography. Given my love for visual representation, it is no surprise that when I grew up to become more aware of my interests, I chose to pursue architecture. Architecture was not just a subject but a passion that brought together the best of both worlds; art and science. With every project I worked on, I realised how important history and culture were to shaping the architectural language of a place. This thought has stayed with me ever since then. Now, given my past academic experience and professional work, I want to broaden my knowledge base and skillset by pursuing this program at your esteemed institution. My aspi

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


📌 ModelB - In-domain Prompt:
[INST] Write me an SOP for a Master's degree application in Computer Science focusing on Artificial Intelligence. The SOP should be complete, well-structured, with a clear conclusion. Do not stop mid-sentence. End naturally. [/INST]
[SOP] Statement of Purpose The world has witnessed the power and potential of technology through its rapid growth over the last few decades. From being just a tool to enhance productivity to becoming one that is capable of influencing human behaviour and thought process, it has come quite far. With the advent of AI, machine learning and data analytics, we have been able to understand our own needs and wants better than ever before. As humans, we often underestimate our capabilities while ignoring the fact that we possess the most powerful weapon at our disposal: data. Data holds all answers to every question mankind might ask. It also provides insight into understanding oneself better as individuals. I believe that given my aca

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


[INST] Write me an SOP for a Master's in Art History The SOP should be complete, well-structured, with a clear conclusion. Do not stop mid-sentence. End naturally. [/INST]
[SOP]
STATEMENT OF PURPOSE I am deeply fascinated by the power of art and its ability to communicate stories across cultures and eras. This passion has driven my interest towards exploring various aspects of it from history, design, functionality to sociocultural significance. With this statement of purpose, I hope to express my deep love for art and the intentionality that drives my motivation to pursue Masters degree at XYZ University, which is known for its quality education and research opportunities. Since childhood, I have always been inclined towards creativity. It could range from painting, drawing, dancing or even writing. However, as time progressed, I started developing a strong inclination towards understanding more about how these forms evolved over centuries and why they continue to influence our lifest

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


📌 ModelC - In-domain Prompt:
[INST] Write me an SOP for a Master's degree application in Computer Science focusing on Artificial Intelligence. The SOP should be complete, well-structured, with a clear conclusion. Do not stop mid-sentence. End naturally. [/INST] Statement of Purpose (SOP) for Master's Degree in Computer Science: Specialization in Artificial Intelligence

I am pleased to apply for the Master's Degree program in Computer Science with a specialization in Artificial Intelligence at XYZ University. I have always been fascinated by the power and potential of computers and technology, leading me to pursue a Bachelor's Degree in Computer Science from ABC College. Now, as I stand at the threshold of my professional career, I am eager to deepen my knowledge and expertise in the field of Artificial Intelligence (AI).

Throughout my undergraduate studies, I have consistently excelled academically, achieving a GPA of 3.8 out of 4.0. My coursework focused primarily on computer progr

## ✅ Part 2: Setup LLM-as-Judge to Compare Models

- We'll now automate LLM-based judging using:

- System prompt (clear judging criteria)

- User prompt (two model outputs + prompt)

In [None]:
!pip install openai



In [None]:
import openai
from google.colab import userdata

# Read the API key securely from Colab secrets
api_key = userdata.get('OPENAI_API_KEY')


# 🧠 Model Comparison Using LLM-as-a-Judge (GPT-4)

In this step:
- We automatically compare pairs of fine-tuned models by using **GPT-4** as a **judge**.
- For each pair of models, and for each evaluation prompt:
  - We present both generated SOPs side-by-side.
  - GPT-4 evaluates based on:
    - Structure and organization of the SOP
    - Relevance to the given prompt
    - Clarity and fluency of writing
    - Generalization ability (avoiding overfitting or memorization)
  - GPT-4 provides:
    - A short explanation for its choice
    - Declares a **winner**: Model A, Model B, or Tie.

### Detailed Workflow:
1. **Prepare prompts** and **outputs** for comparison.
2. **Define a judging function (`judge_models`)**:
   - Sends a structured evaluation request to GPT-4.
   - Receives the evaluation explanation and decision.
3. **Loop through all model pairs and prompts**:
   - Compare ModelA vs ModelB, ModelA vs ModelC, ModelB vs ModelC.
   - Judge each matchup on each prompt (In-domain and Out-of-domain).
4. **Record Results**:
   - Save prompt, models compared, winner, and GPT-4's explanation into a structured list.
5. **Save Results to CSV**:
   - Save a full results table into `/content/model_comparison_results.csv`.
6. **Optional**: Create a **pivot table summary** for quick visualization of which model won more comparisons.

✅ **By the end of this step**:
- You have an automated, objective evaluation of your models using a strong external judge (GPT-4).
- You can directly see which model performs better across different prompt types.


In [None]:
# 🧠 LLM-as-Judge Function
from openai import OpenAI

# Instantiate the OpenAI client
client = OpenAI(api_key=api_key)

def judge_models(prompt, output_a, output_b, model_names=("Model A", "Model B")):
    system_prompt = """You are an academic admissions officer.
You must decide which SOP (Statement of Purpose) is better based on:
- Structure (organized SOP style, proper start and end)
- Relevance to the prompt
- Clarity and fluency of writing
- Generalization ability (not just memorized data)

Reply in this format:
- Short explanation
- Winner: Model A / Model B / Tie
"""

    user_prompt = f"""### Prompt:
{prompt}

### SOP from {model_names[0]}:
{output_a}

### SOP from {model_names[1]}:
{output_b}

Which SOP is better?"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,
    )

    return response.choices[0].message.content

# 🚀 Model Names (your internal names)
model_internal_names = {
    "ModelA": "mistral-manasa-lora",
    "ModelB": "mistral-v02-sop-explainer-lora",
    "ModelC": "mistral_sop_finetuned"
}

# 📄 Prompts you used
prompt_labels = ["In-domain", "Out-of-domain 1", "Out-of-domain 2"]

# 🔥 All Model Pairs to Compare
model_pairs = [("ModelA", "ModelB"), ("ModelA", "ModelC"), ("ModelB", "ModelC")]

# 📦 Collect results
results = []

for prompt_label in prompt_labels:
    prompt = prompts[prompt_label]

    for model1, model2 in model_pairs:
        print(f"🧪 Judging {model1} vs {model2} on '{prompt_label}'...")
        output1 = all_outputs[model1][prompt_label]
        output2 = all_outputs[model2][prompt_label]

        judgment = judge_models(prompt, output1, output2, model_names=(model1, model2))
        print(judgment)

        # Parse winner
        if "Winner: Model A" in judgment:
            winner = model1
        elif "Winner: Model B" in judgment:
            winner = model2
        else:
            winner = "Tie"

        results.append({
            "Prompt": prompt_label,
            "Model 1": model1,
            "Model 2": model2,
            "Winner": winner,
            "Explanation": judgment
        })

# 📝 Save results
import pandas as pd

results_df = pd.DataFrame(results)
results_df.to_csv("/content/model_comparison_results.csv", index=False)

print("\n✅ Judging Complete! Results saved to '/content/model_comparison_results.csv'.")

# 📊 Optional: View mini-summary
summary_table = results_df.pivot_table(index="Prompt", columns=["Model 1", "Model 2"], values="Winner", aggfunc=lambda x: x)
summary_table


🧪 Judging ModelA vs ModelB on 'In-domain'...
Both SOPs are well-structured, relevant to the prompt, and demonstrate a clear understanding of the field of computer science. However, there are some differences that set them apart.

Model A's SOP is more detailed and provides a comprehensive overview of the candidate's academic and professional journey. It also demonstrates a strong passion for technology and its impact on society. The candidate's achievements and experiences are well-articulated, and the SOP ends with a clear statement of the candidate's goals and aspirations.

Model B's SOP, on the other hand, is also well-structured and relevant, but it lacks the depth and detail of Model A's SOP. The candidate's academic and professional experiences are not as thoroughly explained, and the SOP ends abruptly without a clear conclusion.

In terms of clarity and fluency of writing, both SOPs are well-written and easy to understand. However, Model A's SOP is more engaging and compelling d

Model 1,ModelA,ModelA,ModelB
Model 2,ModelB,ModelC,ModelC
Prompt,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
In-domain,ModelA,Tie,Tie
Out-of-domain 1,ModelB,Tie,Tie
Out-of-domain 2,ModelB,Tie,ModelC


In [None]:
!pip install ace_tools
# 📦 Install everything needed
!pip install -q rouge-score spacy nltk bert-score
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m127.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# 📚 Evaluation of SOP Outputs using ROUGE, BERTScore, and Structural Metrics

In this step:
- We **evaluate the quality** of generated SOPs using automatic metrics and structural features.

### What this code does:

### 1. Set Up Environment
- Import libraries: `spaCy`, `nltk`, `bert_score`, `rouge_scorer`.
- Load the English (`en_core_web_sm`) language model for token and sentence analysis.
- Download NLTK's sentence tokenizer data.

### 2. Parse Generated Outputs
- Define `extract_prompt_blocks()`:
  - Parse saved model output `.txt` files.
  - Extract each **Prompt**, **Prompt Label**, and **Generated SOP** cleanly using robust regex.

### 3. Evaluate Each SOP
- For each SOP:
  - **Token count** (number of words).
  - **Sentence count** (using spaCy).
  - **Verb count** (number of action words, for liveliness of writing).
  - **ROUGE-1** and **ROUGE-L** scores:
    - Measures overlap between the **prompt** and the **generated SOP**.
- Store all metrics in a structured record.

### 4. Evaluate Semantic Similarity (BERTScore)
- Use **BERTScore** to measure semantic similarity between:
  - The SOP (candidate).
  - The prompt (reference).
- Calculate precision, recall, and F1, and save BERTScore-F1 for each SOP.

### 5. Save Results
- Save the final evaluation dataframe as a `.csv` file (one row per SOP) for easy review and analysis.

✅ **By the end of this step**:
- You have a detailed table showing:
  - SOP length statistics (tokens, sentences, verbs).
  - ROUGE overlap with prompts.
  - Semantic quality (BERTScore F1).
- File is saved automatically for reporting or deeper analysis.

In [None]:
# 📚 Imports
import os
import re
import pandas as pd
from tqdm import tqdm
from rouge_score import rouge_scorer
import spacy
import nltk
import bert_score

nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize

nlp = spacy.load("en_core_web_sm")

# 📄 Extract prompt blocks using robust regex
def extract_prompt_blocks(text):
    pattern = re.compile(
        r"### Prompt \((.*?)\):\n(.*?)\n\n### Generated SOP:\n(.*?)(?=(?:### Prompt|\Z))",
        re.DOTALL
    )
    matches = pattern.findall(text)

    data = []
    for prompt_label, prompt_text, generated_sop in matches:
        # Extract SOP only
        sop_match = re.search(r'\[SOP\](.*)', generated_sop, re.DOTALL)
        sop_text = sop_match.group(1).strip() if sop_match else "[MISSING SOP]"
        data.append({
            "PromptLabel": prompt_label.strip(),
            "Prompt": prompt_text.strip(),
            "SOP": sop_text.split("### Prompt")[0].split("===")[0].strip()
        })
    return data

# 🔍 Evaluation Function
def evaluate_sop_outputs_with_bertscore(txt_path):
    with open(txt_path, "r", encoding="utf-8") as f:
        content = f.read()

    blocks = extract_prompt_blocks(content)
    records = []
    prompts = []
    sops = []

    for block in tqdm(blocks):
        prompt_label = block["PromptLabel"]
        prompt_text = block["Prompt"]
        sop = block["SOP"]

        doc = nlp(sop)
        num_tokens = len(doc)
        num_sentences = len(list(doc.sents))
        num_verbs = len([token for token in doc if token.pos_ == "VERB"])

        rouge = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True).score(prompt_text, sop)

        prompts.append(prompt_text)
        sops.append(sop)

        records.append({
            "PromptLabel": prompt_label,
            "SOP_Tokens": num_tokens,
            "SOP_Sentences": num_sentences,
            "SOP_Verbs": num_verbs,
            "ROUGE-1": round(rouge['rouge1'].fmeasure, 4),
            "ROUGE-L": round(rouge['rougeL'].fmeasure, 4),
        })

    print("🧠 Calculating BERTScore (this may take ~1 min)...")
    P, R, F1 = bert_score.score(cands=sops, refs=prompts, lang="en", verbose=True)
    for i, f1 in enumerate(F1):
        records[i]["BERTScore-F1"] = round(float(f1), 4)

    return pd.DataFrame(records)

# 📁 Path to your output file
txt_file_path = "/content/mistral-v02-sop-explainer-lora_outputs.txt"

# ✅ Run Evaluation
df_eval = evaluate_sop_outputs_with_bertscore(txt_file_path)

# 💾 Save as CSV
csv_path = txt_file_path.replace(".txt", "_SOP_eval_with_BERT.csv")
df_eval.to_csv(csv_path, index=False)
print(f"✅ Evaluation saved to: {csv_path}")

# 🖼️ Show first few rows
df_eval.head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
100%|██████████| 6/6 [00:00<00:00,  7.09it/s]


🧠 Calculating BERTScore (this may take ~1 min)...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.33 seconds, 18.29 sentences/sec
✅ Evaluation saved to: /content/mistral-v02-sop-explainer-lora_outputs_SOP_eval_with_BERT.csv


Unnamed: 0,PromptLabel,SOP_Tokens,SOP_Sentences,SOP_Verbs,ROUGE-1,ROUGE-L,BERTScore-F1
0,In-domain 1,578,27,67,0.0468,0.0351,0.811
1,Out-of-domain 1,574,28,79,0.0304,0.0304,0.815
2,Out-of-domain 2,598,26,79,0.0371,0.026,0.8122
3,In-domain 2,1026,52,136,0.0402,0.0254,0.806
4,Out-of-domain 3,844,39,147,0.0404,0.0278,0.8001


# 📂 Split and Save Individual SOPs from Combined Output File

In this step:
- We **split a combined model output text file** into **separate files** — one file per generated SOP.

### What this code does:

### 1. Setup Configuration
- Define the input `.txt` file containing all SOP generations.
- Define the output directory (`split_sops`) where individual SOP files will be saved.

### 2. Load and Split the File
- Read the entire content of the input file.
- Split the file into blocks using the separator line (`=` repeated 80 times).

### 3. Extract and Save Each SOP
- For each block:
  - Extract the **Prompt Label** (e.g., "In-domain", "Out-of-domain 1") using regex.
  - Extract the **SOP text** after the `[SOP]` marker.
  - Clean the SOP text if needed (remove `[END]` markers).
  - Save each SOP into its own `.txt` file:
    - Files are named based on their order and prompt label (e.g., `01_In-domain.txt`).

✅ **By the end of this step**:
- Each generated SOP is saved into a clean, individual text file.
- Files are organized in the `split_sops` folder for easy manual review, annotation, or downstream evaluation.

In [None]:
import os
import re

# === CONFIGURATION ===
input_file = "mistral-v02-sop-explainer-lora_outputs.txt"  # Your full .txt file
output_dir = "split_sops"  # Folder where individual SOPs will be saved

# === CREATE OUTPUT FOLDER ===
os.makedirs(output_dir, exist_ok=True)

# === LOAD INPUT FILE ===
with open(input_file, "r", encoding="utf-8") as f:
    content = f.read()

# === SPLIT INTO BLOCKS ===
blocks = content.split("=" * 80)

# === EXTRACT SOPs & SAVE EACH ONE ===
for i, block in enumerate(blocks):
    if not block.strip():
        continue

    # Extract prompt label (e.g., In-domain, Out-of-domain 1)
    match = re.search(r"### Prompt \((.*?)\):", block)
    prompt_label = match.group(1).strip().replace(" ", "_") if match else f"SOP_{i+1}"

    # Extract SOP text after '[SOP]' marker
    sop_match = re.search(r"\[SOP\](.*)", block, re.DOTALL)
    if sop_match:
        sop_text = sop_match.group(1).strip()
        sop_text = sop_text.replace("[END]", "").strip()
    else:
        continue  # Skip if SOP block not found

    # Write to file
    filename = f"{i+1:02d}_{prompt_label}.txt"
    filepath = os.path.join(output_dir, filename)
    with open(filepath, "w", encoding="utf-8") as out_file:
        out_file.write(sop_text)

print(f"✅ {len(os.listdir(output_dir))} SOPs saved to: {output_dir}")


✅ 6 SOPs saved to: split_sops
