# 📚 Project Overview

This project fine-tunes a large language model (Mistral-7B-Instruct-v0.2) to generate high-quality Statements of Purpose (SOPs) in a specific author's writing style.

In this notebook, we cover the **complete workflow**:

### 1. Data Preparation
- **Load and clean the dataset** (`manasa_cleaned_file.jsonl`) containing SOPs formatted with clear instruction-style prompts.
- **Remove reasoning sections** to create a pure SOP dataset for focused fine-tuning.

### 2. Fine-Tuning Mistral-7B with LoRA (Low-Rank Adaptation)
- **Quantize** the base model to 4-bit precision for memory-efficient training on an A100 GPU.
- **Apply LoRA adapters** to fine-tune the model efficiently without updating all model weights.
- **Fine-tune** the model on the custom SOP instruction dataset for 2 epochs.
- **Log training metrics** and checkpoint intermediate model states.

### 3. Model Inference and Evaluation
- **Run inference** using the fine-tuned model on new prompts (e.g., SOPs for different programs).
- **Verify** that the model generates complete, coherent SOPs aligned with the target writing style.

✅ **End Result**:
- A fine-tuned model saved to `/content/drive/MyDrive/mistral_sop_finetuned` that can generate SOPs based on new instructions.
- Instructions provided for **how to load and use** the fine-tuned model after training.

🔵 **Big Picture**:
This fine-tuned SOP generation model lays the foundation for a larger project where we aim to also generate **stylistic reasoning explanations** alongside SOPs (in Model 2).

In [None]:
!pip install transformers datasets trl peft accelerate bitsandbytes

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.12

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### 🧹 Remove [REASONING] Sections from Dataset

In this step:
- We load the formatted dataset (`manasa_sop_formatted_updated.jsonl`) where each entry contains both an SOP and a reasoning explanation.
- We use a regular expression (`regex`) to **remove everything between `[REASONING]` and `[/REASONING]`** in each sample.
- This leaves only the clean SOP content in the `"text"` field.
- The cleaned dataset is saved as `/content/drive/MyDrive/manasa_cleaned_file.jsonl`.

✅ This prepares a pure SOP-only dataset, useful if we want to fine-tune a model solely on SOP generation without explanations.

In [None]:
import json
import re

input_path = "/content/drive/MyDrive/manasa_sop_formatted_updated.jsonl"
output_path = "/content/drive/MyDrive/manasa_cleaned_file.jsonl"

def remove_reasoning(text):
    # Use regex to remove everything between [REASONING] and [/REASONING]
    return re.sub(r"\[REASONING\](.|\n)*?\[/REASONING\]", "", text).strip()

with open(input_path, 'r') as infile, open(output_path, 'w') as outfile:
    for line in infile:
        sample = json.loads(line)
        cleaned_text = remove_reasoning(sample["text"])
        sample["text"] = cleaned_text
        json.dump(sample, outfile)
        outfile.write("\n")

print("✅ All [REASONING] sections removed and cleaned data saved.")


✅ All [REASONING] sections removed and cleaned data saved.


# 📚 Fine-Tuning Mistral-7B-Instruct-v0.2 for SOP Generation — Full Pipeline

This code fine-tunes the Mistral-7B-Instruct-v0.2 model to generate Statements of Purpose (SOPs) in a specific writing style using custom instruction-formatted data.

### Main steps:

1. **Install Required Libraries**
   - Install Hugging Face Transformers, Datasets, PEFT (for LoRA), Bitsandbytes (for 4-bit quantization), and W&B (optional logging).

2. **Set Up Environment**
   - Mount Google Drive to access your dataset and save outputs.
   - Define file paths for loading data and saving models.

3. **Load and Prepare Dataset**
   - Load the SOP dataset from a `.jsonl` file.
   - Convert it into a Hugging Face `Dataset` object.
   - Split the data into 90% training and 10% validation.

4. **Load Tokenizer**
   - Load the tokenizer from the Mistral-7B-Instruct model.
   - Set padding token to the EOS token and prepare for right-side padding.

5. **Tokenize the Dataset**
   - Tokenize the SOP text examples.
   - Set a maximum length (2048 tokens) to fit long SOPs.

6. **Load the Model with 4-bit Quantization**
   - Load Mistral-7B in 4-bit NF4 quantized format for efficient fine-tuning on limited GPU memory (A100).
   - Automatically map model layers to GPU using `device_map="auto"`.

7. **Apply LoRA (Parameter-Efficient Fine-Tuning)**
   - Configure a LoRA setup targeting specific projection layers (`q_proj`, `k_proj`, etc.).
   - Inject LoRA adapters into the model to fine-tune a small number of parameters efficiently.

8. **Configure Training**
   - Define training hyperparameters using `TrainingArguments`, including:
     - Batch sizes, learning rate, gradient accumulation, save steps, evaluation steps, mixed precision (fp16), etc.
   - Set W&B as the logging platform if needed.

9. **Initialize Trainer**
   - Wrap the model, training arguments, datasets, and data collator using Hugging Face `Trainer`.

10. **Fine-Tune the Model**
    - Start the fine-tuning process with `trainer.train()`.
    - Save model checkpoints during training and the final model after training.

11. **Inference and Testing**
    - Define a `generate_sop` function to test the fine-tuned model.
    - Provide a prompt and generate a complete SOP.
    - Print an example SOP generated for a Data Science Master's degree at MIT.

12. **Post-Training Usage Instructions**
    - Simple steps for reloading and using the fine-tuned model later.

In [None]:
# Fine-tuning Mistral-7B-Instruct-v0.2 for SOP Generation
# This code assumes you're running in Google Colab with an A100 GPU

# Install required libraries
!pip install -q transformers datasets accelerate peft bitsandbytes wandb sentencepiece

import os
import torch
import json
import random
from google.colab import drive
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Mount Google Drive to access your data
drive.mount('/content/drive')

# Define paths
jsonl_path = '/content/drive/MyDrive/manasa_cleaned_file.jsonl'  # Update this to your JSONL file path
output_dir = '/content/drive/MyDrive/mistral_sop_finetuned'

# Load dataset from JSONL
def load_jsonl_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data
sop_data = load_jsonl_data(jsonl_path)


# Create HF Dataset
dataset = Dataset.from_list(sop_data)

# Split the dataset into training and validation
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")

# Print a sample for verification
print("\nSample formatted prompt:")
print(train_dataset[0]["text"])

# Load tokenizer and prepare for training
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=2048,  # Adjust based on your SOP lengths
        return_tensors="pt"
    )

# Tokenize datasets
tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

tokenized_val = eval_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

# Check if tokenization was successful
print(f"\nTokenized train dataset length: {len(tokenized_train)}")
print(f"Tokenized validation dataset length: {len(tokenized_val)}")

# Configure quantization for efficient fine-tuning on A100
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Prepare for PEFT/LoRA fine-tuning
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                     # Rank dimension
    lora_alpha=32,            # Alpha parameter for LoRA scaling
    lora_dropout=0.05,        # Dropout probability for LoRA layers
    bias="none",              # We're not training biases
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ]
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.print_trainable_parameters()}")

# Set up training arguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    num_train_epochs=2,          # Start with a small number of epochs to prevent overfitting
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_steps=50,
    logging_steps=10,
    fp16=True,
    push_to_hub=False,
    save_total_limit=3,
    report_to="wandb",  # Set to "wandb" if you want to use Weights & Biases
)

# Create data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're not doing masked language modeling
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
)

# Start fine-tuning
print("Starting fine-tuning process...")
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")

# Function to test the model
def generate_sop(prompt, max_length=2048):
    input_text = f"<s>[INST] {prompt} [/INST]\n"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.85,
            do_sample=True,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # Extract just the response part
    response = generated_text.split("[/INST]")[1].strip()
    return response

# Test the model with a sample prompt
test_prompt = "Write me an SOP for pursuing a Master's Degree in Data Science at MIT."
print("\nTesting the fine-tuned model with a sample prompt:")
generated_sop = generate_sop(test_prompt)
print(generated_sop)

# Instructions for using the model after this session
print("\n==== How to use the fine-tuned model ====")
print("1. Load the saved model from your Google Drive")
print("2. Use the 'generate_sop' function with your desired prompt")
print("3. You can adjust temperature and top_p for more/less creativity")
print("4. For new domains not in training data, the model will generate SOPs in the same style")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Training examples: 503
Validation examples: 56

Sample formatted prompt:
<s>[INST] Write me an SOP for pursuing a Master's Degree in Computer Science, focusing on Artificial Intelligence and Machine Learning. [/INST]
[SOP]
STATEMENT OF PURPOSE The power of knowing a system inside out, inclusive of its hardware and software functionalities, will not only revise and refurbish the entire purpose of its creation but also incredibly increase the scope of its implementation. With growing technology, advancement of science and learning, I developed an impeccable interest for computers, their languages and their highly appreciated scope of advancement. The world is constantly evolving and looking deeply through the lens of development, one can certainly say that computers have a role dedicated to lifting the universal veil of information exchange, therefore uplifting

Map:   0%|          | 0/503 [00:00<?, ? examples/s]

Map:   0%|          | 0/56 [00:00<?, ? examples/s]


Tokenized train dataset length: 503
Tokenized validation dataset length: 56


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758
Trainable parameters: None


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting fine-tuning process...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlasyaedunuri[0m ([33mlasyaedunuri-university-of-north-carolina-at-charlotte[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
100,2.2609,2.452541


  return fn(*args, **kwargs)


Model saved to /content/drive/MyDrive/mistral_sop_finetuned

Testing the fine-tuned model with a sample prompt:


  return fn(*args, **kwargs)


[SOP]
STATEMENT OF PURPOSE "I believe, the most powerful thing one can possess is the ability to predict, to imagine, to visualise and to foresee." Hailing from a small town, I have seen my father run a grocery store and my mother work as a school teacher. My parents always supported me to do what I wanted to do and never pressured me into choosing a specific stream of study. I was a very good student at school, always topped the class, was the topper of the school in 10th grade, and was very active in extracurricular activities. I participated in many state and districtlevel competitions and won several medals. I was also a very good athlete, I won many gold medals in athletics. I was also the school caption for three consecutive years. I was a very good student, always had a positive attitude towards life. I always believed that there is always more to learn and explore, therefore I never settled down with what I had. I was a very good student, always topped the class, was the topper

### 🧪 Test the Fine-Tuned Model with a New Prompt

In this step:
- We test the fine-tuned Mistral-7B model by providing a new instruction prompt:
  - "Write me an SOP for pursuing a Master's Degree in Political Science from Duke University."
- The `generate_sop` function:
  - Formats the prompt using `[INST]...[/INST]` tags.
  - Tokenizes the input and moves it to the GPU.
  - Generates a response using the model with sampling settings (`temperature=0.7`, `top_p=0.85`).
  - Extracts and prints the generated SOP text.

✅ This helps verify that the model can now generate full SOPs for unseen prompts in a style consistent with the training data.

In [None]:
# Test the model with a sample prompt
test_prompt = "Write me an SOP for pursuing a Master's Degree in Political Science from Duke University."
print("\nTesting the fine-tuned model with a sample prompt:")
generated_sop = generate_sop(test_prompt)
print(generated_sop)


Testing the fine-tuned model with a sample prompt:
[SOP]
STATEMENT OF PURPOSE "The world today is a global village, interconnected by various factors that influence one another. The power of politics is one such factor that influences the growth of the world. The world is intertwined by politics and the progression of technology. The two have always influenced one another, where politics influence technology and technology, in turn, has helped politics to grow. As a young girl, I enjoyed the concept of politics and the role it played in shaping the world. The power of politics was always evident to me. It was everywhere, from the small scale to the big scale, and I was intrigued by its influence on the world. I enjoyed participating in political discussions and debates. I enjoyed reading books, watching films and listening to music that had political undertones. I was interested in understanding the political landscape of the world. I was also interested in understanding the role of w

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=34741cbfaf713ae30f24c2ad0584c0bd42d644710173c30d6133ee20caae9814
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


# 📊 Model Evaluation Pipeline for Fine-Tuned SOP Generator

This block evaluates the fine-tuned Mistral-7B-Instruct model for SOP generation.

### What this script does:

1. **Load Fine-Tuned Model**
   - Load the model and tokenizer from the specified saved directory.
   - Load in 4-bit quantized mode to save GPU memory.

2. **SOP Generation**
   - Given a test prompt, generate an SOP using the fine-tuned model.
   - Automatically format the prompt according to the `[INST] ... [/INST]` format expected by Mistral.

3. **Completeness Checking**
   - Check if the generated SOP is **complete**:
     - Ends with proper punctuation (`.`, `!`, `?`).
     - Doesn't end mid-sentence or with dangling conjunctions (e.g., "and", "but").

4. **Optional ROUGE Score Calculation**
   - If reference SOPs are available, calculate **ROUGE-1**, **ROUGE-2**, and **ROUGE-L** scores to measure overlap between generated and reference SOPs.

5. **Prompt Sampling**
   - Two modes for testing:
     - **Held-out prompts**: Use fresh prompts not seen during training (e.g., "Write me an SOP for Machine Learning").
     - **Training data prompts**: Sample prompts and reference SOPs from the training dataset for evaluation.

6. **Evaluation Metrics**
   - For each generated SOP, log:
     - The prompt
     - The generated SOP
     - Whether it is complete (boolean)
     - ROUGE scores (if applicable)
   - Calculate and report:
     - Average completeness score across samples
     - Average ROUGE scores (if references are available)

7. **Save Results**
   - Save detailed results and summary metrics into a `.json` file (`evaluation_results.json`).

8. **Notebook and CLI Compatibility**
   - If running inside a notebook, simulate command-line arguments.
   - If running from the command line (`python evaluate.py`), accept `--model_path`, `--dataset`, etc. as arguments.

---

✅ **By the end of this evaluation script**:
- You will have a detailed report showing how well your fine-tuned model generates SOPs:
  - Are the SOPs complete and polished?
  - How well do they match ground-truth SOPs (if available)?
  - Metrics like Completeness % and ROUGE scores are computed.

In [None]:
import torch
import json
import re
import argparse
import numpy as np
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from rouge_score import rouge_scorer
import random
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

def load_model_and_tokenizer(model_path):
    """Load the fine-tuned model and tokenizer"""
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Load model in 4-bit to save memory
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        load_in_4bit=True,
        device_map="auto"
    )

    return model, tokenizer

def generate_sop(model, tokenizer, prompt, max_length=2048):
    """Generate an SOP using the loaded model"""
    # Format prompt according to Mistral's chat template
    formatted_prompt = f"<s>[INST] {prompt} [/INST]"

    # Create generation pipeline
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=max_length,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.15,
        do_sample=True
    )

    # Generate text
    generated = generator(formatted_prompt)[0]['generated_text']

    # Extract SOP part
    output = generated.split("[/INST]")[-1].strip()

    # Try to extract just the SOP content if it uses the [SOP] tags
    sop_match = re.search(r'\[SOP\](.*?)(?:\[\/SOP\]|$)', output, re.DOTALL)
    if sop_match:
        output = sop_match.group(1).strip()

    return output

def check_completeness(sop):
    """Check if an SOP is complete (doesn't end mid-sentence)"""
    # Clean up the text
    sop = sop.strip()

    # Check last sentence
    sentences = sent_tokenize(sop)
    if not sentences:
        return False

    last_sentence = sentences[-1]

    # Check for proper punctuation at the end
    if not re.search(r'[.!?][\s"\']*$', last_sentence):
        return False

    # Check for dangling words that suggest incomplete thought
    incomplete_endings = [
        "and", "or", "but", "however", "therefore", "thus",
        "moreover", "furthermore", "consequently", "since", "has", "a"
    ]

    for word in incomplete_endings:
        if last_sentence.lower().endswith(word.lower()):
            return False

    return True

def evaluate_model(model_path, test_prompts, reference_sops=None):
    """Evaluate the model on test prompts"""
    model, tokenizer = load_model_and_tokenizer(model_path)

    results = []
    completeness_scores = []

    # Set up ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Process each test prompt
    for i, prompt in enumerate(tqdm(test_prompts, desc="Generating SOPs")):
        # Generate SOP
        generated_sop = generate_sop(model, tokenizer, prompt)

        # Check completeness
        is_complete = check_completeness(generated_sop)
        completeness_scores.append(int(is_complete))

        result = {
            "prompt": prompt,
            "generated_sop": generated_sop,
            "is_complete": is_complete
        }

        # Calculate ROUGE scores if we have references
        if reference_sops and i < len(reference_sops):
            reference = reference_sops[i]
            rouge_scores = scorer.score(reference, generated_sop)
            result["rouge1"] = rouge_scores["rouge1"].fmeasure
            result["rouge2"] = rouge_scores["rouge2"].fmeasure
            result["rougeL"] = rouge_scores["rougeL"].fmeasure

        results.append(result)

    # Calculate overall completeness score
    overall_completeness = sum(completeness_scores) / len(completeness_scores) if completeness_scores else 0

    return results, overall_completeness

def sample_test_prompts(dataset_path, num_samples=10, held_out=True):
    """Sample test prompts from the dataset"""
    # Load dataset
    dataset = []
    with open(dataset_path, 'r', encoding='utf-8') as f:
        for line in f:
            try:
                item = json.loads(line.strip())
                dataset.append(item)
            except json.JSONDecodeError:
                continue

    if held_out:
        # Create completely new prompts not in the dataset
        courses = [
            "Machine Learning", "Business Analytics", "Digital Marketing",
            "Public Health", "Quantum Computing", "Cybersecurity",
            "Artificial Intelligence", "Urban Planning", "Robotics",
            "Applied Mathematics", "Film Production", "Game Design",
            "Environmental Law", "Political Science", "International Relations"
        ]

        # Create prompts
        return [f"Write me an SOP for pursuing a Master's Degree in {course}." for course in
                random.sample(courses, min(num_samples, len(courses)))]
    else:
        # Extract existing prompts from the dataset
        prompts = []
        references = []

        for item in dataset:
            # Check format
            if "text" in item:
                text = item["text"]
                instruction_match = re.search(r'\[INST\](.*?)\[\/INST\]', text, re.DOTALL)
                sop_match = re.search(r'\[SOP\](.*?)\[\/SOP\]', text, re.DOTALL)

                if instruction_match and sop_match:
                    prompts.append(instruction_match.group(1).strip())
                    references.append(sop_match.group(1).strip())

        # Sample the prompts
        if prompts:
            sample_indices = random.sample(range(len(prompts)), min(num_samples, len(prompts)))
            return [prompts[i] for i in sample_indices], [references[i] for i in sample_indices]

        return [], []

def main():
    parser = argparse.ArgumentParser(description="Evaluate fine-tuned SOP model")
    parser.add_argument("--model_path", required=True, help="Path to the fine-tuned model")
    parser.add_argument("--dataset", required=True, help="Path to the dataset file used for training")
    parser.add_argument("--output", default="evaluation_results.json", help="Path to save evaluation results")
    parser.add_argument("--num_samples", type=int, default=10, help="Number of samples to evaluate")
    parser.add_argument("--held_out", action="store_true", help="Use completely new prompts not in the dataset")

    args = parser.parse_args()

    # Sample test prompts
    if args.held_out:
        test_prompts = sample_test_prompts(args.dataset, args.num_samples, held_out=True)
        reference_sops = None
    else:
        test_prompts, reference_sops = sample_test_prompts(args.dataset, args.num_samples, held_out=False)

    # Evaluate model
    results, completeness_score = evaluate_model(args.model_path, test_prompts, reference_sops)

    # Add summary metrics
    summary = {
        "completeness_score": completeness_score,
    }

    # Add ROUGE scores if available
    if reference_sops:
        rouge1_scores = [r.get("rouge1", 0) for r in results if "rouge1" in r]
        rouge2_scores = [r.get("rouge2", 0) for r in results if "rouge2" in r]
        rougeL_scores = [r.get("rougeL", 0) for r in results if "rougeL" in r]

        if rouge1_scores:
            summary["avg_rouge1"] = sum(rouge1_scores) / len(rouge1_scores)
            summary["avg_rouge2"] = sum(rouge2_scores) / len(rouge2_scores)
            summary["avg_rougeL"] = sum(rougeL_scores) / len(rougeL_scores)

    # Save results
    with open(args.output, 'w', encoding='utf-8') as f:
        json.dump({"results": results, "summary": summary}, f, indent=2)

    print(f"Evaluation complete! Results saved to {args.output}")
    print(f"Overall completeness score: {completeness_score:.2f}")

    if "avg_rouge1" in summary:
        print(f"Average ROUGE-1: {summary['avg_rouge1']:.4f}")
        print(f"Average ROUGE-2: {summary['avg_rouge2']:.4f}")
        print(f"Average ROUGE-L: {summary['avg_rougeL']:.4f}")

if __name__ == "__main__":
    import sys
    if "ipykernel" in sys.modules:
        # Simulate command-line args in notebook
        class Args:
            model_path = "/content/drive/MyDrive/mistral_sop_finetuned"
            dataset = "/content/drive/MyDrive/manasa_cleaned_file.jsonl"
            output = "evaluation_results.json"
            num_samples = 10
            held_out = False
        args = Args()

        # Run the functions directly
        if args.held_out:
            test_prompts = sample_test_prompts(args.dataset, args.num_samples, held_out=True)
            reference_sops = None
        else:
            test_prompts, reference_sops = sample_test_prompts(args.dataset, args.num_samples, held_out=False)

        results, completeness_score = evaluate_model(args.model_path, test_prompts, reference_sops)

        with open(args.output, 'w', encoding='utf-8') as f:
            json.dump({"results": results, "summary": {
                "completeness_score": completeness_score
            }}, f, indent=2)

        print(f"Evaluation complete! Results saved to {args.output}")
    else:
        main()


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating SOPs:   0%|          | 0/10 [00:00<?, ?it/s]Device set to use cuda:0
Generating SOPs:  10%|█         | 1/10 [03:33<31:57, 213.04s/it]Device set to use cuda:0
Generating SOPs:  20%|██        | 2/10 [07:05<28:21, 212.73s/it]Device set to use cuda:0
Generating SOPs:  30%|███       | 3/10 [10:39<24:53, 213.31s/it]Device set to use cuda:0
Generating SOPs:  40%|████      | 4/10 [14:12<21:19, 213.22s/it]Device set to use cuda:0
Generating SOPs:  50%|█████     | 5/10 [17:44<17:44, 212.85s/it]Device set to use cuda:0
Generating SOPs:  60%|██████    | 6/10 [21:17<14:10, 212.71s/it]Device set to use cuda:0
Generating SOPs:  70%|███████   | 7/10 [24:49<10:37, 212.65s/it]Device set to use cuda:0
Generating SOPs:  80%|████████  | 8/10 [28:21<07:04, 212.43s/it]Device set to use cuda:0
Generating SOPs:  90%|█████████ | 9/10 [31:54<03:32, 212.39s/it]Device set to use cuda:0
Generating SOPs: 100%|██████████| 10/10 [35:26<00:00, 212.62s/it]

Evaluation complete! Results saved to evaluation_results.json





In [None]:
from datasets import Dataset
import json

# Define paths
jsonl_path = '/content/drive/MyDrive/manasa_cleaned_file.jsonl'  # Update this to your JSONL file path

# Load dataset from JSONL
def load_jsonl_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data
sop_data = load_jsonl_data(jsonl_path)


# Create HF Dataset
dataset = Dataset.from_list(sop_data)

# Split the dataset into training and validation
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

In [None]:
!pip install -q transformers datasets peft bert-score rouge-score spacy textstat nltk


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m41.0/61.1 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.3/105.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.4/939.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Evaluation Pipeline for Fine-tuned Mistral-7B-Instruct SOP Generator
# This code assumes you have already fine-tuned your model and saved it

import os
import torch
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from tqdm import tqdm
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.tokenize import sent_tokenize
from rouge_score import rouge_scorer
from datasets import Dataset
from bert_score import score as bert_score
import spacy
import textstat

# Install required packages
!pip install -q transformers datasets peft bert-score rouge-score spacy textstat nltk

# Download necessary NLTK data
nltk.download('punkt_tab')

# Load spaCy model
!python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

# Mount Google Drive
drive.mount('/content/drive')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:

from peft import PeftModel, PeftConfig

# Configure paths
model_path = '/content/drive/MyDrive/mistral_sop_finetuned'  # Path to your fine-tuned model
eval_output_dir = '/content/drive/MyDrive/sop_evaluation_results'  # Where to save evaluation results

# Create output directory if it doesn't exist
os.makedirs(eval_output_dir, exist_ok=True)

# Load the fine-tuned model and tokenizer
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# First check if this is a PEFT/LoRA model
is_peft_model = False
try:
    peft_config = PeftConfig.from_pretrained(model_path)
    is_peft_model = True
    print("Detected PEFT/LoRA model, loading base model first...")

    # Load the base model with proper quantization
    base_model = AutoModelForCausalLM.from_pretrained(
        peft_config.base_model_name_or_path,
        load_in_4bit=True,  # Use 4-bit precision
        device_map="auto",
        trust_remote_code=True
    )

    # Load the LoRA adapter on top of it
    model = PeftModel.from_pretrained(base_model, model_path)
    print("Successfully loaded PEFT/LoRA model")

except Exception as e:
    print(f"Not a PEFT model or error loading PEFT config: {e}")
    print("Loading as standard model...")

    # Load as standard model
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )

# Define initial test prompts - include both seen and unseen domains
test_prompts = [
    "Write me an SOP for pursuing a Master's Degree in Computer Science focusing on Artificial Intelligence.",  # Similar to training
    "Write me an SOP for pursuing a Master's Degree in Data Science at MIT.",  # New domain but similar degree
    "Write me an SOP for pursuing a PhD in Biology with focus on Genetics.",  # Different degree and field
    "Write me an SOP for pursuing an MBA with concentration in Finance.",  # Very different domain
    "Write me an SOP for a Master's in Fine Arts focusing on Digital Media.",  # Creative field
]

# Get reference SOPs for style comparison from your eval_dataset
reference_sops = []
for example in eval_dataset:
    # Extract the instruction and output from your dataset
    # This assumes your eval_dataset has the same format as what you used for training
    text = example["text"]

    # Parse the instruction and output from the formatted text
    # Assuming format is "<s>[INST] instruction [/INST] output </s>"
    try:
        instruction = text.split("[INST]")[1].split("[/INST]")[0].strip()
        output = text.split("[/INST]")[1].split("</s>")[0].strip()

        reference_sops.append({
            "instruction": instruction,
            "output": output
        })

        # Add this instruction to test prompts if not already there
        if instruction not in test_prompts:
            test_prompts.append(instruction)

    except IndexError:
        print(f"Could not parse example: {text[:50]}...")

# Use these reference SOPs for evaluation
reference_texts = [item["output"] for item in reference_sops if "output" in item]

print(f"Loaded {len(reference_texts)} reference SOPs from evaluation dataset")
print(f"Total test prompts for evaluation: {len(test_prompts)}")

Loading model and tokenizer...
Detected PEFT/LoRA model, loading base model first...


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Successfully loaded PEFT/LoRA model
Loaded 56 reference SOPs from evaluation dataset
Total test prompts for evaluation: 58


In [None]:
# Function to generate SOPs using the model - optimized version
def generate_sop(prompt, max_length=2048):
    input_text = f"<s>[INST] {prompt} [/INST]\n"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate text with slightly reduced parameters
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.85,
            do_sample=True,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id
        )

    # More efficient text processing
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    try:
        response = generated_text.split("[/INST]")[1].strip()
        if "</s>" in response:
            response = response.split("</s>")[0].strip()
    except IndexError:
        response = generated_text

    return response

# Optimized Evaluation functions
class SOPEvaluator:
    def __init__(self, reference_texts):
        self.reference_texts = reference_texts

        # Only initialize TF-IDF if we have reference texts
        if reference_texts and len(reference_texts) > 0:
            self.tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Reduced from 10000
            self.tfidf_vectorizer.fit(reference_texts)
            self.reference_vectors = self.tfidf_vectorizer.transform(reference_texts)
        else:
            self.tfidf_vectorizer = None
            self.reference_vectors = None

        # Initialize ROUGE scorer with only essential metrics
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)  # Removed rouge2

        # Initialize trackers
        self.all_metrics = []

        # Pre-compile regex patterns
        self.sentence_end_pattern = re.compile(r'[.!?]$')

    def check_completeness(self, text):
        """Check if the SOP is complete (not ending mid-sentence)"""
        if text and len(text.strip()) > 0:
            return bool(self.sentence_end_pattern.search(text.strip()))
        return False

    def measure_style_similarity(self, text):
        """Measure similarity to reference style using TF-IDF vectors"""
        if not self.reference_vectors is None:
            try:
                # Transform the new text to TF-IDF
                text_vector = self.tfidf_vectorizer.transform([text])

                # Calculate cosine similarities with each reference
                similarities = cosine_similarity(text_vector, self.reference_vectors)[0]

                # Return the highest similarity score
                return np.max(similarities) if similarities.size > 0 else 0
            except:
                return 0
        return 0

    def calculate_readability(self, text):
        """Calculate readability metrics - optimized to do fewer calculations"""
        flesch_reading_ease = textstat.flesch_reading_ease(text)
        # Only calculate grade level if needed
        flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)

        return {
            "flesch_reading_ease": flesch_reading_ease,
            "flesch_kincaid_grade": flesch_kincaid_grade
        }

    def analyze_structure(self, text):
        """Analyze the structure of the SOP - optimized"""
        # Use faster sentence tokenization
        sentences = text.split('. ')
        paragraphs = text.split('\n\n')

        # Check SOP pattern with simple string check
        has_sop_header = text.strip().startswith("STATEMENT OF PURPOSE")

        # Calculate average sentence length more efficiently
        if sentences:
            total_words = sum(len(s.split()) for s in sentences)
            avg_sent_len = total_words / len(sentences)
        else:
            avg_sent_len = 0

        return {
            "num_sentences": len(sentences),
            "num_paragraphs": len(paragraphs),
            "avg_sentence_length": avg_sent_len,
            "has_sop_header": has_sop_header
        }

    def calculate_semantic_richness(self, text):
        """Calculate semantic richness metrics - simplified"""
        # Use a simpler approach to estimate vocabulary richness
        words = text.lower().split()
        unique_words = set(words)

        # Skip NLP processing for entities to speed up
        return {
            "vocabulary_richness": len(unique_words) / len(words) if words else 0,
            "num_entities": 0  # Skip entity extraction to save time
        }

    def evaluate_sop(self, generated_text, prompt):
        """Comprehensive evaluation of a generated SOP - optimized"""
        # Basic metrics
        length = len(generated_text.split())
        is_complete = self.check_completeness(generated_text)

        # Only calculate style similarity if we have reference texts
        style_similarity = self.measure_style_similarity(generated_text) if self.reference_texts else 0

        # ROUGE scores - limit to first reference only
        rouge_scores = {}
        if self.reference_texts:
            scores = self.rouge_scorer.score(generated_text, self.reference_texts[0])
            rouge_scores = {f"rouge_{metric}": score.fmeasure for metric, score in scores.items()}

        # Readability metrics - only calculate if text is long enough
        readability = self.calculate_readability(generated_text) if length > 50 else {"flesch_reading_ease": 0, "flesch_kincaid_grade": 0}

        # Structure analysis
        structure = self.analyze_structure(generated_text)

        # Semantic richness - simplified
        semantics = self.calculate_semantic_richness(generated_text)

        # Combine all metrics
        metrics = {
            "prompt": prompt[:50],  # Store only first 50 chars of prompt to save memory
            "length": length,
            "is_complete": is_complete,
            "style_similarity": style_similarity,
            **rouge_scores,
            **readability,
            **structure,
            **semantics
        }

        self.all_metrics.append(metrics)
        return metrics

    def generate_report(self, output_dir):
        """Generate a comprehensive evaluation report - optimized"""
        # Convert metrics to DataFrame
        df = pd.DataFrame(self.all_metrics)

        # Save detailed metrics to CSV
        csv_path = os.path.join(output_dir, "evaluation_metrics.csv")
        df.to_csv(csv_path, index=False)

        # Calculate summary statistics
        summary = {
            "avg_length": df["length"].mean(),
            "completeness_rate": df["is_complete"].mean() * 100,
            "avg_style_similarity": df["style_similarity"].mean() if "style_similarity" in df else 0,
            "avg_flesch_reading_ease": df["flesch_reading_ease"].mean(),
            "avg_flesch_kincaid_grade": df["flesch_kincaid_grade"].mean(),
            "avg_num_paragraphs": df["num_paragraphs"].mean(),
            "avg_sentence_length": df["avg_sentence_length"].mean(),
            "sop_header_rate": df["has_sop_header"].mean() * 100,
            "avg_vocabulary_richness": df["vocabulary_richness"].mean()
        }

        # Save summary to JSON
        summary_path = os.path.join(output_dir, "evaluation_summary.json")
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2)

        # Generate only essential visualizations
        self._generate_essential_visualizations(df, output_dir)

        return summary

    def _generate_essential_visualizations(self, df, output_dir):
        """Generate only the most important visualizations"""
        # Style similarity across prompts - if we have this data
        if "style_similarity" in df.columns:
            plt.figure(figsize=(10, 5))
            plt.bar(range(len(df)), df["style_similarity"], color='skyblue')
            plt.xlabel("Test Case")
            plt.ylabel("Style Similarity Score")
            plt.title("Style Similarity to Reference SOPs")
            plt.tight_layout()
            plt.savefig(os.path.join(output_dir, "style_similarity.png"))
            plt.close()

        # Length distribution - essential
        plt.figure(figsize=(10, 5))
        plt.hist(df["length"], bins=10, color='green', alpha=0.7)  # Reduced bins from 15 to 10
        plt.xlabel("SOP Length (words)")
        plt.ylabel("Frequency")
        plt.title("Distribution of SOP Lengths")
        plt.tight_layout()
        plt.savefig(os.path.join(output_dir, "length_distribution.png"))
        plt.close()

# Main evaluation function - optimized
def evaluate_model(test_prompts, evaluator, save_generations=True, sample_size=None):
    """Run evaluation on the model with option to sample prompts"""
    # Optionally sample a subset of prompts for faster evaluation
    if sample_size and sample_size < len(test_prompts):
        test_prompts = random.sample(test_prompts, sample_size)

    print(f"Evaluating model on {len(test_prompts)} test prompts...")

    generated_sops = {}

    # Generate SOPs for each test prompt
    for prompt in tqdm(test_prompts):
        generated_text = generate_sop(prompt)
        generated_sops[prompt] = generated_text

        # Evaluate the generated SOP
        metrics = evaluator.evaluate_sop(generated_text, prompt)

        # Print minimal feedback
        print(f"Processed prompt: {prompt[:30]}... ({metrics['length']} words)")

    # Save all generated SOPs
    if save_generations:
        sops_path = os.path.join(eval_output_dir, "generated_sops.json")
        with open(sops_path, 'w') as f:
            json.dump(generated_sops, f, indent=2)

    # Generate evaluation report
    summary = evaluator.generate_report(eval_output_dir)

    print("\n==== Evaluation Summary ====")
    for metric, value in summary.items():
        print(f"{metric}: {value:.2f}")

    return summary, generated_sops

# Optimized BERTScore evaluation
def calculate_bert_scores(generated_sops, reference_texts, sample_size=None):
    """Calculate BERTScore with option to sample for faster evaluation"""
    if not reference_texts or not generated_sops:
        print("Missing either reference or generated texts for BERTScore evaluation")
        return {}

    # Sample prompts if requested
    prompts = list(generated_sops.keys())
    if sample_size and sample_size < len(prompts):
        sampled_prompts = random.sample(prompts, sample_size)
        sampled_sops = {p: generated_sops[p] for p in sampled_prompts}
    else:
        sampled_sops = generated_sops

    generated_texts = list(sampled_sops.values())

    # Use first reference text
    reference_text = reference_texts[0] if reference_texts else ""
    references = [reference_text] * len(generated_texts)

    # Calculate BERTScore with lower batch size
    try:
        print("Calculating BERTScore (this may take a while)...")
        P, R, F1 = bert_score(generated_texts, references, lang="en", batch_size=8, verbose=True)

        bert_scores = {
            "bert_precision": P.mean().item(),
            "bert_recall": R.mean().item(),
            "bert_f1": F1.mean().item()
        }

        # Save only summary scores to save time
        with open(os.path.join(eval_output_dir, "bertscore_summary.json"), 'w') as f:
            json.dump(bert_scores, f, indent=2)

        return bert_scores
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {}

# Main execution
if __name__ == "__main__":
    # Add these imports at the top of your file if not already there
    import random
    import re

    # Define sample size for faster evaluation
    SAMPLE_SIZE = 20  # Change this number to control evaluation speed

    # Initialize evaluator
    evaluator = SOPEvaluator(reference_texts)

    # Run evaluation with sampling
    summary, generated_sops = evaluate_model(test_prompts, evaluator, sample_size=SAMPLE_SIZE)

    # Calculate BERTScore on a small sample for speed
    bert_scores = calculate_bert_scores(generated_sops, reference_texts, sample_size=min(10, len(generated_sops)))

    if bert_scores:
        print("\n==== BERTScore Results ====")
        for metric, value in bert_scores.items():
            print(f"{metric}: {value:.4f}")

        # Update summary with BERTScore
        with open(os.path.join(eval_output_dir, "evaluation_summary.json"), 'r') as f:
            full_summary = json.load(f)

        full_summary.update(bert_scores)

        with open(os.path.join(eval_output_dir, "evaluation_summary.json"), 'w') as f:
            json.dump(full_summary, f, indent=2)

    # Generate a simplified report
    with open(os.path.join(eval_output_dir, "evaluation_report.md"), 'w') as f:
        f.write("# SOP Generation Model Evaluation Report\n\n")
        f.write(f"Evaluation performed on {len(generated_sops)} test prompts\n\n")

        f.write("## Overall Metrics\n\n")
        f.write(f"- Average SOP Length: {summary['avg_length']:.1f} words\n")
        f.write(f"- Completeness Rate: {summary['completeness_rate']:.1f}%\n")
        f.write(f"- Style Similarity to Reference: {summary['avg_style_similarity']:.3f}\n")
        f.write(f"- SOP Header Rate: {summary['sop_header_rate']:.1f}%\n\n")

        f.write("## Readability & Structure\n\n")
        f.write(f"- Flesch Reading Ease: {summary['avg_flesch_reading_ease']:.1f}\n")
        f.write(f"- Flesch-Kincaid Grade Level: {summary['avg_flesch_kincaid_grade']:.1f}\n")
        f.write(f"- Average Paragraphs: {summary['avg_num_paragraphs']:.1f}\n")
        f.write(f"- Average Sentence Length: {summary['avg_sentence_length']:.1f} words\n\n")

        if bert_scores:
            f.write("## Semantic Similarity (BERTScore)\n\n")
            f.write(f"- F1 Score: {bert_scores['bert_f1']:.4f}\n\n")

        # Show only one example to save space
        f.write("## Example Generation\n\n")
        first_prompt = next(iter(generated_sops.keys()))
        f.write(f"**Prompt:** {first_prompt[:100]}...\n\n")
        f.write("**Generated SOP:**\n\n")
        f.write(f"```\n{generated_sops[first_prompt][:300]}...\n```\n\n")

    print(f"\nEvaluation complete. Results saved to {eval_output_dir}")

Evaluating model on 20 test prompts...


  5%|▌         | 1/20 [03:41<1:10:02, 221.21s/it]

Processed prompt: Write me an SOP for pursuing a... (1764 words)


 10%|█         | 2/20 [07:20<1:06:00, 220.04s/it]

Processed prompt: Write me an SOP for pursuing a... (1795 words)


 15%|█▌        | 3/20 [10:59<1:02:13, 219.64s/it]

Processed prompt: Write me an SOP for pursuing a... (1685 words)


 15%|█▌        | 3/20 [13:16<1:15:16, 265.65s/it]


KeyboardInterrupt: 

In [None]:
# SOP Generation Model Evaluation Pipeline
# Optimized for Google Colab with A100 GPU

import os
import re
import json
import torch
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import textstat
from transformers import AutoModelForCausalLM, AutoTokenizer
from bert_score import BERTScorer

# Install required packages
!pip install -q rouge_score textstat transformers bert_score tqdm nltk
!nltk.download('punkt_tab', quiet=True)

# Set up environment
print("Setting up environment...")

# Create output directory
eval_output_dir = "sop_evaluation_results"
os.makedirs(eval_output_dir, exist_ok=True)

# Enable GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Model setup
def load_model(model_name_or_path):
    """Load model and tokenizer with optimized settings for A100"""
    print(f"Loading model: {model_name_or_path}")

    # Configure model loading for optimal A100 performance
    model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        torch_dtype=torch.bfloat16,  # Use bfloat16 for better A100 performance
        device_map="auto",           # Optimize device mapping
        trust_remote_code=True,
    )

    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path,
        use_fast=True,               # Use fast tokenizer
        padding_side="left",         # For causal models
        trust_remote_code=True,
    )

    # Add padding token if needed
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

# SOP Generation Function
def generate_sop(prompt, model, tokenizer, max_length=2048):
    """Generate SOP using the model with optimized settings for A100"""
    input_text = f"<s>[INST] {prompt} [/INST]\n"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate text with performance-optimized settings
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.85,
            do_sample=True,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id
        )

    # Process generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    try:
        response = generated_text.split("[/INST]")[1].strip()
        if "</s>" in response:
            response = response.split("</s>")[0].strip()
    except IndexError:
        response = generated_text

    return response

# Optimized Batch Generation
def batch_generate_sops(prompts, model, tokenizer, batch_size=4):
    """Generate SOPs in batches for better GPU utilization"""
    generated_sops = {}

    # Process in batches
    for i in range(0, len(prompts), batch_size):
        batch_prompts = prompts[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(prompts)-1)//batch_size + 1}")

        # Process each prompt in the batch
        for prompt in tqdm(batch_prompts):
            generated_text = generate_sop(prompt, model, tokenizer)
            generated_sops[prompt] = generated_text

    return generated_sops

# Fast SOP Evaluator
class FastSOPEvaluator:
    def __init__(self, reference_texts=None):
        self.reference_texts = reference_texts

        # Initialize TFIDF only if we have reference texts
        if reference_texts and len(reference_texts) > 0:
            print("Initializing TF-IDF vectorizer...")
            self.tfidf_vectorizer = TfidfVectorizer(max_features=5000)
            self.tfidf_vectorizer.fit(reference_texts)
            self.reference_vectors = self.tfidf_vectorizer.transform(reference_texts)
        else:
            self.tfidf_vectorizer = None
            self.reference_vectors = None

        # Pre-compile regex patterns
        self.sentence_end_pattern = re.compile(r'[.!?]$')

        # Initialize metrics storage
        self.all_metrics = []

    def check_completeness(self, text):
        """Fast check if the SOP is complete"""
        if text and len(text.strip()) > 0:
            return bool(self.sentence_end_pattern.search(text.strip()))
        return False

    def measure_style_similarity(self, text):
        """Fast style similarity calculation"""
        if self.reference_vectors is not None:
            try:
                text_vector = self.tfidf_vectorizer.transform([text])
                similarities = cosine_similarity(text_vector, self.reference_vectors)[0]
                return np.max(similarities) if similarities.size > 0 else 0
            except:
                return 0
        return 0

    def analyze_structure(self, text):
        """Fast structure analysis"""
        # Split by common sentence delimiters
        sentences = re.split(r'[.!?] ', text)
        sentences = [s for s in sentences if s.strip()]

        # Count paragraphs
        paragraphs = text.split('\n\n')
        paragraphs = [p for p in paragraphs if p.strip()]

        # Check header
        has_sop_header = text.strip().startswith("STATEMENT OF PURPOSE")

        # Calculate average sentence length
        word_count = sum(len(s.split()) for s in sentences)
        avg_sent_len = word_count / len(sentences) if sentences else 0

        return {
            "num_sentences": len(sentences),
            "num_paragraphs": len(paragraphs),
            "avg_sentence_length": avg_sent_len,
            "has_sop_header": has_sop_header
        }

    def calculate_readability(self, text):
        """Fast readability metrics"""
        return {
            "flesch_reading_ease": textstat.flesch_reading_ease(text),
            "flesch_kincaid_grade": textstat.flesch_kincaid_grade(text)
        }

    def evaluate_sop(self, generated_text, prompt):
        """Evaluate a single SOP with essential metrics only"""
        # Basic metrics (always calculate)
        length = len(generated_text.split())
        is_complete = self.check_completeness(generated_text)

        # Optional metrics (skip if no references)
        style_similarity = self.measure_style_similarity(generated_text) if self.reference_vectors is not None else 0

        # Structure analysis
        structure = self.analyze_structure(generated_text)

        # Readability (only if reasonable length)
        if length > 30:
            readability = self.calculate_readability(generated_text)
        else:
            readability = {"flesch_reading_ease": 0, "flesch_kincaid_grade": 0}

        # Vocabulary richness (simple calculation)
        words = generated_text.lower().split()
        unique_words = set(words)
        vocab_richness = len(unique_words) / max(1, len(words))

        # Combine metrics
        metrics = {
            "prompt": prompt[:50],  # Truncate prompt for memory efficiency
            "length": length,
            "is_complete": is_complete,
            "style_similarity": style_similarity,
            **structure,
            **readability,
            "vocabulary_richness": vocab_richness
        }

        self.all_metrics.append(metrics)
        return metrics

    def batch_evaluate(self, generated_sops):
        """Evaluate all SOPs at once"""
        print("Evaluating generated SOPs...")
        results = {}

        for prompt, text in tqdm(generated_sops.items()):
            metrics = self.evaluate_sop(text, prompt)
            results[prompt] = metrics

        return results

    def generate_report(self):
        """Generate evaluation report from collected metrics"""
        # Convert metrics to DataFrame
        df = pd.DataFrame(self.all_metrics)

        # Save detailed metrics
        csv_path = os.path.join(eval_output_dir, "evaluation_metrics.csv")
        df.to_csv(csv_path, index=False)

        # Calculate summary statistics
        summary = {
            "avg_length": df["length"].mean(),
            "completeness_rate": df["is_complete"].mean() * 100,
            "avg_style_similarity": df["style_similarity"].mean(),
            "avg_flesch_reading_ease": df["flesch_reading_ease"].mean(),
            "avg_flesch_kincaid_grade": df["flesch_kincaid_grade"].mean(),
            "avg_num_paragraphs": df["num_paragraphs"].mean(),
            "avg_sentence_length": df["avg_sentence_length"].mean(),
            "sop_header_rate": df["has_sop_header"].mean() * 100,
            "avg_vocabulary_richness": df["vocabulary_richness"].mean()
        }

        # Save summary to JSON
        summary_path = os.path.join(eval_output_dir, "evaluation_summary.json")
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2)

        # Generate essential visualizations
        self.generate_visualizations(df)

        return summary

    def generate_visualizations(self, df):
        """Generate essential visualizations only"""
        # Length distribution
        plt.figure(figsize=(10, 5))
        plt.hist(df["length"], bins=10, color='green', alpha=0.7)
        plt.xlabel("SOP Length (words)")
        plt.ylabel("Frequency")
        plt.title("Distribution of SOP Lengths")
        plt.tight_layout()
        plt.savefig(os.path.join(eval_output_dir, "length_distribution.png"))
        plt.close()

        # Style similarity if available
        if df["style_similarity"].mean() > 0:
            plt.figure(figsize=(10, 5))
            plt.bar(range(len(df)), df["style_similarity"], color='skyblue')
            plt.xlabel("Test Case")
            plt.ylabel("Style Similarity Score")
            plt.title("Style Similarity to Reference SOPs")
            plt.tight_layout()
            plt.savefig(os.path.join(eval_output_dir, "style_similarity.png"))
            plt.close()

# Fast BERTScore evaluation
def fast_bert_evaluation(generated_sops, reference_texts=None, sample_size=10):
    """Calculate BERTScore using optimized settings"""
    if not generated_sops:
        print("No generated SOPs to evaluate")
        return {}

    # Sample a subset for faster evaluation
    prompts = list(generated_sops.keys())
    if sample_size and sample_size < len(prompts):
        sampled_prompts = random.sample(prompts, sample_size)
        sampled_sops = {p: generated_sops[p] for p in sampled_prompts}
    else:
        sampled_sops = generated_sops

    generated_texts = list(sampled_sops.values())

    # If we have reference texts, use them; otherwise compare to each other
    if reference_texts and len(reference_texts) > 0:
        reference_text = reference_texts[0]
        references = [reference_text] * len(generated_texts)
    else:
        # Use the first generated text as reference
        reference_text = generated_texts[0]
        references = [reference_text] * len(generated_texts)
        print("No reference texts provided. Using first generated text as reference.")

    try:
        print("Calculating BERTScore on sample (this is optimized but may still take a minute)...")
        # Use optimized BERTScore settings for speed
        scorer = BERTScorer(lang="en", rescale_with_baseline=True, use_fast_tokenizer=True)
        P, R, F1 = scorer.score(generated_texts, references)

        bert_scores = {
            "bert_precision": P.mean().item(),
            "bert_recall": R.mean().item(),
            "bert_f1": F1.mean().item()
        }

        # Save summary to JSON
        with open(os.path.join(eval_output_dir, "bertscore_summary.json"), 'w') as f:
            json.dump(bert_scores, f, indent=2)

        return bert_scores
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {}

# Main evaluation function
def run_evaluation(model_name_or_path, test_prompts, reference_texts=None, sample_size=None):
    """Run full evaluation pipeline with optimized settings"""
    print(f"Starting evaluation with model: {model_name_or_path}")

    # Sample prompts if requested
    if sample_size and sample_size < len(test_prompts):
        print(f"Sampling {sample_size} prompts from {len(test_prompts)} available prompts")
        eval_prompts = random.sample(test_prompts, sample_size)
    else:
        eval_prompts = test_prompts

    # Load model
    model, tokenizer = load_model(model_name_or_path)

    # Generate SOPs in batches
    generated_sops = batch_generate_sops(eval_prompts, model, tokenizer)

    # Save generated SOPs
    sops_path = os.path.join(eval_output_dir, "generated_sops.json")
    with open(sops_path, 'w') as f:
        json.dump(generated_sops, f, indent=2)

    # Initialize evaluator and run evaluation
    evaluator = FastSOPEvaluator(reference_texts)
    evaluator.batch_evaluate(generated_sops)
    summary = evaluator.generate_report()

    # Run BERTScore on a small sample
    bert_sample_size = min(10, len(generated_sops))
    bert_scores = fast_bert_evaluation(generated_sops, reference_texts, bert_sample_size)

    if bert_scores:
        # Update summary with BERTScore
        summary.update(bert_scores)
        with open(os.path.join(eval_output_dir, "evaluation_summary.json"), 'w') as f:
            json.dump(summary, f, indent=2)

    # Generate final report
    create_final_report(summary, generated_sops)

    print(f"\nEvaluation complete! Results saved to {eval_output_dir}")
    return summary, generated_sops

def create_final_report(summary, generated_sops):
    """Create a concise final report"""
    with open(os.path.join(eval_output_dir, "evaluation_report.md"), 'w') as f:
        f.write("# SOP Generation Model Evaluation Report\n\n")
        f.write(f"Evaluation performed on {len(generated_sops)} prompts\n\n")

        f.write("## Summary Metrics\n\n")
        f.write("| Metric | Value |\n")
        f.write("|--------|-------|\n")
        f.write(f"| Average Length | {summary['avg_length']:.1f} words |\n")
        f.write(f"| Completeness Rate | {summary['completeness_rate']:.1f}% |\n")
        f.write(f"| Style Similarity | {summary['avg_style_similarity']:.3f} |\n")
        f.write(f"| Flesch Reading Ease | {summary['avg_flesch_reading_ease']:.1f} |\n")
        f.write(f"| Grade Level | {summary['avg_flesch_kincaid_grade']:.1f} |\n")

        if "bert_f1" in summary:
            f.write(f"| BERTScore F1 | {summary['bert_f1']:.4f} |\n")

        f.write("\n## Example Generation\n\n")
        # Show first example
        first_prompt = next(iter(generated_sops.keys()))
        f.write(f"**Prompt:** {first_prompt[:100]}...\n\n")
        f.write("**Generated SOP:**\n\n")
        f.write(f"```\n{generated_sops[first_prompt][:300]}...\n```\n\n")

        f.write("\n## Next Steps\n\n")
        f.write("- Review the `generated_sops.json` file for all model outputs\n")
        f.write("- Check `evaluation_metrics.csv` for detailed per-prompt metrics\n")
        f.write("- See visualizations in the output directory\n")

/bin/bash: -c: line 1: syntax error near unexpected token `'punkt_tab','
/bin/bash: -c: line 1: `nltk.download('punkt_tab', quiet=True)'
Setting up environment...
Using device: cuda


'\nThe following code shows how to use the evaluation pipeline.\nReplace the values with your actual data.\n\nExample:\n```python\n# Your model path\nmodel_name_or_path = "llama3-70b-instruct"  # or local path\n\n# Load your test prompts\ntest_prompts = [\n    "Create an SOP for laboratory safety procedures",\n    "Write an SOP for new employee onboarding",\n    # Add more prompts...\n]\n\n# Optional: Load reference texts (if available)\nreference_texts = [\n    "STATEMENT OF PURPOSE\nThis document outlines...",\n    # Add more reference SOPs...\n]\n\n# Run evaluation with sampling for speed\nsample_size = 20  # Adjust based on available time\nsummary, generated_sops = run_evaluation(\n    model_name_or_path, \n    test_prompts,\n    reference_texts=reference_texts,\n    sample_size=sample_size\n)\n```\n'

In [None]:
# Load your model
model_name_or_path = "/content/drive/MyDrive/mistral_sop_finetuned"

# Define your test prompts
test_prompts = [
     "Write me an SOP for pursuing a Master's Degree in Computer Science focusing on Artificial Intelligence.",  # Similar to training
    "Write me an SOP for pursuing a Master's Degree in Data Science at MIT.",  # New domain but similar degree
    "Write me an SOP for pursuing a PhD in Biology with focus on Genetics.",  # Different degree and field
    "Write me an SOP for pursuing an MBA with concentration in Finance.",  # Very different domain
    "Write me an SOP for a Master's in Fine Arts focusing on Digital Media.",  # Creative field
    "Write me an SOP for crafting a graduate school application statement focusing on collaboration and leadership experiences.",
    "Write me an SOP for pursuing a Master's in Computer Science focusing on Artificial Intelligence (AI) and Machine Learning (ML). Share your personal experiences and how they have influenced you towards choosing AI/ML as your area of interest."
]

# Optional: Add reference SOPs if available
reference_texts = [
    "STATEMENT OF PURPOSE\nThis document outlines...",
    # Add more reference SOPs...
]

# Run the evaluation (adjust sample size as needed)
summary, generated_sops = run_evaluation(
    model_name_or_path,
    test_prompts,
    reference_texts=reference_texts,
    sample_size=20  # Evaluate 20 random prompts for speed
)