# CS598JBR-Team-9 - MP1: Code Generation Evaluation

This notebook implements MP1 for evaluating code generation capabilities of Large Language Models using the HumanEval dataset.

**Team Members:**
- Saurav Nayak (sgnayak2)
- Yegu Sanjana Annamalai (ya11)
- Anil Muthigi (muthigi2)
- Ritik Hariani (ritikh2)

**GitHub Repository:** https://github.com/muthigi2/CS598JBR-Team-9

## 1. Setup Environment

First, let's check if we have GPU access and set up the basic environment.

In [None]:
import os
import torch
import subprocess
import sys

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device count: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name()}")
else:
    print("No CUDA device available. Consider using CPU or getting GPU access.")

# Set working directory
os.chdir('/Users/ritikhariani/Repos/CS598JBR-Team-9')
print(f"Current working directory: {os.getcwd()}")

## 2. Configure GitHub Repository

Update NetIDs and repository information.

In [None]:
# Team NetIDs in alphabetical order
NetIDs = ["muthigi2", "ritikh2", "sgnayak2", "ya11"]
NetIDs_str = " ".join(NetIDs)

print(f"Team NetIDs: {NetIDs}")
print(f"NetIDs string: {NetIDs_str}")

# GitHub repository info
github_repo = "https://github.com/muthigi2/CS598JBR-Team-9.git"
print(f"GitHub Repository: {github_repo}")

## 3. Install HumanEval Dependencies

Run the setup script to install required dependencies for the HumanEval dataset.

In [None]:
# Install dataset dependencies
!bash -x MP1/setup_dataset.sh

## 4. Generate Team Dataset

Generate a unique subset of 20 HumanEval problems based on team NetIDs.

In [None]:
# Generate dataset for the team
dataset_cmd = f"python3 MP1/dataset_generation.py {NetIDs_str}"
print(f"Running: {dataset_cmd}")

result = subprocess.run(dataset_cmd.split(), capture_output=True, text=True)
print("STDOUT:")
print(result.stdout)
if result.stderr:
    print("STDERR:")
    print(result.stderr)

# Save log
with open('MP1/dataset_generation.log', 'w') as f:
    f.write(f"Command: {dataset_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

print("\nDataset generation completed. Check MP1/dataset_generation.log for details.")

## 5. Extract Seed and Set File Names

Extract the generated seed from the dataset file and set up file names for subsequent steps.

In [None]:
import re
import glob

# Find the generated dataset file to extract seed
dataset_files = glob.glob('selected_humaneval_*.jsonl')
if dataset_files:
    dataset_file = dataset_files[0]
    # Extract seed from filename
    match = re.search(r'selected_humaneval_(\d+)\.jsonl', dataset_file)
    if match:
        seed = match.group(1)
        print(f"Generated seed: {seed}")
        print(f"Dataset file: {dataset_file}")
    else:
        print("Could not extract seed from filename")
        seed = "unknown"
else:
    print("No dataset file found")
    seed = "unknown"

# Set up file names
input_dataset = f"selected_humaneval_{seed}.jsonl"
base_with_quantization = f"MP1/base_prompt_{seed}.jsonl"
instruct_with_quantization = f"MP1/instruct_prompt_{seed}.jsonl"
base_with_quantization_processed = f"MP1/base_prompt_processed_{seed}.jsonl"
instruct_with_quantization_processed = f"MP1/instruct_prompt_processed_{seed}.jsonl"

print(f"\nFile names set:")
print(f"Input dataset: {input_dataset}")
print(f"Base output: {base_with_quantization}")
print(f"Instruct output: {instruct_with_quantization}")
print(f"Base processed: {base_with_quantization_processed}")
print(f"Instruct processed: {instruct_with_quantization_processed}")

## 6. Install Model Dependencies

Install the required dependencies for loading and using DeepSeekCoder models.

In [None]:
# Install model dependencies
!bash -x MP1/setup_models.sh

## 7. Load and Prompt DeepSeekCoder Models

⚠️ **GPU Required**: This section requires GPU access to load and run the models.

Load both DeepSeekCoder-6.7b-base and DeepSeekCoder-6.7b-instruct models with quantization.

In [None]:
# Prompt the base model
base_cmd = f'python3 MP1/model_prompting.py {input_dataset} "deepseek-ai/deepseek-coder-6.7b-base" {base_with_quantization} {base_with_quantization_processed} "True"'
print(f"Running base model: {base_cmd}")

result = subprocess.run(base_cmd, shell=True, capture_output=True, text=True)
print("Base model output:")
print(result.stdout)
if result.stderr:
    print("Base model errors:")
    print(result.stderr)

# Save base model log
with open('MP1/base_prompt.log', 'w') as f:
    f.write(f"Command: {base_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

In [None]:
# Prompt the instruct model
instruct_cmd = f'python3 MP1/model_prompting.py {input_dataset} "deepseek-ai/deepseek-coder-6.7b-instruct" {instruct_with_quantization} {instruct_with_quantization_processed} "True"'
print(f"Running instruct model: {instruct_cmd}")

result = subprocess.run(instruct_cmd, shell=True, capture_output=True, text=True)
print("Instruct model output:")
print(result.stdout)
if result.stderr:
    print("Instruct model errors:")
    print(result.stderr)

# Save instruct model log
with open('MP1/instruct_prompt.log', 'w') as f:
    f.write(f"Command: {instruct_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

## 8. Post-process Model Responses

The post-processing has already been implemented in the model_prompting.py file. Let's check if the processed files were generated correctly.

In [None]:
import json

# Check if processed files exist and show sample
processed_files = [
    base_with_quantization_processed,
    instruct_with_quantization_processed
]

for file_path in processed_files:
    if os.path.exists(file_path):
        print(f"✓ {file_path} exists")
        
        # Show first entry as sample
        with open(file_path, 'r') as f:
            first_line = f.readline()
            if first_line:
                entry = json.loads(first_line)
                print(f"  Sample task_id: {entry.get('task_id', 'N/A')}")
                completion = entry.get('completion', '')
                print(f"  Completion length: {len(completion)} characters")
                print(f"  First 100 chars: {completion[:100]}...")
        print()
    else:
        print(f"✗ {file_path} does not exist")

## 9. Evaluate Generated Code

Run the evaluation scripts to test generated code against HumanEval test cases.

In [None]:
# Evaluate base model results
base_eval_cmd = f"evaluate_functional_correctness {base_with_quantization}"
print(f"Evaluating base model: {base_eval_cmd}")

result = subprocess.run(base_eval_cmd.split(), capture_output=True, text=True)
print("Base evaluation output:")
print(result.stdout)
if result.stderr:
    print("Base evaluation errors:")
    print(result.stderr)

# Save base evaluation log
with open('MP1/base_evaluate.log', 'w') as f:
    f.write(f"Command: {base_eval_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

In [None]:
# Evaluate instruct model results
instruct_eval_cmd = f"evaluate_functional_correctness {instruct_with_quantization}"
print(f"Evaluating instruct model: {instruct_eval_cmd}")

result = subprocess.run(instruct_eval_cmd.split(), capture_output=True, text=True)
print("Instruct evaluation output:")
print(result.stdout)
if result.stderr:
    print("Instruct evaluation errors:")
    print(result.stderr)

# Save instruct evaluation log
with open('MP1/instruct_evaluate.log', 'w') as f:
    f.write(f"Command: {instruct_eval_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

In [None]:
# Evaluate base model processed results
base_processed_eval_cmd = f"evaluate_functional_correctness {base_with_quantization_processed}"
print(f"Evaluating base processed: {base_processed_eval_cmd}")

result = subprocess.run(base_processed_eval_cmd.split(), capture_output=True, text=True)
print("Base processed evaluation output:")
print(result.stdout)
if result.stderr:
    print("Base processed evaluation errors:")
    print(result.stderr)

# Save base processed evaluation log
with open('MP1/base_evaluate_processed.log', 'w') as f:
    f.write(f"Command: {base_processed_eval_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

In [None]:
# Evaluate instruct model processed results
instruct_processed_eval_cmd = f"evaluate_functional_correctness {instruct_with_quantization_processed}"
print(f"Evaluating instruct processed: {instruct_processed_eval_cmd}")

result = subprocess.run(instruct_processed_eval_cmd.split(), capture_output=True, text=True)
print("Instruct processed evaluation output:")
print(result.stdout)
if result.stderr:
    print("Instruct processed evaluation errors:")
    print(result.stderr)

# Save instruct processed evaluation log
with open('MP1/instruct_evaluate_processed.log', 'w') as f:
    f.write(f"Command: {instruct_processed_eval_cmd}\n")
    f.write("STDOUT:\n")
    f.write(result.stdout)
    if result.stderr:
        f.write("\nSTDERR:\n")
        f.write(result.stderr)

## 10. Analyze Results

Parse the evaluation results and create analysis tables.

In [None]:
import pandas as pd
import json
import re

def parse_results_file(filepath):
    """Parse evaluation results file to extract pass/fail information"""
    results = {}
    if os.path.exists(filepath):
        with open(filepath, 'r') as f:
            for line in f:
                if line.strip():
                    try:
                        data = json.loads(line)
                        task_id = data.get('task_id')
                        passed = data.get('passed', False)
                        results[task_id] = passed
                    except json.JSONDecodeError:
                        continue
    return results

def extract_pass_at_k(log_filepath):
    """Extract pass@k values from evaluation log"""
    pass_at_k = None
    if os.path.exists(log_filepath):
        with open(log_filepath, 'r') as f:
            content = f.read()
            # Look for pass@1 in the log
            match = re.search(r'pass@1\s*:\s*([0-9.]+)', content)
            if match:
                pass_at_k = float(match.group(1))
    return pass_at_k

# Parse all result files
base_results = parse_results_file(f"MP1/base_prompt_{seed}.jsonl_results.jsonl")
base_processed_results = parse_results_file(f"MP1/base_prompt_processed_{seed}.jsonl_results.jsonl")
instruct_results = parse_results_file(f"MP1/instruct_prompt_{seed}.jsonl_results.jsonl")
instruct_processed_results = parse_results_file(f"MP1/instruct_prompt_processed_{seed}.jsonl_results.jsonl")

# Extract pass@1 values
base_pass_at_1 = extract_pass_at_k('MP1/base_evaluate.log')
base_processed_pass_at_1 = extract_pass_at_k('MP1/base_evaluate_processed.log')
instruct_pass_at_1 = extract_pass_at_k('MP1/instruct_evaluate.log')
instruct_processed_pass_at_1 = extract_pass_at_k('MP1/instruct_evaluate_processed.log')

print("Pass@1 Results:")
print(f"Base model: {base_pass_at_1}")
print(f"Base model (processed): {base_processed_pass_at_1}")
print(f"Instruct model: {instruct_pass_at_1}")
print(f"Instruct model (processed): {instruct_processed_pass_at_1}")

In [None]:
# Create comparison table
all_task_ids = set()
all_task_ids.update(base_results.keys())
all_task_ids.update(base_processed_results.keys())
all_task_ids.update(instruct_results.keys())
all_task_ids.update(instruct_processed_results.keys())

comparison_data = []
for task_id in sorted(all_task_ids):
    row = {
        'problem_ID': task_id,
        'base_results': base_results.get(task_id, False),
        'base_results_processed': base_processed_results.get(task_id, False),
        'instruct_results': instruct_results.get(task_id, False),
        'instruct_results_processed': instruct_processed_results.get(task_id, False)
    }
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
print("\nPairwise Comparison Table:")
print(comparison_df.to_string(index=False))

# Save to CSV
comparison_df.to_csv('MP1/results_comparison.csv', index=False)
print("\nResults saved to MP1/results_comparison.csv")

## 11. Run Validation Scripts

Execute the validation script to ensure all deliverables follow the required structure.

In [None]:
# Run validation script
validation_cmd = f"python3 MP1/validate.py {github_repo}"
print(f"Running validation: {validation_cmd}")

result = subprocess.run(validation_cmd.split(), capture_output=True, text=True)
print("Validation output:")
print(result.stdout)
if result.stderr:
    print("Validation errors:")
    print(result.stderr)

## 12. Summary and Next Steps

### What is the pass@k metric?

The pass@k metric measures the percentage of problems for which at least one correct solution is found among k generated solutions. The formula is:

**pass@k = (1/n) × Σ(i=1 to n) [1{ci ≥ k}]**

Where:
- n = total number of problems
- ci = number of correct solutions for problem i
- 1{ci ≥ k} = indicator function (1 if ci ≥ k, 0 otherwise)

For pass@1, it simply measures the percentage of problems solved correctly on the first attempt.

### Files Generated

This notebook generates all required deliverables:

1. **Dataset files**: `selected_humaneval_{seed}.jsonl`, `dataset_generation.log`
2. **Model output files**: `base_prompt_{seed}.jsonl`, `instruct_prompt_{seed}.jsonl`
3. **Processed files**: `base_prompt_processed_{seed}.jsonl`, `instruct_prompt_processed_{seed}.jsonl`
4. **Evaluation results**: All `*_results.jsonl` files
5. **Log files**: All required `.log` files
6. **Analysis**: `results_comparison.csv`

### Next Steps

1. Commit all generated files to your GitHub repository
2. Update the progress report with the analysis
3. Include the pass@k values and comparison table
4. Analyze patterns in the results and discuss post-processing effectiveness