## Python Test case taken from MBPP dataset. 
- test is to write a function that gives the perimeter of the square.
- i did not load the dataset from hugging face. Simply downloaded it, picked a sample python test case manually and pasted it here. 

In [1]:
task = {
    "task_id": 17,
    "function_name": "square_perimeter",
    "signature": "def square_perimeter(a):",
    "prompt": "Write a function to find the perimeter of a square.",
    "asserts": [
        "assert square_perimeter(10) == 40",
        "assert square_perimeter(5) == 20",
        "assert square_perimeter(4) == 16",
    ],
}

## Creating a folder Structure. nothing else!! 

In [2]:
# folder structure 
import os, json, pathlib

ROOT = pathlib.Path(".").resolve()
SOL_DIR = ROOT / "solutions"
TEST_DIR = ROOT / "tests"
LOG_DIR  = ROOT / "logs"

for d in (SOL_DIR, TEST_DIR, LOG_DIR):
    d.mkdir(exist_ok=True)

# Save task for reference
with open(LOG_DIR / f"task_{task['task_id']}.json", "w", encoding="utf-8") as f:
    json.dump(task, f, indent=2)

print(f"Setup complete for Task {task['task_id']} – {task['function_name']}")
print("Folders created:", SOL_DIR, TEST_DIR, LOG_DIR, sep="\n")

Setup complete for Task 17 – square_perimeter
Folders created:
/home/jovyan/PyTest_Scenarios-2/solutions
/home/jovyan/PyTest_Scenarios-2/tests
/home/jovyan/PyTest_Scenarios-2/logs


## Creating a Sample Pytest Reference. --> (this is like a baseline) 
- creates 2 files - solutions file and the pytest script for testing it. 

In [3]:
# Reference solution (solutions/square_perimeter.py)
ref_code = f"""
{task['signature']}
    return 4 * a
""".lstrip()

sol_path = SOL_DIR / f"{task['function_name']}.py"
with open(sol_path, "w", encoding="utf-8") as f:
    f.write(ref_code)


    

# Create pytest file (tests/test_square_perimeter.py)
gt_pytest = f"""
import pytest
from solutions.{task['function_name']} import {task['function_name']}

def test_values():
{chr(10).join('    ' + line for line in task['asserts'])}
"""

test_path = TEST_DIR / f"test_{task['function_name']}.py"
with open(test_path, "w", encoding="utf-8") as f:
    f.write(gt_pytest.strip() + "\n")

print("Wrote files:")
print("-", sol_path)
print("-", test_path)


Wrote files:
- /home/jovyan/PyTest_Scenarios-2/solutions/square_perimeter.py
- /home/jovyan/PyTest_Scenarios-2/tests/test_square_perimeter.py


## Loading DeepSeek coder LLM from HuggingFace. No local downloads with Ollama. 
- model used - deepseek-ai/deepseek-coder-1.3b-instruct
- How did I decide the model?
   - decided to use codegemma--> realized it was gated, required access. I dont have much time. 
   - then chose: deepseek coder 6.7b --> spent 6 hours troubleshooting and was facing issues quantizing and loading the model. Turned out it was too big for my machine and resources available .
   - Hence settled for the smallest LLM- deepseek coder 1.3b instruct model.  

In [4]:
# Installing libraries to use the LLM from Hugging face 
!pip3 install transformers accelerate torch --quiet

In [5]:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose model name (change to full or instruct variant as you prefer)
MODEL_NAME = "deepseek-ai/deepseek-coder-1.3b-instruct"

# Load tokenizer & model (use 8-bit if memory is tight)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
)

print("Model loaded:", MODEL_NAME)

`torch_dtype` is deprecated! Use `dtype` instead!


Model loaded: deepseek-ai/deepseek-coder-1.3b-instruct


In [6]:
#Helper function for prompting DeepSeek:

def generate_llm_output(prompt, max_new_tokens=256, temperature=0.3):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Return only the part after the prompt
    return text[len(prompt):].strip()

print("Helper ready.")


Helper ready.


## Demonstration that DeepSeek wrote a pytest case with a prompt.

In [7]:

prompt = """Write a pytest test file for a Python function `square_perimeter(a)` 
that returns the perimeter of a square.
Import as: from solutions.square_perimeter import square_perimeter
Include at least 3 test cases: 10→40, 5→20, 4→16.
Output only valid Python code."""
generated_test = generate_llm_output(prompt)
print(generated_test)


Here is the test file:

```python
import pytest
from solutions.square_perimeter import square_perimeter

def test_square_perimeter():
    assert square_perimeter(10) == 40
    assert square_perimeter(5) == 20
    assert square_perimeter(4) == 16
```

Please note that the `square_perimeter` function is not defined in the provided code, so you need to define it first.

If you want to use a mock function, you can do so as follows:

```python
import pytest
from unittest.mock import patch
from solutions.square_perimeter import square_perimeter

@patch('solutions.square_perimeter.square_perimeter')
def test_square_perimeter(mock_square_perimeter):
    mock_square_perimeter.return_value = 40
    assert square_perimeter(10) == 40

    mock_square_perimeter.return_value = 20
    assert square_perimeter(5) == 20

    mock_square


## these code file manually in folder - deepseek responses

## Evaluation: Here I am using the LLM for evaluation. 

In [8]:
import re, pathlib, torch

MODEL_NAME = "deepseek-ai/deepseek-coder-1.3b-instruct"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
judge = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)

#Load DeepSeek-generated pytest text 
pytest_text = pathlib.Path("Deepseek_Responses/DS-Pytest-Response.py").read_text(encoding="utf-8")

# Problem + checklist for evaluation
problem_text = "Write a function square_perimeter(a) that returns the perimeter of a square."
checklist = (
    "1. Correct import from solutions.square_perimeter\n"
    "2. Tests at least 10→40, 5→20, 4→16\n"
    "3. Uses clean asserts, no I/O or randomness\n"
    "4. Readable and concise"
)

# Build prompt for the local judge
judge_prompt = f"""
You are a strict code-review evaluator.

Problem:
{problem_text}

Checklist:
{checklist}

Pytest code:

Evaluate the pytest test quality on a scale of 1–5:
- 5 = fully meets checklist
- 3 = partially meets
- 1 = fails

Reply only in this format:
Score: <number 1-5>
Reason: <short reason>
"""

# Run inference
inputs = tokenizer(judge_prompt, return_tensors="pt").to(judge.device)
with torch.no_grad():
    output_tokens = judge.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )
decoded = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

#Display results
response_text = decoded[len(judge_prompt):].strip()
print("=== MODEL RAW OUTPUT ===")
print(response_text or "(empty response)")

match = re.search(r"Score:\s*([1-5])", response_text)
if match:
    print(f"\n Parsed Score: {match.group(1)}/5")
else:
    print(f"\n Could not find a numeric score in the response.")


=== MODEL RAW OUTPUT ===
Solution:

```python
import solutions.square_perimeter

def test_square_perimeter():
    assert solutions.square_perimeter.square_perimeter(10) == 40
    assert solutions.square_perimeter.square_perimeter(5) == 20
    assert solutions.square_perimeter.square_perimeter(4) == 16
```

Solution explanation:

The function square_perimeter(a) is supposed to return the perimeter of a square. The formula for the perimeter of a square is 4 times the length of one side. So, if we input a value for a, the function will return the perimeter of the square.

Test cases:

1. square_perimeter(10) returns 40
2. square_perimeter(5) returns 20
3. square_perimeter(4) returns 16

Score: 5
Reason: The function works as expected and the test cases cover all possible inputs.

 Parsed Score: 5/5
