# Data Preprocessing & Tokenization Pipeline

This notebook handles Step 3 of our masterplan:
- Create prompt templates for input formatting
- Load Qwen tokenizer and tokenize the data (max_length=2048)
- Create PyTorch datasets for training
- Test data pipeline with sample batches

## 1. Import Required Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import logging
import json
import pickle
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any, Optional

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('../logs/data_preprocessing.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

print("Libraries imported successfully!")
logger.info("Starting data preprocessing and tokenization")

2025-08-21 16:01:01,965 - INFO - Starting data preprocessing and tokenization


Libraries imported successfully!


## 2. Load Preprocessed Data

In [2]:
# Load the preprocessed data splits
data_dir = Path('../data')

logger.info("Loading preprocessed data splits")

try:
    # Load splits
    with open(data_dir / 'train_split.pkl', 'rb') as f:
        train_data = pickle.load(f)
        
    with open(data_dir / 'val_split.pkl', 'rb') as f:
        val_data = pickle.load(f)
        
    with open(data_dir / 'test_split.pkl', 'rb') as f:
        test_data = pickle.load(f)
    
    # Load metadata
    with open(data_dir / 'metadata.json', 'r') as f:
        metadata = json.load(f)
    
    print(f"✅ Data loaded successfully!")
    print(f"Train samples: {len(train_data)}")
    print(f"Val samples: {len(val_data)}")
    print(f"Test samples: {len(test_data)}")
    
    logger.info(f"Data loaded - Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}")
    
except Exception as e:
    logger.error(f"Error loading data: {str(e)}")
    print(f"Error: {str(e)}")
    raise

2025-08-21 16:01:23,699 - INFO - Loading preprocessed data splits
2025-08-21 16:01:27,910 - INFO - Data loaded - Train: 14049, Val: 1756, Test: 1757


✅ Data loaded successfully!
Train samples: 14049
Val samples: 1756
Test samples: 1757


## 3. Analyze Data Structure for Prompt Design

In [3]:
# Examine the data structure to design appropriate prompts
sample = train_data[0]
print("=== Sample Data Structure ===")
for key, value in sample.items():
    if isinstance(value, str):
        print(f"{key}: {value[:100]}..." if len(value) > 100 else f"{key}: {value}")
    elif isinstance(value, list):
        print(f"{key}: List with {len(value)} items")
        if len(value) > 0:
            first_item = value[0]
            if isinstance(first_item, str):
                print(f"  First item (string): {first_item[:200]}..." if len(first_item) > 200 else f"  First item (string): {first_item}")
            elif isinstance(first_item, dict):
                print(f"  First item (dict) keys: {list(first_item.keys())}")
                # Show actual values for dict
                for k, v in first_item.items():
                    if isinstance(v, str) and len(v) > 100:
                        print(f"    {k}: {v[:100]}...")
                    else:
                        print(f"    {k}: {v}")
            else:
                print(f"  First item type: {type(first_item)}, value: {first_item}")
    else:
        print(f"{key}: {value}")

print("\n=== Detailed Unit Tests Analysis ===")
if 'unit_tests' in sample:
    unit_tests = sample['unit_tests']
    print(f"Unit tests type: {type(unit_tests)}")
    print(f"Unit tests length: {len(unit_tests) if hasattr(unit_tests, '__len__') else 'N/A'}")
    
    if isinstance(unit_tests, list) and len(unit_tests) > 0:
        print(f"\nFirst 3 unit tests:")
        for i in range(min(3, len(unit_tests))):
            test = unit_tests[i]
            print(f"  Test {i}: type={type(test)}")
            if isinstance(test, dict):
                for k, v in test.items():
                    if isinstance(v, str) and len(v) > 200:
                        print(f"    {k}: {v[:200]}...")
                    else:
                        print(f"    {k}: {v}")
            elif isinstance(test, str):
                print(f"    Content: {test[:200]}..." if len(test) > 200 else f"    Content: {test}")
            print()

logger.info("Data structure analysis completed")

2025-08-21 16:01:27,944 - INFO - Data structure analysis completed


=== Sample Data Structure ===
task_id: 14568
question: As AtCoder Beginner Contest 100 is taking place, the office of AtCoder, Inc. is decorated with a seq...
code_ground_truth: def max_operations_on_sequence(N, a):
    ans = 0
    for i in a:
        while i % 2 == 0:
        ...
code_generate: [{"sol_id": 0, "code": "def max_operations_on_sequence(N, a):\n    \"\"\"\n    Calculate the maximum...
unit_tests: [{"ut_id": 0, "code": "import unittest\n\nclass TestMaxOperationsOnSequence(unittest.TestCase):\n\n ...

=== Detailed Unit Tests Analysis ===
Unit tests type: <class 'str'>
Unit tests length: 133288


## 4. Create Prompt Templates

In [4]:
class PromptTemplate:
    """Handles prompt formatting for unit test generation"""
    
    def __init__(self):
        self.system_prompt = """You are an expert Python developer specialized in writing comprehensive unit tests. Generate high-quality unit tests for the given Python code."""
        
        self.user_template = """Write unit tests for the following Python code:

```python
{code}
```

Requirements:
- Write comprehensive unit tests using pytest
- Include edge cases and error handling
- Use descriptive test names
- Add assertions to verify correctness

Unit tests:"""

    def format_input(self, code: str, question: str = None) -> str:
        """Format the input for the model"""
        # Include question context if available
        if question and question.strip():
            enhanced_template = f"""Problem: {question}

Write unit tests for the following Python code:

```python
{code}
```

Requirements:
- Write comprehensive unit tests using pytest
- Include edge cases and error handling
- Use descriptive test names
- Add assertions to verify correctness

Unit tests:"""
            return enhanced_template
        else:
            return self.user_template.format(code=code)

    def format_training_example(self, sample: Dict[str, Any]) -> Dict[str, str]:
        """Format a training sample for the model"""
        # Extract code and unit tests from sample
        code = sample.get('code_ground_truth', '')
        question = sample.get('question', '')
        
        # Get unit tests - handle different formats
        unit_tests = sample.get('unit_tests', [])
        test_code = ''
        
        if isinstance(unit_tests, list) and len(unit_tests) > 0:
            # Check the first unit test
            first_test = unit_tests[0]
            
            if isinstance(first_test, dict):
                # If it's a dictionary, look for 'code' key
                test_code = first_test.get('code', '')
                
                # If 'code' is empty, try other possible keys
                if not test_code:
                    possible_keys = ['test', 'unit_test', 'test_code', 'content']
                    for key in possible_keys:
                        if key in first_test and first_test[key]:
                            test_code = first_test[key]
                            break
                            
            elif isinstance(first_test, str):
                test_code = first_test
            else:
                test_code = str(first_test)
                
        elif isinstance(unit_tests, str):
            test_code = unit_tests
        
        # Format input and output
        input_text = self.format_input(code, question)
        output_text = test_code.strip() if test_code else ''
        
        return {
            'input': input_text,
            'output': output_text,
            'full_text': f"{input_text}\n\n{output_text}"
        }

# Test the prompt template (only show one example)
prompt_template = PromptTemplate()
print("=== Testing Prompt Template (Sample 1) ===")
test_sample = prompt_template.format_training_example(train_data[0])

print(f"Input length: {len(test_sample['input'])} characters")
print(f"Output length: {len(test_sample['output'])} characters")
print(f"\nInput preview:")
print(test_sample['input'][:300] + "...")
print(f"\nOutput preview:")
if test_sample['output']:
    print(test_sample['output'][:300] + "...")
else:
    print("(EMPTY OUTPUT)")

# Test a few more samples to check data quality
print(f"\n=== Quick Data Quality Check ===")
valid_samples = 0
empty_outputs = 0
total_check = min(10, len(train_data))

for i in range(total_check):
    try:
        sample = prompt_template.format_training_example(train_data[i])
        if sample['output'].strip():
            valid_samples += 1
        else:
            empty_outputs += 1
    except:
        pass

print(f"Checked {total_check} samples:")
print(f"Valid outputs: {valid_samples}")
print(f"Empty outputs: {empty_outputs}")
print(f"Success rate: {valid_samples/total_check*100:.1f}%")

logger.info("Prompt template created and tested")

2025-08-21 16:01:30,441 - INFO - Prompt template created and tested with debug info


=== Testing Prompt Template with Debug Info ===
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 133288
DEBUG - Unit tests is direct string: 133288 chars
DEBUG - Final output length: 133288 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMaxOperationsOnSequence(unittest.TestCase):\n\n ...

=== Sample Formatted Prompt ===
Input length: 1758 characters
Output length: 133288 characters

Input preview:
Problem: As AtCoder Beginner Contest 100 is taking place, the office of AtCoder, Inc. is decorated with a sequence of length N, a = {a_1, a_2, a_3, ..., a_N}.

Snuke, an employee, would like to play with this sequence.
Specifically, he would like to repeat the following operation as many times as po...

Output preview:
[{"ut_id": 0, "code": "import unittest\n\nclass TestMaxOperationsOnSequence(unittest.TestCase):\n\n    # Test with the sample input provided in the

## 5. Load Qwen Tokenizer

In [5]:
# Load Qwen tokenizer
MODEL_NAME = "Qwen/Qwen2.5-Coder-7B-Instruct"
MAX_LENGTH = 2048

logger.info(f"Loading tokenizer: {MODEL_NAME}")

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    # Set up special tokens
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print(f"✅ Tokenizer loaded successfully!")
    print(f"Model: {MODEL_NAME}")
    print(f"Vocab size: {tokenizer.vocab_size}")
    print(f"Max length: {MAX_LENGTH}")
    print(f"Pad token: {tokenizer.pad_token}")
    print(f"EOS token: {tokenizer.eos_token}")
    
    # Test tokenization
    test_text = "def hello_world(): print('Hello, World!')"
    tokens = tokenizer(test_text, return_tensors="pt")
    print(f"\nTest tokenization:")
    print(f"Input: {test_text}")
    print(f"Tokens: {tokens['input_ids'].shape}")
    print(f"Decoded: {tokenizer.decode(tokens['input_ids'][0])}")
    
    logger.info(f"Tokenizer loaded and tested successfully")
    
except Exception as e:
    logger.error(f"Error loading tokenizer: {str(e)}")
    print(f"Error: {str(e)}")
    raise

2025-08-21 16:01:35,278 - INFO - Loading tokenizer: Qwen/Qwen2.5-Coder-7B-Instruct
2025-08-21 16:01:36,414 - INFO - Tokenizer loaded and tested successfully


✅ Tokenizer loaded successfully!
Model: Qwen/Qwen2.5-Coder-7B-Instruct
Vocab size: 151643
Max length: 2048
Pad token: <|endoftext|>
EOS token: <|im_end|>

Test tokenization:
Input: def hello_world(): print('Hello, World!')
Tokens: torch.Size([1, 11])
Decoded: def hello_world(): print('Hello, World!')


## 6. Create PyTorch Dataset Class

In [6]:
class UnitTestDataset(Dataset):
    """Memory-efficient PyTorch Dataset for unit test generation"""
    
    def __init__(self, data: List[Dict], tokenizer, prompt_template: PromptTemplate, max_length: int = 2048):
        self.data = data
        self.tokenizer = tokenizer
        self.prompt_template = prompt_template
        self.max_length = max_length
        
        # Don't pre-process all samples - do it on-demand to save memory
        logger.info(f"Created dataset with {len(data)} samples (lazy loading)")
        
        # Test process a few samples to check for issues
        test_samples = min(5, len(data))
        successful_samples = 0
        
        for i in range(test_samples):
            try:
                formatted = self.prompt_template.format_training_example(data[i])
                if formatted['output'].strip():  # Check if we have actual output
                    successful_samples += 1
                else:
                    logger.warning(f"Sample {i} has empty output")
            except Exception as e:
                logger.warning(f"Failed to process test sample {i}: {str(e)}")
        
        logger.info(f"Test processing: {successful_samples}/{test_samples} samples successful")
        
        if successful_samples == 0:
            logger.error("No samples could be processed successfully!")
            raise ValueError("Dataset contains no valid samples")
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        try:
            # Process sample on-demand
            sample_data = self.data[idx]
            formatted = self.prompt_template.format_training_example(sample_data)
            
            # Check if output is empty
            if not formatted['output'].strip():
                logger.warning(f"Sample {idx} has empty output, using placeholder")
                formatted['output'] = "# No unit test available for this code"
            
            # Tokenize the full text (input + output)
            full_text = formatted['full_text']
            
            # Tokenize
            encoding = self.tokenizer(
                full_text,
                truncation=True,
                max_length=self.max_length,
                padding=False,
                return_tensors="pt"
            )
            
            input_ids = encoding['input_ids'].squeeze()
            attention_mask = encoding['attention_mask'].squeeze()
            
            # For causal LM, labels are the same as input_ids
            labels = input_ids.clone()
            
            # Find where the output starts to mask input tokens in loss calculation
            input_text = formatted['input']
            input_encoding = self.tokenizer(
                input_text,
                truncation=True,
                max_length=self.max_length,
                padding=False,
                return_tensors="pt"
            )
            input_length = input_encoding['input_ids'].shape[1]
            
            # Mask input tokens in labels (set to -100 so they're ignored in loss)
            if input_length < len(labels):
                labels[:input_length] = -100
            
            return {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'labels': labels
            }
            
        except Exception as e:
            logger.error(f"Error processing sample {idx}: {str(e)}")
            # Return a minimal valid sample to avoid crashing
            dummy_text = "def dummy(): pass"
            encoding = self.tokenizer(
                dummy_text,
                truncation=True,
                max_length=32,
                padding=False,
                return_tensors="pt"
            )
            input_ids = encoding['input_ids'].squeeze()
            attention_mask = encoding['attention_mask'].squeeze()
            labels = input_ids.clone()
            labels[:] = -100  # Mask everything
            
            return {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'labels': labels
            }

# Test the dataset class with a smaller subset first
print("Creating small test dataset...")
test_dataset = UnitTestDataset(
    data=train_data[:10],  # Test with only 10 samples
    tokenizer=tokenizer,
    prompt_template=prompt_template,
    max_length=MAX_LENGTH
)

print(f"✅ Test dataset created with {len(test_dataset)} samples")

# Test a sample
sample = test_dataset[0]
print(f"\nSample shape:")
print(f"Input IDs: {sample['input_ids'].shape}")
print(f"Attention mask: {sample['attention_mask'].shape}")
print(f"Labels: {sample['labels'].shape}")
print(f"Non-masked labels: {(sample['labels'] != -100).sum()} tokens")

logger.info("Memory-efficient dataset class created and tested")

2025-08-21 16:01:45,628 - INFO - Pre-processing 5 samples...
2025-08-21 16:01:45,630 - INFO - Successfully processed 5 samples
2025-08-21 16:01:45,746 - INFO - Dataset class created and tested


Creating test dataset...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 133288
DEBUG - Unit tests is direct string: 133288 chars
DEBUG - Final output length: 133288 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMaxOperationsOnSequence(unittest.TestCase):\n\n ...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 31002
DEBUG - Unit tests is direct string: 31002 chars
DEBUG - Final output length: 31002 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\nimport bisect\n\nclass TestGPARPSScores(unittest.TestCase):\...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 90001
DEBUG - Unit tests is direct string: 90001 chars
DEBUG - Final output length: 90001

## 7. Create Data Collator

In [7]:
from transformers import DataCollatorForLanguageModeling

class CustomDataCollator:
    """Custom data collator for batching samples"""
    
    def __init__(self, tokenizer, max_length: int = 2048):
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __call__(self, batch):
        # Extract sequences
        input_ids = [item['input_ids'] for item in batch]
        attention_masks = [item['attention_mask'] for item in batch]
        labels = [item['labels'] for item in batch]
        
        # Pad sequences
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        attention_masks = torch.nn.utils.rnn.pad_sequence(
            attention_masks, batch_first=True, padding_value=0
        )
        labels = torch.nn.utils.rnn.pad_sequence(
            labels, batch_first=True, padding_value=-100
        )
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_masks,
            'labels': labels
        }

# Create data collator
data_collator = CustomDataCollator(tokenizer, MAX_LENGTH)

# Test with a small batch
test_batch = [test_dataset[i] for i in range(2)]
collated_batch = data_collator(test_batch)

print("✅ Data collator created and tested")
print(f"Batch shape:")
print(f"Input IDs: {collated_batch['input_ids'].shape}")
print(f"Attention mask: {collated_batch['attention_mask'].shape}")
print(f"Labels: {collated_batch['labels'].shape}")

logger.info("Data collator created and tested")

2025-08-21 16:01:50,683 - INFO - Data collator created and tested


✅ Data collator created and tested
Batch shape:
Input IDs: torch.Size([2, 2048])
Attention mask: torch.Size([2, 2048])
Labels: torch.Size([2, 2048])


## 8. Create Full Datasets

In [8]:
# Create datasets for all splits with memory monitoring
import psutil
import gc

def get_memory_usage():
    """Get current memory usage"""
    process = psutil.Process()
    return process.memory_info().rss / 1024 / 1024  # MB

def create_dataset_safely(data, name, max_samples=None):
    """Create dataset with memory monitoring"""
    initial_memory = get_memory_usage()
    print(f"\n=== Creating {name} dataset ===")
    print(f"Initial memory: {initial_memory:.1f} MB")
    
    # Limit dataset size if specified
    if max_samples and len(data) > max_samples:
        print(f"Limiting {name} dataset to {max_samples} samples (from {len(data)})")
        data = data[:max_samples]
    
    try:
        dataset = UnitTestDataset(
            data=data,
            tokenizer=tokenizer,
            prompt_template=prompt_template,
            max_length=MAX_LENGTH
        )
        
        final_memory = get_memory_usage()
        memory_increase = final_memory - initial_memory
        print(f"✅ {name} dataset created: {len(dataset)} samples")
        print(f"Memory after creation: {final_memory:.1f} MB (+{memory_increase:.1f} MB)")
        
        return dataset
        
    except Exception as e:
        print(f"❌ Failed to create {name} dataset: {str(e)}")
        gc.collect()  # Force garbage collection
        return None

logger.info("Creating datasets with memory monitoring")

# Start with smaller datasets to test
MAX_TRAIN_SAMPLES = 1000   # Limit training to 1000 samples for now
MAX_VAL_SAMPLES = 200      # Limit validation to 200 samples
MAX_TEST_SAMPLES = 200     # Limit test to 200 samples

print(f"Memory-efficient dataset creation:")
print(f"Train samples: {min(len(train_data), MAX_TRAIN_SAMPLES)} (from {len(train_data)})")
print(f"Val samples: {min(len(val_data), MAX_VAL_SAMPLES)} (from {len(val_data)})")
print(f"Test samples: {min(len(test_data), MAX_TEST_SAMPLES)} (from {len(test_data)})")

# Create datasets one by one
train_dataset = create_dataset_safely(train_data, "train", MAX_TRAIN_SAMPLES)
gc.collect()  # Clean up memory

if train_dataset:
    val_dataset = create_dataset_safely(val_data, "validation", MAX_VAL_SAMPLES)
    gc.collect()
    
    if val_dataset:
        test_dataset = create_dataset_safely(test_data, "test", MAX_TEST_SAMPLES)
        gc.collect()
        
        if test_dataset:
            print(f"\n✅ All datasets created successfully!")
            print(f"Final datasets:")
            print(f"  Train: {len(train_dataset)} samples")
            print(f"  Validation: {len(val_dataset)} samples")
            print(f"  Test: {len(test_dataset)} samples")
            
            final_memory = get_memory_usage()
            print(f"Total memory usage: {final_memory:.1f} MB")
            
            logger.info(f"Datasets created - Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}")
        else:
            print("❌ Failed to create test dataset")
    else:
        print("❌ Failed to create validation dataset")
else:
    print("❌ Failed to create training dataset")

# Note about scaling up
print(f"\n💡 Note: Starting with smaller datasets for stability.")
print(f"Once training works, you can increase MAX_TRAIN_SAMPLES up to {len(train_data)}")

2025-08-21 16:03:02,634 - INFO - Creating datasets with memory monitoring
2025-08-21 16:03:02,640 - INFO - Pre-processing 1000 samples...
2025-08-21 16:03:02,814 - INFO - Successfully processed 1000 samples


Memory-efficient dataset creation:
Train samples: 1000 (from 14049)
Val samples: 200 (from 1756)
Test samples: 200 (from 1757)

=== Creating train dataset ===
Initial memory: 3534.5 MB
Limiting train dataset to 1000 samples (from 14049)
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 133288
DEBUG - Unit tests is direct string: 133288 chars
DEBUG - Final output length: 133288 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMaxOperationsOnSequence(unittest.TestCase):\n\n ...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 31002
DEBUG - Unit tests is direct string: 31002 chars
DEBUG - Final output length: 31002 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\nimport bisect\n\nclass TestGPARPSScores(unittest.TestCase):\...
DEBUG - Sample keys: [

2025-08-21 16:03:02,975 - INFO - Pre-processing 200 samples...
2025-08-21 16:03:03,010 - INFO - Successfully processed 200 samples
2025-08-21 16:03:03,184 - INFO - Pre-processing 200 samples...


DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 147465
DEBUG - Unit tests is direct string: 147465 chars
DEBUG - Final output length: 147465 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestCalculateNthPersonSeatProbabilityFunction(unitte...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 36064
DEBUG - Unit tests is direct string: 36064 chars
DEBUG - Final output length: 36064 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMinStepsToReachN(unittest.TestCase):\n    \n    ...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 93967
DEBUG - Unit tests is direct string: 93967 chars
DEBUG - Final output length: 93967 chars
DEBUG - Output pre

2025-08-21 16:03:03,223 - INFO - Successfully processed 200 samples
2025-08-21 16:03:03,383 - INFO - Datasets created - Train: 1000, Val: 200, Test: 200


DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 245427
DEBUG - Unit tests is direct string: 245427 chars
DEBUG - Final output length: 245427 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMinPlotsToClearCalculator(unittest.TestCase):\n ...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 220102
DEBUG - Unit tests is direct string: 220102 chars
DEBUG - Final output length: 220102 chars
DEBUG - Output preview: [{"ut_id": 0, "code": "import unittest\n\nclass TestMaximalPrefixSubsequenceLength(unittest.TestCase...
DEBUG - Sample keys: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests']
DEBUG - Unit tests type: <class 'str'>, length: 194992
DEBUG - Unit tests is direct string: 194992 chars
DEBUG - Final output length: 194992 chars
DEBUG - Outp

## 9. Create DataLoaders

In [9]:
# Create DataLoaders with appropriate batch sizes for RTX 3050
TRAIN_BATCH_SIZE = 1  # Small batch size for 4GB VRAM
EVAL_BATCH_SIZE = 2   # Slightly larger for evaluation

logger.info(f"Creating DataLoaders with batch sizes - Train: {TRAIN_BATCH_SIZE}, Eval: {EVAL_BATCH_SIZE}")

train_dataloader = DataLoader(
    train_dataset,
    batch_size=TRAIN_BATCH_SIZE,
    shuffle=True,
    collate_fn=data_collator,
    num_workers=0,  # Avoid multiprocessing issues on Windows
    pin_memory=torch.cuda.is_available()
)

val_dataloader = DataLoader(
    val_dataset,
    batch_size=EVAL_BATCH_SIZE,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=0,
    pin_memory=torch.cuda.is_available()
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=EVAL_BATCH_SIZE,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=0,
    pin_memory=torch.cuda.is_available()
)

print(f"✅ DataLoaders created successfully!")
print(f"Train batches: {len(train_dataloader)}")
print(f"Val batches: {len(val_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")

logger.info("DataLoaders created successfully")

2025-08-21 16:03:37,579 - INFO - Creating DataLoaders with batch sizes - Train: 1, Eval: 2
2025-08-21 16:03:37,670 - INFO - DataLoaders created successfully


✅ DataLoaders created successfully!
Train batches: 1000
Val batches: 100
Test batches: 100


## 10. Test Data Pipeline

In [10]:
# Test the complete data pipeline
logger.info("Testing complete data pipeline")

print("=== Data Pipeline Test ===")

# Test training dataloader
print("\n1. Testing training dataloader...")
train_batch = next(iter(train_dataloader))
print(f"Train batch shapes:")
for key, value in train_batch.items():
    print(f"  {key}: {value.shape}")

# Test validation dataloader
print("\n2. Testing validation dataloader...")
val_batch = next(iter(val_dataloader))
print(f"Val batch shapes:")
for key, value in val_batch.items():
    print(f"  {key}: {value.shape}")

# Check for any issues
print("\n3. Data quality checks...")
sample_input_ids = train_batch['input_ids'][0]
sample_labels = train_batch['labels'][0]

# Check tokenization
decoded_text = tokenizer.decode(sample_input_ids, skip_special_tokens=False)
print(f"Sample length: {len(sample_input_ids)} tokens")
print(f"Non-masked labels: {(sample_labels != -100).sum()} tokens")
print(f"Masked labels: {(sample_labels == -100).sum()} tokens")

# Show a snippet of the decoded text
print(f"\nDecoded sample (first 200 chars):")
print(decoded_text[:200] + "...")

print(f"\n✅ Data pipeline test completed successfully!")
logger.info("Data pipeline test completed successfully")

2025-08-21 16:03:42,320 - INFO - Testing complete data pipeline


=== Data Pipeline Test ===

1. Testing training dataloader...


2025-08-21 16:03:42,774 - INFO - Data pipeline test completed successfully


Train batch shapes:
  input_ids: torch.Size([1, 2048])
  attention_mask: torch.Size([1, 2048])
  labels: torch.Size([1, 2048])

2. Testing validation dataloader...
Val batch shapes:
  input_ids: torch.Size([2, 2048])
  attention_mask: torch.Size([2, 2048])
  labels: torch.Size([2, 2048])

3. Data quality checks...
Sample length: 2048 tokens
Non-masked labels: 1234 tokens
Masked labels: 814 tokens

Decoded sample (first 200 chars):
Problem: You are given an unrooted tree of $n$ nodes numbered from $\mbox{1}$ to $n$. Each node $\boldsymbol{i}$ has a color, $c_i$. 

Let $d(i,j)$ be the number of different colors in the path betwee...

✅ Data pipeline test completed successfully!


## 11. Save Processed Datasets

In [11]:
# Save the processed datasets and components for later use
processed_dir = Path('../data/processed')
processed_dir.mkdir(exist_ok=True)

logger.info("Saving processed datasets and components")

# Save datasets
torch.save(train_dataset, processed_dir / 'train_dataset.pt')
torch.save(val_dataset, processed_dir / 'val_dataset.pt')
torch.save(test_dataset, processed_dir / 'test_dataset.pt')

# Save tokenizer
tokenizer.save_pretrained(processed_dir / 'tokenizer')

# Save prompt template
with open(processed_dir / 'prompt_template.pkl', 'wb') as f:
    pickle.dump(prompt_template, f)

# Save data collator
with open(processed_dir / 'data_collator.pkl', 'wb') as f:
    pickle.dump(data_collator, f)

# Save processing metadata
processing_metadata = {
    'model_name': MODEL_NAME,
    'max_length': MAX_LENGTH,
    'train_batch_size': TRAIN_BATCH_SIZE,
    'eval_batch_size': EVAL_BATCH_SIZE,
    'train_samples': len(train_dataset),
    'val_samples': len(val_dataset),
    'test_samples': len(test_dataset),
    'vocab_size': tokenizer.vocab_size,
    'pad_token': tokenizer.pad_token,
    'eos_token': tokenizer.eos_token,
    'created_at': datetime.now().isoformat()
}

with open(processed_dir / 'processing_metadata.json', 'w') as f:
    json.dump(processing_metadata, f, indent=2)

print(f"✅ All processed data saved to {processed_dir}")

# Show file sizes
print("\n=== Saved Files ===")
for file_path in processed_dir.rglob('*'):
    if file_path.is_file():
        file_size = file_path.stat().st_size / 1024 / 1024  # MB
        print(f"  {file_path.name}: {file_size:.2f} MB")

logger.info(f"All processed data saved to {processed_dir}")

2025-08-21 16:03:49,274 - INFO - Saving processed datasets and components
2025-08-21 16:03:52,753 - INFO - All processed data saved to ..\data\processed


✅ All processed data saved to ..\data\processed

=== Saved Files ===
  data_collator.pkl: 5.12 MB
  processing_metadata.json: 0.00 MB
  prompt_template.pkl: 0.00 MB
  test_dataset.pt: 68.73 MB
  train_dataset.pt: 311.64 MB
  val_dataset.pt: 66.31 MB
  added_tokens.json: 0.00 MB
  chat_template.jinja: 0.00 MB
  merges.txt: 1.59 MB
  special_tokens_map.json: 0.00 MB
  tokenizer.json: 10.89 MB
  tokenizer_config.json: 0.00 MB
  vocab.json: 2.65 MB


## 12. Summary and Next Steps

In [12]:
print("=== Data Preprocessing Summary ===")
print(f"✅ Loaded {len(train_data) + len(val_data) + len(test_data)} total samples")
print(f"✅ Created prompt templates for unit test generation")
print(f"✅ Loaded and configured Qwen tokenizer ({MODEL_NAME})")
print(f"✅ Created PyTorch datasets:")
print(f"   - Train: {len(train_dataset)} samples")
print(f"   - Validation: {len(val_dataset)} samples")
print(f"   - Test: {len(test_dataset)} samples")
print(f"✅ Created DataLoaders with appropriate batch sizes")
print(f"✅ Tested complete data pipeline")
print(f"✅ Saved all processed components to {processed_dir}")

print("\n=== Configuration Summary ===")
print(f"Model: {MODEL_NAME}")
print(f"Max sequence length: {MAX_LENGTH}")
print(f"Training batch size: {TRAIN_BATCH_SIZE}")
print(f"Evaluation batch size: {EVAL_BATCH_SIZE}")
print(f"Vocab size: {tokenizer.vocab_size}")

print("\n=== Next Steps ===")
print("1. ✅ Step 3 Complete: Data preprocessing and tokenization")
print("2. 🔄 Step 4: Model loading and configuration")
print("3. ⏳ Step 5: QLoRA/PEFT setup")
print("4. ⏳ Step 6: Training configuration")

logger.info("Data preprocessing completed successfully!")
logger.info("Ready to proceed with Step 4: Model loading and configuration")

2025-08-21 16:04:00,177 - INFO - Data preprocessing completed successfully!
2025-08-21 16:04:00,177 - INFO - Ready to proceed with Step 4: Model loading and configuration


=== Data Preprocessing Summary ===
✅ Loaded 17562 total samples
✅ Created prompt templates for unit test generation
✅ Loaded and configured Qwen tokenizer (Qwen/Qwen2.5-Coder-7B-Instruct)
✅ Created PyTorch datasets:
   - Train: 1000 samples
   - Validation: 200 samples
   - Test: 200 samples
✅ Created DataLoaders with appropriate batch sizes
✅ Tested complete data pipeline
✅ Saved all processed components to ..\data\processed

=== Configuration Summary ===
Model: Qwen/Qwen2.5-Coder-7B-Instruct
Max sequence length: 2048
Training batch size: 1
Evaluation batch size: 2
Vocab size: 151643

=== Next Steps ===
1. ✅ Step 3 Complete: Data preprocessing and tokenization
2. 🔄 Step 4: Model loading and configuration
3. ⏳ Step 5: QLoRA/PEFT setup
4. ⏳ Step 6: Training configuration
