# Dataset Preparation: CodeRM-UnitTest Loading and Splitting

This notebook handles Step 2 of our masterplan:
- Load CodeRM-UnitTest dataset from Hugging Face
- Explore dataset structure and understand the format
- Sample 20k records from the full 77.2k dataset
- Create 80/10/10 split (16k train / 2k val / 2k test)
- Save preprocessed splits locally

## 1. Import Required Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import logging
from datetime import datetime
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import json
import pickle
from pathlib import Path

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('../logs/dataset_preparation.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

print("Libraries imported successfully!")
logger.info("Starting dataset preparation process")

2025-08-21 14:50:39,426 - INFO - Starting dataset preparation process


Libraries imported successfully!


## 2. Setup Environment and GPU Check

### GPU Setup Troubleshooting

If GPU is not detected, run this cell to install PyTorch with CUDA support:

In [2]:
import torch
import subprocess
import sys

# Comprehensive GPU and CUDA diagnostics
print("=== PyTorch and CUDA Diagnostics ===")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

# Check if CUDA is built with PyTorch
if hasattr(torch.cuda, 'is_available'):
    print(f"CUDA built with PyTorch: {torch.cuda.is_available()}")
else:
    print("CUDA not built with PyTorch")

# Device detection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\nDevice selected: {device}")

if torch.cuda.is_available():
    print("\n=== GPU Information ===")
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU {i}: {props.name}")
        print(f"  Total memory: {props.total_memory / 1024**3:.2f} GB")
        print(f"  Major.Minor: {props.major}.{props.minor}")
        print(f"  Multi-processor count: {props.multi_processor_count}")
    
    # Set to first GPU and get detailed memory info
    torch.cuda.set_device(0)
    gpu_memory_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
    gpu_memory_reserved = torch.cuda.memory_reserved(0) / 1024**3
    gpu_memory_allocated = torch.cuda.memory_allocated(0) / 1024**3
    gpu_memory_free = gpu_memory_total - gpu_memory_allocated
    
    print(f"\n=== GPU Memory Status ===")
    print(f"Total GPU Memory: {gpu_memory_total:.2f} GB")
    print(f"Reserved Memory: {gpu_memory_reserved:.2f} GB")
    print(f"Allocated Memory: {gpu_memory_allocated:.2f} GB")
    print(f"Free Memory: {gpu_memory_free:.2f} GB")
    
    # Test GPU with a simple operation
    try:
        test_tensor = torch.randn(100, 100).to(device)
        result = torch.matmul(test_tensor, test_tensor)
        print(f"✅ GPU test successful - tensor shape: {result.shape}")
        del test_tensor, result
        torch.cuda.empty_cache()
    except Exception as e:
        print(f"❌ GPU test failed: {e}")
    
    logger.info(f"GPU detected and tested: {torch.cuda.get_device_name(0)} with {gpu_memory_total:.2f} GB memory")
else:
    print("\n❌ No GPU available or CUDA not properly installed")
    print("Possible solutions:")
    print("1. Install CUDA-enabled PyTorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118")
    print("2. Check NVIDIA drivers are installed")
    print("3. Verify CUDA toolkit is installed")
    
    # Check NVIDIA driver
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True, shell=True)
        if result.returncode == 0:
            print("\n✅ NVIDIA drivers detected:")
            print(result.stdout.split('\n')[0])  # First line with driver info
        else:
            print("\n❌ NVIDIA drivers not found or nvidia-smi not available")
    except:
        print("\n❌ Could not check NVIDIA drivers")
    
    logger.warning("No GPU available, training will use CPU (this will be very slow)")

# Create necessary directories
os.makedirs('../data', exist_ok=True)
os.makedirs('../logs', exist_ok=True)
print(f"\n✅ Directory structure verified")
print(f"Final device for training: {device}")

=== PyTorch and CUDA Diagnostics ===
PyTorch version: 2.7.1+cu118
CUDA available: True
CUDA version: 11.8
cuDNN version: 90100
Number of GPUs: 1
CUDA built with PyTorch: True

Device selected: cuda

=== GPU Information ===
GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU
  Total memory: 4.00 GB
  Major.Minor: 8.6
  Multi-processor count: 16

=== GPU Memory Status ===
Total GPU Memory: 4.00 GB
Reserved Memory: 0.00 GB
Allocated Memory: 0.00 GB
Free Memory: 4.00 GB
✅ GPU test successful - tensor shape: torch.Size([100, 100])


2025-08-21 14:53:05,037 - INFO - GPU detected and tested: NVIDIA GeForce RTX 3050 Laptop GPU with 4.00 GB memory



✅ Directory structure verified
Final device for training: cuda


## 3. Load CodeRM-UnitTest Dataset

In [3]:
# Load the dataset from Hugging Face
logger.info("Loading CodeRM-UnitTest dataset from Hugging Face")

try:
    dataset = load_dataset("KAKA22/CodeRM-UnitTest")
    print(f"Dataset loaded successfully!")
    print(f"Dataset structure: {dataset}")
    
    # Get the train split (assuming it's the main split)
    train_data = dataset['train']
    print(f"Total samples in dataset: {len(train_data)}")
    
    logger.info(f"Dataset loaded with {len(train_data)} samples")
    
except Exception as e:
    logger.error(f"Error loading dataset: {str(e)}")
    print(f"Error: {str(e)}")
    raise

2025-08-21 14:53:26,329 - INFO - Loading CodeRM-UnitTest dataset from Hugging Face


README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


unit_test_taco-train.parquet:   0%|          | 0.00/509M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


unit_test_codefeedback-filter.parquet:   0%|          | 0.00/778M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

2025-08-21 14:57:16,396 - INFO - Dataset loaded with 17562 samples


Dataset loaded successfully!
Dataset structure: DatasetDict({
    train: Dataset({
        features: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests'],
        num_rows: 17562
    })
    test: Dataset({
        features: ['task_id', 'question', 'code_ground_truth', 'code_generate', 'unit_tests'],
        num_rows: 62900
    })
})
Total samples in dataset: 17562


## 4. Explore Dataset Structure

In [4]:
# Examine the first few samples
print("=== Dataset Features ===")
print(train_data.features)

print("\n=== First Sample ===")
first_sample = train_data[0]
for key, value in first_sample.items():
    if isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    else:
        print(f"{key}: {value}")

logger.info("Dataset structure exploration completed")

2025-08-21 14:57:27,122 - INFO - Dataset structure exploration completed


=== Dataset Features ===
{'task_id': Value('int64'), 'question': Value('string'), 'code_ground_truth': Value('string'), 'code_generate': Value('string'), 'unit_tests': Value('string')}

=== First Sample ===
task_id: 0
question: There are $n$ candy boxes in front of Tania. The boxes are arranged in a row from left to right, numbered from $1$ to $n$. The $i$-th box contains $r_i$ candies, candies have the color $c_i$ (the colo...
code_ground_truth: def min_seconds_to_eat_candies(n, s, k, r, c):
    INF = 10000000000.0
    max_n = 50
    max_k = 2000
    
    s -= 1  # Convert to 0-based index
    buf = [''] * (max_n + 1)
    dp = [[0 for _ in ra...
code_generate: [{"sol_id": 0, "code": "def min_seconds_to_eat_candies(n, s, k, r, c):\n    \"\"\"\n    This function calculates the minimum number of seconds Tanya needs to eat at least k candies.\n    \n    Paramet...
unit_tests: [{"ut_id": 0, "code": "import unittest\n\nclass TestMinSecondsToEatCandies(unittest.TestCase):\n    \n    def test

In [6]:
# Examine unit tests structure
print("=== Unit Tests Structure ===")
if 'unit_tests' in first_sample:
    unit_tests = first_sample['unit_tests']
    print(f"Type of unit_tests: {type(unit_tests)}")
    print(f"Number of unit tests for first sample: {len(unit_tests)}")
    
    if len(unit_tests) > 0:
        print("\n=== First Unit Test ===")
        first_test = unit_tests[0]
        print(f"Type of first test: {type(first_test)}")
        
        # Handle if it's a string
        if isinstance(first_test, str):
            print(f"Unit test content: {first_test[:500]}...")  # Show first 500 chars
        # Handle if it's a dictionary
        elif isinstance(first_test, dict):
            for key, value in first_test.items():
                if isinstance(value, str) and len(value) > 200:
                    print(f"{key}: {value[:200]}...")
                else:
                    print(f"{key}: {value}")
        else:
            print(f"Unexpected type: {type(first_test)}")
            print(f"Content: {first_test}")

# Check data quality metrics - also need to fix this part
print("\n=== Quality Metrics Analysis ===")
if 'unit_tests' in first_sample and len(first_sample['unit_tests']) > 0:
    # First check what structure we're dealing with
    sample_test = first_sample['unit_tests'][0]
    print(f"Sample test type: {type(sample_test)}")
    
    if isinstance(sample_test, dict):
        # If it's a dictionary, we can access FAR/FRR
        far_values = [test.get('FAR', 0) for test in first_sample['unit_tests'] if isinstance(test, dict)]
        frr_values = [test.get('FRR', 0) for test in first_sample['unit_tests'] if isinstance(test, dict)]
        if far_values and frr_values:
            print(f"FAR range in first sample: {min(far_values):.3f} - {max(far_values):.3f}")
            print(f"FRR range in first sample: {min(frr_values):.3f} - {max(frr_values):.3f}")
        else:
            print("No FAR/FRR values found in the unit tests")
    else:
        print("Unit tests are stored as strings, not dictionaries with FAR/FRR metrics")
        print("This means the dataset structure is different than expected")

=== Unit Tests Structure ===
Type of unit_tests: <class 'str'>
Number of unit tests for first sample: 189276

=== First Unit Test ===
Type of first test: <class 'str'>
Unit test content: [...

=== Quality Metrics Analysis ===
Sample test type: <class 'str'>
Unit tests are stored as strings, not dictionaries with FAR/FRR metrics
This means the dataset structure is different than expected


## 5. Sample 20k Records from Dataset

In [8]:
# Sample 20k records from the full dataset (or use all if less than 20k)
SAMPLE_SIZE = 20000
total_samples = len(train_data)

logger.info(f"Sampling {SAMPLE_SIZE} records from {total_samples} total samples")

if total_samples >= SAMPLE_SIZE:
    # Use random sampling to get diverse data
    np.random.seed(42)  # For reproducibility
    sample_indices = np.random.choice(total_samples, SAMPLE_SIZE, replace=False)
    sample_indices = sorted(sample_indices)  # Sort for efficient access
    
    sampled_data = train_data.select(sample_indices)
    effective_sample_size = SAMPLE_SIZE
    print(f"Successfully sampled {len(sampled_data)} records")
    
    logger.info(f"Sampled {len(sampled_data)} records using random sampling")
else:
    print(f"Dataset has only {total_samples} samples, using all available data")
    sampled_data = train_data
    effective_sample_size = total_samples
    
    logger.warning(f"Dataset smaller than requested sample size, using all {total_samples} samples")

print(f"Final dataset size for training: {effective_sample_size} samples")

2025-08-21 15:01:30,149 - INFO - Sampling 20000 records from 17562 total samples


Dataset has only 17562 samples, using all available data
Final dataset size for training: 17562 samples


## 6. Create 80/10/10 Data Splits

In [9]:
# Convert to pandas DataFrame for easier splitting
logger.info("Converting dataset to pandas DataFrame for splitting")

# Convert the sampled data to a list of dictionaries
data_list = []
for i in range(len(sampled_data)):
    sample = sampled_data[i]
    data_list.append(sample)

print(f"Converted {len(data_list)} samples to list format")

# Create indices for splitting
indices = list(range(len(data_list)))

# First split: 80% train, 20% temp (which will become 10% val + 10% test)
train_indices, temp_indices = train_test_split(
    indices, test_size=0.2, random_state=42, shuffle=True
)

# Second split: Split the 20% into 10% val and 10% test
val_indices, test_indices = train_test_split(
    temp_indices, test_size=0.5, random_state=42, shuffle=True
)

print(f"Data split sizes:")
print(f"Train: {len(train_indices)} samples ({len(train_indices)/len(data_list)*100:.1f}%)")
print(f"Validation: {len(val_indices)} samples ({len(val_indices)/len(data_list)*100:.1f}%)")
print(f"Test: {len(test_indices)} samples ({len(test_indices)/len(data_list)*100:.1f}%)")

logger.info(f"Created splits - Train: {len(train_indices)}, Val: {len(val_indices)}, Test: {len(test_indices)}")

2025-08-21 15:01:41,923 - INFO - Converting dataset to pandas DataFrame for splitting
2025-08-21 15:02:00,277 - INFO - Created splits - Train: 14049, Val: 1756, Test: 1757


Converted 17562 samples to list format
Data split sizes:
Train: 14049 samples (80.0%)
Validation: 1756 samples (10.0%)
Test: 1757 samples (10.0%)


In [10]:
# Create the actual data splits
train_split = [data_list[i] for i in train_indices]
val_split = [data_list[i] for i in val_indices]
test_split = [data_list[i] for i in test_indices]

print("Data splits created successfully!")
print(f"Train split: {len(train_split)} samples")
print(f"Validation split: {len(val_split)} samples")
print(f"Test split: {len(test_split)} samples")

Data splits created successfully!
Train split: 14049 samples
Validation split: 1756 samples
Test split: 1757 samples


## 7. Save Preprocessed Splits

In [11]:
# Save splits using pickle for Python objects
data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)

logger.info("Saving data splits to disk")

# Save splits
with open(data_dir / 'train_split.pkl', 'wb') as f:
    pickle.dump(train_split, f)
    
with open(data_dir / 'val_split.pkl', 'wb') as f:
    pickle.dump(val_split, f)
    
with open(data_dir / 'test_split.pkl', 'wb') as f:
    pickle.dump(test_split, f)

print("Data splits saved successfully!")

# Save metadata
metadata = {
    'total_samples': len(data_list),
    'train_size': len(train_split),
    'val_size': len(val_split),
    'test_size': len(test_split),
    'sample_size': SAMPLE_SIZE,
    'original_dataset_size': total_samples,
    'split_ratio': '80/10/10',
    'random_seed': 42,
    'created_at': datetime.now().isoformat(),
    'dataset_name': 'KAKA22/CodeRM-UnitTest'
}

with open(data_dir / 'metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Metadata saved to {data_dir / 'metadata.json'}")
logger.info(f"All data splits and metadata saved to {data_dir}")

2025-08-21 15:02:20,063 - INFO - Saving data splits to disk
2025-08-21 15:02:59,313 - INFO - All data splits and metadata saved to ..\data


Data splits saved successfully!
Metadata saved to ..\data\metadata.json


## 8. Verify Saved Data

In [12]:
# Verify the saved data by loading it back
logger.info("Verifying saved data integrity")

try:
    # Load splits back
    with open(data_dir / 'train_split.pkl', 'rb') as f:
        loaded_train = pickle.load(f)
        
    with open(data_dir / 'val_split.pkl', 'rb') as f:
        loaded_val = pickle.load(f)
        
    with open(data_dir / 'test_split.pkl', 'rb') as f:
        loaded_test = pickle.load(f)
    
    # Load metadata
    with open(data_dir / 'metadata.json', 'r') as f:
        loaded_metadata = json.load(f)
    
    print("=== Verification Results ===")
    print(f"Train split loaded: {len(loaded_train)} samples")
    print(f"Val split loaded: {len(loaded_val)} samples")
    print(f"Test split loaded: {len(loaded_test)} samples")
    print(f"\nMetadata:")
    for key, value in loaded_metadata.items():
        print(f"  {key}: {value}")
    
    # Quick integrity check
    assert len(loaded_train) == len(train_split), "Train split size mismatch"
    assert len(loaded_val) == len(val_split), "Val split size mismatch"
    assert len(loaded_test) == len(test_split), "Test split size mismatch"
    
    print("\n✅ Data integrity verification passed!")
    logger.info("Data integrity verification completed successfully")
    
except Exception as e:
    print(f"❌ Verification failed: {str(e)}")
    logger.error(f"Data verification failed: {str(e)}")
    raise

2025-08-21 15:03:11,246 - INFO - Verifying saved data integrity
2025-08-21 15:03:17,258 - INFO - Data integrity verification completed successfully


=== Verification Results ===
Train split loaded: 14049 samples
Val split loaded: 1756 samples
Test split loaded: 1757 samples

Metadata:
  total_samples: 17562
  train_size: 14049
  val_size: 1756
  test_size: 1757
  sample_size: 20000
  original_dataset_size: 17562
  split_ratio: 80/10/10
  random_seed: 42
  created_at: 2025-08-21T15:02:59.307437
  dataset_name: KAKA22/CodeRM-UnitTest

✅ Data integrity verification passed!


## 9. Summary and Next Steps

In [13]:
print("=== Dataset Preparation Summary ===")
print(f"✅ Loaded CodeRM-UnitTest dataset ({total_samples} total samples)")
print(f"✅ Sampled {len(data_list)} records for training")
print(f"✅ Created 80/10/10 splits:")
print(f"   - Train: {len(train_split)} samples")
print(f"   - Validation: {len(val_split)} samples")
print(f"   - Test: {len(test_split)} samples")
print(f"✅ Saved all splits to {data_dir}")
print(f"✅ Data integrity verified")

print("\n=== Files Created ===")
for file_path in data_dir.glob('*'):
    file_size = file_path.stat().st_size / 1024 / 1024  # MB
    print(f"  {file_path.name}: {file_size:.2f} MB")

print("\n=== Next Steps ===")
print("1. ✅ Step 2 Complete: Dataset loading and splitting")
print("2. 🔄 Step 3: Data preprocessing and tokenization")
print("3. ⏳ Step 4: Model loading and configuration")
print("4. ⏳ Step 5: QLoRA/PEFT setup")

logger.info("Dataset preparation completed successfully!")
logger.info(f"Ready to proceed with Step 3: Data preprocessing")

2025-08-21 15:03:20,325 - INFO - Dataset preparation completed successfully!
2025-08-21 15:03:20,327 - INFO - Ready to proceed with Step 3: Data preprocessing


=== Dataset Preparation Summary ===
✅ Loaded CodeRM-UnitTest dataset (17562 total samples)
✅ Sampled 17562 records for training
✅ Created 80/10/10 splits:
   - Train: 14049 samples
   - Validation: 1756 samples
   - Test: 1757 samples
✅ Saved all splits to ..\data
✅ Data integrity verified

=== Files Created ===
  metadata.json: 0.00 MB
  test_split.pkl: 273.70 MB
  train_split.pkl: 2197.99 MB
  val_split.pkl: 274.79 MB

=== Next Steps ===
1. ✅ Step 2 Complete: Dataset loading and splitting
2. 🔄 Step 3: Data preprocessing and tokenization
3. ⏳ Step 4: Model loading and configuration
4. ⏳ Step 5: QLoRA/PEFT setup
