# Soc25: AI Perf (Commits -> Benchmarks)

---

## Table of Contents

| Section Number | Section Title                     | Description                                          |
|----------------|---------------------------------|----------------------------------------------------|
| 1              | [Data Collection](#markdown-header-data-collection)                 | Fetching and aggregating commit messages from GitHub repositories. |
| 2              | [Data Processing & Labeling](#data-processing--labeling)     | Cleaning, encoding labels, and splitting data for training and evaluation. |
| 3              | [Model Architecture](#model-architecture)             | Defining the DistilBERT-based classifier network.  |
| 4              | [Model Training and Evaluation](#markdown-header-model-training-and-evaluation)     | Iterative training, loss calculation, validation, and saving the best model. |
| 5              | [Inference & Testing](#inference--testing)             | Loading the trained model and running predictions on new commit messages. |

---

In [1]:
# # # !pip install peft==0.7.1 transformers==4.31.0 accelerate==0.24.1 
# !pip uninstall -y accelerate transformers
# !pip install --upgrade transformers accelerate


In [2]:
import torch

if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
    print("CUDA Capability:", torch.cuda.get_device_capability(0))
    print("Memory Total (GB):", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))
    print("Multi-processors:", torch.cuda.get_device_properties(0).multi_processor_count)
else:
    print("No CUDA GPU detected.")

GPU Name: NVIDIA A100 80GB PCIe
CUDA Capability: (8, 0)
Memory Total (GB): 84.97
Multi-processors: 108


In [3]:
import time
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset

model_id = "deepseek-ai/deepseek-coder-6.7b-instruct"

# Load tokenizer and model (use float16 if your GPU supports it)
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Avoid warnings

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",           # Automatically place on GPU(s)
    torch_dtype=torch.float16,   # Use float16 if possible for efficiency
)
model.config.pad_token_id = tokenizer.eos_token_id  # For generation padding

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
prompt = """<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Explain DaCapo in Adoptium.

[/INST]"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    start = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=80,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )
    end = time.time()

print(f"Elapsed: {end - start:.2f} sec")
print(f"Output length: {outputs.shape[1]}")
print(f"Tokens/sec: {outputs.shape[1] / (end - start):.2f}")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Elapsed: 3.10 sec
Output length: 121
Tokens/sec: 39.07
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Explain DaCapo in Adoptium.

[/INST]

<s>[INST] <<SYS>>
DaCapo is a benchmark suite that consists of a set of Java applications that have been designed to represent a variety of tasks that are common in real-world applications. It is used to evaluate the performance of a Java virtual machine (JVM), and to measure the efficiency of the JVM's optimizer. 




In [None]:
import json

training = [
    ("What is DaCapo in Adoptium?", "DaCapo is a benchmark suite used by Adoptium to evaluate Java runtime performance across various workloads."),
    ("Why is DaCapo important in Adoptium?", "DaCapo provides real-world benchmarks that help developers ensure the performance and stability of Java builds in Adoptium."),
    ("How does Adoptium use DaCapo?", "Adoptium uses DaCapo as part of its QA process to run performance tests and compare runtime behavior across builds."),
    ("Name some benchmarks included in DaCapo.", "Some of the benchmarks in DaCapo include Eclipse, H2, Luindex, Lusearch, Xalan, and others representing real-world Java applications."),
    ("What is the Eclipse benchmark in DaCapo?", "The Eclipse benchmark in DaCapo simulates a typical integrated development environment (IDE) workload by running a batch build in the Eclipse Java IDE. It is used to evaluate the performance of compilers and build systems."),
    ("What does the H2 benchmark in DaCapo test?","The H2 benchmark evaluates the performance of the H2 Java SQL database engine, testing tasks such as query execution and data manipulation to simulate database workloads."),
    ("What is the Luindex benchmark in DaCapo?","Luindex benchmarks the indexing phase of the Lucene search engine. It measures performance when indexing a large set of documents, representing a write-heavy search engine workload."),
    ("What does the Lusearch benchmark in DaCapo simulate?","Lusearch represents the search phase of Lucene, focusing on how efficiently a Java application can perform full-text search operations over indexed content."),
    ("What kind of workload does the Xalan benchmark in DaCapo represent?","The Xalan benchmark measures the performance of XSLT transformations in Java, simulating applications that convert XML data using XSL stylesheets."),
    ("What does the Tradebeans benchmark in DaCapo test?","Tradebeans is a J2EE benchmark in DaCapo that exercises EJB (Enterprise Java Beans) and simulates online stock trading operations, testing application server and transaction performance."),
    ("Why are Lucene-based benchmarks included in DaCapo?","Lucene-based benchmarks like Luindex and Lusearch are included because Lucene is a widely used open-source Java search engine. These benchmarks help test indexing and search performance in real-world Java applications."),
    ("Is the H2 database used outside benchmarking?","Yes, H2 is a lightweight, embedded Java SQL database widely used in development and testing environments due to its ease of use and fast startup times."),
    ("What kind of applications benefit from the Eclipse benchmark data?","Applications involving Java code compilation, build systems, or integrated development environments benefit from Eclipse benchmark data, as it reflects real-world developer usage patterns.</s>")
]

with open("dacapo_train.jsonl", "w") as f:
    for question, answer in training:
        entry = {
            "text": f"<s>[INST] {question} [/INST] {answer}</s>"
        }
        f.write(json.dumps(entry) + "\n")

In [21]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [25]:
import torch
import time
import os
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset

# Set environment variable for better debugging
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

model_id = "deepseek-ai/deepseek-coder-6.7b-instruct"

# 1. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2. Load model with optimizations
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    trust_remote_code=True,
)

# Ensure pad_token_id is set consistently
model.config.pad_token_id = tokenizer.eos_token_id

# 3. Load dataset from your jsonl file
dataset = load_dataset("json", data_files={"train": "dacapo_train.jsonl"})

# 4. Fixed tokenization function with proper validation
def tokenize_function(examples):
    # Tokenize the text
    tokenized = tokenizer(
        examples["text"],
        padding=False,
        truncation=True,
        max_length=512,
        return_tensors=None,  # Return lists, not tensors
    )
    
    # Validate and fix token IDs
    vocab_size = tokenizer.vocab_size
    fixed_input_ids = []
    fixed_attention_mask = []
    
    for i, input_ids in enumerate(tokenized["input_ids"]):
        # Handle None values and out-of-range tokens
        if input_ids is None:
            input_ids = [tokenizer.bos_token_id, tokenizer.eos_token_id]
        
        # Fix out-of-range tokens
        fixed_ids = []
        for token_id in input_ids:
            if token_id is None:
                fixed_ids.append(tokenizer.unk_token_id)
            elif token_id >= vocab_size:
                # Token 32013 is likely a special token - replace with EOS
                fixed_ids.append(tokenizer.eos_token_id)
            else:
                fixed_ids.append(token_id)
        
        fixed_input_ids.append(fixed_ids)
        
        # Fix attention mask
        if tokenized["attention_mask"][i] is None:
            fixed_attention_mask.append([1] * len(fixed_ids))
        else:
            fixed_attention_mask.append(tokenized["attention_mask"][i])
    
    return {
        "input_ids": fixed_input_ids,
        "attention_mask": fixed_attention_mask
    }

# Apply tokenization with debugging
print("Tokenizing dataset...")
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# 5. Validate the tokenized dataset
print("Validating tokenized dataset...")
train_dataset = tokenized_dataset["train"]
vocab_size = tokenizer.vocab_size

# Check a few samples with proper None handling
for i in range(min(3, len(train_dataset))):
    sample = train_dataset[i]
    input_ids = sample["input_ids"]
    
    # Handle None values and get max token
    if input_ids is None or len(input_ids) == 0:
        max_token = 0
    else:
        # Filter out None values before finding max
        valid_tokens = [token for token in input_ids if token is not None]
        max_token = max(valid_tokens) if valid_tokens else 0
    
    print(f"Sample {i}: Max token ID = {max_token}, Vocab size = {vocab_size}")
    if max_token >= vocab_size:
        print(f"ERROR: Token ID {max_token} still exceeds vocabulary size!")
        break
    
print("Dataset validation complete!")

# 6. Setup training arguments with A100 optimizations
training_args = TrainingArguments(
    output_dir="./finetuned-deepseek-coder",
    per_device_train_batch_size=2,  # A100 can handle larger batches
    gradient_accumulation_steps=2,   # Effective batch size of 4
    num_train_epochs=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=10,
    bf16=True,  # A100 supports bfloat16
    fp16=False,
    gradient_checkpointing=True,
    learning_rate=5e-6,
    warmup_steps=50,
    weight_decay=0.01,
    dataloader_pin_memory=True,
    remove_unused_columns=False,
    report_to="none",
    seed=42,  # Set explicit seed for reproducibility
    data_seed=42,
    # Optimization for A100
    dataloader_num_workers=4,
    # evaluation_strategy="steps",
    # eval_steps=500,
)

# 7. Data collator with proper padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False,
    pad_to_multiple_of=8,
)

# 8. Initialize Trainer with error handling
print("Initializing trainer...")
try:
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        data_collator=data_collator,
    )
    print("Trainer initialized successfully!")
except Exception as e:
    print(f"Error initializing trainer: {e}")
    # Additional debugging
    print(f"Model vocabulary size: {model.config.vocab_size}")
    print(f"Tokenizer vocabulary size: {tokenizer.vocab_size}")
    print(f"Model pad_token_id: {model.config.pad_token_id}")
    print(f"Tokenizer pad_token_id: {tokenizer.pad_token_id}")
    raise

# 9. Train with timing and error handling
print("Starting training...")
start_time = time.time()
try:
    trainer.train()
    end_time = time.time()
    print(f"\n✅ Training completed in {(end_time - start_time) / 60:.2f} minutes")
except Exception as e:
    print(f"Training error: {e}")
    # Clear CUDA cache and retry with smaller batch size
    torch.cuda.empty_cache()
    print("Retrying with smaller batch size...")
    training_args.per_device_train_batch_size = 1
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        data_collator=data_collator,
    )
    trainer.train()

# 10. Save model
print("Saving model...")
trainer.save_model("./finetuned-deepseek-coder")
tokenizer.save_pretrained("./finetuned-deepseek-coder")
print("Model saved successfully!")

# 11. Test the model
def test_model():
    print("\n🧪 Testing fine-tuned model...")
    test_prompt = "<s>[INST] What is DaCapo in Adoptium? [/INST]"
    inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Model response: {response}")

# Uncomment to test after training
# test_model()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Tokenizing dataset...


Map:   0%|          | 0/13 [00:00<?, ? examples/s]

Validating tokenized dataset...
Sample 0: Max token ID = 32021, Vocab size = 32000
ERROR: Token ID 32021 still exceeds vocabulary size!
Dataset validation complete!
Initializing trainer...
Error initializing trainer: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Model vocabulary size: 32256
Tokenizer vocabulary size: 32000
Model pad_token_id: 32021
Tokenizer pad_token_id: 32014


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [16]:
from transformers.models.dac import DacConfig, DacForCausalLM

checkpoint_path = "./finetuned-dacapo"

# Load config explicitly
config = DacConfig.from_pretrained(checkpoint_path)

# Load model with this config
model = DacForCausalLM.from_pretrained(checkpoint_path, config=config)

model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path)
model.to("cuda")
prompt = "What is DaCapo?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



ImportError: cannot import name 'DacForCausalLM' from 'transformers.models.dac' (/data1/phdiep/myenv/lib/python3.10/site-packages/transformers/models/dac/__init__.py)

In [8]:
print("Tokenizer vocab size:", tokenizer.vocab_size)
print("Model vocab size:", model.config.vocab_size)


Tokenizer vocab size: 32000
Model vocab size: 32256


In [None]:
import torch

print(f"Using device: {device}")
print(f"Allocated: {torch.cuda.memory_allocated(device) / 1024**3:.2f} GB")
print(f"Cached:    {torch.cuda.memory_reserved(device) / 1024**3:.2f} GB")