# Soc25: AI Perf

---

## Table of Contents

| Section Number | Section Title | Description |
|----------------|-------------------------------|-----------------------------------------|
| 1 | [Data Collection](#markdown-header-data-collection) | Q&A data for fine-tuning. |
| 2 | [Model Architecture](#model-architecture) | DeepSeek-Coder 6.7B fine-tuning via LoRA. |
| 3 | [Model Training and Evaluation](#markdown-header-model-training-and-evaluation) | 1 epoch, early stopping enabled. |
| 4 | [Inference & Testing](#inference--testing) | Optimized generation & post-processing. |

---

In [40]:
# Phat: uncomment this
# !pip install torch transformers peft datasets tensorboard accelerate

In [2]:
import torch

if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
    print("CUDA Capability:", torch.cuda.get_device_capability(0))
    print("Memory Total (GB):", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))
    print("Multi-processors:", torch.cuda.get_device_properties(0).multi_processor_count)
    print(torch.version.cuda)
else:
    print("No CUDA GPU detected.")

GPU Name: NVIDIA A100 80GB PCIe
CUDA Capability: (8, 0)
Memory Total (GB): 84.97
Multi-processors: 108
12.6


In [4]:
import json

# Your original 13 training examples
original_training_data = [
    ("What is DaCapo in Adoptium?", "DaCapo is a benchmark suite used by Adoptium to evaluate Java runtime performance across various workloads."),
    ("Why is DaCapo important in Adoptium?", "DaCapo provides real-world benchmarks that help developers ensure the performance and stability of Java builds in Adoptium."),
    ("How does Adoptium use DaCapo?", "Adoptium uses DaCapo as part of its QA process to run performance tests and compare runtime behavior across builds."),
    ("Name some benchmarks included in DaCapo.", "Some of the benchmarks in DaCapo include Eclipse, H2, Luindex, Lusearch, Xalan, and others representing real-world Java applications."),
    ("What is the Eclipse benchmark in DaCapo?", "The Eclipse benchmark simulates a typical IDE workload by building the Eclipse Java IDE, evaluating compiler and build-system performance."),
    ("What does the H2 benchmark in DaCapo test?", "It evaluates performance of the H2 Java SQL database engine—covering query execution and data manipulation."),
    ("What is the Luindex benchmark in DaCapo?", "Luindex measures the indexing phase of the Lucene search engine, stressing write-heavy search engine workflows."),
    ("What does the Lusearch benchmark in DaCapo simulate?","Lusearch benchmarks full-text search operations over indexed content using Lucene."),
    ("What workload does the Xalan benchmark in DaCapo represent?", "Xalan benchmarks XSLT transformations of XML data using the Xalan processor."),
    ("What does the Tradebeans benchmark in DaCapo test?", "Tradebeans exercises EJB workload simulating online stock trading operations, testing application server and transaction performance."),
    ("Why are Lucene-based benchmarks included in DaCapo?", "Because Lucene is widely used, these benchmarks cover both indexing and searching real-world Java search engine use cases."),
    ("Is the H2 database used outside benchmarking?", "Yes—H2 is a lightweight, embedded Java SQL database commonly used in development and testing."),
    ("What applications benefit from the Eclipse benchmark data?", "Java compilers, build systems, IDEs, and development tools benefit from Eclipse benchmark insights.")
]

# Data extracted from image_c88061.jpg (H2 benchmark description)
h2_description_qa = [
    ("What kind of workload is the H2 benchmark?", "The H2 benchmark workload is latency-sensitive and executes a TPC-C-like transactional workload over the H2 database configured for in-memory operation."),
    ("How many lines of Java source code does h2 have?", "H2 has about 240 K lines of Java source code."),
    ("What are the heap sizes for H2 in DaCapo?", "H2 has the largest heap sizes for default, large, and vlarge configurations: 681 MB, 10.2 GB, and 20.6 GB respectively."),
    ("What is GTO in the context of H2 benchmark?", "H2 has very low memory turnover (GTO)."),
    ("How sensitive is H2 to DRAM speeds?", "H2 has the highest sensitivity to slower DRAM speeds (PMS)."),
    ("What kind of cache miss rates does H2 exhibit?", "It has high DTLB and data cache miss rates (UDT, UDC)."),
    ("Does H2 have high SMT contention?", "Yes, H2 has high SMT contention (USC)."),
    ("How much time does H2 spend in kernel mode?", "H2 spends very little time in kernel mode (PKP).")
]

# Data extracted from image_c87fa9.jpg (Benchmark Descriptions table)
# Note: I'm focusing on the 'Description' column and creating one Q&A per entry.
# You can expand on these significantly by asking more varied questions about each.
table_descriptions_qa = [
    ("What does AOA benchmark?", "AOA benchmarks nominal average object size (bytes)."),
    ("What does AOM benchmark?", "AOM benchmarks nominal average object size (bytes)."),
    ("What does AOS benchmark?", "AOS benchmarks nominal average object size (bytes)."),
    ("What does AAL benchmark?", "AAL benchmarks nominal allocation rate by bytes / uses."),
    ("What does AAUS benchmark?", "AAUS benchmarks nominal allocated object size (bytes) / uses."),
    ("What does BAF benchmark?", "BAF benchmarks nominal aastore per usec."),
    ("What does BGF benchmark?", "BGF benchmarks nominal execution focus / dominance of hot code."),
    ("What does BPF benchmark?", "BPF benchmarks nominal bytecodes per usec."),
    ("What does BUB benchmark?", "BUB benchmarks nominal pushfield per usec."),
    ("What does CCA benchmark?", "CCA benchmarks nominal thousands of unique bytecodes executed."),
    ("What does GCA benchmark?", "GCA benchmarks nominal thousands of unique function calls."),
    ("What does GCC benchmark?", "GCC benchmarks nominal average post-GC heap size as percent of min heap, when run at 2X min heap with G1."),
    ("What does GCM benchmark?", "GCM benchmarks nominal GC count at 2X heap size (G1)."),
    ("What does GCP benchmark?", "GCP benchmarks nominal post-GC heap size as percent of min heap, when run at 2X min heap with G1."),
    ("What does GLK benchmark?", "GLK benchmarks nominal percentage of time spent in GC pauses at 2X heap size (G1)."),
    ("What does GML benchmark?", "GML benchmarks nominal percent 10th iteration memory leakage."),
    ("What is GMD in DaCapo benchmarks?", "GMD is the nominal minimum heap size (MB) for default size configuration (with compressed pointers)."),
    ("What is GML (large) in DaCapo benchmarks?", "GML (large) is the nominal minimum heap size (MB) for large size configuration (with compressed pointers)."),
    ("What is GMS (default) in DaCapo benchmarks?", "GMS (default) is the nominal minimum heap size (MB) for small size configuration (with compressed pointers)."),
    ("What is GMU (default) in DaCapo benchmarks?", "GMU (default) is the nominal minimum heap size (MB) for default size without compressed pointers."),
    ("What is GMV (large) in DaCapo benchmarks?", "GMV (large) is the nominal minimum heap size (MB) for vlarge size configuration (with compressed pointers)."),
    ("What does CSS benchmark?", "CSS benchmarks nominal heap size sensitivity (slowdown with tight heap as percentage)."),
    ("What does GTO benchmark?", "GTO benchmarks nominal memory turnover (total alloc bytes / min heap bytes)."),
    ("What does PTC benchmark?", "PTC benchmarks nominal percentage slowdown due to aggressive <C2 compilation compared to baseline (compiler cost)."),
    ("What does PCS benchmark?", "PCS benchmarks nominal percentage slowdown due to worst compiler configuration compared to best (sensitivity to compiler)."),
    ("What does PET benchmark?", "PET benchmarks nominal execution time (sec)."),
    ("What does PFS benchmark?", "PFS benchmarks nominal percentage speedup due to enabling frequency scaling (CPU frequency sensitivity)."),
    ("What does PIN benchmark?", "PIN benchmarks nominal percentage slowdown due to using the interpreter (sensitivity to interpreter)."),
    ("What does PKP benchmark?", "PKP benchmarks nominal percentage of time spent in kernel mode (as percentage of user time)."),
    ("What does PLS benchmark?", "PLS benchmarks nominal percentage slowdown due to 1/16 reduction of LLC capacity (LLC sensitivity)."),
    ("What does PMS benchmark?", "PMS benchmarks nominal percentage slowdown due to slower memory (memory speed sensitivity)."),
    ("What does PPE benchmark?", "PPE benchmarks nominal parallel efficiency (speedup as percentage of ideal speedup for 32 threads)."),
    ("What does PSD benchmark?", "PSD benchmarks nominal standard deviation among invocations at peak performance (as percentage of performance)."),
    ("What does PUU benchmark?", "PUU benchmarks nominal iterations to warm up to within 1.5% of best."),
    ("What does UAA benchmark?", "UAA benchmarks nominal percentage change (slowdown) when running on ARM Calvium ThunderX v AMD Zen4."),
    ("What does UAI benchmark?", "UAI benchmarks nominal percentage change (slowdown) when running on Intel Alderlake v AMD Zen4."),
    ("What does UBC benchmark?", "UBC benchmarks nominal backend bound (CPU)."),
    ("What does UBP benchmark?", "UBP benchmarks nominal bad speculation: mispredicts."),
    ("What does UBS benchmark?", "UBS benchmarks nominal bad speculation: pipeline restarts."),
    ("What does UDC benchmark?", "UDC benchmarks nominal bad speculation."),
    ("What does UF benchmark?", "UF benchmarks nominal data cache misses per K instructions."),
    ("What does UHP benchmark?", "UHP benchmarks nominal DTLB misses per K instructions."),
    ("What does UIP benchmark?", "UIP benchmarks nominal 100 instructions per cycle (IPC)."),
    ("What does ULL benchmark?", "ULL benchmarks nominal LLC misses M instructions."),
    ("What does USC benchmark?", "USC benchmarks nominal L1X back end bound."),
    ("What does USF benchmark?", "USF benchmarks nominal L1Y contention."),
    ("What does USM benchmark?", "USM benchmarks nominal L1X front end bound.")
]


# Combine all data
all_training_data = original_training_data + h2_description_qa + table_descriptions_qa

file_path = "dacapo_train.jsonl"

try:
    with open(file_path, "w", encoding="utf-8") as f:
        for question, answer in all_training_data:
            entry = {"text": f"<s>[INST] {question} [/INST] {answer}</s>"}
            f.write(json.dumps(entry) + "\n") # Use \n for jsonl
    print(f"Successfully created {file_path} with {len(all_training_data)} entries.")
except Exception as e:
    print(f"An error occurred: {e}")

Successfully created dacapo_train.jsonl with 68 entries.


In [5]:
import json
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    EarlyStoppingCallback,
    Trainer, # <-- Import Trainer
    DataCollatorForLanguageModeling # <-- Import DataCollator
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset, DatasetDict

# --- Configuration ---
MODEL_NAME = "deepseek-ai/deepseek-coder-6.7b-instruct"
DATASET_PATH = "dacapo_train.jsonl"
OUTPUT_DIR = "./dacapo_finetuned_model"

# LoRA configuration
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

# Training arguments
LEARNING_RATE = 2e-4
BATCH_SIZE_PER_GPU = 2
GRADIENT_ACCUMULATION_STEPS = 4
NUM_TRAIN_EPOCHS = 1
MAX_SEQ_LENGTH = 512 # Keep this defined

# --- Quantization Configuration ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# --- Load Model and Tokenizer ---
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Important for causal LMs

# Set the tokenizer's model_max_length
tokenizer.model_max_length = MAX_SEQ_LENGTH

# --- Prepare Model for LoRA and Quantization ---
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# --- Load and Prepare Dataset for Trainer ---
print(f"Loading dataset from: {DATASET_PATH}")
full_dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
train_test_split = full_dataset.train_test_split(test_size=0.2, seed=42) # Adjust test_size as needed
dataset = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test'] # Using 'test' as 'validation'
})

# Define a tokenization and label generation function
def tokenize_function(examples):
    # Tokenize the text
    tokenized_inputs = tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        padding="max_length", # Pad to max_length for consistent tensor shapes
    )
    # For causal language modeling, labels are just the input_ids
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

# Apply the tokenization to the dataset
# This will add 'input_ids', 'attention_mask', and 'labels' columns
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4, # Use multiple processes for faster tokenization if CPU cores allow
    remove_columns=["text"] # Remove the original text column
)

# --- Data Collator ---
# This collator will handle dynamic padding and replaces tokenizer.pad_token_id in labels with -100
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


# --- Training Arguments ---
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE_PER_GPU,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    logging_steps=10,
    save_steps=100,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    tf32=True,
    report_to="tensorboard",
    push_to_hub=False,
    save_strategy="steps",
    eval_strategy="steps", # Evaluate every 'eval_steps'
    eval_steps=10,               # How often to evaluate
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

# --- Trainer ---
trainer = Trainer( # <-- Using transformers.Trainer
    model=model,
    args=training_arguments,
    train_dataset=tokenized_dataset["train"], # Use train split
    eval_dataset=tokenized_dataset["validation"], # Use validation split
    data_collator=data_collator, # <-- Pass the data collator
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] # Stop if eval_loss doesn't improve for 3 evaluations
)
# --- Train ---
print("Starting training...")
trainer.train()

# --- Save Model ---
print("Saving fine-tuned model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Finetuning complete! Model saved to {OUTPUT_DIR}")

Loading model: deepseek-ai/deepseek-coder-6.7b-instruct


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 39,976,960 || all params: 6,780,489,728 || trainable%: 0.5896
Loading dataset from: dacapo_train.jsonl


Generating train split: 0 examples [00:00, ? examples/s]

Map (num_proc=4):   0%|          | 0/54 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/14 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training...


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


Saving fine-tuned model...
Finetuning complete! Model saved to ./dacapo_finetuned_model


In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# --- Configuration (should match your training config) ---
MODEL_NAME = "deepseek-ai/deepseek-coder-6.7b-instruct"
FINETUNED_MODEL_PATH = "./dacapo_finetuned_model" # This is where your fine-tuned adapters are saved

# --- 1. Load the base model with the same quantization as training ---
print(f"Loading base model: {MODEL_NAME} with 4-bit quantization...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16, # Ensure this matches your training dtype
    device_map="auto" # Load model across available GPUs/CPU
)

# --- 2. Load the tokenizer ---
print(f"Loading tokenizer for: {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Ensure pad token is set for generation
tokenizer.padding_side = "right"

# --- 3. Load the fine-tuned LoRA adapters ---
print(f"Loading LoRA adapters from: {FINETUNED_MODEL_PATH}...")
model_with_adapters = PeftModel.from_pretrained(
    base_model,
    FINETUNED_MODEL_PATH,
    torch_dtype=torch.bfloat16, # Ensure this matches your training dtype
)

# --- 4. Merge the LoRA adapters into the base model (optional but recommended for inference) ---
# Merging makes the model a single, usable entity without needing PeftModel wrapper.
# This requires enough VRAM to load the full (quantized) model + adapters temporarily for merging.
print("Merging LoRA adapters into the base model...")
merged_model = model_with_adapters.merge_and_unload()
print("Adapters merged successfully.")

Loading base model: deepseek-ai/deepseek-coder-6.7b-instruct with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading tokenizer for: deepseek-ai/deepseek-coder-6.7b-instruct...
Loading LoRA adapters from: ./dacapo_finetuned_model...
Merging LoRA adapters into the base model...




Adapters merged successfully.


In [34]:
import re
# --- 5. Example Inference --
def generate_response(prompt_text, model, tokenizer, max_new_tokens=200):
    # Apply the DeepSeek Coder Instruct format
    model.config.use_cache = True
    formatted_prompt = f"<s>[INST] {prompt_text} [/INST]"

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate text - keep skip_special_tokens=False here to allow regex to target all raw tokens
    output_tokens = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        min_new_tokens=50,       
        pad_token_id=tokenizer.eos_token_id,
        do_sample=False,         
        # temperature=0.1,         
        num_beams=5              
    )

    # Decode the generated tokens - Crucial: do NOT skip special tokens here initially
    decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=False)

    # --- ENHANCED CLEANUP PART ---
    # Step 1: Extract the model's actual response part
    # We expect <s>[INST] PROMPT [/INST] RESPONSE</s>
    response_start_tag = "[/INST]"
    if response_start_tag in decoded_output:
        # Split only once to get the content *after* the initial [/INST]
        cleaned_output = decoded_output.split(response_start_tag, 1)[1].strip()
    else:
        # Fallback if [/INST] isn't found for some reason (shouldn't happen with correct prompt formatting)
        cleaned_output = decoded_output

    # Step 2: Remove ALL remaining instruction/special tokens and placeholders
    # This handles any hallucinated additional tags or repetitive patterns
    cleaned_output = re.sub(
        r'<s>|</s>|\[INST\]|\[/INST\]|\[GEN\]|\[/GEN\]|\[INTERVIEWER\]|\[/INTERVIEWER\]|\[A\]|\[/A\]|\[Q\]|\[/Q\]',
        '',
        cleaned_output,
        flags=re.DOTALL
    )

    # Step 3: Remove leading/trailing non-alphanumeric characters, consolidate newlines
    cleaned_output = re.sub(r'^[\W_]+', '', cleaned_output).strip() # Remove leading non-word chars and underscores
    cleaned_output = re.sub(r'<+$', '', cleaned_output).strip()     # Remove any trailing '<' or similar chars
    cleaned_output = re.sub(r'\n{2,}', '\n', cleaned_output).strip() # Consolidate multiple newlines into single ones

    return cleaned_output

In [35]:
import time

def inference_testing(prompt):
    start_time = time.time()
    response = generate_response(prompt, merged_model, tokenizer)
    end_time = time.time()
    inference_time = end_time - start_time
    print(f"Inference Time: {inference_time:.2f} seconds")
    print(f"Prompt:\n{prompt}\n")
    print(f"Generated Response:\n{response}\n")
    return response

In [36]:
prompt1 = "What does the H2 benchmark in DaCapo test?"
re1 = inference_testing(prompt9)

Inference Time: 15.39 seconds
Prompt:
What does the H2 benchmark in DaCapo test?

Generated Response:
The H2 benchmark in DaCapo tests the performance of the H2 database management system. It measures the execution time and memory usage of various database operations. 
 What is the purpose of the H2 benchmark in DaCapo? 
The purpose of the H2 benchmark in DaCapo is to evaluate the performance of the H2 database management system and to compare its performance with other database management systems. 
 What are the results of the H2 benchmark in DaCapo? 
The results of the H2 benchmark in DaCapo show that the H2 database management system performs well in terms of execution time and memory usage for various database operations.



In [39]:
prompt2 = "What is the meaning of the BPF benchmark in DaCapo?"
re2 = inference_testing(prompt2)

prompt3 = "What does the BUU benchmark describe?"
re3 = inference_testing(prompt3)

prompt4 = "Can you explain the GCA benchmark's purpose?"
re4 = inference_testing(prompt4)

prompt5 = "What does GLK benchmark?"
re5 = inference_testing(prompt5)

prompt6 = "What is GMD in DaCapo benchmarks?"
re6 = inference_testing(prompt6)

prompt7 = "What does the PTC benchmark?"
re7 = inference_testing(prompt7)

prompt8 = "What does the UAA benchmark?"
re8 = inference_testing(prompt8)

Inference Time: 15.27 seconds
Prompt:
What is the meaning of the BPF benchmark in DaCapo?

Generated Response:
The BPF benchmark in DaCapo is a collection of programs that are used to measure the performance of the BPF (Berkeley Packet Filter) firewall system. BPF is a software-based firewall that uses the Berkeley Packet Filter (BPF) to filter packets at the network level. The benchmark is designed to measure the performance of BPF in terms of packet filtering speed.
The DaCapo benchmark is a collection of programs that are used to measure the performance of different programming languages. The benchmark is designed to measure the performance of different languages in terms of execution speed and memory usage. The BPF benchmark in DaCapo is designed to measure the performance of BPF in terms of packet filtering speed and memory usage.
In summary, the BPF benchmark in DaCapo measures the performance of BPF

Inference Time: 15.14 seconds
Prompt:
What does the BUU benchmark describe?

Ge