# Fine-tune model for creating IP-XACT
## In this document:
1. Load model from Hugging Face
2. Evaluate model without fine-tuning
3. Fine-tune the model
4. Evaluate the fine tuned model
5. Upload model to Hugging Face

## 1. Load model from Hugging Face

In [None]:
!pip install datasets PyPDF2

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, GPT2Tokenizer, GPT2Model
from datasets import Dataset
import torch
import PyPDF2


tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2Model.from_pretrained('gpt2-medium')





The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 2. Evaluate model without fine-tuning

In [None]:
# 2. Evaluate Before Fine-Tuning
initial_prompt = "Generate an IP-XACT component for a simple 32-bit memory-mapped register. Output in xml format"


encoded_input = tokenizer(initial_prompt, return_tensors='pt')
initial_output = model(**encoded_input)
print("Initial Output:", initial_output)



from transformers import pipeline

generator = pipeline("text-generation", model="gpt2-medium")


response = generator(initial_prompt, max_length=500, num_return_sequences=1)

print(response[0]['generated_text'])

Initial Output: BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 0.0509, -0.2751, -0.7783,  ...,  0.5840, -0.6404,  0.5585],
         [-0.2467,  0.3570, -0.3189,  ...,  0.4342, -0.5435,  0.6757],
         [-0.2876,  0.4164, -0.1669,  ...,  0.3337, -0.5499,  0.4009],
         ...,
         [-0.2938, -0.3889, -0.7778,  ..., -0.2329, -0.0606,  0.5008],
         [ 0.1721,  0.2089,  0.6367,  ..., -0.5629, -0.5302,  0.1623],
         [-0.0346,  0.3047, -0.1738,  ..., -0.0521, -0.3577,  0.2421]]],
       grad_fn=<ViewBackward0>), past_key_values=((tensor([[[[-6.0438e-02,  1.0039e-01,  5.3273e-02,  ..., -4.4611e-01,
           -1.7612e-01,  7.3897e-02],
          [ 1.8165e-01,  6.6250e-01,  4.4878e-01,  ...,  5.1001e-01,
            2.0368e-01, -6.1480e-01],
          [ 1.1185e+00,  2.8384e-01, -1.0772e-02,  ...,  1.7196e+00,
            2.1156e-01, -4.1885e-01],
          ...,
          [-1.0392e-01, -6.0143e-02,  2.0850e-01,  ..., -3.5986e-01,
            5.9941e-02, -4

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generate an IP-XACT component for a simple 32-bit memory-mapped register. Output in xml format.

In other words, your system should have one IP-XACT or D3D-V1 component for 32-bits. The code below shows how to create a 3D-V1 component and put it into memory in an 8-bit register. The assembly file looks like this:

D3DMemory.xml

.NET version 6 D3D-V1 component.

using System; public class D3DMemory : MonoBehaviour { public static byte[] CreateXACTComponent([byte] tag, [Vertex] vertexType, [Buffer] buffer) { return (byte) tag.Flags; } public public void SetBuffer(int type) { byte [] buffer = {buffer}; if (Buffer[type]!= (Vertex)Buffer.Type) { // Set buffer to Vertex buffer.SetBuffer(index); } } public VRebug CreateXACTComponent() { return CreateXACTComponent(0, null, "default"); } public virtual void Render(Vertex v) { Matrix m3 = vertex.GetMatrix(); Matrix a = m3 ^ vertex.GetMul(); int i = 0,j = 0; for (i = 0; i < m3.Length; i++) { // Get m3 offset if it is > 0 m3[m3(i)] = m3[i-1] ^ m3

## 3. Fine-tune the model

In [None]:
import torch

# Clear the GPU cache after each operation
torch.cuda.empty_cache()

# 3. Load and Preprocess Training Data from PDF
def extract_text_from_pdf(pdf_path):
    """Extract text from a given PDF file."""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])
    return text

#https://www.accellera.org/images/downloads/standards/ip-xact/IPXACT-2022_user_guide.pdf link works 15.3.2025
pdf_path = "IPXACT-2022_user_guide.pdf"  # Ensure this is in your working directory
pdf_text = extract_text_from_pdf(pdf_path)
dataset = Dataset.from_dict({"text": [pdf_text]})

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
tokenizer.pad_token = tokenizer.eos_token

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

from transformers import DataCollatorForLanguageModeling

# Data collator for causal language modeling (GPT-2 specific)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # GPT-2 is a causal language model, so no mask
)

from transformers import GPT2LMHeadModel, TrainingArguments, Trainer

# Load GPT-2 model for language modeling
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    learning_rate=5e-5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    fp16=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    data_collator=data_collator,  # Ensures correct padding and formatting for GPT-2
)

# Start fine-tuning
trainer.train()



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mnikusamu[0m ([33mnikusamu-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=3, training_loss=2.9106534322102866, metrics={'train_runtime': 115.5142, 'train_samples_per_second': 0.026, 'train_steps_per_second': 0.026, 'total_flos': 2786102083584.0, 'train_loss': 2.9106534322102866, 'epoch': 3.0})

## 4. Upload fine-tuned model to Hugging Face

In [None]:
!pip install huggingface_hub

from huggingface_hub import HfApi, notebook_login

# Log in programmatically (if not using CLI)
notebook_login()

# Define model repo and push
HF_USERNAME = "niklassuvitie"  # Change this to your HF username
REPO_NAME = "gpt2-medium-ipxact"  # Name of your model repo
MODEL_PATH = "./results"  # Path where fine-tuned model is saved

# Create a repo on HF (if it doesn't exist)
api = HfApi()
api.create_repo(repo_id=f"{HF_USERNAME}/{REPO_NAME}", exist_ok=True)

# Push model to HF
model.push_to_hub(f"{HF_USERNAME}/{REPO_NAME}")
tokenizer.push_to_hub(f"{HF_USERNAME}/{REPO_NAME}")

print(f"Model uploaded to: https://huggingface.co/{HF_USERNAME}/{REPO_NAME}")



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Model uploaded to: https://huggingface.co/niklassuvitie/gpt2-medium-ipxact


## 5. Evaluate the fine tuned model

In [1]:
from transformers import pipeline

# Load the pre-trained model from Hugging Face
pretrained_generator = pipeline("text-generation", model="gpt2-medium")

fine_tuned_generator = pipeline("text-generation", model="niklassuvitie/gpt2-medium-ipxact")

initial_prompt = "Generate an IP-XACT component for a simple 32-bit memory-mapped register. Output in xml format"

print("hi")
# Generate text using the pre-trained model
pretrained_response = pretrained_generator(initial_prompt, max_length=500, num_return_sequences=1)
print("Pretrained Model Output:")
print(pretrained_response)

# Generate text using the fine-tuned model
fine_tuned_response = fine_tuned_generator(initial_prompt, max_length=500, num_return_sequences=1)
print("\nFine-Tuned Model Output:")
print(fine_tuned_response)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


hi


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Pretrained Model Output:
[{'generated_text': "Generate an IP-XACT component for a simple 32-bit memory-mapped register. Output in xml format.\n\nIn addition to setting up a simple input/output loop this package features:\n\nRegistering a user's account (useful for storing the user's account information in the registry)\n\nCreate a local copy of a user's account (useful for storing the user's account information in the registry)\n\nManage users with a profile (used for storing the user's public profile information to a file)\n\nPerform basic administrative access (useful for logging into the system, managing the environment, accessing files and directories in the system)\n\nPerform simple machine administration tasks, like create a local copy of the machine to access and perform initial setup of the system (user account on this server or user account or user profile on the machine, respectively)\n\nCreate/deleterate/add new accounts, delete old ones and create account objects\n\nEnable/