# Training and Deploying Gen AI at Scale with PCAI

This notebook demonstrates a complete workflow for fine-tuning a pre-trained large language model, uploading the fine-tuned model to Hugging Face, and preparing it for deployment using our Machine Learning Inference Service. Fine-tuning is especially powerful for teaching new skills, tone, and formatting to a language model, thereby tailoring it to specialized tasks and business needs.

## Overview

1. **Model & Tokenizer Setup:** Load the pre-trained model and its tokenizer.
2. **Dataset Preparation:** Create a synthetic dataset with a question–answer example.
3. **Text Generation Before Finetuning:** Generate baseline text from the model.
4. **Fine-Tuning:** Train the model on the dataset to imbue it with new skills and tone.
5. **Text Generation After Finetuning:** Evaluate improvements in the generated text.
6. **Upload to Hugging Face Hub:** Share the fine-tuned model for deployment.
7. **Deployment Overview:** Brief discussion on deploying the model via our inference service.

In [5]:
# Import necessary libraries
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
from datasets import Dataset

# This notebook demonstrates fine-tuning a model to teach it new skills and tone.

### Step 1: Model and Tokenizer Setup

We begin by loading a pre-trained model (facebook/opt-125m) and its corresponding tokenizer. This model is chosen for demonstration purposes. Once loaded, the model is moved to GPU if available.

In [6]:
# Load pre-trained model and tokenizer
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name,use_fast=False, verbose=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,)

### Step 2: Dataset Preparation

Next, we create a synthetic dataset containing a single example—a question and answer pair that explains why an orange is orange. This example serves as a training signal to help the model learn a new tone and formatting style.

In [23]:
from datasets import load_dataset
dataset = load_dataset('timdettmers/openassistant-guanaco')
print(dataset)
print(dataset['train'][0])

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})
{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large 

In [16]:
# Create a synthetic dataset with one example
synthetic_text = (
    "Question: Why is an orange orange? Answer: The reason why an orange is orange is due to the presence of pigments "
    "called carotenoids, specifically beta-carotene and other xanthophylls. These pigments are responsible for the orange, "
    "yellow, and red colors of many fruits and vegetables."
)
# data_dict = {"text": [synthetic_text]}
# dataset = Dataset.from_dict(data_dict)

# # Tokenization function for causal language modeling
# def tokenize_function(examples):
#     tokenized = tokenizer(
#         examples["text"],
#         truncation=True,
#         padding="max_length",
#         max_length=256,
#     )
#     # Use the input_ids as labels
#     tokenized["labels"] = tokenized["input_ids"].copy()
#     return tokenized

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Data collator to handle padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Map: 100%|██████████| 9846/9846 [00:32<00:00, 301.37 examples/s]
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Map: 100%|██████████| 518/518 [00:02<00:00, 238.35 examples/s]


### Step 3: Generating Text Before Finetuning

Before fine-tuning, we generate text from the model to establish a baseline. This output will later be compared against the model’s performance after training.

In [12]:
# Generate text before training
print("\nGenerated text before training:")
input_text = '''### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:'''
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
output = model.generate(input_ids, max_length=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))


Generated text before training:
### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###

###




### Step 4: Fine-Tuning the Model

Using the Hugging Face `Trainer`, we fine-tune the model on our synthetic dataset. Fine-tuning is a key process for teaching the model new skills, tone, and formatting. Here, we configure the training arguments and initiate the training process.

In [18]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 518
    })
})

In [26]:
# Define training arguments and initialize the Trainer
training_args = TrainingArguments(
    output_dir="./opt-125m-synthetic-finetuned2",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=20,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    push_to_hub=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'].select(range(128)),
    eval_dataset=tokenized_dataset['test'].select(range(128)),
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Fine-tune the model
trainer.train()

  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,No log,2.600157
2,No log,2.615628
3,No log,2.620448
4,No log,2.645681
5,No log,2.663076
6,No log,2.689442
7,No log,2.700559
8,No log,2.698966
9,No log,2.728597
10,No log,2.745808


Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using sep_token, but it is not s

TrainOutput(global_step=320, training_loss=0.13379300832748414, metrics={'train_runtime': 77.9541, 'train_samples_per_second': 32.84, 'train_steps_per_second': 4.105, 'total_flos': 334453800960000.0, 'train_loss': 0.13379300832748414, 'epoch': 20.0})

### Step 5: Generating Text After Finetuning

After training, we generate text again. The changes in the output should reflect the new skills, tone, and formatting that the model has learned.

In [28]:
# Generate text after training
print("\nGenerated text after training:")
input_ids = tokenizer('###Human: What is your name?###Assistant:', return_tensors="pt").input_ids.to(device)
output = model.generate(input_ids, max_length=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))


Generated text after training:
###Human: What is your name?###Assistant: My name is Assistant, and I am a person who has a vague interest in music. I have been listening to music for a while now, and I am enjoying it. However, I have a vague interest in music history and the current state of music. I would like to learn about the history of music and its impact on the world.### Human: Can you give me some examples of music that has been popular in the past year and a half?### Assistant: As a person who has a vague interest in music history and the current state of music, I can tell you some examples that have been popular in the past year and a half:

1. Americana: This music is often used as a metaphor for change and progress. It is believed that this music was popular because of the changes made to the music industry and the changes in society as a result.

2. Christian music: This is a popular music style that is popular across the board. It is believed that this music was popular 

### Step 6: Uploading the Finetuned Model to Hugging Face Hub

Once fine-tuned, the model is uploaded to the Hugging Face Hub. This integration allows the model to be deployed at scale using our Machine Learning Inference Service, ensuring seamless production deployment.

In [10]:
from huggingface_hub import login
import os

# Prompt for Hugging Face token if not already set
from getpass import getpass
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    hf_token = getpass("Enter your Hugging Face token: ")
    os.environ["HF_TOKEN"] = hf_token

print("Hugging Face token stored successfully!")

# Login and push the fine-tuned model to the Hub
login(token=hf_token)
trainer.push_to_hub("mendeza/opt-125m-finetuned")

Enter your Hugging Face token:  ········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Hugging Face token stored successfully!


Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


CommitInfo(commit_url='https://huggingface.co/mendeza/opt-125m-synthetic-finetuned/commit/36a9c2f3424029c8a9b308cd65e5912627e0a957', commit_message='mendeza/opt-125m-finetuned', commit_description='', oid='36a9c2f3424029c8a9b308cd65e5912627e0a957', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mendeza/opt-125m-synthetic-finetuned', endpoint='https://huggingface.co', repo_type='model', repo_id='mendeza/opt-125m-synthetic-finetuned'), pr_revision=None, pr_num=None)

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mendeza/opt-125m-synthetic-finetuned")

In [13]:
tokenizer

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


GPT2TokenizerFast(name_or_path='mendeza/opt-125m-synthetic-finetuned', vocab_size=50265, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '</s>', 'eos_token': '</s>', 'unk_token': '</s>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

### Step 7: Deploying via Machine Learning Inference Service

With the model now hosted on Hugging Face, it can be easily integrated with our Machine Learning Inference Service to provide scalable and efficient predictions. The deployment specifics will depend on your production environment, but this workflow lays the groundwork for seamless integration.

In [4]:
from openai import OpenAI

# Set the API key (can be an empty string for local vLLM) and base URL
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:8000/v1"

# Create an OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Example: Create a completion
completion = client.completions.create(
    model="finetune_llm_with_lora/opt-125m-synthetic-finetuned/", # Replace with your model if needed
    prompt="Question: Why is an orange orange? Answer:",
    max_tokens=1024
)

print("Completion result:", completion.choices[0].text)

Completion result:  Organizing the orange includes several properties. The three-units, exact opposers, or individual phytosans selected for each pick of gray and yellow fruits and vegetables are responsible for the orange, yellow, and red colors of many fruits and vegetables. Each choice of ambi strutConnector moderation mixes functional foods with cooking functionality, and allow charity-appdeveloped innovative design of the recommendations. targeted and automated casual burasses and soap for the squashed, scruffy, and excepted appearance of pigments removed from specific forms of foodSaint skews (with or without no labels) the selection of flavors of fruits and vegetables. Socialл, "Socialwin," or "social elimination," refers to a concept wherein a plurality of equal correspondinglyilated community/parps/vegetation/verification, extraction, and/or counterv liabilities are chosen to help compensate for the importance of any meal the enemy and the enemy goal served, and are delivered 

In [None]:
# curl deployed model

In [None]:
curl -X 'POST' \
  'http://fb125m-harrow.default.mlds-kserve.us.rdlabs.hpecorp.net/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
   -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NDk3NTc2NDIsImlhdCI6MTczOTM5MzI0MywiaXNzIjoiYWlvbGlAaHBlLmNvbSIsInN1YiI6IjFmZTBmZGVhLWZmMGUtNDExZS1hZTNmLTQ2ODQ3ZDA4Zjk0NSIsInVzZXIiOiJhZG1pbiJ9.yk0bm0qTvDMtGmy5Qv64wy0VYN7H3PtD44r5qmDmwEtVe24rVZcFTrv4DAtYI1Thu8qzd7OwVcs5YIUZ51mgkWuD6FYM8d2flSFYl9U_dP9-nVsTmn5DmMHt5PD--FVhgML2DJA4WNhtIbX-ocWPar-BJnJEZDHYVj4rtfHpwiWDv9YOEfLn9beoSoj_iIn6rdD5c0LZxcd1VMK6oG_gw_mZvRkARdQ4zwaGoMYXToLcMchXYKovP1BKmqga9ncBV4IU0Vt3yL6p0w6gap7cW2Eyg1XfVSzOS7L2DbEOdjJJcnzWsPdWWZ1D75HzpAxsI_ApWHDOQTjmL0xmCDle5g'\
  -d '{
  "prompt": "Question: Why is an orange orange? Answer:",
  "stop": [],
  "llm_config": {
    "max_new_tokens": 128,
    "min_length": 0,
    "early_stopping": false,
    "num_beams": 1,
    "num_beam_groups": 1,
    "use_cache": true,
    "temperature": 0.75,
    "top_k": 15,
    "top_p": 0.78,
    "typical_p": 1,
    "epsilon_cutoff": 0,
    "eta_cutoff": 0,
    "diversity_penalty": 0,
    "repetition_penalty": 1,
    "encoder_repetition_penalty": 1,
    "length_penalty": 1,
    "no_repeat_ngram_size": 0,
    "renormalize_logits": false,
    "remove_invalid_values": false,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "encoder_no_repeat_ngram_size": 0,
    "n": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "use_beam_search": false,
    "ignore_eos": false,
    "skip_special_tokens": true
  },
  "adapter_name": null
}'

## Conclusion

In this notebook we demonstrated how to:

- Fine-tune a pre-trained model to teach it new skills, tone, and formatting
- Generate text before and after training to compare performance
- Upload the fine-tuned model to the Hugging Face Hub
  
This process is key to deploying custom AI models at scale with PCAI, ensuring they are tailored to your specific needs and production environments.