# Training and Deploying Gen AI at Scale with PCAI

This notebook demonstrates a complete workflow for fine-tuning a pre-trained large language model, uploading the fine-tuned model to Hugging Face, and preparing it for deployment using our Machine Learning Inference Service. Fine-tuning is especially powerful for teaching new skills, tone, and formatting to a language model, thereby tailoring it to specialized tasks and business needs.

## Overview

1. **Model & Tokenizer Setup:** Load the pre-trained model and its tokenizer.
2. **Dataset Preparation:** Create a synthetic dataset with a question–answer example.
3. **Text Generation Before Finetuning:** Generate baseline text from the model.
4. **Fine-Tuning:** Train the model on the dataset to imbue it with new skills and tone.
5. **Text Generation After Finetuning:** Evaluate improvements in the generated text.
6. **Upload to Hugging Face Hub:** Share the fine-tuned model for deployment.
7. **Deployment Overview:** Brief discussion on deploying the model via our inference service.

In [1]:
# Import necessary libraries
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
from datasets import Dataset

# This notebook demonstrates fine-tuning a model to teach it new skills and tone.

  from .autonotebook import tqdm as notebook_tqdm


### Step 1: Model and Tokenizer Setup

We begin by loading a pre-trained model (facebook/opt-125m) and its corresponding tokenizer. This model is chosen for demonstration purposes. Once loaded, the model is moved to GPU if available.

In [2]:
# Load pre-trained model and tokenizer
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name,use_fast=False, verbose=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,)

### Step 2: Dataset Preparation

Next, we create a synthetic dataset containing a single example—a question and answer pair that explains why an orange is orange. This example serves as a training signal to help the model learn a new tone and formatting style.

In [3]:
# Create a synthetic dataset with one example
synthetic_text = (
    "Question: Why is an orange orange? Answer: The reason why an orange is orange is due to the presence of pigments "
    "called carotenoids, specifically beta-carotene and other xanthophylls. These pigments are responsible for the orange, "
    "yellow, and red colors of many fruits and vegetables."
)
data_dict = {"text": [synthetic_text]}
dataset = Dataset.from_dict(data_dict)

# Tokenization function for causal language modeling
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=256,
    )
    # Use the input_ids as labels
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Data collator to handle padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Map: 100%|██████████| 1/1 [00:00<00:00, 184.54 examples/s]


### Step 3: Generating Text Before Finetuning

Before fine-tuning, we generate text from the model to establish a baseline. This output will later be compared against the model’s performance after training.

In [5]:
# Generate text before training
print("\nGenerated text before training:")
input_text = "Question: Why is an orange orange? Answer:"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
output = model.generate(input_ids, max_length=20)
print(tokenizer.decode(output[0], skip_special_tokens=True))


Generated text before training:
Question: Why is an orange orange? Answer: Because it's a color.
I think


### Step 4: Fine-Tuning the Model

Using the Hugging Face `Trainer`, we fine-tune the model on our synthetic dataset. Fine-tuning is a key process for teaching the model new skills, tone, and formatting. Here, we configure the training arguments and initiate the training process.

In [9]:
# Define training arguments and initialize the Trainer
training_args = TrainingArguments(
    output_dir="./opt-125m-synthetic-finetuned",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=1,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    push_to_hub=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Fine-tune the model
trainer.train()

### Step 5: Generating Text After Finetuning

After training, we generate text again. The cell below is example code on how to run predictions on the finetuned model

```python
# Generate text after training
print("\nGenerated text after training:")
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
output = model.generate(input_ids, max_length=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

### Step 6: Uploading the Finetuned Model to Hugging Face Hub

Once fine-tuned, the model is uploaded to the Hugging Face Hub. 
This code wont run, but it is example code how to interactively add HF token to your notebook

```python
from huggingface_hub import login
import os

# Prompt for Hugging Face token if not already set
from getpass import getpass
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    hf_token = getpass("Enter your Hugging Face token: ")
    os.environ["HF_TOKEN"] = hf_token

print("Hugging Face token stored successfully!")
```

```python
# Code to login and push fine-tuned model to the Hub
login(token=hf_token)
trainer.push_to_hub("mendeza/opt-125m-finetuned")
```

### Step 7: Deploying via Machine Learning Inference Service

With the model now hosted on Hugging Face, it can be easily integrated with our Machine Learning Inference Service to provide scalable and efficient predictions. The deployment specifics will depend on your production environment, but this workflow lays the groundwork for seamless integration.

## Step 7.1 Go to Tokens on MLIS, select `finetuned-opt-125m` and copy the Token created
run cell below to enter Token interactively

In [17]:
from huggingface_hub import login
import os

# Prompt for Hugging Face token if not already set
from getpass import getpass
token = os.environ.get("token")

token = getpass("Enter your token: ")
os.environ["token"] = token

print("Token stored successfully!")

Enter your token:  ········


Token stored successfully!


In [19]:
import asyncio
import nest_asyncio
from openai import AsyncOpenAI

nest_asyncio.apply()

# Set API credentials
openai_api_key = os.environ["token"]
openai_api_base = "https://finetuned-opt-125m-predictor-admin-e51f23f1.saie02.tryezmeral.com/v1"

# Create an async OpenAI client
client = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
    timeout=60.0,
)

# Async function to stream completions
async def stream_completion():
    response = await client.completions.create(
        model="mendeza/opt-125m-synthetic-finetuned",
        prompt="Question: Why is an orange orange? Answer:",
        max_tokens=4,
        stream=True,  # Enable streaming,
        top_p=1,
        temperature=0.1
    )

    print("Completion result:", end=" ", flush=True)
    
    async for chunk in response:
        if chunk.choices and chunk.choices[0].text:
            print(chunk.choices[0].text, end="", flush=True)  # Stream output

# Run the async function properly in Jupyter
await stream_completion()


Completion result:  The reason why an

## Conclusion

In this notebook we demonstrated how to:

- Fine-tune a pre-trained model to teach it new skills, tone, and formatting
- Generate text before and after training to compare performance
- Upload the fine-tuned model to the Hugging Face Hub
  
This process is key to deploying custom AI models at scale with PCAI, ensuring they are tailored to your specific needs and production environments.