
# 🚀 Fine-tune GPT-2 on Your PDF (Siebel CRM Guide) — Colab Notebook

This notebook lets you **upload a PDF**, extract text, build a dataset, **fine-tune GPT-2**, and compare **Before vs After** outputs — all in Google Colab, with no OpenAI key needed.

**What you'll get:**
1. Upload your PDF (e.g., *Siebel CRM Fundamentals, 171 pages*)
2. Extract & clean text
3. Create a dataset (tokenized & grouped for language modeling)
4. Baseline generation (**Before fine-tuning**)
5. Fine-tune GPT-2 with Hugging Face `Trainer`
6. Compare outputs (**After fine-tuning**) side-by-side
7. Optional: Save & export your fine-tuned model



In [8]:

!pip install -U transformers datasets accelerate pypdf




In [9]:

import torch, platform, sys
print("Python:", sys.version)
print("PyTorch:", torch.__version__)
print("CUDA is available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("⚠️ No GPU detected. In Colab, go to Runtime → Change runtime type → GPU.")


Python: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
PyTorch: 2.6.0+cu124
CUDA is available: True
GPU: Tesla T4


## 1) Upload your PDF

In [3]:
# from google.colab import files
# uploaded = files.upload()  # Choose your Siebel PDF here
pdf_path = "/content/FundOUI.pdf"
print("Using file:", pdf_path)

Using file: /content/FundOUI.pdf


## 2) Extract & clean text

In [4]:

from pypdf import PdfReader
import re

reader = PdfReader(pdf_path)
pages = len(reader.pages)
print(f"PDF pages detected: {pages}")

all_text = []
for i, page in enumerate(reader.pages):
    text = page.extract_text() or ""
    # Basic cleanup: remove hyphenated line breaks & excessive spaces
    text = text.replace("\u00ad", "")              # soft hyphen
    text = re.sub(r"-\n\s*", "", text)             # hyphen + newline merge
    text = re.sub(r"\n{2,}", "\n\n", text)         # collapse big gaps
    text = re.sub(r"[ \t]{2,}", " ", text)         # collapse spaces
    all_text.append(text)

full_text = "\n\n".join(all_text).strip()

with open("siebel_guide.txt", "w", encoding="utf-8") as f:
    f.write(full_text)

print("✅ Extracted text length:", len(full_text))
print("Preview:\n", full_text[:1200])


PDF pages detected: 170
✅ Extracted text length: 341373
Preview:
 Siebel CRM
Fundamentals Guide 
Siebel Innovation Pack 2016, Rev. A 
E52425-01
June 2016

Siebel CRM Fundamentals Guide, Siebel Innovation Pack 2016, Rev. A 
E52425-01
Copyright © 2005, 2016 Oracle and/or its affiliates. All rights reserved.
This software and related documentation are provided under a license agreement containing restrictions on 
use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your 
license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, 
transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse 
engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is 
prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If 
you find any errors, plea

## 3) Build dataset (tokenize & group for language modeling)

In [5]:

from datasets import Dataset
from transformers import AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Create a single-sample dataset; we'll tokenize and group into chunks.
raw_ds = Dataset.from_dict({"text": [full_text]})

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized = raw_ds.map(tokenize_function, batched=True, remove_columns=["text"])

# Group into contiguous blocks of block_size tokens for causal LM
block_size = 256

def group_texts(examples):
    # Concatenate all texts.
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated["input_ids"])
    # Drop the small remainder
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_ds = tokenized.map(group_texts, batched=True)
split = lm_ds.train_test_split(test_size=0.1, seed=42)
train_ds, test_ds = split["train"], split["test"]
print(train_ds, test_ds)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (77053 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 270
}) Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 30
})


## 4) Baseline: Generate **Before Fine-tuning**

In [6]:

from transformers import AutoModelForCausalLM, pipeline

base_model = AutoModelForCausalLM.from_pretrained(model_name)
gen_pipe_before = pipeline("text-generation", model=base_model, tokenizer=tokenizer, device_map="auto")

# You can edit these prompts as needed.
prompts = [
    "What is Siebel CRM?",
    "Explain Siebel Workflow Policies in simple terms.",
    "How does the Siebel Data Model organize business components?",
    "what is importance of outer join flag?"
]

def generate_list(pipe, prompts, max_new_tokens=120):
    outs = []
    for p in prompts:
        g = pipe(p, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7)[0]["generated_text"]
        outs.append(g)
    return outs

before_outputs = generate_list(gen_pipe_before, prompts)
for p, o in zip(prompts, before_outputs):
    print("="*80)
    print("Prompt:", p)
    print(o)


Device set to use cuda:0


Prompt: What is Siebel CRM?
What is Siebel CRM?

Siebel CRM is a framework to help you build a web website in the WordPress framework. This framework works with a variety of WordPress themes and it is compatible with all of them.

How can I create a template that can be used in a blog or blog post?

Once you have a template in place, you can use it in any blog or blog post. The template can also be used in any blog or blog post in any order.

What is the difference between a WordPress theme and a theme from a developer?

A WordPress theme is a
Prompt: Explain Siebel Workflow Policies in simple terms.
Explain Siebel Workflow Policies in simple terms.

The standard workflow for creating user interfaces is the following:

Create a User Interface from the data and data-points that you have created in the previous step.

Create a User Interface from the data and data-points that you have created in the previous step. Create an interface with the same name as the data you created in the prev

## 5) Fine-tune GPT-2 on your PDF

In [10]:
#!pip install -U transformers

Collecting transformers
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.55.2-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.55.1
    Uninstalling transformers-4.55.1:
      Successfully uninstalled transformers-4.55.1
Successfully installed transformers-4.55.2


In [12]:

from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    save_total_limit=2,
    fp16=True
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    data_collator=data_collator,
)

trainer.train()
eval_metrics = trainer.evaluate()
eval_metrics


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgogogoyes785[0m ([33mgogogoyes785-tata-consultancy-services[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,2.7586
100,2.2933


{'eval_loss': 2.1167335510253906,
 'eval_runtime': 0.3593,
 'eval_samples_per_second': 83.492,
 'eval_steps_per_second': 11.132,
 'epoch': 3.0}

## 6) Compare **After Fine-tuning**

In [13]:

from transformers import pipeline
import pandas as pd

gen_pipe_after = pipeline("text-generation", model=base_model, tokenizer=tokenizer, device_map="auto")

after_outputs = generate_list(gen_pipe_after, prompts)

df = pd.DataFrame({
    "Prompt": prompts,
    "Before": before_outputs,
    "After": after_outputs
})

# Display side-by-side
import IPython
IPython.display.display(df)


Device set to use cuda:0


Unnamed: 0,Prompt,Before,After
0,What is Siebel CRM?,What is Siebel CRM?\n\nSiebel CRM is a framewo...,"What is Siebel CRM?\n\nIn a nutshell, Siebel C..."
1,Explain Siebel Workflow Policies in simple terms.,Explain Siebel Workflow Policies in simple ter...,Explain Siebel Workflow Policies in simple ter...
2,How does the Siebel Data Model organize busine...,How does the Siebel Data Model organize busine...,How does the Siebel Data Model organize busine...
3,what is importance of outer join flag?,what is importance of outer join flag?\n\nI'm ...,what is importance of outer join flag?\nThe va...


## 7) (Optional) Perplexity on the test split

In [14]:

import math
metrics = trainer.evaluate(eval_dataset=test_ds)
try:
    perplexity = math.exp(metrics["eval_loss"])
except OverflowError:
    perplexity = float("inf")
print("Perplexity:", perplexity)


Perplexity: 8.303968649153008


## 8) Save / Download your fine-tuned model

In [None]:

import os, shutil

save_dir = "./fine_tuned_siebel_gpt2"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

# Zip for easy download
shutil.make_archive("fine_tuned_siebel_gpt2", "zip", save_dir)
from google.colab import files
files.download("fine_tuned_siebel_gpt2.zip")
