# T5 VS Gemma-3-4B

## Setup & Model Loading

Environment configuration and GPU detection

In [None]:
%pip install transformers accelerate pandas sentencepiece hf_xet torch torchvision --index-url https://download.pytorch.org/whl/cu126 --quiet
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device Used: {device}")

Note: you may need to restart the kernel to use updated packages.
Device Used: cuda


Model initialization

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
import time
import pandas as pd

t5_name = "t5-small"

try:
    t5_tokenizer = AutoTokenizer.from_pretrained(t5_name)
    t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_name).to(device)
except Exception as e:
    print(f"Could not load {t5_name}:", e)
    t5_tokenizer, t5_model = None, None

gemma_name = "google/gemma-3-4b-it"
#it was chosen over pt because it is trained on more data and is supposed to be better suited for following instructions

try:
    gemma_tokenizer = AutoTokenizer.from_pretrained(gemma_name)
    gemma_model = AutoModelForCausalLM.from_pretrained(
        gemma_name,
        torch_dtype=torch.bfloat16
        ).to(device)
except Exception as e:
    print(f"Could not load {gemma_name}:", e)
    gemma_tokenizer, gemma_model = None, None

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
`torch_dtype` is deprecated! Use `dtype` instead!
Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: 

Data Loading and Preperation

In [None]:
with open("data/parables.txt", "r") as f:
    prompts = f.readlines()

## Performance Benchmarking

In [None]:
def time_me(func: function):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        return result, end - start
    return wrapper

def memory_me(func: function):
    def wrapper(*args, **kwargs):
        torch.cuda.reset_peak_memory_stats()
        start_mem = torch.cuda.memory_allocated()
        result = func(*args, **kwargs)
        end_mem = torch.cuda.memory_allocated()
        peak_mem = torch.cuda.max_memory_allocated()
        return result, end_mem - start_mem, peak_mem
    return wrapper

In [None]:
@time_me
@memory_me
def summarize(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Single Use Comparison

In [None]:
t5_results = pd.DataFrame({"prompt": prompts, "output": [None]*len(prompts), "inference_speed": [None]*len(prompts), "peak_memory": [None]*len(prompts)})

gemma_results = t5_results.copy()

for prompt in prompts:
    row_index = t5_results[t5_results["prompt"] == prompt].index[0]

    if t5_model :
        t5_output, t5_runtime = summarize(t5_model, t5_tokenizer, prompt)
        t5_results.at[row_index, "output"] = t5_output
        t5_results.at[row_index, "inference_speed"] = t5_runtime
    else:
        t5_results.at[row_index, "output"] = "<T5 not loaded>"
        t5_results.at[row_index, "inference_speed"] = None

    if gemma_model:
        gemma_output, gemma_runtime = summarize(gemma_model, gemma_tokenizer, "Summarize this text: " + prompt)
        gemma_results.at[row_index, "output"] = gemma_output
        gemma_results.at[row_index, "inference_speed"] = gemma_runtime
    else:
        gemma_results.at[row_index, "output"] = "<Gemma not loaded>"
        gemma_results.at[row_index, "inference_speed"] = None

Batch Processing Comparison

Memory usage comparison charts

Inference speed analysis

In [None]:
plt.figure(figsize=(8,5))
plt.bar(df.index - 0.15, df["t5_runtime"], width=0.3, label="T5")
plt.bar(df.index + 0.15, df["gemma_runtime"], width=0.3, label="Gemma")
plt.xticks(df.index, [f"Prompt {i+1}" for i in df.index])
plt.ylabel("Runtime (s)")
plt.title("Inference Speed")
plt.legend()
plt.show()

Output length and generation time

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df["t5_tokens"], df["t5_runtime"], label="T5", marker="o")
plt.scatter(df["gemma_tokens"], df["gemma_runtime"], label="Gemma", marker="x")
plt.xlabel("Output tokens")
plt.ylabel("Runtime (s)")
plt.title("Output Length vs Generation Time")
plt.legend()
plt.show()

if torch.cuda.is_available():
    plt.figure(figsize=(8,5))
    plt.bar(df.index - 0.15, [m/1e6 for m in df["t5_mem_peak"]], width=0.3, label="T5")
    plt.bar(df.index + 0.15, [m/1e6 for m in df["gemma_mem_peak"]], width=0.3, label="Gemma")
    plt.xticks(df.index, [f"Prompt {i+1}" for i in df.index])
    plt.ylabel("Peak Memory (MB)")
    plt.title("Memory Usage")
    plt.legend()
    plt.show()

Batch processing performace

In [None]:
batch_prompts = ["Hello world!" for _ in range(4)]
batch_sizes = [1, 2, 4]
batch_results = []

for bs in batch_sizes:
    inputs = t5_tokenizer(batch_prompts[:bs], return_tensors="pt", padding=True).to(device)
    start = time.time()
    _ = t5_model.generate(**inputs, max_new_tokens=32)
    t5_time = time.time() - start

    if gemma_model:
        inputs = gemma_tokenizer(batch_prompts[:bs], return_tensors="pt", padding=True).to(device)
        start = time.time()
        _ = gemma_model.generate(**inputs, max_new_tokens=32)
        gemma_time = time.time() - start
    else:
        gemma_time = None

    batch_results.append({"batch_size": bs, "t5_time": t5_time, "gemma_time": gemma_time})

batch_df = pd.DataFrame(batch_results)
display(batch_df)

plt.figure(figsize=(8,5))
plt.plot(batch_df["batch_size"], batch_df["t5_time"], label="T5", marker="o")
if gemma_model:
    plt.plot(batch_df["batch_size"], batch_df["gemma_time"], label="Gemma", marker="x")
plt.xlabel("Batch size")
plt.ylabel("Runtime (s)")
plt.title("Batch Processing Performance")
plt.legend()
plt.show()

## Qualitative Output Assessment

Side-by-side summary comparison

In [None]:
summary_cols = [
    "t5_runtime", "gemma_runtime",
    "t5_tokens", "gemma_tokens",
    "t5_mem_peak", "gemma_mem_peak"
]

summary_df = df[["prompt"] + summary_cols]
summary_df

comparison = pd.concat([
    df[["prompt", "t5_runtime", "t5_tokens", "t5_mem_peak"]].rename(
        columns={"t5_runtime": "runtime", "t5_tokens": "tokens", "t5_mem_peak": "peak_mem"}
    ).assign(model="T5"),
    
    df[["prompt", "gemma_runtime", "gemma_tokens", "gemma_mem_peak"]].rename(
        columns={"gemma_runtime": "runtime", "gemma_tokens": "tokens", "gemma_mem_peak": "peak_mem"}
    ).assign(model="Gemma")
])

comparison = comparison.pivot(index="prompt", columns="model", values=["runtime", "tokens", "peak_mem"])
comparison

## Practical Recomendations

Hardware requirement guidelines

- **T5 (small/base/large)**
  - Memory footprint: 220M – 770M parameters.
  - Runs comfortably on CPU, GPU optional for speedup.
  - Peak GPU memory: ~1–2 GB for `t5-small`, ~4–6 GB for `t5-large`.
  - Faster inference on modest hardware, but lower output fluency.

- **Gemma-3-4B**
  - 4B parameters → heavier model.
  - GPU strongly recommended: needs ~8–12 GB VRAM for inference.
  - CPU-only is possible but very slow.
  - Better fluency, coherence, and longer-context handling, but higher hardware cost.

Performance vs quality trade-offs

- **T5**
  - Pros: Fast, lightweight, lower memory usage, good for translation and summarization tasks.
  - Cons: Output may be shorter, less nuanced, and sometimes rigid compared to larger models.

- **Gemma-3-4B**
  - Pros: Generates richer, more natural text; handles open-ended prompts better.
  - Cons: Slower inference, higher VRAM usage, may require batching limits.

Choose T5 for speed and low resources, and Gemma for quality and richer outputs if you have the hardware budget.