# PodcastIQ - Q&A Extraction Training
## Fine-tune T5 for Question-Answer Extraction

This notebook trains a T5 model to extract Q&A pairs from podcast transcripts.

### Features:
- ‚úÖ **CPU Compatible**: Works on CPU (slower) or GPU (faster)
- ‚úÖ **Automatic Device Detection**: Automatically uses GPU if available
- ‚úÖ **Optimized Batch Sizes**: Adjusts for CPU/GPU automatically
- ‚úÖ **Model Export**: Creates downloadable zip file with all model files
- ‚úÖ **Proper Tokenizer Saving**: Ensures tokenizer config is saved correctly

### Requirements:
- Upload `processed_data.zip` from the preprocessing notebook
- Contains `train_qa.json` and `test_qa.json` files

In [None]:
!pip install transformers datasets accelerate evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [None]:
import torch
import json
import zipfile
import os
import warnings
from datasets import Dataset
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

# Suppress Triton warnings (not needed for CPU)
warnings.filterwarnings('ignore', category=UserWarning, module='torchao')

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
print(f"GPU Available: {torch.cuda.is_available()}")

# Set device for CPU training
if device == "cpu":
    print("‚ö†Ô∏è  Training on CPU - this will be slower but will work!")
    print("üí° Consider using Google Colab with GPU for faster training")

Device: cuda
GPU Available: True


In [None]:
# Upload Q&A data
from google.colab import files
print("Upload processed_data.zip from preprocessing notebook")
uploaded = files.upload()

# Extract if zip was uploaded
if 'processed_data.zip' in uploaded:
    with zipfile.ZipFile('processed_data.zip', 'r') as z:
        z.extractall('.')
    print("‚úÖ Extracted processed_data.zip")

with open('train_qa.json', 'r') as f:
    train_data = json.load(f)
with open('test_qa.json', 'r') as f:
    test_data = json.load(f)

print(f"Train: {len(train_data)}, Test: {len(test_data)}")

Upload processed_data.zip from preprocessing notebook


Saving processed_data.zip to processed_data.zip
‚úÖ Extracted processed_data.zip
Train: 7200, Test: 1800


In [None]:
# Load T5 model
MODEL_NAME = "google/flan-t5-base"

print(f"Loading model: {MODEL_NAME}")
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

# Move model to device
model = model.to(device)
print(f"‚úÖ Model loaded and moved to {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")

Loading model: google/flan-t5-base


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

‚úÖ Model loaded and moved to cuda
Model parameters: 247.6M


In [None]:
# Create training format: context -> question + answer
def format_qa_pairs(data):
    """Format Q&A pairs for training"""
    formatted = []
    for item in data:
        # Input: generate question from answer context
        input_text = f"generate a health claim question: {item['answer']}"
        output_text = item['question']
        formatted.append({'input': input_text, 'output': output_text})
    return formatted

train_formatted = format_qa_pairs(train_data)
test_formatted = format_qa_pairs(test_data)

train_dataset = Dataset.from_list(train_formatted)
test_dataset = Dataset.from_list(test_formatted)

print("Sample:", train_dataset[0])

Sample: {'input': 'generate a health claim question: one to four', 'output': 'Smaller dogs tend to have how many pups per litter?'}


In [None]:
# Tokenize
def tokenize_fn(examples):
    inputs = tokenizer(
        examples['input'],
        max_length=512,
        truncation=True,
        padding=True,
        add_special_tokens=True
    )
    labels = tokenizer(
        examples['output'],
        max_length=128,
        truncation=True,
        padding=True,
        add_special_tokens=True
    )
    inputs['labels'] = labels['input_ids']
    return inputs

print("Tokenizing datasets...")
tokenized_train = train_dataset.map(tokenize_fn, batched=True, remove_columns=['input', 'output'])
tokenized_test = test_dataset.map(tokenize_fn, batched=True, remove_columns=['input', 'output'])

print(f"‚úÖ Tokenization complete!")
print(f"Train samples: {len(tokenized_train)}")
print(f"Test samples: {len(tokenized_test)}")

Tokenizing datasets...


Map:   0%|          | 0/7200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

‚úÖ Tokenization complete!
Train samples: 7200
Test samples: 1800


In [None]:
# Training Configuration
# Adjust batch sizes for CPU (smaller batches to avoid OOM)
batch_size = 2 if device == "cpu" else 4
gradient_accumulation = 4 if device == "cpu" else 1

print(f"Training configuration:")
print(f"  Device: {device}")
print(f"  Batch size: {batch_size}")
print(f"  Gradient accumulation: {gradient_accumulation}")

training_args = Seq2SeqTrainingArguments(
    output_dir="./podcastiq-qa",
    eval_strategy="epoch",
    save_strategy="epoch",  # Match eval_strategy for load_best_model_at_end
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    predict_with_generate=True,
    fp16=False,  # Disable fp16 for CPU compatibility
    bf16=False,  # Disable bf16 for CPU
    logging_steps=50,
    warmup_steps=100,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator
)

print("\nüöÄ Starting training...")
print("This may take a while on CPU. Please be patient!")
trainer.train()
print("\n‚úÖ Q&A Training complete!")

Training configuration:
  Device: cuda
  Batch size: 4
  Gradient accumulation: 1

üöÄ Starting training...
This may take a while on CPU. Please be patient!


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.8535,0.844704
2,0.7945,0.826065
3,0.7506,0.822117


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].



‚úÖ Q&A Training complete!


In [None]:
# Test inference
print("Testing trained model...")
test_context = "Sleep is essential for memory consolidation. During deep sleep, the brain processes and stores information learned during the day."

input_text = f"generate a health claim question: {test_context}"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate question
outputs = model.generate(
    **inputs,
    max_length=64,
    min_length=10,
    num_beams=4,
    early_stopping=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)
question = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"\nContext: {test_context}")
print(f"Generated Question: {question}")
print("\n‚úÖ Inference test complete!")

Testing trained model...

Context: Sleep is essential for memory consolidation. During deep sleep, the brain processes and stores information learned during the day.
Generated Question: What is the role of sleep in memory consolidation?

‚úÖ Inference test complete!


In [None]:
# Save model
print("Saving model and tokenizer...")

# Save the best model (from training)
output_dir = "./podcastiq-qa-extractor"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Verify files were saved
print("\n‚úÖ Model saved! Files:")
saved_files = os.listdir(output_dir)
for f in saved_files:
    size = os.path.getsize(os.path.join(output_dir, f)) / (1024 * 1024)  # MB
    print(f"  - {f} ({size:.2f} MB)")

# Create zip file
print("\nüì¶ Creating zip file...")
zip_filename = "podcastiq-qa-extractor.zip"

# Remove old zip if exists
if os.path.exists(zip_filename):
    os.remove(zip_filename)

# Create zip
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, os.path.dirname(output_dir))
            zipf.write(file_path, arcname)
            print(f"  Added: {arcname}")

zip_size = os.path.getsize(zip_filename) / (1024 * 1024)  # MB
print(f"\n‚úÖ Zip file created: {zip_filename} ({zip_size:.2f} MB)")

# Download (Colab only)
try:
    from google.colab import files
    print("\nüì• Downloading zip file...")
    files.download(zip_filename)
    print("‚úÖ Download complete!")
except ImportError:
    print(f"\nüí° Not in Colab. Zip file saved at: {os.path.abspath(zip_filename)}")
    print("   You can download it manually from the file browser.")

print("\nüéâ Model training and export complete!")
print(f"üìÅ Model directory: {os.path.abspath(output_dir)}")
print(f"üì¶ Zip file: {os.path.abspath(zip_filename)}")

Saving model and tokenizer...

‚úÖ Model saved! Files:
  - special_tokens_map.json (0.00 MB)
  - added_tokens.json (0.00 MB)
  - tokenizer_config.json (0.02 MB)
  - training_args.bin (0.01 MB)
  - config.json (0.00 MB)
  - spiece.model (0.75 MB)
  - model.safetensors (944.47 MB)
  - generation_config.json (0.00 MB)

üì¶ Creating zip file...
  Added: podcastiq-qa-extractor/special_tokens_map.json
  Added: podcastiq-qa-extractor/added_tokens.json
  Added: podcastiq-qa-extractor/tokenizer_config.json
  Added: podcastiq-qa-extractor/training_args.bin
  Added: podcastiq-qa-extractor/config.json
  Added: podcastiq-qa-extractor/spiece.model
  Added: podcastiq-qa-extractor/model.safetensors
  Added: podcastiq-qa-extractor/generation_config.json

‚úÖ Zip file created: podcastiq-qa-extractor.zip (876.78 MB)

üì• Downloading zip file...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Download complete!

üéâ Model training and export complete!
üìÅ Model directory: /content/podcastiq-qa-extractor
üì¶ Zip file: /content/podcastiq-qa-extractor.zip
