# PodcastIQ - Summarization Model Training
## Fine-tune BART-large-cnn for Podcast Summarization

This notebook fine-tunes BART on podcast data with length-aware generation.

In [None]:
# Install dependencies
!pip install transformers datasets accelerate evaluate rouge-score sentencepiece
!pip install huggingface_hub

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=0ee299ebc47c2227059a35133cace2643b81097717bd662d0c78cd048f094aa3
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score, evaluate
Successfully installed evaluate-0.4.6 rouge-score-0.1.2


In [None]:
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

GPU Available: True
GPU: Tesla T4


In [None]:
import json
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import evaluate

## Upload Preprocessed Data

In [None]:
from google.colab import files
import zipfile

print("Upload processed_data.zip from preprocessing notebook")
uploaded = files.upload()

with zipfile.ZipFile('processed_data.zip', 'r') as z:
    z.extractall('.')

Upload processed_data.zip from preprocessing notebook


Saving processed_data.zip to processed_data.zip


In [None]:
# Load training data
with open('train_summarization.json', 'r') as f:
    train_data = json.load(f)
with open('val_summarization.json', 'r') as f:
    val_data = json.load(f)

print(f"Train samples: {len(train_data)}")
print(f"Val samples: {len(val_data)}")

Train samples: 3566
Val samples: 141


In [None]:
# Create datasets
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)

print(train_dataset)
print("Sample:", train_dataset[0])

Dataset({
    features: ['input', 'output', 'length', 'source'],
    num_rows: 3566
})
Sample: {'input': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would be a good idea to get a check-up. #Person1#: Yes, well, you haven't had one for 5 years. You should have one every year. #Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor? #Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good. #Person2#: Ok. #Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith? #Person2#: Yes. #Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit. #Person2#: I've tried hundreds of times, but I just can't seem to kick the habit. #Person1#: Well, we have classes and some medications that might help. I'll give you more information before you

## Load Model and Tokenizer

In [None]:
MODEL_NAME = "facebook/bart-large-cnn"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print(f"Model loaded: {MODEL_NAME}")
print(f"Parameters: {model.num_parameters():,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Model loaded: facebook/bart-large-cnn
Parameters: 406,290,432


In [None]:
# Length-aware preprocessing
LENGTH_TOKENS = {
    'short': '[SHORT]',
    'medium': '[MEDIUM]',
    'long': '[LONG]'
}

# Add special tokens
special_tokens = {'additional_special_tokens': list(LENGTH_TOKENS.values())}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

def preprocess_function(examples):
    """Preprocess with length prefix"""
    # Add length token to input
    inputs = []
    for text, length in zip(examples['input'], examples['length']):
        length_token = LENGTH_TOKENS.get(length, '[MEDIUM]')
        inputs.append(f"{length_token} {text}")

    model_inputs = tokenizer(
        inputs,
        max_length=1024,
        truncation=True,
        padding=True
    )

    labels = tokenizer(
        examples['output'],
        max_length=256,
        truncation=True,
        padding=True
    )

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Tokenize datasets
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names
)
tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names
)

print(f"Tokenized train: {len(tokenized_train)}")

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Map:   0%|          | 0/3566 [00:00<?, ? examples/s]

Map:   0%|          | 0/141 [00:00<?, ? examples/s]

Tokenized train: 3566


## Training Configuration

In [None]:
# Load ROUGE metric
rouge = evaluate.load('rouge')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in labels (padding) with pad token
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE scores
    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )

    return {k: round(v * 100, 2) for k, v in result.items()}

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
# Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./podcastiq-summarizer",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    predict_with_generate=True,
    generation_max_length=256,
    fp16=True,
    push_to_hub=False,
    logging_steps=50,
    report_to="none"
)

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

# Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Seq2SeqTrainer(


## Train the Model

In [None]:
# Train!
print("Starting training...")
trainer.train()
print("✅ Training complete!")

Starting training...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,0.2621,0.528838,38.82,17.02,30.47,30.4
2,0.1945,0.538784,39.45,18.0,30.82,30.86
3,0.1521,0.559413,40.61,18.83,31.99,31.96




✅ Training complete!


In [None]:
# Evaluate
results = trainer.evaluate()
print("\nEvaluation Results:")
for k, v in results.items():
    print(f"  {k}: {v}")


Evaluation Results:
  eval_loss: 0.5594127178192139
  eval_rouge1: 40.61
  eval_rouge2: 18.83
  eval_rougeL: 31.99
  eval_rougeLsum: 31.96
  eval_runtime: 132.9786
  eval_samples_per_second: 1.06
  eval_steps_per_second: 0.534
  epoch: 3.0


## Test Length-Aware Generation

In [None]:
# Test inference with different lengths
test_text = train_data[0]['input'][:1000]

for length, token in LENGTH_TOKENS.items():
    prompt = f"{token} {test_text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # Generate with length-specific params
    length_params = {
        'short': {'min_length': 30, 'max_length': 80},
        'medium': {'min_length': 80, 'max_length': 150},
        'long': {'min_length': 150, 'max_length': 250}
    }

    outputs = model.generate(
        **inputs,
        **length_params[length],
        num_beams=4,
        early_stopping=True
    )

    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n=== {length.upper()} Summary ({len(summary.split())} words) ===")
    print(summary)


=== SHORT Summary (43 words) ===
Mr. Smith comes to Doctor Hawkins for a check-up. Doctor Hawkins advises Mr. Smith to come to the doctor at least once a year for his own good and advises him to quit smoking because smoking is the leading cause of lung cancer.

=== MEDIUM Summary (67 words) ===
Mr. Smith comes to Doctor Hawkins for a check-up. Doctor Hawkins advises him to have one every year and tells him smoking is the leading cause of lung cancer and heart disease. Hawkins will give Mr. Smith some information about classes and some medications to help him quit smoking before he leaves the doctor's office and thanks him for the check-ups. He thanks the doctor and leaves.

=== LONG Summary (129 words) ===
Mr. Smith comes to Doctor Hawkins for a check-up. Doctor Hawkins advises Mr. Smith to come to the doctor at least once a year for his own good and advises him to quit smoking because smoking is the leading cause of lung cancer and heart disease. Hawkins will give him some informatio

## Save Model

In [None]:
# Save model locally
trainer.save_model("./podcastiq-summarizer-final")
tokenizer.save_pretrained("./podcastiq-summarizer-final")

# Zip for download
!zip -r podcastiq-summarizer.zip ./podcastiq-summarizer-final

from google.colab import files
files.download('podcastiq-summarizer.zip')

print("\n✅ Model saved and ready for download!")

  adding: podcastiq-summarizer-final/ (stored 0%)
  adding: podcastiq-summarizer-final/special_tokens_map.json (deflated 71%)
  adding: podcastiq-summarizer-final/added_tokens.json (deflated 29%)
  adding: podcastiq-summarizer-final/tokenizer_config.json (deflated 79%)
  adding: podcastiq-summarizer-final/training_args.bin (deflated 53%)
  adding: podcastiq-summarizer-final/config.json (deflated 62%)
  adding: podcastiq-summarizer-final/merges.txt (deflated 53%)
  adding: podcastiq-summarizer-final/vocab.json (deflated 59%)
  adding: podcastiq-summarizer-final/model.safetensors (deflated 7%)
  adding: podcastiq-summarizer-final/tokenizer.json (deflated 82%)
  adding: podcastiq-summarizer-final/generation_config.json (deflated 46%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


✅ Model saved and ready for download!


In [None]:
# Optional: Push to Hugging Face Hub
# Uncomment and run if you have a HF account

# from huggingface_hub import notebook_login
# notebook_login()

# model.push_to_hub("your-username/podcastiq-summarizer")
# tokenizer.push_to_hub("your-username/podcastiq-summarizer")