## Environment Setup

We will begin by installing the core libraries needed for our text summarization project. This includes dataset handling tools, evaluation metrics specifically designed for summarization tasks, transformer-based models, experiment tracking capabilities, and natural language toolkit components. Together, these packages provide the complete infrastructure for building and assessing our summarization system.

In [2]:
# Install essential NLP and evaluation packages
!pip install datasets rouge evaluate transformers wandb nltk rouge_score

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4

## Library Integration

We will import a comprehensive set of tools for our natural language processing pipeline. This includes fundamental data manipulation libraries, deep learning frameworks, evaluation modules, and specialized NLP components. The transformers package provides access to state-of-the-art sequence-to-sequence models designed for text generation tasks, while evaluation libraries enable rigorous assessment of our summarization results. We're also setting up experiment tracking to monitor our model's performance throughout training.

In [3]:
# Load comprehensive toolkit dependencies
import os
import numpy as np
import pandas as pd
import torch
import evaluate
import nltk
import rouge
import wandb

from datasets import load_dataset, concatenate_datasets
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
from sklearn.metrics import accuracy_score, f1_score
from nltk.tokenize import sent_tokenize
from torch.utils.data import DataLoader

## NLP Resource Preparation

We will download essential linguistic resources for text segmentation. The punkt tokenizer models are required for accurately splitting text into sentences, which is crucial for both preprocessing our input documents and evaluating our generated summaries at the sentence level.

In [4]:
# Acquire text processing resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Performance Tracking

We will set up Weights & Biases integration to monitor our fine-tuning process. This platform allows us to track metrics, visualize performance trends, and compare different model configurations throughout our experiments. While optional, this tooling provides valuable insights into training dynamics and helps optimize our summarization model's development.

In [5]:
# Initialize experiment monitoring platform
wandb.init(project="multiple-dataset-fine-tuning")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpns00911[0m ([33mpns00911-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Computational Resource Detection

We will automatically select the most powerful processing hardware available for our model training. This configuration checks for CUDA-compatible GPUs and defaults to CPU processing when necessary, ensuring our code runs efficiently regardless of the execution environment.

In [6]:
# Determine optimal processing hardware
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## Experimental Consistency

We will enforce deterministic behavior across all random operations in our pipeline. By setting fixed seed values for both NumPy and PyTorch operations, we ensure that our experiments can be precisely replicated across different runs, enabling meaningful comparisons between model configurations and training approaches.

In [7]:
# Establish consistent randomization parameters
def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed()

## Model Selection

We will employ Google's T5 (Text-to-Text Transfer Transformer) model as our foundation. This architecture is particularly well-suited for summarization as it frames all NLP tasks in a unified text-to-text format. The "small" variant provides a balance between computational efficiency and performance, making it ideal for fine-tuning experiments across multiple datasets. We configure both the tokenizer for processing our text inputs and the model itself, which is automatically deployed to our optimal computing device.

In [8]:
# Initialize sequence transformation architecture
# Selecting T5 for its multi-task capabilities in text-to-text format
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Data Acquisition

We will utilize the CNN/DailyMail dataset, a standard benchmark collection for text summarization. This corpus contains over 300,000 unique news articles paired with human-written highlights, providing high-quality examples of professional summarization. The dataset's version 3.0.0 represents the most refined form of this collection, with consistent formatting and minimal preprocessing required.

In [9]:
# Acquire news summary corpus
# 1. CNN/DailyMail for professional journalism summarization
cnn_dataset = load_dataset("cnn_dailymail", "3.0.0")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

## Multi-Task Dataset Integration

We will expand our training capabilities by incorporating the Stanford Sentiment Treebank (SST-2) dataset from the GLUE benchmark. This collection contains movie review excerpts labeled with binary sentiment classifications, allowing us to train our model on discriminative text tasks alongside generative summarization. After loading both datasets, we verify their successful acquisition and examine their size distributions across training, validation, and testing splits to ensure adequate representation for our multi-task learning approach.

In [10]:
# Load sentiment analysis benchmark
# 2. GLUE SST-2 for binary sentiment evaluation
sst2_dataset = load_dataset("glue", "sst2")

print("Datasets loaded successfully!")
print(f"CNN/DailyMail - Train: {len(cnn_dataset['train'])}, Validation: {len(cnn_dataset['validation'])}, Test: {len(cnn_dataset['test'])}")
print(f"SST-2 - Train: {len(sst2_dataset['train'])}, Validation: {len(sst2_dataset['validation'])}, Test: {len(sst2_dataset['test'])}")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Datasets loaded successfully!
CNN/DailyMail - Train: 287113, Validation: 13368, Test: 11490
SST-2 - Train: 67349, Validation: 872, Test: 1821


## Dataset Preparation

We will transform our raw datasets into formats optimized for sequence-to-sequence learning. After applying task-specific preprocessing functions to each collection, we create balanced samples from both datasets to prevent any single task from dominating the training process. The sampling strategy maintains diversity while keeping computational requirements manageable.

For effective multi-task learning, we merge these processed datasets into unified training and validation collections. This approach enables our model to learn both generative summarization and discriminative sentiment analysis simultaneously. The sequence-to-sequence data collator handles batch creation with appropriate padding and truncation to maintain consistent input dimensions.

In [12]:
# Function to preprocess CNN/DailyMail for summarization
def preprocess_cnn_dailymail(examples):
    # Add task prefix to distinguish this as a summarization task
    inputs = ["summarize: " + doc for doc in examples["article"]]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=256, truncation=True)

    # Tokenize targets (summaries)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=64, truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    # Add task type identifier
    model_inputs["task_type"] = ["summarization"] * len(inputs)

    return model_inputs

In [13]:
# Function to preprocess SST-2 for sentiment classification - optimized version
def preprocess_sst2(examples):
    batch_size = len(examples["sentence"])

    # Add task prefix in a more efficient way
    inputs = [f"classify sentiment: {sentence}" for sentence in examples["sentence"]]

    # Tokenize inputs - use padding=False to avoid unnecessary padding during preprocessing
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding=False  # Change from "max_length" to False
    )

    # Simplify label conversion
    text_labels = ["negative" if label == 0 else "positive" for label in examples["label"]]

    # Tokenize targets with padding=False
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            text_labels,
            max_length=8,
            truncation=True,
            padding=False  # Change from "max_length" to False
        )

    model_inputs["labels"] = labels["input_ids"]

    # Add task type identifier efficiently
    model_inputs["task_type"] = ["classification"] * batch_size

    return model_inputs

In [14]:
# Transform raw text collections into model-ready format
cnn_processed = cnn_dataset.map(preprocess_cnn_dailymail, batched=True)  # Convert news articles
sst2_processed = sst2_dataset.map(preprocess_sst2, batched=True)      # Format sentiment data

# Balance dataset representation with controlled sample sizes
cnn_sample_size = min(len(cnn_processed["train"]), 2000)
sst2_sample_size = min(len(sst2_processed["train"]), 2000)

# Create manageable training subsets
cnn_train_subset = cnn_processed["train"].shuffle(seed=42).select(range(cnn_sample_size))
sst2_train_subset = sst2_processed["train"].shuffle(seed=42).select(range(sst2_sample_size))

# Merge datasets for joint training
combined_train = concatenate_datasets([cnn_train_subset, sst2_train_subset])
combined_val = concatenate_datasets([
    cnn_processed["validation"].shuffle(seed=42).select(range(min(len(cnn_processed["validation"]), 500))),
    sst2_processed["validation"].shuffle(seed=42).select(range(min(len(sst2_processed["validation"]), 500)))
])

print(f"Combined training set size: {len(combined_train)}")
print(f"Combined validation set size: {len(combined_val)}")

# Configure batch handling strategy
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="max_length",
    max_length=512
)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]



Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Combined training set size: 4000
Combined validation set size: 1000


## Evaluation Configuration

We will implement a comprehensive assessment strategy using specialized metrics for each task type. The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) suite provides industry-standard metrics for evaluating summarization quality by measuring overlap between generated summaries and references. Simultaneously, we incorporate accuracy metrics for evaluating the model's performance on classification tasks like sentiment analysis. This dual-metric approach ensures we can properly assess our model's effectiveness across both generative and discriminative capabilities.

In [15]:
def standardize_outputs(predictions, references):
    predictions = [pred.strip() for pred in predictions]
    references = [[ref.strip()] for ref in references]
    return predictions, references

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## Comprehensive Evaluation Framework

We will implement a robust metrics calculation system that handles our multi-task evaluation requirements. This function processes model outputs with multiple layers of error prevention, ensuring reliable assessment even when dealing with edge cases or out-of-vocabulary tokens.

The framework automatically distinguishes between different task types within our batch, routing each prediction to the appropriate evaluation metrics. For summarization tasks, we calculate ROUGE scores that measure content overlap between generated and reference summaries. For sentiment classification, we compute accuracy and F1 scores after converting text outputs to binary decisions. This unified approach enables us to track performance across disparate tasks within a single training run.

In [18]:
def calculate_performance_metrics(evaluation_pair):
    raw_predictions, raw_labels = evaluation_pair

    # Process model outputs with error prevention
    processed_predictions = []
    try:
        # Ensure token values are valid
        maximum_token_id = tokenizer.vocab_size - 1
        sanitized_predictions = np.clip(raw_predictions, 0, maximum_token_id).astype(np.int32)
        processed_predictions = tokenizer.batch_decode(sanitized_predictions, skip_special_tokens=True)
    except Exception as error:
        # Implement fallback strategy for problematic predictions
        for individual_prediction in raw_predictions:
            try:
                # Handle each prediction separately
                sanitized_individual = np.clip(individual_prediction, 0, tokenizer.vocab_size - 1).astype(np.int32)
                decoded_text = tokenizer.decode(sanitized_individual, skip_special_tokens=True)
                processed_predictions.append(decoded_text)
            except Exception as nested_error:
                # Provide empty placeholder on failure
                print(f"Warning: Failed to decode prediction: {nested_error}")
                processed_predictions.append("")

    # Convert label indices to text
    normalized_labels = np.where(raw_labels != -100, raw_labels, tokenizer.pad_token_id)
    processed_labels = tokenizer.batch_decode(normalized_labels, skip_special_tokens=True)

    # Apply consistent formatting
    processed_predictions, processed_labels = standardize_outputs(processed_predictions, processed_labels)

    # Separate task types for appropriate evaluation
    sentiment_predictions = []
    sentiment_references = []
    summary_predictions = []
    summary_references = []

    for prediction, reference in zip(processed_predictions, processed_labels):
        if "positive" in reference[0] or "negative" in reference[0]:
            # Classification example
            sentiment_predictions.append(prediction)
            sentiment_references.append(reference[0])
        else:
            # Summarization example
            summary_predictions.append(prediction)
            summary_references.append(reference[0])

    # Initialize performance container
    performance_results = {}

    # Assess summarization quality when applicable
    if summary_predictions:
        rouge_metrics = rouge.compute(
            predictions=summary_predictions,
            references=[[ref] for ref in summary_references],
            use_stemmer=True
        )
        performance_results.update({k: v for k, v in rouge_metrics.items()})

    # Evaluate classification performance when applicable
    if sentiment_predictions:
        # Convert textual outputs to binary decisions
        binary_predictions = ["positive" in pred for pred in sentiment_predictions]
        binary_references = ["positive" in ref for ref in sentiment_references]

        performance_results["classification_accuracy"] = accuracy_score(binary_references, binary_predictions)
        performance_results["classification_f1"] = f1_score(binary_references, binary_predictions, average='binary')

    return performance_results

## Training Optimization

We will define a comprehensive configuration that governs our model's learning process. This setup employs mixed-precision training to accelerate computation while reducing memory requirements. We establish regular evaluation intervals for monitoring progress and implement checkpointing to preserve successful model states.

The hyperparameters balance learning efficiency with resource constraints - using moderate batch sizes, a conservative learning rate with decay regularization, and memory-saving techniques like gradient checkpointing. Our evaluation strategy intelligently adapts to the available datasets, selecting the appropriate metric for model selection based on which task types are represented in our validation set.

In [19]:
# Configure comprehensive training parameters
training_args = Seq2SeqTrainingArguments(
    fp16=True,
    output_dir="./results",
    eval_strategy="steps",
    eval_steps=200,
    logging_dir="./logs",
    logging_steps=50,
    save_steps=200,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=2,
    predict_with_generate=True,
    generation_max_length=64,
    report_to="wandb",
    load_best_model_at_end=True,
    metric_for_best_model="rouge1" if len(cnn_processed["validation"]) > 0 else "classification_accuracy",
    push_to_hub=False,
    dataloader_num_workers=4,
    optim="adamw_torch",
    gradient_checkpointing=True,
)

## Training Initialization

We will create our sequence-to-sequence trainer instance by integrating all previously defined components. This unified framework connects our model architecture with the prepared datasets, specialized data processing utilities, and custom evaluation metrics. The trainer handles all aspects of the fine-tuning process including batching, optimization, gradient computation, and performance tracking, allowing us to focus on analyzing results rather than implementation details.

In [20]:
# Establish unified training framework
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=combined_train,
    eval_dataset=combined_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=calculate_performance_metrics,
)

  trainer = Seq2SeqTrainer(


## Model Preservation

We will store our fine-tuned multi-task model for future use and deployment. This process saves both the adapted model weights that contain our learned parameters and the tokenizer configuration necessary for processing new inputs. By maintaining these components in a dedicated directory, we create a complete, self-contained solution that can be easily loaded for inference on unseen examples or transferred to other environments.

In [21]:
# Preserve trained model artifacts
model_path = "./fine_tuned_multi_task_model"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)
print(f"Model saved to {model_path}")

Model saved to ./fine_tuned_multi_task_model


## Model Evaluation

We will conduct a practical assessment of our multi-task model's capabilities by testing it on representative examples from each task domain. For summarization, we present the model with a nuanced passage about climate change that contains multiple interconnected concepts requiring thoughtful condensation. For sentiment analysis, we provide an ambiguous movie review with mixed positive and negative elements to challenge the model's ability to discern overall sentiment.

The evaluation function operates in inference mode, applying appropriate prompting patterns for each task type and capturing the model's generated outputs. This practical demonstration confirms that our single model can effectively switch between different NLP tasks based solely on the input format, showcasing the versatility achieved through our multi-task fine-tuning approach.

In [22]:
# Validate model capabilities across task domains
def evaluate_dual_task_performance(trained_model, text_processor):
    trained_model.eval()

    # Assess summarization capability
    long_document = """
    Climate change is causing significant shifts in global weather patterns. Rising temperatures have
    led to more frequent and severe weather events, including hurricanes, floods, and droughts.
    Melting ice caps contribute to rising sea levels, threatening coastal communities worldwide.
    Scientists emphasize the need for reduced carbon emissions and transition to renewable energy
    sources. Many countries have pledged to achieve carbon neutrality by mid-century, though critics
    argue these targets aren't ambitious enough to prevent the most severe consequences.
    """

    summary_query = text_processor("summarize: " + long_document, return_tensors="pt").to(device)
    condensed_output_ids = trained_model.generate(
        summary_query["input_ids"],
        max_length=75,
        min_length=30,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    produced_summary = text_processor.decode(condensed_output_ids[0], skip_special_tokens=True)

    # Verify sentiment analysis capability
    sample_review = "Despite beautiful visuals, the film's plot was confusing and the characters were poorly developed."
    sentiment_query = text_processor("classify sentiment: " + sample_review, return_tensors="pt").to(device)
    sentiment_output_ids = trained_model.generate(
        sentiment_query["input_ids"],
        max_length=10,
        early_stopping=True
    )
    detected_sentiment = text_processor.decode(sentiment_output_ids[0], skip_special_tokens=True)

    return {
        "summarization_example": long_document,
        "generated_summary": produced_summary,
        "classification_example": sample_review,
        "predicted_sentiment": detected_sentiment
    }

# Execute evaluation procedure
evaluation_results = evaluate_dual_task_performance(model, tokenizer)
print("\nTest Results:")
print(f"Summarization Example: \n{evaluation_results['summarization_example'][:100]}...")
print(f"Generated Summary: \n{evaluation_results['generated_summary']}")
print(f"\nClassification Example: \n{evaluation_results['classification_example']}")
print(f"Predicted Sentiment: {evaluation_results['predicted_sentiment']}")




Test Results:
Summarization Example: 

    Climate change is causing significant shifts in global weather patterns. Rising temperatures ha...
Generated Summary: 
climate change is causing significant shifts in global weather patterns. rising temperatures have led to more frequent and severe weather events. melting ice caps contribute to rising sea levels.

Classification Example: 
Despite beautiful visuals, the film's plot was confusing and the characters were poorly developed.
Predicted Sentiment: sentiment: Despite beautiful visuals,
