# Text Summarization with T5
This notebook demonstrates how to use the `DataLoader` class to load and process datasets for text summarization using the T5 model.

In [1]:
import sys
import os

sys.path.append(os.path.abspath(os.path.join('..', 'src')))

In [2]:
# Import necessary libraries
from training import SummarizationDataLoader

## Initialize DataLoader
We initialize the `DataLoader` with the dataset name, tokenizer name, and batch size.

In [3]:
# Initialize DataLoader
dataset_name = 'FiscalNote/billsum'
tokenizer_name = 't5-small'
batch_size = 4

data_loader = SummarizationDataLoader(
    dataset_name=dataset_name, tokenizer_name=tokenizer_name, batch_size=batch_size)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Prepare Data
Load the dataset specified by the `dataset_name` attribute.

In [4]:
# Prepare data
data_loader.prepare_data()

Explore train example

In [5]:
data_loader.dataset["train"][0]

{'text': "SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES \n              TO NONPROFIT ORGANIZATIONS.\n\n    (a) Definitions.--In this section:\n            (1) Business entity.--The term ``business entity'' means a \n        firm, corporation, association, partnership, consortium, joint \n        venture, or other form of enterprise.\n            (2) Facility.--The term ``facility'' means any real \n        property, including any building, improvement, or appurtenance.\n            (3) Gross negligence.--The term ``gross negligence'' means \n        voluntary and conscious conduct by a person with knowledge (at \n        the time of the conduct) that the conduct is likely to be \n        harmful to the health or well-being of another person.\n            (4) Intentional misconduct.--The term ``intentional \n        misconduct'' means conduct by a person with knowledge (at the \n        time of the conduct) that the conduct is harmful to the health \n        or w

## Setup DataLoaders
Set up the train, validation, and test datasets.

In [6]:
# Setup DataLoaders
data_loader.setup()

## Create DataLoaders
Create DataLoaders for training, validation, and testing.

In [7]:
# Create DataLoaders
train_dataloader = data_loader.train_dataloader()
val_dataloader = data_loader.val_dataloader()
test_dataloader = data_loader.test_dataloader()

## Visualize Examples
Visualize some examples from the training dataset.

In [8]:
# Visualize examples
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[ 180, 3073, 9562,  ...,    0,    0,    0],
        [ 180, 3073, 9562,  ..., 2156,    8,    1],
        [ 180, 3073, 9562,  ..., 1866,   57,    1],
        [ 180, 3073, 9562,  ..., 1461, 1381,    1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[ 4804,  8804,     7,     3,     9,   268, 10409,    45,  3095,  6283,
             3,  8321,    12,   136,  2871,    42,  1687, 16198,    44,     3,
             9,  3064,    13,    24, 10409,    16,  2135,    28,     3,     9,
           169,    13,   224,  3064,    57,     3,     9, 11069,  1470,     3,
            99,    10,  5637,     8,   169,  6986,  1067,     8,  7401,    13,
           268,    13,     8,   268, 10409,   117,  6499,   224,  2871,    42,
          1687,  6986,   383,     3,     9,  1059,    24,   224,  3064,    19,
           261,    57,   224,  1470,   117,    11, 10153,