Set Up and install dependencies.

1. transformers: Library for natural language processing tasks, providing access to pre-trained models like BERT, GPT, T5, etc.
2. datasets: Library for accessing and managing datasets for natural language processing and other machine learning tasks.
3. tensorboard: Visualization tool provided by TensorFlow for monitoring and analyzing machine learning models.
4. sentencepiece: Library for tokenization, used by some transformers models for subword tokenization.
5. accelerate: Library for high-performance computing, often used to accelerate training and inference on GPUs.
6. evaluate: Package for evaluating machine learning models, commonly used for assessing model performance.
7. rouge_score: Package for computing ROUGE scores, a metric commonly used for evaluating text summarization tasks.


In [27]:
!pip install -U transformers
!pip install -U datasets
!pip install tensorboard
!pip install sentencepiece
!pip install accelerate
!pip install evaluate
!pip install rouge_score



Brief desc

1. torch: PyTorch, a machine learning library providing tensors and neural network operations.
2. pprint: Pretty-printing module for Python, used to format Python data structures in a human-readable way.
3. evaluate: Package for evaluating machine learning models, commonly used for assessing model performance.
4. numpy: Numerical computing library for Python, providing support for arrays, matrices, and mathematical operations.
5. T5Tokenizer: Tokenizer class for T5 models, used to convert text inputs into model inputs.
6. T5ForConditionalGeneration: T5 model class for conditional generation tasks like text summarization.
7. TrainingArguments: Class for defining training arguments/configuration for model training.
8. Trainer: Class for handling model training and evaluation loops, provided by the Hugging Face Transformers library.
9. load_dataset: Function for loading datasets from the Hugging Face datasets library, facilitating easy access to various datasets for machine learning tasks.


In [28]:
import torch
import pprint
import evaluate
import numpy as np

from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    TrainingArguments,
    Trainer
)


#pprint: Pretty Print, a module used for printing Python data structures in a more human-readable format.

In [14]:
pp = pprint.PrettyPrinter()

#load the training split of "bbc-news-summary" dataset from the Hugging Face Hub.

In [15]:
from datasets import load_dataset
dataset = load_dataset('gopalkalpande/bbc-news-summary', split='train')
full_dataset = dataset.train_test_split(test_size=0.2, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
print(dataset_train)
print(dataset_valid)

Downloading readme:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.32M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 1779
})
Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 445
})


In [16]:
full_dataset = dataset.train_test_split(test_size=0.2, shuffle=True)

In [17]:
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']

In [18]:
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 1779
})
Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 445
})


#Dataset Analysis

    1.Find the longest article and summary in the entire training set.

In [19]:
def find_longest_length(dataset):

    max_length = 0
    counter_4k = 0
    counter_2k = 0
    counter_1k = 0
    counter_500 = 0
    for text in dataset:
        corpus = [
            word for word in text.split()
        ]
        if len(corpus) > 4000:
            counter_4k += 1
        if len(corpus) > 2000:
            counter_2k += 1
        if len(corpus) > 1000:
            counter_1k += 1
        if len(corpus) > 500:
            counter_500 += 1
        if len(corpus) > max_length:
            max_length = len(corpus)
    return max_length, counter_4k, counter_2k, counter_1k, counter_500

longest_article_length, counter_4k, counter_2k, counter_1k, counter_500 = find_longest_length(dataset_train['Articles'])
print(f"Longest article length: {longest_article_length} words")
print(f"Artciles larger than 4000 words: {counter_4k}")
print(f"Artciles larger than 2000 words: {counter_2k}")
print(f"Artciles larger than 1000 words: {counter_1k}")
print(f"Artciles larger than 500 words: {counter_500}")
longest_summary_length, counter_4k, counter_2k, counter_1k, counter_500 = find_longest_length(dataset_train['Summaries'])
print(f"Longest summary length: {longest_summary_length} words")
print(f"Summaries larger than 4000 words: {counter_4k}")
print(f"Summaries larger than 2000 words: {counter_2k}")
print(f"Summaries larger than 1000 words: {counter_1k}")
print(f"Summaries larger than 500 words: {counter_500}")


Longest article length: 4377 words
Artciles larger than 4000 words: 1
Artciles larger than 2000 words: 6
Artciles larger than 1000 words: 18
Artciles larger than 500 words: 337
Longest summary length: 2073 words
Summaries larger than 4000 words: 0
Summaries larger than 2000 words: 1
Summaries larger than 1000 words: 6
Summaries larger than 500 words: 14



   2. Find the average sentence in the entire training for articles and summaries respectively

In [20]:
def find_avg_sentence_length(dataset):

    sentence_lengths = []
    for text in dataset:
        corpus = [
            word for word in text.split()
        ]
        sentence_lengths.append(len(corpus))
    return sum(sentence_lengths)/len(sentence_lengths)

avg_article_length = find_avg_sentence_length(dataset_train['Articles'])
print(f"Average article length: {avg_article_length} words")
avg_summary_length = find_avg_sentence_length(dataset_train['Summaries'])
print(f"Averrage summary length: {avg_summary_length} words")

Average article length: 376.71557054525016 words
Averrage summary length: 163.96964586846542 words


#Model Configuration:
1. MODEL = 't5-base': Specifies the pre-trained T5 model to be used, in this case, the base version.
2. BATCH_SIZE = 4: Defines the batch size for training and evaluation.
3. NUM_PROCS = 4: Specifies the number of processes to use for data preprocessing, typically for parallel processing.
4. EPOCHS = 10: Indicates the number of epochs, or complete passes through the dataset, during training.
5. OUT_DIR = 'results_t5base': Specifies the directory where the results of the training process will be saved.
6. MAX_LENGTH = 512: Defines the maximum length of sequences to consider while preparing the dataset.


In [21]:
MODEL = 't5-base'
BATCH_SIZE = 2
NUM_PROCS = 2
OUT_DIR='model_space'
EPOCHS = 2
MAX_LENGTH = 256 # Maximum context length to consider while preparing dataset.sequences longer than 512 tokens will likely be truncated or split to meet this constraint.

 Preprocess text data for a T5 model using the Hugging Face transformers library

In [22]:
from transformers import T5Tokenizer

# Define tokenizer
tokenizer = T5Tokenizer.from_pretrained(MODEL)


# Function to convert text data into model inputs and targets
def preprocess_function(examples, tokenizer):  # Pass the tokenizer as an argument
    MAX_LENGTH = 256
    inputs = [f"summarize: {article}" for article in examples['Articles']]
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        truncation=True,
        padding='max_length'
    )

    # Set up the tokenizer for targets
    targets = [summary for summary in examples['Summaries']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=MAX_LENGTH,
            truncation=True,
            padding='max_length'
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the function to the whole dataset
tokenized_train = dataset_train.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS,
    fn_kwargs={"tokenizer": tokenizer}  # Pass the tokenizer as a function argument
)
tokenized_valid = dataset_valid.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS,
    fn_kwargs={"tokenizer": tokenizer}  # Pass the tokenizer as a function argument
)


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map (num_proc=2):   0%|          | 0/1779 [00:00<?, ? examples/s]



Map (num_proc=2):   0%|          | 0/445 [00:00<?, ? examples/s]



THE MODEL

In [23]:
# Initialize the T5 model for conditional generation
model = T5ForConditionalGeneration.from_pretrained(MODEL)

# Check if CUDA GPU is available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the selected device (GPU or CPU)
model.to(device)

# trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

222,903,552 total parameters.
222,903,552 training parameters.


Rouge Metric

In [29]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [30]:
def compute_metrics(eval_pred):
    # Extract predictions and labels
    predictions, labels = eval_pred.predictions[0], eval_pred.label_ids

    # Decode model predictions and ground truth labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculate ROUGE scores
    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
        rouge_types=['rouge1', 'rouge2', 'rougeL']
    )

    # Compute average generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    # Round the result values to four decimal places
    return {k: round(v, 4) for k, v in result.items()}



    Original Trainer may have a memory leak.
    This is a workaround to avoid storing too many tensors that are not needed.
   

In [None]:
def preprocess_logits_for_metrics(logits, labels):

    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

Training



1. **Training Arguments Initialization:**
   - Initializes the training arguments using `TrainingArguments`.
   - Sets various parameters such as `output_dir`, `num_train_epochs`, `per_device_train_batch_size`, `warmup_steps`, and others.
   - `dataloader_prefetch_factor` is set to improve data loading efficiency.

2. **Trainer Configuration:**
   - Creates a `Trainer` object with the specified configurations.
   - Takes the initialized `model` and `training_args` as input.
   - Specifies the tokenized datasets for training and evaluation (`train_dataset` and `eval_dataset`, respectively).
   - Defines functions for preprocessing logits and computing metrics during evaluation.

3. **Training Loop:**
   - Initiates the training loop by calling the `train()` method of the `Trainer` object.
   - Trains the model according to the specified training arguments and dataset configurations.
   - Returns a `TrainerState` object containing information about the training history and progress.

In [None]:
training_args = TrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=2,
    per_device_train_batch_size=5,
    per_device_eval_batch_size=5,
    warmup_steps=10,
    weight_decay=0.1,
    logging_dir=OUT_DIR,
    logging_steps=50,
    evaluation_strategy='steps',
    eval_steps=50,
    save_strategy='epoch',
    save_total_limit=2,
    report_to='tensorboard',
    learning_rate=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,

)

history = trainer.train()

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Gen Len
50,2.9616,2.440826,0.5776,0.3448,0.5138,198.2449
100,2.7167,2.17878,0.666,0.4865,0.6096,198.2449
150,2.4669,1.97821,0.6904,0.5337,0.6401,198.2449
200,2.1156,2.03989,0.6916,0.5409,0.6419,198.2494
250,2.0015,1.853174,0.7048,0.5603,0.6617,198.2449
300,1.9544,1.804546,0.7171,0.5747,0.6712,198.2449
350,1.9683,1.741409,0.7158,0.5802,0.6735,198.2449
400,1.7724,1.679961,0.714,0.5805,0.6755,198.2449
450,1.5066,1.649815,0.7271,0.5891,0.6827,198.2449
500,1.5932,1.604265,0.7315,0.5943,0.6883,198.2449


In [31]:
tokenizer.save_pretrained(r"model_space")
model.save_pretrained("model_space")



In [32]:
!zip -r "model_space" "model_space"

  adding: model_space/ (stored 0%)
  adding: model_space/generation_config.json (deflated 29%)
  adding: model_space/special_tokens_map.json (deflated 85%)
  adding: model_space/added_tokens.json (deflated 83%)
  adding: model_space/model.safetensors


zip error: Interrupted (aborting)


 Use the requests library to download a file from a given URL and save it locally.

In [34]:
import requests

url = "https://www.dropbox.com/scl/fi/561r8pfhem4lu70hf438q/inference_data.zip?rlkey=aedt2saqmmp3a67qc4o34k04y&dl=1"
response = requests.get(url)
with open("inference_data.zip", "wb") as f:
    f.write(response.content)

In [40]:
!unzip inference_data.zip

Archive:  inference_data.zip
replace inference_data/file_1.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: inference_data/file_1.txt  
replace inference_data/file_2.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: inference_data/file_2.txt  


In [39]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

import glob

loads a pre-trained T5 model and tokenizer from specified directories:

In [41]:
model_path = "/content/model_space"  # the path where you saved your model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(model_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Python function, summarize_text, takes an input text, a pre-trained model, and its associated tokenizer to generate a summary of the input text.

In [42]:
def summarize_text(text, model, tokenizer, max_length=512, num_beams=5):
    # Preprocess the text
    inputs = tokenizer.encode(
        "summarize: " + text,
        return_tensors='pt',
        max_length=max_length,
        truncation=True
    )

    # Generate the summary
    summary_ids = model.generate(
        inputs,
        max_length=50,
        num_beams=num_beams,
        # early_stopping=True,
    )

    # Decode and return the summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [43]:
for file_path in glob.glob('inference_data/*.txt'):
    file = open(file_path)
    text = file.read()
    summary = summarize_text(text, model, tokenizer)
    pp.pprint(summary)
    print('-'*75)

('the leader of one of the world’s most influential AI companies, openAI, was '
 'fired Friday night by the startup’s board in a surprise move. within about '
 "48 hours, he'd been hired to run a")
---------------------------------------------------------------------------
("the chatGPT company will get its third CEO in three days. it's another major "
 'shakeup to the balance of power over artificial intelligence.')
---------------------------------------------------------------------------


In [44]:
!pip install gradio




In [45]:

import gradio as gr
from transformers import T5ForConditionalGeneration, T5Tokenizer

In [46]:

def summarize_text(text):
    # Preprocess the text
    inputs = tokenizer.encode(
        "summarize: " + text,
        return_tensors='pt',
        max_length=512,
        truncation=True,
        padding='max_length'
    )

    # Generate the summary
    summary_ids = model.generate(
        inputs,
        max_length=50,
        num_beams=5,
        # early_stopping=True
    )

    # Decode and return the summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [47]:
model_path = 'model_space'  # the path where you saved your model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained('model_space')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [48]:
interface = gr.Interface(
    fn=summarize_text,
    inputs=gr.Textbox(lines=10, placeholder='Enter Text Here...', label='Input text'),
    outputs=gr.Textbox(label='Summarized Text'),
    title='Text Summarizer using T5'
)
interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://5621663b2ccdb94395.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


