## Text Summarization Model Training

```python
This Python notebook code performs the training of a machine learning model for text summarization. Here's a summary of its contents:

- The code imports the necessary libraries and modules, including `os`, `dataclasses`, `pathlib`, and various modules from the `transformers`, `datasets`, and `torch` libraries.

- It defines a `ModelTrainerConfig` data class that represents the configuration for model training. The class contains various attributes such as `root_dir`, `data_path`, `model_ckpt`, and different training parameters.

- The code defines a `ConfigurationManager` class that handles the configuration management. It reads the configuration and parameter files, creates necessary directories, and provides a method to retrieve the `ModelTrainerConfig`.

- Next, the code defines a `ModelTrainer` class that performs the training process. It initializes the class with a `ModelTrainerConfig` object and defines a `train` method.

- Inside the `train` method, the code checks the availability of a CUDA device and initializes the tokenizer and model for sequence-to-sequence generation.

- The code loads the dataset from disk using the specified data path.

- It sets the training arguments, including the output directory, number of epochs, batch sizes, logging and evaluation steps, and gradient accumulation steps.

- The code creates a `Trainer` object and trains the model using the specified training arguments, tokenizer, data collator, and datasets.

- After training, the trained model and tokenizer are saved to the specified directories.

- Finally, the code wraps the training process in a try-except block and executes it.

In summary, this code performs the training of a text summarization model using the Pegasus architecture. It reads the configuration, sets the training parameters, loads the dataset, trains the model, and saves the trained model and tokenizer.


In [1]:
import os

In [3]:
%pwd

'f:\\artificial intelegnce\\study\\ML End To End Projects Krish Naik\\github\\Text-Summarizer-Project\\research'

In [4]:
os.chdir("../")

In [5]:
%pwd

'f:\\artificial intelegnce\\study\\ML End To End Projects Krish Naik\\github\\Text-Summarizer-Project'

In [6]:
from dataclasses import dataclass
from pathlib import Path


@dataclass(frozen=True)
class ModelTrainerConfig:
    root_dir: Path
    data_path: Path
    model_ckpt: Path
    num_train_epochs: int
    warmup_steps: int
    per_device_train_batch_size: int
    weight_decay: float
    logging_steps: int
    evaluation_strategy: str
    eval_steps: int
    save_steps: float
    gradient_accumulation_steps: int

In [7]:
from textSummarizer.constants import *
from textSummarizer.utils.common import read_yaml, create_directories

In [8]:
class ConfigurationManager:
    def __init__(
        self,
        config_filepath = CONFIG_FILE_PATH,
        params_filepath = PARAMS_FILE_PATH):

        self.config = read_yaml(config_filepath)
        self.params = read_yaml(params_filepath)

        create_directories([self.config.artifacts_root])


    
    def get_model_trainer_config(self) -> ModelTrainerConfig:
        config = self.config.model_trainer
        params = self.params.TrainingArguments

        create_directories([config.root_dir])

        model_trainer_config = ModelTrainerConfig(
            root_dir=config.root_dir,
            data_path=config.data_path,
            model_ckpt = config.model_ckpt,
            num_train_epochs = params.num_train_epochs,
            warmup_steps = params.warmup_steps,
            per_device_train_batch_size = params.per_device_train_batch_size,
            weight_decay = params.weight_decay,
            logging_steps = params.logging_steps,
            evaluation_strategy = params.evaluation_strategy,
            eval_steps = params.evaluation_strategy,
            save_steps = params.save_steps,
            gradient_accumulation_steps = params.gradient_accumulation_steps
        )

        return model_trainer_config

In [10]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset, load_from_disk
import torch

In [11]:
class ModelTrainer:
    def __init__(self, config: ModelTrainerConfig):
        self.config = config


    
    def train(self):
        device = "cuda" if torch.cuda.is_available() else "cpu"
        tokenizer = AutoTokenizer.from_pretrained(self.config.model_ckpt)
        model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(self.config.model_ckpt).to(device)
        seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)
        
        #loading data 
        dataset_samsum_pt = load_from_disk(self.config.data_path)

        # trainer_args = TrainingArguments(
        #     output_dir=self.config.root_dir, num_train_epochs=self.config.num_train_epochs, warmup_steps=self.config.warmup_steps,
        #     per_device_train_batch_size=self.config.per_device_train_batch_size, per_device_eval_batch_size=self.config.per_device_train_batch_size,
        #     weight_decay=self.config.weight_decay, logging_steps=self.config.logging_steps,
        #     evaluation_strategy=self.config.evaluation_strategy, eval_steps=self.config.eval_steps, save_steps=1e6,
        #     gradient_accumulation_steps=self.config.gradient_accumulation_steps
        # ) 


        trainer_args = TrainingArguments(
            output_dir=self.config.root_dir, num_train_epochs=1, warmup_steps=500,
            per_device_train_batch_size=1, per_device_eval_batch_size=1,
            weight_decay=0.01, logging_steps=10,
            evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
            gradient_accumulation_steps=16
        ) 

        trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["test"], 
                  eval_dataset=dataset_samsum_pt["validation"])
        
        trainer.train()

        ## Save model
        model_pegasus.save_pretrained(os.path.join(self.config.root_dir,"pegasus-samsum-model"))
        ## Save tokenizer
        tokenizer.save_pretrained(os.path.join(self.config.root_dir,"tokenizer"))


In [12]:
try:
    config = ConfigurationManager()
    model_trainer_config = config.get_model_trainer_config()
    model_trainer_config = ModelTrainer(config=model_trainer_config)
    model_trainer_config.train()
except Exception as e:
    raise e

[2023-06-19 20:59:47,523: INFO: common: yaml file: config\config.yaml loaded successfully]
[2023-06-19 20:59:47,527: INFO: common: yaml file: params.yaml loaded successfully]
[2023-06-19 20:59:47,529: INFO: common: created directory at: artifacts]
[2023-06-19 20:59:47,530: INFO: common: created directory at: artifacts/model_trainer]


Downloading pytorch_model.bin:   0%|          | 10.5M/2.28G [00:03<11:57, 3.16MB/s]

KeyboardInterrupt: 

Downloading pytorch_model.bin:   0%|          | 10.5M/2.28G [00:20<11:57, 3.16MB/s]

- Text-Summarization-NLP-Project/research/04_model_trainer.ipynb import os: Imports the `os` module to work with the operating system.

- `%pwd`: This magic command prints the current working directory.

- `'d:\\Bappy\\YouTube\\Text-Summarizer-Project\\research'`: The output of the previous `%pwd` command, indicating the current working directory.

- `os.chdir("../")`: Changes the current working directory to the parent directory.

- `%pwd`: Prints the current working directory after changing it.

- `'d:\\Bappy\\YouTube\\Text-Summarizer-Project'`: The output of the previous `%pwd` command, indicating the updated current working directory.

- `from dataclasses import dataclass`: Imports the `dataclass` decorator from the `dataclasses` module.

- `from pathlib import Path`: Imports the `Path` class from the `pathlib` module.

- `@dataclass(frozen=True)`: Decorator used to define an immutable data class.

- `class ModelTrainerConfig`: Defines a class named `ModelTrainerConfig` representing the configuration for model training. Inside the `ModelTrainerConfig` class, it defines various attributes such as `root_dir`, `data_path`, `model_ckpt`, and different training parameters.

- `from textSummarizer.constants import *`: Imports constants from the `textSummarizer.constants` module.

- `from textSummarizer.utils.common import read_yaml, create_directories`: Imports the `read_yaml` and `create_directories` functions from the `textSummarizer.utils.common` module.

- `class ConfigurationManager`: Defines a class named `ConfigurationManager` responsible for managing the configuration. Inside the `ConfigurationManager` class, it defines an `__init__` method that initializes the configuration and parameters by reading YAML files.

- `def get_model_trainer_config(self) -> ModelTrainerConfig`: Defines a method to retrieve the `ModelTrainerConfig` object. Inside the `get_model_trainer_config` method, it creates necessary directories, initializes the `ModelTrainerConfig` object with the configuration and parameter values, and returns the object.

- `from transformers import TrainingArguments, Trainer`: Imports the `TrainingArguments` and `Trainer` classes from the `transformers` module.

- `from transformers import DataCollatorForSeq2Seq`: Imports the `DataCollatorForSeq2Seq` class from the `transformers` module.

- `from transformers import AutoModelForSeq2SeqLM, AutoTokenizer`: Imports the `AutoModelForSeq2SeqLM` and `AutoTokenizer` classes from the `transformers` module.

- `from datasets import load_dataset, load_from_disk`: Imports the `load_dataset` and `load_from_disk` functions from the `datasets` module.

- `import torch`: Imports the `torch` module.

- `class ModelTrainer`: Defines a class named `ModelTrainer` responsible for the training process. Inside the `ModelTrainer` class, it defines an `__init__` method that initializes the `ModelTrainerConfig` object. Inside the `ModelTrainer` class, it defines a `train` method that performs the training process. Inside the `train` method, it checks the availability of a CUDA device and initializes the tokenizer and model for sequence-to-sequence generation. It loads the dataset from disk using the specified data path. It sets the training arguments, including the output directory, number of epochs, batch sizes, logging and evaluation steps, and gradient accumulation steps. It creates a `Trainer` object and trains the model using the specified training arguments, tokenizer, data collator, and datasets. After training, the trained model and tokenizer are saved to the specified directories.

- The code wraps the training process in a try-except block and executes it. The try-except block handles any exceptions that may occur during the training process.

- The code performs the training of a text summarization model using the Pegasus architecture. It involves importing necessary modules, defining configuration classes, managing the configuration, training the model, and saving the trained model and tokenizer.
