- `import os`: Imports the `os` module for operating system-related functionalities.
- `%pwd`: Displays the current working directory.
- `os.chdir("../")`: Changes the current working directory to the parent directory.
- `%pwd`: Displays the current working directory again to verify the change.
- `from dataclasses import dataclass, Path`: Imports the `dataclass` decorator and the `Path` class from the `dataclasses` and `pathlib` modules, respectively.
- `@dataclass(frozen=True)`: Decorator that creates an immutable data class `DataTransformationConfig` with the specified attributes.
- `class ConfigurationManager`: Defines the `ConfigurationManager` class responsible for managing the project's configuration.
- `def __init__(self, config_filepath = CONFIG_FILE_PATH, params_filepath = PARAMS_FILE_PATH)`: Initializes the `ConfigurationManager` instance with optional `config_filepath` and `params_filepath` parameters, which default to predefined constants.
- `self.config = read_yaml(config_filepath)`: Reads the configuration file specified by `config_filepath` using the `read_yaml` function and assigns the result to the `config` attribute.
- `self.params = read_yaml(params_filepath)`: Reads the parameters file specified by `params_filepath` using the `read_yaml` function and assigns the result to the `params` attribute.
- `create_directories([self.config.artifacts_root])`: Creates directories specified in the configuration file using the `create_directories` function.
- `def get_data_transformation_config(self) -> DataTransformationConfig`: Retrieves the data transformation configuration by calling the `get_data_transformation_config` method of the `ConfigurationManager` instance.
- `create_directories([config.root_dir])`: Creates directories specified in the data transformation configuration.
- `data_transformation_config = DataTransformationConfig(...)`: Creates an instance of the `DataTransformationConfig` class using the data transformation configuration values.
- `import os`: Imports the `os` module.
- `from textSummarizer.logging import logger`: Imports the `logger` object from the `textSummarizer.logging` module, which is used for logging.
- `from transformers import AutoTokenizer`: Imports the `AutoTokenizer` class from the `transformers` module, which is used for tokenization.
- `from datasets import load_dataset, load_from_disk`: Imports the `load_dataset` and `load_from_disk` functions from the `datasets` module, which are used for loading datasets.
- `class DataTransformation`: Defines the `DataTransformation` class responsible for converting data into features for model training.
- `def __init__(self, config: DataTransformationConfig)`: Initializes the `DataTransformation` instance with a `config` parameter of type `DataTransformationConfig`.
- `self.tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name)`: Creates an instance of the `AutoTokenizer` class, initialized with the tokenizer specified in the configuration, and assigns it to the `tokenizer` attribute.
- `def convert_examples_to_features(self, example_batch)`: Converts examples from the dataset to features for model training.
- `input_encodings = self.tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)`: Tokenizes the dialogue from the example batch using the `tokenizer` attribute.
- `with self.tokenizer.as_target_tokenizer():`: Context manager for using the tokenizer as the target tokenizer.
- `target_encodings = self.tokenizer(example_batch['summary'], max_length=128, truncation=True)`: Tokenizes the summary from the example batch using the target tokenizer.
- `def convert(self)`: Converts the dataset to features and saves it to disk.
- `dataset_samsum = load_from_disk(self.config.data_path)`: Loads the dataset from disk using the data path specified in the configuration.
- `dataset_samsum_pt = dataset_samsum.map(self.convert_examples_to_features, batched=True)`: Maps the `convert_examples_to_features` method to the dataset to convert examples to features in batches.
- `dataset_samsum_pt.save_to_disk(os.path.join(self.config.root_dir, "samsum_dataset"))`: Saves the converted dataset to disk at the specified path.
- `config = ConfigurationManager()`: Creates an instance of the `ConfigurationManager` class to manage the project's configuration.
- `data_transformation_config = config.get_data_transformation_config()`: Retrieves the data transformation configuration by calling the `get_data_transformation_config` method of the `ConfigurationManager` instance.
- `data_transformation = DataTransformation(config=data_transformation_config)`: Creates an instance of the `DataTransformation` class for data transformation, passing the obtained data transformation configuration as a parameter.
- `data_transformation.convert()`: Invokes the `convert` method of the `DataTransformation` instance to perform the data transformation process.

In summary, the code performs data transformation for the text summarization project. It sets up the necessary configurations, loads the dataset, tokenizes the input data, converts it into features, and saves the transformed dataset to disk.


In [1]:
import os

In [2]:
%pwd

'f:\\artificial intelegnce\\study\\ML End To End Projects Krish Naik\\github\\Text-Summarizer-Project\\research'

In [3]:
os.chdir("../")

In [4]:
%pwd

'f:\\artificial intelegnce\\study\\ML End To End Projects Krish Naik\\github\\Text-Summarizer-Project'

In [5]:
from dataclasses import dataclass
from pathlib import Path


@dataclass(frozen=True)
class DataTransformationConfig:
    root_dir: Path
    data_path: Path
    tokenizer_name: Path

In [6]:
from textSummarizer.constants import *
from textSummarizer.utils.common import read_yaml, create_directories

In [7]:
class ConfigurationManager:
    def __init__(
        self,
        config_filepath = CONFIG_FILE_PATH,
        params_filepath = PARAMS_FILE_PATH):

        self.config = read_yaml(config_filepath)
        self.params = read_yaml(params_filepath)

        create_directories([self.config.artifacts_root])


    
    def get_data_transformation_config(self) -> DataTransformationConfig:
        config = self.config.data_transformation

        create_directories([config.root_dir])

        data_transformation_config = DataTransformationConfig(
            root_dir=config.root_dir,
            data_path=config.data_path,
            tokenizer_name = config.tokenizer_name
        )

        return data_transformation_config

In [9]:
import os
from textSummarizer.logging import logger
from transformers import AutoTokenizer
from datasets import load_dataset, load_from_disk

In [10]:
class DataTransformation:
    def __init__(self, config: DataTransformationConfig):
        self.config = config
        self.tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name)


    
    def convert_examples_to_features(self,example_batch):
        input_encodings = self.tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )
        
        with self.tokenizer.as_target_tokenizer():
            target_encodings = self.tokenizer(example_batch['summary'], max_length = 128, truncation = True )
            
        return {
            'input_ids' : input_encodings['input_ids'],
            'attention_mask': input_encodings['attention_mask'],
            'labels': target_encodings['input_ids']
        }
    

    def convert(self):
        dataset_samsum = load_from_disk(self.config.data_path)
        dataset_samsum_pt = dataset_samsum.map(self.convert_examples_to_features, batched = True)
        dataset_samsum_pt.save_to_disk(os.path.join(self.config.root_dir,"samsum_dataset"))

In [11]:
try:
    config = ConfigurationManager()
    data_transformation_config = config.get_data_transformation_config()
    data_transformation = DataTransformation(config=data_transformation_config)
    data_transformation.convert()
except Exception as e:
    raise e

[2023-06-19 19:51:19,987: INFO: common: yaml file: config\config.yaml loaded successfully]
[2023-06-19 19:51:19,989: INFO: common: yaml file: params.yaml loaded successfully]
[2023-06-19 19:51:19,990: INFO: common: created directory at: artifacts]
[2023-06-19 19:51:19,992: INFO: common: created directory at: artifacts/data_transformation]


Downloading (…)okenizer_config.json: 100%|██████████| 88.0/88.0 [00:00<00:00, 43.8kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 1.12k/1.12k [00:00<00:00, 573kB/s]
Downloading (…)ve/main/spiece.model: 100%|██████████| 1.91M/1.91M [00:01<00:00, 1.29MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 65.0/65.0 [00:00<00:00, 32.6kB/s]
                                                                                                 