<a href="https://colab.research.google.com/github/mahesh-tippanu/MLP_Datathlon/blob/main/DatathonMLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Libraries and Modules

In this cell, we import the necessary libraries and modules required for the task:

- **pandas**: For data manipulation and analysis.
- **transformers**: Includes the `T5Tokenizer` and `T5ForConditionalGeneration` classes for tokenizing text and generating predictions using the T5 model.
- **datasets**: Provides the `Dataset` and `DatasetDict` classes for handling datasets.
- **numpy**: For numerical operations.

These libraries and modules will be used for data processing, model training, and evaluation.


In [2]:
# Import necessary libraries
!pip install datasets
!pip install transformers
!pip install tqdm
import pandas as pd
import numpy as np
import json
from transformers import TrainingArguments, Trainer, TrainerCallback
from datasets import Dataset, DatasetDict
from transformers import T5Tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive



## Reading Data from JSONL Files

In this cell, we define a function `read_jsonl` to read data from JSON Lines (JSONL) files into pandas DataFrames. We then use this function to read the following datasets:

- **Training Data**: `attrebute_train.data` and `attrebute_train.solution`, with the first 1000 rows.
- **Testing Data**: `attrebute_test.data` and `attrebute_test.solution`, with the first 200 rows.
- **Validation Data**: `attrebute_val.data` and `attrebute_val.solution`, with the first 200 rows.

The commented-out lines are for reading the entire datasets if needed. This setup allows us to work with a subset of the data for initial experimentation and testing.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

path = '/content/drive/MyDrive/Datathon/'

def load_jsonl_sample(file_path, nrows):
    samples = []
    with open(file_path, 'r') as f:
        for i, line in enumerate(f):
            if i >= nrows:
                break
            samples.append(json.loads(line))
    return samples

train_data = load_jsonl_sample(path + 'attribute_train.data',nrows = 1000)
train_solution = load_jsonl_sample(path + 'attribute_train.solution', nrows = 1000)
# val_data = pd.read_json(path + 'attribute_val.data', lines=True)
# val_solution = pd.read_json(path + 'attribute_val.solution', lines=True)
test_data = load_jsonl_sample(path + 'attribute_test.data', nrows = 1000)

train_data = pd.DataFrame(train_data)
train_solution = pd.DataFrame(train_solution)
test_data = pd.DataFrame(test_data)



Mounted at /content/drive


## Data Preprocessing and Formatting

In this cell, we define a function `preprocess_data` to prepare the data for model training. This function merges the product description data with the corresponding attribute labels, then formats the data into `input_text` and `target_text` pairs:

- **`input_text`**: Constructed by combining the product title, store, and manufacturer details.
- **`target_text`**: Constructed by specifying the attribute-value pairs for brand and categories.

### Data Processing

We apply the `preprocess_data` function to the training, testing, and validation datasets to generate the `input_text` and `target_text`.

Finally, the processed data is converted into the Hugging Face Dataset format using `Dataset.from_pandas` for further model training and evaluation.


In [4]:
# Load data
train_data = load_jsonl_sample(path + 'attribute_train.data', nrows=1000)
train_solution = load_jsonl_sample(path + 'attribute_train.solution', nrows=1000)
test_data = load_jsonl_sample(path + 'attribute_test.data', nrows=1000)

# Convert data to pandas DataFrame
train_data = pd.DataFrame(train_data)
train_solution = pd.DataFrame(train_solution)
test_data = pd.DataFrame(test_data)

# Merge train data with solution on 'indoml_id'
train_df = pd.merge(train_data, train_solution, on='indoml_id')

# Feature Engineering
train_df['details_Manufacturer'].fillna('Unknown', inplace=True)

# Text Vectorization for 'title' using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
title_features = vectorizer.fit_transform(train_df['title']).toarray()

# Label Encoding for 'store' and 'details_Manufacturer'
le_store = LabelEncoder()
le_manufacturer = LabelEncoder()
train_df['store_encoded'] = le_store.fit_transform(train_df['store'])
train_df['manufacturer_encoded'] = le_manufacturer.fit_transform(train_df['details_Manufacturer'])

# Combine all features into a single matrix
X = np.hstack([title_features, train_df[['store_encoded', 'manufacturer_encoded']].values])

# Target columns
target_columns = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']

# Apply Label Encoding to each target column
label_encoders = {col: LabelEncoder() for col in target_columns}
y = train_df[target_columns].copy()
for col in target_columns:
    y[col] = label_encoders[col].fit_transform(train_df[col])

# Split the data into training and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(test_data),
    'validation': Dataset.from_pandas(val_df)
})
# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Example usage
input_text = "Translate English to French: How are you?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Translated Text:", translated_text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



Translated Text: Comment êtes-vous?


In [None]:
train_df.head()


Unnamed: 0,indoml_id,title,store,details_Manufacturer,details_Brand,L0_category,L1_category,L2_category,L3_category,L4_category,store_encoded,manufacturer_encoded
0,0,"Enclume Angled Pot Hook, Set of 6, Use with Po...",Enclume,Enclume,Enclume,Home & Kitchen,Kitchen & Dining,Storage & Organization,Racks & Holders,Pot Racks,231,245
1,1,Schutt Vengeance DCT Hybrid Youth Football H,Schutt,Schutt,Schutt,Sports & Outdoors,Sports,Team Sports,Football,Protective Gear,584,608
2,2,Easton 2014 MAKO SL14MK9 Baseball Bat (-9),Easton,"Easton Sports, Inc.",Easton,Sports & Outdoors,Sports,Team Sports,Baseball,Baseball Bats,224,239
3,3,Bilstein B46-0929 Heavy-Duty Gas Shock Absorber,Bilstein,Bilstein,Bilstein,Automotive,Replacement Parts,"Shocks, Struts & Suspension",Shocks,na,90,87
4,4,Apple Red Cardstock - 8.5 x 11 inch - 65Lb Cov...,Clear Path Paper,Clear Path Paper,Clear Path Paper,"Arts, Crafts & Sewing",Crafting,Paper & Paper Crafts,Paper,Card Stock,143,153


## Creating Dataset Dictionary

In this cell, we create a `DatasetDict` to organize the processed datasets for training, testing, and validation. The `DatasetDict` is a convenient way to manage multiple datasets in Hugging Face's `datasets` library.

- **`train`**: Contains the training dataset (`train_dataset`).
- **`test`**: Contains the test dataset (`test_dataset`).
- **`validation`**: Contains the validation dataset (`val_dataset`).

The `DatasetDict` will be used for training and evaluating the model, allowing for easy access to different subsets of data.


In [5]:
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(test_data),
    'validation': Dataset.from_pandas(val_df)
})

In [6]:
from google.colab import drive
drive.mount('/content/drive')

path = '/content/drive/MyDrive/Datathon/'

def load_jsonl_sample(file_path, nrows):
    samples = []
    with open(file_path, 'r') as f:
        for i, line in enumerate(f):
            if i >= nrows:
                break
            samples.append(json.loads(line))
    return samples

!pip install pandas
import pandas as pd
train_data = load_jsonl_sample(path + 'attribute_train.data',nrows = 1000)
train_solution = load_jsonl_sample(path + 'attribute_train.solution', nrows = 1000)
# val_data = pd.read_json(path + 'attribute_val.data', lines=True)
# val_solution = pd.read_json(path + 'attribute_val.solution', lines=True)
test_data = load_jsonl_sample(path + 'attribute_test.data', nrows = 1000)

train_data = pd.DataFrame(train_data)
train_solution = pd.DataFrame(train_solution)
test_data = pd.DataFrame(test_data)

# Merge train data with solution on 'indoml_id'
train_df = pd.merge(train_data, train_solution, on='indoml_id')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Loading the T5 Model and Tokenizer

In this cell, we load the T5 model and tokenizer from the Hugging Face `transformers` library:

- **`T5Tokenizer`**: Tokenizer for converting text into tokens and vice versa, using the `t5-small` pre-trained model.
- **`T5ForConditionalGeneration`**: T5 model for sequence-to-sequence tasks, also using the `t5-small` pre-trained model.

These components will be used for encoding the input text, generating predictions, and decoding the output text.


In [7]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

## Tokenizing the Dataset

In this cell, we define the `preprocess_function` to tokenize the `input_text` and `target_text` using the T5 tokenizer:

- **`inputs`**: Tokenized input texts with a maximum length of 352 tokens, padded and truncated as necessary.
- **`targets`**: Tokenized target texts with a maximum length of 128 tokens, padded and truncated as necessary.
- **`model_inputs`**: Contains the tokenized inputs and labels (target texts) for model training.

The `preprocess_function` is applied to the entire dataset using the `map` method with `batched=True`, ensuring efficient processing of the data in batches.

The result, `tokenized_datasets`, is a `DatasetDict` containing the tokenized versions of the train, test, and validation datasets, ready for model training.


In [8]:
def preprocess_function(examples):
    # Extract inputs and targets from the examples
    inputs = examples['title']
    targets = examples['L0_category']  # Replace with the appropriate target field

    # Tokenize inputs (encoder inputs)
    model_inputs = tokenizer(
        inputs,
        max_length=128,  # Adjust this length if needed
        truncation=True,
        padding='max_length'
    )

    # Tokenize targets (decoder inputs)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=128,  # Adjust this length if needed
            truncation=True,
            padding='max_length'
        )

    # Add labels to model inputs
    model_inputs['labels'] = labels['input_ids']

    # Ensure that labels are correctly set (convert -100 to ignore_index if necessary)
    model_inputs['labels'] = [label if label != tokenizer.pad_token_id else -100 for label in model_inputs['labels']]

    return model_inputs


In [15]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
    test: Dataset({
        features: ['title', 'details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['title', 'details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
})

In [14]:
# Tokenize datasets
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [12]:
# from datasets import load_from_disk

# tokenized_datasets = load_from_disk('./')

## Configuring Training Arguments

In this cell, we set up the `TrainingArguments` for training the T5 model using the Hugging Face `Trainer`:

- **`output_dir`**: Directory to save the model checkpoints and results.
- **`evaluation_strategy`**: Strategy for evaluation, set to `'epoch'`, meaning evaluation will occur at the end of each epoch.
- **`learning_rate`**: Learning rate for optimization, set to `2e-5`.
- **`per_device_train_batch_size`**: Batch size for training, set to `16`.
- **`per_device_eval_batch_size`**: Batch size for evaluation, set to `16`.
- **`num_train_epochs`**: Number of training epochs, set to `2`.
- **`weight_decay`**: Weight decay for regularization, set to `0.01`.
- **`save_total_limit`**: Limit on the number of checkpoints to keep, set to `3`.
- **`logging_dir`**: Directory for logging information.
- **`logging_steps`**: Frequency of logging, set to every 20 steps.
- **`report_to`**: Reporting options, set to `'none'` to disable reporting.

These arguments control various aspects of the training process and ensure efficient training and logging.


In [27]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import pandas as pd
from datasets import Dataset, DatasetDict

# Initialize tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Sample data
data = {
    'title': ['2001 2002 C YUKON Compatible KEYLESS ENTRY REMOTE CLICKER FOB'],
    'details_Brand': ['Chevrolet'],
    'L0_category': ['Automotive'],
    'L1_category': ['Exterior Accessories'],
    'L2_category': ['License Plate Covers & Frames'],
    'L3_category': ['Frames'],
    'L4_category': ['na']
}

train_df = pd.DataFrame(data)
test_data = pd.DataFrame(data)  # Assuming test data is similar for demonstration
val_df = pd.DataFrame(data)     # Assuming validation data is similar for demonstration

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(test_data),
    'validation': Dataset.from_pandas(val_df)
})

# Define preprocessing function
def preprocess_function(examples):
    inputs = examples['title']
    targets = examples['L0_category']  # Replace with the appropriate target field

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding='max_length'
    )

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=128,
            truncation=True,
            padding='max_length'
        )

    model_inputs['labels'] = labels['input_ids']
    # Replace padding token id with -100 for loss calculation
    model_inputs['labels'] = [label if label != tokenizer.pad_token_id else -100 for label in model_inputs['labels']]

    return model_inputs

# Tokenize datasets
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Initialize data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator
)

# Train the model
trainer.train()

# Evaluate the model on the validation dataset
val_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['validation'],
    ignore_keys=["loss"],
    metric_key_prefix="eval"
)
print(f"Validation Loss: {val_results['eval_loss']}")

# Evaluate the model on the test dataset
test_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['test'],
    ignore_keys=["loss"],
    metric_key_prefix="eval"
)
print(f"Test Loss: {test_results['eval_loss']}")

print(test_df.iloc[0])


Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
1,No log,0.191525
2,No log,0.316679
3,No log,0.282902


Validation Loss: 0.2829017639160156
Test Loss: 0.2829017639160156
title            2001 2002 C YUKON Compatible KEYLESS ENTRY REM...
details_Brand                                            Chevrolet
L0_category                                             Automotive
L1_category                                   Exterior Accessories
L2_category                          License Plate Covers & Frames
L3_category                                                 Frames
L4_category                                                     na
Name: 0, dtype: object


## Defining a Custom Callback for Logging

In this cell, we define a custom callback class `CustomCallback` that extends `TrainerCallback` from the Hugging Face `transformers` library:

- **`on_log` Method**: This method is triggered during the training process whenever logging occurs. It prints:
  - The current training step (`state.global_step`).
  - Each key-value pair in the `logs` dictionary.

This custom callback allows for detailed logging of training progress and metrics directly to the console, providing real-time feedback during the training process.


In [17]:
class CustomCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            print(f"Step: {state.global_step}")
            for key, value in logs.items():
                print(f"{key}: {value}")
            print("\n")

## Training the Model

In this cell, we initialize and run the `Trainer` for training the T5 model:

- **`model`**: The T5 model to be trained.
- **`args`**: The `TrainingArguments` specified in the previous cell.
- **`train_dataset`**: The tokenized training dataset.
- **`eval_dataset`**: The tokenized validation dataset.
- **`callbacks`**: The list of callbacks to use during training, including the custom `CustomCallback` defined earlier.

After setting up the `Trainer`, we call `trainer.train()` to start the training process. The custom callback will print detailed logging information during training.


## Evaluating the Model

In this cell, we evaluate the trained model on both the validation and test datasets:

- **Validation Evaluation**: We use the `trainer.evaluate()` method to assess the model's performance on the validation dataset (`tokenized_datasets['validation']`). The validation loss is printed to provide an indication of how well the model generalizes to unseen validation data.

- **Test Evaluation**: Similarly, we evaluate the model on the test dataset (`tokenized_datasets['test']`). The test loss is printed to gauge the model's performance on the final test set.

The `eval_loss` metric provides insight into the model's performance, helping to assess its accuracy and effectiveness on the given datasets.


In [18]:
# Ensure datasets are in the correct format
tokenized_datasets = tokenized_datasets.with_format("torch")

# Evaluate the model on the validation dataset
val_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['validation'],
    ignore_keys=["loss"],  # Optionally ignore specific keys
    metric_key_prefix="eval"
)
print(f"Validation Loss: {val_results['eval_loss']}")

# Evaluate the model on the test dataset
test_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['test'],
    ignore_keys=["loss"],  # Optionally ignore specific keys
    metric_key_prefix="eval"
)
print(f"Test Loss: {test_results['eval_loss']}")

Validation Loss: 0.2829017639160156
Test Loss: 0.2829017639160156


## Saving the Fine-Tuned Model

In this cell, we save the fine-tuned T5 model and tokenizer to a specified directory:

- **`model.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the trained T5 model to the directory `./fine_tuned_t5`. This allows you to load the model later without retraining.

- **`tokenizer.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the tokenizer associated with the T5 model to the same directory. This ensures that you can use the same tokenizer for encoding and decoding text during inference.

Saving both the model and tokenizer ensures that you can resume work or deploy the model in the future with consistent results.


In [19]:
model.save_pretrained('./fine_tuned_t5_1000dp')
tokenizer.save_pretrained('./fine_tuned_t5_1000dp')

('./fine_tuned_t5_1000dp/tokenizer_config.json',
 './fine_tuned_t5_1000dp/special_tokens_map.json',
 './fine_tuned_t5_1000dp/spiece.model',
 './fine_tuned_t5_1000dp/added_tokens.json')

## Loading the Fine-Tuned Model and Tokenizer

In this cell, we load the fine-tuned T5 model and tokenizer from the specified directory and set up the environment for evaluation:

- **`device`**: Determines whether to use a GPU (`cuda`) or CPU for computation based on availability.

- **`model`**: Loads the fine-tuned T5 model and moves it to the appropriate device (`cuda` or `cpu`).

- **`tokenizer`**: Loads the tokenizer associated with the fine-tuned T5 model.

The model is set to evaluation mode with `model.eval()`, preparing it for generating predictions.

### Functions

- **`generate_text(inputs)`**: Takes a batch of input texts, tokenizes them, and generates predictions using the fine-tuned model. It returns the generated texts after decoding them from token IDs.

- **`extract_details(text)`**: Extracts attribute details from the generated or target text using regular expressions. It returns the details for brand and categories, defaulting to `'na'` if not found.

- **`clean_repeated_patterns(text)`**: Cleans the generated text by removing redundant patterns, specifically handling the `L4_category`.

These functions will be used for generating predictions and extracting and cleaning the details from the results.


In [20]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import pandas as pd
from datasets import Dataset, DatasetDict

# Initialize tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Sample data
data = {
    'title': ['2001 2002 C YUKON Compatible KEYLESS ENTRY REMOTE CLICKER FOB'],
    'details_Brand': ['Chevrolet'],
    'L0_category': ['Automotive'],
    'L1_category': ['Exterior Accessories'],
    'L2_category': ['License Plate Covers & Frames'],
    'L3_category': ['Frames'],
    'L4_category': ['na']
}

train_df = pd.DataFrame(data)
test_df = pd.DataFrame(data)  # Assuming test data is similar for demonstration
val_df = pd.DataFrame(data)   # Assuming validation data is similar for demonstration

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(test_df),
    'validation': Dataset.from_pandas(val_df)
})

# Define preprocessing function
def preprocess_function(examples):
    inputs = examples['title']
    targets = examples['L0_category']  # Replace with the appropriate target field

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding='max_length'
    )

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=128,
            truncation=True,
            padding='max_length'
        )

    model_inputs['labels'] = labels['input_ids']
    # Replace padding token id with -100 for loss calculation
    model_inputs['labels'] = [-100 if label == tokenizer.pad_token_id else label for label in model_inputs['labels']]

    return model_inputs

# Tokenize datasets
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Initialize data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator
)

# Train the model
trainer.train()

# Evaluate the model
val_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['validation']
)
print(f"Validation Loss: {val_results['eval_loss']}")

test_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['test']
)
print(f"Test Loss: {test_results['eval_loss']}")


Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
1,No log,0.191525
2,No log,0.316679
3,No log,0.282902


Validation Loss: 0.2829017639160156
Test Loss: 0.2829017639160156


## Generating Predictions and Extracting Details

In this cell, we process the test data in batches to generate predictions and extract attribute details:

- **`batch_size`**: The number of samples processed in each batch, set to `128`.

- **`generated_details`**: List to store extracted details from generated texts.
- **`target_details`**: List to store extracted details from target texts.

### Processing Loop

We iterate over the test data in batches:
1. **Batch Extraction**: For each batch of inputs, we generate predictions using the `generate_text` function.
2. **Details Extraction**: For each generated text and corresponding label, we extract and append details using the `extract_details` function.

**Note**: The `batch_labels` are included here for completeness, but they are not used in this code snippet for generating predictions.

Finally, a message is printed to indicate that the extraction of generated information is complete.


In [21]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Initialize the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
model.eval()  # Set model to evaluation mode

def generate_text(inputs):
    """Generate text from model given input tensor."""
    with torch.no_grad():
        outputs = model.generate(input_ids=inputs['input_ids'],
                                 attention_mask=inputs['attention_mask'],
                                 max_length=50)  # Adjust max_length as needed
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

def extract_details(text):
    """Extract details from generated or target text."""
    # Split the text by the delimiter used in your task
    # Here assuming a space delimiter; adjust as needed
    return text.split(' ')

# Batch size
batch_size = 128

# Sample test dataset (replace with your actual dataset)
test_data = [
    {'input_text': 'Zebra Case for Pine64 ~ Black Ice Tall - C4Labs'},
    {'input_text': '2001 2002 C YUKON Compatible KEYLESS ENTRY REMOTE CLICKER FOB'},
    # Add more test data as needed
]

generated_details = []
target_details = []  # Assuming you have target details, if not, can be left empty

# Process the test data in batches
for i in range(0, len(test_data), batch_size):
    batch = test_data[i:i + batch_size]

    # Tokenize the batch
    batch_inputs = tokenizer([item['input_text'] for item in batch],
                             padding=True,
                             truncation=True,
                             return_tensors='pt')

    # Generate predictions
    generated_texts = generate_text(batch_inputs)

    # Extract details from generated texts
    for text in generated_texts:
        details = extract_details(text)
        generated_details.append(details)

    # Extract details from target texts (if available)
    # Replace with appropriate code if you have target details
    for item in batch:
        # Assuming target details are provided in the dataset
        # target_details.append(extract_details(item['target_text']))
        pass

print("Extraction of generated information complete.")

# If you have target details and want to compute evaluation metrics, you can use them as well.


Extraction of generated information complete.


## Evaluating Model Performance by Category

In this cell, we evaluate the model's performance by splitting the generated and target details into categories and calculating various metrics:

### Data Preparation

- **`generated_dict`** and **`target_dict`**: Dictionaries to store generated and target details for each category (0 through 5). The `generated_details` and `target_details` lists are split into these dictionaries based on category indices.

- **Cleaning Repeated Patterns**: The `L4_category` entries in `generated_dict` are cleaned using the `clean_repeated_patterns` function to remove redundant patterns.

### Metrics Calculation

- **`categories`**: List of categories for which metrics will be computed: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`metrics`**: List of metrics to be calculated: `accuracy`, `precision`, `recall`, and `f1`.

For each category:
1. **Compute Metrics**: Accuracy, precision, recall, and F1 score are calculated using `accuracy_score`, `precision_score`, `recall_score`, and `f1_score` from `sklearn.metrics`. Metrics are computed with macro averaging to handle multi-class classification.

2. **Print Results**: The results for each category are printed, showing the calculated metrics with four decimal places.

The printed results provide insight into the performance of the model across different categories and metrics.


In [22]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import re

# Example clean_repeated_patterns function
def clean_repeated_patterns(text):
    # Example function to clean repeated patterns (customize as needed)
    return re.sub(r'(.)\1+', r'\1\1', text)

# Sample data (replace with actual test data)
generated_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more generated details as needed
]

target_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more target details as needed
]

# Initialize dictionaries to store details
generated_dict = {i: [] for i in range(6)}
target_dict = {i: [] for i in range(6)}

# Populate dictionaries with data
for gen, tar in zip(generated_details, target_details):
    for i in range(6):
        generated_dict[i].append(gen[i])
        target_dict[i].append(tar[i])

# Clean repeated patterns in L4_category
generated_dict[5] = [clean_repeated_patterns(text) for text in generated_dict[5]]

# Define categories and metrics
categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
metrics = ['accuracy', 'precision', 'recall', 'f1']

# Initialize results dictionary
results = {category: {metric: 0 for metric in metrics} for category in categories}

# Calculate metrics for each category
for i, category in enumerate(categories):
    print('Current Category: ', category)
    y_true = target_dict[i]
    y_pred = generated_dict[i]

    results[category]['accuracy'] = accuracy_score(y_true, y_pred)
    results[category]['precision'] = precision_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['recall'] = recall_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['f1'] = f1_score(y_true, y_pred, average='macro', zero_division=0)

# Print results
print()
for category, metrics in results.items():
    print(f"{category}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()


Current Category:  details_Brand
Current Category:  L0_category
Current Category:  L1_category
Current Category:  L2_category
Current Category:  L3_category
Current Category:  L4_category

details_Brand:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000

L0_category:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000

L1_category:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000

L2_category:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000

L3_category:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000

L4_category:
  accuracy: 1.0000
  precision: 1.0000
  recall: 1.0000
  f1: 1.0000



## Computing Item-Level Accuracy

In this cell, we define a function to compute item-level accuracy, which measures how often all predicted categories match the target categories for each item:

### Function: `compute_item_accuracy`

- **Inputs**:
  - `generated_details`: List of predicted details for each item.
  - `target_details`: List of true details for each item.

- **Process**:
  - **Count Correct Items**: Iterates through pairs of generated and target details. If all elements in a generated detail match the corresponding elements in the target detail, it counts as a correct item.
  - **Compute Accuracy**: Divides the count of correct items by the total number of items to get the accuracy. Returns `0` if there are no items.

### Execution

- **`item_accuracy`**: Calls `compute_item_accuracy` with the `generated_details` and `target_details` to calculate the accuracy.
- **Print Accuracy**: Prints the item-level accuracy with four decimal places.

Item-level accuracy provides a metric of how well the model performs in predicting all categories correctly for each product.


In [23]:
import numpy as np

# Function to compute item-level accuracy
def compute_item_accuracy(generated_details, target_details):
    correct_items = 0
    total_items = len(generated_details)

    for gen, tar in zip(generated_details, target_details):
        if gen == tar:
            correct_items += 1

    return correct_items / total_items if total_items > 0 else 0

# Sample data (replace with your actual data)
generated_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more generated details as needed
]

target_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more target details as needed
]

# Compute item-level accuracy
item_accuracy = compute_item_accuracy(generated_details, target_details)

# Print item-level accuracy
print(f"Item-Level Accuracy: {item_accuracy:.4f}")


Item-Level Accuracy: 1.0000


## Saving Predictions to a File

In this cell, we save the generated predictions to a file in JSONL format:

- **`categories`**: List of categories for which predictions are made: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`attrebute_test_baseline_200dp.predict`**: The output file where the predictions will be saved.

### Process

1. **Open File**: Opens the file `attrebute_test_baseline_200dp.predict` for writing.

2. **Write Predictions**:
   - **Iterate**: Loops through `generated_details` along with `indoml_id`, which acts as the identifier for each item.
   - **Create Result**: Constructs a dictionary with `indoml_id` and the predicted values for each category.
   - **Write to File**: Serializes the dictionary to JSON format and writes it to the file, one entry per line.

This file can be used for evaluation or submission purposes, containing the model's predictions in the required format.


In [24]:
import json

# Sample data (replace these with your actual data)
generated_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more generated details as needed
]

indoml_ids = [275, 276]


## Creating a Zip Archive for Predictions

In this cell, we create a zip archive of the predictions file:

- **`file_to_zip`**: The name of the file containing the predictions (`attrebute_test_baseline_200dp.predict`).

- **`zip_file_name`**: The name of the zip archive to be created (`any_name.zip`).

### Process

1. **Create Zip Archive**: Opens a new zip file (`any_name.zip`) for writing.

2. **Add File to Zip**:
   - **Add File**: Adds the predictions file (`attrebute_test_baseline_200dp.predict`) to the zip archive. The `arcname` parameter ensures that the file is stored in the zip archive with the same name as it has on the file system.

The resulting zip file can be used for submission or sharing, compressing the predictions file into a standard format.


In [None]:
import json
import zipfile

# Sample data (replace with your actual data)
generated_details = [
    ['Chevrolet', 'Automotive', 'Exterior Accessories', 'License Plate Covers & Frames', 'Frames', 'na'],
    # Add more generated details as needed
]

indoml_ids = [275, 276]  # Sample IDs (replace with your actual IDs)

# Write predictions to file
predict_file_name = 'attrebute_test_baseline_200dp.predict'
with open(predict_file_name, 'w') as f:
    for indoml_id, details in zip(indoml_ids, generated_details):
        result = {
            'indoml_id': indoml_id,
            'details_Brand': details[0],
            'L0_category': details[1],
            'L1_category': details[2],
            'L2_category': details[3],
            'L3_category': details[4],
            'L4_category': details[5]
        }
        f.write(json.dumps(result) + '\n')

print("Predictions file created successfully.")

# Define the filenames
file_to_zip = predict_file_name
zip_file_name = 'MachineLearningP.zip'

# Create a zip archive and add the predictions file
with zipfile.ZipFile(zip_file_name, 'w') as zipf:
    zipf.write(file_to_zip, arcname=file_to_zip)

print(f"Created zip archive: {zip_file_name} containing {file_to_zip}")



Predictions file created successfully.
Created zip archive: MachineLearningP.zip containing attrebute_test_baseline_200dp.predict
