## Importing Libraries and Modules

In this cell, we import the necessary libraries and modules required for the task:

- **pandas**: For data manipulation and analysis.
- **transformers**: Includes the `T5Tokenizer` and `T5ForConditionalGeneration` classes for tokenizing text and generating predictions using the T5 model.
- **datasets**: Provides the `Dataset` and `DatasetDict` classes for handling datasets.
- **numpy**: For numerical operations.

These libraries and modules will be used for data processing, model training, and evaluation.


In [3]:
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, TrainerCallback
from datasets import Dataset, DatasetDict
import numpy as np

## Reading Data from JSONL Files

In this cell, we define a function `read_jsonl` to read data from JSON Lines (JSONL) files into pandas DataFrames. We then use this function to read the following datasets:

- **Training Data**: `attrebute_train.data` and `attrebute_train.solution`, with the first 1000 rows.
- **Testing Data**: `attrebute_test.data` and `attrebute_test.solution`, with the first 200 rows.
- **Validation Data**: `attrebute_val.data` and `attrebute_val.solution`, with the first 200 rows.

The commented-out lines are for reading the entire datasets if needed. This setup allows us to work with a subset of the data for initial experimentation and testing.


In [5]:
def read_jsonl(file_path):
    return pd.read_json(file_path, lines=True)


train_data = read_jsonl('/kaggle/input/indoml-2024/attribute_train.data')
train_solution = read_jsonl('/kaggle/input/indoml-2024/attribute_train.solution')
test_df = read_jsonl('/kaggle/input/indoml-2024/attribute_test.data')


In [6]:
train_solution.head()

Unnamed: 0,indoml_id,details_Brand,L0_category,L1_category,L2_category,L3_category,L4_category
0,0,Enclume,Home & Kitchen,Kitchen & Dining,Storage & Organization,Racks & Holders,Pot Racks
1,1,Schutt,Sports & Outdoors,Sports,Team Sports,Football,Protective Gear
2,2,Easton,Sports & Outdoors,Sports,Team Sports,Baseball,Baseball Bats
3,3,Bilstein,Automotive,Replacement Parts,"Shocks, Struts & Suspension",Shocks,na
4,4,Clear Path Paper,"Arts, Crafts & Sewing",Crafting,Paper & Paper Crafts,Paper,Card Stock


In [7]:
# Extract possible labels for each category
categories = ['details_Brand','L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
label_sets = {category: train_solution[category].unique().tolist() for category in categories}


# Function to clean the labels
def clean_labels(labels):
    # Create a dictionary to map lowercase labels to their original counterparts
    lower_label_map = {}
    for label in labels:
        lower_label = label.lower()
        if lower_label not in lower_label_map:
            lower_label_map[lower_label] = label
        else:
            # If a lowercase version already exists, keep the one with all lowercase
            if lower_label_map[lower_label][0].islower():
                continue
            else:
                lower_label_map[lower_label] = label
    
    # Return the cleaned set of labels
    return list(lower_label_map.values())

# Clean the label sets for each category
cleaned_label_sets = {category: clean_labels(labels) for category, labels in label_sets.items()}

In [8]:
len(cleaned_label_sets['details_Brand'])

5003

In [9]:
len(label_sets['L1_category'])

163

In [10]:
n = 5000
m = 250000
val_data = train_data[:n]
val_solution = train_solution[:n]

train_data = train_data[n:n+m]
train_solution = train_solution[n:n+m]

In [11]:
# Step 1: Create the test_solution DataFrame with the same columns as train_solution
test_solution = pd.DataFrame(columns=train_solution.columns)

# Step 2: Copy the 'indoml_id' column from test_data to test_solution
test_solution['indoml_id'] = test_df['indoml_id']

# Step 3: Fill all other columns with 'test'
for column in test_solution.columns:
    if column != 'indoml_id':
        test_solution[column] = 'test'

# Display the first few rows to verify
test_solution.head()

Unnamed: 0,indoml_id,details_Brand,L0_category,L1_category,L2_category,L3_category,L4_category
0,0,test,test,test,test,test,test
1,1,test,test,test,test,test,test
2,2,test,test,test,test,test,test
3,3,test,test,test,test,test,test
4,4,test,test,test,test,test,test


## Data Preprocessing and Formatting

In this cell, we define a function `preprocess_data` to prepare the data for model training. This function merges the product description data with the corresponding attribute labels, then formats the data into `input_text` and `target_text` pairs:

- **`input_text`**: Constructed by combining the product title, store, and manufacturer details.
- **`target_text`**: Constructed by specifying the attribute-value pairs for brand and categories.

### Data Processing

We apply the `preprocess_data` function to the training, testing, and validation datasets to generate the `input_text` and `target_text`.

Finally, the processed data is converted into the Hugging Face Dataset format using `Dataset.from_pandas` for further model training and evaluation.


In [12]:
def preprocess_data(data, solution):
    merged = pd.merge(data, solution, on='indoml_id')

    merged['input_text'] = merged.apply(lambda row: f"title: {row['title']} store: {row['store']} details_Manufacturer: {row['details_Manufacturer']}", axis=1)
    merged['target_text'] = merged.apply(lambda row: f"details_Brand: {row['details_Brand']} L0_category: {row['L0_category']} L1_category: {row['L1_category']} L2_category: {row['L2_category']} L3_category: {row['L3_category']} L4_category: {row['L4_category']}", axis=1)

    return merged[['input_text', 'target_text']]


train_processed = preprocess_data(train_data, train_solution)
test_processed = preprocess_data(test_df, test_solution)
val_processed = preprocess_data(val_data, val_solution)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_processed)
test_dataset = Dataset.from_pandas(test_processed)
val_dataset = Dataset.from_pandas(val_processed)


In [13]:
train_dataset[:5]

{'input_text': ['title: Colonial Candle Holiday Sparkle 8 oz Scented Oval Jar Candle store: Colonial Candle details_Manufacturer: Colonial Candle',
  'title: Wagner ThermoQuiet MX699 Semi-Metallic Disc Brake Pad Set store: Wagner details_Manufacturer: Wagner',
  'title: SS TON English Willow Cricket Bat - 2017 Edition store: SS details_Manufacturer: SS',
  'title: Callahan CDS04073 REAR 320mm Drilled & Slotted 5 Lug [2] Rotors [ fit BMW 525i 528i 530i E60 ] store: Callahan BRAKE PARTS details_Manufacturer: Callahan Brake Parts',
  'title: FCS II Accelerator PC Carbon Tri Fins store: FCS details_Manufacturer: FCS'],
 'target_text': ['details_Brand: Colonial Candle L0_category: Home & Kitchen L1_category: Home Dcor Products L2_category: Candles & Holders L3_category: Candles L4_category: Jar Candles',
  'details_Brand: Wagner L0_category: Automotive L1_category: Replacement Parts L2_category: Brake System L3_category: Brake Pads L4_category: na',
  'details_Brand: SS L0_category: Sports 

## Creating Dataset Dictionary

In this cell, we create a `DatasetDict` to organize the processed datasets for training, testing, and validation. The `DatasetDict` is a convenient way to manage multiple datasets in Hugging Face's `datasets` library.

- **`train`**: Contains the training dataset (`train_dataset`).
- **`test`**: Contains the test dataset (`test_dataset`).
- **`validation`**: Contains the validation dataset (`val_dataset`).

The `DatasetDict` will be used for training and evaluating the model, allowing for easy access to different subsets of data.


In [14]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'validation': val_dataset
})

## Loading the T5 Model and Tokenizer

In this cell, we load the T5 model and tokenizer from the Hugging Face `transformers` library:

- **`T5Tokenizer`**: Tokenizer for converting text into tokens and vice versa, using the `t5-small` pre-trained model.
- **`T5ForConditionalGeneration`**: T5 model for sequence-to-sequence tasks, also using the `t5-small` pre-trained model.

These components will be used for encoding the input text, generating predictions, and decoding the output text.


In [15]:
# tokenizer = T5Tokenizer.from_pretrained('t5-base')
# model = T5ForConditionalGeneration.from_pretrained('t5-base')

In [16]:
import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration, BitsAndBytesConfig
import torch
from torch import cuda, bfloat16

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-large')

# Load the model with 4-bit quantization
model = T5ForConditionalGeneration.from_pretrained(
    't5-large',
    device_map="auto",
    quantization_config=bnb_config
)

# Optional: Move the model to the appropriate device
# model.to('cuda' if torch.cuda.is_available() else 'cpu')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Tokenizing the Dataset

In this cell, we define the `preprocess_function` to tokenize the `input_text` and `target_text` using the T5 tokenizer:

- **`inputs`**: Tokenized input texts with a maximum length of 352 tokens, padded and truncated as necessary.
- **`targets`**: Tokenized target texts with a maximum length of 128 tokens, padded and truncated as necessary.
- **`model_inputs`**: Contains the tokenized inputs and labels (target texts) for model training.

The `preprocess_function` is applied to the entire dataset using the `map` method with `batched=True`, ensuring efficient processing of the data in batches.

The result, `tokenized_datasets`, is a `DatasetDict` containing the tokenized versions of the train, test, and validation datasets, ready for model training.


In [17]:
def preprocess_function(examples):
    inputs = examples['input_text']
    targets = examples['target_text']
    model_inputs = tokenizer(inputs, max_length=128, padding='max_length', truncation=True)
    labels = tokenizer(targets, max_length=128, padding='max_length', truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/250000 [00:00<?, ? examples/s]

Map:   0%|          | 0/95036 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [18]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 250000
    })
    test: Dataset({
        features: ['input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 95036
    })
    validation: Dataset({
        features: ['input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
})

In [19]:
#tokenized_datasets.save_to_disk('./')

In [20]:
# from datasets import load_from_disk

# tokenized_datasets = load_from_disk('./')

## Configuring Training Arguments

In this cell, we set up the `TrainingArguments` for training the T5 model using the Hugging Face `Trainer`:

- **`output_dir`**: Directory to save the model checkpoints and results.
- **`evaluation_strategy`**: Strategy for evaluation, set to `'epoch'`, meaning evaluation will occur at the end of each epoch.
- **`learning_rate`**: Learning rate for optimization, set to `2e-5`.
- **`per_device_train_batch_size`**: Batch size for training, set to `16`.
- **`per_device_eval_batch_size`**: Batch size for evaluation, set to `16`.
- **`num_train_epochs`**: Number of training epochs, set to `2`.
- **`weight_decay`**: Weight decay for regularization, set to `0.01`.
- **`save_total_limit`**: Limit on the number of checkpoints to keep, set to `3`.
- **`logging_dir`**: Directory for logging information.
- **`logging_steps`**: Frequency of logging, set to every 20 steps.
- **`report_to`**: Reporting options, set to `'none'` to disable reporting.

These arguments control various aspects of the training process and ensure efficient training and logging.


In [21]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./logs',
    logging_steps=20,
    report_to='none'
)



## Defining a Custom Callback for Logging

In this cell, we define a custom callback class `CustomCallback` that extends `TrainerCallback` from the Hugging Face `transformers` library:

- **`on_log` Method**: This method is triggered during the training process whenever logging occurs. It prints:
  - The current training step (`state.global_step`).
  - Each key-value pair in the `logs` dictionary.

This custom callback allows for detailed logging of training progress and metrics directly to the console, providing real-time feedback during the training process.


In [22]:
class CustomCallback(TrainerCallback):
    def __init__(self, eval_steps):
        self.eval_steps = eval_steps

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            print(f"Step: {state.global_step}")
            for key, value in logs.items():
                print(f"{key}: {value}")
            print("\n")

        # Run evaluation at specified intervals
        if state.global_step % self.eval_steps == 0 and state.global_step != 0:
            # Trigger evaluation
            print("Running evaluation...")
            eval_results = kwargs['model'].evaluate(eval_dataset=kwargs['eval_dataset'])
            print(f"Eval Results at step {state.global_step}: {eval_results}")
            print("\n")

## Training the Model

In this cell, we initialize and run the `Trainer` for training the T5 model:

- **`model`**: The T5 model to be trained.
- **`args`**: The `TrainingArguments` specified in the previous cell.
- **`train_dataset`**: The tokenized training dataset.
- **`eval_dataset`**: The tokenized validation dataset.
- **`callbacks`**: The list of callbacks to use during training, including the custom `CustomCallback` defined earlier.

After setting up the `Trainer`, we call `trainer.train()` to start the training process. The custom callback will print detailed logging information during training.


In [23]:
# %pip install --quiet peft

In [24]:
from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
)

model.add_adapter(peft_config)

In [25]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[CustomCallback(eval_steps=1000)]
)

trainer.train()

Epoch,Training Loss,Validation Loss


Step: 20
loss: 3.0879
grad_norm: 0.12785713374614716
learning_rate: 0.00199744
epoch: 0.00128


Step: 40
loss: 0.4105
grad_norm: 0.06743501871824265
learning_rate: 0.00199488
epoch: 0.00256


Step: 60
loss: 0.3159
grad_norm: 0.0619133897125721
learning_rate: 0.00199232
epoch: 0.00384


Step: 80
loss: 0.2798
grad_norm: 0.05886916443705559
learning_rate: 0.00198976
epoch: 0.00512


Step: 100
loss: 0.2478
grad_norm: 0.0677536129951477
learning_rate: 0.0019872
epoch: 0.0064


Step: 120
loss: 0.2405
grad_norm: 0.08208909630775452
learning_rate: 0.00198464
epoch: 0.00768


Step: 140
loss: 0.2205
grad_norm: 0.059834618121385574
learning_rate: 0.00198208
epoch: 0.00896


Step: 160
loss: 0.2068
grad_norm: 0.0490608848631382
learning_rate: 0.00197952
epoch: 0.01024


Step: 180
loss: 0.2
grad_norm: 0.052846331149339676
learning_rate: 0.0019769600000000003
epoch: 0.01152


Step: 200
loss: 0.1906
grad_norm: 0.05385070666670799
learning_rate: 0.0019744
epoch: 0.0128


Step: 220
loss: 0.1901
grad_nor



Step: 520
loss: 0.2422
grad_norm: 0.09770577400922775
learning_rate: 0.00193344
epoch: 0.03328


Step: 540
loss: 0.265
grad_norm: 0.07725152373313904
learning_rate: 0.00193088
epoch: 0.03456


Step: 560
loss: 0.3034
grad_norm: 0.08278632164001465
learning_rate: 0.00192832
epoch: 0.03584


Step: 580
loss: 0.3594
grad_norm: 0.16105549037456512
learning_rate: 0.0019257599999999999
epoch: 0.03712


Step: 600
loss: 0.4146
grad_norm: 0.0993102490901947
learning_rate: 0.0019232000000000001
epoch: 0.0384


Step: 620
loss: 0.5143
grad_norm: 0.08013924211263657
learning_rate: 0.00192064
epoch: 0.03968


Step: 640
loss: 0.6773
grad_norm: 0.15425421297550201
learning_rate: 0.00191808
epoch: 0.04096


Step: 660
loss: 0.7792
grad_norm: 0.3474056124687195
learning_rate: 0.00191552
epoch: 0.04224


Step: 680
loss: 1.0712
grad_norm: 0.4491325318813324
learning_rate: 0.00191296
epoch: 0.04352


Step: 700
loss: 1.25
grad_norm: 0.46288254857063293
learning_rate: 0.0019104
epoch: 0.0448


Step: 720
loss: 1

AttributeError: 'T5ForConditionalGeneration' object has no attribute 'evaluate'

## Evaluating the Model

In this cell, we evaluate the trained model on both the validation and test datasets:

- **Validation Evaluation**: We use the `trainer.evaluate()` method to assess the model's performance on the validation dataset (`tokenized_datasets['validation']`). The validation loss is printed to provide an indication of how well the model generalizes to unseen validation data.

- **Test Evaluation**: Similarly, we evaluate the model on the test dataset (`tokenized_datasets['test']`). The test loss is printed to gauge the model's performance on the final test set.

The `eval_loss` metric provides insight into the model's performance, helping to assess its accuracy and effectiveness on the given datasets.


In [None]:
# val_results = trainer.evaluate(eval_dataset=tokenized_datasets['validation'])
# print(f"Validation Loss: {val_results['eval_loss']}")

# test_results = trainer.evaluate(eval_dataset=tokenized_datasets['test'])
# print(f"Test Loss: {test_results['eval_loss']}")

## Saving the Fine-Tuned Model

In this cell, we save the fine-tuned T5 model and tokenizer to a specified directory:

- **`model.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the trained T5 model to the directory `./fine_tuned_t5`. This allows you to load the model later without retraining.

- **`tokenizer.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the tokenizer associated with the T5 model to the same directory. This ensures that you can use the same tokenizer for encoding and decoding text during inference.

Saving both the model and tokenizer ensures that you can resume work or deploy the model in the future with consistent results.


In [None]:
model.save_pretrained('./fine_tuned_t5_2')
tokenizer.save_pretrained('./fine_tuned_t5_2')

## Loading the Fine-Tuned Model and Tokenizer

In this cell, we load the fine-tuned T5 model and tokenizer from the specified directory and set up the environment for evaluation:

- **`device`**: Determines whether to use a GPU (`cuda`) or CPU for computation based on availability.

- **`model`**: Loads the fine-tuned T5 model and moves it to the appropriate device (`cuda` or `cpu`).

- **`tokenizer`**: Loads the tokenizer associated with the fine-tuned T5 model.

The model is set to evaluation mode with `model.eval()`, preparing it for generating predictions.

### Functions

- **`generate_text(inputs)`**: Takes a batch of input texts, tokenizes them, and generates predictions using the fine-tuned model. It returns the generated texts after decoding them from token IDs.

- **`extract_details(text)`**: Extracts attribute details from the generated or target text using regular expressions. It returns the details for brand and categories, defaulting to `'na'` if not found.

- **`clean_repeated_patterns(text)`**: Cleans the generated text by removing redundant patterns, specifically handling the `L4_category`.

These functions will be used for generating predictions and extracting and cleaning the details from the results.


In [None]:
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = T5ForConditionalGeneration.from_pretrained('./fine_tuned_t5_2').to(device)
tokenizer = T5Tokenizer.from_pretrained('./fine_tuned_t5_2')

model.eval()

test_data = val_dataset['input_text'][:5000]
test_labels = val_dataset['target_text'][:5000]

def generate_text(inputs):
    inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True, truncation=True, max_length=352)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=128)

    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return generated_texts

def extract_details(text):
    pattern = r'details_Brand: (.*?) L0_category: (.*?) L1_category: (.*?) L2_category: (.*?) L3_category: (.*?) L4_category: (.*)'
    match = re.match(pattern, text)
    if match:
        return tuple(item if item is not None else 'na' for item in match.groups())
    return 'na', 'na', 'na', 'na', 'na', 'na'

def clean_repeated_patterns(text):
    cleaned_data = text.split(' L4_category')[0]
    return cleaned_data


## Generating Predictions and Extracting Details

In this cell, we process the test data in batches to generate predictions and extract attribute details:

- **`batch_size`**: The number of samples processed in each batch, set to `128`.

- **`generated_details`**: List to store extracted details from generated texts.
- **`target_details`**: List to store extracted details from target texts.

### Processing Loop

We iterate over the test data in batches:
1. **Batch Extraction**: For each batch of inputs, we generate predictions using the `generate_text` function.
2. **Details Extraction**: For each generated text and corresponding label, we extract and append details using the `extract_details` function.

**Note**: The `batch_labels` are included here for completeness, but they are not used in this code snippet for generating predictions.

Finally, a message is printed to indicate that the extraction of generated information is complete.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TF-IDF vectorizer to compute cosine similarity
vectorizer = TfidfVectorizer()

# Function to find the closest label using cosine similarity
def find_closest_label(generated_label, possible_labels):
    possible_labels_vectorized = vectorizer.fit_transform(possible_labels)
    generated_label_vectorized = vectorizer.transform([generated_label])

    cosine_similarities = cosine_similarity(generated_label_vectorized, possible_labels_vectorized)
    closest_label_index = cosine_similarities.argmax()

    return possible_labels[closest_label_index]

In [None]:
batch_size = 128
generated_details = []
target_details = []

test_data = val_dataset['input_text'][:1000]
test_labels = val_dataset['target_text'][:1000]
j=0

for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
    batch_inputs = test_data[i:i+batch_size]
    batch_labels = test_labels[i:i+batch_size]  # Assuming `val_solution` contains the correct labels for the validation set

    # Generate text using your model
    generated_texts = generate_text(batch_inputs)

    for generated_text, label in zip(generated_texts, batch_labels):
        # Extract the details as a tuple from the generated text
        details = extract_details(generated_text)

        # Correcting the details if the generated label is not valid
        corrected_details = []
        j+=1
        print(f'Instance: {j}')
        for i, category in enumerate(categories):
            print(f'Category: {category}')
            generated_label = details[i]  # Extract the label corresponding to the category
            print(f'generated label: {generated_label}',end='; ')
            if generated_label not in label_sets[category]:
                closest_label = find_closest_label(generated_label, label_sets[category])
                corrected_details.append(closest_label)
                print(f'chosen label: {closest_label}')
            else:
                corrected_details.append(generated_label)
                print(f'chosen label: {generated_label}')

        # Append the corrected details as a tuple
        generated_details.append(tuple(corrected_details))

        # Extract the details from the actual target label and append
        target_details.append(extract_details(label))


print('Generated info extracted and corrected...')
print(len(generated_details))

In [None]:
# batch_size = 128
# generated_details = []
# target_details = []

# for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
#     batch_inputs = test_data[i:i+batch_size]
#     batch_labels = test_label[i:i+batch_size] #you are not going to have this

#     generated_texts = generate_text(batch_inputs)

#     for generated_text, label in zip(generated_texts, batch_labels):
#         generated_details.append(extract_details(generated_text))
#         target_details.append(extract_details(label))

# print('Generated info extracted.............')

## Evaluating Model Performance by Category

In this cell, we evaluate the model's performance by splitting the generated and target details into categories and calculating various metrics:

### Data Preparation

- **`generated_dict`** and **`target_dict`**: Dictionaries to store generated and target details for each category (0 through 5). The `generated_details` and `target_details` lists are split into these dictionaries based on category indices.

- **Cleaning Repeated Patterns**: The `L4_category` entries in `generated_dict` are cleaned using the `clean_repeated_patterns` function to remove redundant patterns.

### Metrics Calculation

- **`categories`**: List of categories for which metrics will be computed: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`metrics`**: List of metrics to be calculated: `accuracy`, `precision`, `recall`, and `f1`.

For each category:
1. **Compute Metrics**: Accuracy, precision, recall, and F1 score are calculated using `accuracy_score`, `precision_score`, `recall_score`, and `f1_score` from `sklearn.metrics`. Metrics are computed with macro averaging to handle multi-class classification.

2. **Print Results**: The results for each category are printed, showing the calculated metrics with four decimal places.

The printed results provide insight into the performance of the model across different categories and metrics.


In [None]:
for x in cleaned_label_sets['details_Brand']:
    if x == 'Laser & Inkjet Printer Labels':
        print(x)

In [None]:
cleaned_label_sets['L4_category']

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize dictionaries to store the generated and target values for each category
generated_dict = {i: [] for i in range(6)}
target_dict = {i: [] for i in range(6)}

# Populate the dictionaries with the corresponding values
for gen, tar in zip(generated_details, target_details):
    for i in range(6):
        generated_dict[i].append(gen[i])
        target_dict[i].append(tar[i])

print('Splitted into categories.............\n')

# Clean repeated patterns in L4_category
generated_dict[5] = [clean_repeated_patterns(text) for text in generated_dict[5]]

categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
metrics = ['accuracy', 'precision', 'recall', 'f1']

results = {category: {metric: 0 for metric in metrics} for category in categories}

# Calculate metrics for each category and print mismatches
for i, category in enumerate(categories):
    print('Current Category: ', category)
    y_true = target_dict[i]
    y_pred = generated_dict[i]

    # Calculate metrics
    results[category]['accuracy'] = accuracy_score(y_true, y_pred)
    results[category]['precision'] = precision_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['recall'] = recall_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['f1'] = f1_score(y_true, y_pred, average='macro', zero_division=0)

    # Print instances where the generated label is not the same as the target label
    print(f"Mismatches in {category}:")
    for idx, (true_label, pred_label) in enumerate(zip(y_true, y_pred)):
        if true_label != pred_label:
            print(f"  Instance {idx}: Target = {true_label}, Generated = {pred_label}")
    print()

# Print overall results
print("Overall Metrics:")
for category, metrics in results.items():
    print(f"{category}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()

In [None]:
generated_dict = {i: [] for i in range(6)}
target_dict = {i: [] for i in range(6)}

for gen, tar in zip(generated_details, target_details):
    for i in range(6):
        generated_dict[i].append(gen[i])
        target_dict[i].append(tar[i])

print('Splitted into category.............\n')

# Clean repeated patterns in L4_category
generated_dict[5] = [clean_repeated_patterns(text) for text in generated_dict[5]]

categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
metrics = ['accuracy', 'precision', 'recall', 'f1']

results = {category: {metric: 0 for metric in metrics} for category in categories}

for i, category in enumerate(categories):
    print('Current Category: ', category)
    y_true = target_dict[i]
    y_pred = generated_dict[i]

    results[category]['accuracy'] = accuracy_score(y_true, y_pred)
    results[category]['precision'] = precision_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['recall'] = recall_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['f1'] = f1_score(y_true, y_pred, average='macro', zero_division=0)

print()

for category, metrics in results.items():
    print(f"{category}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()

## Computing Item-Level Accuracy

In this cell, we define a function to compute item-level accuracy, which measures how often all predicted categories match the target categories for each item:

### Function: `compute_item_accuracy`

- **Inputs**:
  - `generated_details`: List of predicted details for each item.
  - `target_details`: List of true details for each item.

- **Process**:
  - **Count Correct Items**: Iterates through pairs of generated and target details. If all elements in a generated detail match the corresponding elements in the target detail, it counts as a correct item.
  - **Compute Accuracy**: Divides the count of correct items by the total number of items to get the accuracy. Returns `0` if there are no items.

### Execution

- **`item_accuracy`**: Calls `compute_item_accuracy` with the `generated_details` and `target_details` to calculate the accuracy.
- **Print Accuracy**: Prints the item-level accuracy with four decimal places.

Item-level accuracy provides a metric of how well the model performs in predicting all categories correctly for each product.


In [None]:
def compute_item_accuracy(generated_details, target_details):
    correct_items = 0
    total_items = len(generated_details)

    for gen, tar in zip(generated_details, target_details):
        if all(g == t for g, t in zip(gen, tar)):
            correct_items += 1

    return correct_items / total_items if total_items > 0 else 0

item_accuracy = compute_item_accuracy(generated_details, target_details)
print(f"Item-level accuracy: {item_accuracy:.4f}")


## Saving Predictions to a File

In this cell, we save the generated predictions to a file in JSONL format:

- **`categories`**: List of categories for which predictions are made: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`attrebute_test_baseline_200dp.predict`**: The output file where the predictions will be saved.

### Process

1. **Open File**: Opens the file `attrebute_test_baseline_200dp.predict` for writing.

2. **Write Predictions**:
   - **Iterate**: Loops through `generated_details` along with `indoml_id`, which acts as the identifier for each item.
   - **Create Result**: Constructs a dictionary with `indoml_id` and the predicted values for each category.
   - **Write to File**: Serializes the dictionary to JSON format and writes it to the file, one entry per line.

This file can be used for evaluation or submission purposes, containing the model's predictions in the required format.


In [None]:
batch_size = 128
generated_details_test = []
# target_details = []
test_data = test_dataset['input_text']

for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
    batch_inputs = test_data[i:i+batch_size]
#     batch_labels = test_labels[i:i+batch_size]  # Assuming `val_solution` contains the correct labels for the validation set

    # Generate text using your model
    generated_texts = generate_text(batch_inputs)

    for generated_text in generated_texts:
        # Extract the details as a tuple from the generated text
        details = extract_details(generated_text)

        # Correcting the details if the generated label is not valid
        corrected_details = []
        for i, category in enumerate(categories):
            generated_label = details[i]  # Extract the label corresponding to the category
            if generated_label not in label_sets[category]:
                closest_label = find_closest_label(generated_label, label_sets[category])
                corrected_details.append(closest_label)
            else:
                corrected_details.append(generated_label)

        # Append the corrected details as a tuple
        generated_details_test.append(tuple(corrected_details))


print('Generated info extracted and corrected...')
print(len(generated_details_test))

In [None]:
import json
categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']

with open('attribute_test_baseline_sub3.predict', 'w') as file:

    for indoml_id, details in enumerate(generated_details_test):
        result = {"indoml_id": indoml_id}
        for category, value in zip(categories, details):
            result[category] = value

        file.write(json.dumps(result) + '\n')

## Creating a Zip Archive for Predictions

In this cell, we create a zip archive of the predictions file:

- **`file_to_zip`**: The name of the file containing the predictions (`attrebute_test_baseline_200dp.predict`).

- **`zip_file_name`**: The name of the zip archive to be created (`any_name.zip`).

### Process

1. **Create Zip Archive**: Opens a new zip file (`any_name.zip`) for writing.

2. **Add File to Zip**:
   - **Add File**: Adds the predictions file (`attrebute_test_baseline_200dp.predict`) to the zip archive. The `arcname` parameter ensures that the file is stored in the zip archive with the same name as it has on the file system.

The resulting zip file can be used for submission or sharing, compressing the predictions file into a standard format.


In [None]:
import zipfile

file_to_zip = 'attribute_test_sub3.predict'
zip_file_name = 'attribute_test_sub3.zip'

with zipfile.ZipFile(zip_file_name, 'w') as zipf:
    zipf.write(file_to_zip, arcname=file_to_zip)

In [None]:
import shutil

# Specify the folder you want to zip and the name of the output zip file
folder_to_zip = '/kaggle/working/fine_tuned_t5_2'  # Replace with the path to your folder
output_zip_file = 'model.zip'  # Replace with the desired zip file name

# Create a zip archive
shutil.make_archive(output_zip_file.replace('.zip', ''), 'zip', folder_to_zip)

print(f"Folder '{folder_to_zip}' has been zipped as '{output_zip_file}'")