## Importing Libraries and Modules

In this cell, we import the necessary libraries and modules required for the task:

- **pandas**: For data manipulation and analysis.
- **transformers**: Includes the `T5Tokenizer` and `T5ForConditionalGeneration` classes for tokenizing text and generating predictions using the T5 model.
- **datasets**: Provides the `Dataset` and `DatasetDict` classes for handling datasets.
- **numpy**: For numerical operations.

These libraries and modules will be used for data processing, model training, and evaluation.


In [1]:
%pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

In [2]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, TrainerCallback
from datasets import Dataset, DatasetDict
import numpy as np

## Reading Data from JSONL Files

In this cell, we define a function `read_jsonl` to read data from JSON Lines (JSONL) files into pandas DataFrames. We then use this function to read the following datasets:

- **Training Data**: `attrebute_train.data` and `attrebute_train.solution`, with the first 1000 rows.
- **Testing Data**: `attrebute_test.data` and `attrebute_test.solution`, with the first 200 rows.
- **Validation Data**: `attrebute_val.data` and `attrebute_val.solution`, with the first 200 rows.

The commented-out lines are for reading the entire datasets if needed. This setup allows us to work with a subset of the data for initial experimentation and testing.


In [10]:
def read_jsonl(file_path, nrows=None):
    return pd.read_json(file_path, lines=True, nrows=nrows)


train_data = read_jsonl('/content/attribute_train.data')
train_solution = read_jsonl('/content/attribute_train.solution')
test_df = read_jsonl('/content/attribute_test.data')


In [None]:
# Extract possible labels for each category
categories = ['L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
label_sets = {category: train_data[category].unique().tolist() for category in categories}

In [11]:
n = 10000
m = 100000
val_data = train_data[:n]
val_solution = train_solution[:n]

train_data = train_data[n:n+m]
train_solution = train_solution[n:n+m]

In [12]:
# Step 1: Create the test_solution DataFrame with the same columns as train_solution
test_solution = pd.DataFrame(columns=train_solution.columns)

# Step 2: Copy the 'indoml_id' column from test_data to test_solution
test_solution['indoml_id'] = test_df['indoml_id']

# Step 3: Fill all other columns with 'test'
for column in test_solution.columns:
    if column != 'indoml_id':
        test_solution[column] = 'test'

# Display the first few rows to verify
test_solution.head()

Unnamed: 0,indoml_id,details_Brand,L0_category,L1_category,L2_category,L3_category,L4_category
0,0,test,test,test,test,test,test
1,1,test,test,test,test,test,test
2,2,test,test,test,test,test,test
3,3,test,test,test,test,test,test
4,4,test,test,test,test,test,test


## Data Preprocessing and Formatting

In this cell, we define a function `preprocess_data` to prepare the data for model training. This function merges the product description data with the corresponding attribute labels, then formats the data into `input_text` and `target_text` pairs:

- **`input_text`**: Constructed by combining the product title, store, and manufacturer details.
- **`target_text`**: Constructed by specifying the attribute-value pairs for brand and categories.

### Data Processing

We apply the `preprocess_data` function to the training, testing, and validation datasets to generate the `input_text` and `target_text`.

Finally, the processed data is converted into the Hugging Face Dataset format using `Dataset.from_pandas` for further model training and evaluation.


In [13]:
def preprocess_data(data, solution):
    merged = pd.merge(data, solution, on='indoml_id')

    merged['input_text'] = merged.apply(lambda row: f"title: {row['title']} store: {row['store']} details_Manufacturer: {row['details_Manufacturer']}", axis=1)
    merged['target_text'] = merged.apply(lambda row: f"details_Brand: {row['details_Brand']} L0_category: {row['L0_category']} L1_category: {row['L1_category']} L2_category: {row['L2_category']} L3_category: {row['L3_category']} L4_category: {row['L4_category']}", axis=1)

    return merged[['input_text', 'target_text']]


train_processed = preprocess_data(train_data, train_solution)
test_processed = preprocess_data(test_df, test_solution)
val_processed = preprocess_data(val_data, val_solution)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_processed)
test_dataset = Dataset.from_pandas(test_processed)
val_dataset = Dataset.from_pandas(val_processed)


In [14]:
train_dataset[:5]

{'input_text': ['title: FEL-PRO HS 9188 PT-1 Head Gasket Set store: Fel-Pro details_Manufacturer: FEL-PRO',
  'title: MAGID 15NYL KnitMaster 10 1/2 Lightweight Machine Knit Nylon Gloves, Large, White (12 Pairs) store: MAGID details_Manufacturer: MAGID',
  'title: Eiko 55 G4-1/2 Miniature Bayonet Base Halogen Bulb, 7V/0.41 Amp store: Eiko details_Manufacturer: Eiko',
  'title: 300 Thread Count Egyptian Cotton Sheet Set, DEEP Pocket, 300TC, Full, Solid Black store: Egyptian Cotton Factory Outlet Store details_Manufacturer: Egyptian Cotton Factory Outlet Store',
  'title: Sunbeam Imperial Plush Heated Blanket King - Lagoon Blue store: Sunbeam details_Manufacturer: Jarden Consumer Products'],
 'target_text': ['details_Brand: Fel-Pro L0_category: Automotive L1_category: Replacement Parts L2_category: Gaskets L3_category: Head Gasket Sets L4_category: na',
  'details_Brand: MAGID L0_category: Tools & Home Improvement L1_category: Safety & Security L2_category: Personal Protective Equipment L

## Creating Dataset Dictionary

In this cell, we create a `DatasetDict` to organize the processed datasets for training, testing, and validation. The `DatasetDict` is a convenient way to manage multiple datasets in Hugging Face's `datasets` library.

- **`train`**: Contains the training dataset (`train_dataset`).
- **`test`**: Contains the test dataset (`test_dataset`).
- **`validation`**: Contains the validation dataset (`val_dataset`).

The `DatasetDict` will be used for training and evaluating the model, allowing for easy access to different subsets of data.


In [15]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'validation': val_dataset
})

## Loading the T5 Model and Tokenizer

In this cell, we load the T5 model and tokenizer from the Hugging Face `transformers` library:

- **`T5Tokenizer`**: Tokenizer for converting text into tokens and vice versa, using the `t5-small` pre-trained model.
- **`T5ForConditionalGeneration`**: T5 model for sequence-to-sequence tasks, also using the `t5-small` pre-trained model.

These components will be used for encoding the input text, generating predictions, and decoding the output text.


In [16]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Tokenizing the Dataset

In this cell, we define the `preprocess_function` to tokenize the `input_text` and `target_text` using the T5 tokenizer:

- **`inputs`**: Tokenized input texts with a maximum length of 352 tokens, padded and truncated as necessary.
- **`targets`**: Tokenized target texts with a maximum length of 128 tokens, padded and truncated as necessary.
- **`model_inputs`**: Contains the tokenized inputs and labels (target texts) for model training.

The `preprocess_function` is applied to the entire dataset using the `map` method with `batched=True`, ensuring efficient processing of the data in batches.

The result, `tokenized_datasets`, is a `DatasetDict` containing the tokenized versions of the train, test, and validation datasets, ready for model training.


In [17]:
def preprocess_function(examples):
    inputs = examples['input_text']
    targets = examples['target_text']
    model_inputs = tokenizer(inputs, max_length=128, padding='max_length', truncation=True)
    labels = tokenizer(targets, max_length=128, padding='max_length', truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/95036 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

In [None]:
#tokenized_datasets.save_to_disk('./')

In [None]:
# from datasets import load_from_disk

# tokenized_datasets = load_from_disk('./')

## Configuring Training Arguments

In this cell, we set up the `TrainingArguments` for training the T5 model using the Hugging Face `Trainer`:

- **`output_dir`**: Directory to save the model checkpoints and results.
- **`evaluation_strategy`**: Strategy for evaluation, set to `'epoch'`, meaning evaluation will occur at the end of each epoch.
- **`learning_rate`**: Learning rate for optimization, set to `2e-5`.
- **`per_device_train_batch_size`**: Batch size for training, set to `16`.
- **`per_device_eval_batch_size`**: Batch size for evaluation, set to `16`.
- **`num_train_epochs`**: Number of training epochs, set to `2`.
- **`weight_decay`**: Weight decay for regularization, set to `0.01`.
- **`save_total_limit`**: Limit on the number of checkpoints to keep, set to `3`.
- **`logging_dir`**: Directory for logging information.
- **`logging_steps`**: Frequency of logging, set to every 20 steps.
- **`report_to`**: Reporting options, set to `'none'` to disable reporting.

These arguments control various aspects of the training process and ensure efficient training and logging.


In [18]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./logs',
    logging_steps=20,
    report_to='none'
)



## Defining a Custom Callback for Logging

In this cell, we define a custom callback class `CustomCallback` that extends `TrainerCallback` from the Hugging Face `transformers` library:

- **`on_log` Method**: This method is triggered during the training process whenever logging occurs. It prints:
  - The current training step (`state.global_step`).
  - Each key-value pair in the `logs` dictionary.

This custom callback allows for detailed logging of training progress and metrics directly to the console, providing real-time feedback during the training process.


In [19]:
class CustomCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            print(f"Step: {state.global_step}")
            for key, value in logs.items():
                print(f"{key}: {value}")
            print("\n")

## Training the Model

In this cell, we initialize and run the `Trainer` for training the T5 model:

- **`model`**: The T5 model to be trained.
- **`args`**: The `TrainingArguments` specified in the previous cell.
- **`train_dataset`**: The tokenized training dataset.
- **`eval_dataset`**: The tokenized validation dataset.
- **`callbacks`**: The list of callbacks to use during training, including the custom `CustomCallback` defined earlier.

After setting up the `Trainer`, we call `trainer.train()` to start the training process. The custom callback will print detailed logging information during training.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[CustomCallback()]
)

trainer.train()

Epoch,Training Loss,Validation Loss


Step: 20
loss: 1.2328
grad_norm: 0.3010375201702118
learning_rate: 0.0019936
epoch: 0.0032


Step: 40
loss: 0.2749
grad_norm: 0.26716718077659607
learning_rate: 0.0019872
epoch: 0.0064


Step: 60
loss: 0.24
grad_norm: 0.2254064828157425
learning_rate: 0.0019808
epoch: 0.0096


Step: 80
loss: 0.1916
grad_norm: 0.23095083236694336
learning_rate: 0.0019744
epoch: 0.0128


Step: 100
loss: 0.1824
grad_norm: 0.17696872353553772
learning_rate: 0.001968
epoch: 0.016


Step: 120
loss: 0.1565
grad_norm: 0.17030836641788483
learning_rate: 0.0019616
epoch: 0.0192


Step: 140
loss: 0.1415
grad_norm: 0.16717420518398285
learning_rate: 0.0019552000000000003
epoch: 0.0224


Step: 160
loss: 0.1329
grad_norm: 0.1367701292037964
learning_rate: 0.0019488
epoch: 0.0256


Step: 180
loss: 0.1552
grad_norm: 0.17059433460235596
learning_rate: 0.0019424
epoch: 0.0288


Step: 200
loss: 0.1193
grad_norm: 0.19284522533416748
learning_rate: 0.001936
epoch: 0.032


Step: 220
loss: 0.1181
grad_norm: 0.152240961790084

## Evaluating the Model

In this cell, we evaluate the trained model on both the validation and test datasets:

- **Validation Evaluation**: We use the `trainer.evaluate()` method to assess the model's performance on the validation dataset (`tokenized_datasets['validation']`). The validation loss is printed to provide an indication of how well the model generalizes to unseen validation data.

- **Test Evaluation**: Similarly, we evaluate the model on the test dataset (`tokenized_datasets['test']`). The test loss is printed to gauge the model's performance on the final test set.

The `eval_loss` metric provides insight into the model's performance, helping to assess its accuracy and effectiveness on the given datasets.


In [None]:
# val_results = trainer.evaluate(eval_dataset=tokenized_datasets['validation'])
# print(f"Validation Loss: {val_results['eval_loss']}")

# test_results = trainer.evaluate(eval_dataset=tokenized_datasets['test'])
# print(f"Test Loss: {test_results['eval_loss']}")

Step: 250
eval_loss: 0.17515872418880463
eval_runtime: 0.8743
eval_samples_per_second: 228.76
eval_steps_per_second: 28.595
epoch: 2.0


Validation Loss: 0.17515872418880463
Step: 250
eval_loss: 0.16509920358657837
eval_runtime: 0.8694
eval_samples_per_second: 230.051
eval_steps_per_second: 28.756
epoch: 2.0


Test Loss: 0.16509920358657837


## Saving the Fine-Tuned Model

In this cell, we save the fine-tuned T5 model and tokenizer to a specified directory:

- **`model.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the trained T5 model to the directory `./fine_tuned_t5`. This allows you to load the model later without retraining.

- **`tokenizer.save_pretrained('./fine_tuned_t5_1000dp')`**: Saves the tokenizer associated with the T5 model to the same directory. This ensures that you can use the same tokenizer for encoding and decoding text during inference.

Saving both the model and tokenizer ensures that you can resume work or deploy the model in the future with consistent results.


In [None]:
model.save_pretrained('./fine_tuned_t5_1000dp')
tokenizer.save_pretrained('./fine_tuned_t5_1000dp')

('./fine_tuned_t5_1000dp/tokenizer_config.json',
 './fine_tuned_t5_1000dp/special_tokens_map.json',
 './fine_tuned_t5_1000dp/spiece.model',
 './fine_tuned_t5_1000dp/added_tokens.json')

## Loading the Fine-Tuned Model and Tokenizer

In this cell, we load the fine-tuned T5 model and tokenizer from the specified directory and set up the environment for evaluation:

- **`device`**: Determines whether to use a GPU (`cuda`) or CPU for computation based on availability.

- **`model`**: Loads the fine-tuned T5 model and moves it to the appropriate device (`cuda` or `cpu`).

- **`tokenizer`**: Loads the tokenizer associated with the fine-tuned T5 model.

The model is set to evaluation mode with `model.eval()`, preparing it for generating predictions.

### Functions

- **`generate_text(inputs)`**: Takes a batch of input texts, tokenizes them, and generates predictions using the fine-tuned model. It returns the generated texts after decoding them from token IDs.

- **`extract_details(text)`**: Extracts attribute details from the generated or target text using regular expressions. It returns the details for brand and categories, defaulting to `'na'` if not found.

- **`clean_repeated_patterns(text)`**: Cleans the generated text by removing redundant patterns, specifically handling the `L4_category`.

These functions will be used for generating predictions and extracting and cleaning the details from the results.


In [None]:
import re
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# model = T5ForConditionalGeneration.from_pretrained('./fine_tuned_t5_1000dp').to(device)
# tokenizer = T5Tokenizer.from_pretrained('./fine_tuned_t5_1000dp')

model.eval()

test_data = test_dataset['input_text']
test_labels = test_dataset['target_text']

def generate_text(inputs):
    inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True, truncation=True, max_length=352)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=128)

    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return generated_texts

def extract_details(text):
    pattern = r'details_Brand: (.*?) L0_category: (.*?) L1_category: (.*?) L2_category: (.*?) L3_category: (.*?) L4_category: (.*)'
    match = re.match(pattern, text)
    if match:
        return tuple(item if item is not None else 'na' for item in match.groups())
    return 'na', 'na', 'na', 'na', 'na', 'na'

def clean_repeated_patterns(text):
    cleaned_data = text.split(' L4_category')[0]
    return cleaned_data


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Generating Predictions and Extracting Details

In this cell, we process the test data in batches to generate predictions and extract attribute details:

- **`batch_size`**: The number of samples processed in each batch, set to `128`.

- **`generated_details`**: List to store extracted details from generated texts.
- **`target_details`**: List to store extracted details from target texts.

### Processing Loop

We iterate over the test data in batches:
1. **Batch Extraction**: For each batch of inputs, we generate predictions using the `generate_text` function.
2. **Details Extraction**: For each generated text and corresponding label, we extract and append details using the `extract_details` function.

**Note**: The `batch_labels` are included here for completeness, but they are not used in this code snippet for generating predictions.

Finally, a message is printed to indicate that the extraction of generated information is complete.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TF-IDF vectorizer to compute cosine similarity
vectorizer = TfidfVectorizer()

# Function to find the closest label using cosine similarity
def find_closest_label(generated_label, possible_labels):
    possible_labels_vectorized = vectorizer.fit_transform(possible_labels)
    generated_label_vectorized = vectorizer.transform([generated_label])

    cosine_similarities = cosine_similarity(generated_label_vectorized, possible_labels_vectorized)
    closest_label_index = cosine_similarities.argmax()

    return possible_labels[closest_label_index]

In [None]:
batch_size = 128
generated_details = []
target_details = []

for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
    batch_inputs = test_data[i:i+batch_size]
    batch_labels = test_labels[i:i+batch_size]  # Assuming `val_solution` contains the correct labels for the validation set

    # Generate text using your model
    generated_texts = generate_text(batch_inputs)

    for generated_text, label in zip(generated_texts, batch_labels):
        # Extract the details as a tuple from the generated text
        details = extract_details(generated_text)

        # Correcting the details if the generated label is not valid
        corrected_details = []
        for i, category in enumerate(categories):
            generated_label = details[i]  # Extract the label corresponding to the category
            if generated_label not in label_sets[category]:
                closest_label = find_closest_label(generated_label, label_sets[category])
                corrected_details.append(closest_label)
            else:
                corrected_details.append(generated_label)

        # Append the corrected details as a tuple
        generated_details.append(tuple(corrected_details))

        # Extract the details from the actual target label and append
        target_details.append(extract_details(label))


print('Generated info extracted and corrected...')
print(len(generated_details))

In [None]:
# batch_size = 128
# generated_details = []
# target_details = []

# for i in tqdm(range(0, len(test_data), batch_size), desc="Processing test data"):
#     batch_inputs = test_data[i:i+batch_size]
#     batch_labels = test_label[i:i+batch_size] #you are not going to have this

#     generated_texts = generate_text(batch_inputs)

#     for generated_text, label in zip(generated_texts, batch_labels):
#         generated_details.append(extract_details(generated_text))
#         target_details.append(extract_details(label))

# print('Generated info extracted.............')

Processing test data: 100%|██████████| 2/2 [00:03<00:00,  1.81s/it]

Generated info extracted.............





## Evaluating Model Performance by Category

In this cell, we evaluate the model's performance by splitting the generated and target details into categories and calculating various metrics:

### Data Preparation

- **`generated_dict`** and **`target_dict`**: Dictionaries to store generated and target details for each category (0 through 5). The `generated_details` and `target_details` lists are split into these dictionaries based on category indices.

- **Cleaning Repeated Patterns**: The `L4_category` entries in `generated_dict` are cleaned using the `clean_repeated_patterns` function to remove redundant patterns.

### Metrics Calculation

- **`categories`**: List of categories for which metrics will be computed: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`metrics`**: List of metrics to be calculated: `accuracy`, `precision`, `recall`, and `f1`.

For each category:
1. **Compute Metrics**: Accuracy, precision, recall, and F1 score are calculated using `accuracy_score`, `precision_score`, `recall_score`, and `f1_score` from `sklearn.metrics`. Metrics are computed with macro averaging to handle multi-class classification.

2. **Print Results**: The results for each category are printed, showing the calculated metrics with four decimal places.

The printed results provide insight into the performance of the model across different categories and metrics.


In [None]:
generated_dict = {i: [] for i in range(6)}
target_dict = {i: [] for i in range(6)}

for gen, tar in zip(generated_details, target_details):
    for i in range(6):
        generated_dict[i].append(gen[i])
        target_dict[i].append(tar[i])

print('Splitted into category.............\n')

# Clean repeated patterns in L4_category
generated_dict[5] = [clean_repeated_patterns(text) for text in generated_dict[5]]

categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']
metrics = ['accuracy', 'precision', 'recall', 'f1']

results = {category: {metric: 0 for metric in metrics} for category in categories}

for i, category in enumerate(categories):
    print('Current Category: ', category)
    y_true = target_dict[i]
    y_pred = generated_dict[i]

    results[category]['accuracy'] = accuracy_score(y_true, y_pred)
    results[category]['precision'] = precision_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['recall'] = recall_score(y_true, y_pred, average='macro', zero_division=0)
    results[category]['f1'] = f1_score(y_true, y_pred, average='macro', zero_division=0)

print()

for category, metrics in results.items():
    print(f"{category}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()

Splitted into category.............

Current Category:  details_Brand
Current Category:  L0_category
Current Category:  L1_category
Current Category:  L2_category
Current Category:  L3_category
Current Category:  L4_category

details_Brand:
  accuracy: 0.9650
  precision: 0.9267
  recall: 0.9267
  f1: 0.9267

L0_category:
  accuracy: 0.5750
  precision: 0.2829
  recall: 0.2239
  f1: 0.2241

L1_category:
  accuracy: 0.4300
  precision: 0.1488
  recall: 0.1490
  f1: 0.1340

L2_category:
  accuracy: 0.1800
  precision: 0.0496
  recall: 0.0462
  f1: 0.0413

L3_category:
  accuracy: 0.1450
  precision: 0.0846
  recall: 0.0798
  f1: 0.0786

L4_category:
  accuracy: 0.3850
  precision: 0.0667
  recall: 0.0581
  f1: 0.0594



## Computing Item-Level Accuracy

In this cell, we define a function to compute item-level accuracy, which measures how often all predicted categories match the target categories for each item:

### Function: `compute_item_accuracy`

- **Inputs**:
  - `generated_details`: List of predicted details for each item.
  - `target_details`: List of true details for each item.

- **Process**:
  - **Count Correct Items**: Iterates through pairs of generated and target details. If all elements in a generated detail match the corresponding elements in the target detail, it counts as a correct item.
  - **Compute Accuracy**: Divides the count of correct items by the total number of items to get the accuracy. Returns `0` if there are no items.

### Execution

- **`item_accuracy`**: Calls `compute_item_accuracy` with the `generated_details` and `target_details` to calculate the accuracy.
- **Print Accuracy**: Prints the item-level accuracy with four decimal places.

Item-level accuracy provides a metric of how well the model performs in predicting all categories correctly for each product.


In [None]:
def compute_item_accuracy(generated_details, target_details):
    correct_items = 0
    total_items = len(generated_details)

    for gen, tar in zip(generated_details, target_details):
        if all(g == t for g, t in zip(gen, tar)):
            correct_items += 1

    return correct_items / total_items if total_items > 0 else 0

item_accuracy = compute_item_accuracy(generated_details, target_details)
print(f"Item-level accuracy: {item_accuracy:.4f}")


Item-level accuracy: 0.0350


## Saving Predictions to a File

In this cell, we save the generated predictions to a file in JSONL format:

- **`categories`**: List of categories for which predictions are made: `details_Brand`, `L0_category`, `L1_category`, `L2_category`, `L3_category`, and `L4_category`.

- **`attrebute_test_baseline_200dp.predict`**: The output file where the predictions will be saved.

### Process

1. **Open File**: Opens the file `attrebute_test_baseline_200dp.predict` for writing.

2. **Write Predictions**:
   - **Iterate**: Loops through `generated_details` along with `indoml_id`, which acts as the identifier for each item.
   - **Create Result**: Constructs a dictionary with `indoml_id` and the predicted values for each category.
   - **Write to File**: Serializes the dictionary to JSON format and writes it to the file, one entry per line.

This file can be used for evaluation or submission purposes, containing the model's predictions in the required format.


In [None]:
import json
categories = ['details_Brand', 'L0_category', 'L1_category', 'L2_category', 'L3_category', 'L4_category']

with open('attrebute_test_baseline_200dp.predict', 'w') as file:

    for indoml_id, details in enumerate(generated_details):
        result = {"indoml_id": indoml_id}
        for category, value in zip(categories, details):
            result[category] = value

        file.write(json.dumps(result) + '\n')

## Creating a Zip Archive for Predictions

In this cell, we create a zip archive of the predictions file:

- **`file_to_zip`**: The name of the file containing the predictions (`attrebute_test_baseline_200dp.predict`).

- **`zip_file_name`**: The name of the zip archive to be created (`any_name.zip`).

### Process

1. **Create Zip Archive**: Opens a new zip file (`any_name.zip`) for writing.

2. **Add File to Zip**:
   - **Add File**: Adds the predictions file (`attrebute_test_baseline_200dp.predict`) to the zip archive. The `arcname` parameter ensures that the file is stored in the zip archive with the same name as it has on the file system.

The resulting zip file can be used for submission or sharing, compressing the predictions file into a standard format.


In [None]:
# import zipfile

# file_to_zip = 'attrebute_test_baseline_200dp.predict'
# zip_file_name = 'any_name.zip'

# with zipfile.ZipFile(zip_file_name, 'w') as zipf:
#     zipf.write(file_to_zip, arcname=file_to_zip)