# NLI Problem Solve
Natural Language Inferencing (NLI) is a classic NLP (Natural Language Processing) problem that involves taking two sentences (the premise and the hypothesis ), and deciding how they are related- if the premise entails the hypothesis, contradicts it, or neither.

In [5]:
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

Config option `kernel_spec_manager_class` not recognized by `EnableNBExtensionApp`.
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [6]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt
import tensorflow as tf

In [7]:
# set GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")  # Создаем стратегию для одного GPU
    except RuntimeError as e:
        print(e)
else:
    strategy = tf.distribute.get_strategy()
    print('Number of replicas:', strategy.num_replicas_in_sync)

In [8]:
import os
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

# Fast EDA

In [9]:
train = pd.read_json("../input/dataset-indonli-new/train.jsonl", lines=True)
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

train["label"] = train['label'].replace({'c':2,'e':0,'n':1})

train.head()

Unnamed: 0,pair_id,premise_id,premise,hypothesis,label,data_split,annotator_type,sentence_size
0,101315,10131,Presiden Joko Widodo (Jokowi) menyampaikan pre...,Prediksi akhir wabah tidak disampaikan Jokowi.,2,train,lay,single
1,110511,11051,Meski biasanya hanya digunakan di fasilitas ke...,Masker sekali pakai banyak dipakai di tingkat ...,0,train,lay,single
2,124434,12443,"Data dari Nielsen Music mencatat, ""Joanne"" tel...",Nielsen Music mencatat pada akhir minggu ini.,1,train,lay,single
3,124274,12427,Album Wild West miliknya pada tahun 1981 merup...,Ia memiliki lebih dari satu album.,1,train,lay,single
4,119442,11944,"Seperti namanya, paket internet sahur Telkomse...",Paket internet sahur tidak ditujukan untuk saa...,2,train,lay,single


In [None]:
#train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
#test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

#train.head()

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10330 entries, 0 to 10329
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   pair_id         10330 non-null  int64 
 1   premise_id      10330 non-null  int64 
 2   premise         10330 non-null  object
 3   hypothesis      10330 non-null  object
 4   label           10330 non-null  int64 
 5   data_split      10330 non-null  object
 6   annotator_type  10330 non-null  object
 7   sentence_size   10330 non-null  object
dtypes: int64(3), object(5)
memory usage: 645.8+ KB


In [13]:
!pip install plotly



In [None]:
#Cannot be run using dataset indonli
import plotly.express as px

labels, frequencies = np.unique(train.language.values, return_counts = True)

fig = px.pie(values=frequencies, 
             names=labels, 
             title='Languages distribution',
             color_discrete_sequence=px.colors.sequential.Plotly3)

fig.show(renderer="iframe")

In [None]:
import plotly.graph_objects as go

train['text_length'] = train['premise'].apply(len)

fig = go.Figure(data=[go.Histogram(x=train['text_length'], 
                                   nbinsx=50,
                                   marker_color='skyblue')])
fig.update_layout(title_text='Text length distribution in "premise"', # title of plot
                  xaxis_title_text='Len of text in premise', # xaxis label
                  yaxis_title_text='Number of sentences', # yaxis label
                  bargap=0.2, # gap between bars of adjacent location coordinates
                  bargroupgap=0.1) # gap between bars of the same location coordinates
fig.show(renderer="iframe")


In [None]:
train['text_length'] = train['hypothesis'].apply(len)

fig = go.Figure(data=[go.Histogram(x=train['text_length'], 
                                   nbinsx=50,
                                   marker_color='skyblue')])
fig.update_layout(title_text='Text length distribution in "hypothesis"', # title of plot
                  xaxis_title_text='Len of text in hypothesis', # xaxis label
                  yaxis_title_text='Number of sentences', # yaxis label
                  bargap=0.2, # gap between bars of adjacent location coordinates
                  bargroupgap=0.1) # gap between bars of the same location coordinates
fig.show(renderer="iframe")

In [15]:
import plotly.graph_objects as go

label_count = train['label'].value_counts().sort_index()
label_names = ['entailment', 'neutral', 'contradiction']
label_count.index = label_names

fig = go.Figure([go.Bar(x=label_names, y=label_count, marker_color='skyblue')])

fig.update_layout(title_text='Number of entries per label', # title of plot
                  xaxis_title_text='Label', # xaxis label
                  yaxis_title_text='Count', # yaxis label
                  )
fig.show(renderer="iframe")


# Transformers 🤖

In this notebook we will be use XLM-RoBERTa. <br><br>
Our 1st step is to import "symanto/xlm-roberta-base-snli-mnli-anli-xnli". So this is a pre-trained model based on the XLM-RoBERTa architecture, which has been further fine-tuned on the SNLI, MNLI, ANLI, and XNLI datasets. Let's unpack what these abbreviations mean:

1. XLM-RoBERTa: XLM-RoBERTa (Cross-lingual Language Model - RoBERTa) is a variant of the RoBERTa model that is designed to work with texts in multiple languages. It was developed by the Facebook AI team and trained on a large corpus of text from 100 languages.

2. SNLI: The Stanford Natural Language Inference Corpus is a dataset for the task of natural language inference (NLI), consisting of sentences annotated for 'entailment', 'contradiction', or 'neutrality' relations.

3. MNLI: The Multi-Genre Natural Language Inference Corpus is another dataset for the NLI task, which incorporates a variety of genres and text sources.

4. ANLI: The Adversarial Natural Language Inference task is a dataset consisting of several 'rounds' of NLI tasks, each progressively more difficult.

5. XNLI: The Cross-lingual Natural Language Inference Corpus is a multilingual dataset for the NLI task, based on MNLI but translated into 15 languages.

Thus, this model has been specifically trained for the task of natural language inference across several datasets and multiple languages. It should be especially useful for this task, particularly in a multilingual context.

## Quick Setup

In [16]:
!pip install evaluate # library for metrics

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3


In [17]:
import evaluate
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer

## Tokenizer

A tokenizer is a component that breaks text into individual "tokens". In the context of Natural Language Processing (NLP), a token typically represents a word or symbol. For example, the sentence "Hello, world!" might be tokenized into: ["Hello", ",", "world", "!"].

Tokenization is the first step in most NLP tasks, including tasks involving Transformer models such as BERT, GPT-2, XLM-RoBERTa, etc. These models are trained on tokenized texts and operate on tokenized input data.

The specific tokenizer used with a particular model is typically trained alongside the model and knows how to properly break text into tokens in the way that was used during the model's training. Each token is then associated with a unique numerical identifier that the model uses for training and inference.

It's important to use the correct tokenizer for your specific model, as different models might use different tokenization schemes. For example, some models might break words into subwords or characters, while others might use whole words as tokens. A mismatch between the tokenization scheme during training and the tokenization scheme during inference can lead to incorrect results.

In [18]:
# model_name = 'bert-base-multilingual-cased'
model_name = 'symanto/xlm-roberta-base-snli-mnli-anli-xnli'
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [19]:
train.head()

Unnamed: 0,pair_id,premise_id,premise,hypothesis,label,data_split,annotator_type,sentence_size
0,101315,10131,Presiden Joko Widodo (Jokowi) menyampaikan pre...,Prediksi akhir wabah tidak disampaikan Jokowi.,2,train,lay,single
1,110511,11051,Meski biasanya hanya digunakan di fasilitas ke...,Masker sekali pakai banyak dipakai di tingkat ...,0,train,lay,single
2,124434,12443,"Data dari Nielsen Music mencatat, ""Joanne"" tel...",Nielsen Music mencatat pada akhir minggu ini.,1,train,lay,single
3,124274,12427,Album Wild West miliknya pada tahun 1981 merup...,Ia memiliki lebih dari satu album.,1,train,lay,single
4,119442,11944,"Seperti namanya, paket internet sahur Telkomse...",Paket internet sahur tidak ditujukan untuk saa...,2,train,lay,single


In [20]:
# Delete unnecessary columns for indonli

train = train.drop(labels=['premise_id','pair_id','data_split','annotator_type','sentence_size'], axis=1)

print(train.columns)
test = test.drop(labels=['language','lang_abv'], axis=1)

Index(['premise', 'hypothesis', 'label'], dtype='object')


In [21]:
train.head()

Unnamed: 0,premise,hypothesis,label
0,Presiden Joko Widodo (Jokowi) menyampaikan pre...,Prediksi akhir wabah tidak disampaikan Jokowi.,2
1,Meski biasanya hanya digunakan di fasilitas ke...,Masker sekali pakai banyak dipakai di tingkat ...,0
2,"Data dari Nielsen Music mencatat, ""Joanne"" tel...",Nielsen Music mencatat pada akhir minggu ini.,1
3,Album Wild West miliknya pada tahun 1981 merup...,Ia memiliki lebih dari satu album.,1
4,"Seperti namanya, paket internet sahur Telkomse...",Paket internet sahur tidak ditujukan untuk saa...,2


In [None]:
# delete unnecessary columns
train = train.drop(labels=['language', 'text_length', 'lang_abv'], axis=1)
test = test.drop(labels=['language','lang_abv'], axis=1)

> Hugging Face have classes DatasetDict() and Dataset(). They convert data into a format convenient for the model

In [22]:
from datasets import Dataset, DatasetDict

In [23]:
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)
test_ds = Dataset.from_pandas(test)

ds = DatasetDict()
ds['train'] = train_ds
ds['validation'] = val_ds
ds['test'] = test_ds

ds

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', '__index_level_0__'],
        num_rows: 8264
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label', '__index_level_0__'],
        num_rows: 2066
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis'],
        num_rows: 5195
    })
})

In [24]:
# tokenaize sentence func
def tokenizer_sentence(data):
    return tokenizer(data['premise'], data['hypothesis'], truncation=True)

In [25]:
tokenized_ds = ds.map(tokenizer_sentence, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [26]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 8264
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 2066
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis', 'input_ids', 'attention_mask'],
        num_rows: 5195
    })
})

> As we can see, the columns 'input_ids', 'attention_mask' have been added

> DataCollatorWithPadding complements or truncates data to a fixed length.

In [27]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Build Model

Let's create the custom class named `CustomXLMRobertaModel` that extends the `nn.Module` class from PyTorch, which means it represents a neural network module.

- `__init__(self, num_labels)`: This method initializes the class instance. It takes `num_labels` as an argument, which represents the number of possible output labels (classes) for the model.

- `super(CustomXLMRobertaModel, self).__init__()`: This is calling the `__init__` method of the parent `nn.Module` class, which is necessary to properly initialize the class.

- `model_name = 'symanto/xlm-roberta-base-snli-mnli-anli-xnli'`: This specifies the pre-trained model to use. In this case, it is a pre-trained XLM-RoBERTa model.

- `self.roberta = XLMRobertaModel.from_pretrained(model_name)`: This loads the specified pre-trained XLM-RoBERTa model.

- `self.dropout = nn.Dropout(0.2)`: This is a dropout layer, which is a regularization technique that helps prevent overfitting. The `0.2` specifies that approximately 20% of the inputs will be randomly set to 0 during training.

- `self.classifier = nn.Sequential(...)`: This is the classification layer of the model, which takes the output from the XLM-RoBERTa model and produces the final class predictions. It consists of a sequence of operations (a linear transformation, layer normalization, a ReLU activation function, another dropout, and another linear transformation).

- `self.loss = nn.CrossEntropyLoss()`: This specifies the loss function to be used, which is cross entropy loss. This is a common choice for multi-class classification tasks.

- `self.num_labels = num_labels`: This just stores the number of possible output labels for later use.

- `def forward(self, input_ids, attention_mask, labels=None)`: This is the method that is called when you pass input data to the model. It takes as input the `input_ids` (the tokenized input data), the `attention_mask` (which specifies which tokens should be attended to by the model), and optionally the true `labels`.

- The `output` is obtained by passing the `input_ids` and `attention_mask` to the XLM-RoBERTa model, then passing the resulting `pooler_output` through the dropout layer.

- The `logits` are obtained by passing the `output` through the classification layer. These are the raw, unnormalized scores for each class.

- If `labels` are provided, then it calculates the cross entropy loss between the predicted `logits` and the true `labels`, and returns a dictionary containing both the `loss` and the `logits`. If `labels` are not provided, it simply returns the `logits`.

In [28]:
import torch.nn as nn
from transformers import XLMRobertaModel

class CustomXLMRobertaModel(nn.Module):
    def __init__(self, num_labels):
        super(CustomXLMRobertaModel, self).__init__()
        model_name = 'symanto/xlm-roberta-base-snli-mnli-anli-xnli'
        self.roberta = XLMRobertaModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Sequential(
            nn.Linear(768, 512),
            nn.LayerNorm(512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, num_labels)
        )
        self.loss = nn.CrossEntropyLoss()
        self.num_labels = num_labels

    def forward(self, input_ids, attention_mask, labels=None):
        output = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        output = self.dropout(output.pooler_output)
        logits = self.classifier(output)

        if labels is not None:
            loss = self.loss(logits.view(-1, self.num_labels), labels.view(-1))
            return {"loss": loss, "logits": logits}
        else:
            return logits

In [29]:
model = CustomXLMRobertaModel(num_labels=3) # we have 3 classes

Downloading config.json:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Some weights of the model checkpoint at symanto/xlm-roberta-base-snli-mnli-anli-xnli were not used when initializing XLMRobertaModel: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at symanto/xlm-roberta-base-snli-mnli-anli-xnli and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to us

## Train Model

`TrainingArguments` is a class from the `transformers` library that provides arguments and parameters for training a model. This function takes a multitude of arguments that allow the training process to be configured.

Here's what each of the provided parameters does:

1. `output_dir` ("/content" in your case): This is the path to the directory where output files, such as the trained model and logs, will be saved.

2. `optim` ("adamw_torch"): This is the optimizer that will be used to update the model's weights during training. "adamw_torch" is a variant of the Adam optimization algorithm that incorporates weight decay, often used to regularize the model and prevent overfitting.

3. `num_train_epochs` (5): This is the number of epochs the training will run for. One epoch means one complete pass through the entire training dataset.

4. `evaluation_strategy` ("epoch"): This is the strategy for evaluating the model. If set to "epoch", the model will be evaluated after each training epoch. Other possible values include "steps" (evaluate after a given number of training steps) and "no" (no evaluation).

5. `logging_dir` ('./logs'): This is the path to the directory where training process logs will be saved.

6. `logging_steps` (10): This is the number of training steps between log entries. If set to 10, a log entry will be created every 10 training steps.

There are many other parameters can also be set in `TrainingArguments`. You found it in the [documentation](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

> Logs. What's it?

Training process logs are records that capture key information about the progress of a model's training. This can include data about each step or epoch of training, such as training losses, quality metrics (accuracy, F1 score, etc.), hyperparameter values, training time, and other useful information.

Training logs are used for the following reasons:

1. **Monitoring Progress**: Training logs allow for tracking the model's training progress over time and identifying when the model begins to overfit or when the training stabilizes.

2. **Debugging and Optimization**: If the training process is not going as expected, logs can help identify issues and determine how to optimize the process. For example, if training losses suddenly increase, this might indicate convergence problems.

3. **Record-keeping and Reproducibility**: Saving training logs allows for keeping a record of how the model was trained, which is important for reproducibility of scientific results. This can also be useful when comparing different models or training strategies.

4. **Visualization**: Training logs can be used to create graphs and charts that visualize the training progress. This can be especially helpful in analyzing and comparing models.

Some tools, such as TensorBoard or Wandb, can automatically visualize these logs, making the process of analysis and monitoring even more convenient.

In [30]:
from sklearn.metrics import accuracy_score, f1_score
from datasets import load_metric

training_args = TrainingArguments("/content",
                                  optim="adamw_torch",
                                  learning_rate=3e-5,
                                  per_device_train_batch_size=16,                                
                                  num_train_epochs=10,
                                  evaluation_strategy="epoch",
                                  logging_dir='./logs',
                                  logging_steps=10)

f1_metric = load_metric("f1")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_metric.compute(predictions=predictions, references=labels, average="micro")
    }

Downloading builder script:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

`Trainer` provides a straightforward and fast way to train and evaluate your model.

`trainer = Trainer(...`: This creates an instance of the `Trainer` class, which will be used for training and evaluating the model. The arguments passed are:

   - `model`: this is the model to be trained. In this context, it isn't defined yet, but it could be any model compatible with the transformers library.

   - `args=training_args`: these arguments control the training process. `training_args` should be an instance of `TrainingArguments` or compatible, which defines parameters like the learning rate, batch size, etc.

   - `train_dataset=tokenized_ds["train"]` and `eval_dataset=tokenized_ds["validation"]`: these are the datasets for training and evaluating the model. In this case, they are taken from the dictionary `tokenized_ds`, presumably containing tokenized versions of the original data.

   - `data_collator=data_collator`: the `data_collator` is a function that takes a list of samples from the `Dataset` and collates them into mini-batches (batches) for training or evaluation. It isn't defined in this particular context.

   - `tokenizer=tokenizer`: this is the tokenizer that was used to tokenize the input data.

   - `compute_metrics=compute_metrics`: this is a function that will be used to compute metrics during evaluation. In this context, it isn't defined yet. It should take the outputs from `Trainer.evaluate()` and return a dictionary where keys are the names of the metrics and values are the computed metrics.

In [32]:
from transformers import Trainer

trainer = Trainer(
    model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,  # передаем функцию compute_metrics
)

### Setup env

Wandb, or Weights & Biases, is a machine learning tool that helps to track and visualize the progress of model learning, as well as compare various experiments. It provides a convenient web interface where you can observe your experiments in real time, see graphs of metrics such as loss and accuracy, save and load model weights, and share the results with colleagues.

String `os.environ["WANDB_DISABLED"] = "true"` disables Weights & Biases integration. This can be useful if you don't want your experiments to be blocked in Wandb, or if you work in an environment where internet access is limited or unavailable. 

In [33]:
!pip install wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [34]:
import os
os.environ["WANDB_DISABLED"] = "true"

### Start train process

In [35]:
trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5486,0.533522,0.801549,{'f1': 0.8015488867376573}
2,0.4357,0.545013,0.801549,{'f1': 0.8015488867376573}
3,0.3953,0.589025,0.795257,{'f1': 0.7952565343659246}
4,0.3171,0.669074,0.792836,{'f1': 0.7928363988383349}
5,0.1673,0.829791,0.796225,{'f1': 0.7962245885769604}
6,0.1767,0.952783,0.792352,{'f1': 0.7923523717328169}
7,0.1362,1.102883,0.789932,{'f1': 0.7899322362052275}
8,0.099,1.17661,0.796709,{'f1': 0.7967086156824781}
9,0.0237,1.246914,0.797677,{'f1': 0.797676669893514}
10,0.0598,1.271758,0.801065,{'f1': 0.8010648596321394}


Trainer is attempting to log a value of "{'f1': 0.8015488867376573}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

Trainer is attempting to log a value of "{'f1': 0.8015488867376573}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

Trainer is attempting to log a value of "{'f1': 0.7952565343659246}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we drop

TrainOutput(global_step=2590, training_loss=0.22148267769013824, metrics={'train_runtime': 1908.5257, 'train_samples_per_second': 43.3, 'train_steps_per_second': 1.357, 'total_flos': 0.0, 'train_loss': 0.22148267769013824, 'epoch': 10.0})

In [None]:
# In kaggle, there is no way to access tensorboard 
# but you can use this code to visualize learning graphs from logs.

# !pip install tensorboard
# %load_ext tensorboard

# %tensorboard --logdir /kaggle/working/logs (your path to logs dir)

## Get Model predictions

In [30]:
predictions = trainer.predict(tokenized_ds["test"])
predictions

PredictionOutput(predictions=array([[-2.737657 , -2.203037 ,  4.2112136],
       [-2.381342 ,  4.047823 , -2.3333812],
       [ 4.660001 , -1.9491919, -2.1156757],
       ...,
       [ 4.5528708, -2.0941494, -1.8590173],
       [ 0.6248544,  1.742324 , -1.9893024],
       [-2.826497 , -2.1592557,  4.293371 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 46.6276, 'test_samples_per_second': 111.415, 'test_steps_per_second': 6.97})

In [32]:
logits = torch.from_numpy(predictions.predictions)
probs = torch.softmax(logits, -1).tolist() # convert to probability
probs[:5]

[[0.0009572316776029766, 0.0016338031273335218, 0.997408926486969],
 [0.0016084787202998996, 0.9967039227485657, 0.0016875024884939194],
 [0.9975171089172363, 0.0013445729855448008, 0.00113836454693228],
 [0.001298150629736483, 0.9972037076950073, 0.0014981417916715145],
 [0.0012291505699977279, 0.9970943927764893, 0.0016764780739322305]]

In [33]:
outputs = []

for index, prob in enumerate(probs):
    # ind indx with max probability of class
    predicted_label = prob.index(max(prob))
    element_id = ds['test']['id'][index]
    prediction = (element_id, predicted_label)
    outputs.append(prediction)

## Save Submision

In [34]:
submission = pd.DataFrame(outputs, columns=['id', 'prediction'])
submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,id,prediction
0,c6d58c3f69,2
1,cefcc82292,1
2,e98005252c,0
3,58518c10ba,1
4,c32b0d16df,1
