# SPAM vs NOT-SPAM with RoBERTa Base

This notebook uses RoBERTa Base (not pretrained on MLM task) and achieves a Test F1-Score of **0.99**.

Import stuff.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import gc
import os
import time
import torch 
import random
import numpy as np
import pandas as pd
import transformers
from torch import nn 
from tqdm.notebook import tqdm
import torch.nn.functional as F 
from sklearn import metrics, model_selection
from torch.utils.data import Sampler, Dataset, DataLoader

gc.enable()

Defining the configurations.

In [3]:
config = dict(
    learning_rate = 5e-5,
    seed = 541,
    tokenizer_path = '../input/roberta-base',
    model_checkpoint = '../input/roberta-base',
    max_length = 512,
    num_folds = 6,
    epochs = 1,
    train_batch_size = 32,
    valid_batch_size = 64,
    weight_decay = 1e-2,
    num_labels=2,
    device='cuda',
    report_to='none',
    num_warmup_steps=150,
    grad_acc_steps=1,
)

Setting a seed for everything for reproducibility.

In [4]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)

    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

    torch.backends.cudnn.deterministic = True

set_random_seed(config['seed'])

The following function is used to create K-Fold split on the data, stratified on the target variable. For reproducbility, we will shuffle the dataset with a Seed before creating the folds.

In [5]:
def create_folds(data):
    
    data['kfold'] = -1
    
    kf = model_selection.StratifiedKFold(n_splits=config['num_folds'], shuffle=True, random_state=config['seed'])
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data['target'])):
        data.loc[v_, 'kfold'] = f
    
    return data

Loading the Tokenizer from the transformers libarary. 

The *fast_encode* function tokenizes the data in chunks for speeding up the process (when the data is extremely huge)

In [6]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    config['tokenizer_path'],
)

In [7]:
def fast_encode(texts, chunk_size=512):
    input_ids = []
    at_ids = []

    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size]
        encs = tokenizer.batch_encode_plus(
            text_chunk,
            add_special_tokens=True,
        )

        input_ids.extend(encs['input_ids'])
        at_ids.extend(encs['attention_mask'])

    return {'input_ids': input_ids, 'attention_mask': at_ids}

Dataset class for the torch dataloader. It encodes the text but I am not using any particular truncation length. The maximum length in the text is less than the maximum length supported by the RoBERTa Base tokenizer (i.e. 512). 

I am using an Adaptive Collation function which will pad a batch of data to the maximum length of text in that batch. It's like using a truncation length for each batch. 

This way, none of the sample is truncated and we don't have to use the maximum length of the entire dataset on all samples. This helps with memory optimization and speeding up training.

In [8]:
class AdaptiveDataset():

    def __init__(self, df, inference_only=False):
        self.inference_only = inference_only
        self.enc = fast_encode(df.text.tolist())
        self.labels = df.target.tolist() if not self.inference_only else -1
        self.len = len(df)
    
    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        ret = {
            'input_ids': self.enc['input_ids'][idx],
            'attention_mask': self.enc['attention_mask'][idx],
        }

        if not self.inference_only:
            ret['labels'] = self.labels[idx]

        return ret

In [9]:
class AdaptiveCollate:

    def __init__(self, inference_only=False):
        self.inference_only = inference_only
    
    def __call__(self, batch):
        
        output = {}
        
        output['input_ids'] = [sample['input_ids'] for sample in batch]
        output['attention_mask'] = [sample['attention_mask'] for sample in batch]
        
        batch_max = max([len(ids) for ids in output['input_ids']])
        
        output['input_ids'] = [sample + (batch_max - len(sample)) * [tokenizer.pad_token_id] for sample in output['input_ids']]
        output['attention_mask'] = [sample + (batch_max - len(sample)) * [0] for sample in output['attention_mask']]        

        output['input_ids'] = torch.tensor(output['input_ids'], dtype=torch.long)
        output['attention_mask'] = torch.tensor(output['attention_mask'], dtype=torch.long)

        if not self.inference_only:
            output['labels'] = [sample['labels'] for sample in batch]
            output['labels'] = torch.tensor(output['labels'], dtype=torch.long)
    
        return output

Average Meter class will help keep track of the losses for training and validation.

In [10]:
class AverageMeter(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0
        self.max = 0
        self.min = 1e5

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count
        if val > self.max:
            self.max = val
        if val < self.min:
            self.min = val

I am not using the default Model Architecture. I am using a custom attention head on top of the transformer output. This has helped to converge faster and also boost the performance.

In [11]:
class SpamModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.config = transformers.AutoConfig.from_pretrained(config['model_checkpoint'])
        self.config.update({
            "output_hidden_states":True, 
            "hidden_dropout_prob": 0.0,
            "layer_norm_eps": 1e-7
        })                       
        self.backbone = transformers.AutoModel.from_pretrained(config['model_checkpoint'], config=self.config)  
            
        self.attention = nn.Sequential(            
            nn.Linear(768, 512),            
            nn.Tanh(),                       
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )
        
        self.layer_norm = nn.LayerNorm(768)
        self.classifier = nn.Sequential(                        
            nn.Linear(768, 2)                        
        )        

    def forward(self, input_ids, attention_mask):
        x = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )        

        last_layer_hidden_states = x.hidden_states[-1]
        
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_layer_hidden_states.size()).float()
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)

        weights = self.attention(last_layer_hidden_states)
                
        context_vector = torch.sum(weights * last_layer_hidden_states, dim=1)
        
        mean_embeddings = context_vector / sum_mask
        norm_mean_embeddings = self.layer_norm(mean_embeddings)
        
        return self.classifier(norm_mean_embeddings)

PyTorch train loop, validation loop, loss and metric functions. Typical stuff.

In [12]:
def loss_fn(o, y):
    loss = nn.CrossEntropyLoss()(o, y)
    return loss

def get_score(y_true, y_preds, report=False):
    if report:
        print(metrics.classification_report(y_true, y_preds))
        
    return round(metrics.f1_score(y_true, y_preds, average='macro'), 2)

def train_fn(model, train_loader, optimizer, scheduler, current_epoch):  
    losses = AverageMeter()
    optimizer.zero_grad()

    with tqdm(train_loader, unit="batch") as tepoch:
        for batch_idx, data in enumerate(tepoch):
            ids = data['input_ids'].to(config['device'])
            mask = data['attention_mask'].to(config['device'])
            label = data['labels'].to(config['device'])

            model.train()

            o = model(input_ids=ids, attention_mask=mask)
            loss = loss_fn(o, label)
            loss.backward()
            
            if batch_idx % config['grad_acc_steps'] == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad() 
            
            losses.update(loss.item(), len(label))
            
            tepoch.set_postfix(train_loss=losses.avg)
            
def eval_fn(model, valid_loader, current_epoch):
    losses = AverageMeter()
    final_targets = []
    final_predictions = []
    model.eval()
    
    with torch.no_grad():
        
        with tqdm(valid_loader, unit="batch") as tepoch:

            for batch_idx, data in enumerate(tepoch):
                input_ids = data['input_ids'].to(config['device'])
                attention_mask = data['attention_mask'].to(config['device'])
                label = data['labels'].to(config['device'])
                
                o = model(input_ids=input_ids, attention_mask=attention_mask)
                loss = loss_fn(o, label)

                o = o.detach().cpu().numpy().tolist()
                final_predictions.extend(o)
                y = label.detach().cpu().numpy().tolist()
                final_targets.extend(y)
                
                losses.update(loss.item(), len(valid_loader))

    final_classes = [probs.index(max(probs)) for probs in final_predictions]

    final_classes = [id_to_class[i] for i in final_classes]
    final_targets = [id_to_class[i] for i in final_targets]

    score = get_score(final_targets, final_classes, report=True)        
        
    return round(losses.avg, 2), round(score, 2)

The *run_fold* function takes the training data and trains the model for 1 Fold. As in, if Fold 0 is given, the model will be trained on Folds 1, 2, 3, 4, 5 and evaluated on the unseed Fold 0. This exercise is repeated for all folds.
The *engine* function just loops over to train for all folds and calculate the average score for all folds.

In [13]:
def run_fold(fold, df):
    
    print(f'Fold: {fold}')
    
    train_df = df[df['kfold'] != fold].reset_index(drop=True)
    valid_df = df[df['kfold'] == fold].reset_index(drop=True)
    
    
    train_dataset = AdaptiveDataset(train_df)
    train_loader = DataLoader(
        train_dataset,
        batch_size=config['train_batch_size'],
        collate_fn = AdaptiveCollate(),
        shuffle=True
    )
    
    valid_dataset = AdaptiveDataset(valid_df)
    valid_loader = DataLoader(
        valid_dataset,
        batch_size=config['valid_batch_size'],
        collate_fn = AdaptiveCollate(),
        shuffle=False,
    )
    
    model = SpamModel().to(config['device']) 
    
    optimizer = torch.optim.AdamW(
        model.parameters(), 
        lr=config['learning_rate'], 
        weight_decay=config['weight_decay'],
    )
    
    scheduler = transformers.get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config['num_warmup_steps'],
        num_training_steps = config['epochs'] * len(train_loader)
    )    
    
    best_loss = 9999
    best_score = None
    start = time.time()
    
    for epoch in range(config['epochs']):   
        print(f'\n\n\nTraining Epoch: {epoch}')
        train_fn(model, train_loader, optimizer, scheduler, epoch)
        
        print('Evaluation...')
        val_loss, val_score = eval_fn(model, valid_loader, epoch) 
        if val_loss < best_loss:
            best_loss = val_loss
            best_score = val_score
            torch.save(model.state_dict(), f'model_{fold}.pth')
        print('Valid Score:', val_score, 'Valid Loss:', val_loss, 'Best Loss:', best_loss)
        
    print(f'Best Loss: {best_loss}, Time Taken: {round(time.time() - start, 4)}s')
    print()
    
    del model, optimizer
    del train_df, valid_df
    del valid_dataset, train_dataset
    del valid_loader, train_loader
    gc.collect()
    
    return best_score

In [14]:
def engine(df):
    score_avg = 0
    for fold in range(config['num_folds']):
        score_avg += run_fold(fold, TRAIN_DATA)
    
    score_avg /= config['num_folds']
    score_avg = round(score_avg, 2)
    
    print('Average Score:', score_avg)

Importing the data. Encoding used while reading the file is latin-1 because the text in the file causes an error while using utf encoding (mainly because of the commas in the text, ugh!)

* The dataset has weird column names. Firstly, I am just going to rename the features.
* I have created the class_to_id and id_to_class dictionary that maps the target to a numeric encoding and vice-versa.
* I am also creating 2 extra features: num_words has the number of words in the text calculated using the python .split() function and num_characters has the number of characters in the text.

In [15]:
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding='latin-1')\
        .rename(columns={'v1': 'target', 'v2': 'text'})[['target', 'text']]

df['num_tokens'] = df['text'].apply(lambda x: len(x.split()))
df['num_characters'] = df['text'].apply(lambda x: len(x))

class_to_id = {'ham': 0, 'spam': 1}
id_to_class = {id_: class_ for class_, id_ in class_to_id.items()}

df['target'] = df['target'].map(class_to_id)

I am splitting the data to leave out 0.15 of the entire data for testing the model. These samples will never be seen by the model. Using a seed to for reproducible results. The K-Fold split will be computed on the remaing 0.85 of the data. I have chosen K = 6.

In [16]:
TRAIN_DATA, TEST_DATA = model_selection.train_test_split(
    df, 
    test_size=0.15, 
    stratify=df['target'].values,
    random_state=config['seed'],
)

TRAIN_DATA = TRAIN_DATA.reset_index(drop=True)
TEST_DATA = TEST_DATA.reset_index(drop=True)

TRAIN_DATA = create_folds(TRAIN_DATA)

Just checking the maximum length in the text after tokenization.

In [17]:
x = fast_encode(df.text.tolist())
print('Max Token Length:', max([len(i) for i in x['input_ids']]))

del x
gc.collect();

  0%|          | 0/11 [00:00<?, ?it/s]

Max Token Length: 259


Training Begins.

Damn, 0.98 F1-Score for Cross-Validation Average. Let's hope this is not over-fitting. (spoiler alert - it's NOT over-fitting, transformers ftw)

In [18]:
engine(TRAIN_DATA)

Fold: 0


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       1.00      0.99      0.99       684
        spam       0.94      0.98      0.96       106

    accuracy                           0.99       790
   macro avg       0.97      0.99      0.98       790
weighted avg       0.99      0.99      0.99       790

Valid Score: 0.98 Valid Loss: 0.03 Best Loss: 0.03
Best Loss: 0.03, Time Taken: 34.5706s

Fold: 1


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       684
        spam       0.99      0.91      0.95       106

    accuracy                           0.99       790
   macro avg       0.99      0.95      0.97       790
weighted avg       0.99      0.99      0.99       790

Valid Score: 0.97 Valid Loss: 0.04 Best Loss: 0.04
Best Loss: 0.04, Time Taken: 33.1265s

Fold: 2


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00       684
        spam       1.00      0.94      0.97       105

    accuracy                           0.99       789
   macro avg       1.00      0.97      0.98       789
weighted avg       0.99      0.99      0.99       789

Valid Score: 0.98 Valid Loss: 0.03 Best Loss: 0.03
Best Loss: 0.03, Time Taken: 32.8965s

Fold: 3


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       683
        spam       0.98      0.95      0.97       106

    accuracy                           0.99       789
   macro avg       0.99      0.97      0.98       789
weighted avg       0.99      0.99      0.99       789

Valid Score: 0.98 Valid Loss: 0.03 Best Loss: 0.03
Best Loss: 0.03, Time Taken: 32.9244s

Fold: 4


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       683
        spam       1.00      0.92      0.96       106

    accuracy                           0.99       789
   macro avg       0.99      0.96      0.97       789
weighted avg       0.99      0.99      0.99       789

Valid Score: 0.97 Valid Loss: 0.05 Best Loss: 0.05
Best Loss: 0.05, Time Taken: 32.816s

Fold: 5


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





Training Epoch: 0


  0%|          | 0/124 [00:00<?, ?batch/s]

Evaluation...


  0%|          | 0/13 [00:00<?, ?batch/s]

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00       683
        spam       0.98      0.99      0.99       106

    accuracy                           1.00       789
   macro avg       0.99      0.99      0.99       789
weighted avg       1.00      1.00      1.00       789

Valid Score: 0.99 Valid Loss: 0.02 Best Loss: 0.02
Best Loss: 0.02, Time Taken: 33.0209s

Average Score: 0.98


Inference Time. 

* I will load the trained models, compute the probabilites and keep adding them to *test_preds*.
* Dividing it by num_folds will give the average predicted probabilities.
* When these predictions are evaluated, the F1-Score is 0.99.

In [19]:
def predict(model, valid_loader):
    final_predictions = []
    model.eval()
    
    with torch.no_grad():
        
        with tqdm(valid_loader, unit="batch") as tepoch:

            for batch_idx, data in enumerate(tepoch):
                input_ids = data['input_ids'].to(config['device'])
                attention_mask = data['attention_mask'].to(config['device'])
                
                o = model(input_ids=input_ids, attention_mask=attention_mask)
                o = nn.Softmax()(o).detach().cpu().numpy()[:, 1]
                final_predictions.extend(o)
        
    return final_predictions

In [20]:
test_dataset = AdaptiveDataset(TEST_DATA, inference_only=True)
test_loader = DataLoader(
    test_dataset,
    batch_size=config['valid_batch_size'],
    collate_fn = AdaptiveCollate(inference_only=True),
    shuffle=False,
)

  0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
test_probabilities = np.zeros((TEST_DATA.shape[0]))

for fold in range(config['num_folds']):
    model_path = f'model_{fold}.pth'
    model = SpamModel()
    model.load_state_dict(torch.load(model_path))
    model.to(config['device'])
    
    test_probabilities += predict(model, test_loader)

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

Some weights of the model checkpoint at ../input/roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/14 [00:00<?, ?batch/s]

In [22]:
test_probabilities /= config['num_folds']
test_probabilities = (test_probabilities > 0.5) * 1

predicted_classes = [id_to_class[i] for i in test_probabilities]
target_classes = [id_to_class[i] for i in TEST_DATA['target'].tolist()]

print('Test F1-Score:', get_score(target_classes, predicted_classes, report=True))

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00       724
        spam       1.00      0.96      0.98       112

    accuracy                           0.99       836
   macro avg       1.00      0.98      0.99       836
weighted avg       0.99      0.99      0.99       836

Test F1-Score: 0.99


**Amazing! The CV and Test score are almost the same. 0.99 F1-Score is way too good.**

**Such brilliant performance from the model shows why transformers are the state-of-the-art for almost all NLP tasks.**