# Transformer models with data augmented for Intimacy scoring

Task: https://codalab.lisn.upsaclay.fr/competitions/7096

This notebook contains the code to fine-tune several pre-trained transformers for the task of intimacy analysis (regression). 

In particular, the models are:

- **BERT Multilingual**: bert-base-multilingual-uncased
- **XLM-T**: This is a XLM-Roberta-base model trained on ~198M multilingual tweets. MODEL_NAME= "cardiffnlp/twitter-xlm-roberta-base"

- **XLM-R**: XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. MODEL_NAME= "xlm-roberta-base"

- **DistillBERT**: a distilled version of the BERT base multilingual model. 
- **MiniLM**: Multilingual MiniLM uses the same tokenizer as XLM-R. MODEL_NAME= "microsoft/Multilingual-MiniLM-L12-H384"


Experiments show that XLM-T obtains the best results. We have also explore the use of data augmentation techniques such as EDA or NLPAug library. Unfortunately, data augmentation does no seem to improve the results. In the final submission, we sent XLM-T with data augmentation. 







## Defining some global variables
Select the model, if we use data augmented or if we are preparing for submission:

In [1]:
USE_DATA_AUGMENTED = False

models= ['bert-base-multilingual-uncased', 
         'cardiffnlp/twitter-xlm-roberta-base', 
         'xlm-roberta-base', 
         'distilbert-base-multilingual-cased', 
         'microsoft/Multilingual-MiniLM-L12-H384']
MODEL_NAME=models[4] #0, 1, 2, 3, 4

print('Using model:', MODEL_NAME, USE_DATA_AUGMENTED)



Using model: microsoft/Multilingual-MiniLM-L12-H384 False


Let's check if we are using gpu:

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Mar  9 14:05:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P0    29W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Data

We will use the mint dataset that consits of a collection of tweets. Each tweets is annotated with a intimacy score. The task is to predict the intimacy score for a given tweet.


### Loading the data and the augmented data




In [6]:
from datasets import load_dataset, concatenate_datasets

from google.colab import drive
from datasets import load_dataset

# mount your google drive
drive.mount('/content/drive')

# we load the dataset of sarcasm
path = "/content/drive/My Drive/Colab Notebooks/data/intimacy/"

data_files = {"train": path+"train-full.csv", 
              "test": path+"test_labeled.csv"}

dataset = load_dataset("csv", data_files=data_files)
LANGUAGES = set(dataset['train']['language'])

dataset





Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 9491
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 3881
    })
})

We create a split for validation (we get the 20% from training):

In [7]:
aux = dataset["train"].train_test_split(test_size=0.20, seed=42)
dataset["validation"] = aux['test']
dataset["train"] = aux['train']
del(aux)
dataset



DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 7592
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 3881
    })
    validation: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1899
    })
})

In [8]:

if USE_DATA_AUGMENTED:
    dataset_name = "ISEGURA/intimacy_aug_all"
    dataset = load_dataset(dataset_name, use_auth_token=access_token) #use_auth_token=True, for public datasets
    print(dataset_name, "was loaded!!!")

    dataset['validation'] = dataset['validation'].remove_columns(['text_aug', 'text_nlpaug'])

    # we get the augmented texts and save them into new datasets
    data_eda = dataset['train'].remove_columns(['text','text_nlpaug']).rename_columns({'text_aug':'text'})
    data_nlpaug = dataset['train'].remove_columns(['text','text_aug']).rename_columns({'text_nlpaug':'text'})
    dataset["train"] = dataset["train"].remove_columns([ 'text_aug', 'text_nlpaug'])
    dataset["train"] = concatenate_datasets([dataset["train"],data_eda, data_nlpaug])

    del(data_eda)
    del(data_nlpaug)


dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 7592
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 3881
    })
    validation: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1899
    })
})

Let's clean the texts removing some strings:

In [9]:
import re
def clean(examples):
    ## it applies the tokenzier on the dataset in its field text
    # we could add max_length = MAX_LENGHT, but in this case is not neccesary because MAX_LENTH is already 512, the maximum length allowed by the model
    new_texts = []
    for text in examples['text']:
        text = re.sub('@user', '', text)
        text = re.sub('http', '', text)
        text = re.sub('@[\w]+', '', text)
        text = text.strip()
        new_texts.append(text)
    
    examples['text'] = new_texts
    return examples

dataset=dataset.map(clean, batched=True)
dataset



Map:   0%|          | 0/7592 [00:00<?, ? examples/s]

Map:   0%|          | 0/3881 [00:00<?, ? examples/s]

Map:   0%|          | 0/1899 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 7592
    })
    test: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 3881
    })
    validation: Dataset({
        features: ['text', 'label', 'language'],
        num_rows: 1899
    })
})

### Tokenization

We will load a tokenizer from a pre-trained model. This tokenizer allows us to trasform the input texts to the required format for fine-tuning the pre-trained model.
In particular, we will work with the 'bert-base-multilingual-uncased', because it is a multilingual model and our input texts are written in several languages:

In [10]:
from transformers import AutoTokenizer
if 'MiniLM' in MODEL_NAME:
    # we must load the tokenizer of XLM-R
    tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
else: 
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

### Maximum length of texts



In [11]:
import pandas as pd

len_train_texts = [len(tokenizer(text).input_ids) for text in dataset['train']['text']]
df=pd.Series(len_train_texts)
# free the space of this list
del(len_train_texts)
#show the statistics
df.describe(percentiles=[0.25, 0.50, 0.75, 0.85, 0.90, 0.95, 0.99])


count    7592.000000
mean       19.401739
std        13.277689
min         2.000000
25%        11.000000
50%        16.000000
75%        24.000000
85%        30.000000
90%        33.000000
95%        41.000000
99%        75.000000
max       143.000000
dtype: float64

Therefore, we can consider as maximum length 50, because it will cover the most sequences.

### Data encoding


TODO: Review dynamic padding.


In [12]:
MAX_LEN = 50

def tokenize(examples):
    ## it applies the tokenzier on the dataset in its field text
    # we could add max_length = MAX_LENGHT, but in this case is not neccesary because MAX_LENTH is already 512, the maximum length allowed by the model
    return tokenizer(examples["text"], truncation=True, max_length=MAX_LEN, padding='max_length')

#apply tokenizer and remove the columns that we do not need anymore
data_encodings=dataset.map(tokenize, batched=True, remove_columns=['text','language'])
data_encodings


Map:   0%|          | 0/7592 [00:00<?, ? examples/s]

Map:   0%|          | 0/3881 [00:00<?, ? examples/s]

Map:   0%|          | 0/1899 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7592
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 3881
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1899
    })
})

## Model

We load the pre-trained model. 

In this case, the **number of labels to be predicted will be only 1**, because it is not a classification task, but rather **a regression problem**. 

As num_labes is 1, the **AutoModelForSequenceClassification will trigger the linear regression and use MSELoss() as the loss function** automatically. 


In [13]:
from transformers import AutoModelForSequenceClassification
# As num_labes is 1, the AutoModelForSequenceClassification will trigger the linear regression and use MSELoss() as the loss function automatically.
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels = 1).to("cuda")


Downloading (…)lve/main/config.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/Multilingual-MiniLM-L12-H384 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Anyway, we define a function to compute the appropiate metrics for regression:

In [14]:
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from scipy import stats

def compute_metrics_for_regression(eval_pred):
    logits, labels = eval_pred
    labels = labels.reshape(-1, 1)

    # loss metrics
    mse = mean_squared_error(labels, logits)
    rmse = mean_squared_error(labels, logits, squared=False)
    mae = mean_absolute_error(labels, logits)
    smape = 1/len(labels) * np.sum(2 * np.abs(logits-labels) / (np.abs(labels) + np.abs(logits))*100)
    # performance metrics
    r2 = r2_score(labels, logits)
    pearson=stats.pearsonr(np.squeeze(np.asarray(labels)), np.squeeze(np.asarray(logits)))
    pearson=pearson[0]
    # we return a dictionary with all metrics
    return {"mse": mse, "rmse": rmse, "mae": mae, "r2": r2, "smape": smape, "pearson": pearson}
    # return {"mse": mse, "rmse": rmse, "mae": mae, "r2": r2, "smape": smape}

In [15]:
from transformers import TrainingArguments

NUM_EPOCHS = 3 # paper used 15

# Specifiy the arguments for the trainer  
training_args = TrainingArguments(
    output_dir ='./results',          
    num_train_epochs = NUM_EPOCHS,     
    per_device_train_batch_size = 64, # 128 in the paper   
    per_device_eval_batch_size = 20,   
    weight_decay = 0.01,               
    learning_rate = 2e-5,  # 0.001 in the paper,
    logging_dir = './logs',            
    save_total_limit = 10,
    load_best_model_at_end = True,     
    # metric_for_best_model = 'rmse',    
    metric_for_best_model = 'pearson',     
    evaluation_strategy = "epoch",  # steps in the paper
    save_strategy = "epoch",    # steps in the paper
    report_to = 'all',
) 

### Trainer

In [16]:
from transformers import Trainer

# Call the Trainer
trainer = Trainer(
    model = model,                         
    args = training_args,                  
    train_dataset = data_encodings['train'], # if you only want to check the training is right, replace with train_dataset = data_encodings['train'].select(range(100))         
    eval_dataset = data_encodings['validation'],  # if you only want to check the training is right, replace with eval_dataset = data_encodings['validation'].select(range(20)),                  
    compute_metrics = compute_metrics_for_regression,     
    #callbacks=[EarlyStoppingCallback(3, 0.0)]
)

# Train the model
trainer.train()


***** Running training *****
  Num examples = 7592
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 357
  Number of trainable parameters = 117654145


Epoch,Training Loss,Validation Loss,Mse,Rmse,Mae,R2,Smape,Pearson
1,No log,0.857827,0.857827,0.926189,0.717245,-0.079088,35.309834,0.137056
2,No log,0.653535,0.653535,0.808415,0.633439,0.177897,30.914799,0.449989
3,No log,0.599774,0.599774,0.774451,0.601442,0.245525,29.150326,0.508251


***** Running Evaluation *****
  Num examples = 1899
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-119
Configuration saved in ./results/checkpoint-119/config.json
Model weights saved in ./results/checkpoint-119/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1899
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-238
Configuration saved in ./results/checkpoint-238/config.json
Model weights saved in ./results/checkpoint-238/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1899
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-357
Configuration saved in ./results/checkpoint-357/config.json
Model weights saved in ./results/checkpoint-357/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-357 (score: 0.5082505777992651).


TrainOutput(global_step=357, training_loss=1.2259750579919468, metrics={'train_runtime': 98.178, 'train_samples_per_second': 231.987, 'train_steps_per_second': 3.636, 'total_flos': 146512730800800.0, 'train_loss': 1.2259750579919468, 'epoch': 3.0})

### Evaluate on the validation dataset
The best model will be evaluated on the validation dataset:

In [17]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1899
  Batch size = 20


{'eval_loss': 0.5997742414474487,
 'eval_mse': 0.599774181842804,
 'eval_rmse': 0.7744508981704712,
 'eval_mae': 0.6014423966407776,
 'eval_r2': 0.2455249237200574,
 'eval_smape': 29.150325829383885,
 'eval_pearson': 0.5082505777992651,
 'eval_runtime': 1.9845,
 'eval_samples_per_second': 956.913,
 'eval_steps_per_second': 47.871,
 'epoch': 3.0}

## Evaluation

However, the model could be direcly used to predict the scores for the texts the test dataset and then obtain the metrics on the test dataset to provide a final evaluation. 


### Predictions

The following funcion gests a text (which is not tokenized or encoded) and returns the predicted intimacy score provided by the model. 
To do this, the functions needs to encode the text by using the same tokenizer and arguments that were used to transform the training and validation dataset. Then, the model is used directly on the encoded input. The output of the model is a tensor containing the value of the predicted scoring. We finally return this value. 

In [18]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, max_length=MAX_LEN, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
    outputs = model(**inputs)   #output is a tensor
    return outputs[0].item()    #we only have to return the value of the tensor by using item()

In [19]:
from google.colab import drive

# mount your google drive
drive.mount('/content/drive')

PATH = "/content/drive/My Drive/Colab Notebooks/proyectos/intimacy/"
PATH_DATA = "/content/drive/My Drive/Colab Notebooks/data/intimacy/"

dataset_test = load_dataset("csv", data_files=PATH_DATA+"test_labeled.csv")
# clean the texts in the test dataset 
# as we used for the texts in the training dataset
dataset_test=dataset_test.map(clean, batched=True)
dataset_test = dataset_test['train']
y_test = dataset_test['label']

# generate predictions for each text
y_pred=[get_prediction(text) for text in dataset_test['text']]

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
diff = [e1 - e2 for e1, e2 in zip(y_pred,y_test)] # Resultado: [-2, -1, -2, 0, -7, 6, 2]
smape = 1/len(y_test) * np.sum(2 * np.abs(diff) / (np.abs(y_test) + np.abs(y_pred))*100)
# performance metrics
r2 = r2_score(y_test, y_pred)
pearson=stats.pearsonr(np.squeeze(np.asarray(y_test)), np.squeeze(np.asarray(y_pred)))
pearson=pearson[0]

results = {'mse': mse, 'rmse': rmse, 'mae': mae,
           'smape':smape, 'r2':r2, 'pearson':pearson}


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e1cbdb9d865648ca/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e1cbdb9d865648ca/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/3881 [00:00<?, ? examples/s]

In [21]:

def print_metrics(y_test, y_pred, lang=''):
    if lang:
        pass
    else:
        #         print("Final results on the whole test dataset")
        print('|   |MSE|RMSE|MAE|R2|SMAPE|PEARSON|')
        print('|---|---|---|---|---|---|---|')

    mse = mean_squared_error(y_test, y_pred)
    # print ("MSE: ", "{:.2f}".format(mse), end=', ')

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    # print("RMSE: ", "{:.2f}".format(rmse), end=', ')
    
    mae = mean_absolute_error(y_test, y_pred)
    # print("MAE: ", "{:.2f}".format(mae), end=', ')

    r2 = r2_score(y_test, y_pred)
    # print("R2: ", "{:.2f}".format(r2), end=', ')

    diff=[label-pred for (label,pred) in zip(y_test,y_pred)]
    smape= 1/len(y_test) * np.sum(2 * np.abs(diff) / (np.abs(y_test) + np.abs(y_pred))*100)
    # print("SMAPE: ", "{:.2f}".format(smape), end=', ')

    pearson=stats.pearsonr(y_test, y_pred)[0]
    # print("PEARSON: ", "{:.2f}".format(pearson))
    # print()

    text_table=str('|')+lang+str('|') +"{:.2f}".format(mse) \
                +str('|')+"{:.2f}".format(rmse) \
                +str('|')+"{:.2f}".format(mae) \
                +str('|')+"{:.2f}".format(r2) \
                +str('|')+"{:.2f}".format(smape) \
                +str('|')+"{:.2f}".format(pearson) + str('|')

    print(text_table)
    # print()

In [23]:
print("Results for : ", MODEL_NAME)
# general results
print_metrics(y_test, y_pred)

for lang in sorted(LANGUAGES):
    test_language = dataset['test'].filter(lambda example: example["language"]==lang)
    # print(lang, "number of instances instances in test dataset:", test_language.num_rows)
    # print("Example: ", test_language[0]['text'])
    y_test_lang=test_language['label']
    y_pred_lang=[get_prediction(text) for text in test_language['text']]
    print_metrics(y_test_lang, y_pred_lang, lang)

Results for :  Multilingual-MiniLM-L12-H384
|   |MSE|RMSE|MAE|R2|SMAPE|PEARSON|
|---|---|---|---|---|---|---|
||0.78|0.88|0.68|0.16|31.85|0.43|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|Chinese|0.68|0.82|0.64|0.14|28.24|0.47|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|English|0.55|0.74|0.56|0.26|28.16|0.53|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|French|0.62|0.79|0.60|0.20|29.12|0.46|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|Italian|0.52|0.72|0.55|0.26|28.54|0.51|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|Portuguese|0.60|0.78|0.62|0.18|28.92|0.42|


Filter:   0%|          | 0/3881 [00:00<?, ? examples/s]

|Spanish|0.74|0.86|0.64|0.21|27.97|0.56|
