This script contains the training for the WSD classification about social media context. It uses a BERT-based pretrained model for Catalan, from HuggingFace (https://huggingface.co/projecte-aina/roberta-large-ca-v2)

**NOTE**: the data used here comes from a previous script that cleaned and filtered it -> *Data_handling.ipynb*

In [None]:
from transformers import AutoModel
from transformers import AutoTokenizer
import torch
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from collections import Counter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# create roberta model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('projecte-aina/roberta-large-ca-v2')
roberta = AutoModel.from_pretrained('projecte-aina/roberta-large-ca-v2',
                                               output_hidden_states = True, # Whether the model returns all hidden-states
                                               )
roberta.eval()

if torch.cuda.is_available(): # intel processor GPU
  device='cuda'
elif torch.backends.mps.is_available(): # apple silicon GPU
  device='mps'
else:
  device='cpu' # CPU
roberta.to(device)
print(f"Finished loading RoBERTa on device {device}")

Some weights of RobertaModel were not initialized from the model checkpoint at projecte-aina/roberta-large-ca-v2 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Finished loading RoBERTa on device mps


In [3]:
# File names
folderName = '../Data/' # We are in a subfolder
fileName1 = 'cleanDataset.csv'
fileName2 = 'Manual-partial_annotated.csv'

In [None]:
# Get data from files
df_init = pd.read_csv(folderName+fileName1, sep=";", index_col=0)
df_ann = pd.read_csv(folderName+fileName2, sep=";")

In [5]:
print(df_init.shape)
df_init.head()

(1450, 6)


Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText
0,1619677524967190528,repiulet,"un ruzi*, que he barrat tot d'una, vota pel me...",2023-01-29 12:43:00+00:00,Calque,"un ruzi*, que he barrat tot d'una, vota pel me..."
1,1620168513376894976,retuitar,podeu demanar la dimissió de sigfrid gras sens...,2023-01-30 21:14:00+00:00,Full adaptation,podeu demanar la dimissió de sigfrid gras sens...
2,1619742793383170048,retuitar,"perquè, retuitar? perque fa falta",2023-01-29 17:02:00+00:00,Full adaptation,"perquè, retuitar? perque fa falta"
4,1619642801406509056,retuitar,retuit si tu també creus que tampoc s'ha de re...,2023-01-29 10:25:00+00:00,Full adaptation,retuit si tu també creus que tampoc s'ha de re...
6,1619493816993718272,retuitar,"@kanen49 si us plau, deixa de retuitar aquests...",2023-01-29 00:33:00+00:00,Full adaptation,"@kanen49 si us plau, deixa de retuitar aquests..."


In [6]:
print(df_ann.shape)
df_ann.head()

(1457, 4)


Unnamed: 0,id,searchQuery,text,socialMediaSense
0,1619677524967190528,repiulet,"un ruzi*, que he barrat tot d'una, vota pel me...",1.0
1,1620168513376894976,retuitar,podeu demanar la dimissió de sigfrid gras sens...,1.0
2,1619742793383170048,retuitar,"perquè, retuitar? perque fa falta",1.0
3,1619666435776864256,retuitar,"en contra de retuitar genocides, per molt sucó...",1.0
4,1619642801406509056,retuitar,retuit si tu també creus que tampoc s'ha de re...,1.0


In [None]:
# Merge the two DFs into the first. From the second, we only want to keep the 'socialMediaSense' column
df = pd.merge(df_init, df_ann[['id', 'socialMediaSense']], "inner", on=['id'])

In [8]:
df.head()

Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText,socialMediaSense
0,1619677524967190528,repiulet,"un ruzi*, que he barrat tot d'una, vota pel me...",2023-01-29 12:43:00+00:00,Calque,"un ruzi*, que he barrat tot d'una, vota pel me...",1.0
1,1620168513376894976,retuitar,podeu demanar la dimissió de sigfrid gras sens...,2023-01-30 21:14:00+00:00,Full adaptation,podeu demanar la dimissió de sigfrid gras sens...,1.0
2,1619742793383170048,retuitar,"perquè, retuitar? perque fa falta",2023-01-29 17:02:00+00:00,Full adaptation,"perquè, retuitar? perque fa falta",1.0
3,1619642801406509056,retuitar,retuit si tu també creus que tampoc s'ha de re...,2023-01-29 10:25:00+00:00,Full adaptation,retuit si tu també creus que tampoc s'ha de re...,1.0
4,1619493816993718272,retuitar,"@kanen49 si us plau, deixa de retuitar aquests...",2023-01-29 00:33:00+00:00,Full adaptation,"@kanen49 si us plau, deixa de retuitar aquests...",1.0


In [9]:
# Keep only rows annotated with social media sense
df_lab = df[~(df['socialMediaSense'].isna())]

In [10]:
df_lab.shape

(288, 7)

In [11]:
Counter(df_lab['socialMediaSense'])

Counter({0.0: 158, 1.0: 130})

**TRAIN MODEL**

In [12]:
# To train (or fit) our logistic regression, we need the value of CLS token vectors
def get_cls_token(text):
  # A function which extracts the CLS vector representation for any text

  # first the text is tokenized
  tokenized_text = tokenizer(text, return_tensors="pt")
  # we move tokenized to our device (gpu) so that the model can access them
  tokenized_text.to(device)
  with torch.no_grad():
    # we pass all the tokens through the model, which outputs a vector representation for each word
    outputs = roberta(**tokenized_text)

  # Note (for people interested in technicalities): last_hidden_state has three dimension.
  #     The first is the batch. If we give the model multiple sentences they would be listed here. As we only input one sentence, we can select it (0).
  #     The second is the tokens in a sentence. Here we only take the first (0th) token which is CLS
  #     The final one are the n dimensions of the embedding. w2vec represented each word with 300 values. Can you check how many there are here?
  #     ":" means we're taking all values
  return outputs.last_hidden_state[0, 0, :].cpu().numpy()

cls_vectors=[]
df_lab["cls_vector"] = df_lab["text"].apply(lambda x: get_cls_token(x))

print(df_lab.head())

                    id searchQuery  \
0  1619677524967190528    repiulet   
1  1620168513376894976    retuitar   
2  1619742793383170048    retuitar   
3  1619642801406509056    retuitar   
4  1619493816993718272    retuitar   

                                                text  \
0  un ruzi*, que he barrat tot d'una, vota pel me...   
1  podeu demanar la dimissió de sigfrid gras sens...   
2                  perquè, retuitar? perque fa falta   
3  retuit si tu també creus que tampoc s'ha de re...   
4  @kanen49 si us plau, deixa de retuitar aquests...   

                   timestamp   type_borrowing  \
0  2023-01-29 12:43:00+00:00           Calque   
1  2023-01-30 21:14:00+00:00  Full adaptation   
2  2023-01-29 17:02:00+00:00  Full adaptation   
3  2023-01-29 10:25:00+00:00  Full adaptation   
4  2023-01-29 00:33:00+00:00  Full adaptation   

                                           cleanText  socialMediaSense  \
0  un ruzi*, que he barrat tot d'una, vota pel me...             

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lab["cls_vector"] = df_lab["text"].apply(lambda x: get_cls_token(x))


In [None]:
# Train - test dataset split (80-20 split) - keep the balance of the labels equal in both datasets
X_train, X_test, y_train, y_test = train_test_split(df_lab["cls_vector"], df_lab["socialMediaSense"], test_size=0.2, random_state=42, stratify=df_lab["socialMediaSense"])

In [14]:
X_train.shape, X_test.shape

((230,), (58,))

In [None]:
# Train a linear regression on CLS vectors to predict social media sense
# We get our X (features) and our y (classes) from the previous split

# Initialize the model
model = LogisticRegression(max_iter=1000) # simple LR
#model = LogisticRegression(class_weight='balanced', max_iter=1000) # Weighted LR - when the distribution of labels is not balanced

model = model.fit(X_train.tolist(), y_train) # fit the model to the data

predictions_train = model.predict(X_train.tolist()) # produce predictions

In [None]:
# Evaluate quality of classification for TRAINING dataset
accuracy= accuracy_score(y_train, predictions_train)
print(f'Accuracy fitted logistic regression IN TRAINING: {accuracy}')

f1_score_testing = f1_score(y_train, predictions_train)
print(f'F1 score fitted logistic regression IN TRAINING: {f1_score_testing}')

confusionMatrixTest = pd.crosstab(predictions_train, y_train)
print(confusionMatrixTest)

Accuracy fitted logistic regression IN TRAINING: 0.9478260869565217
F1 score fitted logistic regression IN TRAINING: 0.9423076923076923
socialMediaSense  0.0  1.0
row_0                     
0.0               120    6
1.0                 6   98


In [None]:
# Try our model with the test data
# We get our X (features) from the previous split
predictions_test = model.predict(X_test.tolist()) # produce predictions

In [None]:
# Evaluate quality of classification for TEST dataset
accuracy= accuracy_score(y_test, predictions_test)
print(f'Accuracy fitted logistic regression IN TESTING: {accuracy}')

f1_score_testing = f1_score(y_test, predictions_test)
print(f'F1 score fitted logistic regression IN TESTING: {f1_score_testing}')

confusionMatrixTest = pd.crosstab(predictions_test, y_test)
print(confusionMatrixTest)

Accuracy fitted logistic regression IN TESTING: 0.7413793103448276
F1 score fitted logistic regression IN TESTING: 0.6666666666666666
socialMediaSense  0.0  1.0
row_0                     
0.0                28   11
1.0                 4   15


In [19]:
print(f"Precision in Testing: {precision_score(y_test, predictions_test)}")
print(f"Recall in Testing: {recall_score(y_test, predictions_test)}")

Precision in Testing: 0.7894736842105263
Recall in Testing: 0.5769230769230769


In [20]:
# Summary of the main metrics
print("Classification Report for the model")
print(classification_report(y_test, predictions_test, target_names=['NOT', 'SMS'], digits=3)) # SMS: Social Media Sense

Classification Report for the model
              precision    recall  f1-score   support

         NOT      0.718     0.875     0.789        32
         SMS      0.789     0.577     0.667        26

    accuracy                          0.741        58
   macro avg      0.754     0.726     0.728        58
weighted avg      0.750     0.741     0.734        58



**ANNOTATE UNLABELED DATA**

In [21]:
# Annotate the unlabeled part of the dataset
df_not = df[(df['socialMediaSense'].isna())]
print(df_not.shape)
df_not.head()

(1121, 7)


Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText,socialMediaSense
127,1620168598974267392,respondre,"ui, he llegit cada tuit aquest matí...\ni mira...",2023-01-30 21:14:00+00:00,Calque,"ui, he llegit cada tuit aquest matí... i mira ...",
164,1620181893898715136,post,ella's post 😔,2023-01-30 22:07:00+00:00,Direct borrowing,ella's post 😔,
290,1620140388890849280,comentar,"@estelsiplanetes admirad, joan anton\nenhorabo...",2023-01-30 19:22:00+00:00,Calque,"@estelsiplanetes admirad, joan anton enhorabon...",
291,1620127589808615424,comentar,us volia comentar que avui s'ha estrenat aques...,2023-01-30 18:31:00+00:00,Calque,us volia comentar que avui s'ha estrenat aques...,
292,1620123643425472512,comentar,💣 núria roca denuncia que els fans de shakira ...,2023-01-30 18:15:00+00:00,Calque,💣 núria roca denuncia que els fans de shakira ...,


In [22]:
# Generate CLS tokens for the texts
cls_vectors=[]
df_not["cls_vector"] = df_not["text"].apply(lambda x: get_cls_token(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_not["cls_vector"] = df_not["text"].apply(lambda x: get_cls_token(x))


In [23]:
# Generate predictions for the unlabeled rows
# We get our X (features)
X_prod = df_not["cls_vector"].tolist()

predictions_prod = model.predict(X_prod) # produce predictions
df_not['socialMediaSense'] = predictions_prod

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_not['socialMediaSense'] = predictions_prod


In [24]:
print(df_not.shape)
df_not.head()

(1121, 8)


Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText,socialMediaSense,cls_vector
127,1620168598974267392,respondre,"ui, he llegit cada tuit aquest matí...\ni mira...",2023-01-30 21:14:00+00:00,Calque,"ui, he llegit cada tuit aquest matí... i mira ...",1.0,"[0.094049096, -0.10130556, 0.3584217, 0.103393..."
164,1620181893898715136,post,ella's post 😔,2023-01-30 22:07:00+00:00,Direct borrowing,ella's post 😔,1.0,"[0.16650769, -0.27861863, 0.02536476, -0.05767..."
290,1620140388890849280,comentar,"@estelsiplanetes admirad, joan anton\nenhorabo...",2023-01-30 19:22:00+00:00,Calque,"@estelsiplanetes admirad, joan anton enhorabon...",0.0,"[0.27231136, 0.10775298, 0.1529417, 0.06056674..."
291,1620127589808615424,comentar,us volia comentar que avui s'ha estrenat aques...,2023-01-30 18:31:00+00:00,Calque,us volia comentar que avui s'ha estrenat aques...,0.0,"[0.16063462, -0.06442083, 0.10596745, -0.12750..."
292,1620123643425472512,comentar,💣 núria roca denuncia que els fans de shakira ...,2023-01-30 18:15:00+00:00,Calque,💣 núria roca denuncia que els fans de shakira ...,1.0,"[0.16521722, 0.05898673, 0.025104355, -0.16138..."


In [None]:
# Concatenate the DF with previously annotated data and the DF with the automatic labeling
df_concat = pd.concat([df_lab, df_not], axis=0)
print(df_concat.shape)
df_concat.head()

(1409, 8)


Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText,socialMediaSense,cls_vector
0,1619677524967190528,repiulet,"un ruzi*, que he barrat tot d'una, vota pel me...",2023-01-29 12:43:00+00:00,Calque,"un ruzi*, que he barrat tot d'una, vota pel me...",1.0,"[0.10117894, -0.23830476, 0.17808442, 0.035533..."
1,1620168513376894976,retuitar,podeu demanar la dimissió de sigfrid gras sens...,2023-01-30 21:14:00+00:00,Full adaptation,podeu demanar la dimissió de sigfrid gras sens...,1.0,"[0.24353136, -0.2335331, 0.12269226, 0.0489439..."
2,1619742793383170048,retuitar,"perquè, retuitar? perque fa falta",2023-01-29 17:02:00+00:00,Full adaptation,"perquè, retuitar? perque fa falta",1.0,"[0.13093077, -0.22990246, 0.11809923, -0.09822..."
3,1619642801406509056,retuitar,retuit si tu també creus que tampoc s'ha de re...,2023-01-29 10:25:00+00:00,Full adaptation,retuit si tu també creus que tampoc s'ha de re...,1.0,"[0.20563349, -0.34878263, 0.10538846, -0.17445..."
4,1619493816993718272,retuitar,"@kanen49 si us plau, deixa de retuitar aquests...",2023-01-29 00:33:00+00:00,Full adaptation,"@kanen49 si us plau, deixa de retuitar aquests...",1.0,"[0.24767199, -0.034266822, -0.0022907257, -0.0..."


In [None]:
# Filter: leave only rows that have a social media sense
df_sms = df_concat[df_concat['socialMediaSense'] == 1.0]
print(df_sms.shape)
df_sms.head()

(477, 8)


Unnamed: 0,id,searchQuery,text,timestamp,type_borrowing,cleanText,socialMediaSense,cls_vector
0,1619677524967190528,repiulet,"un ruzi*, que he barrat tot d'una, vota pel me...",2023-01-29 12:43:00+00:00,Calque,"un ruzi*, que he barrat tot d'una, vota pel me...",1.0,"[0.10117894, -0.23830476, 0.17808442, 0.035533..."
1,1620168513376894976,retuitar,podeu demanar la dimissió de sigfrid gras sens...,2023-01-30 21:14:00+00:00,Full adaptation,podeu demanar la dimissió de sigfrid gras sens...,1.0,"[0.24353136, -0.2335331, 0.12269226, 0.0489439..."
2,1619742793383170048,retuitar,"perquè, retuitar? perque fa falta",2023-01-29 17:02:00+00:00,Full adaptation,"perquè, retuitar? perque fa falta",1.0,"[0.13093077, -0.22990246, 0.11809923, -0.09822..."
3,1619642801406509056,retuitar,retuit si tu també creus que tampoc s'ha de re...,2023-01-29 10:25:00+00:00,Full adaptation,retuit si tu també creus que tampoc s'ha de re...,1.0,"[0.20563349, -0.34878263, 0.10538846, -0.17445..."
4,1619493816993718272,retuitar,"@kanen49 si us plau, deixa de retuitar aquests...",2023-01-29 00:33:00+00:00,Full adaptation,"@kanen49 si us plau, deixa de retuitar aquests...",1.0,"[0.24767199, -0.034266822, -0.0022907257, -0.0..."
