# Pro russian comments classification from Le Monde, using Setfit 

To edit : 
> 3rd pillar of my project on lmd ukraine war  
> Original dataset 300k comments, then 500 manual annotations using label studio  
> Russian trolls / very few (cause lmd limitations), pro russian quite rare : used previous work & semantic search to ease identification + chatgpt to generate arguments I could search for
> Fine tuned with a pure PyTorch model (instead of a classifier sklearn)
> Setfix syntax is for version > 1.0

In [1]:
!pip install setfit

Collecting setfit
  Obtaining dependency information for setfit from https://files.pythonhosted.org/packages/a4/b0/0afe7c5e0901fece8677746a70f9658c8c7c55dc46c9c947e473c7ed9d77/setfit-1.0.1-py3-none-any.whl.metadata
  Downloading setfit-1.0.1-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.3.0 (from setfit)
  Obtaining dependency information for datasets>=2.3.0 from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting sentence-transformers>=2.2.1 (from setfit)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting evaluate>=0.3.0 (from setfit)
  Obtaining dependency information for evaluate>=0.3.0 from 

In [2]:
# wandb login, logging enabled by default in SetFit
import wandb
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
my_secret = user_secrets.get_secret("wandb_key") 
wandb.login(key=my_secret)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

from datasets import Dataset, DatasetDict, load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset

from sentence_transformers.losses import CosineSimilarityLoss

import torch
import gc

from optuna import Trial



## Load data

#### Load from disk

In [4]:
# filepath = "data/lmd_ukraine_annotated.parquet"
filepath = "/kaggle/input/lmd-annotated/lmd_ukraine_annotated.parquet"

In [5]:
data = pd.read_parquet(filepath)
data.head(3)

Unnamed: 0,article_id,url,title,desc,content,date,keywords,article_type,allow_comments,premium,author,comment,comment_id,classe
0,3259703,https://www.lemonde.fr/actualite-medias/articl...,"Le conflit russo-ukrainien, qui mobilise les m...",Au Festival de journalisme de Couthures : la g...,Parce qu’elle est revenue frapper à nos porte...,2022-07-16,"[international, europe, ukraine, crise-ukraini...",Factuel,True,False,Ricardo Uztarroz,La question qui vaille et qui n'est pas posée...,e7206b56918f694f,pro_russia
1,3259703,https://www.lemonde.fr/actualite-medias/articl...,"Le conflit russo-ukrainien, qui mobilise les m...",Au Festival de journalisme de Couthures : la g...,Parce qu’elle est revenue frapper à nos porte...,2022-07-16,"[international, europe, ukraine, crise-ukraini...",Factuel,True,False,Ricardo Uztarroz,Salandre : les documents dont vous faîtes ét...,d904e44906dfb957,other
2,3259703,https://www.lemonde.fr/actualite-medias/articl...,"Le conflit russo-ukrainien, qui mobilise les m...",Au Festival de journalisme de Couthures : la g...,Parce qu’elle est revenue frapper à nos porte...,2022-07-16,"[international, europe, ukraine, crise-ukraini...",Factuel,True,False,Correcteur,« C’est l’affaire des russes »? C’est donc vot...,1c03f54daeffd1ca,pro_ukraine


In [6]:
print(data.dtypes)
# For later stage and to comply with huggingface Dataset format, convert article_type to string type
data['article_type'] = data['article_type'].astype(str)

article_id           int64
url                 object
title               object
desc                object
content             object
date                object
keywords            object
article_type      category
allow_comments        bool
premium               bool
author              object
comment             object
comment_id          object
classe              object
dtype: object


#### Classes overview / % annotated labels

Custom original dataset (see my other projects) was 236k comments.  
After custom hashing / cleaning / deduplication + manual labeling : 175k records, 574 manually labeled examples, using label studio.  
As a whole, dataset is unbalanced "by nature", labeled examples are ok.  
"Truly" pro-russian comments were quite hard to find : 1. comment section is subscribers only and moderated so almost 0 trolls. 2. People support Ukraine 3. Had to extend a bit what pro-russian means, but tried not to be too harsh on "balanced" comments either. Highly subjective.  

In [7]:
print(len(data))
print(data.classe.value_counts())
print(sum(data.classe.notnull()))
print(sum(data.classe.isnull()))

175353
classe
other          256
pro_ukraine    196
pro_russia     122
Name: count, dtype: int64
574
174779


## Prepare Dataset (labels, optional sample, split)

#### Split, convert to Huggingface DatasetDict

Keep labels representativity in our train / eval data (overall, not annotated dataset is way more unbalanced). Using sklearn Stratify (optional)  
We want each class to have +- 50 examples max (ressources often show it works with only 8 rows per class ; we could go up to 100). Let's value our painful manual labeling work.  
We have 574 labels, train dataset is sampled to have 60 labels per class. Eval is kept around 200 to 300 samples. Test data will be the remaining, non labeled data.  
Test, unlabeled data could be of use later for optimization through distillation (teacher <-> student). Setfit uses a particular technique to leverage unlabeled data.   



In [8]:
# select labeled data only to split between train and eval, test set is the unlabeled data.
with_labels = data.query("classe.notnull()")
test_df = data.query("classe.isnull()")
print(len(with_labels), len(test_df))

574 174779


In [9]:
# labeled data is split between train and eval sets
# Optional stratify= but we still want to make sure classes share the same distribution in both dataset

train_df, eval_df = train_test_split(with_labels, test_size=0.4, stratify=with_labels['classe'], random_state=40)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [10]:
# we make sure the smaller class has enough labels (e.g 8, or 20 or 50 or "max" 100).of
# This dataset will later be sampled again using Setfit.sample_dataset. Classes will have the same amount of rows (8 or 10 or 60...)
print(len(train_df))
print(train_df.classe.value_counts())
print(len(eval_df))
print(eval_df.classe.value_counts())

344
classe
other          153
pro_ukraine    118
pro_russia      73
Name: count, dtype: int64
230
classe
other          103
pro_ukraine     78
pro_russia      49
Name: count, dtype: int64


In [11]:
# For labeled data, add a 'label' column where 'classe' labels strings -> int
# Also, Setfit (torch) classifier head requires int, not floats
label_mapping = {'pro_ukraine': 0, 'pro_russia': 1, 'other': 2}
for df in [train_df, eval_df]:
    df['label'] = df['classe'].map(label_mapping)

Convert to hugging dataset format to streamline the operations, and later push to the hub

In [12]:
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)
test_dataset = Dataset.from_pandas(test_df)

# convert to huggingface --commonly used, DatasetDict format
dataset = DatasetDict(
    {
        'train': train_dataset,
        'validation': eval_dataset,
        'test': test_dataset
    }
)

  if _pandas_api.is_sparse(col):


In [13]:
# save # classes, to be used later when loading model
num_classes = len(train_dataset.unique("label"))
num_classes

3

## Modeling

Candidates models, could also use something larger.  
- https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (900MB)  
- https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (470MB)
- https://huggingface.co/dangvantuan/sentence-camembert-base
- https://huggingface.co/dangvantuan/sentence-camembert-large (1GB)

Training with SetFit consists of two phases behind the scenes: 1.finetuning embeddings and 2. training a classification head.  
Depending on SetFit version, might import (old) `SetFitTrainer` instead of `Trainer`.   
Refers to hf/setfit [documentation](https://huggingface.co/docs/setfit/how_to/overview) rather than the github for updated ressources

In [14]:
# "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
# "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"

# dangvantuan/sentence-camembert-base
# dangvantuan/sentence-camembert-large

# Lajavaness/sentence-camembert-base
# Lajavaness/sentence-camembert-large

### A. Fine-tune, using Pytorch head

Setfit docs recommends the sklearn logistic regression head though (see option B.). Performs a bit better in our use case too.  
Here, by specifying use_differentiable_head=True, `SetFitHead`, a custom torch classification head is used.  
To use your own custom classification head see [here](https://huggingface.co/docs/setfit/how_to/classification_heads).  

#### Load Model

In [15]:
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    use_differentiable_head=True, head_params={"out_features": num_classes})

model.labels = ["pro_ukraine", "pro_russia", "other"]
print(model.model_head)

  self.comm = Comm(**args)


config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


SetFitHead({'in_features': 768, 'out_features': 3, 'temperature': 1.0, 'bias': True, 'device': 'cuda'})


#### Set Trainer args

Might try to play on [sampling_strategy](https://huggingface.co/docs/setfit/v1.0.0/en/reference/trainer#setfit.TrainingArguments) (i.e undersampling or unique) for minority class "pro_russian".  
From SetFit doc, num_epochs, max_steps and body_learning_rate are the most important regarding phase 1.  
Cstomize your training arguments here, setfit.TrainingArguments class
uples correspond to steps 1. finetuning embedding, 2. training classification head

In [16]:
args = TrainingArguments(
    batch_size=(32, 16), # default is (16,2), second value is for the classification head (SetFitHead)
    num_epochs=(1, 16), # default (1, 16)
    end_to_end=True, # if False (default), freezes body and train Head only. If True train the entire model during the classi. phase.
    body_learning_rate=(1e-07 , 3e-06), # (2e-5, 1e-5) by default. Only used if end to end is True (else body is frozen)
    head_learning_rate=2e-3, # default 1e-2
    # l2_weight=0.01, # optional weight model body & head, passed to AdamW optimizer in classification training
    sampling_strategy='oversampling', # default is oversampling. Kinda replace --but still exist, the num_iterations args
    max_steps=-1, # default -1. Can also overrides num_epochs and reduce the # steps that would be otherwise needed.
    evaluation_strategy='steps',
    eval_steps=100, # print eval every eval_steps
    save_strategy='steps',
    load_best_model_at_end = True, # required if earlystopping
    #report_to='none', # if wandb enabled and you don't want, set to "none"
    
)

Optional, sample the dataset.  
Note : if using sampling_strategy set to 'unique' or 'undersampling', all minority class examples will be drawn. So could also do that to tweak the min # of examples

In [17]:
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=72, seed=40)

  if _pandas_api.is_sparse(col):


Instanciate the trainer.  

In [18]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric='accuracy', #default
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    column_mapping={"comment": "text", "label": "label"}, # cols expected by the model   
)

Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
  if not hasattr(tensorboard, "__version__") or LooseVersion(
  ) < LooseVersion("1.15"):
  self.comm = Comm(**args)


Map:   0%|          | 0/216 [00:00<?, ? examples/s]

#### train


In [19]:
trainer.train()

***** Running training *****
  Num examples = 972
  Num epochs = 1
  Total optimization steps = 972
  Total train batch size = 32
[34m[1mwandb[0m: Currently logged in as: [33mvionmatthieu[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231220_135439-k4jwezz5[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfiery-fire-3[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/vionmatthieu/setfit[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/vionmatthieu/setfit/runs/k4jwezz5[0m


Step,Training Loss,Validation Loss,Embedding Loss,Rate
100,No log,No log,0.2789,0.0
200,No log,No log,0.2759,0.0
300,No log,No log,0.2739,0.0
400,No log,No log,0.2721,0.0
500,No log,No log,0.2709,0.0
600,No log,No log,0.2701,0.0
700,No log,No log,0.2696,0.0
800,No log,No log,0.269,0.0
900,No log,No log,0.2687,0.0


  self.comm = Comm(**args)


  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

  0%|          | 0/1057 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 500.
The `max_length` is `None`. Using the maximum acceptable length according to the current model body: 128.


Epoch:   0%|          | 0/16 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

Iteration:   0%|          | 0/14 [00:00<?, ?it/s]

#### Evaluate

In [20]:
trainer.evaluate()

***** Running evaluation *****


Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.6478260869565218}

## B. Hyperparameter Optimization (Optuna), with NN

print(best_run)