The provided code performs **fine-tuning** on a pre-trained CamemBERT model for classifying the difficulty levels of sentences. It utilizes a training dataset containing sentences and their associated difficulty labels. The sentences are tokenized, and the tokens are used to create a training dataset. The model is then trained over multiple epochs by minimizing the CrossEntropy loss between the model predictions and the actual difficulty levels. The Adam optimizer is employed to adjust the model weights. 

**This process adapts the pre-trained model to the specific task of classifying sentence difficulty levels.**

In [1]:
import pandas as pd
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader, Dataset
import transformers
from transformers import CamembertForSequenceClassification, CamembertTokenizer, AutoModelForSequenceClassification, AutoTokenizer
from torch.optim import Adam
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

We use fine-tuning with the complete dataset (without augmentation).

In [2]:
# Read the CSV file
training_data = pd.read_csv("../Dataset_upgrade/training_dataUP.csv", index_col=0)
training_data.head()

Unnamed: 0_level_0,sentence,difficulty,note_orthographe,lexical_complexite,char_length,word_length,type_token_ratio,sentence_length,avg_word_length,complexite_texte,...,DET,PRON,NUM,NOUN,INTJ,ADP,ADJ,VERB,PROPN,SCONJ
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,Les coûts kilométriques réels peuvent diverger...,4,1.0,0.194007,0.160077,0.140152,0.467105,0.140152,0.339713,0.244565,...,0.066667,0.0,0.0,0.311111,0.0,0.288889,0.066667,0.088889,0.0,0.0
1,"Le bleu, c'est ma couleur préférée mais je n'a...",0,1.0,0.082334,0.03699,0.041667,1.0,0.041667,0.204545,0.086957,...,0.1875,0.125,0.0,0.125,0.0,0.0,0.0,0.125,0.0625,0.0
2,Le test de niveau en français est sur le site ...,0,0.769231,0.088078,0.039541,0.045455,0.826923,0.045455,0.195804,0.081522,...,0.2,0.0,0.0,0.4,0.0,0.266667,0.0,0.066667,0.0,0.0
3,Est-ce que ton mari est aussi de Boston?,0,1.0,0.062664,0.022959,0.026515,1.0,0.026515,0.193182,0.054348,...,0.0,0.1,0.0,0.3,0.0,0.1,0.0,0.1,0.1,0.1
4,"Dans les écoles de commerce, dans les couloirs...",2,1.0,0.184993,0.13074,0.125,0.602941,0.125,0.28877,0.228261,...,0.095238,0.047619,0.047619,0.238095,0.0,0.261905,0.047619,0.095238,0.0,0.0


In [3]:
# CamemBERT Large
model_name = "dangvantuan/sentence-camembert-large"

In [4]:
# Data preparation
sentences = training_data['sentence'].tolist()
labels = training_data['difficulty'].tolist()

In [5]:
# Tokenizer and model initialization
tokenizer = CamembertTokenizer.from_pretrained(model_name)
model = CamembertForSequenceClassification.from_pretrained(model_name, num_labels=6)
tokens = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at dangvantuan/sentence-camembert-large and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


This code fine-tunes a CamemBERT model on a dataset for sentence difficulty classification. It creates a training dataset, defines a CrossEntropy loss function, and utilizes the Adam optimizer for model training over multiple epochs, displaying the average loss per epoch. The tqdm library is used for a progress bar during training.

**Do not run, as this is very time-consuming. The generated model can be found in the file model_only.pth**

# Retrieve the necessary tensors from the BatchEncoding
input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']

# Create a training dataset
train_dataset = torch.utils.data.TensorDataset(input_ids, attention_mask, torch.tensor(labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# Model training
for epoch in range(6):
    model.train()
    total_loss = 0

    # Use tqdm for the progress bar
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{6}"):
        input_ids_batch, attention_mask_batch, labels_batch = batch

        # Build the input dictionary for the model
        inputs = {'input_ids': input_ids_batch, 'attention_mask': attention_mask_batch, 'labels': labels_batch}
           
        outputs = model(**inputs)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Display the average loss for the epoch
    print(f"Average loss for epoch: {total_loss / len(train_loader)}")


saves the state dictionary of the PyTorch model to a file named "modele.pth". This state dictionary contains all the learnable parameters of the model and their current values.

In [None]:
# torch.save(model.state_dict(), "modele.pth")
# torch.save({
#    'epoch': epoch,
#    'model_state_dict': model.state_dict(),
#    'optimizer_state_dict': optimizer.state_dict(),
#    'loss': loss,  # Vous pouvez sauvegarder la perte moyenne ou la dernière perte enregistrée
#    # Ajoutez d'autres métriques si nécessaire
#}, "complet.pth")

The pre-trained Camembert model and its corresponding tokenizer are loaded. Subsequently, the saved state of a post-trained model is loaded from the file "complet.pth," and the model is switched to evaluation mode using model.eval(). This allows the model to be used for making predictions or inferences on new data.

We use the training data with data augmentation. To augment the data, we simply translated the sentences into English and then back into French. This enables us to have more data and slightly different sentences

In [6]:
training_data = pd.read_csv("../Dataset_upgrade/augmented_training_dataUP.csv", index_col=0)

In [7]:
from transformers import CamembertForSequenceClassification, CamembertTokenizer
import torch

# Load tokenizer and pre-trained template
model_name = "dangvantuan/sentence-camembert-large"

tokenizer = CamembertTokenizer.from_pretrained("dangvantuan/sentence-camembert-large")
model = CamembertForSequenceClassification.from_pretrained("dangvantuan/sentence-camembert-large", num_labels=6)

# Load the saved state of your post-training model
model.load_state_dict(torch.load("model_only.pth"))
#model.load_state_dict(torch.load("complet.pth")['model_state_dict'])

model.eval()

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at dangvantuan/sentence-camembert-large and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


CamembertForSequenceClassification(
  (roberta): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(32005, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0-23): 24 x CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=1024, out_features=10

In [8]:
import numpy as np
sentences = training_data['sentence'].tolist()
labels = training_data['difficulty'].tolist()

# Initialize a list to store the extracted features
all_features = []

# Loop over the data with a progress bar
for sentence in tqdm(sentences, desc="Processing sentences"):
    # Tokenization and feature extraction for each sentence
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    features = outputs.logits.squeeze().detach().numpy()
    
    # Add the features to the list
    all_features.append(features)

Processing sentences:   0%|          | 0/9600 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Processing sentences: 100%|██████████| 9600/9600 [26:10<00:00,  6.11it/s]


In [9]:
training_data_x = training_data.drop(columns=["difficulty", "sentence"])
combined_features = np.concatenate((all_features, training_data_x), axis=1)

In [10]:
from sklearn.model_selection import train_test_split

X = combined_features  # embeddings camembert
y = training_data["difficulty"]  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Find the best parameters

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
    'C': [0.1, 1, 10, 100],  
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']  
}

# SVM + GridSearchCV
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', verbose=2)

grid_search.fit(X_train, y_train)

print("Meilleurs paramètres : ", grid_search.best_params_)

best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Précision avec les meilleurs paramètres : ", accuracy)


Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time=   0.1s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.2s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.2s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.2s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.2s
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time=   0.2s
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time=   0.3s
[CV] END .....................C=0.1, gamma=scal

In [11]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import joblib

#  SVM (classification)
svm_model = SVC(C=10, gamma='scale', kernel='rbf')

svm_model.fit(X_train, y_train)
# Save the model and tokenizer

#joblib.dump(svm_model, 'svm_model.pkl')
#tokenizer.save_pretrained("camembert_tokenizer")
#model.save_pretrained("camembert_model")

accuracy = svm_model.score(X_test, y_test)
y_pred = svm_model.predict(X_test)
print("Précision du modèle SVM :", accuracy)

# Accuracy using accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the SVM model:", accuracy)

report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Précision du modèle SVM : 0.9135416666666667
Accuracy of the SVM model: 0.9135416666666667
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.95      0.95       311
           1       0.88      0.91      0.90       327
           2       0.84      0.88      0.86       323
           3       0.93      0.92      0.92       344
           4       0.93      0.92      0.93       317
           5       0.95      0.91      0.93       298

    accuracy                           0.91      1920
   macro avg       0.92      0.91      0.91      1920
weighted avg       0.91      0.91      0.91      1920






**Predicting for new data**



In [12]:
test_data = pd.read_csv("../Dataset_upgrade/unlabelled_test_dataUP.csv", index_col=0)
test_data2 = pd.read_csv("../Dataset/unlabelled_test_data.csv", index_col=0)

In [12]:
test_data2.head()

Unnamed: 0_level_0,sentence
id,Unnamed: 1_level_1
0,Nous dûmes nous excuser des propos que nous eû...
1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,"Et, paradoxalement, boire froid n'est pas la b..."
3,"Ce n'est pas étonnant, car c'est une saison my..."
4,"Le corps de Golo lui-même, d'une essence aussi..."


In [13]:
import numpy as np
sentences = test_data['sentence'].tolist()

all_features2 = []

for sentence in tqdm(sentences, desc="Processing sentences"):
    # Tokenisation et extraction de caractéristiques pour chaque phrase
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    features = outputs.logits.squeeze().detach().numpy()
    
    all_features2.append(features)

Processing sentences: 100%|██████████| 1200/1200 [02:30<00:00,  7.99it/s]


In [14]:

test_data_x = test_data.drop(columns=["sentence",])
combined_features = np.concatenate((all_features2, test_data_x), axis=1)

In [15]:

# Predictions
predicted_labels = svm_model.predict(combined_features)
test_data2['difficulty'] = predicted_labels

print(test_data2.head())

                                             sentence  difficulty
id                                                               
0   Nous dûmes nous excuser des propos que nous eû...           5
1   Vous ne pouvez pas savoir le plaisir que j'ai ...           2
2   Et, paradoxalement, boire froid n'est pas la b...           3
3   Ce n'est pas étonnant, car c'est une saison my...           1
4   Le corps de Golo lui-même, d'une essence aussi...           5


In [16]:
test_data2 = test_data2.drop(columns=['sentence'])
difficulty_mapping = {
    0: 'A1',
    1: 'A2',
    2: 'B1',
    3: 'B2',
    4: 'C1',
    5: 'C2'
}

test_data2['difficulty'] = test_data2['difficulty'].map(difficulty_mapping)

In [17]:
print(test_data2.head(30))

   difficulty
id           
0          C2
1          B1
2          B2
3          A2
4          C2
5          C2
6          A2
7          A2
8          C1
9          A2
10         A2
11         A2
12         B2
13         C1
14         A1
15         A2
16         C1
17         A2
18         A2
19         A2
20         C2
21         C1
22         C1
23         C1
24         B1
25         C2
26         A1
27         A1
28         C2
29         B2


In [26]:
test_data2.shape

(1200, 1)

In [18]:
test_data2.to_csv('philippe+augmentation2.csv', index=True)