<a href="https://colab.research.google.com/github/noetarbouriech/is-it-gorafi/blob/main/is_it_gorafi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Is it Gorafi?

Text classification model for recognizing Gorafi news articles by comparing with Figaro news article.

## Dependencies installation

In [None]:
!git clone https://github.com/noetarbouriech/is-it-gorafi.git

Cloning into 'is-it-gorafi'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 12 (delta 3), reused 6 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (12/12), 59.05 KiB | 491.00 KiB/s, done.
Resolving deltas: 100% (3/3), done.


In [None]:
!pip install datasets transformers huggingface_hub evaluate scikit-learn optimum[exporters]
!apt-get install git-lfs

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting optimum[exporters]
  Downloading optimum-1.24.0-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting onnx (from optimum[exporters])
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime (from optimum[exporters])
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11->optimum[exp

## Model training

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, TrainingArguments, Trainer

SEED=42

# Load dataset
dataset = load_dataset("csv", data_files="is-it-gorafi/dataset.csv")
dataset = dataset.map(lambda x: {"is_gorafi": int(x["is_gorafi"])})

# Split train/test data
train_test_split = dataset["train"].train_test_split(test_size=0.2, seed=SEED, shuffle=True)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Geotrend/distilbert-base-en-fr-cased")
model = AutoModelForSequenceClassification.from_pretrained("Geotrend/distilbert-base-en-fr-cased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)

# Tokenize dataset
tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
tokenized_test_datasets = test_dataset.map(tokenize_function, batched=True)

# Rename label column
tokenized_datasets = tokenized_datasets.rename_column("is_gorafi", "label")
tokenized_test_datasets = tokenized_test_datasets.rename_column("is_gorafi", "label")

# Training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=20,
    learning_rate=2e-5,  # Reduced learning rate
    weight_decay=0.1,  # L2 regularization
    per_device_train_batch_size=32,  # Increase batch size
    #load_best_model_at_end=True,  # Load best model based on validation metrics
    greater_is_better=True,  # If True, selects the model with highest accuracy
    report_to="none",  # Disable logging to Weights & Biases
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_test_datasets
)

# Train model
trainer.train()
trainer.evaluate()

# Initialize the sentiment analysis pipeline
nlp_ara = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Process dataset
def analyze_sentiment(batch):
    titles = batch['title']
    predictions = nlp_ara(titles)  # Run sentiment analysis on the batch of titles

    # Prepare the output
    predicted_is_gorafi = [1 if prediction['label'] == 'LABEL_1' else 0 for prediction in predictions]

    # Return the batch with the added 'predicted_is_gorafi' field
    return {'predicted_is_gorafi': predicted_is_gorafi}

# Apply the sentiment analysis function to the dataset
test_dataset = test_dataset.map(analyze_sentiment, batched=True)

nb_examples = min(100, len(test_dataset))
accuracy = 0

# Show some results - accessing the first few entries correctly
for i in range(nb_examples):  # Show the first 5 examples
    example = test_dataset[i]  # Access each example
    print(f"Title: {example['title']}")
    print(f"Predicted is_gorafi: {example['predicted_is_gorafi']} | Actual is_gorafi: {example['is_gorafi']}")
    if example['predicted_is_gorafi'] == example['is_gorafi']:
      accuracy+=1

print("Accuracy = ", accuracy / nb_examples)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Geotrend/distilbert-base-en-fr-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.411489
2,No log,0.350016
3,No log,0.3541
4,No log,0.406253
5,No log,0.543961
6,No log,0.542046
7,No log,0.604932


Epoch,Training Loss,Validation Loss
1,No log,0.411489
2,No log,0.350016
3,No log,0.3541
4,No log,0.406253
5,No log,0.543961
6,No log,0.542046
7,No log,0.604932
8,No log,0.736801
9,No log,0.68595
10,No log,0.694836


Device set to use cuda:0


Map:   0%|          | 0/318 [00:00<?, ? examples/s]

Title: Acheté dans une brocante au Mans en 2010, le tableau s’avère être un Modigliani
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Un «film coup de poing»:Le Mohicanmet tous les maux de la Corse à l’écran
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Jean-Pierre Pernaut sera inhumé dans un cercueil fabriqué à la main par le dernier petit artisan du Poitou
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Les cérémonies des Jeux olympiques récompensées, une Victoire de la musique hors-norme
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Présidentielle – Un premier sondage place Cthulhu au second tour
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Législatives – N’Golo Kanté toujours en tête des intentions de vote
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Expulsions – Le RN remercie Valérie Pécresse d’avoir relié Châtelet à Orly en seulement 25 minutes
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Budget : «Je voterai la c

## Using our model

### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Extract true labels and predictions
true_labels = [example['is_gorafi'] for example in test_dataset]
predicted_labels = [example['predicted_is_gorafi'] for example in test_dataset]

# Compute confusion matrix
cm = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[257  37]
 [ 52 289]]


### Using model with a sentence provided by us

In [None]:
test_title = "François Durovray favorable à des «peines planchers» en cas d'attaque envers les forces de l'ordre"
prediction = nlp_ara(test_title)

# Check confidence score and adjust prediction logic
label = prediction[0]['label']
score = prediction[0]['score']

if score < 0.6:  # Set a threshold for confidence score
    predicted_is_gorafi = 0  # If confidence is low, treat it as not is_gorafi
else:
    predicted_is_gorafi = 1 if label == 'LABEL_1' else 0

print(f"Title: {test_title}")
print(f"Predicted is_gorafi: {predicted_is_gorafi} with confidence score: {score}")

Title: François Durovray favorable à des «peines planchers» en cas d'attaque envers les forces de l'ordre
Predicted is_gorafi: 0 with confidence score: 0.9889334440231323


## Exporting our tokenizer and model in ONNX format

In [None]:
from optimum.exporters.onnx import main_export

onnx_output_dir = "onnx_model"

main_export(
    model_name_or_path="test_trainer/checkpoint-357",
    task="text-classification",
    output=onnx_output_dir,
)
print(f"ONNX model exported to {onnx_output_dir}/ folder")


ONNX model exported to onnx_model/
