<a href="https://colab.research.google.com/github/noetarbouriech/is-it-gorafi/blob/main/newspaper_theme.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Newspaper category?

Text classification model for recognizing Gorafi news articles by comparing with Figaro news article.

## Dependencies installation

In [None]:
!git clone https://github.com/noetarbouriech/is-it-gorafi.git

Cloning into 'is-it-gorafi'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 18 (delta 5), reused 7 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (18/18), 185.49 KiB | 3.64 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [None]:
!pip install datasets transformers huggingface_hub evaluate scikit-learn optimum[exporters]
!apt-get install git-lfs

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting optimum[exporters]
  Downloading optimum-1.24.0-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting onnx (from optimum[exporters])
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime (from optimum[exporters])
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11->optimum[exp

# Is it gorafi ?

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, TrainingArguments, Trainer

SEED = 42

# Load dataset
dataset = load_dataset("csv", data_files="is-it-gorafi/dataset.csv")

# Map category labels to numerical values
categories = ["culture", "sciences", "sports", "société", "politique"]

# Encode categories
def encode_category(example):
    example["label"] = categories.index(example["category"])
    return example

dataset = dataset.map(encode_category)

# Split train/test data
train_test_split = dataset["train"].train_test_split(test_size=0.2, seed=SEED, shuffle=True)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Geotrend/distilbert-base-en-fr-cased")
model = AutoModelForSequenceClassification.from_pretrained("Geotrend/distilbert-base-en-fr-cased", num_labels=len(categories))

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)

# Tokenize dataset
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.1,
    per_device_train_batch_size=16,
    load_best_model_at_end=True,  # Load best model based on validation metrics
    greater_is_better=True,  # If True, selects the model with highest accuracy
    report_to="none",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

# Train model
trainer.train()
trainer.evaluate()

# Initialize the classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Analyze dataset
def analyze_category(batch):
    titles = batch['title']
    predictions = classifier(titles)

    # Convert predictions to category
    predicted_categories = [categories[int(pred["label"].split("_")[-1])] for pred in predictions]

    return {'predicted_category': predicted_categories}

# Apply the classification function
test_dataset = test_dataset.map(analyze_category, batched=True)

nb_examples = min(100, len(test_dataset))
accuracy = 0

# Evaluate predictions
for i in range(nb_examples):
    example = test_dataset[i]
    print(f"Title: {example['title']}")
    print(f"Predicted category: {example['predicted_category']} | Actual category: {categories[example['label']]}")

    if example['predicted_category'] == categories[example['label']]:
        accuracy += 1

print("Accuracy =", accuracy / nb_examples)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/4102 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/230k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/277M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Geotrend/distilbert-base-en-fr-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3281 [00:00<?, ? examples/s]

Map:   0%|          | 0/821 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,No log,0.745808
2,No log,0.670382
3,0.777100,0.6666


Device set to use cuda:0


Map:   0%|          | 0/821 [00:00<?, ? examples/s]

Title: «J’ai construit une écologie du réel», les indiscrétions duFigaro Magazine
Predicted category: sciences | Actual category: politique
Title: Un surfeur rate complètement sa descente en oubliant d’envoyer de la poudreuse sur les skieurs
Predicted category: société | Actual category: sports
Title: Ligue des nations : un choc Italie-Allemagne, Ronaldo au Danemark, le nouveau cycle espagnol... Tout sur les autres quarts de finale
Predicted category: sports | Actual category: sports
Title: Thales Alenia Space présente Halo, première étape de la future station orbitale Lunar Gateway
Predicted category: sciences | Actual category: sciences
Title: Jean-Paul Delevoye a oublié de préciser qu’il allait toucher 5% de toutes les retraites des Français
Predicted category: politique | Actual category: politique
Title: Facebook : Il partage un article d’astrophysique auquel il ne comprend rien
Predicted category: sciences | Actual category: sciences
Title: Marlène Schiappa met sa couverture Play

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, TrainingArguments, Trainer

SEED=42

# Load dataset
dataset = load_dataset("csv", data_files="is-it-gorafi/dataset.csv")
dataset = dataset.map(lambda x: {"is_gorafi": int(x["is_gorafi"])})

# Split train/test data
train_test_split = dataset["train"].train_test_split(test_size=0.2, seed=SEED, shuffle=True)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Geotrend/distilbert-base-en-fr-cased")
model = AutoModelForSequenceClassification.from_pretrained("Geotrend/distilbert-base-en-fr-cased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)

# Tokenize dataset
tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
tokenized_test_datasets = test_dataset.map(tokenize_function, batched=True)

# Rename label column
tokenized_datasets = tokenized_datasets.rename_column("is_gorafi", "label")
tokenized_test_datasets = tokenized_test_datasets.rename_column("is_gorafi", "label")

# Training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    learning_rate=2e-5,  # Reduced learning rate
    weight_decay=0.1,  # L2 regularization
    per_device_train_batch_size=16,  # Increase batch size
    load_best_model_at_end=True,  # Load best model based on validation metrics
    greater_is_better=True,  # If True, selects the model with highest accuracy
    report_to="none",  # Disable logging to Weights & Biases
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_test_datasets
)

# Train model
trainer.train()
trainer.evaluate()

# Initialize the sentiment analysis pipeline
nlp_ara = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Process dataset
def analyze_sentiment(batch):
    titles = batch['title']
    predictions = nlp_ara(titles)  # Run sentiment analysis on the batch of titles

    # Prepare the output
    predicted_is_gorafi = [1 if prediction['label'] == 'LABEL_1' else 0 for prediction in predictions]

    # Return the batch with the added 'predicted_is_gorafi' field
    return {'predicted_is_gorafi': predicted_is_gorafi}

# Apply the sentiment analysis function to the dataset
test_dataset = test_dataset.map(analyze_sentiment, batched=True)

nb_examples = min(100, len(test_dataset))
accuracy = 0

# Show some results - accessing the first few entries correctly
for i in range(nb_examples):  # Show the first 5 examples
    example = test_dataset[i]  # Access each example
    print(f"Title: {example['title']}")
    print(f"Predicted is_gorafi: {example['predicted_is_gorafi']} | Actual is_gorafi: {example['is_gorafi']}")
    if example['predicted_is_gorafi'] == example['is_gorafi']:
      accuracy+=1

print("Accuracy = ", accuracy / nb_examples)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/4102 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/230k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/277M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Geotrend/distilbert-base-en-fr-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3281 [00:00<?, ? examples/s]

Map:   0%|          | 0/821 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,No log,0.228933
2,No log,0.234918
3,0.281700,0.238929


Device set to use cuda:0


Map:   0%|          | 0/821 [00:00<?, ? examples/s]

Title: «J’ai construit une écologie du réel», les indiscrétions duFigaro Magazine
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Un surfeur rate complètement sa descente en oubliant d’envoyer de la poudreuse sur les skieurs
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Ligue des nations : un choc Italie-Allemagne, Ronaldo au Danemark, le nouveau cycle espagnol... Tout sur les autres quarts de finale
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Thales Alenia Space présente Halo, première étape de la future station orbitale Lunar Gateway
Predicted is_gorafi: 0 | Actual is_gorafi: False
Title: Jean-Paul Delevoye a oublié de préciser qu’il allait toucher 5% de toutes les retraites des Français
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Facebook : Il partage un article d’astrophysique auquel il ne comprend rien
Predicted is_gorafi: 1 | Actual is_gorafi: True
Title: Marlène Schiappa met sa couverture Playboy aux enchères pour rembourser le Fonds Maria

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("Geotrend/distilbert-base-en-fr-cased")
x = model.config

print(model.config_class)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Geotrend/distilbert-base-en-fr-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>


## Using our model

### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Extract true labels and predictions
true_labels = [example['is_gorafi'] for example in test_dataset]
predicted_labels = [example['predicted_is_gorafi'] for example in test_dataset]

# Compute confusion matrix
cm = confusion_matrix(true_labels, predicted_labels)

print("Confusion Matrix:")
print(cm)

KeyError: 'predicted_is_gorafi'

### Using model with a sentence provided by us

In [None]:
test_title = "L’Arbitre Oublie Son Sifflet, Les Joueurs Continuent Depuis 3 Jours"

prediction_gorafi = nlp_ara(test_title)

# Check confidence score and adjust prediction logic
label_gorafi = prediction_gorafi[0]['label']
score_gorafi = prediction_gorafi[0]['score']

if score_gorafi < 0.6:  # Set a threshold for confidence score
    predicted_is_gorafi = 0  # If confidence is low, treat it as not is_gorafi
else:
    predicted_is_gorafi = 1 if label_gorafi == 'LABEL_1' else 0

prediction_category = classifier(test_title)

# Extract label and confidence score
label_category = prediction_category[0]['label']
score_category = prediction_category[0]['score']

# Map label to category
predicted_cat = categories[int(label_category.split("_")[-1])]

# Apply confidence threshold
if score_category < 0.4:  # Set a threshold for confidence score
    predicted_cat = predicted_cat + "(uncertain)"  # If confidence is low, mark as uncertain


print(f"Title: {test_title}")

print(f"Predicted: is gorafi {predicted_is_gorafi} with confidence score: {score_gorafi}")
print(f"Predicted: category {predicted_cat} with confidence score: {score_category}")

NameError: name 'nlp_ara' is not defined

## Exporting our tokenizer and model in ONNX format

In [None]:
from optimum.exporters.onnx import main_export

onnx_output_dir = "onnx_model"

main_export(
    model_name_or_path="test_trainer/checkpoint-357",
    task="text-classification",
    output=onnx_output_dir,
)
print(f"ONNX model exported to {onnx_output_dir}/ folder")


ONNX model exported to onnx_model/
