<a href="https://colab.research.google.com/github/noetarbouriech/is-it-gorafi/blob/main/newspaper_theme_setfit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Setfig to categorize news article

We are using the same dataset of our model to detect Figaro vs Gorafi.

## Download dependencies

In [1]:
!git clone https://github.com/noetarbouriech/is-it-gorafi.git

Cloning into 'is-it-gorafi'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 18 (delta 5), reused 7 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (18/18), 185.49 KiB | 15.46 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [8]:
!pip install datasets setfit
!apt-get install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


## Preparing the dataset

In [6]:
from datasets import load_dataset
from setfit import sample_dataset

csv = load_dataset("csv", data_files="is-it-gorafi/dataset.csv")
dataset = csv["train"].train_test_split(test_size=0.2, seed=42, shuffle=True)
train_dataset = sample_dataset(dataset["train"], label_column="category", num_samples=8)
test_dataset = dataset["test"]

print("dataset:", dataset, "\n")
print("train_dataset:", train_dataset, "\n")
print("test_dataset:", test_dataset)

dataset: DatasetDict({
    train: Dataset({
        features: ['is_gorafi', 'title', 'category'],
        num_rows: 3281
    })
    test: Dataset({
        features: ['is_gorafi', 'title', 'category'],
        num_rows: 821
    })
}) 

train_dataset: Dataset({
    features: ['is_gorafi', 'title', 'category'],
    num_rows: 40
}) 

test_dataset: Dataset({
    features: ['is_gorafi', 'title', 'category'],
    num_rows: 821
})


## Training our model

Be careful, it takes a lot of time to train it.

In [10]:
import os
from setfit import SetFitModel, Trainer, TrainingArguments

# Initializing a new SetFit model
model = SetFitModel.from_pretrained(
    "sentence-transformers/all-mpnet-base-v2",
    labels=["culture", "sciences", "sports", "société", "politique"],
)
model.to("cuda")

# Preparing the training arguments
os.environ["WANDB_DISABLED"] = "true"
args = TrainingArguments(
    batch_size=32,
    num_epochs=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

# Preparing the trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    column_mapping={"title": "text", "category": "label"}, # Map dataset columns to text/label expected by trainer
)
trainer.train()

# Train and evaluate
metrics = trainer.evaluate()
print(metrics)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 1280
  Batch size = 32
  Num epochs = 2


Epoch,Training Loss,Validation Loss
1,0.2223,0.24325
2,0.1509,0.258702


***** Running evaluation *****


{'accuracy': 0.535931790499391}


## Exporting the model

In [None]:
save_directory = "onnx/"
model.save_pretrained(save_directory)