Text classification using CamemBERT
=================================

We implemented text classification on the Allociné reviews dataset using CamemBERT.

In [None]:
!pip install --upgrade transformers
!pip install simpletransformers
!pip install dvc
!pip install datasets

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 18.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 55.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 56.0MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=d33395dccd

Collecting dvc
[?25l  Downloading https://files.pythonhosted.org/packages/7e/52/319920bb3df3cc6e8a61662e3ea0eef302d2b59f3c099544c64b94179a3e/dvc-2.0.7-py2.py3-none-any.whl (627kB)
[K     |▌                               | 10kB 25.3MB/s eta 0:00:01[K     |█                               | 20kB 31.5MB/s eta 0:00:01[K     |█▋                              | 30kB 22.2MB/s eta 0:00:01[K     |██                              | 40kB 26.1MB/s eta 0:00:01[K     |██▋                             | 51kB 24.6MB/s eta 0:00:01[K     |███▏                            | 61kB 27.2MB/s eta 0:00:01[K     |███▋                            | 71kB 18.4MB/s eta 0:00:01[K     |████▏                           | 81kB 19.8MB/s eta 0:00:01[K     |████▊                           | 92kB 18.6MB/s eta 0:00:01[K     |█████▏                          | 102kB 18.5MB/s eta 0:00:01[K     |█████▊                          | 112kB 18.5MB/s eta 0:00:01[K     |██████▎                         | 122kB 18.5MB/s

Collecting tqdm<4.50.0,>=4.27
[?25l  Downloading https://files.pythonhosted.org/packages/73/d5/f220e0c69b2f346b5649b66abebb391df1a00a59997a7ccf823325bd7a3e/tqdm-4.49.0-py2.py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 7.7MB/s 
Installing collected packages: tqdm
  Found existing installation: tqdm 4.59.0
    Uninstalling tqdm-4.59.0:
      Successfully uninstalled tqdm-4.59.0
Successfully installed tqdm-4.49.0


In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from datasets import load_dataset


Load dataset

In [None]:
dataset = load_dataset("allocine")

train_df = pd.DataFrame({'text': dataset['train']['review'][:25000], 'labels': dataset['train']['label'][:25000]})
valid_df = pd.DataFrame({'text': dataset['validation']['review'], 'labels': dataset['validation']['label']})
test_df = pd.DataFrame({'text': dataset['test']['review'], 'labels': dataset['test']['label']})
train_df

Reusing dataset allocine_dataset (/root/.cache/huggingface/datasets/allocine_dataset/allocine/1.0.0/bbee2ebb45a067891973b91ebdd40a93598d1e2dd5710b6714cdc2cd81d0ed65)


Unnamed: 0,text,labels
0,Si vous cherchez du cinéma abrutissant à tous ...,0
1,"Trash, re-trash et re-re-trash...! Une horreur...",0
2,"Et si, dans les 5 premières minutes du film, l...",0
3,Mon dieu ! Quelle métaphore filée ! Je suis ab...,0
4,"Premier film de la saga Kozure Okami, ""Le Sabr...",1
...,...,...
24995,"Quel beau film, tout y est, les acteurs sont f...",1
24996,Un documentaire émouvant sur ces millions d'am...,1
24997,Le genre de film qui entre par une oreille et ...,0
24998,73 minutes d'un scénario qui nous endore pénib...,0


Truncate dataset

In [None]:
train_df = train_df[:25000]

Unnamed: 0,text,labels
0,Si vous cherchez du cinéma abrutissant à tous ...,0
1,"Trash, re-trash et re-re-trash...! Une horreur...",0
2,"Et si, dans les 5 premières minutes du film, l...",0
3,Mon dieu ! Quelle métaphore filée ! Je suis ab...,0
4,"Premier film de la saga Kozure Okami, ""Le Sabr...",1
...,...,...
495,"Ce film est vraiment très très bien, je me sou...",1
496,C'est très bon et complètement loufoque. Ça co...,1
497,inattendu et captivant.. bien que la réalisati...,1
498,"Je n'ai pas pu le voir en entier, mais le peu ...",1


Define model

In [None]:
model_args = ClassificationArgs(num_train_epochs=3, overwrite_output_dir=False, manual_seed=42)
model = ClassificationModel(model_type='camembert', model_name='camembert-base', use_cuda=True, num_labels=2, args=model_args)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'cl

Train it

In [None]:
model.train_model(train_df)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 3', max=3125.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1 of 3', max=3125.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 2 of 3', max=3125.0, style=ProgressStyle(de…





(9375, 0.1685619183776031)

Load pretrained


In [None]:
model_args = ClassificationArgs(num_train_epochs=3, overwrite_output_dir=False, manual_seed=42)
model = ClassificationModel(model_type='camembert', model_name='outputs/checkpoint-20000', use_cuda=True, num_labels=2, args=model_args)

Evaluate its performance

In [None]:
result, model_outputs, wrong_preds = model.eval_model(valid_df)
print(result)

HBox(children=(FloatProgress(value=0.0, max=20000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=2500.0, style=ProgressStyle(desc…


{'mcc': 0.8930918460602015, 'tp': 9337, 'tn': 9593, 'fp': 611, 'fn': 459, 'auroc': 0.9875345923959715, 'auprc': 0.9860996964217978, 'eval_loss': 0.26346013724287043}


Predict on test set

In [None]:
test_predictions, raw_outputs = model.predict(list(test.text.values))

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/408 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Save as submission file

In [None]:
test_original = pd.read_csv("/content/test.csv")

sample_sub = pd.DataFrame({"target": test_predictions, "id": test_original.id})
sample_sub.to_csv("submission.csv",index=False)
files.download("submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download checkpoint

In [None]:
!zip -r checkpoint-20000x8 outputs/checkpoint-20000

  adding: outputs/checkpoint-20000/ (stored 0%)
  adding: outputs/checkpoint-20000/tokenizer_config.json (deflated 47%)
  adding: outputs/checkpoint-20000/config.json (deflated 49%)
  adding: outputs/checkpoint-20000/pytorch_model.bin (deflated 7%)
  adding: outputs/checkpoint-20000/model_args.json (deflated 62%)
  adding: outputs/checkpoint-20000/optimizer.pt (deflated 11%)
  adding: outputs/checkpoint-20000/sentencepiece.bpe.model (deflated 49%)
  adding: outputs/checkpoint-20000/training_args.bin (deflated 49%)
  adding: outputs/checkpoint-20000/scheduler.pt (deflated 49%)
  adding: outputs/checkpoint-20000/special_tokens_map.json (deflated 47%)


In [None]:
from google.colab import files
files.download('checkpoint-20000x8.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!du -hd1

96K	./.config
1.3G	./cache_dir
15G	./outputs
52K	./runs
55M	./sample_data
18G	.
