# **Text Classification using Language Model Fine-Tuning **

Below is a brief tutorial to learn how to use BERT for sentence classification. It is largely inspired by a notebook written by Rubing Shen and makes use of a dedicated libray

*Since the resources on Google Colab are limited, you may bump into limitations when trying to use it for your own project. In this case, copy this notebook on your computational plateform to use it with your own GPU.*

## Enabling GPU

The package requires a GPU to run. To enable a GPU for this Notebook, you will need to:  
- Click 'Execution' in the menu bar, then click 'Modify '.
- Select GPU from the Hardware Accelerator drop-down list, then click 'Save'.

Click on the arrow below to verify that you are successfully connected to a GPU. This should return the name of the GPU used.

In [1]:
from torch import cuda

cuda.get_device_name(0)

'Tesla T4'

# Sentiment analysis off the shelf

In [2]:
pip install -q transformers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")#(many) alternative models exist - specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [4]:
data = ["I love you", "I hate you"]
data=['anti trump  pro science  follow me and i follow back  blue wave 🌊', "i am an american nationalist and damn proud of it!!   trumptrain2020  maga  kag  gobucks no ❄❄❄", "a proud mom  grandma  and  aunt  doing what i can to prevent them from inheriting trump's america     america will unite for democracy  not a autocracy  resist"]
sentiment_pipeline(data)

[{'label': 'NEGATIVE', 'score': 0.5591527223587036},
 {'label': 'POSITIVE', 'score': 0.9988829493522644},
 {'label': 'POSITIVE', 'score': 0.9998018145561218}]

## Installing the package for fine-tuning

Run the cell below to install the package *AugmentedSocialScientist* on the current Google Colab runtime.

In [None]:
!git clone https://github.com/rubingshen/AugmentedSocialScientist.git
!pip install ./AugmentedSocialScientist/package/

Cloning into 'AugmentedSocialScientist'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (99/99), done.[K
remote: Compressing objects: 100% (86/86), done.[K
remote: Total 134 (delta 36), reused 50 (delta 13), pack-reused 35[K
Receiving objects: 100% (134/134), 1.68 MiB | 17.54 MiB/s, done.
Resolving deltas: 100% (46/46), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./AugmentedSocialScientist/package
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keras==2.8 (from AugmentedSocialScientist==0.1)
  Downloading keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow==2.8 (from AugmentedSocialScientist==0.1)
  Downloading tensorflow-2.8.0-cp310-cp310-manylinux2010_x86_64.whl (497.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

Import other required packages for this tutorial.

In [None]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth=None
pd.options.display.max_rows=100

# **English Text Classifier Example: Clickbait Detection**


For this example, we use data from [Chakraborty et al. 2016](https://github.com/bhargaviparanjape/clickbait), in order to train a classifier that distinguishes between clickbait and non-clickbait titles

Import BERT model ([Devlin et al. 2019](https://arxiv.org/pdf/1810.04805.pdf)) from the package *AugmentedSocialScientist*.

In [None]:
from AugmentedSocialScientist import bert

There are 1 GPU(s) available.
We will use GPU 0: Tesla T4


Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Loading data

In [None]:
cb_train = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_train.csv')
cb_test = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_test.csv')

Inspect loaded data

In [None]:
cb_train

Unnamed: 0,headline,is_clickbait
0,11-Year-Old Bicyclist Called Out Reckless Drivers Like A Boss,1
1,What Would Your Life Be Like As A Lame Super Hero,1
2,17 Faces Everyone Who's Experienced Halloween In The Cold Will Understand,1
3,16 Held in Coup Effort in Equatorial Guinea,0
4,Can You Guess The Celebrity Hiding Behind The Inanimate Object,1
...,...,...
495,"If You're Excited For The New ""Star Wars"" Movie You Have To See This Art",1
496,Baseball Homecomings for Ken Griffey Jr. and Jason Giambi,0
497,13 Apps That'll Make Your iPhone-Android Relationship So Much Better,1
498,What It Feels Like When You Make Your Crush Laugh,1


In [None]:
cb_test

Unnamed: 0,headline,is_clickbait
0,North Queensland Fury sign former Liverpool great Fowler,0
1,US combat forces pull out of Iraq,0
2,Financial Unit Weighs on General Electric,0
3,"Many Civilian Targets, but One Core Question Among Gazans: Why?",0
4,18 Reasons You Should Avoid Lifting Weights At All Costs,1
...,...,...
195,Band manager Daniel Biechele shown parole support by families of victims of the Station nightclub fire,0
196,Justin Trudeau Personally Welcomed A Plane Full Of Refugees To Canada,1
197,Which Sex Toy Matches Your Personality,1
198,Who Is Your Celeb BFF Based On Your Birth Month,1


In [None]:
cb_test.xs(1)['is_clickbait']=[0,4]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cb_test.xs(1)['is_clickbait']=[0,4]


### Training a model

We will now encode the training and the test data.

In [None]:
train_loader = bert.encode(cb_train.headline.values, cb_train.is_clickbait.values)

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
test_loader = bert.encode(cb_test.headline.values, cb_test.is_clickbait.values)

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

The following command trains, validates, and saves the model.

To improve your accuracy, you can tune some hyperparameters
*   n_epochs is the number of runs on all the training data;
*   lr is the learning rate;
*   seed_val is a random seed, for replicability purposes.
*   save_model_as is the name of model saving folder. The model will be saved at `./models/<model_name>`. If you don't want to save the model after training, set this parameter to `None`.

Once the model has completed its training phase, it calculates the F1-score (between 0 and 1) to assess the quality of the model.



In [None]:
score = bert.run_training(train_loader,
                          test_loader,
                          n_epochs=10,
                          lr=5e-6,
                          seed_val=42,
                          save_model_as='clickbait')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at


Training...

  Average training loss: 0.67
  Training took: 0:00:03

Running Validation...

  Average test loss: 0.62
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       1.00      0.83      0.91       104
           1       0.84      1.00      0.91        96

    accuracy                           0.91       200
   macro avg       0.92      0.91      0.91       200
weighted avg       0.92      0.91      0.91       200


Training...

  Average training loss: 0.59
  Training took: 0:00:02

Running Validation...

  Average test loss: 0.55
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       104
           1       0.92      0.96      0.94        96

    accuracy                           0.94       200
   macro avg       0.94      0.94      0.94       200
weighted avg       0.94      0.94      0.94       200


Training...

  Average training loss: 0.52
  Training

### Predicting on new data

Load unlabelled data for prediction, inspect it.

In [None]:
cb_pred = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_pred.csv')

In [None]:
cb_pred

Unnamed: 0,headline
0,34 Musical Baby Names That'll Make You Want To Procreate
1,Senate Approves Tight Regulation Over Cigarettes
2,"Scotland predicted to have worst recession since 1980, but not as bad as rest of UK"
3,17 Times Chloe The Mini Frenchie Won Instagram In 2015
4,Markets rally as world's central banks infuse cash
5,17 Photos Everyone Who Grew Up Eating Pan Dulce Will Relate To
6,Zimbabwean opposition leader rejects calls for power sharing talks
7,Chief of Swiss Re Steps Down
8,This Guy's Epic Story Explains Why Every Girl Has A Trapped In The Closet Moment
9,"There's A New Trailer For The ""Sherlock"" Christmas Special And It's Pure Magic"


Encode your prediction data

In [None]:
pred_loader = bert.encode(cb_pred.headline.values)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

Predict using the saved model

In [None]:
pred_proba = bert.predict_with_model(pred_loader, model_path='./models/clickbait')

  0%|          | 0/2 [00:00<?, ?it/s]

Output: the model returns the probabiliby of each headline in the unlabelled data set to belong to a given category (0: not clickbait; 1: clickbait).

In [None]:
pred_proba

array([[0.08891816, 0.9110818 ],
       [0.8865426 , 0.11345742],
       [0.88926005, 0.11073996],
       [0.19879048, 0.80120957],
       [0.8728297 , 0.1271703 ],
       [0.16549903, 0.83450097],
       [0.89192766, 0.10807239],
       [0.88374555, 0.11625442],
       [0.16703835, 0.8329616 ],
       [0.16628607, 0.83371395],
       [0.8917807 , 0.10821937],
       [0.09156585, 0.90843415],
       [0.8149036 , 0.18509638],
       [0.8534682 , 0.14653182],
       [0.15306884, 0.8469311 ],
       [0.88547766, 0.1145223 ],
       [0.89769644, 0.10230357],
       [0.8708133 , 0.12918669],
       [0.8773369 , 0.12266309],
       [0.90423006, 0.09576995],
       [0.16928744, 0.8307126 ],
       [0.8832665 , 0.11673348],
       [0.11754417, 0.8824559 ],
       [0.11646442, 0.88353556],
       [0.13959594, 0.860404  ],
       [0.871242  , 0.12875801],
       [0.11747322, 0.8825268 ],
       [0.12624456, 0.87375546],
       [0.12507081, 0.87492925],
       [0.15531169, 0.8446883 ],
       [0.

Store the predicted category and probability to the dataframe

In [None]:
cb_pred['pred_label'] = np.argmax(pred_proba, axis=1)
cb_pred['pred_proba'] = np.max(pred_proba, axis=1)

Inspect the prediction results

In [None]:
for i in range(len(cb_pred)):
    print(f"{cb_pred.loc[i,'headline']}")
    print(f"Is clickbait: {bool(cb_pred.loc[i,'pred_label'])}, with a probability of {cb_pred.loc[i,'pred_proba']*100:.0f}%")
    print()

34 Musical Baby Names That'll Make You Want To Procreate
Is clickbait: True, with a probability of 91%

Senate Approves Tight Regulation Over Cigarettes
Is clickbait: False, with a probability of 89%

Scotland predicted to have worst recession since 1980, but not as bad as rest of UK
Is clickbait: False, with a probability of 89%

17 Times Chloe The Mini Frenchie Won Instagram In 2015
Is clickbait: True, with a probability of 80%

Markets rally as world's central banks infuse cash
Is clickbait: False, with a probability of 87%

17 Photos Everyone Who Grew Up Eating Pan Dulce Will Relate To
Is clickbait: True, with a probability of 83%

Zimbabwean opposition leader rejects calls for power sharing talks
Is clickbait: False, with a probability of 89%

Chief of Swiss Re Steps Down
Is clickbait: False, with a probability of 88%

This Guy's Epic Story Explains Why Every Girl Has A Trapped In The Closet Moment
Is clickbait: True, with a probability of 83%

There's A New Trailer For The "Sherl

# For Other Languages

For other languages, the pacakge contains a multilingual model: XLM-RoBERTa  ([Goyal et al. 2020](https://arxiv.org/abs/1911.02116)), which is able to perform NLP tasks on 100 different languages (see Appendix A in the paper for a list).

To use it, the syntax is the same as with BERT and CamemBERT. You first need to import the model `xlmroberta` from the package.

In [None]:
from AugmentedSocialScientist import xlmroberta

There are 1 GPU(s) available.
We will use GPU 0: Tesla T4


Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Once imported, you can then use the functions `xlmroberta.encode`, `xlmroberta.run_training`, `xlmroberta.predict_with_model` with the same syntax as for BERT and CamemBERT.

However, this multilingual model requires a significant amount of RAM which may exceed the capacity of Google Colab. In this case, try to run the model on your own GPU server.