This project examines a text classification task where the scientific area of a set of computer science papers should be determined, given the titles and references of each paper. There are 5 labels (scientific area) for this dataset as follows: Data Mining, Database, Machine Learning, Natural Language Processing, Programming Language. Overall, 8 deep learning models were avaluated on this dataset, which result as follows. Electra 76.76%, AWD_LSTM 76.68%, XLNet 78.83%, RoBERTa 79.15%, DeBERTa 79.30%, Longformer 79.30%, DistilBERT 80.52%, BERT 81.34%. The BERT model resulted the best on this dataset, with a test accuracy of 81.34%, which is also significantly higher than a random classification accuracy of 20% for this task. The problem of this project was originally studied as an assignment for the Deep Learning PhD course at HEC Montreal. The code below is the implementation of the same problem with new models and libraries (simpletransformers and fastai libraries).
 

### GPU information

In [None]:
import torch 

In [None]:
torch.cuda.get_device_name(0)   # Type of the GPU

'Tesla T4'

In [None]:
torch.cuda.device_count()   # The number of available GPUs

1

In [None]:
runtimeType = ("GPU" if torch.cuda.is_available() else "CPU")    # Whether GPU is used or CPU
print(runtimeType)  

GPU


### Importing data files

In [None]:
#Here, first we import 4 dataset files of this task.
from google.colab import files
uploaded = files.upload()

Saving fullidlist.csv to fullidlist.csv
Saving reference.csv to reference.csv
Saving text.csv to text.csv
Saving train.csv to train.csv


### Data preprocessing

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Converting the imported datasets to Pandas dataframes.

fullidlist = pd.read_csv('fullidlist.csv', sep=',')
print(fullidlist.shape)
display(fullidlist.head())

(25561, 1)


Unnamed: 0,id
0,0
1,1
2,2
3,3
4,4


In [None]:
train = pd.read_csv('train.csv', sep=',')
print(train.shape)
display(train.head())

(12779, 2)


Unnamed: 0,id,label
0,0,1
1,3,1
2,6,1
3,8,0
4,9,0


In [None]:
text = pd.read_csv('text.csv', sep=',')
print(text.shape)
display(text.head())
#display(text["title"])

(25561, 2)


Unnamed: 0,id,title
0,0,interactive visual exploration of neighbor bas...
1,1,autodomainmine a graphical data mining system ...
2,2,anipqo almost non intrusive parametric query o...
3,3,relational division four algorithms and their ...
4,4,selection and ranking of text from highly impe...


In [None]:
reference = pd.read_csv('reference.csv', sep=',')
print(reference.shape)
display(reference.head())

(73313, 2)


Unnamed: 0,id,id.1
0,0,22305
1,0,22491
2,1,9243
3,1,10943
4,1,14322


In [None]:
#First we need to bulid a single table with columns: title, references, and label, where each row corresponds to one and only one id.
#So, we put the titles from the text.csv file, to the train.csv file with same id.
#Then do the same for the reference file.

merged1 = train.merge(text, on='id')
merged1.to_csv('merged1.csv', index=False)
#display(merged1.head())
print(merged1.shape)

## Here, we repeated the title to see it is more important than the references in finding the true class of the paper.
## However, in our experiments the validation accuracy decreased by doing so. Therefore, we omit this part of code.
#merged1["title"]= merged1["title"].str.repeat(4) 
#merged1["title"]= merged1["title"] + ' ' + merged1["title"]  
#merged1.to_csv('merged1.csv', index=False)
#display(merged1.head())
#print(merged1.shape)

reference2 = reference
reference2['id.1'] = reference2['id.1'].map(text.set_index('id')['title'])
reference2.to_csv('reference2.csv', index=False)
#display(reference2.head())
print(reference.shape)
print(reference2.shape)

reference3 = reference2.groupby('id')['id.1'].apply(' '.join).reset_index()
#reference3 = reference2.groupby('id')['id.1'].sum() ##this does not insert space
reference3.to_csv('reference3.csv', index=False)
#display(reference3.head())
print(reference3.shape)

#(17696, 2) must become (25561, 2)
reference4 = fullidlist.merge(reference3, on='id', how='left')
reference4.to_csv('reference4.csv', index=False)
#display(merged2.head())
print(reference4.shape)

merged2 = merged1.merge(reference4, on='id')
merged2.to_csv('merged2.csv', index=False)
#display(merged2.head())
print(merged2.shape)

#replace all NaNs with an empty string
merged2 = merged2.replace(np.nan, '', regex=True)

merged2 ['description'] = merged2['title'] + ' ' + merged2['id.1']
conc = merged2
conc.to_csv('conc.csv', index=False)
#display(conc.head())

##More data helps. When we only used titles, the valid (test) accuracy with BERT was around 72 percent. but adding references made it 81.

(12779, 3)
(73313, 2)
(73313, 2)
(17696, 2)
(25561, 2)
(12779, 4)


In [None]:
# now we create the new preprocessed train dataset.
train = conc
train.to_csv('train.csv', index=False)

In [None]:
train.head()

Unnamed: 0,id,label,title,id.1,description
0,0,1,interactive visual exploration of neighbor bas...,a framework for clustering evolving data strea...,interactive visual exploration of neighbor bas...
1,3,1,relational division four algorithms and their ...,implementation techniques for main memory data...,relational division four algorithms and their ...
2,6,1,simplifying xml schema effortless handling of ...,statix making xml count answering xml queries ...,simplifying xml schema effortless handling of ...
3,8,0,funbase a function based information managemen...,temporal databases status and research direc...,funbase a function based information managemen...
4,9,0,inverted matrix efficient discovery of frequen...,dynamic itemset counting and implication rules...,inverted matrix efficient discovery of frequen...


In [None]:
# Splitting the train set, to a train and test set.
from sklearn.model_selection import train_test_split
train, testData = train_test_split(train, test_size = 0.2)

In [None]:
train

Unnamed: 0,id,label,title,id.1,description
4772,9418,3,sound complete and scalable path sensitive ana...,static error detection using semantic inconsis...,sound complete and scalable path sensitive ana...
7281,14397,2,an incremental theorem prover,,an incremental theorem prover
7659,15158,4,an improved redundancy elimination algorithm f...,bridging the gap between underspecification fo...,an improved redundancy elimination algorithm f...
100,203,2,semi supervised classification using sparse ga...,maximum margin clustering made practical,semi supervised classification using sparse ga...
5584,11056,0,designing specific weighted similarity measure...,,designing specific weighted similarity measure...
...,...,...,...,...,...
10659,21229,2,a martingale framework for concept change dete...,,a martingale framework for concept change dete...
558,1085,3,representation of factual information by equat...,a lazy evaluator compilation and delayed evalu...,representation of factual information by equat...
7994,15858,1,supporting flat relations by a nested relation...,a study of index structures for main memory da...,supporting flat relations by a nested relation...
3178,6241,1,a layered architecture for querying dynamic we...,w3qs a query system for the world wide web ari...,a layered architecture for querying dynamic we...


In [None]:
testData

Unnamed: 0,id,label,title,id.1,description
6408,12696,0,svm selective sampling for ranking with applic...,active learning of label ranking functions pre...,svm selective sampling for ranking with applic...
12412,24806,1,on the modes and meaning of feedback to transa...,integrity checking for multiple updates databa...,on the modes and meaning of feedback to transa...
2262,4396,0,parallel text searching in serial files using ...,,parallel text searching in serial files using ...
3869,7586,2,a portfolio approach to algorithm selection,taming the computational complexity of combina...,a portfolio approach to algorithm selection ta...
341,650,4,a hybrid approach to the automatic planning of...,speech acts and rationality planning coherent ...,a hybrid approach to the automatic planning of...
...,...,...,...,...,...
12083,24160,2,hierarchical sampling for active learning,agnostic active learning performance threshold...,hierarchical sampling for active learning agno...
9384,18662,4,unknown word extraction for chinese documents,empirical estimates of adaptation the chance o...,unknown word extraction for chinese documents ...
9776,19454,2,a meta programming technique for debugging ans...,,a meta programming technique for debugging ans...
8166,16203,4,re usable tools for precision machine translation,parsing the wall street journal using a lexica...,re usable tools for precision machine translat...


In [None]:
# Renaming the label and description columns, to labels and text, which is the default columns names for text classification in simpletransformers library.
train = train.rename(columns={'label': 'labels', 'description': 'text'})
testData = testData.rename(columns={'label': 'labels', 'description': 'text'})

### Installing simpletransformers Library

In [None]:
!pip install simpletransformers   # Installing simpletransformers Python library used for natural language processing.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.9-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.5/250.5 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

### Electra model

In [None]:
# Configuring model args
electraArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
electraModel = ClassificationModel(
    "electra", "google/electra-small-discriminator",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = electraArgs )

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Training the classification model 
electraModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]

(1278, 1.076999844239911)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = electraModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7074778821566007, 'eval_loss': 0.7796264171600342}

In [None]:
modelOutputs

array([[ 1.80273438,  0.27685547, -0.72900391, -0.99511719, -1.17285156],
       [ 0.63720703,  2.53125   , -1.29296875, -0.89013672, -1.3046875 ],
       [ 1.828125  ,  0.71972656, -0.92382812, -1.0546875 , -1.39648438],
       ...,
       [-0.359375  , -0.94775391,  0.13525391,  1.88769531, -0.69238281],
       [-0.47436523, -0.66894531,  0.3425293 , -0.58496094,  1.37109375],
       [ 0.69873047,  2.51367188, -1.31054688, -0.82617188, -1.39355469]])

In [None]:
modelOutputs.shape

(2556, 5)

In [None]:
falsePredictions

[{'guid': 8, 'text_a': 'typed functional query languages with equational specifications fql   a functional query language', 'text_b': None, 'label': 0},
 {'guid': 25, 'text_a': 'ix cubes iceberg cubes for data warehousing and olap on xml data bottom up computation of sparse and iceberg cubes data cube a relational aggregation operator generalizing group by cross tab and sub total extending xquery for analytics qc trees an efficient summary structure for semantic olap analytical processing of xml documents opportunities and challenges star cubing computing iceberg cubes by top down and bottom up integration', 'text_b': None, 'label': 0},
 {'guid': 29, 'text_a': 'boomerang resourceful lenses for string data pads a domain specific language for processing ad hoc data lineage tracing for general data warehouse transformations', 'text_b': None, 'label': 3},
 {'guid': 41, 'text_a': 'on the automatic transformation of class membership criteria ', 'text_b': None, 'label': 4},
 {'guid': 48, 'tex

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

594

In [None]:
predictedLabels, modelOutputs = electraModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
predictedLabels   # Obtaining the predicted labels of the Test dataset.

array([0, 1, 0, ..., 3, 4, 1])

In [None]:
len(predictedLabels)

2556

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.7676056338028169

### AWD_LSTM model (fastai)

In [None]:
from fastai.text import *  # For the AWD_LSTM model, we use the fastai library.

In [None]:
from fastai.text.data import TextDataLoaders

In [None]:
train.to_csv('train2.csv', index=False)

In [None]:
dataset = TextDataLoaders.from_csv(path='/content/', csv_fname='train2.csv', text_col='text', label_col='labels')
dataset.show_batch(max_n=4)

Unnamed: 0,text,category
0,xxbos database techniques for the world wide web a survey extracting schema from semistructured data sqlxnf processing composite objects as abstractions over relational data a framework for supporting data integration using the materialized and virtual approaches incremental maintenance for materialized views over semistructured data querying and updating the file graph structured views and their incremental maintenance w3qs a query system for the world wide web ariadne a system for constructing mediators for internet sources describing and using query capabilities of heterogeneous sources an overview of good cost based query scrambling for initial delays to weave the web accessing relational databases from the world wide web optimizing queries across diverse data sources object exchange across heterogeneous information sources leveraging mediator cost models with heterogeneous data sources catching the boat with strudel experiences with a web site management system obtaining complete answers from incomplete databases integration of heterogeneous databases without common domains,1
1,xxbos efficient anomaly monitoring over moving object trajectory streams robust and fast similarity search for moving object trajectories efficient retrieval of similar time sequences under time warping efficient algorithms for mining outliers from large data sets mining top n local outliers in large databases lof identifying density based local outliers efficiently supporting ad hoc queries in large datasets of time sequences exact indexing of dynamic time warping indexing multi dimensional time series with support for multiple distance measures warping indexes with envelope transforms for query by humming fast subsequence matching in time series databases mining distance based outliers from large databases in any metric space identifying similarities periodicities and bursts for online search queries trajectory clustering a partition and group framework time dependent semantic similarity measure of queries using historical click through data mining distance based outliers in near linear time with randomization and a simple pruning rule evaluating continuous,0
2,xxbos the xxunk model and algebra for unified data mining query flocks a generalization of association rule mining optics ordering points to identify the clustering structure online generation of association rules data cube a relational aggregation operator generalizing group by cross tab and sub total automatic subspace clustering of high dimensional data for data mining applications scalable techniques for mining causal structures birch an efficient data clustering method for very large databases beyond market baskets generalizing association rules to correlations optimization of constrained frequent set queries with 2 variable constraints exploratory mining and pruning optimizations of constrained association rules efficient and effective clustering methods for spatial data mining what can hierarchies do for data warehouses a new sql like operator for mining association rules discovery of multiple level association rules from large databases spirit sequential pattern mining with regular expression constraints finding interesting rules from large sets of discovered association,1
3,xxbos evaluation of main memory join algorithms for joins with set comparison join predicates implementation techniques for main memory database systems multi table joins through bitmapped join indices a new join algorithm an overview of the system software of a parallel relational database machine grace hashing methods and relational algebra operations rapid bushy join order optimization with cartesian products benchmarking spatial join operations with spatial output efficient processing of spatial joins using r trees spatial hash joins sort merge join an idea whose time hash passed evaluation of signature files as set access facilities in oodbs efficient computation of spatial joins an evaluation of non equijoin algorithms a performance evaluation of pointer based joins a low communication sort algorithm for a parallel database machine partition based spatial merge join a study of sort algorithms for multiprocessor database machines tradeoffs in processing complex join queries via hashing in multiprocessor database machines,1


In [None]:
from fastai.text.all import *

In [None]:
# Training the AWD_LSTM model on the train dataset.
learner = text_classifier_learner(dataset, AWD_LSTM, drop_mult=0.2, metrics=accuracy)
learner.fine_tune(2, 1e-2)
# The above combination of hyperparameters gave a higher accracy, e.g., compared to the combination of drop_mult=0.3 and fine_tune(2, 1e-3).

epoch,train_loss,valid_loss,accuracy,time
0,1.227178,1.07086,0.582681,00:13


epoch,train_loss,valid_loss,accuracy,time
0,0.945674,0.856899,0.702544,00:16
1,0.657202,0.732775,0.74364,00:16


In [None]:
N_test = range(len(testData.text)) # for the current task, len(testData.text)=len(testData)=2556.
labelsSet = [_ for i in N_test]   # the set for storing the predicted labels of test data.
for i in N_test:
  labelsSet[i] = learner.predict(list(testData.text)[i])   # Predicting the labels of test data, based on the learned model fitted on train data.
print(labelsSet)

[('0', tensor(0), tensor([0.7135, 0.1659, 0.1165, 0.0031, 0.0009])), ('1', tensor(1), tensor([0.0124, 0.9588, 0.0134, 0.0138, 0.0016])), ('3', tensor(3), tensor([0.1557, 0.0762, 0.2862, 0.4093, 0.0725])), ('2', tensor(2), tensor([0.0108, 0.0117, 0.9635, 0.0082, 0.0058])), ('4', tensor(4), tensor([0.0336, 0.0046, 0.1163, 0.0020, 0.8435])), ('2', tensor(2), tensor([0.1083, 0.0207, 0.5615, 0.1679, 0.1416])), ('0', tensor(0), tensor([0.6691, 0.2490, 0.0790, 0.0020, 0.0009])), ('2', tensor(2), tensor([0.0669, 0.0117, 0.8361, 0.0221, 0.0632])), ('1', tensor(1), tensor([0.2351, 0.5808, 0.0372, 0.0339, 0.1131])), ('4', tensor(4), tensor([1.3540e-02, 2.6602e-05, 4.6772e-02, 1.9408e-03, 9.3772e-01])), ('2', tensor(2), tensor([0.0221, 0.0076, 0.8231, 0.1434, 0.0037])), ('0', tensor(0), tensor([0.8565, 0.1053, 0.0339, 0.0030, 0.0013])), ('1', tensor(1), tensor([1.4779e-02, 9.7359e-01, 9.9188e-03, 1.1206e-03, 5.8802e-04])), ('4', tensor(4), tensor([9.7826e-04, 1.9942e-04, 1.1645e-02, 1.0273e-03, 9.

In [None]:
labels = [_ for i in range(len(labelsSet))]   # the set for storing the first element of predicted labels of test data.
for i in range(len(labelsSet)):
  labels[i] = labelsSet[i][0]
print(labels)

['0', '1', '3', '2', '4', '2', '0', '2', '1', '4', '2', '0', '1', '4', '3', '2', '0', '0', '1', '1', '1', '0', '3', '2', '0', '1', '3', '3', '0', '1', '0', '1', '3', '2', '1', '1', '2', '2', '1', '1', '2', '2', '1', '3', '3', '0', '4', '0', '1', '2', '2', '3', '0', '2', '1', '2', '3', '0', '3', '2', '2', '2', '4', '1', '4', '3', '3', '4', '2', '2', '4', '3', '1', '2', '3', '2', '1', '0', '2', '0', '3', '0', '0', '3', '2', '4', '2', '2', '3', '1', '0', '3', '2', '0', '3', '3', '1', '1', '0', '3', '3', '4', '3', '1', '3', '1', '2', '1', '2', '0', '4', '1', '3', '2', '2', '1', '2', '2', '0', '1', '0', '1', '2', '0', '1', '0', '1', '0', '1', '3', '1', '2', '1', '0', '3', '2', '3', '3', '3', '0', '4', '2', '4', '0', '1', '2', '3', '0', '0', '2', '0', '2', '2', '2', '0', '0', '4', '3', '1', '4', '0', '4', '1', '1', '0', '2', '0', '3', '1', '0', '1', '3', '1', '3', '2', '2', '2', '0', '2', '1', '2', '4', '2', '2', '2', '1', '0', '1', '4', '0', '2', '1', '2', '1', '3', '4', '2', '3', '2', '1',

In [None]:
test_id = testData['id']

In [None]:
test_pred = pd.DataFrame({'id': test_id, 'labels': labels})
test_pred.to_csv('test_pred.csv', index=False)
print(test_pred.shape)
test_pred.head()

(2556, 2)


Unnamed: 0,id,labels
6408,12696,0
12412,24806,1
2262,4396,3
3869,7586,2
341,650,4


In [None]:
test_pred['labels']

6408     0
12412    1
2262     3
3869     2
341      4
        ..
12083    2
9384     4
9776     2
8166     4
1225     1
Name: labels, Length: 2556, dtype: object

In [None]:
test_pred = test_pred._convert(numeric=True)   # To convert the dtype of labels column, from object to int64.
test_pred['labels']

6408     0
12412    1
2262     3
3869     2
341      4
        ..
12083    2
9384     4
9776     2
8166     4
1225     1
Name: labels, Length: 2556, dtype: int64

In [None]:
truetest = testData

In [None]:
truetest['labels']

6408     0
12412    1
2262     0
3869     2
341      4
        ..
12083    2
9384     4
9776     2
8166     4
1225     1
Name: labels, Length: 2556, dtype: int64

In [None]:
# Obtaining the accuracy of the model on the test data.
from sklearn.metrics import accuracy_score
accuracy_score(truetest['labels'], test_pred['labels'])

0.7668231611893583

### XLNet model

In [None]:
# Configuring model args
xlnetArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
xlnetModel = ClassificationModel(
    "xlnet", "xlnet-base-cased",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = xlnetArgs )

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'sequence_summary.summary.weight', 'logits_proj.weight', 'sequence_summary.summary.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

In [None]:
# Training the classification model 
xlnetModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]



(1278, 0.7848536411361515)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = xlnetModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7342218568375798, 'eval_loss': 0.6157406583428383}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

541

In [None]:
predictedLabels, modelOutputs = xlnetModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.7883411580594679

### RoBERTa model

In [None]:
# Configuring model args
robertaArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
robertaModel = ClassificationModel(
    "roberta", "roberta-base", # "xlm-roberta-large" resulted in a lower accuracy (0.76).
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = robertaArgs )

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Training the classification model 
robertaModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]

(1278, 0.7504272726371721)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = robertaModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7369199673346406, 'eval_loss': 0.6190074577927589}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

533

In [None]:
predictedLabels, modelOutputs = robertaModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.7914710485133021

### DeBERTa model

In [None]:
# Configuring model args
debertaArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
debertaModel = ClassificationModel(
    "deberta", "microsoft/deberta-base",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = debertaArgs )

Downloading:   0%|          | 0.00/474 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/559M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaForSequenceClassification: ['lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['pooler.dense.bias', 'classifi

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

In [None]:
# Training the classification model 
debertaModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]

(1278, 0.7850601325519908)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = debertaModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7383602960786659, 'eval_loss': 0.6015734273358249}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

529

In [None]:
predictedLabels, modelOutputs = debertaModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.793035993740219

### Longformer model

In [None]:
# Configuring model args
longformerArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
longformerModel = ClassificationModel(
    "longformer", "allenai/longformer-base-4096",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = longformerArgs )

Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/597M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', '

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Training the classification model 
longformerModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]

(1278, 0.7517109321130087)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = longformerModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.738943534360373, 'eval_loss': 0.5979624390602112}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

529

In [None]:
predictedLabels, modelOutputs = longformerModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.793035993740219

### DistilBERT model

In [None]:
# Configuring model args
distilbertArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
distilbertModel = ClassificationModel(
    "distilbert", "distilbert-base-uncased",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = distilbertArgs )

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Training the classification model 
distilbertModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]

(1278, 0.7027078349615486)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = distilbertModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7540924601887646, 'eval_loss': 0.5857472568750381}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

498

In [None]:
predictedLabels, modelOutputs = distilbertModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))

0.8051643192488263

### BERT model

In [None]:
# Configuring model args
bertArgs = ClassificationArgs(num_train_epochs=1, overwrite_output_dir=True) 

# Forming the classification model
bertModel = ClassificationModel(
    "bert", "bert-base-uncased",
    use_cuda = True, # Making sure GPU is used (as opposed to CPU)
    num_labels = 5, # Determining the number of labels of the classification task
    args = bertArgs )

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Training the classification model 
bertModel.train_model(train)

  0%|          | 0/10223 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1278 [00:00<?, ?it/s]



(1278, 0.7166085636391886)

In [None]:
# Evaluation of the classification model on Test dataset
mccAndLoss, modelOutputs, falsePredictions = bertModel.eval_model(testData)


  0%|          | 0/2556 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
mccAndLoss

{'mcc': 0.7646026172987265, 'eval_loss': 0.5573107093572617}

In [None]:
len(falsePredictions)   # Obtaining the number of false predictions.

477

In [None]:
predictedLabels, modelOutputs = bertModel.predict(list(testData.text))  #Predicting the labels of the test set (the text column) with the trained model.

  0%|          | 0/2556 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
# Accuracy of the classification model
from sklearn.metrics import accuracy_score
accuracy_score(list(testData.labels), list(predictedLabels))
# An accuracy of 81% (for BERT model) can be considered very good, since for this task we have 5 labels which gives a random accuracy of only 20%.

0.8133802816901409

### Confusion matrix and classification report for the best model (BERT)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(list(testData.labels), list(predictedLabels))

array([[455, 101,  41,   6,  21],
       [ 53, 517,  12,  11,   3],
       [ 61,  11, 412,  31,  30],
       [  7,  17,  28, 417,   3],
       [ 17,   2,  13,   9, 278]])

In [None]:
from sklearn.metrics import classification_report
target_names = ['0', '1', '2', '3', '4']
print(classification_report(list(testData.labels), list(predictedLabels),target_names=target_names))

              precision    recall  f1-score   support

           0       0.77      0.73      0.75       624
           1       0.80      0.87      0.83       596
           2       0.81      0.76      0.78       545
           3       0.88      0.88      0.88       472
           4       0.83      0.87      0.85       319

    accuracy                           0.81      2556
   macro avg       0.82      0.82      0.82      2556
weighted avg       0.81      0.81      0.81      2556

