Text classification using Roberta
=================================

We implemented text classification on the distaster tweets dataset and the imdb ranking dataset.
This second notebook will cover the imdb dataset.

Get the imdb dataset

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar zxvf aclImdb_v1.tar.gz

Preprocess it

In [None]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


Install libraries

In [None]:
!pip install --upgrade transformers
!pip install simpletransformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 10.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 41.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 38.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=0da2588820

Imports

In [None]:
from os import listdir
from os.path import join

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs

Load the dataset

In [None]:
label_encode = {"neg":0, "pos": 1}


def load_imdb(path):
  labels = []
  texts = []
  for label in ["neg", "pos"]:
    for f in listdir(join(path, label)):
      with open(join(join(path, label), f)) as fp:
        texts.append(fp.read())
      labels.append(label_encode[label])
  return pd.DataFrame({"text": texts, "target": labels})

train = load_imdb("aclImdb/train")
test = load_imdb("aclImdb/test")
train

Unnamed: 0,text,target
0,Oh Geez... There are so many other films I wan...,0
1,"I was looking forward to this ride, and was ho...",0
2,The worst movie I have seen in a while. Yeah i...,0
3,You get a gift. It is exquisitely wrapped. The...,0
4,This film is really terrible. terrible as in i...,0
...,...,...
24995,I have been waiting for such an original pictu...,1
24996,I am watching the series back to back as fast ...,1
24997,The 1997 low-key indie dramedy Henry Fool woul...,1
24998,"If you enjoy the subtle (yes, I said subtle) a...",1


Split the labelled dataset into training and validation sets.

In [None]:
train_df, valid_df = train_test_split(train, test_size=0.2, stratify=train["target"], random_state=0)

Define model

In [None]:
model_args = ClassificationArgs(num_train_epochs=8, overwrite_output_dir=True, manual_seed=42)
model = ClassificationModel(model_type='roberta', model_name='roberta-base', use_cuda=True, num_labels=2, args=model_args)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

Train it

In [None]:
model.train_model(train_df)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/20000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 0 of 8:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 1 of 8:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 2 of 8:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 3 of 8:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 4 of 8:   0%|          | 0/2500 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

Evaluate its performance

In [None]:
result, model_outputs, wrong_preds = model.eval_model(valid_df)
print(result)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/5000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/625 [00:00<?, ?it/s]

{'mcc': 0.7612197109172187, 'tp': 2118, 'tn': 2281, 'fp': 219, 'fn': 382, 'auroc': 0.95193864, 'auprc': 0.9518503254798096, 'eval_loss': 0.6716627651179209}


Predict on test set

In [None]:
test_predictions, raw_outputs = model.predict(list(test.text.values))

  0%|          | 0/25000 [00:00<?, ?it/s]

  0%|          | 0/3125 [00:00<?, ?it/s]

Save as submission file

In [None]:
test_original = pd.read_csv("/content/test.csv")

sample_sub = pd.DataFrame({"target": test_predictions, "id": test_original.id})
sample_sub.to_csv("submission.csv",index=False)
files.download("submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>