Text classification using Roberta
=================================

We implemented text classification on the distaster tweets dataset and the imdb ranking dataset.
This first notebook will cover the disaster tweets dataset. 

Get the disaster tweets dataset

In [None]:
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!pip install --upgrade --force-reinstall --no-deps kaggle
!pip install --upgrade transformers
!pip install simpletransformers
!kaggle competitions download -c nlp-getting-started
!unzip nlp-getting-started.zip

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs

Load dataset

In [2]:
train = pd.read_csv("/content/train.csv")
test = pd.read_csv("/content/test.csv")
train.drop(["id", "keyword", "location"], axis=1, inplace=True)
test.drop(["id", "keyword", "location"], axis=1, inplace=True)
train

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @TheTawniest The out of control w...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,Police investigating after an e-bike collided ...,1


Split the labelled dataset into training and validation sets.

In [3]:
train_df, valid_df = train_test_split(train, test_size=0.2, stratify=train["target"], random_state=0)

Define model

In [4]:
model_args = ClassificationArgs(num_train_epochs=9, overwrite_output_dir=True, manual_seed=42)
model = ClassificationModel(model_type='roberta', model_name='roberta-base', use_cuda=True, num_labels=2, args=model_args)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Train it

In [5]:
model.train_model(train_df)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/6090 [00:00<?, ?it/s]

Epoch:   0%|          | 0/9 [00:00<?, ?it/s]

Running Epoch 0 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 1 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 2 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 3 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 4 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 5 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 6 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 7 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

Running Epoch 8 of 9:   0%|          | 0/762 [00:00<?, ?it/s]

(6858, 0.33839607791689685)

Evaluate its performance

In [6]:
result, model_outputs, wrong_preds = model.eval_model(valid_df)
print(result)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/1523 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/191 [00:00<?, ?it/s]

{'mcc': 0.599761880178017, 'tp': 507, 'tn': 717, 'fp': 152, 'fn': 147, 'auroc': 0.8675795230202383, 'auprc': 0.8648651783286968, 'eval_loss': 1.0125171315407682}


Predict on test set

In [None]:
test_predictions, raw_outputs = model.predict(list(test.text.values))

  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/408 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Save as submission file

In [None]:
test_original = pd.read_csv("/content/test.csv")

sample_sub = pd.DataFrame({"target": test_predictions, "id": test_original.id})
sample_sub.to_csv("submission.csv",index=False)
files.download("submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>