Text classification using Roberta
=================================

We implemented text classification on the distaster tweets dataset and the imdb ranking dataset.
This third notebook will cover the AmITheAsshole reddit posts dataset. 

Get the AITA dataset

In [None]:
!pip install --upgrade transformers
!pip install simpletransformers
!pip install dvc

In [1]:
!dvc get https://github.com/iterative/aita_dataset aita_clean.csv

[0m

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs

Load dataset

In [5]:
aita_dataset = pd.read_csv("/content/aita_clean.csv")
aita_dataset

Unnamed: 0,id,timestamp,title,body,edited,verdict,score,num_comments,is_asshole
0,1ytxov,1.393279e+09,[AITA] I wrote an explanation in TIL and came ...,[Here is the post in question](http://www.redd...,False,asshole,52,13.0,1
1,1yu29c,1.393281e+09,[AITA] Threw my parent's donuts away,"My parents are diabetic, morbidly obese, and a...",1393290576.0,asshole,140,27.0,1
2,1yu8hi,1.393285e+09,I told a goth girl she looked like a clown.,I was four.,False,not the asshole,74,15.0,0
3,1yuc78,1.393287e+09,[AItA]: Argument I had with another redditor i...,http://www.reddit.com/r/HIMYM/comments/1vvfkq/...,1393286962.0,everyone sucks,22,3.0,1
4,1yueqb,1.393288e+09,[AITA] I let my story get a little long and bo...,,False,not the asshole,6,4.0,0
...,...,...,...,...,...,...,...,...,...
97623,ex94w5,1.580577e+09,AITA for telling my sister she is being a spoi...,My sister(17F) and I(15M) are white kids born ...,1580585457.0,not the asshole,16,23.0,0
97624,ex970f,1.580577e+09,AITA for telling my husband to f* off after he...,My husband (28M) and I (32F) are married for a...,1580584475.0,not the asshole,1373,304.0,0
97625,ex9dwo,1.580578e+09,AITA for attempting to keep my students out of...,Upfront apologies for formatting. I’m also try...,False,not the asshole,4,15.0,0
97626,ex9egs,1.580578e+09,WIBTA if I left my brothers fate up to the state?,A little back story my mom is a drug addict an...,False,not the asshole,280,140.0,0


Get the relevant columns

In [6]:
train = pd.DataFrame({"text": aita_dataset.title[:25000], "labels": aita_dataset.is_asshole[:25000]})

Split the labelled dataset into training and validation sets.

In [7]:
train_df, valid_df = train_test_split(train, test_size=0.2, stratify=train["labels"], random_state=0)

Define model

In [8]:
model_args = ClassificationArgs(num_train_epochs=3, overwrite_output_dir=True, manual_seed=42)
model = ClassificationModel(model_type='roberta', model_name='roberta-base', use_cuda=True, num_labels=2, args=model_args)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Train it

In [9]:
model.train_model(train_df)

  0%|          | 0/20000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/2500 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/2500 [00:00<?, ?it/s]

(7500, 0.58899134298563)

Evaluate its performance

In [11]:
result, model_outputs, wrong_preds = model.eval_model(valid_df)
print(result)

  0%|          | 0/5000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/625 [00:00<?, ?it/s]

{'mcc': 0.0, 'tp': 0, 'tn': 3658, 'fp': 0, 'fn': 1342, 'auroc': 0.509219834607039, 'auprc': 0.27213070146341717, 'eval_loss': 0.5820193046092987}


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


Predict on test set

In [10]:
test_predictions, raw_outputs = model.predict(list(test.text.values))

NameError: ignored

Save as submission file

In [None]:
test_original = pd.read_csv("/content/test.csv")

sample_sub = pd.DataFrame({"target": test_predictions, "id": test_original.id})
sample_sub.to_csv("submission.csv",index=False)
files.download("submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>