# RoBERTa

Supported by [huggingface/transformers](https://github.com/huggingface/transformers), PyTorch version.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

PROJ_DIR = "drive/MyDrive/CS4248 Project/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
!pip install transformers evaluate

In [14]:
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

fulltrain = pd.read_csv(PROJ_DIR + 'raw_data/fulltrain.csv', names = ['label', 'text'])
fulltrain = fulltrain.iloc[:1000,:]  # TODO

train, valid = train_test_split(fulltrain, test_size=0.2, shuffle=True)

data = DatasetDict()
data['train'] = Dataset.from_pandas(train)
data['valid'] = Dataset.from_pandas(valid)
data['train'][0]

{'label': 1,
 'text': "The Alabama Department of Education reported Wednesday that its sole textbook has begun to seriously show its age after more than a decade of heavy daily use at the state's 1,500 public schools. Officials said the decrepit tome, titled Introduction To Civics, has recently become so tattered that it is now nearly unusable for the 748,000 students enrolled in kindergarten through 12th grade who are required to share it. 'When you have every child in Alabama using the same textbook, there's bound to be a certain amount of wear and tear over time,' said State Superintendent Dr. Thomas R. Bice, lifting the book's cover to reveal the thin strip of adhesive barely connecting the badly disfigured piece of cardboard to its spine. 'But with our book in this conditionpages partially ripped, some separated from the binding and jammed elsewhere in the wrong sequential order, others missing entirelyit becomes difficult to maintain an effective curriculum.' 'Unfortunately, what

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
data_tok = data.map(lambda x: tokenizer(x['text'], padding="max_length", truncation=True), batched=True)
# data_tok['train'][0]

train_data = data_tok["train"].shuffle(seed=123)
valid_data = data_tok["valid"].shuffle(seed=123)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [18]:
from transformers import (AutoModelForSequenceClassification,
                          TrainingArguments,
                          Trainer)
import evaluate
import numpy as np

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4)
training_args = TrainingArguments(output_dir="checkpoints", evaluation_strategy="epoch")
metric = evaluate.load("accuracy")

def compute_metrics(pred):
    logits, labels = pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=valid_data,
    compute_metrics=compute_metrics,
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

In [19]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.000136,1.0
2,No log,9e-05,1.0
3,No log,7.9e-05,1.0


TrainOutput(global_step=300, training_loss=0.01868575096130371, metrics={'train_runtime': 248.964, 'train_samples_per_second': 9.64, 'train_steps_per_second': 1.205, 'total_flos': 631477872230400.0, 'train_loss': 0.01868575096130371, 'epoch': 3.0})