# <font size=6> **Data Science Project by Nikola Stanojkovski && Ines Lesnovska** </font>

<font color = 'Orange' size = 4 > Seven NLP Tasks With Twitter Datasets </font>

<br/>

# <font size=6> **Проект по "Вовед во науката за податоци" изработен од Никола Станојковски и Инес Лесновска** </font>

<font color = 'Orange' size = 4 > Седум задачи за обработка на природен јазик со податоци од твитови </font>

## <font color="Orange" size=5> Source: https://www.kaggle.com/arashnic/7-nlp-tasks-with-tweets </font>

### <font color="Orange" size=4> Context </font>

<font color="blue" size=3>The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is
no standardized evaluation protocol, neither a strong set of baselines trained on such domainspecific data. The propose of this dataset is presenting evaluation consisting of seven heterogeneous Twitter-specific classification tasks.</font>

### <font color="Orange" size=4> Content </font>

<font color="blue" size=3>
This dattaset consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. Each dataset presented in the same format and with fixed training, validation and test splits.</font>

### <font color="Orange" size=4> Acknowledgements </font>

<font color="blue" size=3>
Unified Benchmark and Comparative Evaluation for Tweet Classification:

  - Francesco Barbieri♣ Jose Camacho-Collados†
  - Leonardo Neves♣ Luis Espinosa-Anke† </font>

### <font color="Orange" size=4>Inspiration </font>

<ul>
<li><font color="Orange">Emotion Recognition</font></li>
<li><font color="Orange">Emoji Prediction</font></li>
<li><font color="Orange">Irony Detection</font></li>
<li><font color="Orange">Hate Speech Detection</font></li>
<li><font color="Orange">Offensive Language Identification</font></li>
<li><font color="Orange">Sentiment Analysis</font></li>
<li><font color="Orange">Stance Detection</font></li>
</ul>

## <font color="Blue" size=5> For every particular NLP task given below, the most appropriate model that gives the best results has been chosen after the trial of many </font>

## <font color="Blue" size=5> For every particular NLP task given below, the most appropriate evaluation metrics have been chosen </font>

# **Emotion Recognition**

<font color="Orange" size=5> Using a BERT Pretrained model: 'bert-base-cased' </font>

##  Install Dependencies

In [None]:

 !pip install transformers



In [None]:
pip install pytorch_transformers



##  Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/emotion/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Emotion'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/emotion/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Emotion
0,0,anger
1,1,joy
2,2,optimism
3,3,sadness


In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/emotion/train_text.txt') as train_text, open('./source/emotion/train_labels.txt') as train_labels:
  matrix = []
  emotions = pd.read_csv('./source/emotion/mapping.csv')
  for text, label in zip(train_text, train_labels):
    emotion = emotions.loc[emotions['Code'] == int(label), 'Emotion'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emotion'] = int(label)
    row['target'] = emotion
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/emotion/train.csv', index=False)
train

Unnamed: 0,text,emotion,target
0,“Worry is a down payment on a problem you may ...,2,optimism
1,My roommate: it's okay that we can't spell bec...,0,anger
2,No but that's so cute. Atsu was probably shy a...,1,joy
3,Rooneys fucking untouchable isn't he? Been fuc...,0,anger
4,it's pretty depressing when u hit pan on ur fa...,3,sadness
...,...,...,...
3252,I get discouraged because I try for 5 fucking ...,3,sadness
3253,The @user are in contention and hosting @user ...,3,sadness
3254,@user @user @user @user @user as a fellow UP g...,0,anger
3255,You have a #problem? Yes! Can you do #somethin...,0,anger


In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/emotion/test_text.txt') as test_text, open('./source/emotion/test_labels.txt') as test_labels:
  matrix = []
  emotions = pd.read_csv('./source/emotion/mapping.csv')
  for text, label in zip(test_text, test_labels):
    emotion = emotions.loc[emotions['Code'] == int(label), 'Emotion'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emotion'] = int(label)
    row['target'] = emotion
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/emotion/test.csv', index=False)
test

Unnamed: 0,text,emotion,target
0,#Deppression is real. Partners w/ #depressed p...,3,sadness
1,@user Interesting choice of words... Are you c...,0,anger
2,My visit to hospital for care triggered #traum...,3,sadness
3,@user Welcome to #MPSVT! We are delighted to h...,1,joy
4,What makes you feel #joyful?,1,joy
...,...,...,...
1416,I need a sparkling bodysuit . No occasion. Jus...,1,joy
1417,@user I've finished reading it; simply mind-bl...,3,sadness
1418,shaft abrasions from panties merely shifted to...,0,anger
1419,All this fake outrage. Y'all need to stop 🤣,0,anger


In [None]:
# Transforming the .txt files and creating the validation dataset

with open('./source/emotion/val_text.txt') as val_text, open('./source/emotion/val_labels.txt') as val_labels:
  matrix = []
  emotions = pd.read_csv('./source/emotion/mapping.csv')
  for text, label in zip(val_text, val_labels):
    emotion = emotions.loc[emotions['Code'] == int(label), 'Emotion'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emotion'] = int(label)
    row['target'] = emotion
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/emotion/val.csv', index=False)
val

Unnamed: 0,text,emotion,target
0,"@user @user Oh, hidden revenge and anger...I r...",0,anger
1,if not then #teamchristine bc all tana has don...,0,anger
2,Hey @user #Fields in #skibbereen give your onl...,0,anger
3,Why have #Emmerdale had to rob #robron of havi...,0,anger
4,@user I would like to hear a podcast of you go...,0,anger
...,...,...,...
369,@user @user If #trump #whitehouse aren't held ...,0,anger
370,@user Which #chutiya #producer #invested in #c...,0,anger
371,Russia story will infuriate Trump today. Media...,0,anger
372,Shit getting me irritated 😠,0,anger


##  Loading Tokenizer and Encoding our Data

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

# Importing the tokenizer for the model

tokenizer = BertTokenizer.from_pretrained(
    'bert-base-cased',
    do_lower_case=True
)

In [None]:
import torch

# Encoding the data

encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.emotion.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.emotion.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

dataset_val.tensors

(tensor([[  101,   137,  4795,  ...,     0,     0,     0],
         [  101,  1191,  1136,  ...,     0,     0,     0],
         [  101, 23998,   137,  ...,     0,     0,     0],
         ...,
         [  101,   187, 13356,  ...,     0,     0,     0],
         [  101,  4170,  2033,  ...,     0,     0,     0],
         [  101,   137,  4795,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([0, 0, 0, 0, 0, 0, 3, 3, 0, 3, 1, 0, 0, 3, 3, 0, 0, 3, 3, 0, 3, 1, 2, 3,
         3, 3, 0, 1, 0, 0, 0, 3, 0, 3, 2, 1, 1, 0, 2, 1, 1, 3, 1, 0, 1, 1, 3, 0,
         0, 0, 2, 1, 1, 3, 0, 0, 0, 0, 1, 1, 3, 2, 3, 1, 3, 0, 0, 0, 3, 0, 0, 0,
         1, 2, 1, 0, 1, 0, 1, 0, 3, 0, 0, 1, 0, 0, 3, 0, 3, 3, 0, 1, 3, 1, 0, 1,
         3, 2, 0, 2, 2, 1, 1, 1, 3, 3, 0, 0, 1, 3, 1, 0, 1, 1, 3, 1, 1, 0, 1, 1,

##  Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = mapping['Emotion'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = BertForSequenceClassification.from_pretrained(
                                      'bert-base-cased', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

## Creating Data Loaders

In [None]:
batch_size = 5
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

##  Setting Up Optimizer and Scheduler


In [None]:
# Setting up the optimizer

from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

##  Defining Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric

 def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating our Training Loop

In [None]:
import random

# Setting up the environment

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
# Defining the evaluation function

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model, making the prediction and calculating the performance metrics

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=652.0, style=ProgressStyle(description_widt…

Training loss: 0.6816470446122205


HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))


Validation loss: 0.6187465737263361
F1 Score (weighted): 0.7856047237210736



## Evaluating the model

In [None]:
accuracy_per_class(predictions, true_vals)

Class: anger
Accuracy:145/160

Class: joy
Accuracy:70/97

Class: optimism
Accuracy:14/28

Class: sadness
Accuracy:66/89



# **Emoji Prediction**

<font color="Orange" size=5> Using a ROBERTA Pretrained model: 'roberta-large' </font>

##  Install Dependencies

In [None]:
 !pip install transformers



In [None]:
 pip install pytorch_transformers



## Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/emoji/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Emoji'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/emoji/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Emoji
0,0,❤
1,1,😍
2,2,😂
3,3,💕
4,4,🔥
5,5,😊
6,6,😎
7,7,✨
8,8,💙
9,9,😘


In [None]:
# Transforming the .txt files and creating the final dataset to work with

with open('./source/emoji/train_text.txt') as train_text, open('./source/emoji/train_labels.txt') as train_labels, open('./source/emoji/test_text.txt') as test_text, open('./source/emoji/test_labels.txt') as test_labels, open('./source/emoji/val_text.txt') as val_text, open('./source/emoji/val_labels.txt') as val_labels:

  matrix = []
  emojis = pd.read_csv('./source/emoji/mapping.csv')

  noZeros = 0

  for text, label in zip(train_text, train_labels):
    if noZeros == 10000 and int(label) == 0:
      continue
    if int(label) == 0:
      noZeros = noZeros + 1

    emoji = emojis.loc[emojis['Code'] == int(label), 'Emoji'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emoji'] = int(label)
    row['target'] = emoji
    matrix.append(row)
  
  for text, label in zip(test_text, test_labels):
    if noZeros == 10000 and int(label) == 0:
      continue
    if int(label) == 0:
      noZeros = noZeros + 1

    emoji = emojis.loc[emojis['Code'] == int(label), 'Emoji'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emoji'] = int(label)
    row['target'] = emoji
    matrix.append(row)

  for text, label in zip(val_text, val_labels):
    if noZeros == 10000 and int(label) == 0:
      continue
    if int(label) == 0:
      noZeros = noZeros + 1

    emoji = emojis.loc[emojis['Code'] == int(label), 'Emoji'].item()
    row = {}
    row['text'] = text.rstrip()
    row['emoji'] = int(label)
    row['target'] = emoji
    matrix.append(row)

  dataset = pd.DataFrame(matrix)

dataset.to_csv('./source/emoji/dataset.csv', index=False)
dataset

Unnamed: 0,text,emoji,target
0,Sunday afternoon walking through Venice in the...,12,☀
1,Time for some BBQ and whiskey libations. Chomp...,19,😜
2,Love love love all these people ️ ️ ️ #friends...,0,❤
3,"️ ️ ️ ️ @ Toys""R""Us",0,❤
4,Man these are the funniest kids ever!! That fa...,2,😂
...,...,...,...
88937,They're alright @ Da Vinci Banquet Halls,13,💜
88938,Senior night with my little Bailey !! So proud...,3,💕
88939,Real friends or labeled as family! #BrotherMan...,6,😎
88940,It makes me so happy meet people wearing hats ...,3,💕


In [None]:
# Separating the dataset into train, test, validation set and stratifying it in order to get more balanced set of data to work with

from sklearn.model_selection import train_test_split

train, val, y_train, y_val = train_test_split(dataset.index.values, 
                                                  dataset.emoji.values, 
                                                  test_size=0.15, 
                                                  random_state=42,
                                                  stratify=dataset.emoji.values)

In [None]:
dataset['data_type'] = ['not_set']*dataset.shape[0]
dataset.loc[train, 'data_type'] = 'train'
dataset.loc[val, 'data_type'] = 'val'

dataset

Unnamed: 0,text,emoji,target,data_type
0,Sunday afternoon walking through Venice in the...,12,☀,train
1,Time for some BBQ and whiskey libations. Chomp...,19,😜,train
2,Love love love all these people ️ ️ ️ #friends...,0,❤,train
3,"️ ️ ️ ️ @ Toys""R""Us",0,❤,val
4,Man these are the funniest kids ever!! That fa...,2,😂,train
...,...,...,...,...
88937,They're alright @ Da Vinci Banquet Halls,13,💜,train
88938,Senior night with my little Bailey !! So proud...,3,💕,train
88939,Real friends or labeled as family! #BrotherMan...,6,😎,train
88940,It makes me so happy meet people wearing hats ...,3,💕,train


In [None]:
dataset.groupby(['emoji', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,text,target
emoji,data_type,Unnamed: 2_level_1,Unnamed: 3_level_1
0,train,8500,8500
0,val,1500,1500
1,train,8714,8714
1,val,1538,1538
2,train,8288,8288
2,val,1463,1463
3,train,4213,4213
3,val,743,743
4,train,5189,5189
4,val,916,916


##  Loading Tokenizer and Encoding the Data

In [None]:
from transformers import RobertaTokenizer
from torch.utils.data import TensorDataset

In [None]:
# Importing the tokenizer for the model

tokenizer = RobertaTokenizer.from_pretrained(
    'roberta-large',
    do_lower_case=True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




In [None]:
import torch

# Encoding the data

encoded_data_train = tokenizer.batch_encode_plus(
    dataset[dataset.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    dataset[dataset.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(dataset[dataset.data_type=='train'].emoji.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(dataset[dataset.data_type=='val'].emoji.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

dataset_val.tensors

(tensor([[    0, 12605, 33549,  ...,     1,     1,     1],
         [    0, 10431, 39009,  ...,     1,     1,     1],
         [    0, 18028, 24754,  ...,     1,     1,     1],
         ...,
         [    0,  2794, 18759,  ...,     1,     1,     1],
         [    0,  4651,   881,  ...,     1,     1,     1],
         [    0,  4528,    32,  ...,     1,     1,     1]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([ 0, 11,  7,  ...,  3,  4,  1]))

##  Setting up RoBERTa Pretrained Model

In [None]:
from transformers import RobertaForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = dataset['target'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = RobertaForSequenceClassification.from_pretrained(
                                      'roberta-large', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=482.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1425941629.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

##  Creating Data Loaders

In [None]:
batch_size = 10

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=40
)

##  Setting Up Optimizer and Scheduler

In [None]:
from transformers import AdamW

# Setting up the optimizer

optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

##  Defining the Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating the Training Loop

In [None]:
import random

# Setting up the environment

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
# Defining the evaluation function

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model, making the prediction and calculating the performance metrics

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=7560.0, style=ProgressStyle(description_wid…

Training loss: 1.9830986277885223


HBox(children=(FloatProgress(value=0.0, max=334.0), HTML(value='')))


Validation loss: 1.8143948634227594
F1 Score (weighted): 0.4203684579628441



##  Evaluating the Model

In [None]:
accuracy_per_class(predictions, true_vals) # One best model prediction

Class: ☀
Accuracy:1202/1500

Class: 😜
Accuracy:733/1538

Class: ❤
Accuracy:1018/1463

Class: 😂
Accuracy:201/743

Class: 🇺🇸
Accuracy:510/916

Class: ✨
Accuracy:94/597

Class: 😍
Accuracy:172/642

Class: 😘
Accuracy:325/794

Class: 🔥
Accuracy:81/452

Class: 💙
Accuracy:115/411

Class: 😊
Accuracy:14/536

Class: 📷
Accuracy:296/456

Class: 😉
Accuracy:275/396

Class: 😁
Accuracy:0/337

Class: 📸
Accuracy:11/399

Class: 😎
Accuracy:52/349

Class: 💜
Accuracy:5/396

Class: 💕
Accuracy:385/459

Class: 💯
Accuracy:478/608

Class: 🎄
Accuracy:3/350



In [None]:
accuracy_per_class(predictions, true_vals) # Second best model prediction

Class: ☀
Accuracy:1195/1500

Class: 😜
Accuracy:838/1538

Class: ❤
Accuracy:978/1463

Class: 😂
Accuracy:199/743

Class: 🇺🇸
Accuracy:540/916

Class: ✨
Accuracy:118/597

Class: 😍
Accuracy:148/642

Class: 😘
Accuracy:313/794

Class: 🔥
Accuracy:40/452

Class: 💙
Accuracy:88/411

Class: 😊
Accuracy:279/536

Class: 📷
Accuracy:293/456

Class: 😉
Accuracy:272/396

Class: 😁
Accuracy:0/337

Class: 📸
Accuracy:6/399

Class: 😎
Accuracy:68/349

Class: 💜
Accuracy:4/396

Class: 💕
Accuracy:376/459

Class: 💯
Accuracy:204/608

Class: 🎄
Accuracy:0/350



# **Hate Speech Detection**

<font color="Orange" size=5> Using a ROBERTA Pretrained model: 'roberta-large' </font>

##  Install Dependencies

In [None]:
!pip install transformers



In [None]:
 pip install pytorch_transformers



##   Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/hate/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Hate'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/hate/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Hate
0,0,not-hate
1,1,hate


In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/hate/train_text.txt') as train_text, open('./source/hate/train_labels.txt') as train_labels:
  matrix = []
  hates = pd.read_csv('./source/hate/mapping.csv')
  for text, label in zip(train_text, train_labels):
    hate = hates.loc[hates['Code'] == int(label), 'Hate'].item()
    row = {}
    row['text'] = text.rstrip()
    row['hate'] = int(label)
    row['target'] = hate
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/hate/train.csv', index=False)
train

Unnamed: 0,text,hate,target
0,@user nice new signage. Are you not concerned ...,0,not-hate
1,A woman who you fucked multiple times saying y...,1,hate
2,@user @user real talk do you have eyes or were...,1,hate
3,your girlfriend lookin at me like a groupie in...,1,hate
4,Hysterical woman like @user,0,not-hate
...,...,...,...
8995,Oooohhhh bitch didn't even listen to the dead ...,0,not-hate
8996,@user Good Luck @user More Americans #WalkAway...,0,not-hate
8997,Bitch you can't keep up so stop trying,1,hate
8998,@user @user @user @user @user @user Japan is a...,0,not-hate


In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/hate/test_text.txt') as test_text, open('./source/hate/test_labels.txt') as test_labels:
  matrix = []
  hates = pd.read_csv('./source/hate/mapping.csv')
  for text, label in zip(test_text, test_labels):
    hate = hates.loc[hates['Code'] == int(label), 'Hate'].item()
    row = {}
    row['text'] = text.rstrip()
    row['hate'] = int(label)
    row['target'] = hate
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/hate/test.csv', index=False)
test

Unnamed: 0,text,hate,target
0,"@user , you are correct that Reid certainly is...",0,not-hate
1,Whoever just unfollowed me you a bitch,1,hate
2,@user @user Those People Invaded Us!!! They DO...,1,hate
3,"stop JUDGING bitches by there cover, jus cuz s...",1,hate
4,how about i knock heads off and send them gift...,1,hate
...,...,...,...
2965,@user Calling them #IllegalAliens is heartless...,0,not-hate
2966,Silly Killary WANNABE !! And @user numbers JUS...,0,not-hate
2967,@user @user @user @user @user @user @user @use...,0,not-hate
2968,@user StopImmigration,1,hate


In [None]:
# Transforming the .txt files and creating the validation dataset

with open('./source/hate/val_text.txt') as val_text, open('./source/hate/val_labels.txt') as val_labels:
  matrix = []
  hates = pd.read_csv('./source/hate/mapping.csv')
  for text, label in zip(val_text, val_labels):
    hate = hates.loc[hates['Code'] == int(label), 'Hate'].item()
    row = {}
    row['text'] = text.rstrip()
    row['hate'] = int(label)
    row['target'] = hate
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/hate/val.csv', index=False)
val

Unnamed: 0,text,hate,target
0,"@user @user If book Claire wanted to ""stay in ...",0,not-hate
1,After arriving in the EU refugees make protest...,0,not-hate
2,😳👇,0,not-hate
3,@user Worst thing is if they are that stupid t...,1,hate
4,@user Say's the HYSTERICAL woman. It is woman ...,0,not-hate
...,...,...,...
995,Pass #MeritBased Immigration. Kill #ChainMigra...,1,hate
996,imagine chaeyoung cutting some cooked meat for...,0,not-hate
997,I usually dont hate people but I actually hate...,1,hate
998,Cameron stopped immigrants voting on the EU in...,1,hate


##  Loading Tokenizer and Encoding the Data


In [None]:
from transformers import RobertaTokenizer
from torch.utils.data import TensorDataset

In [None]:
# Importing the tokenizer for the model

tokenizer = RobertaTokenizer.from_pretrained(
    'roberta-large',
    do_lower_case=True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




In [None]:
import torch

# Encoding the data

encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.hate.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.hate.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

dataset_val.tensors

(tensor([[    0,  1039, 12105,  ...,     1,     1,     1],
         [    0,  4993,  7789,  ...,     1,     1,     1],
         [    0, 18636, 15264,  ...,     1,     1,     1],
         ...,
         [    0,   100,  2333,  ...,     1,     1,     1],
         [    0,   347, 35953,  ...,     1,     1,     1],
         [    0, 39012,  3189,  ...,     1,     1,     1]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
         1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
         0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
         0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,

## Setting up ROBERTA Pretrained Model

In [None]:
from transformers import RobertaForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = mapping['Hate'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = RobertaForSequenceClassification.from_pretrained(
                                      'roberta-large', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=482.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1425941629.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

## Creating Data Loaders

In [None]:
batch_size = 4
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

##  Setting Up Optimizer and Scheduler

In [None]:
from transformers import AdamW

# Setting up the optimizer

optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

##  Defining the Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating the Training Loop

In [None]:
import random

# Setting up the environment

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
 # Defining the evaluation function
 
 def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=2250.0, style=ProgressStyle(description_wid…

Training loss: 0.5841976440191372


HBox(children=(FloatProgress(value=0.0, max=32.0), HTML(value='')))


Validation loss: 0.7236271155998111
F1 Score (weighted): 0.7970300671620828



## Evaluation of the model

In [None]:
 accuracy_per_class(predictions, true_vals)

Class: not-hate
Accuracy:471/573

Class: hate
Accuracy:326/427



# **Irony Detection**

<font color="Orange" size=5> Using a DistilBERT Pretrained model: 'distilbert-base-uncased-distilled-squad' </font>

##  Install Dependencies

In [None]:
pip install pytorch_transformers



In [None]:
 !pip install transformers



##  Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/irony/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Irony'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/irony/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Irony
0,0,non_irony
1,1,irony


In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/irony/train_text.txt') as train_text, open('./source/irony/train_labels.txt') as train_labels:
  matrix = []
  ironies = pd.read_csv('./source/irony/mapping.csv')
  for text, label in zip(train_text, train_labels):
    irony = ironies.loc[ironies['Code'] == int(label), 'Irony'].item()
    row = {}
    row['text'] = text.rstrip()
    row['irony'] = int(label)
    row['target'] = irony
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/irony/train.csv', index=False)
train

Unnamed: 0,text,irony,target
0,seeing ppl walking w/ crutches makes me really...,1,irony
1,"look for the girl with the broken smile, ask h...",0,non_irony
2,Now I remember why I buy books online @user #s...,1,irony
3,@user @user So is he banded from wearing the c...,1,irony
4,Just found out there are Etch A Sketch apps. ...,1,irony
...,...,...,...
2857,I don't have to respect your beliefs.||I only ...,0,non_irony
2858,Women getting hit on by married managers at @u...,1,irony
2859,@user no but i followed you and i saw you post...,0,non_irony
2860,@user I dont know what it is but I'm in love y...,0,non_irony


In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/irony/test_text.txt') as test_text, open('./source/irony/test_labels.txt') as test_labels:
  matrix = []
  ironies = pd.read_csv('./source/irony/mapping.csv')
  for text, label in zip(test_text, test_labels):
    irony = ironies.loc[ironies['Code'] == int(label), 'Irony'].item()
    row = {}
    row['text'] = text.rstrip()
    row['irony'] = int(label)
    row['target'] = irony
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/irony/test.csv', index=False)
test

Unnamed: 0,text,irony,target
0,@user Can U Help?||More conservatives needed o...,0,non_irony
1,"Just walked in to #Starbucks and asked for a ""...",1,irony
2,#NOT GONNA WIN,0,non_irony
3,@user He is exactly that sort of person. Weirdo!,0,non_irony
4,So much #sarcasm at work mate 10/10 #boring 10...,1,irony
...,...,...,...
779,"If you drag yesterday into today, your tomorro...",0,non_irony
780,Congrats to my fav @user & her team & my birth...,0,non_irony
781,@user Jessica sheds tears at her fan signing e...,0,non_irony
782,#Irony: al jazeera is pro Anti - #GamerGate be...,1,irony


In [None]:
# Transforming the .txt files and creating the validation dataset

with open('./source/irony/val_text.txt') as val_text, open('./source/irony/val_labels.txt') as val_labels:
  matrix = []
  ironies = pd.read_csv('./source/irony/mapping.csv')
  for text, label in zip(val_text, val_labels):
    irony = ironies.loc[ironies['Code'] == int(label), 'Irony'].item()
    row = {}
    row['text'] = text.rstrip()
    row['irony'] = int(label)
    row['target'] = irony
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/irony/val.csv', index=False)
val

Unnamed: 0,text,irony,target
0,#NBA players #NY support protests of #police k...,1,irony
1,A new year about to start|So many people came ...,0,non_irony
2,"Obama's $1,176,120.90 in Taxpayer Funded Cost...",1,irony
3,Can't wait to work with the dream team again t...,1,irony
4,!!! RT @user Of all the places to get stuck in...,1,irony
...,...,...,...
950,Abraham was actually from modern day Iraq (Ur ...,0,non_irony
951,@user which one is more disturbing dan? Tickli...,1,irony
952,@user @user haha that's cool! I had a feeling ...,0,non_irony
953,@user @user Let the Western bastards bank acco...,1,irony


In [None]:
train.groupby(['irony']).count()

Unnamed: 0_level_0,text,target
irony,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1417,1417
1,1445,1445


##  Loading Tokenizer and Encoding our Data

In [None]:
from transformers import DistilBertTokenizer
from torch.utils.data import TensorDataset

In [None]:
# Importing the tokenizer for the model

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad", use_fast=False)

In [None]:
import torch

# Encoding the data

encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.irony.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.irony.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

dataset_val.tensors

(tensor([[ 101, 1001, 6452,  ...,    0,    0,    0],
         [ 101, 1037, 2047,  ...,    0,    0,    0],
         [ 101, 8112, 1005,  ...,    0,    0,    0],
         ...,
         [ 101, 1030, 5310,  ...,    0,    0,    0],
         [ 101, 1030, 5310,  ...,    0,    0,    0],
         [ 101, 6270, 1010,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
         1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
         1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
         0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
         1, 1, 1, 0, 0, 1, 0, 1, 1,

 ## Setting up DistilBERT Pretrained Model

In [None]:
from transformers import DistilBertForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = mapping['Irony'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = DistilBertForSequenceClassification.from_pretrained(
                                      'distilbert-base-uncased-distilled-squad', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at distilbert-base-uncased-distilled-squad were not used when initializing DistilBertForSequenceClassification: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-distilled-squad and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-

##  Creating Data Loaders

In [None]:
batch_size = 3

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=5
)

##  Setting Up Optimizer and Scheduler

In [None]:
from transformers import AdamW

# Setting up the optimizer

optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 3
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

## Defining the Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating the Training Loop

In [None]:
# Setting up the environment

import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
# Defining the evaluation function

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model, making the prediction and calculating the performance metrics

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=954.0, style=ProgressStyle(description_widt…

Training loss: 0.6527713685759209


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


Validation loss: 0.6197772813249008
F1 Score (weighted): 0.6605352188355637


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=954.0, style=ProgressStyle(description_widt…

Training loss: 0.5388618419903567


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


Validation loss: 0.8309663156564323
F1 Score (weighted): 0.662558035938622


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=954.0, style=ProgressStyle(description_widt…

Training loss: 0.4476918964061019


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


Validation loss: 1.1137144400650412
F1 Score (weighted): 0.6712474503294743



##  Evaluating the Model

In [None]:
accuracy_per_class(predictions, true_vals)

Class: non_irony
Accuracy:322/499

Class: irony
Accuracy:319/456



# **Offensive Language Identification**

<font color="Orange" size=5> Using an ALBERT Pretrained model: 'albert-base-v2' </font>

##  Install Dependencies

In [None]:
 pip install pytorch_transformers

Collecting pytorch_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 14.3MB/s eta 0:00:01[K     |███▊                            | 20kB 18.9MB/s eta 0:00:01[K     |█████▋                          | 30kB 17.2MB/s eta 0:00:01[K     |███████▍                        | 40kB 12.5MB/s eta 0:00:01[K     |█████████▎                      | 51kB 12.3MB/s eta 0:00:01[K     |███████████▏                    | 61kB 13.8MB/s eta 0:00:01[K     |█████████████                   | 71kB 12.9MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 12.3MB/s eta 0:00:01[K     |████████████████▊               | 92kB 10.9MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 10.9MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 10.9MB/s eta 0:00:01[K     |██████████████████

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 11.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 36.8MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.10.2 transformers-4.5.1


##  Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/offensive/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Offensive'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/offensive/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Offensive
0,0,not-offensive
1,1,offensive


In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/offensive/train_text.txt') as train_text, open('./source/offensive/train_labels.txt') as train_labels:
  matrix = []
  off_languages = pd.read_csv('./source/offensive/mapping.csv')
  for text, label in zip(train_text, train_labels):
    offensive = off_languages.loc[off_languages['Code'] == int(label), 'Offensive'].item()
    row = {}
    row['text'] = text.rstrip()
    row['offensive'] = int(label)
    row['target'] = offensive
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/offensive/train.csv', index=False)
train

Unnamed: 0,text,offensive,target
0,@user Bono... who cares. Soon people will unde...,0,not-offensive
1,@user Eight years the republicans denied obama...,1,offensive
2,@user Get him some line help. He is gonna be j...,0,not-offensive
3,@user @user She is great. Hi Fiona!,0,not-offensive
4,@user She has become a parody unto herself? Sh...,1,offensive
...,...,...,...
11911,@user I wonder if they are sex traffic victims?,1,offensive
11912,@user Do we dare say he is... better than Nyjer?,0,not-offensive
11913,@user No idea who he is. Sorry,0,not-offensive
11914,#Professor Who Shot Self Over Trump Says Gun C...,0,not-offensive


In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/offensive/test_text.txt') as test_text, open('./source/offensive/test_labels.txt') as test_labels:
  matrix = []
  off_languages = pd.read_csv('./source/offensive/mapping.csv')
  for text, label in zip(test_text, test_labels):
    offensive = off_languages.loc[off_languages['Code'] == int(label), 'Offensive'].item()
    row = {}
    row['text'] = text.rstrip()
    row['offensive'] = int(label)
    row['target'] = offensive
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/offensive/test.csv', index=False)
test

Unnamed: 0,text,offensive,target
0,#ibelieveblaseyford is liar she is fat ugly li...,1,offensive
1,@user @user @user I got in a pretty deep debat...,0,not-offensive
2,"...if you want more shootings and more death, ...",0,not-offensive
3,Angels now have 6 runs. Five of them have come...,0,not-offensive
4,#Travel #Movies and Unix #Fortune combined Vi...,0,not-offensive
...,...,...,...
855,#CNN irrationally argues 4 legalising #abortio...,0,not-offensive
856,@user @user @user @user @user @user @user @use...,0,not-offensive
857,#Conservatives don’t care what you post..it’s ...,1,offensive
858,#antifa #Resist.. Trump is trying to bring wor...,0,not-offensive


In [None]:
# Transforming the .txt files and creating the validation dataset

with open('./source/offensive/val_text.txt') as val_text, open('./source/offensive/val_labels.txt') as val_labels:
  matrix = []
  off_languages = pd.read_csv('./source/offensive/mapping.csv')
  for text, label in zip(val_text, val_labels):
    offensive = off_languages.loc[off_languages['Code'] == int(label), 'Offensive'].item()
    row = {}
    row['text'] = text.rstrip()
    row['offensive'] = int(label)
    row['target'] = offensive
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/offensive/val.csv', index=False)
val

Unnamed: 0,text,offensive,target
0,@user @user WiiU is not even a real console.,0,not-offensive
1,@user @user @user If he is from AZ I would put...,1,offensive
2,@user I thought Canada had strict gun control....,0,not-offensive
3,@user @user @user @user @user @user @user @use...,0,not-offensive
4,1 Minute of Truth: Gun Control via @user,0,not-offensive
...,...,...,...
1319,@user @user Whose twitter interest start with ...,0,not-offensive
1320,"@user @user How did the press"""" get the letter...",0,not-offensive
1321,@user @user @user @user @user @user Sorry abou...,0,not-offensive
1322,@user Fuck Alan I’m sorry,1,offensive


##  Loading Tokenizer and Encoding the Data

In [None]:
from transformers import AlbertTokenizer
from torch.utils.data import TensorDataset

In [None]:
# Importing the tokenizer for the model

tokenizer = AlbertTokenizer.from_pretrained(
    'albert-base-v2',
    do_lower_case=True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1312669.0, style=ProgressStyle(descript…




In [None]:
import torch

# Encoding the data

encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.offensive.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.offensive.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

dataset_val.tensors

(tensor([[   2,   13,    1,  ...,    0,    0,    0],
         [   2,   13,    1,  ...,    0,    0,    0],
         [   2,   13,    1,  ...,    0,    0,    0],
         ...,
         [   2,   13,    1,  ...,    0,    0,    0],
         [   2,   13,    1,  ...,    0,    0,    0],
         [   2, 6926,  262,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([0, 1, 0,  ..., 0, 1, 0]))

##  Setting up ALBERT Pretrained Model

In [None]:
from transformers import AlbertForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = mapping['Offensive'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = AlbertForSequenceClassification.from_pretrained(
                                      'albert-base-v2', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=684.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=47376696.0, style=ProgressStyle(descrip…




Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.dense.bias', 'predictions.decoder.weight', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.weight', 'classifier.bias']
You sho

##  Creating Data Loaders

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 20

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

##  Setting Up Optimizer and Scheduler

In [None]:
from transformers import AdamW

# Setting up the optimizer

optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 1

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

## Defining the Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating the Training Loop

In [None]:
import random

# Setting up the environment

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

torch.cuda.empty_cache() 

model.to(device)
print(device)

cuda


In [None]:
# Defining the evaluation function

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model, making the prediction and calculating the performance metrics

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=596.0, style=ProgressStyle(description_widt…

Training loss: 0.4844311217233639


HBox(children=(FloatProgress(value=0.0, max=42.0), HTML(value='')))


Validation loss: 0.42420835775278865
F1 Score (weighted): 0.7987578944874906



##  Evaluating the Model

In [None]:
accuracy_per_class(predictions, true_vals)

Class: not-offensive
Accuracy:756/865

Class: offensive
Accuracy:305/459



# **Sentiment Analysis**

<font color="Orange"> </font>
<font color="Orange" size=5> Using a BERT Pretrained model: 'bert-large-uncased'</font>

##  Install Dependencies

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 8.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 47.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 51.1MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
pip install pytorch_transformers

Collecting pytorch_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 20.5MB/s eta 0:00:01[K     |███▊                            | 20kB 19.7MB/s eta 0:00:01[K     |█████▋                          | 30kB 11.5MB/s eta 0:00:01[K     |███████▍                        | 40kB 9.5MB/s eta 0:00:01[K     |█████████▎                      | 51kB 7.5MB/s eta 0:00:01[K     |███████████▏                    | 61kB 7.4MB/s eta 0:00:01[K     |█████████████                   | 71kB 8.4MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 8.9MB/s eta 0:00:01[K     |████████████████▊               | 92kB 7.8MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 8.4MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 8.4MB/s eta 0:00:01[K     |██████████████████████▎   

##  Preprocessing

In [None]:
import pandas as pd

# Transforming the mapping into a .csv file

with open('./source/sentiment/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Sentiment'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/sentiment/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Sentiment
0,0,negative
1,1,neutral
2,2,positive


In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/sentiment/train_text.txt') as train_text, open('./source/sentiment/train_labels.txt') as train_labels:
  matrix = []
  sentiments = pd.read_csv('./source/sentiment/mapping.csv')
  for text, label in zip(train_text, train_labels):
    sentiment = sentiments.loc[sentiments['Code'] == int(label), 'Sentiment'].item()
    row = {}
    row['text'] = text.rstrip()
    row['sentiment'] = int(label)
    row['target'] = sentiment
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/sentiment/train.txt', index=False)
train

Unnamed: 0,text,sentiment,target
0,Broncineers! Remember that tomorrow we have th...,1,neutral
1,"Not to be outdone by the neighbours, Erdogan i...",1,neutral
2,"Katy Perry - ""You just gotta ignite the light,...",1,neutral
3,Who\u2019s going to Concords football game thi...,1,neutral
4,Do you own a business in Bolder? Then you may ...,1,neutral
...,...,...,...
45384,Looking forward to the new Jersey Shore starti...,2,positive
45385,Dreamed I was @user and I spent the night maki...,2,positive
45386,Kyle is going out for the first time tomorrow ...,2,positive
45387,Hey happy Friday!!! Today T&amp;P will be givi...,2,positive


In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/sentiment/test_text.txt') as test_text, open('./source/sentiment/test_labels.txt') as test_labels:
  matrix = []
  sentiments = pd.read_csv('./source/sentiment/mapping.csv')
  for text, label in zip(test_text, test_labels):
    sentiment = sentiments.loc[sentiments['Code'] == int(label), 'Sentiment'].item()
    row = {}
    row['text'] = text.rstrip()
    row['sentiment'] = int(label)
    row['target'] = sentiment
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/sentiment/test.txt', index=False)
test

Unnamed: 0,text,sentiment,target
0,#ArianaGrande Ari By Ariana Grande 80% Full #S...,1,neutral
1,Ariana Grande KIIS FM Yours Truly CD listening...,2,positive
2,Ariana Grande White House Easter Egg Roll in W...,2,positive
3,#CD #Musics Ariana Grande Sweet Like Candy 3.4...,2,positive
4,SIDE TO SIDE 😘 @user #sidetoside #arianagrande...,1,neutral
...,...,...,...
11901,@user update: Zac Efron kissing a puppy,2,positive
11902,#zac efron sex pic skins michelle sex,1,neutral
11903,First Look at Neighbors 2 with Zac Efron Shirt...,1,neutral
11904,zac efron poses nude #lovely libra porn,1,neutral


In [None]:
# Transforming the .txt files and creating the validation dataset

with open('./source/sentiment/val_text.txt') as val_text, open('./source/sentiment/val_labels.txt') as val_labels:
  matrix = []
  sentiments = pd.read_csv('./source/sentiment/mapping.csv')
  for text, label in zip(val_text, val_labels):
    sentiment = sentiments.loc[sentiments['Code'] == int(label), 'Sentiment'].item()
    row = {}
    row['text'] = text.rstrip()
    row['sentiment'] = int(label)
    row['target'] = sentiment
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/sentiment/val.csv', index=False)
val

Unnamed: 0,text,sentiment,target
0,man I'm about to see Janet Jackson in October ...,1,neutral
1,@user what do you think about the reaction is ...,1,neutral
2,Conor McGregor's reaction to swirling Jose Ald...,1,neutral
3,To all the people who will buy Go Set a Watchm...,0,negative
4,How about a Day of Action and Meetups on Sat O...,2,positive
...,...,...,...
1995,It's good to see that Dave managed to get back...,2,positive
1996,Hope everyone is out having fun at the 2012 Ro...,2,positive
1997,Watching full 1st season of The Finder on Hulu...,2,positive
1998,#cricket Pakistan hopeful of Bangladesh visit:...,2,positive


##  Loading Tokenizer and Encoding our Data

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

# Importing the tokenizer for the model

tokenizer = BertTokenizer.from_pretrained(
    'bert-large-uncased',
    do_lower_case=True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
# Encoding the data

import torch

encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.sentiment.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.sentiment.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)


dataset_val.tensors

(tensor([[  101,  2158,  1045,  ...,     0,     0,     0],
         [  101,  1030,  5310,  ...,     0,     0,     0],
         [  101, 20545, 23023,  ...,     0,     0,     0],
         ...,
         [  101,  3666,  2440,  ...,     0,     0,     0],
         [  101,  1001,  4533,  ...,     0,     0,     0],
         [  101, 19387,  1030,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([1, 1, 1,  ..., 2, 2, 2]))

##  Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
# Creating a dictionary with the possible labels for input

label_dict = {}
possible_labels = mapping['Sentiment'].unique()

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
# Importing the model

model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-uncased', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

##  Creating Data Loaders

In [None]:
batch_size = 6
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=20
)

##  Setting up Optimizer and Scheduler

In [None]:
from transformers import AdamW

# Setting up the optimizer

optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

# Setting up the scheduler

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

##  Defining the Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
# Defining the f1 score metric
 
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
# Defining the accuracy per class metric

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

##  Creating the Training Loop

In [None]:
import random

# Setting up the environment

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting up the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
# Defining the evaluation function

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

## Training the model, making the prediction and calculating the performance metrics

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    

    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=7565.0, style=ProgressStyle(description_wid…

Training loss: 0.6556394049665113


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Validation loss: 0.5883392065763473
F1 Score (weighted): 0.7450052407878486



## Evaluating the model

In [None]:
accuracy_per_class(predictions, true_vals)

Class: negative
Accuracy:187/301

Class: neutral
Accuracy:658/896

Class: positive
Accuracy:647/803



# **Stance Detection**

In [None]:
import pandas as pd

# Transforming the mapping for all stance categories into a .csv file

with open('./source/stance/mapping.txt') as file:
  matrix = []
  for line in file:
    splits = line.split('\t')
    row = {}
    row['Code'] = splits[0]
    row['Stance'] = splits[1].rstrip()
    matrix.append(row)
  mapping = pd.DataFrame(matrix)

mapping.to_csv('./source/stance/mapping.csv', index=False)
mapping

Unnamed: 0,Code,Stance
0,0,none
1,1,against
2,2,favor


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b0/9e/5b80becd952d5f7250eaf8fc64b957077b12ccfe73e9c03d37146ab29712/transformers-4.6.0-py3-none-any.whl (2.3MB)
[K     |▏                               | 10kB 20.2MB/s eta 0:00:01[K     |▎                               | 20kB 27.5MB/s eta 0:00:01[K     |▍                               | 30kB 24.8MB/s eta 0:00:01[K     |▋                               | 40kB 27.5MB/s eta 0:00:01[K     |▊                               | 51kB 29.6MB/s eta 0:00:01[K     |▉                               | 61kB 31.5MB/s eta 0:00:01[K     |█                               | 71kB 25.5MB/s eta 0:00:01[K     |█▏                              | 81kB 26.5MB/s eta 0:00:01[K     |█▎                              | 92kB 27.9MB/s eta 0:00:01[K     |█▍                              | 102kB 29.4MB/s eta 0:00:01[K     |█▌                              | 112kB 29.4MB/s eta 0:00:01[K     |█▊                              | 

## Abortion

In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/stance/abortion/train_text.txt') as train_text, open('./source/stance/abortion/train_labels.txt') as train_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(train_text, train_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/stance/abortion/train.txt', index=False)

In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/stance/abortion/test_text.txt') as test_text, open('./source/stance/abortion/test_labels.txt') as test_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(test_text, test_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/stance/abortion/test.txt', index=False)

In [None]:
# Transforming the .txt files and creating the val dataset

with open('./source/stance/abortion/val_text.txt') as val_text, open('./source/stance/abortion/val_labels.txt') as val_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(val_text, val_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/stance/abortion/val.txt', index=False)

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-large-cased-whole-word-masking-finetuned-squad',
    do_lower_case = True
)

In [None]:
import torch
encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.stance.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.stance.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                            labels_val)


dataset_val.tensors

(tensor([[ 101, 2052,  117,  ...,    0,    0,    0],
         [ 101, 1175,  112,  ...,    0,    0,    0],
         [ 101, 1191, 1128,  ...,    0,    0,    0],
         ...,
         [ 101,  137, 4795,  ...,    0,    0,    0],
         [ 101, 1139, 1404,  ...,    0,    0,    0],
         [ 101, 6243, 1128,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([0, 1, 1, 0, 1, 2, 1, 0, 1, 2, 1, 2, 1, 0, 2, 0, 2, 2, 1, 0, 2, 1, 1, 1,
         1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 2, 2, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
         1, 1, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 2, 0]))

In [None]:
from transformers import BertForSequenceClassification 

In [None]:
label_dict = {}
possible_labels = train.stance.unique()

for index, possible_label in enumerate(possible_labels):
  label_dict[possible_label] = index

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-cased-whole-word-masking-finetuned-squad', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad were not used when initializing BertForSequenceClassification: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions

In [None]:
batch_size = 5
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=5
)

In [None]:
from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 2e-5,
    eps = 1e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 15
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, 2)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=118.0, style=ProgressStyle(description_widt…

Training loss: 0.5396141222030935


HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))


Validation loss: 0.8177774372909751
F1 Score (weighted): 0.803030303030303



In [None]:
accuracy_per_class(predictions, true_vals)

Class: 1
Accuracy:15/18

Class: 0
Accuracy:31/36

Class: 2
Accuracy:7/12



## Atheism

In [None]:
import pandas as pd

with open('./source/stance/atheism/train_text.txt') as train_text, open('./source/stance/atheism/train_labels.txt') as train_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(train_text, train_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/stance/atheism/train.txt', index=False)

In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/stance/atheism/test_text.txt') as test_text, open('./source/stance/atheism/test_labels.txt') as test_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(test_text, test_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/stance/atheism/test.txt', index=False)

In [None]:
# Transforming the .txt files and creating the val dataset

with open('./source/stance/atheism/val_text.txt') as val_text, open('./source/stance/atheism/val_labels.txt') as val_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(val_text, val_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/stance/atheism/val.txt', index=False)

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-large-cased-whole-word-masking-finetuned-squad',
    do_lower_case = False
)

In [None]:
import torch
encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.stance.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.stance.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                            labels_val)


dataset_val.tensors

(tensor([[  101,  1192,  1169,  ...,     0,     0,     0],
         [  101, 23451,  1406,  ...,     0,     0,     0],
         [  101,  1109,  1178,  ...,     0,     0,     0],
         ...,
         [  101,  6424,  7215,  ...,     0,     0,     0],
         [  101,   146,   112,  ...,     0,     0,     0],
         [  101,  1109,  4997,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([1, 1, 1, 1, 1, 2, 2, 2, 0, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2,
         0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 2,
         1, 0, 1, 0]))

In [None]:
from transformers import BertForSequenceClassification 

In [None]:
label_dict = {}
possible_labels = train.stance.unique()

for index, possible_label in enumerate(possible_labels):
  label_dict[possible_label] = index

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-cased-whole-word-masking-finetuned-squad', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad were not used when initializing BertForSequenceClassification: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions

In [None]:
batch_size = 5
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=5
)

In [None]:
from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 4e-5,
    eps = 7e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 15
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, 2)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=93.0, style=ProgressStyle(description_width…

Training loss: 0.8454795657627044


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))


Validation loss: 0.6791728721423582
F1 Score (weighted): 0.7208224180360403



In [None]:
accuracy_per_class(predictions, true_vals)

Class: 1
Accuracy:9/12

Class: 2
Accuracy:22/31

Class: 0
Accuracy:6/9



## Climate

In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/stance/climate/train_text.txt') as train_text, open('./source/stance/climate/train_labels.txt') as train_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(train_text, train_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/stance/climate/train.txt', index=False)

In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/stance/climate/test_text.txt') as test_text, open('./source/stance/climate/test_labels.txt') as test_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(test_text, test_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/stance/climate/test.txt', index=False)

In [None]:
# Transforming the .txt files and creating the val dataset

with open('./source/stance/climate/val_text.txt') as val_text, open('./source/stance/climate/val_labels.txt') as val_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(val_text, val_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/stance/climate/val.txt', index=False)

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-large-cased-whole-word-masking-finetuned-squad',
    do_lower_case = True
)

In [None]:
import torch
encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.stance.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.stance.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                            labels_val)


dataset_val.tensors

(tensor([[ 101,  108, 2862,  ...,    0,    0,    0],
         [ 101, 1169,  137,  ...,    0,    0,    0],
         [ 101,  119,  137,  ...,    0,    0,    0],
         ...,
         [ 101,  137, 4795,  ...,    0,    0,    0],
         [ 101, 8362,  117,  ...,    0,    0,    0],
         [ 101, 1175,  112,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 1,
         0, 0, 2, 2, 1, 0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0]))

In [None]:
from transformers import BertForSequenceClassification 

In [None]:
label_dict = {}
possible_labels = train.stance.unique()

for index, possible_label in enumerate(possible_labels):
  label_dict[possible_label] = index

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-cased-whole-word-masking-finetuned-squad', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad were not used when initializing BertForSequenceClassification: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions

In [None]:
batch_size = 4
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=4
)

In [None]:
from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 2e-5,
    eps = 4e-8
)

In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 15
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, 2)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=89.0, style=ProgressStyle(description_width…

Training loss: 0.7867464731248577


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.7562374711036682
F1 Score (weighted): 0.6820512820512821



In [None]:
accuracy_per_class(predictions, true_vals)

Class: 0
Accuracy:14/17

Class: 2
Accuracy:0/2

Class: 1
Accuracy:14/21



## Feminist

In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/stance/feminist/train_text.txt') as train_text, open('./source/stance/feminist/train_labels.txt') as train_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(train_text, train_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/stance/feminist/train.txt', index=False)

In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/stance/feminist/test_text.txt') as test_text, open('./source/stance/feminist/test_labels.txt') as test_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(test_text, test_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/stance/feminist/test.txt', index=False)

In [None]:
# Transforming the .txt files and creating the val dataset

with open('./source/stance/feminist/val_text.txt') as val_text, open('./source/stance/feminist/val_labels.txt') as val_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(val_text, val_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/stance/feminist/val.txt', index=False)

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-large-cased-whole-word-masking',
    do_lower_case = True
)

In [None]:
import torch
encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.stance.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.stance.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                            labels_val)


dataset_val.tensors

(tensor([[ 101,  178, 1321,  ...,    0,    0,    0],
         [ 101, 1518, 8124,  ...,    0,    0,    0],
         [ 101,  119,  137,  ...,    0,    0,    0],
         ...,
         [ 101, 1165, 1152,  ...,    0,    0,    0],
         [ 101,  137, 4795,  ...,    0,    0,    0],
         [ 101, 5540, 1225,  ...,    0,    0,    0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([2, 0, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 0, 2, 1, 1, 1, 1, 2, 1, 1, 2,
         0, 2, 0, 0, 0, 0, 2, 0, 1, 1, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 2, 1, 1,
         1, 2, 0, 2, 1, 1, 0, 1, 2, 2, 1, 0, 1, 1, 1, 2, 2, 1, 1]))

In [None]:
from transformers import BertForSequenceClassification 

In [None]:
label_dict = {}
possible_labels = train.stance.unique()

for index, possible_label in enumerate(possible_labels):
  label_dict[possible_label] = index

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-cased-whole-word-masking', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Some weights of the model checkpoint at bert-large-cased-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the m

In [None]:
batch_size = 5
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size = 5
)

In [None]:
from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 2e-5,
    eps = 4e-9
)

In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 1
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 15
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, 2)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=120.0, style=ProgressStyle(description_widt…

Training loss: 0.9626638859510421


HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))


Validation loss: 0.9222281788076673
F1 Score (weighted): 0.4795588309196739



In [None]:
accuracy_per_class(predictions, true_vals)

Class: 1
Accuracy:8/13

Class: 0
Accuracy:29/33

Class: 2
Accuracy:1/21



## Hillary

In [None]:
# Transforming the .txt files and creating the train dataset

with open('./source/stance/hillary/train_text.txt') as train_text, open('./source/stance/hillary/train_labels.txt') as train_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(train_text, train_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  train = pd.DataFrame(matrix)

train.to_csv('./source/stance/hillary/train.txt', index=False)

In [None]:
# Transforming the .txt files and creating the test dataset

with open('./source/stance/hillary/test_text.txt') as test_text, open('./source/stance/hillary/test_labels.txt') as test_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(test_text, test_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  test = pd.DataFrame(matrix)

test.to_csv('./source/stance/hillary/test.txt', index=False)

In [None]:
# Transforming the .txt files and creating the val dataset

with open('./source/stance/hillary/val_text.txt') as val_text, open('./source/stance/hillary/val_labels.txt') as val_labels:
  matrix = []
  stances = pd.read_csv('./source/stance/mapping.csv')
  for text, label in zip(val_text, val_labels):
    stance = stances.loc[stances['Code'] == int(label), 'Stance'].item()
    row = {}
    row['text'] = text.rstrip()
    row['stance'] = int(label)
    row['target'] = stance
    matrix.append(row)
  val = pd.DataFrame(matrix)

val.to_csv('./source/stance/hillary/val.txt', index=False)

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-large-cased-whole-word-masking',
    do_lower_case = False
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [None]:
import torch
encoded_data_train = tokenizer.batch_encode_plus(
    train.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    val.text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train.stance.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(val.stance.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# Creating the tensor datasets

dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                            labels_val)


dataset_val.tensors

(tensor([[  101,   143, 20356,  ...,     0,     0,     0],
         [  101,  7102,  1612,  ...,     0,     0,     0],
         [  101,  1135,   112,  ...,     0,     0,     0],
         ...,
         [  101,  6466,  1242,  ...,     0,     0,     0],
         [  101,  1130,  1692,  ...,     0,     0,     0],
         [  101,   137,  4795,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([2, 1, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 2, 0, 0,
         1, 2, 1, 1, 0, 1, 1, 2, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1,
         2, 2, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1]))

In [None]:
from transformers import BertForSequenceClassification 

In [None]:
label_dict = {}
possible_labels = train.stance.unique()

for index, possible_label in enumerate(possible_labels):
  label_dict[possible_label] = index

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-large-cased-whole-word-masking-finetuned-squad', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=634.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334424802.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad were not used when initializing BertForSequenceClassification: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-cased-whole-word-masking-finetuned-squad and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions

In [None]:
batch_size = 5
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=5
)

In [None]:
from transformers import AdamW
optimizer = AdamW(
    model.parameters(),
    lr = 3e-5,
    eps = 1e-9
)

In [None]:
from transformers import get_linear_schedule_with_warmup

epochs = 2
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 10
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, 2)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    print(f'Validation loss: {val_loss}')
    print(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=124.0, style=ProgressStyle(description_widt…

Training loss: 0.9761226097902944


HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))


Validation loss: 0.8319676348141262
F1 Score (weighted): 0.5160348887341227



In [None]:
accuracy_per_class(predictions, true_vals)

Class: 1
Accuracy:17/18

Class: 0
Accuracy:22/39

Class: 2
Accuracy:0/12

