<a href="https://colab.research.google.com/github/pramodith/Humor/blob/master/Humor_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%bash
pip install pytorch-pretrained-bert
pip install tqdm boto3 requests regex
pip install pytorch-pretrained-bert pytorch-nlp
pip install sacremoses
pip install sentencepiece
pip install pytorch_transformers
pip install transformers
pip install gensim
pip install spacy
python -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


This notebook, is to reproduce the work of Pramodith pertaining to SemEval2020 Task 7. Assessing the Humor of Edited News Headlines. All required data files can be found in the git repo in which thie jupyter notebook is located. I know that the code hasn't been maintained properly since the work was done as an independent researcher in my spare time. If you have any doubts feel free to raise a GIT issue.


In [0]:
import torch
from keras_preprocessing.sequence import pad_sequences
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import json
import re
import gensim
from transformers import BertTokenizer,DistilBertTokenizer,RobertaTokenizer


Get the tokenized input and the locations of the focus words.

In [0]:
def tokenize_bert(X: list, org : bool,tokenizer_type='roberta'):
    '''
    This function tokenizes the input sentences and returns a vectorized representation of them and the location
    of each entity in the sentence.

    :param X: List of all input sentences
    :return: A vectorized list representation of the sentence and a numpy array containing the locations of each entity. First two
    values in  a row belong to entity1 and the next two values belong to entity2.
    '''

    # Add the SOS and EOS tokens.
    # TODO: Replace fullstops with [SEP]
    if tokenizer_type == 'roberta':
      tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
      sentences = ["<s> " + sentence + " </s>" for sentence in X]
    elif tokenizer_type == 'bert':
      tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      sentences = ["[CLS] " + sentence + " [SEP]" for sentence in X]
    elif tokenizer_type == 'distilbert':
      tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
      sentences = ["[CLS] " + sentence + " [SEP]" for sentence in X]
    

    # Tokenize and vectorize
    tokenized_text = [tokenizer.tokenize(sentence,add_special_tokens=False) for sentence in sentences]
    X = [tokenizer.convert_tokens_to_ids(sent) for sent in tokenized_text]
    print(tokenizer_type)
    print(tokenized_text[0])
    print(tokenizer.pad_token)
    print(tokenizer.pad_token_id)

    # MAX_SEQ_LEN
    MAX_LEN = 50
    #Pad sequences to make them all eqally long
    X = pad_sequences(X, MAX_LEN, 'long', 'post', 'post',value=tokenizer.pad_token_id)
    print(X[0])
    
    if tokenizer_type != 'roberta':
      if org:
          entity_locs = np.asarray([[i for i, s in enumerate(sent) if s == '<'] for sent in tokenized_text])
      else:
          entity_locs = np.asarray([[i for i, s in enumerate(sent) if s == '^'] for sent in tokenized_text])
    
    else:
      # Find the locations of each entity and store them
      if org:
          entity_locs = np.asarray([[i for i, s in enumerate(sent) if '<' in s and len(s)==2] for sent in tokenized_text])
      else:
          entity_locs = np.asarray([[i for i, s in enumerate(sent) if '^' in s and len(s)==2] for sent in tokenized_text])
    print(entity_locs[0]) 
    return X,entity_locs

This function is to get the dataloaders to repeat the experiment of using a Non-Siamese network.

In [0]:
def get_sent_emb_dataloaders_bert(file_path: str, mode='train', train_batch_size=64, test_batch_size=64, model=None):
    
    df = pd.read_csv(file_path, sep=",")
    # Get the additional data.
    if mode == 'train':
        df1 = pd.read_csv(file_path[:-4] + "_funlines.csv", sep=",")
        df = pd.concat([df, df1], ignore_index=True)
    id = df['id']
    X = df['original'].values
    X = [sent.replace("\"", "") for sent in X]
    # Replaced word
    replaced = df['original'].apply(lambda x: x[x.index("<"):x.index(">") + 1])
    replaced_clean = [x.replace("<", "").replace("/>", "") for x in replaced]
    if mode != 'test':
        y = df['meanGrade'].values
    edit = df['edit']
    # Substitute the edit word in the place of the replaced word add the required demarcation tokens.
    X2 = [sent.replace(replaced[i], "^ " + edit[i] + " ^") for i, sent in enumerate(X)]
    X1 = [sent.replace("<", "< ").replace("/>", " <") for i, sent in enumerate(X)]
    X, entity_locs = tokenize_roberta_sent(X1, X2)

    if mode == "train":

        train1_inputs = torch.tensor(X)
        train_labels = torch.tensor(y)
        train_entity_locs = torch.tensor(entity_locs)

        # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
        # with an iterator the entire dataset does not need to be loaded into memory

        # train_data = TensorDataset(train1_inputs,train2_inputs, train_entity_locs, train_word2vec_locs, train_labels)
        train_data = TensorDataset(train1_inputs, train_entity_locs, train_labels)
        train_dataloader = DataLoader(train_data, batch_size=train_batch_size, shuffle=True)

        # validation_data = TensorDataset(validation1_inputs,validation2_inputs, validation_entity_locs, validation_word2vec_locs, validation_labels)
        return train_dataloader

    if mode == "val":
        test1_input = torch.tensor(X)
        y = torch.tensor(y)
        train_entity_locs = torch.tensor(entity_locs)
        # word2vec_locs = torch.tensor(word2vec_indices)
        id = torch.tensor(id)
        test_data = TensorDataset(test1_input, train_entity_locs,y, id)
        test_sampler = SequentialSampler(test_data)
        test_data_loader = DataLoader(test_data, sampler=test_sampler, batch_size=test_batch_size)

        return test_data_loader

    if mode == "test":
        test1_input = torch.tensor(X)
        test2_input = torch.tensor(sent_emb)

        train_entity_locs = torch.tensor(entity_locs)
        # word2vec_locs = torch.tensor(word2vec_indices)
        id = torch.tensor(id)
        test_data = TensorDataset(test1_input, test2_input, train_entity_locs, id)
        test_sampler = SequentialSampler(test_data)
        test_data_loader = DataLoader(test_data, sampler=test_sampler, batch_size=test_batch_size)

        return test_data_loader

Get the data loader for language modeling assuming that the model is roberta.


In [0]:
def get_bert_lm_dataloader(file_path : str,batch_size = 16):
    jokes_df = pd.read_csv(file_path)
    jokes = jokes_df['Joke']
    jokes = "<s> " + jokes + " </s>"
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    X = [tokenizer.encode(sent,add_special_tokens=False) for sent in jokes]
    MAX_LEN = max([len(sent) for sent in X])
    print(MAX_LEN)
    X = pad_sequences(X, MAX_LEN, 'long', 'post', 'post',tokenizer.pad_token_id)
    dataset = TensorDataset(torch.tensor(X))
    sampler = RandomSampler(dataset)
    data_loader = DataLoader(dataset, sampler=sampler, batch_size=batch_size,pin_memory=True)
    return data_loader


This function is to get the data loaders for the Siamese architecture.

In [0]:
def get_dataloaders_bert(file_path : str,model_type,mode="train",train_batch_size=64,test_batch_size = 64):

    '''
    This function creates pytorch dataloaders for fast and easy iteration over the dataset.

    :param file_path: Path of the file containing train/test data
    :param mode: Test mode or Train mode
    :param train_batch_size: Size of the batch during training
    :param test_batch_size: Size of the batch during testing
    :return: Dataloaders
    '''

    # Read the data,tokenize and vectorize
    df = pd.read_csv(file_path, sep=",")
    if mode=='train':
        df1 = pd.read_csv(file_path[:-4]+"_funlines.csv",sep=",")
        df = pd.concat([df,df1],ignore_index=True)
    id = df['id']
    X = df['original'].values
    X = [sent.replace("\"","") for sent in X]
    
    replaced = df['original'].apply(lambda x: x[x.index("<"):x.index(">")+1])
    replaced_clean = [x.replace("<","").replace("/>","") for x in replaced]
    if mode!='test':
        y = df['meanGrade'].values
    edit = df['edit']
    X2 = [sent.replace(replaced[i], "^ " + edit[i] + " ^") for i, sent in enumerate(X)]
    X1 = [sent.replace("<","< ").replace("/>"," <") for i,sent in enumerate(X)]
    X1,e1_locs = tokenize_bert(X1,True,model_type)
    X2,e2_locs = tokenize_bert(X2,False,model_type)

    replacement_locs = np.concatenate((e1_locs, e2_locs), 1)
    
    if mode == "train":


        train1_inputs = torch.tensor(X1)
        train2_inputs = torch.tensor(X2)
        train_labels = torch.tensor(y)
        train_entity_locs = torch.tensor(replacement_locs)
        
        # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
        # with an iterator the entire dataset does not need to be loaded into memory

        #train_data = TensorDataset(train1_inputs,train2_inputs, train_entity_locs, train_word2vec_locs, train_labels)
        train_data = TensorDataset(train1_inputs, train2_inputs, train_entity_locs, train_labels)
        train_sampler = RandomSampler(train_data)
        train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)

        return  train_dataloader

    if mode == "val":
        test1_input = torch.tensor(X1)
        test2_input = torch.tensor(X2)

        train_entity_locs = torch.tensor(replacement_locs)
        y = torch.tensor(y)
        id = torch.tensor(id)
        test_data = TensorDataset(test1_input, test2_input, train_entity_locs,y,id)
        test_sampler = SequentialSampler(test_data)
        test_data_loader = DataLoader(test_data, sampler=test_sampler, batch_size=test_batch_size)
        return test_data_loader

    if mode == "test":
        test1_input = torch.tensor(X1)
        test2_input = torch.tensor(X2)

        train_entity_locs = torch.tensor(replacement_locs)
        id = torch.tensor(id)
        test_data = TensorDataset(test1_input, test2_input, train_entity_locs,id)
        test_sampler = SequentialSampler(test_data)
        test_data_loader = DataLoader(test_data, sampler=test_sampler, batch_size=test_batch_size)

        return test_data_loader

Tokenizer function to be used by non-siamese architecture assuming transformer model is Roberta.

In [0]:
def tokenize_roberta_sent(X1: list, X2 : list ):
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    sentences = ["<s> " + X1[i] + " </s></s> " + X2[i] + " </s>" for i in range(len(X1))]
    tokenized_text = [tokenizer.tokenize(sentence) for sentence in sentences]
    print(tokenized_text[0])
    X = [tokenizer.convert_tokens_to_ids(sent) for sent in tokenized_text]
    print(X[0])
    #sent_emb = [[0 if i<sentence.index(102) else 1 for i  in range(len(sentence)) ] for sentence in X]
    MAX_LEN = max([len(x) for x in X])+1
    print(MAX_LEN)
    # Pad sequences to make them all eqally long
    X = pad_sequences(X, MAX_LEN, 'long', 'post', 'post',tokenizer.pad_token_id)
    #sent_emb = pad_sequences(sent_emb,MAX_LEN,'long','post','post',1)
    # Find the locations of each entity and store them
    entity_locs1 = np.asarray(
            [[i for i, s in enumerate(sent) if '<' in s and len(s) == 2] for sent in tokenized_text])
    entity_locs2 = np.asarray([[i for i, s in enumerate(sent) if '^' in s and len(s) == 2] for sent in tokenized_text])

    return X,np.concatenate((entity_locs1, entity_locs2), 1)

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error
import sys
from transformers.optimization import AdamW,get_linear_schedule_with_warmup
from transformers import BertForMaskedLM, DistilBertForMaskedLM, RobertaModel, BertModel,DistilBertModel,RobertaForMaskedLM
from pytorch_pretrained_bert import BertAdam
import argparse
import torchnlp.nn as nn_nlp
import gensim
import numpy as np
import json

In [0]:
torch.manual_seed(12)
torch.cuda.manual_seed(12)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [0]:
class FunBERT(nn.Module):

    def __init__(self, train_file_path: str, dev_file_path: str, test_file_path: str, lm_file_path: str,
                 train_batch_size: int,
                 test_batch_size: int, lr: float, lm_weights_file_path: str, epochs: int, lm_pretrain: str,
                 model_path: str,model_type : str):
      '''

      :param train_file_path: Path to the train file
      :param test_file_path: Path to the test file
      :param train_batch_size: Size of the batch during training
      :param test_batch_size: Size of the batch during testing
      :param lr: learning rate
      '''

      super(FunBERT, self).__init__()
      if lm_pretrain and model_type == 'roberta':
        self.bert_model = RobertaForMaskedLM.from_pretrained('roberta-base', output_hidden_states=True)
        self.tokenizer= RobertaTokenizer.from_pretrained('roberta-base')
      elif model_type == 'roberta':
        self.bert_model = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
        self.tokenizer= RobertaTokenizer.from_pretrained('roberta-base')
      elif model_type == 'bert':
        self.bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
        self.tokenizer= BertTokenizer.from_pretrained('bert-base-uncased')
      elif model_type == 'distilbert':
        self.bert_model = DistilBertModel.from_pretrained('distilbert-base-uncased',output_hidden_states=True)
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
      self.model_type = model_type
      self.train_batch_size = train_batch_size
      self.test_batch_size = test_batch_size
      self.train_file_path = train_file_path
      self.lm_file_path = lm_file_path
      self.attention = nn_nlp.Attention(768 * 2)
      self.dev_file_path = dev_file_path
      self.test_file_path = test_file_path
      self.lr = lr
      
      self.prelu = nn.PReLU()
      self.epochs = epochs
      self.linear_reg1 = nn.Sequential(
          nn.Dropout(0.3),
          nn.Linear(768 * 8, 1024))

      self.final_linear = nn.Sequential(nn.Dropout(0.3), nn.Linear(1024, 1))
    
    def pre_train_bert(self):
        optimizer = optim.Adam(self.bert_model.parameters(), 2e-5)
        scheduler = get_linear_schedule_with_warmup(optimizer, 62,620)
        step = 0
        train_dataloader = get_bert_lm_dataloader(self.lm_file_path, 32)
        print("Training LM")
        if torch.cuda.is_available():
            self.bert_model.cuda()
        for epoch in range(2):
            print("Epoch : " + str(epoch))
            for ind, batch in enumerate(train_dataloader):
                step += 1

                optimizer.zero_grad()
                if torch.cuda.is_available():
                    inp = batch[0].cuda()
                else:
                    inp = batch[0]

                labels = inp.clone()
                # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
                probability_matrix = torch.full(labels.shape, 0.15)
                special_tokens_mask = [
                    self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in
                    labels.tolist()
                ]
                probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
                if self.tokenizer._pad_token is not None:
                    padding_mask = labels.eq(self.tokenizer.pad_token_id)
                    padding_mask = padding_mask.detach().cpu()
                    probability_matrix.masked_fill_(padding_mask, value=0.0)
                masked_indices = torch.bernoulli(probability_matrix).bool()
                labels[~masked_indices] = -100  # We only compute loss on masked tokens

                # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
                indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
                inp[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

                # 10% of the time, we replace masked input tokens with random word
                indices_random = torch.bernoulli(
                    torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
                random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
                inp[indices_random] = random_words[indices_random].cuda()
                outputs = self.bert_model(inp, masked_lm_labels=labels.long(),attention_mask=(inp!=self.tokenizer.pad_token_id).long())
                loss, prediction_scores = outputs[:2]
                loss.backward()
                #torch.nn.utils.clip_grad_norm_(self.bert_model.parameters(), 1.0)
                print(str(step) + " Loss is :" + str(loss.item()))
                optimizer.step()
                scheduler.step()
            torch.cuda.empty_cache()
        print("LM training done")
        torch.save(self.bert_model.state_dict(), "lm_joke_bert.pth")

    # Use this forward for non-siamese model.
    def forward1(self, *input) :
        final_out = []
        input = input[0]
        attn_mask0 = (input[0] != self.tokenizer.pad_token_id).long()
        out_per_seq, _,attention_layer_inps= self.bert_model(input[0].long(),attention_mask=attn_mask0)
        out_per_seq = torch.cat((out_per_seq,attention_layer_inps[11]),2)
        pos = input[0].clone().detach().cpu()
        for (i, loc) in enumerate(input[1]):
            # +1 is to ensure that the symbol token is not considered
            entity1 = torch.mean(out_per_seq[i,loc[0]+1:loc[1]],0)
            entity2 = torch.mean(out_per_seq[i, loc[2] + 1:loc[3]], 0)
            # Limit attention to original sentence for entity1 and edited sentence for entity2 
            imp_seq1 = torch.cat((out_per_seq[i, 0:loc[0] + 1], out_per_seq[i, loc[1]:np.where(pos[i].numpy()==self.tokenizer.sep_token_id)[0][0]]), 0)
            imp_seq2 = torch.cat((out_per_seq[i, np.where(pos[i].numpy()==self.tokenizer.sep_token_id)[0][1]:loc[2] + 1], out_per_seq[i, loc[3]:]), 0)
            _, attention_score = self.attention(entity2.unsqueeze(0).unsqueeze(0), imp_seq2.unsqueeze(0))
            sent_attn2 = torch.sum(attention_score.squeeze(0).expand(768 * 2, -1).t() * imp_seq2, 0)
            _, attention_score = self.attention(entity1.unsqueeze(0).unsqueeze(0), imp_seq1.unsqueeze(0))
            sent_attn1 = torch.sum(attention_score.squeeze(0).expand(768 * 2, -1).t() * imp_seq1, 0)
            #attn_diff = torch.abs(sent_attn2-sent_attn1)
            sent_out = self.prelu(self.linear_reg1(torch.cat((sent_attn2,sent_attn1,out_per_seq[i,0],entity2), 0)))
            out = self.final_linear(sent_out)
            final_out.append(out)
        #out = self.final_linear(torch.cat((out_per_seq[:, 0, :],entity_diff), 1))

        return torch.stack(final_out)

    def forward(self, *input):
        '''
        :param input: input[0] is the sentence, input[1] are the entity locations , input[2] is the ground truth
        :return: Scores for each class
        '''

        final_scores = []
        input = input[0]
        attn_mask0 = (input[0]!=self.tokenizer.pad_token_id).long()
        output_per_seq1,_,attention_layer_inps = self.bert_model(input[0].long(), attention_mask = attn_mask0)
        output_per_seq1 = torch.cat((output_per_seq1, attention_layer_inps[11]), 2)
        attn_mask1 = (input[1]!=self.tokenizer.pad_token_id).long()
        output_per_seq2,_,attention_layer_inps = self.bert_model(input[1].long(),attention_mask = attn_mask1)
        output_per_seq2 = torch.cat((output_per_seq2, attention_layer_inps[11]), 2)
        '''
        Obtain the vectors that represent the entities and average them followed by a Tanh and a linear layer.
        '''
        for (i, loc) in enumerate(input[2]):
            # +1 is to ensure that the symbol token is not considered
            
            entity1 = torch.mean(output_per_seq1[i, loc[0] + 1:loc[1]], 0)
            entity2 = torch.mean(output_per_seq2[i, loc[2] + 1:loc[3]], 0)
            
            imp_seq1 = torch.cat((output_per_seq1[i, 0:loc[0] + 1], output_per_seq1[i, loc[1]:]), 0)
            imp_seq2 = torch.cat((output_per_seq2[i, 0:loc[2] + 1], output_per_seq2[i, loc[3]:]), 0)
            _, attention_score = self.attention(entity2.unsqueeze(0).unsqueeze(0), imp_seq2.unsqueeze(0))
            sent_attn = torch.sum(attention_score.squeeze(0).expand(768 * 2, -1).t() * imp_seq2, 0)
            _, attention_score1 = self.attention(entity1.unsqueeze(0).unsqueeze(0), imp_seq1.unsqueeze(0))
            sent_attn1 = torch.sum(attention_score1.squeeze(0).expand(768 * 2, -1).t() * imp_seq1, 0)
            sent_out = self.prelu(self.linear_reg1(torch.cat((sent_attn,sent_attn1,output_per_seq2[i,0],entity2), 0)))
            final_out = self.final_linear(sent_out)
            final_scores.append(final_out)
        
        return torch.stack((final_scores))
    
    def train_non_siamese(self):
        if torch.cuda.is_available():
            self.cuda()
        optimizer = optim.Adam(self.parameters(), lr=self.lr, weight_decay=0.001)

        loss = nn.MSELoss()
        train_dataloader = get_sent_emb_dataloaders_bert(self.train_file_path,'train',self.train_batch_size)

        val_dataloader = get_sent_emb_dataloaders_bert(self.dev_file_path,"val",self.train_batch_size)
        best_loss = sys.maxsize
        best_accuracy = -sys.maxsize
        steps = 0
        pred_scores = []
        gt_scores = []
        print(f"Pad token is {self.tokenizer.pad_token}")
        for epoch in range(self.epochs):
            steps += 1
            if epoch == 0:
                scheduler = get_linear_schedule_with_warmup(optimizer,140,1400)
            total_prev_loss = 0
            for (batch_num, batch) in enumerate(train_dataloader):
                # If gpu is available move to gpu.
                if torch.cuda.is_available():
                    input1 = batch[0].cuda()
                    locs = batch[1].cuda()
                    gt = batch[2].cuda()
                else:
                    input1 = batch[0]
                    locs = batch[1]
                    gt = batch[2]

                loss_val = 0
                self.bert_model.train()
                self.attention.train()
                self.linear_reg1.train()
                self.prelu.train()
                self.final_linear.train()

                # Clear gradients
                optimizer.zero_grad()
                final_scores = self.forward1((input1,locs))
                loss_val += loss(final_scores.squeeze(1), gt.float())
                

                # Compute gradients
                loss_val.backward()
                torch.nn.utils.clip_grad_norm_(self.parameters(), 1.0)
                total_prev_loss += loss_val.item()
                print("Loss for batch" + str(batch_num) + ": " + str(loss_val.item()))
                # Update weights according to the gradients computed.
                optimizer.step()
                scheduler.step()

            # Don't compute gradients in validation step
            with torch.no_grad():
                # Ensure that dropout behavior is correct.
                pred_scores = []
                gt_scores = []
                predictions = []
                ground_truth = []
                self.bert_model.eval()
                self.attention.eval()
                self.linear_reg1.eval()
                self.final_linear.eval()
                self.prelu.eval()
                mse_loss = 0
                for (val_batch_num, val_batch) in enumerate(val_dataloader):
                    if torch.cuda.is_available():
                        input1 = val_batch[0].cuda()
                        locs = val_batch[1].cuda()
                        gt = val_batch[2].cuda()
                    else:
                        input1 = val_batch[0]
                        locs = val_batch[1]
                        gt = val_batch[2]

                    final_scores = self.forward1((input1, locs))
                    pred_scores.extend(final_scores.cpu().detach().squeeze(1))
                    gt_scores.extend(gt.cpu().detach())

                    mse_loss += mean_squared_error(gt.cpu().detach(),final_scores.cpu().detach().squeeze(1))

                    
                print(f"Validation Loss is {np.sqrt(mean_squared_error(gt_scores,pred_scores))}")

                if mse_loss < best_loss:
                    torch.save(self.state_dict(), "model_1_"  + str(epoch) + ".pth")
                    best_loss = mse_loss

    def train(self, mode=True):
        if torch.cuda.is_available():
            self.cuda()
        #self.bert_model = self.bert_model.roberta
        optimizer = optim.Adam(self.parameters(), lr=self.lr, weight_decay=0.001)

        loss = nn.MSELoss()
        train_dataloader = get_dataloaders_bert(self.train_file_path,self.model_type,'train',self.train_batch_size)

        val_dataloader = get_dataloaders_bert(self.dev_file_path,self.model_type,"val",self.train_batch_size)
        best_loss = sys.maxsize
        best_accuracy = -sys.maxsize
        steps = 0
        pred_scores = []
        gt_scores = []
        print(f"Pad token is {self.tokenizer.pad_token}")
        for epoch in range(self.epochs):
            steps += 1
            if epoch == 0:
                scheduler = get_linear_schedule_with_warmup(optimizer,140,1400)
            total_prev_loss = 0
            for (batch_num, batch) in enumerate(train_dataloader):
                # If gpu is available move to gpu.
                if torch.cuda.is_available():
                    input1 = batch[0].cuda()
                    input2 = batch[1].cuda()
                    locs = batch[2].cuda()
                    gt = batch[3].cuda()
                else:
                    input1 = batch[0]
                    input2 = batch[1]
                    locs = batch[2]
                    gt = batch[3]

                loss_val = 0
                self.bert_model.train()
                self.attention.train()
                self.linear_reg1.train()
                self.prelu.train()
                self.final_linear.train()

                # Clear gradients
                optimizer.zero_grad()
                final_scores = self.forward((input1, input2,locs))
                loss_val += loss(final_scores.squeeze(1), gt.float())
                

                # Compute gradients
                loss_val.backward()
                torch.nn.utils.clip_grad_norm_(self.parameters(), 1.0)
                total_prev_loss += loss_val.item()
                print("Loss for batch" + str(batch_num) + ": " + str(loss_val.item()))
                # Update weights according to the gradients computed.
                optimizer.step()
                scheduler.step()

            # Don't compute gradients in validation step
            with torch.no_grad():
                # Ensure that dropout behavior is correct.
                pred_scores = []
                gt_scores = []
                predictions = []
                ground_truth = []
                self.bert_model.eval()
                self.attention.eval()
                self.linear_reg1.eval()
                self.final_linear.eval()
                self.prelu.eval()
                mse_loss = 0
                for (val_batch_num, val_batch) in enumerate(val_dataloader):
                    if torch.cuda.is_available():
                        input1 = val_batch[0].cuda()
                        input2 = val_batch[1].cuda()
                        locs = val_batch[2].cuda()
                        gt = val_batch[3].cuda()
                    else:
                        input1 = val_batch[0]
                        input2 = val_batch[1]
                        locs = val_batch[2]
                        gt = val_batch[3]

                    final_scores = self.forward((input1, input2, locs))
                    pred_scores.extend(final_scores.cpu().detach().squeeze(1))
                    gt_scores.extend(gt.cpu().detach())

                    mse_loss += mean_squared_error(gt.cpu().detach(),final_scores.cpu().detach().squeeze(1))

                    
                print(f"Validation Loss is {np.sqrt(mean_squared_error(gt_scores,pred_scores))}")

                if mse_loss < best_loss:
                    torch.save(self.state_dict(), "model_1_"  + str(epoch) + ".pth")
                    best_loss = mse_loss
      
      
    def predict(self, model_path=None):

        '''
        This function predicts the classes on a test set and outputs a csv file containing the id and predicted class
        :param model_path: Path of the model to be loaded if not the current model is used.
        :return:
        '''

        gts = []
        preds = []
        if torch.cuda.is_available():
            self.cuda()
        if model_path:
            self.load_state_dict(torch.load(model_path))
        test_dataloader = get_sent_emb_dataloaders_bert(self.test_file_path, self.model_type,"val")
        self.bert_model.eval()
        self.linear_reg1.eval()
        self.final_linear.eval()
        self.prelu.eval()
        self.attention.eval()
        with torch.no_grad():
            with open("task-1-output.csv", "w+") as f:
                f.writelines("id,pred\n")
                for ind, batch in enumerate(test_dataloader):
                    if torch.cuda.is_available():
                        input1 = batch[0].cuda()
                        input2 = batch[1].cuda()
                        locs = batch[2].cuda()
                        id = batch[4].cuda()
                        gt = batch[3].cuda()
                    else:
                        input1 = batch[0]
                        input2 = batch[1]
                        locs = batch[2]
                    final_scores_1 = self.forward((input1, input2, locs))
                    preds.extend(final_scores_1.cpu().detach().squeeze(1))
                    gts.extend(gt.cpu().detach())
                    for cnt, pred in enumerate(final_scores_1):
                        f.writelines(str(id[cnt].item()) + "," + str(pred.item()) + "\n")
                      
                print(f"Test score is {np.sqrt(mean_squared_error(gts,preds))}")

In [0]:
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size", action="store", type=int, default=64, required=False)
    parser.add_argument("--train_file_path", type=str, default="../data/task-1/train.csv", required=False)
    parser.add_argument("--dev_file_path", type=str, default="../data/task-1/dev.csv", required=False)
    parser.add_argument("--test_file_path", type=str, default="../data/task-1/dev.csv", required=False)
    parser.add_argument("--lm_file_path", type=str, default="../data/task-1/shortjokes2.csv", required=False)
    parser.add_argument("--lm_weights_file_path", type=str, default="../models/lm_joke_bert.pth", required=False)
    parser.add_argument("--model_file_path", type=str, default="../models/model_4.pth", required=False)
    parser.add_argument("--predict", type=str, default='false', required=False)
    parser.add_argument("--add_joke_train", type=str, default='true', required=False)
    parser.add_argument("--lm_pretrain", type=str, default='false', required=False)
    parser.add_argument("--word2vec", type=str, default='false', required=False)
    parser.add_argument("--joke_classification_path", type=str, default='../data/task-1/joke_classification.csv',
                        required=False)
    parser.add_argument("--lr", type=float, default=0.0001, required=False)
    parser.add_argument("--train_scratch", type=str, default='false', required=False)
    parser.add_argument("--task", type=int, default=1, required=False)
    parser.add_argument("--epochs", type=int, default=5, required=False)
    parser.add_argument("--model_type", type=str', default='bert')
    args = parser.parse_args()

    obj = FunBERT(args.train_file_path, args.dev_file_path, args.test_file_path, args.lm_file_path, args.batch_size, 64,
                args.lr, args.lm_weights_file_path, args.epochs, args.lm_pretrain,
                args.model_file_path,args.model_type)

    if args.lm_pretrain == 'true':
        obj.pre_train_bert()

    if args.predict == 'true':
        obj.predict(args.model_file_path)
    else:
        # obj.train_joke_classification()
        obj.train()

SyntaxError: ignored

To run the default Siamese architecture, use the cell below. Shortjokes2.csv is the file that contains the original haedlines for MLM training of Roberta. If you run the MLM, you need to uncomment lines corresponding to something like self.bert_model = self.bert_model.roberta to remove the language Modeling head.

In [0]:
obj = FunBERT('train.csv', 'dev.csv', 'truth.csv', 
              'shortjokes2.csv', 64, 64,
                2e-5, 'lm_joke_bert.pth', 5, None,'model_2.pth','roberta')
#obj.bert_model = obj.bert_model.roberta
#obj.bert_model.load_state_dict(torch.load('lm_joke_bert.pth'))
obj.train()
#obj.train_non_siamese()

roberta
['<s>', 'ĠFrance', 'Ġis', 'ĠâĢ', 'ĺ', 'Ġhunting', 'Ġdown', 'Ġits', 'Ġcitizens', 'Ġwho', 'Ġjoined', 'Ġ<', 'ĠIsis', 'Ġ<', 'ĠâĢ', 'Ļ', 'Ġwithout', 'Ġtrial', 'Ġin', 'ĠIraq', '</s>']
<pad>
1
[    0  1470    16    44   711  8217   159    63  2286    54  1770 28696
 38931 28696    44    27   396  1500    11  3345     2     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1]
[11 13]
roberta
['<s>', 'ĠFrance', 'Ġis', 'ĠâĢ', 'ĺ', 'Ġhunting', 'Ġdown', 'Ġits', 'Ġcitizens', 'Ġwho', 'Ġjoined', 'Ġ^', 'Ġtwins', 'Ġ^', 'ĠâĢ', 'Ļ', 'Ġwithout', 'Ġtrial', 'Ġin', 'ĠIraq', '</s>']
<pad>
1
[    0  1470    16    44   711  8217   159    63  2286    54  1770 37249
 13137 37249    44    27   396  1500    11  3345     2     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     

KeyboardInterrupt: ignored

In [0]:
!nvidia-smi

Wed Apr 29 02:24:35 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    24W /  75W |   7587MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [0]:
!ps -aux|grep python


root          18  0.4  0.8 423952 116072 ?       Sl   02:11   0:03 /usr/bin/python2 /usr/local/bin/jupyter-notebook --ip="172.28.0.2" --port=9000 --FileContentsManager.root_dir="/" --MappingKernelManager.root_dir="/content"
root         123  7.9 39.6 29867260 5281776 ?    Ssl  02:12   0:56 /usr/bin/python3 -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-a99e183f-09a7-478b-9768-161c3e3bbc42.json
root         509  0.0  0.0  39192  6488 ?        S    02:24   0:00 /bin/bash -c ps -aux|grep python
root         511  0.0  0.0  38568  5008 ?        S    02:24   0:00 grep python


If GPU is full and cannot free it despite clearing the class's object. Kill the process corresponding to the ipykernel process.


In [0]:
!kill -9 123

In [0]:
obj.predict('model_1_3.pth')

roberta
['<s>', 'ĠThe', 'ĠLatest', 'Ġ:', 'ĠElection', 'Ġtally', 'Ġshows', 'Ġ<', 'ĠAustria', 'Ġ<', 'Ġturning', 'Ġright', '</s>']
<pad>
1
[    0    20  9385  4832  7713 11154   924 28696  9950 28696  3408   235
     2     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1]
[7 9]
roberta
['<s>', 'ĠThe', 'ĠLatest', 'Ġ:', 'ĠElection', 'Ġtally', 'Ġshows', 'Ġ^', 'ĠCars', 'Ġ^', 'Ġturning', 'Ġright', '</s>']
<pad>
1
[    0    20  9385  4832  7713 11154   924 37249 17714 37249  3408   235
     2     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1]
[7 9]
Test score is 0.524755714997939
