<a href="https://colab.research.google.com/github/mingyungkim/SarcasmDetection/blob/master/Attentioned_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone Git

Run the following to get the required files needed for this assignment. 

TODO: switch repos

In [0]:
!git clone https://github.com/cis700/hw3-solutions.git
!mv hw3-solutions/* .
!rm -rf hw3-solutions/

Cloning into 'hw3-solutions'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 17 (delta 2), reused 14 (delta 2), pack-reused 0[K
Unpacking objects: 100% (17/17), done.
mv: cannot move 'hw3-solutions/data' to './data': Directory not empty


# Part 1.  Setting up the Reddit dataset

For this assignment, we are going to use the Reddit sarcasm dataset.  Since sarcasm is difficult to express via text, Redditors frequently end sarcastic comments with "/s".  The Reddit dataset uses the presence of this "/s" token to create a labeled dataset of sarcastic and non-sarcastic comments.  The original Reddit dataset can be found here: https://www.kaggle.com/danofer/sarcasm

We've made some slight modifications to the dataset -- we've removed metadata and balanced the dataset (only 1% of Reddit comments are actually sarcastic, but 50% of the training and test examples are sarcastic in the modified dataset).

Note that even within the niche of sarcasm-based NLP, there are better, cleaner, and larger datasets than this Reddit dataset.  I've selected this one specifically because there are many teachable characteristics of the dataset.

Run the following commands to unzip the csv file.

In [0]:
!gzip -d ./data/reddit_train.csv.gz
!gzip -d ./data/reddit_test.csv.gz

gzip: ./data/reddit_train.csv.gz: No such file or directory
gzip: ./data/reddit_test.csv.gz: No such file or directory


The following cells have all of the necessary imports for this homework.  We see 2 new packages.


1.   **torchtext**.  Recall that torchvision built many image datasets, CV models, and image processing utilities into the PyTorch framework.  The torchtext package does the same for NLP -- it has many utilities for handling tokenization, variable-length sequences, feature vectorization.
2.   **spaCy**.  This is a state-of-the-art English language tokenizer.  We will pass this as an input to the torchtext equivalent of a DataLoader.





In [0]:
!pip install spacy
!python -m spacy download en
# !pip install bert-pytorch #ADDED BY MK
# !pip install pytorch-pretrained-bert #ADDED BY MK


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



In [0]:
import torch
import torchtext
import torchtext.data as data
from torchtext.vocab import Vectors
import spacy
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import random
import time  # TODO: remove
from google.colab import drive
from helper import Logger
import gc
# from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM #ADDED BY MK
# from pytorch_pretrained_bert import BertConfig, BertForTokenClassification, BertAdam #ADDED BY MK

In [0]:
# reset all environment conditions

def reset_env():
    SEED = 1234
    random.seed(SEED)
    torch.manual_seed(SEED)
    torch.backends.cudnn.deterministic = True

drive.mount('/content/gdrive')
device =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
cuda:0


## Q1.1 Review of the dataset

Let's first understand the structure of the dataset.

*   Print out the headers and the first five rows of reddit_train.csv.  Put this in your writeup.
*   As a review of part 1, describe each input and state whether it is fixed-length or variable-length in your writeup.
*   Print out size of both datasets and put the lengths in your writeup.  




In [0]:
# reddit_train_df = pd.read_csv('./data/reddit_train.csv')
# reddit_test_df = pd.read_csv('./data/reddit_test.csv')
# print(reddit_train_df.head())

# print(len(reddit_train_df.index))
# print(len(reddit_test_df.index))

## Q1.2 Featurizing the dataset with torchtext

Computer vision has torchvision; NLP has torchtext.  In this problem, we will create and featurize a torchtext dataset with Fields, a data structure that can automatically featurize text with embeddings.

First, create a tokenizer using spacy_en.  Create two torchtext data fields using data.Field from torchtext.data:

*   a sequential field named TEXT for comment and parent_comment.  (*Hint: since this is natural language, this is sequential data.  Use your tokenizer and convert all characters to lowercase.*)
*   a non-sequential field named LABEL for the labels.  (*Hint: since this is a categorical variable, it is not sequential.  Furthermore, it does not require a vocabulary since there are no words to embed.*)

Next, look at the documentation for data.TabularDataset.splits and create `train_ds` and `test_ds`.  You will need the paths to the training and test datasets, the format, and the mapping from columns to fields.  Note that the first column in the csv's should not be a field.  This step creates 3 torchtext objects for each row in the dataset, so it will take some time (between 2-10 minutes).

We then build our vocabulary with GloVE, a word embedding similar to Word2Vec.  Create a vocabulary from TEXT using the train dataset (*Hint: look at the documentation for Field.build_vocab*).  Use the glove.6B.100d word embedding, and save the vocbulary.  The first time you run this, it will take roughly 5-10 minutes to download.  The pretrained model will then be stored in the ./.vector_cache folder, and rerunning this command will take negligible time.  Store the final vocabulary in the `vocab` variable.

Finally, print out a single example from `train_ds` and print out the properties of `vocab` ('freqs', 'itos', 'stoi', 'vectors').  Include this printouts in the writeup.  Explain what the properties of `vocab` are.  These are the only things you need to include in your writeup for Q1.1b.

In [0]:
spacy_en = spacy.load('en')
spacy.prefer_gpu()

def tokenizer(text):  # create a tokenizer function
   return [tok.text for tok in spacy_en.tokenizer(text)]

# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased').tokenize         #ADDED BY MK
  
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)
LABEL = data.Field(sequential=False, use_vocab=False)

train_ds, test_ds = data.TabularDataset.splits(
    path='./data/', train='reddit_train.csv',
    test='reddit_test.csv', format='csv', skip_header=True,
    fields=[('index', None), ('label', LABEL), ('comment', TEXT), ('parent_comment', TEXT)])

TEXT.build_vocab(train_ds,vectors="glove.6B.100d")
vocab = TEXT.vocab

print(vocab.__dict__.keys())
# vocab.freqs
# vocab.itos
# vocab.stoi
# vocab.vectors

dict_keys(['freqs', 'itos', 'stoi', 'vectors'])


## Utility cells for modeling

We've set up a few utility cells here.  Use these as you see fit.

1.   Cell to set up a logger and Tensorboard.
2.   Cell to keep track of hyperparameters.
3.   Cell with functions for training and testing loops.



In [0]:
### Tensorboard setup
# asdf
LOG_DIR = './logs'
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)

!if [ -f ngrok ] ; then echo "Ngrok already installed" ; else wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > /dev/null 2>&1 && unzip ngrok-stable-linux-amd64.zip > /dev/null 2>&1 ; fi

# !./ngrok authtoken 7rHP2EU3WwBpFXGh7cH3z_6VS67KiDzaRyVAKTLt8St
    
get_ipython().system_raw('./ngrok http 6006 &')

! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print('Tensorboard Link: ' +str(json.load(sys.stdin)['tunnels'][0]['public_url']))"

logger = Logger('./logs')

Ngrok already installed
Traceback (most recent call last):
  File "<string>", line 1, in <module>
IndexError: list index out of range


In [0]:
### hyperparameters
# overall
all_models_hyperparameters = {'embedding_dim': 100,                             
                              'output_dim': 1,
                              'vocabulary_size': len(TEXT.vocab),
                              'train_batch_size': 40,
                              'test_batch_size': 40}

In [0]:
def process_example(m, ex, variable_length):
    lab, child, parent = ex.label.to(device), ex.comment.to(device), ex.parent_comment.to(device)
    if variable_length:
        lengths = [list(c.size())[0] for c in child.permute(1, 0)]
        out = torch.squeeze(m(child, parent, torch.LongTensor(lengths).cpu()), 1)
    else:
        out = torch.squeeze(m(child, parent), 1)
    return lab, out


def train_model(model_name, model, optimizer, loss_criterion, num_epochs, variable_length=False, parents=True):
    tick = time.time()
    # make sure model and loss are on CUDA
    model = model.to(device)
    loss_criterion = loss_criterion.to(device)

    logger = Logger('./logs/' + model_name + str(time.time()))
    batch_num = 0
    max_accuracy = 0
    for epoch in range(num_epochs):
        print("starting epoch ", epoch, ", ", time.time() - tick)
        for example in train_iter:
            batch_num += 1
            label, output = process_example(model, example, variable_length)
            
            optimizer.zero_grad()
            loss = loss_criterion(output.float(), label.float())
            loss.backward()
            optimizer.step()

            ## Tensorboard stuff
            
            # computing train accuracy
            predicted = torch.round(output.data)
            total = label.size(0)
            correct = (predicted.float() == label.to(device).float()).sum().item()
            accuracy = correct / total
            info = { 'loss': loss, 'accuracy': accuracy }
            
            # computing test accuracy
            if batch_num % 20000 == 0:
                test_total = 0
                test_correct = 0
                with torch.no_grad():
                    for test_example in test_iter:
                        test_label, test_output = process_example(model, test_example, variable_length)
                        test_predicted = torch.round(test_output.data)
                        test_total += test_label.size(0)
                        test_correct += (test_predicted.float() == test_label.to(device).float()).sum().item()
                        break;  # only takes one test batch
                test_accuracy = test_correct / test_total
                info['test_accuracy'] = test_accuracy                  
                if test_accuracy > max_accuracy:
                    torch.save(model.state_dict(), "/content/gdrive/My Drive/hw-3-models/" + model_name + "-" + str(batch_num))

            for tag, value in info.items():
                logger.scalar_summary(tag, value, batch_num + 1)        
    return model

# def test_model_accuracy(model, variable_length=False):
#     model = model.to(device)
#     confusion_mtx = np.zeros((2, 2))
#     test_total = 0
#     test_correct = 0
#     with torch.no_grad():
#         for test_example in test_iter:
#             test_label, test_output = process_example(model, test_example, variable_length)
#             test_predicted = torch.round(test_output.data)
#             test_total += test_label.size(0)
#             test_correct += (test_predicted.float() == test_label.to(device).float()).sum().item()
# #             print(test_correct / test_total)
#     test_accuracy = test_correct / test_total
#     return test_accuracy

def test_model_confusion_matrix(model, variable_length=False):
    model = model.to(device)
    confusion_mtx = np.zeros((2, 2))
    with torch.no_grad():
        for test_example in test_iter:
            test_label = test_example.label.to(device)
            test_child = test_example.comment.to(device)
            test_parent = test_example.parent_comment.to(device)
            
            if test_child.size()[0] > 4:  # ensures we're looking at sufficiently large comments
                if variable_length:
                    lengths = [list(c.size())[0] for c in test_child.permute(1, 0)]
#                     output = torch.squeeze(model(comment, torch.LongTensor(lengths).cpu()), 1)
                    test_output = torch.squeeze(model(test_child, test_parent, torch.LongTensor(lengths).to(device)), 1)

                else:
                    test_output = torch.squeeze(model(test_child, test_parent), 1)
                test_predicted = torch.round(test_output.data)
                current_cm = confusion_matrix(test_label.cpu().numpy(), test_predicted.cpu().numpy())
                confusion_mtx += current_cm
    tn, fp, fn, tp = confusion_mtx.ravel()
    accuracy = (tn + tp) / (tn + fp + fn + tp)
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    print("Accuracy: ", accuracy)
    print("TPR / Recall / Sensitivity: ", recall)
    print("Precision: ", precision)
    print("F1: ", 2 * (precision * recall) / (precision + recall))
    return tn, fp, fn, tp

In [0]:
train_iter, test_iter = data.Iterator.splits(
    (train_ds, test_ds), sort_key=lambda x: len(x.comment), shuffle=True,
    batch_sizes=(all_models_hyperparameters['train_batch_size'], all_models_hyperparameters['test_batch_size']), device=device)

# Part 2.  Modeling sarcasm without context

##Question 2.1 Logistic Regression Model

In [0]:
# raise NotimplementedError
# class LogisticRegression(nn.Module):
#     def __init__(self, embedding_dim, output_dim):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)                     # we used GLove 100-dim   => (MK) Bert 768-dim
#         self.embed.weight.data.copy_(vocab.vectors)
#         self.fc = nn.Linear(embedding_dim, output_dim)

#     def forward(self, text, parent=None):
#         text = text.permute(1, 0)
#         embedded = self.embed(text)
#         embedded = embedded.permute(0, 2, 1)
#         avg_embedded = torch.mean(embedded, dim=2)
#         return torch.sigmoid(self.fc(avg_embedded))

In [0]:
# reset_env()
# # base lr
# # attempt 0: 0.001 // peaks at 7 epochs
# base_lr_hyperparameters = {'learning_rate': 0.01,
#                            'num_epochs': 10} # 80k examples}

# logistic_reg = LogisticRegression(all_models_hyperparameters['embedding_dim'], 
#                                   all_models_hyperparameters['output_dim'])
# optimizer = optim.Adam(logistic_reg.parameters(), lr=base_lr_hyperparameters['learning_rate'])
# criterion = nn.BCEWithLogitsLoss().to(device)

# logistic_reg = train_model('Logistic-Regression-attempt-0-', logistic_reg, optimizer, criterion, base_lr_hyperparameters['num_epochs'])

In [0]:
# # logistic_reg = LogisticRegression(all_models_hyperparameters['embedding_dim'], 
# #                                   all_models_hyperparameters['output_dim'])
# # logistic_reg.load_state_dict(torch.load("/content/gdrive/My Drive/hw-3-models/Logistic-Regression-attempt-0--100000"))

# test_model_accuracy(logistic_reg)

In [0]:
# test_model_confusion_matrix(logistic_reg)

##Question 2.2 CNN Model

In [0]:
# class CNN1d(nn.Module):
#     def __init__(self, embedding_dim, num_features, filter_sizes, output_dim, dropout):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)  # we used GLove 100-dim
#         self.embed.weight.data.copy_(vocab.vectors)

#         self.convs = nn.ModuleList([
#             nn.Conv1d(in_channels=embedding_dim,
#                       out_channels=num_features,
#                       kernel_size=fs)
#             for fs in filter_sizes
#         ])

#         self.fc = nn.Linear(len(filter_sizes) * num_features, output_dim)

#         self.dropout = nn.Dropout(dropout)

#     def forward(self, text, parent=None):
#         text = text.permute(1, 0)
#         embedded = self.embed(text)
#         embedded = embedded.permute(0, 2, 1)
#         conved = [F.relu(conv(embedded)) for conv in self.convs]
#         pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
#         cat = self.dropout(torch.cat(pooled, dim=1))
#         return torch.sigmoid(self.fc(cat))

In [0]:
# # base CNN
# # attempt 0: 60 x [3,4,5] x 0.001, dropout=0.5 (good) 0.7981794259015601
# # attempt 1: 80 x [3,4,5] x 0.0005, dropout=0.2 (bad)
# # attempt 2: 60 x [3,4,5] x 0.001, dropout=0.2 (bad)
# # attempt 3: 80 x [2,3,4,5,6] x  0.0005 x dropout=0.5 (bad)
# # attempt 4: 80 x [1,2,3,4,5] x 0.0005 x 0.5 (good) 0.8047 train; 0.7211 test
# # attempt 5: 80 x [1,2,3,4,5] x 0.0001 x 0.5 dropout x 30 epochs (good) 0.83 train; 0.7022 test
# # attempt 6: 40 x [1,2,3,4,5] x 0.0001 x 0.6 dropout x 15 epochs (good) 0.79 train; 0.707 test
# # attempt 7: 200 x [1, 3, 5] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 8: 80 x [1, 1, 2, 2, 3, 3, 4, 5, 8, 10] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 9: 200 x [1, 2, 3, 4, 5, 8, 10] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 10: 200 x [1, 2, 3] x 0.0001 x 0.5 dropout x 15 epochs (bad)
# # attempt 11: 80 x [1, 2, 3] x 0.0001 x 0.5 dropout x 15 epochs (bad)
# # attempt 12: 20 x [3,4,5] x 0.001 x 0.5 dropout x 10 epochs
# # attempt 13: 40 x [3,4,5] x 0.001 x 0.5 dropout x 20 epochs
# # attempt 14: 20 x [3,4,5] x 0.001 x 0.5 dropout x 20 epochs
# # attempt 14: 20 x [3,4,5] x 0.001 x 0.5 dropout x 25 epochs x bs 40
# base_cnn_hyperparameters = {'num_features': 20,
#                             'filter_sizes': [3, 4, 5], #[2-6] didn't work
#                             'learning_rate': 0.001,
#                             'num_epochs': 25,
#                             'dropout': 0.2}
# cnn = CNN1d(all_models_hyperparameters['embedding_dim'],
#             base_cnn_hyperparameters['num_features'],
#             base_cnn_hyperparameters['filter_sizes'],
#             all_models_hyperparameters['output_dim'],
#             base_cnn_hyperparameters['dropout']).to(device)

# optimizer = optim.Adam(cnn.parameters(), lr=base_cnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# cnn = train_model('CNN-attempt-14-', cnn, optimizer, criterion, base_cnn_hyperparameters['num_epochs'])

In [0]:
# # cnn = CNN1d(all_models_hyperparameters['embedding_dim'],
# #             base_cnn_hyperparameters['num_features'],
# #             base_cnn_hyperparameters['filter_sizes'],
# #             all_models_hyperparameters['output_dim'],
# #             all_models_hyperparameters['dropout']).to(device)
# # cnn.load_state_dict(torch.load("/content/gdrive/My Drive/hw-3-models/CNN-attempt-12--240000"))

# test_model_confusion_matrix(cnn)

##Question 2.3 LSTM Model

In [0]:
# class RNN(nn.Module):
#     def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
#                  bidirectional, dropout, pad_idx):
        
#         super().__init__()        
#         self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
#         self.rnn = nn.LSTM(embedding_dim, 
#                            hidden_dim, 
#                            num_layers=n_layers, 
#                            bidirectional=bidirectional, 
#                            dropout=dropout)
#         self.fc = nn.Linear(hidden_dim * 2, output_dim)
#         self.dropout = nn.Dropout(dropout)
        
#     def forward(self, text, parent, text_lengths):
        
#         embedded = self.dropout(self.embedding(text))
        
#         #pack sequence
#         packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
#         packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
#         #unpack sequence
#         output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#         hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
#         #hidden = [batch size, hid dim * num directions]
            
#         return torch.sigmoid(self.fc(hidden.squeeze(0)))

In [0]:
# # base RNN
# # attempt 0: 15 epochs, 256 hiddens, 2 layers, dropout = 0.2, lr = 0.0005
# # attempt 1: 15 epochs, 128 hiddens, 1 layers, dropout = 0.5, lr = 0.0001 (not quite as good)
# # attempt 2: 15 epochs, 256 hiddens, 1 layers, dropout = 0.2, lr = 0.001
# # attempt 3: 10 epochs, 50 hiddens, 1 layers, dropout = 0.5, lr = 0.001
# # attempt 4: 20 epochs, 100 hiddens, 1 layers, dropout = 0.5, lr = 0.001
# base_rnn_hyperparameters = {'hidden_size': 100,
#                             'number_of_layers': 1,
#                             'bidirectional': True,
#                             'dropout': 0.2,
#                             'pad_idx': TEXT.vocab.stoi[TEXT.pad_token],
#                             'learning_rate': 0.001,
#                             'num_epochs': 20}
# rnn = RNN(all_models_hyperparameters['vocabulary_size'],
#           all_models_hyperparameters['embedding_dim'], 
#           base_rnn_hyperparameters['hidden_size'],
#           all_models_hyperparameters['output_dim'],
#           base_rnn_hyperparameters['number_of_layers'],
#           base_rnn_hyperparameters['bidirectional'],
#           all_models_hyperparameters['dropout'],
#           base_rnn_hyperparameters['pad_idx'])

# optimizer = optim.Adam(rnn.parameters(), lr=base_rnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# rnn = train_model('RNN-attempt-0-', rnn, optimizer, criterion, base_rnn_hyperparameters['num_epochs'], variable_length=True)

In [0]:
# test_model_confusion_matrix(rnn, variable_length=True)

# Part 3.  Modeling sarcasm by concatenating context

## Utility cells for modeling

In [0]:
# ### hyperparameters
# # overall
# all_models_hyperparameters = {'embedding_dim': 100,
#                               'output_dim': 1,
#                               'dropout': 0.5,
#                               'vocabulary_size': len(TEXT.vocab),
#                               'number_of_epochs': 10,
#                               'batch_size': 50}

# # base lr
# # attempt 0: 0.001
# two_utt_lr_hyperparameters = {'learning_rate': 0.001}

# # base CNN
# # attempt 0: 60 x [3,4,5] x 0.001, dropout=0.5 (good) 0.7981794259015601
# # attempt 1: 80 x [3,4,5] x 0.0005, dropout=0.2 (bad)
# # attempt 2: 60 x [3,4,5] x 0.001, dropout=0.2 (bad)
# # attempt 3: 80 x [2,3,4,5,6] x  0.0005 x dropout=0.5 (bad)
# # attempt 4: 80 x [1,2,3,4,5] x 0.0005 x 0.5 (good) 0.8047 train; 0.7211 test
# # attempt 5: 80 x [1,2,3,4,5] x 0.0001 x 0.5 dropout x 30 epochs (good) 0.83 train; 0.7022 test
# # attempt 6: 40 x [1,2,3,4,5] x 0.0001 x 0.6 dropout x 15 epochs (good) 0.79 train; 0.707 test
# # attempt 7: 200 x [1, 3, 5] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 8: 80 x [1, 1, 2, 2, 3, 3, 4, 5, 8, 10] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 9: 200 x [1, 2, 3, 4, 5, 8, 10] x 0.0005 x 0.5 dropout x 15 epochs (bad)
# # attempt 10: 200 x [1, 2, 3] x 0.0001 x 0.5 dropout x 15 epochs (bad)
# # attempt 11: 80 x [1, 2, 3] x 0.0001 x 0.5 dropout x 15 epochs (bad)
# two_utt_cnn_hyperparameters = {'num_features': 200,
#                             'filter_sizes': [1, 2, 3], #[2-6] didn't work
#                             'learning_rate': 0.0002}

# # base RNN
# # attempt 0: 15 epochs, 256 hiddens, 2 layers, dropout = 0.2, lr = 0.0005
# # attempt 1: 15 epochs, 128 hiddens, 1 layers, dropout = 0.5, lr = 0.0001 (not quite as good)
# # # attempt 2: 15 epochs, 256 hiddens, 1 layers, dropout = 0.2, lr = 0.001
# two_utt_rnn_hyperparameters = {'hidden_size': 256,
#                             'number_of_layers': 1,
#                             'bidirectional': True,
#                             'dropout': 0.2,
#                             'pad_idx': TEXT.vocab.stoi[TEXT.pad_token],
#                             'learning_rate': 0.0001}



##Question 3.1 Logistic Regression Model

In [0]:
# class ConcatenatedLogisticRegression(nn.Module):
#     def __init__(self, embedding_dim, output_dim):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)  # we used GLove 100-dim
#         self.embed.weight.data.copy_(vocab.vectors)
#         self.fc = nn.Linear(embedding_dim, output_dim)

#     def forward(self, text, parent=None):
#         text = torch.cat((parent, text))
#         text = text.permute(1, 0)
#         embedded = self.embed(text)
#         embedded = embedded.permute(0, 2, 1)
#         avg_embedded = torch.mean(embedded, dim=2)
#         return torch.sigmoid(self.fc(avg_embedded))

In [0]:
# tick = time.time()
# logistic_reg = ConcatenatedLogisticRegression(all_models_hyperparameters['embedding_dim'], 
#                                   all_models_hyperparameters['output_dim'])

# optimizer = optim.Adam(logistic_reg.parameters(), lr=two_utt_lr_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# logistic_reg = train_model('cu-Logistic-Regression-attempt-0-', logistic_reg, optimizer, criterion)
# print(time.time() - tick)

In [0]:
# logistic_reg = ConcatenatedLogisticRegression(all_models_hyperparameters['embedding_dim'], 
#                                   all_models_hyperparameters['output_dim'])
# logistic_reg.load_state_dict(torch.load("/content/gdrive/My Drive/cu-Logistic-Regression-attempt-0-"))
# # print("done!")

# test_model_confusion_matrix(logistic_reg, variable_length=False)

##Question 3.2 CNN Model

In [0]:
# class ConcatenatedCNN1d(nn.Module):
#     def __init__(self, embedding_dim, num_features, filter_sizes, output_dim, dropout):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)  # we used GLove 100-dim
#         self.embed.weight.data.copy_(vocab.vectors)

#         self.convs = nn.ModuleList([
#             nn.Conv1d(in_channels=embedding_dim,
#                       out_channels=num_features,
#                       kernel_size=fs)
#             for fs in filter_sizes
#         ])

#         self.fc = nn.Linear(len(filter_sizes) * num_features, output_dim)

#         self.dropout = nn.Dropout(dropout)

#     def forward(self, text, parent=None):
#         text = text.permute(1, 0)
#         parent = parent.permute(1, 0)
#         text = torch.cat((parent, text), dim=1)
#         embedded = self.embed(text)
#         embedded = embedded.permute(0, 2, 1)
#         conved = [F.relu(conv(embedded)) for conv in self.convs]
#         pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
#         cat = self.dropout(torch.cat(pooled, dim=1))
#         return torch.sigmoid(self.fc(cat))

In [0]:
# tick = time.time()
# concat_cnn = ConcatenatedCNN1d(all_models_hyperparameters['embedding_dim'],
#             two_utt_cnn_hyperparameters['num_features'],
#             two_utt_cnn_hyperparameters['filter_sizes'],
#             all_models_hyperparameters['output_dim'],
#             all_models_hyperparameters['dropout']).to(device)

# optimizer = optim.Adam(concat_cnn.parameters(), lr=two_utt_cnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# concat_cnn = train_model('concat-CNN-attempt-0-', concat_cnn, optimizer, criterion)
# print(time.time() - tick)

In [0]:
# concat_cnn = ConcatenatedCNN1d(all_models_hyperparameters['embedding_dim'],
#             base_cnn_hyperparameters['num_features'],
#             base_cnn_hyperparameters['filter_sizes'],
#             all_models_hyperparameters['output_dim'],
#             all_models_hyperparameters['dropout']).to(device)
# concat_cnn.load_state_dict(torch.load("/content/gdrive/My Drive/concat-CNN-attempt-0-"))
# # # print("done!")

# test_model_confusion_matrix(concat_cnn, variable_length=False)

##Question 3.3 LSTM Model

In [0]:
# class ConcatenatedRNN(nn.Module):
#     def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
#                  bidirectional, dropout, pad_idx):
        
#         super().__init__()        
#         self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
#         self.rnn = nn.LSTM(embedding_dim, 
#                            hidden_dim, 
#                            num_layers=n_layers, 
#                            bidirectional=bidirectional, 
#                            dropout=dropout)
#         self.fc = nn.Linear(hidden_dim * 2, output_dim)
#         self.dropout = nn.Dropout(dropout)
        
#     def forward(self, text, parent, text_lengths):
        
#         text = torch.cat((parent, text), dim=0)
#         embedded = self.dropout(self.embedding(text))
        
#         #pack sequence
#         packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
#         packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
#         #unpack sequence
#         output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#         hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
#         #hidden = [batch size, hid dim * num directions]
            
#         return torch.sigmoid(self.fc(hidden.squeeze(0)))

In [0]:
# tick = time.time()
# concat_rnn = ConcatenatedRNN(all_models_hyperparameters['vocabulary_size'],
#           all_models_hyperparameters['embedding_dim'], 
#           two_utt_rnn_hyperparameters['hidden_size'],
#           all_models_hyperparameters['output_dim'],
#           two_utt_rnn_hyperparameters['number_of_layers'],
#           two_utt_rnn_hyperparameters['bidirectional'],
#           all_models_hyperparameters['dropout'],
#           two_utt_rnn_hyperparameters['pad_idx'])

# optimizer = optim.Adam(concat_rnn.parameters(), lr=two_utt_rnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# concat_rnn = train_model('concat-RNN-attempt-0-', concat_rnn, optimizer, criterion, variable_length=True)
# print(time.time() - tick)

In [0]:
# test_model_confusion_matrix(concat_rnn, variable_length=True)

# Part 4.  Modeling sarcasm by separating context

##Utility cells for modeling

In [0]:
### hyperparameters
# overall
all_models_hyperparameters = {'embedding_dim': 100,
                              'output_dim': 1,
                              'dropout': 0.5,
                              'vocabulary_size': len(TEXT.vocab),
                              'number_of_epochs': 20,
                              'batch_size': 50}

# base lr
# attempt 0: 0.001
separate_lr_hyperparameters = {'learning_rate': 0.001}

# base CNN
# attempt 0: 200 x [1,2,3,4,5] x 0.0005 (bad)
# attempt 1: 200 x [1,2,3,4,5] x 0.0001 (bad)
# attempt 2: 50 x [3,4,5] x 0.0001 (good)
# attempt 2.5: 80 x [3,4,5] x 0.00001 (too slow)
# attempt 3: 80 x [3,4,5] x 0.0005 (good)
# attempt 3: 80 x [3,4,5] x 0.0005 x 20 epochs (good)
# jk all these are wrong

# attempt 5: 20 x [2,3,4] x 0.0005 x 10 epochs (good)
# attempt 6: 
# separate_cnn_hyperparameters = {'num_features': 20,
#                             'filter_sizes': [2, 3, 4], #[2-6] didn't work
#                             'learning_rate': 0.0005}

# base RNN
# attempt 0: 15 epochs, 256 hiddens, 2 layers, dropout = 0.2, lr = 0.0005
# attempt 1: 15 epochs, 128 hiddens, 1 layers, dropout = 0.5, lr = 0.0001 (not quite as good)
# attempt 2: 15 epochs, 256 hiddens, 1 layers, dropout = 0.2, lr = 0.001
separate_rnn_hyperparameters = {'hidden_size': 256,
                            'number_of_layers': 1,
                            'bidirectional': True,
                            'dropout': 0.2,
                            'pad_idx': TEXT.vocab.stoi[TEXT.pad_token],
                            'learning_rate': 0.0001}



##Question 4.1 Logistic Regression Model

In [0]:
# class SeparateLogisticRegression(nn.Module):
#     def __init__(self, embedding_dim, output_dim):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)  # we used GLove 100-dim
#         self.embed.weight.data.copy_(vocab.vectors)
#         self.fc = nn.Linear(2 * embedding_dim, output_dim)

#     def forward(self, comment, parent_comment):
#         def mean_embed(text):
#             if len(text.size()) == 1:
#                 text = torch.unsqueeze(text, dim=0)
#             text = text.permute(1, 0)
#             embedded = self.embed(text)
#             embedded = embedded.permute(0, 2, 1)
#             return torch.mean(embedded, dim=2)
#         avg_embedded = torch.cat((mean_embed(comment), mean_embed(parent_comment)), dim=1)
#         return torch.sigmoid(self.fc(avg_embedded))

##Question 4.2 CNN Model

In [0]:
# class SeparateCNN1d(nn.Module):
#     def __init__(self, embedding_dim, num_features, filter_sizes, output_dim, dropout):
#         super().__init__()

#         self.embed = nn.Embedding(len(vocab), embedding_dim)  # we used GLove 100-dim
#         self.embed.weight.data.copy_(vocab.vectors)

#         self.conv_stack_1 = nn.ModuleList([
#             nn.Conv1d(in_channels=embedding_dim,
#                       out_channels=num_features,
#                       kernel_size=fs)
#             for fs in filter_sizes
#         ])
#         self.conv_stack_2 = nn.ModuleList([
#             nn.Conv1d(in_channels=embedding_dim,
#                       out_channels=num_features,
#                       kernel_size=fs)
#             for fs in filter_sizes
#         ])

#         self.fc = nn.Linear(2 * len(filter_sizes) * num_features, output_dim)

#         self.dropout = nn.Dropout(dropout)

#     def forward(self, text, parent=None):
#         def process_utterance(text, stack):
#             text = text.permute(1, 0)
#             embedded = self.embed(text)
#             embedded = embedded.permute(0, 2, 1)
#             conved = [F.relu(conv(embedded)) for conv in stack]
#             pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
#             cat = self.dropout(torch.cat(pooled, dim=1))
#             return cat
#         features_from_both = torch.cat((process_utterance(parent, self.conv_stack_1),
#                                         process_utterance(text, self.conv_stack_2)),
#                                        dim=1)
#         return torch.sigmoid(self.fc(features_from_both))

In [0]:
# tick = time.time()
# separate_cnn = SeparateCNN1d(all_models_hyperparameters['embedding_dim'],
#                             separate_cnn_hyperparameters['num_features'],
#                             separate_cnn_hyperparameters['filter_sizes'],
#                             all_models_hyperparameters['output_dim'],
#                             all_models_hyperparameters['dropout']).to(device)

# optimizer = optim.Adam(separate_cnn.parameters(), lr=separate_cnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# separate_cnn = train_model('separate-CNN-attempt-5-', separate_cnn, optimizer, criterion)
# print(time.time() - tick)

In [0]:
# test_model_confusion_matrix(separate_cnn, variable_length=False)

##Question 4.3 RNN Model

In [0]:
# class SeparateRNN(nn.Module):
#     def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
#                  bidirectional, dropout, pad_idx):
        
#         super().__init__()        
#         self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
#         self.child_rnn = nn.LSTM(embedding_dim, 
#                            hidden_dim, 
#                            num_layers=n_layers, 
#                            bidirectional=bidirectional, 
#                            dropout=dropout)
#         self.parent_rnn = nn.LSTM(embedding_dim, 
#                            hidden_dim, 
#                            num_layers=n_layers, 
#                            bidirectional=bidirectional, 
#                            dropout=dropout)
#         self.fc = nn.Linear(hidden_dim * 4, output_dim)
#         self.dropout = nn.Dropout(dropout)
        
#     def forward(self, text, parent, text_lengths):
#         def process_utterance(text, rnn):
#             embedded = self.dropout(self.embedding(text))

#             #pack sequence
#             packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
#             packed_output, (hidden, cell) = rnn(packed_embedded)

#             #unpack sequence
#             output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#             hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
#             return cat
#         features_from_both = torch.cat((process_utterance(parent, self.parent_rnn),
#                                         process_utterance(text, self.child_rnn)),
#                                        dim=1)
#         return torch.sigmoid(self.fc(features_from_both))

# raise NotImplementedError

In [0]:
two_utt_rnn_hyperparameters = {'hidden_size': 256,
                            'number_of_layers': 1,
                            'bidirectional': True,
                            'dropout': 0.2,
                            'pad_idx': TEXT.vocab.stoi[TEXT.pad_token],
                            'learning_rate': 0.0001}

all_models_hyperparameters = {'embedding_dim': 100,                             
                              'output_dim': 1,
                              'vocabulary_size': len(TEXT.vocab),
                              'train_batch_size': 40,
                              'test_batch_size': 40,
                              'dropout': 0.5,}

# Luong attention layer
class Attn(nn.Module):
  def __init__(self, method, hidden_size):
    super(Attn, self).__init__()
    self.method = method
    if self.method not in ['dot', 'general', 'concat']:
      raise ValueError(self.method, "is not an appropriate attention method.")
    self.hidden_size = hidden_size
    if self.method == 'general':
      self.attn = nn.Linear(self.hidden_size, hidden_size)
    elif self.method == 'concat':
      self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
      self.v = nn.Parameter(torch.FloatTensor(hidden_size))

  def dot_score(self, hidden, encoder_output):
    return torch.sum(hidden * encoder_output, dim=2)

  def general_score(self, hidden, encoder_output):
    energy = self.attn(encoder_output)
    return torch.sum(hidden * energy, dim=2)

  def concat_score(self, hidden, encoder_output):
    energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
    return torch.sum(self.v * energy, dim=2)

  def forward(self, hidden, encoder_outputs):
    # Calculate the attention weights (energies) based on the given method
    if self.method == 'general':
      attn_energies = self.general_score(hidden, encoder_outputs)
    elif self.method == 'concat':
      attn_energies = self.concat_score(hidden, encoder_outputs)
    elif self.method == 'dot':
      attn_energies = self.dot_score(hidden, encoder_outputs)

    # Transpose max_length and batch_size dimensions
    attn_energies = attn_energies.t()

    # Return the softmax normalized probability scores (with added dimension)
    return F.softmax(attn_energies, dim=1).unsqueeze(1)
  
  
class RecModel(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
    super(RecModel, self).__init__()
    self.hidden_dim = hidden_dim
    
#     if not pretrained:
#       self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
#     else:
#       self.word_embeddings = nn.Embedding.from_pretrained(pretrained_weights)
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)

    # The LSTM takes word embeddings as inputs, and outputs hidden states
    # with dimensionality hidden_dim.
    self.dropout=nn.Dropout(dropout)
    self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=(0 if n_layers == 1 else dropout))


  def forward(self,  text, text_lengths):    
    embedded = self.dropout(self.embedding(text))
    
    #pack sequence
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
    packed_output, (hidden, cell) = self.lstm(packed_embedded)

    #unpack sequence
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#     hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
#     cell = self.dropout(torch.cat((cell[-2,:,:], cell[-1,:,:]), dim = 1))
#     # Sort data by dereasing order of input lengths
#     input_lengths, perm_index = input_lengths.sort(0, descending=True)
#     body = body[perm_index]
    
#     # Compute the embeddings
#     embedded = self.word_embeddings(body)
    
#     # Pack the padded sequence of inputs
#     packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths,batch_first=True)
    
#     # Pass the input through the LSTM unit
#     if hidden_input is not None:

#       cell_input = torch.zeros(hidden_input.size()).to(device)

#       output, (hidden, cell) = self.lstm(packed, (hidden_input,cell_input))
#     else:
#       output, (hidden, cell) = self.lstm(packed)
    
#     output, _ = nn.utils.rnn.pad_packed_sequence(output)
    
#     # Return the hidden vector from the LSTM
    return output, hidden, cell

  
  
  
class AttnDecoder(nn.Module):
  def __init__(self, attn_model, vocab_size, embedding_dim, hidden_size, output_size, pad_idx,  n_layers=1, dropout=0.1):
    super(AttnDecoder, self).__init__()

    # Keep for reference
    self.attn_model = attn_model
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.n_layers = n_layers
    self.dropout = nn.Dropout(dropout)
    
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
    
    
    self.embedding_dropout = nn.Dropout(dropout)
    self.gru = nn.GRU(embedding_dim, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
    self.concat = nn.Linear(hidden_size * 2, hidden_size)
    self.out = nn.Linear(hidden_size, output_size)

    self.attn = Attn(attn_model, hidden_size)

  def forward(self, input_step, last_hidden, encoder_outputs):
    # Note: we run this one step (word) at a time
    # Get embedding of current input word
#     embedded = self.embedding(input_step)
    embedded = self.dropout(self.embedding(input_step))
    embedded = self.embedding_dropout(embedded)
    
    
    # Forward through unidirectional GRU
    embedded = torch.transpose(embedded, 0, 1)
    rnn_output, hidden = self.gru(embedded, last_hidden)
    # Calculate attention weights from the current GRU output
    attn_weights = self.attn(rnn_output, encoder_outputs)
    # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
    context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
    # Concatenate weighted context vector and GRU output using Luong eq. 5
    rnn_output = rnn_output.squeeze(0)
    context = context.squeeze(1)
    concat_input = torch.cat((rnn_output, context), 1)
    concat_output = torch.tanh(self.concat(concat_input))
    
    output = self.out(concat_output)
    
    # Return output and final hidden state
    return output, hidden
  

embedding_dim = 300
hidden_dim = 300
encoder = RecModel(all_models_hyperparameters['vocabulary_size'],
          all_models_hyperparameters['embedding_dim'], 
          two_utt_rnn_hyperparameters['hidden_size'],
          all_models_hyperparameters['output_dim'],
          two_utt_rnn_hyperparameters['number_of_layers'],
          two_utt_rnn_hyperparameters['bidirectional'],
          all_models_hyperparameters['dropout'],
          two_utt_rnn_hyperparameters['pad_idx']).to(device)

attn_model = 'dot'
hidden_size = 300

# embedding = nn.Embedding(vocab_size, hidden_size)

decoder = AttnDecoder(attn_model, 
          all_models_hyperparameters['vocabulary_size'],
          all_models_hyperparameters['embedding_dim'], 
          two_utt_rnn_hyperparameters['hidden_size'],
          two_utt_rnn_hyperparameters['hidden_size'],
          two_utt_rnn_hyperparameters['pad_idx'],
          two_utt_rnn_hyperparameters['number_of_layers'],
          all_models_hyperparameters['dropout']).to(device)

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time, datetime

class Classifier(nn.Module):
  def __init__(self, hidden_dim):
    super(Classifier, self).__init__()
    
    self.fc1 = nn.Linear(hidden_dim, 128)
    self.fc2 = nn.Linear(128, 64)
    self.fc3 = nn.Linear(64, 1)
    
    
  def forward(self, data):
    op = F.relu(self.fc1(data))
    op = F.relu(self.fc2(op))
    return self.fc3(op)
  
classifier = Classifier(two_utt_rnn_hyperparameters['hidden_size']).to(device)


def train_attention(encoder, decoder, classifier, criterion, encoder_optimizer, decoder_optimizer, classifier_optimizer, num_epochs, logger):
  # Training loop
  batch_num=0
  for epoch in range(0,num_epochs):
    for i, ex in enumerate(train_iter):
      batch_num=batch_num+1
      lab, child, parent = ex.label.to(device), ex.comment.to(device), ex.parent_comment.to(device)
      lengths_child = [list(c.size())[0] for c in child.permute(1, 0)]
      lengths_parent = [list(c.size())[0] for c in parent.permute(1, 0)]
      # Forward pass body through encoder
      encoder_output, encoder_hidden, _ = encoder(parent, lengths_parent)
      
      # Initialize decoder hidden state to hidden state of encoder
      decoder_hidden = encoder_hidden
      
      child = torch.transpose(child, 0, 1)
      # Initialize decoder input to first word of child
#       print(child)
#       print(child.shape)
      decoder_input = child[:,0].unsqueeze(1)
      
      # Forward batch of sequences through decoder
      max_child_length = child.size(1)
#       print(max_child_length)
      for i in range(1,max_child_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_output)
        
        decoder_input = child[:,i].unsqueeze(1)

      
      output = classifier(decoder_hidden)
      output = output.view(output.size(1),-1)
      output = output.squeeze(1)
      loss = criterion(output.float(), lab.float())
      
      encoder_optimizer.zero_grad()
      decoder_optimizer.zero_grad()
      classifier_optimizer.zero_grad()
      
      loss.backward()
      
      encoder_optimizer.step()
      decoder_optimizer.step()
      classifier_optimizer.step()

      #Compute accuracy
      predicted = torch.round(torch.sigmoid(output.data))
      total = lab.size(0)
      correct = (predicted.float() == lab.to(device).float()).sum().item()
      accuracy = correct / total
      info = { 'loss': loss, 'accuracy': accuracy }
      print(info)
      if batch_num%500==0:
        test_total = 0
        test_correct = 0
        test_batch_num=0
        with torch.no_grad():
            
            for i, ex in enumerate(test_iter):
              test_batch_num=test_batch_num+1
              if test_batch_num>50:
                break
              lab, child, parent = ex.label.to(device), ex.comment.to(device), ex.parent_comment.to(device)
              lengths_child = [list(c.size())[0] for c in child.permute(1, 0)]
              lengths_parent = [list(c.size())[0] for c in parent.permute(1, 0)]
              # Forward pass body through encoder
              encoder_output, encoder_hidden, _ = encoder(parent, lengths_parent)

              # Initialize decoder hidden state to hidden state of encoder
              decoder_hidden = encoder_hidden

              child = torch.transpose(child, 0, 1)
              decoder_input = child[:,0].unsqueeze(1)

              # Forward batch of sequences through decoder
              max_child_length = child.size(1)
        #       print(max_child_length)
              for i in range(1,max_child_length):
                decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_output)

                decoder_input = child[:,i].unsqueeze(1)


              output = classifier(decoder_hidden)
              output = output.view(output.size(1),-1)
              output = output.squeeze(1)
              #Compute accuracy
              predicted = torch.round(torch.sigmoid(output.data))
              test_total = test_total+lab.size(0)
              test_correct = test_correct+(predicted.float() == lab.to(device).float()).sum().item()
              test_accuracy = test_correct / test_total
              info['test_accuracy'] = test_accuracy   
              torch.save(encoder.state_dict(), "/content/gdrive/My Drive/hw-3-models/" + 'encoder_attention' + "-" + str(batch_num))
              torch.save(decoder.state_dict(), "/content/gdrive/My Drive/hw-3-models/" + 'decoder_attention' + "-" + str(batch_num))
              torch.save(classifier.state_dict(), "/content/gdrive/My Drive/hw-3-models/" + 'encoder_decoder_classifier_attention' + "-" + str(batch_num))

      for tag, value in info.items():
          logger.scalar_summary(tag, value, batch_num + 1)     
  return encoder, decoder, classifier

gc.collect()

102

In [0]:
criterion = nn.BCEWithLogitsLoss().to(device)

learning_rate = 0.001
encoder_optimizer = optim.Adam(encoder.parameters(), lr = 0.01) 
decoder_optimizer = optim.Adam(decoder.parameters(), lr = 0.01) 
classifier_optimizer = optim.Adam(classifier.parameters(), lr = 0.003) 

# TRAINING LOOP
now = time.mktime(datetime.datetime.now().timetuple())
logger = Logger(f'./logs/run_{now}/')

num_epochs = 10

encoder, decoder, classifier=train_attention(encoder, decoder, classifier, criterion, encoder_optimizer, decoder_optimizer, classifier_optimizer, num_epochs, logger)

{'loss': tensor(0.7034, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.425}
{'loss': tensor(0.6930, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.525}
{'loss': tensor(0.7016, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.375}
{'loss': tensor(0.6961, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.45}
{'loss': tensor(0.6928, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.5}
{'loss': tensor(0.6924, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.55}
{'loss': tensor(0.6882, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.625}
{'loss': tensor(0.7553, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.5}
{'loss': tensor(0.6884, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'accuracy': 0.55}
{'loss': tensor(0.6911, de

KeyboardInterrupt: ignored

In [0]:
# tick = time.time()
# separate_rnn = SeparateRNN(all_models_hyperparameters['vocabulary_size'],
#           all_models_hyperparameters['embedding_dim'], 
#           two_utt_rnn_hyperparameters['hidden_size'],
#           all_models_hyperparameters['output_dim'],
#           two_utt_rnn_hyperparameters['number_of_layers'],
#           two_utt_rnn_hyperparameters['bidirectional'],
#           all_models_hyperparameters['dropout'],
#           two_utt_rnn_hyperparameters['pad_idx'])

# optimizer = optim.Adam(separate_rnn.parameters(), lr=two_utt_rnn_hyperparameters['learning_rate'])

# criterion = nn.BCEWithLogitsLoss().to(device)

# concat_rnn = train_model('separate-RNN-attempt-0-', separate_rnn, optimizer, criterion, variable_length=True)
# print(time.time() - tick)