# KAIST AI605 Assignment 1: Text Classification

## Rubric

### Deadline 
The deadline for this assignment is: Friday 21st October 2022 (Week 8) 11:59pm

### Submission
Please submit your assignment via [KLMS](https://klms.kaist.ac.kr). You must submit both (1) a PDF of your solutions and (2) the Jupyter Notebook file (.ipynb).

Use markdown cells for discussion answers and in-line LaTeX for mathematical expressions. 

### Collaboration
This assignment is not a group assignment so make sure your answer and code are your own.

### Grading
The total number of marks avaiable is 30 points.

### Environment
The use of a GPU is not required for this notebook. Run the following cell to set up the environment. 

In [1]:
# pip install torch nltk tqdm

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

This notebook was tested with the following versions of python and torch

In [3]:
from platform import python_version
import torch
from tqdm import tqdm

print("python", python_version())
print("torch", torch.__version__) 

python 3.8.6rc1
torch 1.11.0+cpu


## Problem 1: Text Pre-processing (8 points)

This question will use a modified version of the data from the unreliable news dataset released by [Rashkin et al., 2017](https://aclanthology.org/D17-1317/). The dataset contains 15,000 news articles labeled as hoax, sattire, fake news etc. 

**The data must be downloaded from KLMS**

Start by loading the dataset with the following code:

In [4]:
import csv

def load_data(filename):
    with open(filename) as f:
        for line in csv.reader(f, delimiter="\t"):
            yield line

train_data = list(load_data("assignment1_data/train.tsv"))
test_data = list(load_data("assignment1_data/test.tsv"))

Each item in `train_data` and `test_data` is a (2-tuple) of the label and sentence

In [5]:
print(len(train_data))
print(len(test_data))
print()
print(train_data[0])

12003
2997

['Satire', "GREEN BAY, WIDavid Horsted, 45, announced Monday that he's seen a whole heck of a lot during his 20 years driving a taxi. 'Aw, geez, the people I've met and the places I've seenthe stories would make your head spin,' Horsted said. 'I've been from Lambeau Field to the Barhausen Waterfowl Preserve and every place in between. One time, one of the Packers even threw up in my cab, but I don't think I should say who.' With a little prodding, Horsted said the person's first name rhymes with 'baloney' and last name with 'sandwich.' "]


**Problem 1.1** (1 point) Use NLTK's `word_tokenize` to tokenize each document in the `train_data` and `test_data` and store these in two lists: `train_tokenized` and `test_tokenized`

In [6]:
from nltk.tokenize import word_tokenize
train_tokenized = []
test_tokenized = []
for element in train_data:
    train_tokenized.append([element[0],word_tokenize(element[1])])


print("Example Train Output:")
print(train_tokenized[0])
print()
for element in test_data:
    test_tokenized.append([element[0],word_tokenize(element[1])])

print("Example Test Output:")
print(test_tokenized[0])

Example Train Output:
['Satire', ['GREEN', 'BAY', ',', 'WIDavid', 'Horsted', ',', '45', ',', 'announced', 'Monday', 'that', 'he', "'s", 'seen', 'a', 'whole', 'heck', 'of', 'a', 'lot', 'during', 'his', '20', 'years', 'driving', 'a', 'taxi', '.', "'Aw", ',', 'geez', ',', 'the', 'people', 'I', "'ve", 'met', 'and', 'the', 'places', 'I', "'ve", 'seenthe', 'stories', 'would', 'make', 'your', 'head', 'spin', ',', "'", 'Horsted', 'said', '.', "'", 'I', "'ve", 'been', 'from', 'Lambeau', 'Field', 'to', 'the', 'Barhausen', 'Waterfowl', 'Preserve', 'and', 'every', 'place', 'in', 'between', '.', 'One', 'time', ',', 'one', 'of', 'the', 'Packers', 'even', 'threw', 'up', 'in', 'my', 'cab', ',', 'but', 'I', 'do', "n't", 'think', 'I', 'should', 'say', 'who', '.', "'", 'With', 'a', 'little', 'prodding', ',', 'Horsted', 'said', 'the', 'person', "'s", 'first', 'name', 'rhymes', 'with', "'baloney", "'", 'and', 'last', 'name', 'with', "'sandwich", '.', "'"]]

Example Test Output:
['Propaganda', ['The', 'Inde

**Problem 1.2** (2 points) isntead of using NLTK's tokenizer, show splitting the string by whitespace for a small sample of instances. Discuss two limitations when string splitting with whitespace.


In [7]:
split_train_tokenized = []
for element in train_data:
    split_train_tokenized.append([element[0],element[1].split(" ")])
print("Example Split Output:")
print(split_train_tokenized[0])

Example Split Output:
['Satire', ['GREEN', 'BAY,', 'WIDavid', 'Horsted,', '45,', 'announced', 'Monday', 'that', "he's", 'seen', 'a', 'whole', 'heck', 'of', 'a', 'lot', 'during', 'his', '20', 'years', 'driving', 'a', 'taxi.', "'Aw,", 'geez,', 'the', 'people', "I've", 'met', 'and', 'the', 'places', "I've", 'seenthe', 'stories', 'would', 'make', 'your', 'head', "spin,'", 'Horsted', 'said.', "'I've", 'been', 'from', 'Lambeau', 'Field', 'to', 'the', 'Barhausen', 'Waterfowl', 'Preserve', 'and', 'every', 'place', 'in', 'between.', 'One', 'time,', 'one', 'of', 'the', 'Packers', 'even', 'threw', 'up', 'in', 'my', 'cab,', 'but', 'I', "don't", 'think', 'I', 'should', 'say', "who.'", 'With', 'a', 'little', 'prodding,', 'Horsted', 'said', 'the', "person's", 'first', 'name', 'rhymes', 'with', "'baloney'", 'and', 'last', 'name', 'with', "'sandwich.'", '']]


Limitation 1:
All the commans are attached to the words which make it harder to analyse them. Same goes for every special character in dictionary. Could be potentially problematic in networks, words *chocolate* and *chocolate,* would be treated differently. Of couse we could train first the smaller network to omit special characters and treat every word with such attachment equally.



Limitation 2:
Our vocabulary would be bigger and a lot of words would be doubled like: *he* and *he's* but also the sentense would be not interpreted correctly. 


**Problem 1.3** (1 points) construct a vocabulary of all tokens in `train_tokenized`. Add an extra `UNK` token as a placeholder to account for unknown/unseen tokens at test time. How many unique tokens are present in this vocabulary?

In [8]:
from collections import Counter
words_counts = Counter()

for words in train_tokenized:
    words_counts.update(words[1])

print(len(words_counts))

147722


In [9]:
vocab = {word: i for i, (word,count) in enumerate(words_counts.items())}
vocab["<UNK>"] = len(vocab)

In [10]:
len(vocab.keys())

147723

**Problem 1.4** (2 points) We can reduce the size of the vocabulary by removing less frequently occuring words. Create a vocabulary for tokens in the which only contains tokens occuring at least (>=) 5 times. What is the size of the vocabulary now? (Remember to include the UNK placeholder token for unseen tokens)

In [11]:
limitted_vocab = {word: count for word, count in words_counts.items() if count>= 5 }
limitted_vocab["<UNK>"] = 0
print(f'Full vocabullary count: {len(vocab)}, limitted vocabullary count: {len(limitted_vocab)}')

Full vocabullary count: 147723, limitted vocabullary count: 39228


**Problem 1.5** (1 point) suggest how additional pre-processing could be used to reduce the vocabulary size when tokenizing with NLTK's tokenizer

In [12]:
limitted_vocab['Here']

670

In [13]:
limitted_vocab['here']

3638

"*'s*", "*'ve*", "*n't*" - catalloging those shortcuts as proper words ve -> have ect. 

eliminating 'CISA', 'SARTRE' as one type of wrods maybe 'proper name' 
and maybe the same goes with names we do not need them specificaly for some tasks
so overall with proper names we could group them and get one label and count them together 

also getting every word with upper and lower letter together would reduce the size significantly but that procedure would depend on what tasks our neural network is suppose to perform 

**Problem 1.6** (1 point) Print the number of items in each class for the test dataset (`test_data`)

In [14]:
for words in test_tokenized:
    print(words[0])
    break

Propaganda


In [15]:
# only_labels = lambda string: string[0]
# only_labels(only_labels(test_tokenized))

In [16]:
labels = [label for label,string in test_tokenized]

In [17]:
label_count = Counter(labels)
label_count

Counter({'Propaganda': 992, 'Hoax': 986, 'Satire': 1019})

## Problem 2: Training a feed-forward network (8 points)

**Problem 2.1** (1 point) Create a dictionary of label names to a unique index and call this `label2idx`, create a dictionary of unique words from the vocabulary to an index and call this `word2idx`

In [18]:
word2idx = {word: index for index,word in enumerate(vocab)}
# word2idx["<UNK>"] = len(word2idx)

In [19]:
len(word2idx)

147723

In [20]:
import numpy as np 
label2idx = {label: i for i,label in enumerate(np.unique(labels))}

In [21]:
label2idx

{'Hoax': 0, 'Propaganda': 1, 'Satire': 2}

**Problem 2.2** (1 point) For each document in `train_tokenized` and `test_tokenized`, create a `torch.LongTensor` of the word IDs from `word2idx`. If a word does not appear in `word2idx`, replace it with a special token for unknwon values (`UNK`). 

In [22]:
def not_in_vocab(ch):
    try: 
        return word2idx[ch]
    except KeyError:
        return word2idx["<UNK>"]
    

In [23]:
train_vectorized = []
for label, sent in train_tokenized:
    ids = [not_in_vocab(ch) for ch in sent]    
    train_vectorized.append((torch.LongTensor(ids), torch.LongTensor([label2idx[label]])))

In [24]:
print(train_vectorized[12][0].shape)

torch.Size([262])


In [25]:
test_vectorized = []
for label, sent in test_tokenized:
    ids = [not_in_vocab(ch) for ch in sent]   
    test_vectorized.append((torch.LongTensor(ids), torch.LongTensor([label2idx[label]])))

In [26]:
test_tokenized[0]

['Propaganda',
 ['The',
  'Independents',
  'Grill',
  'NSA',
  'Defender',
  'On',
  'Legality',
  'Of',
  'Surveillance',
  'ProgramYoutube']]

In [27]:
word2idx['Independents']

32691

In [28]:
print(test_vectorized[0][0])
word2idx['<UNK>']



tensor([   108,  32691,  17439,    289,  31697,   3929, 118474,   3077,  12715,
        147722])


147722

**Problem 2.3** (3 points) Create a Multi Layer Perceptron to classify these documents that performs the following operations: 

* (1) uses `torch.Embedding` to create a $d$-dimensional continuous representation of each token (`embedding_size`), 
* (2) sums the word embeddings, 
* (3) uses `tanh` activation function, 
* (4) uses a torch.Linear layer to project to a hidden dimension (`hidden_size`), 
* (5) applies tanh activation to this hidden representation
* (6) uses a linear layer to perform classification 

In [29]:
embedding_size = 20
hidden_size = 10
classes = label2idx.keys()

In [30]:
class MultiLayerPerceptron(torch.nn.Module):
    def __init__(self,vocab,embedding_size, hidden_size, classes):
        super().__init__()
        self.embedding = torch.nn.Embedding((len(vocab)+1),embedding_size) #maybe using EmmbedingBag???
        self.linear = torch.nn.Linear(embedding_size, hidden_size)
        self.activation = torch.nn.Tanh()
        self.classification = torch.nn.Linear(hidden_size, len(classes))
    def forward(self, x):
        x = self.activation(self.embedding(x).sum(dim = 1))
        x = self.activation(self.linear(x))
        x = self.classification(x)
        return x

**Problem 2.4** (3 points) train the model for 3 epochs and report the mean loss and the accuracy on the test set at each epoch. Use cross-entropy loss and the `Adam` optimizer (https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) with the learning rate set to `0.005`. 

In [31]:
Perceptron = MultiLayerPerceptron(vocab, embedding_size, hidden_size, classes)  
cross_entropy = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(Perceptron.parameters(), lr=0.005)

In [32]:
# loop over the dataset multiple times
for epoch in range(3):  

    runloss = 0.0
    for x, data in tqdm(enumerate(train_vectorized)):
        # get the inputs; data is [inputs, labels]
        inputs, labels = data
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = Perceptron(inputs.view(1,-1))
        loss = cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        runloss += loss.item()
    
    
    print(f'[{epoch + 1}, {x + 1:5d}] loss: {runloss / 12000:.3f}')
    
    #testing 
    acc = 0
    for test_tokens, label in test_vectorized:
        predictions = Perceptron(test_tokens.view(1,-1))
        predicted = torch.argmax(predictions.data,1)

        if(predicted==label):
            acc += 1
    print(acc/len(test_vectorized))
print('complete training')

12003it [09:16, 21.57it/s]


[1, 12003] loss: 0.323
0.9102435769102436


12003it [10:05, 19.81it/s]


[2, 12003] loss: 0.108
0.9526192859526192


12003it [09:51, 20.30it/s]


[3, 12003] loss: 0.061
0.9679679679679679
complete training


# Problem 3: Recurrent Neural Networks (8 points)

Recall that a recurrent neural network conditionally encodes tokens given the previous hidden state $\mathbf{h}_{t-1}$ and the input at the current input $\mathbf{x}_t$:

$$
\mathbf{h}_t = \tanh (U\mathbf{h}_{t-1} + V\mathbf{x}_t)
$$

**Problem 3.1** (1 point) Show that such recurrent neural network (RNN) without an activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.

**Problem 3.2** (4 points) Provide your own implementation of an RNN cell (i.e. do NOT use the built in `torch.nn.RNN` method) that conforms to the following specification:

* Input: a matrix of $N$ embeddings of dimension size $d$ describing a sequence of embeddings for tokens (matrix size: $\mathbb{R}^{N \times d}$)
* Output: a tuple containing 
  * 1: A matrix containing the $N$ hidden states of dimension size $b$ (matrix size: $\mathbb{R}^{N \times b}$) 
  * 2: the final hidden state of the last element (vector of size $\mathbb{R}^{b}$)
  
* Assume that the initial hidden state is a vector of zeros ($\mathbf{h}_0 = [0,...,0]^T$)
* Including bias term is optional

In [33]:
class RNNLayer(torch.nn.Module):
    
    def __init__(self,embedding_size , hidden_size):
        super(RNNLayer, self).__init__()
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.initial_h = torch.zeros(hidden_size)
        # weights for netwrok 
        self.U = torch.FloatTensor(hidden_size, hidden_size).uniform_(0,1)
        
        self.V = torch.FloatTensor(hidden_size, embedding_size).uniform_(0,1)
        
    def forward(self, inputs): 
        hs, ycap = {}, {} #pytorch arrays
        
        hs[-1] = self.initial_h
#         _, N, d = inputs.shape
        N, d = inputs.shape
        for i in range(N):
            hs[i] = torch.tanh(torch.add(torch.matmul(self.U,hs[i-1]),torch.matmul(self.V,inputs[i]))) 
        return (hs.values(), hs[i])

**Problem 3.3** (1 point) create a copy of your model from problem 2.3 and change the summing of embeddings to instead use the final hidden state of your own RNN implementation. Use a copy of your training code from problem 2.4 and modify it to train your model on the first 100 items of the training set, reporting the mean loss and the accuracy on the first 100 items of the test set.

*NOTE: Because training is slow, we limit the training and test data to 100 samples. There is no additional award for using all data*


In [34]:
class copy_of_model(torch.nn.Module):
    def __init__(self,vocab_size,embedding_size, hidden_size, classes):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size) 
        self.rnn_layer = RNNLayer(embedding_size, hidden_size) 
#         self.linear =  torch.nn.Linear(hidden_size, hidden_size)
        self.activation = torch.nn.Tanh()
        self.classification = torch.nn.Linear(hidden_size, len(classes))
         

    def forward(self, x):  
        x = self.embedding(x)
        hidden_sattes, x = self.rnn_layer(x) 
#         x = self.linear(x)
        x = self.activation(x)
        x = self.classification(x)
        return x

In [35]:
RNN_model = copy_of_model(len(vocab), embedding_size, hidden_size, classes)  
cross_entropy = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN_model.parameters(), lr=0.005)
# loop over the dataset multiple times
for epoch in range(3):  

    runloss = 0.0
    for x, data in enumerate(train_vectorized):
        # get the inputs; data is [inputs, labels]
        inputs, labels = data
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = RNN_model(inputs) #.view(1,-1))
        loss = cross_entropy(outputs.view(1,-1), labels)
        loss.backward()
        optimizer.step()

        runloss += loss.item()
        
        #only 100 iterations
        if x == 99:    
            break
    
    #printing the loss
    print(f'[{epoch + 1}, {x + 1:5d}] loss: {runloss/100:.3f}')
    
    #testing on test_data
    acc = 0
    for i, data in enumerate(test_vectorized):
        test_tokens, label = data
        predictions = RNN_model(test_tokens)
        predicted = torch.argmax(predictions)
        if(predicted==label):
            acc += 1
        if(i==99):
            break
    print(acc/100)
        
print('complete training')

[1,   100] loss: 1.104
0.49
[2,   100] loss: 1.007
0.44
[3,   100] loss: 0.978
0.42
complete training


**Problem 3.4** (2 points) A limitation of the RNN is the vanishing gradient and exploding gradient problem. Exploding gradients can be mitigated with gradient clipping. Describe the method and benefit of gradient clipping and provide a simple implementation

Exploiding gradient is a large expotential increase of weights. To prevent our Neural Network we can use method called gradient clipping. This method changes the derivative of the error before backpropagation step by rescalling the gradients with vector norm or clipping gradient values which exceed the given range. This two aproches are called: Gradient clipping by value and Gradient clipping by norm. First aproch specifies the minimum and maximum clip value and if the gradient exceeds this threshold value is clipped with the given minimum and maximum. Second aproch is also clipping the gradinet value but by multiplying the unit vector of gradients with the threshold.  

In [36]:
def gradient_clipping(weights, upper_bound, lower_bound):
    weights[weights>upper_boud] = upper_bound
    weights[weights<lower_bound] = lower_bound
    return weights

## Problem 4: LSTM (6 points)

**Problem 4.1** (2 points) Explain how the architecture of an LSTM can mitigate the vanishing gradient problem found in RNNs. (A complete proof is not necessary)

the cell state derivative that can prevent the LSTM gradients from vanishing,
make the series of sub gradientss in (3) not converge to zero,
It is the presence of the forget gate’s vector of activations in the gradient term along with additive structure which allows the LSTM to find such a parameter update at any time step
Another important property to notice is that the cell state gradient is an additive function
In RNNs, the sum in (3) is made from expressions with a similar behaviour that are likely to all be in [0,1] which causes vanishing gradients.

In LSTMs, however, the presence of the forget gate, along with the additive property of the cell state gradients, enables the network to update the parameter in such a way that the different sub gradients in (3) do not necessarily agree

**Problem 4.2** (4 points) Provide your own implementation of the LSTM (i.e. do NOT use the built in `torch.nn.LSTM` method) that conforms to the following specification:

* Input: a matrix of $N$ embeddings of dimension size $d$ describing a sequence of embeddings for tokens (matrix size: $\mathbb{R}^{N \times d}$)
* Output: a tuple containing 
  * 1: A matrix containing the $N$ hidden states of dimension size $b$ (matrix size: $\mathbb{R}^{N \times b}$) 
  * 2: the final hidden state of the last element (vector of size $\mathbb{R}^{b}$)
  
* Assume that the initial hidden state is a vector of zeros ($\mathbf{h}_0 = [0,...,0]^T$)
* Assume that the initial cell state is a vector of zeros ($\mathbf{c}_0 = [0,...,0]^T$)
* Including bias term is optional


Following problem 3.3, demonstrate training on the first 100 instances instances from the training set and report the accuracy and loss on the first 100 instances from the test set.

In [37]:
class LSTMLayer(torch.nn.Module):
    
    def __init__(self,embedding_size , hidden_size):
        super(LSTMLayer, self).__init__()
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.initial_h = torch.zeros(hidden_size)
        
        # weights for x
        self.Wxf = torch.FloatTensor(hidden_size, embedding_size).uniform_(0,1)
        self.Wxi = torch.FloatTensor(hidden_size, embedding_size).uniform_(0,1)
        self.Wxc = torch.FloatTensor(hidden_size, embedding_size).uniform_(0,1)
        self.Wxo = torch.FloatTensor(hidden_size, embedding_size).uniform_(0,1)
        
        # weights for h
        self.Whf = torch.FloatTensor(hidden_size, hidden_size).uniform_(0,1)
        self.Whi = torch.FloatTensor(hidden_size, hidden_size).uniform_(0,1)
        self.Whc = torch.FloatTensor(hidden_size, hidden_size).uniform_(0,1)
        self.Who = torch.FloatTensor(hidden_size, hidden_size).uniform_(0,1)
        
        
        
    def forward(self, x): 
        hs, ct, ycap = {},{}, {} #pytorch arrays
        
        ct[-1] = torch.zeros(hidden_size).reshape(-1)
        hs[-1] = self.initial_h
        N, d = x.shape
        
        for i in range(N):

            it = torch.sigmoid(torch.add(torch.matmul(self.Wxi, x[i]),torch.matmul(self.Whi,hs[i-1].reshape(-1))))

            ft = torch.sigmoid(torch.add(torch.matmul(self.Wxf, x[i]),torch.matmul(self.Whf,hs[i-1].reshape(-1))))

#             print(f'FT shape: {ft.shape}')
            Ct = torch.tanh(torch.add(torch.matmul(self.Wxc, x[i]),torch.matmul(self.Whc,hs[i-1].reshape(-1))))

#             print(f'Ct shape: {Ct.shape}')
            
            ct[i] = torch.add(torch.mul(ft,ct[i-1]),torch.mul(it,Ct).reshape(-1))
#             print(f'ct shape: {ct[i].shape}')
            ot = torch.sigmoid(torch.add(torch.matmul(self.Wxo,x[i]),torch.matmul(self.Who,hs[i-1].reshape(-1))))
#             print(f'OT shape: {ot.shape}')
            
            hs[i] = torch.mul(ot,torch.tanh(ct[i]))
#             print(f'HS shape: {hs[i].shape}')
            

        return (list(hs.values()), hs[i])
    
        
        # return a 2-tuple: (all hidden states, final hidden state)
        

In [38]:
class model_LSTM(torch.nn.Module):
    def __init__(self,vocab_size,embedding_size, hidden_size, classes):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size) 
        self.rnn_layer = LSTMLayer(embedding_size, hidden_size) 
        self.activation = torch.nn.Tanh()
        self.classification = torch.nn.Linear(hidden_size, len(classes))
         

    def forward(self, x):  
        x = self.activation(self.embedding(x))
        hidden_states, x = self.rnn_layer(x) 

        x = self.activation(x)
        x = self.classification(x)
        return x

In [39]:
LSTM_model = model_LSTM(len(vocab), embedding_size, hidden_size, classes)  
cross_entropy = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(LSTM_model.parameters(), lr=0.005)


In [40]:
# loop over the dataset multiple times
for epoch in range(3):  

    runloss = 0.0
    for x, data in enumerate(train_vectorized):
        # get the inputs; data is [inputs, labels]
        inputs, labels = data
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = LSTM_model(inputs)
        loss = cross_entropy(outputs.view(1,-1), labels)
        loss.backward()
        optimizer.step()

        runloss = loss.item()
        
        #only 100 iterations
        if x == 99:    
            break
    
    #printing the loss
    print(f'[{epoch + 1}, {x + 1:5d}] loss: {runloss/100:.3f}')
    
    #testing on test_data
    acc = 0
    for i, data in enumerate(test_vectorized):
        test_tokens, label = data
        predictions = Perceptron(test_tokens.view(1,-1))
        predicted = torch.argmax(predictions.data,1)

        if(predicted==label):
            acc += 1
        if(i==99):
            break
    print(acc/100)
        
print('complete training')

[1,   100] loss: 0.010
0.91
[2,   100] loss: 0.011
0.91
[3,   100] loss: 0.011
0.91
complete training
