# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

In [1]:
import gc
import string
import re
from collections import Counter
import numpy as np
gc.collect()

0

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ilbuono/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
import nltk
from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
len(STOP_WORDS)

179

In [4]:
class Batcher:
    def __init__(self, window_size, corpus_path, min_freq, max_freq, max_voc_size, batch_size):
        self.corpus_path = corpus_path
        self.window_size = window_size
        self.min_freq = min_freq
        self.max_freq = max_freq
        self.max_voc_size = max_voc_size
        self.batch_size = batch_size
        self.words = None
        self.word2index = None
        self.index2word = None
        self.freq = None
        self.voc = None
        self.voc_size = None
        self.corpus = None
        self.corpus_size = None
        
        
    def read_data(self, S):
        if S == None:
            with open(self.corpus_path, 'r') as f:
                S = f.read()
            S = S.lower()[:10000000]
        print('Len of S = ', len(S))
        regex = re.compile('[%s]' % re.escape(string.punctuation))
        S = regex.sub(' ', S)
        words_raw = list(S.split())
        print(len(words_raw))
        words = []
        for word in words_raw:
            if word in STOP_WORDS:
                pass
            else:
                words.append(word)
        print(len(words))
        self.words = words
        unique_words = list(set(words))
        self.word2index = {k: v for v, k in enumerate(unique_words)}
        self.word2index['UNK'] = len(unique_words)
        self.word2index['PAD'] = len(unique_words)+1
        self.index2word = {v: k for v, k in enumerate(unique_words)}
        self.index2word[len(unique_words)] = 'UNK'
        self.index2word[len(unique_words)+1] = 'PAD'
        words = [self.word2index[word] for word in words]
        
        print('Size of words = ', len(words))
        counter = Counter(words)
        print('Size of counter = ', len(counter))
        if self.min_freq != None:
            counter = {x : counter[x] for x in counter if counter[x] >= self.min_freq}
        print('Size of counter after min_freq = ', len(counter))
        if self.max_freq != None:
            counter = {x : counter[x] for x in counter if counter[x] <= self.max_freq}
        print('Size of counter after max_freq = ', len(counter))
        counter = Counter(counter)

        self.freq = dict(counter.most_common(self.max_voc_size))
        self.voc = set(self.freq)
        self.voc_size = len(self.voc)+2
        
        unk = set(words).difference(self.voc)
        print('Size of freq dict = ', len(self.voc))
        print('Number of vocabulary words = ', len(self.voc))
        print('Number of unknown words = ', len(unk))

        words = [self.word2index['UNK'] if word in unk else word for word in words]
        
        if len(words)%self.batch_size == 0:
            padding = self.window_size
        else:
            padding = self.batch_size - len(words)%self.batch_size + self.window_size
            
        self.corpus = [self.word2index['PAD']]*self.window_size + words + [self.word2index['PAD']]*padding
        self.corpus_size = len(self.corpus)
    
    def generator(self):
        i = self.window_size
        x_batch = []
        y_batch = []
        
        while i < self.corpus_size-self.window_size:
            if len(x_batch)==self.batch_size:
                x_batch = []
                y_batch = []
                
            x = self.corpus[i-self.window_size: i] + self.corpus[i+1: i+self.window_size+1]
            y = [0]*self.voc_size
            y[self.corpus[i]] = 1
            #y = [self.corpus[i]]
            x_batch.append(x)
            y_batch.append(y)
            i += 1
            if len(x_batch)==self.batch_size:
                yield np.array(x_batch), np.array(y_batch)

In [5]:
10000000

10000000

In [6]:
BATCH_SIZE = 128
batcher = Batcher(window_size=2, corpus_path='text8', min_freq=None, max_freq=None, max_voc_size=10000000, batch_size=BATCH_SIZE)
batcher.read_data(S=None)

Len of S =  10000000
1706282
1090922
Size of words =  1090922
Size of counter =  70835
Size of counter after min_freq =  70835
Size of counter after max_freq =  70835
Size of freq dict =  70835
Number of vocabulary words =  70835
Number of unknown words =  0


In [7]:
for x, y in batcher.generator():
    print(x.shape, y.shape)
    break

(128, 4) (128, 70837)


### CBOW

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# nn.CrossEntropyLos
torch.manual_seed(1)

<torch._C.Generator at 0x7f509e150ad0>

In [9]:
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print (torch.cuda.get_device_name(0))

GeForce GTX 1050 Ti


In [10]:
class CBOW(nn.Module):
    def __init__(self, voc_size, embedding_dim, window_size, batch_size):
        super(CBOW, self).__init__()
        self.embedding1 = nn.Embedding(voc_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, voc_size)
        
        nn.init.kaiming_normal_(self.linear1.weight)
        
    def forward(self, inputs):
        embs1 = self.embedding1(torch.tensor(inputs))
        z1 = self.linear1(embs1)
        log_softmax = F.log_softmax(z1, dim=2)
        return log_softmax

In [11]:
losses = []
loss_function = nn.NLLLoss()
model = CBOW(voc_size=batcher.voc_size, embedding_dim=300, window_size=batcher.window_size, batch_size=batcher.batch_size)
model.cuda()
optimizer = optim.SGD(model.parameters(), lr=100)

for epoch in [0, 1, 2]:
    print('========== Epoch {} =========='.format(epoch))
    total_loss = 0
    i = 1
    N = int(len(batcher.words)//BATCH_SIZE)
    for context, target in batcher.generator():
        model.train()
        context = torch.tensor(context).to(device='cuda')
        target = torch.tensor(target).to(device='cuda')
        
        log_probs = model(context)
        loss = loss_function(log_probs, target)
        optimizer.zero_grad()
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        if i%10==0 or i==1:
            print('Batch {}/{}'.format(i, N))
            print(loss)
        i += 1
        losses.append(loss)
        #model.zero_grad()



  # Remove the CWD from sys.path while we load stuff.


Batch 1/8522
tensor(12.1713, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 10/8522
tensor(12.0951, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 20/8522
tensor(12.0285, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 30/8522
tensor(12.0098, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 40/8522
tensor(11.9959, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 50/8522
tensor(11.9529, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 60/8522
tensor(11.9134, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 70/8522
tensor(11.9069, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 80/8522
tensor(11.7805, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 90/8522
tensor(11.8536, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 100/8522
tensor(11.8644, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 110/8522
tensor(11.7691, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 120/8522
tensor(11.7168, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 130/

Batch 1070/8522
tensor(11.2278, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1080/8522
tensor(11.2013, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1090/8522
tensor(11.2546, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1100/8522
tensor(11.2461, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1110/8522
tensor(11.2422, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1120/8522
tensor(11.2418, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1130/8522
tensor(11.2525, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1140/8522
tensor(11.2217, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1150/8522
tensor(11.2510, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1160/8522
tensor(11.2664, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1170/8522
tensor(11.2371, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1180/8522
tensor(11.2327, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1190/8522
tensor(11.2373, device='cuda:0', grad_fn=<NllLos

Batch 2130/8522
tensor(11.1819, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2140/8522
tensor(11.1741, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2150/8522
tensor(11.1806, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2160/8522
tensor(11.1784, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2170/8522
tensor(11.1796, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2180/8522
tensor(11.1787, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2190/8522
tensor(11.1767, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2200/8522
tensor(11.1795, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2210/8522
tensor(11.1777, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2220/8522
tensor(11.1776, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2230/8522
tensor(11.1749, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2240/8522
tensor(11.1765, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2250/8522
tensor(11.1772, device='cuda:0', grad_fn=<NllLos

Batch 3190/8522
tensor(11.1703, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3200/8522
tensor(11.1702, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3210/8522
tensor(11.1696, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3220/8522
tensor(11.1701, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3230/8522
tensor(11.1705, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3240/8522
tensor(11.1703, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3250/8522
tensor(11.1699, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3260/8522
tensor(11.1699, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3270/8522
tensor(11.1703, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3280/8522
tensor(11.1701, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3290/8522
tensor(11.1704, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3300/8522
tensor(11.1696, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3310/8522
tensor(11.1690, device='cuda:0', grad_fn=<NllLos

Batch 4250/8522
tensor(11.1686, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4260/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4270/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4280/8522
tensor(11.1686, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4290/8522
tensor(11.1683, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4300/8522
tensor(11.1683, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4310/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4320/8522
tensor(11.1686, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4330/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4340/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4350/8522
tensor(11.1684, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4360/8522
tensor(11.1685, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4370/8522
tensor(11.1684, device='cuda:0', grad_fn=<NllLos

Batch 5310/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5320/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5330/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5340/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5350/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5360/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5370/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5380/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5390/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5400/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5410/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5420/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5430/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLos

Batch 6370/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6380/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6390/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6400/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6410/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6420/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6430/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6440/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6460/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6470/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6480/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6490/8522
tensor(11.1682, device='cuda:0', grad_fn=<NllLos

Batch 7430/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7440/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7460/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7470/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7480/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7490/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7500/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7510/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7520/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7530/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7540/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7550/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 8490/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8500/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8510/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8520/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 10/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 20/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 30/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 40/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 50/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 60/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 70/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 80/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch

Batch 1030/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1040/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1050/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1060/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1070/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1080/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1090/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1100/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1110/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1120/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1130/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1140/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1150/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 2090/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2100/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2110/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2120/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2130/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2140/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2150/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2160/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2170/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2180/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2190/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2200/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2210/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 3150/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3160/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3170/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3180/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3190/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3200/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3210/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3220/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3230/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3240/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3250/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3260/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3270/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 4210/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4220/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4230/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4240/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4250/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4260/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4270/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4280/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4290/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4300/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4310/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4320/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4330/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 5270/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5280/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5290/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5300/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5310/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5320/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5330/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5340/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5350/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5360/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5370/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5380/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5390/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 6330/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6340/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6350/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6360/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6370/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6380/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6390/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6400/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6410/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6420/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6430/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6440/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 7390/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7400/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7410/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7420/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7430/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7440/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7460/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7470/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7480/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7490/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7500/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7510/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 8450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8460/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8470/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8480/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8490/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8500/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8510/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8520/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 10/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 20/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 30/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 40/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward

Batch 990/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1000/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1010/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1020/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1030/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1040/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1050/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1060/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1070/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1080/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1090/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1100/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 1110/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss

Batch 2050/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2060/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2070/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2080/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2090/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2100/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2110/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2120/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2130/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2140/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2150/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2160/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 2170/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 3110/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3120/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3130/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3140/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3150/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3160/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3170/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3180/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3190/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3200/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3210/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3220/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 3230/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 4170/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4180/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4190/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4200/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4210/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4220/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4230/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4240/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4250/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4260/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4270/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4280/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 4290/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 5230/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5240/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5250/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5260/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5270/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5280/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5290/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5300/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5310/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5320/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5330/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5340/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 5350/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 6290/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6300/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6310/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6320/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6330/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6340/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6350/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6360/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6370/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6380/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6390/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6400/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 6410/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 7350/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7360/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7370/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7380/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7390/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7400/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7410/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7420/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7430/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7440/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7460/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 7470/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLos

Batch 8410/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8420/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8430/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8440/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8450/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8460/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8470/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8480/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8490/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8500/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8510/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
Batch 8520/8522
tensor(11.1681, device='cuda:0', grad_fn=<NllLoss2DBackward>)
