From Switch Transformer paper:

>In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost.

A vanilla Transformer block looks like this:

```python
class ModernTransformerBlock(nn.Module):
    def __init__(self, embed_dim, n_heads, up):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, n_heads)
        self.mlp = nn.Sequential(
            SwishGLU(embed_dim, embed_dim * up),
            nn.Linear(embed_dim * up, embed_dim),
        )
        self.pre_attn_norm = RMSNorm(embed_dim)
        self.pre_mlp_norm = RMSNorm(embed_dim)
    
    def forward(self, x):
        x = x + self.attn(self.pre_attn_norm(x))
        x = x + self.mlp(self.pre_mlp_norm(x))
        return x
```

The Mixture-of-Experts layer replaces the MLP layer. Instead of having one MLP layer, we have `num_experts` different MLP layers called *experts*.

The idea is to process a contextualized token, by sending it to a subset of experts. In this way we could efficiently increase the number of parameters of the model without affecting computational cost too much.

First, the token is fed into *router*, which determines to which experts a token should go to be processed. For computational reasons, there is a fixed limit on:
* how many tokens an expert can process, and
* by how many experts a token is processed.

# Grading
Your task is to implement a Mixture of Experts layer. You can get points for the following subtasks:
1.  (5 points) Naive implementation of MoE layer that works with `num_experts_per_token>=1`
2.  (5 points) Well-vectorized implementation of MoE layer that works with `num_experts_per_token=1`
3.  (5 points) Implementation of a script testing for 1. 2. implementations output equivalence and performance superiority of 2.
4.  (5 points) Well-vectorized implementation of MoE layer that works with `num_experts_per_token>=1`
5.  (Bonus 5 points) Use Huggingface's Trainer class and compare performance of randomly initialized MoE Transformer and standard Transformer on `https://huggingface.co/datasets/imdb` dataset.

20 points scored in this task is equivalent to at least 16% points achievable in this course.

Please submit your assignments until 15th of April, 18:00 CET.

# Rules
- You shouldn't change basic `forward` and `initialization` signatures of the main classes: `Router` and `MoE`. You can add additional arguments with default values.
- As an assignment, provide a Jupyter notebook with a short introduction at the top of what has been done and where.
- You can add or remove any other classes, though you should keep the behaviour of `MLP` class somehow.
- Sensible vectorization is good enough for the maximum amount of points. There is no need to optimize performance to the max, just show that you can identify opportunities for vectorization and you are able to implement complex vectorizations.
- If in doubt, direct questions to either Jan Ludziejewski or Juliusz Straszyński.
- A notebook that is hard to grade (crashing, obfuscated) might be scored for 0 points.

# Hints
- First, write a naive implementation, vectorized operations might be hard to analyze for correctness.
- You can make randomness deterministic by appropriate torch functions.
- If you have a hard time fulfilling fair randomness for token discarding, you can try keeping the earlier tokens.

In [1]:
%pip install torch_tb_profiler einops

Collecting torch_tb_profiler
  Downloading torch_tb_profiler-0.4.3-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: einops, torch_tb_profiler
Successfully installed einops-0.7.0 torch_tb_profiler-0.4.3


In [2]:
from torch import nn
import torch
from transformers import PretrainedConfig
import torch.nn.functional as F
from einops import einsum

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(config.hidden_size, config.intermediate_size),
            nn.ReLU(),
            nn.Linear(config.intermediate_size, config.hidden_size),
        )

    def forward(self, x):
        return self.mlp(x)

# Router
The router is a module which assigns tokens to experts. It answers two questions:
1. Which tokens should be assigned to which expert.
2. How much weight should be assigned to each expert. The weight is determined by similarity between the token embedding and the expert embedding

The following conditions must be satisfied:
1. The routing weights must sum to 1 for each token and be non-negative
2. A token should have exactly `num_experts_per_token` non-zero weights

In [3]:
# Input: [batch_size, seq_len, hidden_size] - input embeddings
# Output: [batch_size, seq_len, num_experts] - expert routing weights
class Router(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_token = config.num_experts_per_token
        self.hidden_size = config.hidden_size
        self.num_experts = config.num_experts

        self.expert_embeddings = nn.Parameter(torch.randn(self.num_experts, self.hidden_size))
        torch.nn.init.kaiming_uniform_(self.expert_embeddings, nonlinearity='linear')

    def forward(self, x):
        pass

The MoE module is a module which wraps around a set of expert modules and a router module.

It takes input embeddings and routes them to the experts.

Each token is processed individually by a subset of experts.

The output token embedding is a weighted sum of the expert outputs.

The weights are determined by the router module.

The subset of experts is determined by non-zero weights in the routing output.

Additionally each expert might process at most `expert_capacity = ceil((batch_size * seq_len) / num_experts * capacity_factor)` tokens

Superfluous tokens to be discarded by a particular expert should be selected uniformly at random.

Discarding should be equivalent to setting the appropriate routing weight to 0, while other weights remain the same.

This means that a token is processed by at most num_experts_per_token experts with a sum of weights of at most 1.

Specifically, this could mean that a token is processed by 0 experts - in this case the resulting embedding should be a zero tensor.

In [4]:
import math

# Input: [batch_size, seq_len, hidden_size] - input embeddings
# Output: [batch_size, seq_len, hidden_size] - output embeddings
class MoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts = config.num_experts
        self.hidden_size = config.hidden_size
        self.num_experts_per_token = config.num_experts_per_token
        self.capacity_factor = config.capacity_factor

        # You can change experts representation if you want
        self.experts = nn.ModuleList([MLP(config) for _ in range(self.num_experts)])
        self.router = Router(config)

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        expert_capacity = math.ceil(batch_size * seq_len / self.num_experts * self.capacity_factor)
        pass

# Configurations

In [5]:
base_config = dict(
    vocab_size=5000,
    max_position_embeddings=256,
    num_attention_heads=8,
    num_hidden_layers=4,
    hidden_dropout_prob=0.1,
    hidden_size=128,
    intermediate_size=512,
    num_labels=2
)

standard_config = PretrainedConfig(
    **base_config,
    ff_cls=MLP
)

moe_config = PretrainedConfig(
    **base_config,
    num_experts=4,
    capacity_factor=2.0,
    num_experts_per_token=1,
    ff_cls=MoE
)

# Basic Transformer-related classes

In [6]:
from einops import rearrange

class Embedding(nn.Module):
  def __init__(self, config):
    super(Embedding, self).__init__()
    self.word_embed = nn.Embedding(config.vocab_size, config.hidden_size)
    self.pos_embed = nn.Embedding(config.max_position_embeddings, config.hidden_size)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)

  def forward(self, x):
    batch_size, seq_length = x.shape
    device = x.device
    positions = torch.arange(0, seq_length).expand(
        batch_size, seq_length).to(device)
    embedding = self.word_embed(x) + self.pos_embed(positions)
    return self.dropout(embedding)


class MHSelfAttention(nn.Module):
    def __init__(self, config: PretrainedConfig):
        super(MHSelfAttention, self).__init__()
        self.num_attention_heads = config.num_attention_heads
        self.hidden_size = config.hidden_size
        self.head_size = self.hidden_size // self.num_attention_heads
        self.num_attention_heads = config.num_attention_heads
        self.qkv = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias=False)

    def forward(self, embeddings):
        batch_size, seq_length, hidden_size = embeddings.size()

        result = self.qkv(embeddings)
        q, k, v = rearrange(result, 'b s (qkv nah hdsz) -> qkv b nah s hdsz', nah=self.num_attention_heads, qkv=3).unbind(0)

        attention_scores = torch.matmul(q, k.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(hidden_size)
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        contextualized_layer = torch.matmul(attention_probs, v)

        outputs = rearrange(contextualized_layer, 'b nah s hdsz -> b s (nah hdsz)')
        return outputs

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = MHSelfAttention(config)
        self.norm1 = nn.LayerNorm(config.hidden_size)
        self.norm2 = nn.LayerNorm(config.hidden_size)
        self.intermediate = config.ff_cls(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x =  x + self.norm1(self.dropout(self.attention(x)))
        x =  x + self.norm2(self.dropout(self.intermediate(x)))
        return x

class TransformerClassifier(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embedding(config)
        self.layer = nn.Sequential(*[TransformerBlock(config) for _ in range(config.num_hidden_layers)])
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, input_ids, labels=None):
        embedding_output = self.embeddings(input_ids)
        encoding = self.layer(embedding_output)
        pooled_encoding = encoding.mean(dim=1)
        logits = self.classifier(pooled_encoding)
        loss = F.cross_entropy(logits, labels) if labels is not None else None
        return {
            'loss': loss,
            'logits': logits,
        }

# Tokenizer training

In [7]:
%pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [8]:
from tokenizers import ByteLevelBPETokenizer
from datasets import load_dataset
from tokenizers.processors import BertProcessing

dataset = load_dataset('imdb')

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(
    dataset['train']['text'],
    vocab_size=base_config['vocab_size'],
    special_tokens=["<s>", "</s>", "<pad>"],
    min_frequency=2
)
tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=base_config['max_position_embeddings'])
tokenizer.enable_padding(pad_id=tokenizer.token_to_id("<pad>"), pad_token="<pad>", length=base_config['max_position_embeddings'])
tokenizer.model_max_length = base_config['max_position_embeddings']
tokenizer.pad_token = "<pad>"

from transformers import Trainer, TrainingArguments

def tokenize(row):
    return {
        'input_ids': tokenizer.encode(row['text']).ids,
    }

tokenized_dataset = dataset.map(tokenize)

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

# **1. Naive implementation of MoE layer that works with num_experts_per_token>=1**

In [9]:
# Input: [batch_size, seq_len, hidden_size] - input embeddings
# Output: [batch_size, seq_len, num_experts] - expert routing weights
class Router(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_token = config.num_experts_per_token
        self.hidden_size = config.hidden_size
        self.num_experts = config.num_experts

        self.expert_embeddings = nn.Parameter(torch.randn(self.num_experts, self.hidden_size))
        torch.nn.init.kaiming_uniform_(self.expert_embeddings, nonlinearity='linear')

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        similarity = einsum(x, self.expert_embeddings, 'b s h, e h -> b s e')
        top_experts = torch.topk(similarity, self.num_experts_per_token)
        softmaxed_topk_values = F.softmax(top_experts.values, dim=-1)
        mask = torch.zeros_like(similarity, dtype=torch.bool)
        mask = mask.scatter_(-1, top_experts.indices, 1)
        routing_weights = torch.zeros_like(similarity)
        routing_weights[mask] = softmaxed_topk_values.flatten()

        return routing_weights

In [10]:
# Input: [batch_size, seq_len, hidden_size] - input embeddings
# Output: [batch_size, seq_len, hidden_size] - output embeddings
class NaiveMoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts = config.num_experts
        self.hidden_size = config.hidden_size
        self.num_experts_per_token = config.num_experts_per_token
        self.capacity_factor = config.capacity_factor

        # You can change experts representation if you want
        self.experts = nn.ModuleList([MLP(config) for _ in range(self.num_experts)])
        self.router = Router(config)

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        expert_capacity = torch.ceil(torch.tensor(batch_size * seq_len / self.num_experts * self.capacity_factor, device=x.device, dtype=torch.int))
        routing_weights = self.router(x)
        for i in range(self.num_experts):
            token_indices = torch.nonzero(routing_weights[:, :, i], as_tuple=False)
            if token_indices.shape[0] > expert_capacity:
                routing_weights[token_indices[expert_capacity:, 0], token_indices[expert_capacity:, 1], i] = 0

        expert_outputs = torch.zeros(batch_size, seq_len, self.hidden_size, device=x.device)
        for i in range(self.num_experts):
            expert_indices = torch.nonzero(routing_weights[:, :, i], as_tuple=False)
            expert_outputs[expert_indices[:, 0], expert_indices[:, 1]] = self.experts[i](x[expert_indices[:, 0], expert_indices[:, 1]])

        return expert_outputs

In [11]:
from torch.utils.data import DataLoader

naive_moe_config = PretrainedConfig(
    **base_config,
    num_experts=4,
    capacity_factor=2.0,
    num_experts_per_token=2,
    ff_cls=NaiveMoE
)

train_loader = DataLoader(tokenized_dataset['train'], batch_size=16, shuffle=True)
test_loader = DataLoader(tokenized_dataset['test'], batch_size=16, shuffle=False)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = TransformerClassifier(naive_moe_config).to(DEVICE)
# model = TransformerClassifier(standard_config).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

In [12]:
from tqdm import tqdm

NUM_OF_EPOCHS = 20

for epoch in range(NUM_OF_EPOCHS):
    model.train()
    train_progress_bar = tqdm(train_loader, desc=f'Train, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
    running_loss = 0.
    for i, batch in enumerate(train_progress_bar):
        x, y = batch['input_ids'], batch['label']
        x = torch.stack(x, dim=1).to(DEVICE)
        y = y.to(DEVICE)
        optimizer.zero_grad()
        loss = model(x, y)['loss']
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

        if i % 10 == 9:
            last_loss = running_loss / 10 # avg loss per batch
            print('batch {} loss: {}'.format(i + 1, last_loss))
            running_loss = 0.

    model.eval()
    with torch.no_grad():
        total_loss = 0
        total_samples = 0
        correct_samples = 0
        test_progress_bar = tqdm(test_loader, desc=f'Test, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
        for batch in test_progress_bar:
            x, y = batch['input_ids'], batch['label']
            x = torch.stack(x, dim=1).to(DEVICE)
            y = y.to(DEVICE)
            logits = model(x)['logits']
            total_loss += F.cross_entropy(logits, y, reduction='sum').item()
            total_samples += y.shape[0]
            correct_samples += (logits.argmax(dim=-1) == y).sum().item()

        print(f'Epoch {epoch + 1}, loss: {total_loss / total_samples}, accuracy: {correct_samples / total_samples}')

Train, Epoch 1 / 20:   1%|          | 11/1563 [00:02<03:03,  8.46it/s]

batch 10 loss: 1.0349373400211335


Train, Epoch 1 / 20:   1%|▏         | 23/1563 [00:03<01:41, 15.18it/s]

batch 20 loss: 0.8920106291770935


Train, Epoch 1 / 20:   2%|▏         | 33/1563 [00:04<01:39, 15.34it/s]

batch 30 loss: 0.8101996958255768


Train, Epoch 1 / 20:   3%|▎         | 43/1563 [00:04<01:37, 15.51it/s]

batch 40 loss: 0.7471609890460968


Train, Epoch 1 / 20:   3%|▎         | 51/1563 [00:05<01:58, 12.77it/s]

batch 50 loss: 0.7084021866321564


Train, Epoch 1 / 20:   4%|▍         | 63/1563 [00:06<01:38, 15.18it/s]

batch 60 loss: 0.7095832407474518


Train, Epoch 1 / 20:   5%|▍         | 73/1563 [00:06<01:28, 16.84it/s]

batch 70 loss: 0.7351710438728333


Train, Epoch 1 / 20:   5%|▌         | 81/1563 [00:07<02:37,  9.39it/s]

batch 80 loss: 0.7763126134872437


Train, Epoch 1 / 20:   6%|▌         | 92/1563 [00:09<02:48,  8.73it/s]

batch 90 loss: 0.8216028213500977


Train, Epoch 1 / 20:   7%|▋         | 102/1563 [00:09<01:36, 15.15it/s]

batch 100 loss: 0.7581846296787262


Train, Epoch 1 / 20:   7%|▋         | 108/1563 [00:10<02:19, 10.41it/s]


KeyboardInterrupt: 

In [None]:
train_loader = DataLoader(tokenized_dataset['train'], batch_size=16, shuffle=True)
test_loader = DataLoader(tokenized_dataset['test'], batch_size=16, shuffle=False)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = TransformerClassifier(standard_config).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

In [None]:
from tqdm import tqdm

NUM_OF_EPOCHS = 20

for epoch in range(NUM_OF_EPOCHS):
    model.train()
    train_progress_bar = tqdm(train_loader, desc=f'Train, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
    running_loss = 0.
    for i, batch in enumerate(train_progress_bar):
        x, y = batch['input_ids'], batch['label']
        x = torch.stack(x, dim=1).to(DEVICE)
        y = y.to(DEVICE)
        optimizer.zero_grad()
        loss = model(x, y)['loss']
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

        if i % 10 == 9:
            last_loss = running_loss / 10 # avg loss per batch
            print('batch {} loss: {}'.format(i + 1, last_loss))
            running_loss = 0.

    model.eval()
    with torch.no_grad():
        total_loss = 0
        total_samples = 0
        correct_samples = 0
        test_progress_bar = tqdm(test_loader, desc=f'Test, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
        for batch in test_progress_bar:
            x, y = batch['input_ids'], batch['label']
            x = torch.stack(x, dim=1).to(DEVICE)
            y = y.to(DEVICE)
            logits = model(x)['logits']
            total_loss += F.cross_entropy(logits, y, reduction='sum').item()
            total_samples += y.shape[0]
            correct_samples += (logits.argmax(dim=-1) == y).sum().item()

        print(f'Epoch {epoch + 1}, loss: {total_loss / total_samples}, accuracy: {correct_samples / total_samples}')

# **2. Vectorized implementation of MoE layer that works with num_experts_per_token>=1**

In [70]:
# Input: [batch_size, seq_len, hidden_size] - input embeddings
# Output: [batch_size, seq_len, hidden_size] - output embeddings
class VectorizedMoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts = config.num_experts
        self.hidden_size = config.hidden_size
        self.num_experts_per_token = config.num_experts_per_token
        self.capacity_factor = config.capacity_factor

        # You can change experts representation if you want
        self.expert = torch.nn.Linear(self.hidden_size, self.hidden_size)
        self.expert_weights = torch.nn.Parameter(torch.stack([self.expert.weight for _ in range(self.num_experts)], dim=0))
        self.expert_biases = torch.nn.Parameter(torch.stack([self.expert.bias for _ in range(self.num_experts)], dim=0))
        self.router = Router(config)

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        expert_capacity = torch.ceil(torch.tensor(batch_size * seq_len / self.num_experts * self.capacity_factor, device=x.device, dtype=torch.int))
        routing_weights = self.router(x)
        flat_routing_weights = routing_weights.view(-1, self.num_experts)  # Shape: [batch_size * seq_len, num_experts]
        topk_values, topk_indices = flat_routing_weights.topk(k=expert_capacity, dim=0)
        mask = torch.zeros_like(flat_routing_weights).bool()
        mask.scatter_(0, topk_indices, 1)
        flat_routing_weights = flat_routing_weights * mask.float()

        x_flat = x.reshape(-1, x.size(-1))
        inputs_expanded = x_flat.unsqueeze(1).expand(-1, self.num_experts, -1)
        weighted_inputs = inputs_expanded * flat_routing_weights.unsqueeze(-1)
        combined_inputs = weighted_inputs.reshape(-1, self.hidden_size)
        combined_outputs = torch.matmul(combined_inputs, self.expert_weights.view(-1, self.hidden_size).t()) + self.expert_biases.flatten()
        combined_outputs = combined_outputs.view(self.num_experts, batch_size * seq_len, self.num_experts, self.hidden_size)
        expert_outputs = torch.sum(combined_outputs * flat_routing_weights.unsqueeze(-1), dim=(0, 2))
        expert_outputs = expert_outputs.view(batch_size, seq_len, self.hidden_size)

        return expert_outputs

In [71]:
from torch.utils.data import DataLoader

vectorized_moe_for_one_expert_config = PretrainedConfig(
    **base_config,
    num_experts=4,
    capacity_factor=2.0,
    num_experts_per_token=2,
    ff_cls=VectorizedMoE
)

train_loader = DataLoader(tokenized_dataset['train'], batch_size=16, shuffle=True)
test_loader = DataLoader(tokenized_dataset['test'], batch_size=16, shuffle=False)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = TransformerClassifier(vectorized_moe_for_one_expert_config).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

In [None]:
from tqdm import tqdm

NUM_OF_EPOCHS = 20

for epoch in range(NUM_OF_EPOCHS):
    model.train()
    train_progress_bar = tqdm(train_loader, desc=f'Train, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
    running_loss = 0.
    for i, batch in enumerate(train_progress_bar):
        x, y = batch['input_ids'], batch['label']
        x = torch.stack(x, dim=1).to(DEVICE)
        y = y.to(DEVICE)
        optimizer.zero_grad()
        loss = model(x, y)['loss']
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

        if i % 10 == 9:
            last_loss = running_loss / 10 # avg loss per batch
            print('batch {} loss: {}'.format(i + 1, last_loss))
            running_loss = 0.

    model.eval()
    with torch.no_grad():
        total_loss = 0
        total_samples = 0
        correct_samples = 0
        test_progress_bar = tqdm(test_loader, desc=f'Test, Epoch {epoch + 1} / {NUM_OF_EPOCHS}')
        for batch in test_progress_bar:
            x, y = batch['input_ids'], batch['label']
            x = torch.stack(x, dim=1).to(DEVICE)
            y = y.to(DEVICE)
            logits = model(x)['logits']
            total_loss += F.cross_entropy(logits, y, reduction='sum').item()
            total_samples += y.shape[0]
            correct_samples += (logits.argmax(dim=-1) == y).sum().item()

        print(f'Epoch {epoch + 1}, loss: {total_loss / total_samples}, accuracy: {correct_samples / total_samples}')

Train, Epoch 1 / 20:   1%|          | 13/1563 [00:00<01:04, 23.90it/s]

batch 10 loss: 0.9177172422409058


Train, Epoch 1 / 20:   2%|▏         | 25/1563 [00:01<01:02, 24.78it/s]

batch 20 loss: 0.8167918026447296


Train, Epoch 1 / 20:   2%|▏         | 34/1563 [00:01<01:01, 24.93it/s]

batch 30 loss: 0.859244453907013


Train, Epoch 1 / 20:   3%|▎         | 43/1563 [00:01<01:00, 24.93it/s]

batch 40 loss: 0.749557101726532


Train, Epoch 1 / 20:   4%|▎         | 55/1563 [00:02<01:01, 24.63it/s]

batch 50 loss: 0.8005412638187408


Train, Epoch 1 / 20:   4%|▍         | 64/1563 [00:02<01:00, 24.62it/s]

batch 60 loss: 0.7357243537902832


Train, Epoch 1 / 20:   5%|▍         | 73/1563 [00:02<01:00, 24.77it/s]

batch 70 loss: 0.7317338049411773


Train, Epoch 1 / 20:   5%|▌         | 85/1563 [00:03<00:59, 24.85it/s]

batch 80 loss: 0.7289501905441285


Train, Epoch 1 / 20:   6%|▌         | 94/1563 [00:03<00:59, 24.85it/s]

batch 90 loss: 0.722758811712265


Train, Epoch 1 / 20:   7%|▋         | 103/1563 [00:04<00:57, 25.18it/s]

batch 100 loss: 0.7449330985546112


Train, Epoch 1 / 20:   7%|▋         | 112/1563 [00:04<00:57, 25.14it/s]

batch 110 loss: 0.7495331168174744


Train, Epoch 1 / 20:   8%|▊         | 124/1563 [00:05<00:58, 24.66it/s]

batch 120 loss: 0.7005780696868896


Train, Epoch 1 / 20:   9%|▊         | 133/1563 [00:05<00:57, 24.85it/s]

batch 130 loss: 0.6671855211257934


Train, Epoch 1 / 20:   9%|▉         | 145/1563 [00:05<00:56, 25.20it/s]

batch 140 loss: 0.7052746653556824


Train, Epoch 1 / 20:  10%|▉         | 154/1563 [00:06<00:55, 25.20it/s]

batch 150 loss: 0.7072145164012908


Train, Epoch 1 / 20:  10%|█         | 163/1563 [00:06<00:55, 25.07it/s]

batch 160 loss: 0.7426926732063294


Train, Epoch 1 / 20:  11%|█         | 175/1563 [00:07<00:55, 24.87it/s]

batch 170 loss: 0.714180052280426


Train, Epoch 1 / 20:  12%|█▏        | 184/1563 [00:07<00:55, 24.92it/s]

batch 180 loss: 0.7288957893848419


Train, Epoch 1 / 20:  12%|█▏        | 193/1563 [00:07<01:02, 21.93it/s]

batch 190 loss: 0.7148251056671142


Train, Epoch 1 / 20:  13%|█▎        | 205/1563 [00:08<00:56, 24.22it/s]

batch 200 loss: 0.723992544412613


Train, Epoch 1 / 20:  14%|█▎        | 214/1563 [00:08<00:54, 24.66it/s]

batch 210 loss: 0.6823336005210876


Train, Epoch 1 / 20:  14%|█▍        | 223/1563 [00:09<00:57, 23.49it/s]

batch 220 loss: 0.754057639837265


Train, Epoch 1 / 20:  15%|█▍        | 232/1563 [00:09<00:57, 23.19it/s]

batch 230 loss: 0.6780261337757111


Train, Epoch 1 / 20:  16%|█▌        | 244/1563 [00:10<00:57, 22.80it/s]

batch 240 loss: 0.670527595281601


Train, Epoch 1 / 20:  16%|█▌        | 253/1563 [00:10<00:58, 22.51it/s]

batch 250 loss: 0.7001429498195648


Train, Epoch 1 / 20:  17%|█▋        | 262/1563 [00:10<00:59, 22.03it/s]

batch 260 loss: 0.6731723546981812


Train, Epoch 1 / 20:  18%|█▊        | 274/1563 [00:11<00:53, 24.08it/s]

batch 270 loss: 0.6536144316196442


Train, Epoch 1 / 20:  18%|█▊        | 283/1563 [00:11<00:51, 24.80it/s]

batch 280 loss: 0.723010516166687


Train, Epoch 1 / 20:  19%|█▉        | 295/1563 [00:12<00:50, 24.88it/s]

batch 290 loss: 0.6913177907466889


Train, Epoch 1 / 20:  19%|█▉        | 304/1563 [00:12<00:50, 24.81it/s]

batch 300 loss: 0.6522000133991241


Train, Epoch 1 / 20:  20%|██        | 313/1563 [00:12<00:51, 24.41it/s]

batch 310 loss: 0.7025796949863434


Train, Epoch 1 / 20:  21%|██        | 325/1563 [00:13<00:49, 25.14it/s]

batch 320 loss: 0.7332491993904113


Train, Epoch 1 / 20:  21%|██▏       | 334/1563 [00:13<00:49, 24.86it/s]

batch 330 loss: 0.661463075876236


Train, Epoch 1 / 20:  22%|██▏       | 343/1563 [00:14<00:48, 24.98it/s]

batch 340 loss: 0.6708715498447418


Train, Epoch 1 / 20:  23%|██▎       | 355/1563 [00:14<00:48, 24.82it/s]

batch 350 loss: 0.6799742221832276


Train, Epoch 1 / 20:  23%|██▎       | 364/1563 [00:14<00:48, 24.51it/s]

batch 360 loss: 0.6674994349479675


Train, Epoch 1 / 20:  24%|██▍       | 373/1563 [00:15<00:48, 24.59it/s]

batch 370 loss: 0.6929183304309845


Train, Epoch 1 / 20:  25%|██▍       | 385/1563 [00:15<00:47, 24.75it/s]

batch 380 loss: 0.6777250409126282


Train, Epoch 1 / 20:  25%|██▌       | 394/1563 [00:16<00:47, 24.77it/s]

batch 390 loss: 0.6470906674861908


Train, Epoch 1 / 20:  26%|██▌       | 403/1563 [00:16<00:46, 24.72it/s]

batch 400 loss: 0.7104312598705291


Train, Epoch 1 / 20:  26%|██▋       | 412/1563 [00:16<00:46, 24.58it/s]

batch 410 loss: 0.6758914530277252


Train, Epoch 1 / 20:  27%|██▋       | 424/1563 [00:17<00:45, 24.77it/s]

batch 420 loss: 0.6745543956756592


Train, Epoch 1 / 20:  28%|██▊       | 433/1563 [00:17<00:45, 24.77it/s]

batch 430 loss: 0.6769984602928162


Train, Epoch 1 / 20:  28%|██▊       | 445/1563 [00:18<00:45, 24.72it/s]

batch 440 loss: 0.6769099950790405


Train, Epoch 1 / 20:  29%|██▉       | 454/1563 [00:18<00:44, 24.79it/s]

batch 450 loss: 0.7177286505699157


Train, Epoch 1 / 20:  30%|██▉       | 463/1563 [00:18<00:44, 24.80it/s]

batch 460 loss: 0.6829958319664001


Train, Epoch 1 / 20:  30%|███       | 475/1563 [00:19<00:44, 24.71it/s]

batch 470 loss: 0.6957504630088807


Train, Epoch 1 / 20:  31%|███       | 484/1563 [00:19<00:43, 24.66it/s]

batch 480 loss: 0.6765010535717011


Train, Epoch 1 / 20:  32%|███▏      | 493/1563 [00:20<00:43, 24.54it/s]

batch 490 loss: 0.6899702727794648


Train, Epoch 1 / 20:  32%|███▏      | 505/1563 [00:20<00:42, 24.72it/s]

batch 500 loss: 0.6872164368629455


Train, Epoch 1 / 20:  33%|███▎      | 514/1563 [00:21<00:44, 23.79it/s]

batch 510 loss: 0.6981833577156067


Train, Epoch 1 / 20:  33%|███▎      | 523/1563 [00:21<00:45, 23.11it/s]

batch 520 loss: 0.6971033692359925


Train, Epoch 1 / 20:  34%|███▍      | 532/1563 [00:21<00:45, 22.88it/s]

batch 530 loss: 0.7047366738319397


Train, Epoch 1 / 20:  35%|███▍      | 544/1563 [00:22<00:45, 22.24it/s]

batch 540 loss: 0.7660557508468628


Train, Epoch 1 / 20:  35%|███▌      | 553/1563 [00:22<00:45, 22.31it/s]

batch 550 loss: 0.6869908750057221


Train, Epoch 1 / 20:  36%|███▌      | 565/1563 [00:23<00:42, 23.58it/s]

batch 560 loss: 0.66078422665596


Train, Epoch 1 / 20:  37%|███▋      | 574/1563 [00:23<00:40, 24.45it/s]

batch 570 loss: 0.6688773691654205


Train, Epoch 1 / 20:  37%|███▋      | 583/1563 [00:24<00:39, 24.82it/s]

batch 580 loss: 0.6758286118507385


Train, Epoch 1 / 20:  38%|███▊      | 595/1563 [00:24<00:39, 24.81it/s]

batch 590 loss: 0.6809965670108795


Train, Epoch 1 / 20:  39%|███▊      | 604/1563 [00:24<00:39, 24.58it/s]

batch 600 loss: 0.6815394759178162


Train, Epoch 1 / 20:  39%|███▉      | 613/1563 [00:25<00:38, 24.48it/s]

batch 610 loss: 0.6970481753349305


Train, Epoch 1 / 20:  40%|███▉      | 625/1563 [00:25<00:37, 24.73it/s]

batch 620 loss: 0.6783476531505584


Train, Epoch 1 / 20:  41%|████      | 634/1563 [00:26<00:37, 24.61it/s]

batch 630 loss: 0.6852421998977661


Train, Epoch 1 / 20:  41%|████      | 643/1563 [00:26<00:37, 24.70it/s]

batch 640 loss: 0.6814114511013031


Train, Epoch 1 / 20:  42%|████▏     | 655/1563 [00:26<00:36, 24.87it/s]

batch 650 loss: 0.6739104866981507


Train, Epoch 1 / 20:  42%|████▏     | 664/1563 [00:27<00:36, 24.65it/s]

batch 660 loss: 0.6733161568641662


Train, Epoch 1 / 20:  43%|████▎     | 673/1563 [00:27<00:36, 24.54it/s]

batch 670 loss: 0.6993718683719635


Train, Epoch 1 / 20:  44%|████▍     | 685/1563 [00:28<00:35, 24.66it/s]

batch 680 loss: 0.6670043528079986


Train, Epoch 1 / 20:  44%|████▍     | 694/1563 [00:28<00:35, 24.70it/s]

batch 690 loss: 0.6698512494564056


Train, Epoch 1 / 20:  45%|████▍     | 703/1563 [00:28<00:34, 24.92it/s]

batch 700 loss: 0.6245149850845337


Train, Epoch 1 / 20:  46%|████▌     | 712/1563 [00:29<00:34, 24.77it/s]

batch 710 loss: 0.6861215353012085


Train, Epoch 1 / 20:  46%|████▋     | 724/1563 [00:29<00:34, 24.65it/s]

batch 720 loss: 0.685114324092865


Train, Epoch 1 / 20:  47%|████▋     | 733/1563 [00:30<00:33, 24.76it/s]

batch 730 loss: 0.7127727210521698


Train, Epoch 1 / 20:  48%|████▊     | 745/1563 [00:30<00:32, 25.02it/s]

batch 740 loss: 0.6583347678184509


Train, Epoch 1 / 20:  48%|████▊     | 754/1563 [00:30<00:32, 24.99it/s]

batch 750 loss: 0.6877457797527313


Train, Epoch 1 / 20:  49%|████▉     | 763/1563 [00:31<00:32, 24.90it/s]

batch 760 loss: 0.658700692653656


Train, Epoch 1 / 20:  50%|████▉     | 775/1563 [00:31<00:31, 25.03it/s]

batch 770 loss: 0.7101770281791687


Train, Epoch 1 / 20:  50%|█████     | 784/1563 [00:32<00:30, 25.24it/s]

batch 780 loss: 0.6926007807254791


Train, Epoch 1 / 20:  51%|█████     | 793/1563 [00:32<00:30, 25.25it/s]

batch 790 loss: 0.6626611649990082


Train, Epoch 1 / 20:  51%|█████▏    | 802/1563 [00:32<00:30, 24.96it/s]

batch 800 loss: 0.6525432884693145


Train, Epoch 1 / 20:  52%|█████▏    | 814/1563 [00:33<00:32, 22.88it/s]

batch 810 loss: 0.6665015339851379


Train, Epoch 1 / 20:  53%|█████▎    | 823/1563 [00:33<00:32, 22.43it/s]

batch 820 loss: 0.6742606520652771


Train, Epoch 1 / 20:  53%|█████▎    | 832/1563 [00:34<00:31, 22.87it/s]

batch 830 loss: 0.6267931163311005


Train, Epoch 1 / 20:  54%|█████▍    | 844/1563 [00:34<00:31, 22.56it/s]

batch 840 loss: 0.6613289654254914


Train, Epoch 1 / 20:  55%|█████▍    | 853/1563 [00:35<00:31, 22.39it/s]

batch 850 loss: 0.6518544465303421


Train, Epoch 1 / 20:  55%|█████▌    | 865/1563 [00:35<00:28, 24.54it/s]

batch 860 loss: 0.6592128217220307


Train, Epoch 1 / 20:  56%|█████▌    | 874/1563 [00:35<00:27, 24.69it/s]

batch 870 loss: 0.6792756676673889


Train, Epoch 1 / 20:  56%|█████▋    | 883/1563 [00:36<00:27, 24.97it/s]

batch 880 loss: 0.6716291069984436


Train, Epoch 1 / 20:  57%|█████▋    | 895/1563 [00:36<00:26, 25.31it/s]

batch 890 loss: 0.6372458934783936


Train, Epoch 1 / 20:  58%|█████▊    | 904/1563 [00:37<00:26, 25.18it/s]

batch 900 loss: 0.6516898334026336


Train, Epoch 1 / 20:  58%|█████▊    | 913/1563 [00:37<00:25, 25.11it/s]

batch 910 loss: 0.6753894448280334


Train, Epoch 1 / 20:  59%|█████▉    | 925/1563 [00:37<00:25, 25.10it/s]

batch 920 loss: 0.6879861056804657


Train, Epoch 1 / 20:  60%|█████▉    | 934/1563 [00:38<00:24, 25.16it/s]

batch 930 loss: 0.6698090255260467


Train, Epoch 1 / 20:  60%|██████    | 943/1563 [00:38<00:24, 25.11it/s]

batch 940 loss: 0.634822791814804


Train, Epoch 1 / 20:  61%|██████    | 955/1563 [00:39<00:24, 25.14it/s]

batch 950 loss: 0.6712542712688446


Train, Epoch 1 / 20:  62%|██████▏   | 964/1563 [00:39<00:23, 24.98it/s]

batch 960 loss: 0.7083239674568176


Train, Epoch 1 / 20:  62%|██████▏   | 973/1563 [00:39<00:23, 25.20it/s]

batch 970 loss: 0.6965440690517426


Train, Epoch 1 / 20:  63%|██████▎   | 985/1563 [00:40<00:22, 25.21it/s]

batch 980 loss: 0.662442970275879


Train, Epoch 1 / 20:  64%|██████▎   | 994/1563 [00:40<00:22, 24.95it/s]

batch 990 loss: 0.6545067608356476


Train, Epoch 1 / 20:  64%|██████▍   | 1003/1563 [00:41<00:22, 24.63it/s]

batch 1000 loss: 0.6324779510498046


Train, Epoch 1 / 20:  65%|██████▍   | 1015/1563 [00:41<00:21, 25.00it/s]

batch 1010 loss: 0.679651004076004


Train, Epoch 1 / 20:  66%|██████▌   | 1024/1563 [00:41<00:21, 25.30it/s]

batch 1020 loss: 0.6931487441062927


Train, Epoch 1 / 20:  66%|██████▌   | 1033/1563 [00:42<00:20, 25.27it/s]

batch 1030 loss: 0.7052060544490815


Train, Epoch 1 / 20:  67%|██████▋   | 1045/1563 [00:42<00:20, 25.08it/s]

batch 1040 loss: 0.6358731091022491


Train, Epoch 1 / 20:  67%|██████▋   | 1054/1563 [00:43<00:20, 25.35it/s]

batch 1050 loss: 0.7069046676158905


Train, Epoch 1 / 20:  68%|██████▊   | 1063/1563 [00:43<00:19, 25.25it/s]

batch 1060 loss: 0.6600581109523773


Train, Epoch 1 / 20:  69%|██████▉   | 1075/1563 [00:43<00:19, 25.18it/s]

batch 1070 loss: 0.6397220492362976


Train, Epoch 1 / 20:  69%|██████▉   | 1084/1563 [00:44<00:18, 25.27it/s]

batch 1080 loss: 0.6967059075832367


Train, Epoch 1 / 20:  70%|██████▉   | 1093/1563 [00:44<00:19, 24.65it/s]

batch 1090 loss: 0.6596114158630371


Train, Epoch 1 / 20:  71%|███████   | 1102/1563 [00:45<00:18, 24.93it/s]

batch 1100 loss: 0.6497233808040619


Train, Epoch 1 / 20:  71%|███████▏  | 1114/1563 [00:45<00:19, 23.63it/s]

batch 1110 loss: 0.6366795361042022


Train, Epoch 1 / 20:  72%|███████▏  | 1123/1563 [00:45<00:19, 22.74it/s]

batch 1120 loss: 0.6240044295787811


Train, Epoch 1 / 20:  72%|███████▏  | 1132/1563 [00:46<00:18, 22.72it/s]

batch 1130 loss: 0.6437844395637512


Train, Epoch 1 / 20:  73%|███████▎  | 1144/1563 [00:46<00:18, 22.35it/s]

batch 1140 loss: 0.6500104159116745


Train, Epoch 1 / 20:  74%|███████▍  | 1153/1563 [00:47<00:17, 23.15it/s]

batch 1150 loss: 0.6889274954795838


Train, Epoch 1 / 20:  75%|███████▍  | 1165/1563 [00:47<00:15, 25.01it/s]

batch 1160 loss: 0.6736928939819335


Train, Epoch 1 / 20:  75%|███████▌  | 1174/1563 [00:48<00:15, 25.08it/s]

batch 1170 loss: 0.6496454358100892


Train, Epoch 1 / 20:  76%|███████▌  | 1183/1563 [00:48<00:15, 25.05it/s]

batch 1180 loss: 0.6805761873722076


Train, Epoch 1 / 20:  76%|███████▋  | 1192/1563 [00:48<00:15, 24.54it/s]

batch 1190 loss: 0.6521302342414856


Train, Epoch 1 / 20:  77%|███████▋  | 1204/1563 [00:49<00:14, 24.83it/s]

batch 1200 loss: 0.6333695352077484


Train, Epoch 1 / 20:  78%|███████▊  | 1213/1563 [00:49<00:13, 25.12it/s]

batch 1210 loss: 0.6372492671012878


Train, Epoch 1 / 20:  78%|███████▊  | 1225/1563 [00:50<00:13, 25.35it/s]

batch 1220 loss: 0.702520364522934


Train, Epoch 1 / 20:  79%|███████▉  | 1234/1563 [00:50<00:12, 25.36it/s]

batch 1230 loss: 0.6903455793857575


Train, Epoch 1 / 20:  80%|███████▉  | 1243/1563 [00:50<00:12, 25.10it/s]

batch 1240 loss: 0.6611990809440613


Train, Epoch 1 / 20:  80%|████████  | 1255/1563 [00:51<00:12, 25.12it/s]

batch 1250 loss: 0.627992445230484


Train, Epoch 1 / 20:  81%|████████  | 1264/1563 [00:51<00:11, 25.44it/s]

batch 1260 loss: 0.6354429125785828


Train, Epoch 1 / 20:  81%|████████▏ | 1273/1563 [00:52<00:11, 25.22it/s]

batch 1270 loss: 0.6886117875576019


Train, Epoch 1 / 20:  82%|████████▏ | 1285/1563 [00:52<00:10, 25.35it/s]

batch 1280 loss: 0.6035654842853546


Train, Epoch 1 / 20:  83%|████████▎ | 1294/1563 [00:52<00:10, 25.09it/s]

batch 1290 loss: 0.6417570561170578


Train, Epoch 1 / 20:  83%|████████▎ | 1303/1563 [00:53<00:10, 25.13it/s]

batch 1300 loss: 0.6162938714027405


Train, Epoch 1 / 20:  84%|████████▍ | 1315/1563 [00:53<00:09, 25.34it/s]

batch 1310 loss: 0.623050332069397


Train, Epoch 1 / 20:  85%|████████▍ | 1324/1563 [00:54<00:09, 24.78it/s]

batch 1320 loss: 0.6269524037837982


Train, Epoch 1 / 20:  85%|████████▌ | 1333/1563 [00:54<00:09, 24.72it/s]

batch 1330 loss: 0.6668346583843231


Train, Epoch 1 / 20:  86%|████████▌ | 1345/1563 [00:54<00:08, 25.06it/s]

batch 1340 loss: 0.6582919597625733


Train, Epoch 1 / 20:  87%|████████▋ | 1354/1563 [00:55<00:08, 24.95it/s]

batch 1350 loss: 0.6022634387016297


Train, Epoch 1 / 20:  87%|████████▋ | 1363/1563 [00:55<00:08, 24.98it/s]

batch 1360 loss: 0.6498807013034821


Train, Epoch 1 / 20:  88%|████████▊ | 1375/1563 [00:56<00:07, 24.89it/s]

batch 1370 loss: 0.6976621091365814


Train, Epoch 1 / 20:  89%|████████▊ | 1384/1563 [00:56<00:07, 24.99it/s]

batch 1380 loss: 0.6883013308048248


Train, Epoch 1 / 20:  89%|████████▉ | 1393/1563 [00:56<00:06, 25.26it/s]

batch 1390 loss: 0.6580933630466461


Train, Epoch 1 / 20:  90%|████████▉ | 1402/1563 [00:57<00:06, 24.22it/s]

batch 1400 loss: 0.6308114528656006


Train, Epoch 1 / 20:  90%|█████████ | 1414/1563 [00:57<00:06, 22.57it/s]

batch 1410 loss: 0.6797486901283264


Train, Epoch 1 / 20:  91%|█████████ | 1423/1563 [00:58<00:06, 21.73it/s]

batch 1420 loss: 0.6625522315502167


Train, Epoch 1 / 20:  92%|█████████▏| 1432/1563 [00:58<00:05, 22.13it/s]

batch 1430 loss: 0.6559337615966797


Train, Epoch 1 / 20:  92%|█████████▏| 1444/1563 [00:59<00:05, 21.79it/s]

batch 1440 loss: 0.6379105806350708


Train, Epoch 1 / 20:  93%|█████████▎| 1453/1563 [00:59<00:04, 23.40it/s]

batch 1450 loss: 0.633931142091751


Train, Epoch 1 / 20:  94%|█████████▎| 1465/1563 [01:00<00:03, 24.59it/s]

batch 1460 loss: 0.6046333849430084


Train, Epoch 1 / 20:  94%|█████████▍| 1474/1563 [01:00<00:03, 25.00it/s]

batch 1470 loss: 0.6353074967861175


Train, Epoch 1 / 20:  95%|█████████▍| 1483/1563 [01:00<00:03, 25.07it/s]

batch 1480 loss: 0.6667285829782486


Train, Epoch 1 / 20:  96%|█████████▌| 1495/1563 [01:01<00:02, 24.78it/s]

batch 1490 loss: 0.7442371934652329


Train, Epoch 1 / 20:  96%|█████████▌| 1504/1563 [01:01<00:02, 24.89it/s]

batch 1500 loss: 0.6536718308925629


Train, Epoch 1 / 20:  97%|█████████▋| 1513/1563 [01:01<00:01, 25.13it/s]

batch 1510 loss: 0.6554839670658111


Train, Epoch 1 / 20:  98%|█████████▊| 1525/1563 [01:02<00:01, 25.01it/s]

batch 1520 loss: 0.6362979054450989


Train, Epoch 1 / 20:  98%|█████████▊| 1534/1563 [01:02<00:01, 24.77it/s]

batch 1530 loss: 0.6269973456859589


Train, Epoch 1 / 20:  99%|█████████▊| 1543/1563 [01:03<00:00, 25.18it/s]

batch 1540 loss: 0.6013853013515472


Train, Epoch 1 / 20:  99%|█████████▉| 1555/1563 [01:03<00:00, 25.27it/s]

batch 1550 loss: 0.6730674088001252


Train, Epoch 1 / 20: 100%|██████████| 1563/1563 [01:03<00:00, 24.45it/s]


batch 1560 loss: 0.6808663606643677


Test, Epoch 1 / 20: 100%|██████████| 1563/1563 [00:30<00:00, 51.80it/s]


Epoch 1, loss: 0.647417263660431, accuracy: 0.62568


Train, Epoch 2 / 20:   1%|          | 12/1563 [00:00<01:07, 23.10it/s]

batch 10 loss: 0.688909387588501


Train, Epoch 2 / 20:   2%|▏         | 24/1563 [00:01<01:09, 22.16it/s]

batch 20 loss: 0.6365421772003174


Train, Epoch 2 / 20:   2%|▏         | 33/1563 [00:01<01:09, 21.91it/s]

batch 30 loss: 0.6760241955518722


Train, Epoch 2 / 20:   3%|▎         | 42/1563 [00:01<01:04, 23.53it/s]

batch 40 loss: 0.6251111835241318


Train, Epoch 2 / 20:   3%|▎         | 54/1563 [00:02<01:01, 24.45it/s]

batch 50 loss: 0.645579743385315


Train, Epoch 2 / 20:   4%|▍         | 63/1563 [00:02<01:00, 24.78it/s]

batch 60 loss: 0.6483443975448608


Train, Epoch 2 / 20:   5%|▍         | 75/1563 [00:03<01:00, 24.65it/s]

batch 70 loss: 0.6455954372882843


Train, Epoch 2 / 20:   5%|▌         | 84/1563 [00:03<00:59, 24.75it/s]

batch 80 loss: 0.5992249637842179


Train, Epoch 2 / 20:   6%|▌         | 93/1563 [00:03<00:59, 24.78it/s]

batch 90 loss: 0.6238004922866821


Train, Epoch 2 / 20:   7%|▋         | 105/1563 [00:04<00:58, 24.78it/s]

batch 100 loss: 0.6311996668577194


Train, Epoch 2 / 20:   7%|▋         | 114/1563 [00:04<00:57, 25.05it/s]

batch 110 loss: 0.6479093730449677


Train, Epoch 2 / 20:   8%|▊         | 123/1563 [00:05<00:57, 24.88it/s]

batch 120 loss: 0.7086040318012238


Train, Epoch 2 / 20:   9%|▊         | 135/1563 [00:05<00:56, 25.21it/s]

batch 130 loss: 0.61141736805439


Train, Epoch 2 / 20:   9%|▉         | 144/1563 [00:05<00:56, 25.03it/s]

batch 140 loss: 0.6724359154701233


Train, Epoch 2 / 20:  10%|▉         | 153/1563 [00:06<00:56, 24.85it/s]

batch 150 loss: 0.6217727422714233


Train, Epoch 2 / 20:  11%|█         | 165/1563 [00:06<00:56, 24.85it/s]

batch 160 loss: 0.6507998585700989


Train, Epoch 2 / 20:  11%|█         | 174/1563 [00:07<00:56, 24.73it/s]

batch 170 loss: 0.6274408817291259


Train, Epoch 2 / 20:  12%|█▏        | 183/1563 [00:07<00:55, 24.91it/s]

batch 180 loss: 0.620867919921875


Train, Epoch 2 / 20:  12%|█▏        | 195/1563 [00:08<00:54, 25.22it/s]

batch 190 loss: 0.6595634996891022


Train, Epoch 2 / 20:  13%|█▎        | 204/1563 [00:08<00:54, 25.10it/s]

batch 200 loss: 0.6481109976768493


Train, Epoch 2 / 20:  14%|█▎        | 213/1563 [00:08<00:54, 24.86it/s]

batch 210 loss: 0.6035119414329528


Train, Epoch 2 / 20:  14%|█▍        | 225/1563 [00:09<00:53, 24.86it/s]

batch 220 loss: 0.6325857013463974


Train, Epoch 2 / 20:  15%|█▍        | 234/1563 [00:09<00:52, 25.33it/s]

batch 230 loss: 0.6565200865268708


Train, Epoch 2 / 20:  16%|█▌        | 243/1563 [00:09<00:52, 25.17it/s]

batch 240 loss: 0.6095854133367539


Train, Epoch 2 / 20:  16%|█▋        | 255/1563 [00:10<00:52, 24.88it/s]

batch 250 loss: 0.6419922351837158


Train, Epoch 2 / 20:  17%|█▋        | 264/1563 [00:10<00:52, 24.96it/s]

batch 260 loss: 0.6374159157276154


Train, Epoch 2 / 20:  17%|█▋        | 273/1563 [00:11<00:51, 25.03it/s]

batch 270 loss: 0.6409917891025543


Train, Epoch 2 / 20:  18%|█▊        | 282/1563 [00:11<00:51, 24.83it/s]

batch 280 loss: 0.6539631485939026


Train, Epoch 2 / 20:  19%|█▉        | 294/1563 [00:12<00:54, 23.17it/s]

batch 290 loss: 0.6146073788404465


Train, Epoch 2 / 20:  19%|█▉        | 303/1563 [00:12<00:55, 22.57it/s]

batch 300 loss: 0.6166948974132538


Train, Epoch 2 / 20:  20%|█▉        | 312/1563 [00:12<00:54, 23.14it/s]

batch 310 loss: 0.6396290123462677


Train, Epoch 2 / 20:  21%|██        | 324/1563 [00:13<00:55, 22.33it/s]

batch 320 loss: 0.6181968867778778


Train, Epoch 2 / 20:  21%|██▏       | 333/1563 [00:13<00:54, 22.46it/s]

batch 330 loss: 0.6164004862308502


Train, Epoch 2 / 20:  22%|██▏       | 345/1563 [00:14<00:50, 24.27it/s]

batch 340 loss: 0.6094393014907837


Train, Epoch 2 / 20:  23%|██▎       | 354/1563 [00:14<00:49, 24.46it/s]

batch 350 loss: 0.6441772133111954


Train, Epoch 2 / 20:  23%|██▎       | 363/1563 [00:14<00:48, 24.74it/s]

batch 360 loss: 0.5893138825893403


Train, Epoch 2 / 20:  24%|██▍       | 375/1563 [00:15<00:48, 24.64it/s]

batch 370 loss: 0.6419210791587829


Train, Epoch 2 / 20:  25%|██▍       | 384/1563 [00:15<00:47, 25.06it/s]

batch 380 loss: 0.6608033001422882


Train, Epoch 2 / 20:  25%|██▌       | 393/1563 [00:16<00:46, 25.01it/s]

batch 390 loss: 0.616424348950386


Train, Epoch 2 / 20:  26%|██▌       | 405/1563 [00:16<00:46, 24.98it/s]

batch 400 loss: 0.6724041342735291


Train, Epoch 2 / 20:  26%|██▋       | 414/1563 [00:17<00:45, 24.98it/s]

batch 410 loss: 0.6145689874887467


Train, Epoch 2 / 20:  27%|██▋       | 423/1563 [00:17<00:45, 24.96it/s]

batch 420 loss: 0.671810832619667


Train, Epoch 2 / 20:  28%|██▊       | 435/1563 [00:17<00:45, 25.03it/s]

batch 430 loss: 0.6493262946605682


Train, Epoch 2 / 20:  28%|██▊       | 444/1563 [00:18<00:44, 25.20it/s]

batch 440 loss: 0.6422068774700165


Train, Epoch 2 / 20:  29%|██▉       | 453/1563 [00:18<00:44, 25.09it/s]

batch 450 loss: 0.6015608549118042


Train, Epoch 2 / 20:  30%|██▉       | 465/1563 [00:19<00:43, 25.10it/s]

batch 460 loss: 0.6136245548725128


Train, Epoch 2 / 20:  30%|███       | 474/1563 [00:19<00:44, 24.66it/s]

batch 470 loss: 0.5906337857246399


Train, Epoch 2 / 20:  31%|███       | 483/1563 [00:19<00:42, 25.19it/s]

batch 480 loss: 0.6106251239776611


Train, Epoch 2 / 20:  32%|███▏      | 495/1563 [00:20<00:42, 25.10it/s]

batch 490 loss: 0.6202489674091339


Train, Epoch 2 / 20:  32%|███▏      | 504/1563 [00:20<00:42, 24.95it/s]

batch 500 loss: 0.594460529088974


Train, Epoch 2 / 20:  33%|███▎      | 513/1563 [00:20<00:41, 25.22it/s]

batch 510 loss: 0.6891353070735932


Train, Epoch 2 / 20:  33%|███▎      | 522/1563 [00:21<00:41, 24.94it/s]

batch 520 loss: 0.6124131292104721


Train, Epoch 2 / 20:  34%|███▍      | 534/1563 [00:21<00:41, 24.76it/s]

batch 530 loss: 0.6135976195335389


Train, Epoch 2 / 20:  35%|███▍      | 543/1563 [00:22<00:41, 24.75it/s]

batch 540 loss: 0.6081041812896728


Train, Epoch 2 / 20:  36%|███▌      | 555/1563 [00:22<00:41, 24.57it/s]

batch 550 loss: 0.6384268641471863


Train, Epoch 2 / 20:  36%|███▌      | 564/1563 [00:23<00:40, 24.80it/s]

batch 560 loss: 0.6744772017002105


Train, Epoch 2 / 20:  37%|███▋      | 573/1563 [00:23<00:39, 25.19it/s]

batch 570 loss: 0.6209838837385178


Train, Epoch 2 / 20:  37%|███▋      | 582/1563 [00:23<00:39, 25.12it/s]

batch 580 loss: 0.6143248051404953


Train, Epoch 2 / 20:  38%|███▊      | 594/1563 [00:24<00:41, 23.19it/s]

batch 590 loss: 0.6867515087127686


Train, Epoch 2 / 20:  39%|███▊      | 603/1563 [00:24<00:43, 22.22it/s]

batch 600 loss: 0.6645216941833496


Train, Epoch 2 / 20:  39%|███▉      | 612/1563 [00:25<00:43, 21.97it/s]

batch 610 loss: 0.6891304731369019


Train, Epoch 2 / 20:  40%|███▉      | 624/1563 [00:25<00:43, 21.63it/s]

batch 620 loss: 0.6760657072067261


Train, Epoch 2 / 20:  40%|████      | 633/1563 [00:26<00:40, 22.71it/s]

batch 630 loss: 0.6449572324752808


Train, Epoch 2 / 20:  41%|████▏     | 645/1563 [00:26<00:37, 24.46it/s]

batch 640 loss: 0.6260910093784332


Train, Epoch 2 / 20:  42%|████▏     | 654/1563 [00:26<00:36, 25.14it/s]

batch 650 loss: 0.6443826138973237


Train, Epoch 2 / 20:  42%|████▏     | 663/1563 [00:27<00:36, 24.93it/s]

batch 660 loss: 0.6381291806697845


Train, Epoch 2 / 20:  43%|████▎     | 672/1563 [00:27<00:35, 24.85it/s]

batch 670 loss: 0.6101125806570054


Train, Epoch 2 / 20:  44%|████▍     | 684/1563 [00:28<00:35, 25.04it/s]

batch 680 loss: 0.6561286866664886


Train, Epoch 2 / 20:  44%|████▍     | 693/1563 [00:28<00:34, 25.14it/s]

batch 690 loss: 0.6592585414648056


Train, Epoch 2 / 20:  45%|████▌     | 705/1563 [00:28<00:34, 25.03it/s]

batch 700 loss: 0.6388536423444748


Train, Epoch 2 / 20:  46%|████▌     | 714/1563 [00:29<00:34, 24.96it/s]

batch 710 loss: 0.6760504841804504


Train, Epoch 2 / 20:  46%|████▋     | 723/1563 [00:29<00:33, 25.16it/s]

batch 720 loss: 0.6514618843793869


Train, Epoch 2 / 20:  47%|████▋     | 735/1563 [00:30<00:32, 25.16it/s]

batch 730 loss: 0.6399930357933045


Train, Epoch 2 / 20:  48%|████▊     | 744/1563 [00:30<00:32, 25.14it/s]

batch 740 loss: 0.618561464548111


Train, Epoch 2 / 20:  48%|████▊     | 753/1563 [00:30<00:32, 24.89it/s]

batch 750 loss: 0.647434288263321


Train, Epoch 2 / 20:  49%|████▉     | 765/1563 [00:31<00:31, 25.06it/s]

batch 760 loss: 0.6517093718051911


Train, Epoch 2 / 20:  50%|████▉     | 774/1563 [00:31<00:31, 25.04it/s]

batch 770 loss: 0.6703737795352935


Train, Epoch 2 / 20:  50%|█████     | 783/1563 [00:32<00:31, 24.71it/s]

batch 780 loss: 0.5872996807098388


Train, Epoch 2 / 20:  51%|█████     | 795/1563 [00:32<00:30, 25.07it/s]

batch 790 loss: 0.6370784133672714


Train, Epoch 2 / 20:  51%|█████▏    | 804/1563 [00:32<00:30, 25.15it/s]

batch 800 loss: 0.6444802820682526


Train, Epoch 2 / 20:  52%|█████▏    | 813/1563 [00:33<00:29, 25.03it/s]

batch 810 loss: 0.6360883116722107


Train, Epoch 2 / 20:  53%|█████▎    | 825/1563 [00:33<00:29, 25.36it/s]

batch 820 loss: 0.6512411653995513


Train, Epoch 2 / 20:  53%|█████▎    | 834/1563 [00:34<00:29, 25.11it/s]

batch 830 loss: 0.640017357468605


Train, Epoch 2 / 20:  54%|█████▍    | 843/1563 [00:34<00:28, 25.02it/s]

batch 840 loss: 0.6225894093513489


Train, Epoch 2 / 20:  55%|█████▍    | 855/1563 [00:34<00:28, 24.93it/s]

batch 850 loss: 0.6243275105953217


Train, Epoch 2 / 20:  55%|█████▌    | 864/1563 [00:35<00:28, 24.72it/s]

batch 860 loss: 0.618877363204956


Train, Epoch 2 / 20:  56%|█████▌    | 873/1563 [00:35<00:27, 24.99it/s]

batch 870 loss: 0.5963100880384445


Train, Epoch 2 / 20:  56%|█████▋    | 882/1563 [00:36<00:27, 24.42it/s]

batch 880 loss: 0.6425060749053955


Train, Epoch 2 / 20:  57%|█████▋    | 894/1563 [00:36<00:28, 23.41it/s]

batch 890 loss: 0.5922153532505036


Train, Epoch 2 / 20:  58%|█████▊    | 903/1563 [00:36<00:29, 22.64it/s]

batch 900 loss: 0.6533177137374878


Train, Epoch 2 / 20:  58%|█████▊    | 912/1563 [00:37<00:29, 22.07it/s]

batch 910 loss: 0.6528215914964676


Train, Epoch 2 / 20:  59%|█████▉    | 924/1563 [00:37<00:28, 22.08it/s]

batch 920 loss: 0.6063293099403382


Train, Epoch 2 / 20:  60%|█████▉    | 933/1563 [00:38<00:26, 23.49it/s]

batch 930 loss: 0.6462598204612732


Train, Epoch 2 / 20:  60%|██████    | 945/1563 [00:38<00:25, 24.58it/s]

batch 940 loss: 0.6081822514533997


Train, Epoch 2 / 20:  61%|██████    | 954/1563 [00:39<00:24, 24.84it/s]

batch 950 loss: 0.5919400453567505


Train, Epoch 2 / 20:  62%|██████▏   | 963/1563 [00:39<00:24, 24.81it/s]

batch 960 loss: 0.5992241382598877


Train, Epoch 2 / 20:  62%|██████▏   | 975/1563 [00:39<00:23, 24.94it/s]

batch 970 loss: 0.6676936864852905


Train, Epoch 2 / 20:  63%|██████▎   | 984/1563 [00:40<00:23, 25.08it/s]

batch 980 loss: 0.6213903486728668


Train, Epoch 2 / 20:  64%|██████▎   | 993/1563 [00:40<00:22, 24.91it/s]

batch 990 loss: 0.6374028891324997


Train, Epoch 2 / 20:  64%|██████▍   | 1005/1563 [00:41<00:22, 24.79it/s]

batch 1000 loss: 0.648452877998352


Train, Epoch 2 / 20:  65%|██████▍   | 1014/1563 [00:41<00:22, 24.86it/s]

batch 1010 loss: 0.6424710214138031


Train, Epoch 2 / 20:  65%|██████▌   | 1023/1563 [00:41<00:21, 25.18it/s]

batch 1020 loss: 0.6508119314908981


Train, Epoch 2 / 20:  66%|██████▌   | 1035/1563 [00:42<00:21, 25.11it/s]

batch 1030 loss: 0.6203590571880341


Train, Epoch 2 / 20:  67%|██████▋   | 1044/1563 [00:42<00:20, 24.94it/s]

batch 1040 loss: 0.6721548408269882


Train, Epoch 2 / 20:  67%|██████▋   | 1053/1563 [00:43<00:20, 24.82it/s]

batch 1050 loss: 0.6336809605360031


Train, Epoch 2 / 20:  68%|██████▊   | 1065/1563 [00:43<00:20, 24.82it/s]

batch 1060 loss: 0.649198180437088


Train, Epoch 2 / 20:  69%|██████▊   | 1074/1563 [00:43<00:19, 24.98it/s]

batch 1070 loss: 0.6159188628196717


Train, Epoch 2 / 20:  69%|██████▉   | 1083/1563 [00:44<00:19, 24.73it/s]

batch 1080 loss: 0.5596384227275848


Train, Epoch 2 / 20:  70%|███████   | 1095/1563 [00:44<00:18, 24.85it/s]

batch 1090 loss: 0.6115846693515777


Train, Epoch 2 / 20:  71%|███████   | 1104/1563 [00:45<00:18, 24.51it/s]

batch 1100 loss: 0.6068015366792678


Train, Epoch 2 / 20:  71%|███████   | 1113/1563 [00:45<00:17, 25.24it/s]

batch 1110 loss: 0.678502881526947


Train, Epoch 2 / 20:  72%|███████▏  | 1125/1563 [00:46<00:17, 24.99it/s]

batch 1120 loss: 0.6095231473445892


Train, Epoch 2 / 20:  73%|███████▎  | 1134/1563 [00:46<00:17, 24.97it/s]

batch 1130 loss: 0.5801851868629455


Train, Epoch 2 / 20:  73%|███████▎  | 1143/1563 [00:46<00:16, 24.94it/s]

batch 1140 loss: 0.6096900850534439


Train, Epoch 2 / 20:  74%|███████▍  | 1155/1563 [00:47<00:16, 25.03it/s]

batch 1150 loss: 0.6066416352987289


Train, Epoch 2 / 20:  74%|███████▍  | 1164/1563 [00:47<00:15, 25.28it/s]

batch 1160 loss: 0.6367566525936127


Train, Epoch 2 / 20:  75%|███████▌  | 1173/1563 [00:47<00:15, 25.04it/s]

batch 1170 loss: 0.5851842373609543


Train, Epoch 2 / 20:  76%|███████▌  | 1182/1563 [00:48<00:16, 23.22it/s]

batch 1180 loss: 0.5711751759052277


Train, Epoch 2 / 20:  76%|███████▋  | 1194/1563 [00:48<00:16, 22.35it/s]

batch 1190 loss: 0.6426072597503663


Train, Epoch 2 / 20:  77%|███████▋  | 1203/1563 [00:49<00:16, 22.03it/s]

batch 1200 loss: 0.6122759282588959


Train, Epoch 2 / 20:  78%|███████▊  | 1212/1563 [00:49<00:15, 22.32it/s]

batch 1210 loss: 0.6818899929523468


Train, Epoch 2 / 20:  78%|███████▊  | 1224/1563 [00:50<00:15, 21.60it/s]

batch 1220 loss: 0.6187980055809021


Train, Epoch 2 / 20:  79%|███████▉  | 1233/1563 [00:50<00:14, 23.30it/s]

batch 1230 loss: 0.673901253938675


Train, Epoch 2 / 20:  80%|███████▉  | 1245/1563 [00:51<00:12, 24.46it/s]

batch 1240 loss: 0.5889717519283295


Train, Epoch 2 / 20:  80%|████████  | 1254/1563 [00:51<00:12, 24.87it/s]

batch 1250 loss: 0.6016661643981933


Train, Epoch 2 / 20:  81%|████████  | 1263/1563 [00:51<00:11, 25.10it/s]

batch 1260 loss: 0.6621687829494476


Train, Epoch 2 / 20:  81%|████████▏ | 1272/1563 [00:52<00:11, 24.56it/s]

batch 1270 loss: 0.5737499177455903


Train, Epoch 2 / 20:  82%|████████▏ | 1284/1563 [00:52<00:11, 23.97it/s]

batch 1280 loss: 0.5954467207193375


Train, Epoch 2 / 20:  83%|████████▎ | 1293/1563 [00:53<00:11, 24.46it/s]

batch 1290 loss: 0.5979837536811828


Train, Epoch 2 / 20:  83%|████████▎ | 1305/1563 [00:53<00:10, 24.47it/s]

batch 1300 loss: 0.6142448842525482


Train, Epoch 2 / 20:  84%|████████▍ | 1314/1563 [00:53<00:09, 25.10it/s]

batch 1310 loss: 0.611939725279808


Train, Epoch 2 / 20:  85%|████████▍ | 1323/1563 [00:54<00:09, 25.19it/s]

batch 1320 loss: 0.6414979428052903


Train, Epoch 2 / 20:  85%|████████▌ | 1335/1563 [00:54<00:09, 24.98it/s]

batch 1330 loss: 0.6477953404188156


Train, Epoch 2 / 20:  86%|████████▌ | 1344/1563 [00:55<00:08, 24.59it/s]

batch 1340 loss: 0.6500793993473053


Train, Epoch 2 / 20:  87%|████████▋ | 1353/1563 [00:55<00:08, 24.80it/s]

batch 1350 loss: 0.6128739476203918


Train, Epoch 2 / 20:  87%|████████▋ | 1365/1563 [00:55<00:07, 24.88it/s]

batch 1360 loss: 0.6539217829704285


Train, Epoch 2 / 20:  88%|████████▊ | 1374/1563 [00:56<00:07, 25.05it/s]

batch 1370 loss: 0.6176998734474182


Train, Epoch 2 / 20:  88%|████████▊ | 1383/1563 [00:56<00:07, 24.92it/s]

batch 1380 loss: 0.6766150236129761


Train, Epoch 2 / 20:  89%|████████▉ | 1395/1563 [00:57<00:06, 25.00it/s]

batch 1390 loss: 0.6652104407548904


Train, Epoch 2 / 20:  90%|████████▉ | 1404/1563 [00:57<00:06, 25.03it/s]

batch 1400 loss: 0.6050806373357773


Train, Epoch 2 / 20:  90%|█████████ | 1413/1563 [00:57<00:06, 24.98it/s]

batch 1410 loss: 0.5758646905422211


Train, Epoch 2 / 20:  91%|█████████ | 1425/1563 [00:58<00:05, 25.41it/s]

batch 1420 loss: 0.6545260488986969


Train, Epoch 2 / 20:  92%|█████████▏| 1434/1563 [00:58<00:05, 24.98it/s]

batch 1430 loss: 0.6524553298950195


Train, Epoch 2 / 20:  92%|█████████▏| 1443/1563 [00:59<00:04, 24.83it/s]

batch 1440 loss: 0.6176461517810822


Train, Epoch 2 / 20:  93%|█████████▎| 1452/1563 [00:59<00:04, 24.78it/s]

batch 1450 loss: 0.6473615050315857


Train, Epoch 2 / 20:  94%|█████████▎| 1464/1563 [00:59<00:03, 25.00it/s]

batch 1460 loss: 0.601272714138031


Train, Epoch 2 / 20:  94%|█████████▍| 1473/1563 [01:00<00:03, 25.34it/s]

batch 1470 loss: 0.5899255037307739


Train, Epoch 2 / 20:  95%|█████████▍| 1482/1563 [01:00<00:03, 23.14it/s]

batch 1480 loss: 0.6141925394535065


Train, Epoch 2 / 20:  96%|█████████▌| 1494/1563 [01:01<00:03, 22.81it/s]

batch 1490 loss: 0.5775075078010559


Train, Epoch 2 / 20:  96%|█████████▌| 1503/1563 [01:01<00:02, 23.11it/s]

batch 1500 loss: 0.6194912970066071


Train, Epoch 2 / 20:  97%|█████████▋| 1512/1563 [01:02<00:02, 22.86it/s]

batch 1510 loss: 0.5687196373939514


Train, Epoch 2 / 20:  98%|█████████▊| 1524/1563 [01:02<00:01, 22.82it/s]

batch 1520 loss: 0.5637457072734833


Train, Epoch 2 / 20:  98%|█████████▊| 1533/1563 [01:02<00:01, 24.32it/s]

batch 1530 loss: 0.5951654016971588


Train, Epoch 2 / 20:  99%|█████████▉| 1545/1563 [01:03<00:00, 24.64it/s]

batch 1540 loss: 0.5539894580841065


Train, Epoch 2 / 20:  99%|█████████▉| 1554/1563 [01:03<00:00, 24.73it/s]

batch 1550 loss: 0.6573299944400788


Train, Epoch 2 / 20: 100%|██████████| 1563/1563 [01:04<00:00, 24.38it/s]


batch 1560 loss: 0.588026013970375


Test, Epoch 2 / 20: 100%|██████████| 1563/1563 [00:30<00:00, 51.88it/s]


Epoch 2, loss: 0.6339258936309814, accuracy: 0.64372


Train, Epoch 3 / 20:   1%|          | 15/1563 [00:00<01:02, 24.83it/s]

batch 10 loss: 0.594450694322586


Train, Epoch 3 / 20:   2%|▏         | 24/1563 [00:00<01:02, 24.76it/s]

batch 20 loss: 0.6086046189069748


Train, Epoch 3 / 20:   2%|▏         | 33/1563 [00:01<01:01, 24.98it/s]

batch 30 loss: 0.5919339567422867


Train, Epoch 3 / 20:   3%|▎         | 45/1563 [00:01<01:00, 25.10it/s]

batch 40 loss: 0.6032005429267884


Train, Epoch 3 / 20:   3%|▎         | 54/1563 [00:02<01:00, 24.96it/s]

batch 50 loss: 0.6038023352622985


Train, Epoch 3 / 20:   4%|▍         | 63/1563 [00:02<01:00, 24.70it/s]

batch 60 loss: 0.6390859693288803


Train, Epoch 3 / 20:   5%|▍         | 72/1563 [00:02<01:04, 23.29it/s]

batch 70 loss: 0.6034035384654999


Train, Epoch 3 / 20:   5%|▌         | 84/1563 [00:03<01:05, 22.44it/s]

batch 80 loss: 0.6146588683128357


Train, Epoch 3 / 20:   6%|▌         | 93/1563 [00:03<01:06, 22.04it/s]

batch 90 loss: 0.5810444414615631


Train, Epoch 3 / 20:   7%|▋         | 102/1563 [00:04<01:07, 21.63it/s]

batch 100 loss: 0.6190326452255249


Train, Epoch 3 / 20:   7%|▋         | 114/1563 [00:04<01:04, 22.48it/s]

batch 110 loss: 0.5939807176589966


Train, Epoch 3 / 20:   8%|▊         | 123/1563 [00:05<01:00, 23.92it/s]

batch 120 loss: 0.5837048292160034


Train, Epoch 3 / 20:   9%|▊         | 135/1563 [00:05<00:58, 24.52it/s]

batch 130 loss: 0.6300323814153671


Train, Epoch 3 / 20:   9%|▉         | 144/1563 [00:06<00:56, 24.98it/s]

batch 140 loss: 0.537341856956482


Train, Epoch 3 / 20:  10%|▉         | 153/1563 [00:06<00:56, 25.03it/s]

batch 150 loss: 0.5778036713600159


Train, Epoch 3 / 20:  11%|█         | 165/1563 [00:06<00:55, 25.03it/s]

batch 160 loss: 0.5552140533924103


Train, Epoch 3 / 20:  11%|█         | 174/1563 [00:07<00:55, 25.13it/s]

batch 170 loss: 0.5988677144050598


Train, Epoch 3 / 20:  12%|█▏        | 183/1563 [00:07<00:54, 25.24it/s]

batch 180 loss: 0.6419419467449188


Train, Epoch 3 / 20:  12%|█▏        | 195/1563 [00:08<00:54, 25.10it/s]

batch 190 loss: 0.6781758248806


Train, Epoch 3 / 20:  13%|█▎        | 204/1563 [00:08<00:54, 24.95it/s]

batch 200 loss: 0.6121210157871246


Train, Epoch 3 / 20:  14%|█▎        | 213/1563 [00:08<00:53, 25.25it/s]

batch 210 loss: 0.5796134531497955


Train, Epoch 3 / 20:  14%|█▍        | 225/1563 [00:09<00:53, 25.07it/s]

batch 220 loss: 0.5758360236883163


Train, Epoch 3 / 20:  15%|█▍        | 234/1563 [00:09<00:54, 24.53it/s]

batch 230 loss: 0.5810914546251297


Train, Epoch 3 / 20:  16%|█▌        | 243/1563 [00:10<00:53, 24.87it/s]

batch 240 loss: 0.6679395318031311


Train, Epoch 3 / 20:  16%|█▋        | 255/1563 [00:10<00:52, 24.80it/s]

batch 250 loss: 0.6136500984430313


Train, Epoch 3 / 20:  17%|█▋        | 264/1563 [00:10<00:52, 24.79it/s]

batch 260 loss: 0.7020767748355865


Train, Epoch 3 / 20:  17%|█▋        | 273/1563 [00:11<00:51, 24.92it/s]

batch 270 loss: 0.6373009741306305


Train, Epoch 3 / 20:  18%|█▊        | 285/1563 [00:11<00:50, 25.10it/s]

batch 280 loss: 0.6520551025867463


Train, Epoch 3 / 20:  19%|█▉        | 294/1563 [00:12<00:50, 25.03it/s]

batch 290 loss: 0.5591357797384262


Train, Epoch 3 / 20:  19%|█▉        | 303/1563 [00:12<00:50, 24.82it/s]

batch 300 loss: 0.5097350299358367


Train, Epoch 3 / 20:  20%|██        | 315/1563 [00:12<00:50, 24.84it/s]

batch 310 loss: 0.6400030851364136


Train, Epoch 3 / 20:  21%|██        | 324/1563 [00:13<00:50, 24.68it/s]

batch 320 loss: 0.6019655883312225


Train, Epoch 3 / 20:  21%|██▏       | 333/1563 [00:13<00:49, 24.63it/s]

batch 330 loss: 0.5592771232128143


Train, Epoch 3 / 20:  22%|██▏       | 345/1563 [00:14<00:48, 24.89it/s]

batch 340 loss: 0.5970123827457428


Train, Epoch 3 / 20:  23%|██▎       | 354/1563 [00:14<00:49, 24.57it/s]

batch 350 loss: 0.5483327716588974


Train, Epoch 3 / 20:  23%|██▎       | 363/1563 [00:14<00:50, 23.96it/s]

batch 360 loss: 0.5734293580055236


Train, Epoch 3 / 20:  24%|██▍       | 372/1563 [00:15<00:51, 22.98it/s]

batch 370 loss: 0.5923457652330398


Train, Epoch 3 / 20:  25%|██▍       | 384/1563 [00:15<00:52, 22.38it/s]

batch 380 loss: 0.6231544822454452


Train, Epoch 3 / 20:  25%|██▌       | 393/1563 [00:16<00:53, 22.02it/s]

batch 390 loss: 0.5862207293510437


Train, Epoch 3 / 20:  26%|██▌       | 402/1563 [00:16<00:53, 21.80it/s]

batch 400 loss: 0.6373190075159073


Train, Epoch 3 / 20:  26%|██▋       | 414/1563 [00:17<00:49, 23.13it/s]

batch 410 loss: 0.5714088648557663


Train, Epoch 3 / 20:  27%|██▋       | 423/1563 [00:17<00:46, 24.33it/s]

batch 420 loss: 0.5721567422151566


Train, Epoch 3 / 20:  28%|██▊       | 435/1563 [00:18<00:45, 24.91it/s]

batch 430 loss: 0.5227076441049576


Train, Epoch 3 / 20:  28%|██▊       | 444/1563 [00:18<00:44, 25.04it/s]

batch 440 loss: 0.6425546646118164


Train, Epoch 3 / 20:  29%|██▉       | 453/1563 [00:18<00:44, 24.70it/s]

batch 450 loss: 0.5810176193714142


Train, Epoch 3 / 20:  30%|██▉       | 465/1563 [00:19<00:44, 24.77it/s]

batch 460 loss: 0.6345710635185242


Train, Epoch 3 / 20:  30%|███       | 474/1563 [00:19<00:43, 24.85it/s]

batch 470 loss: 0.5856474429368973


Train, Epoch 3 / 20:  31%|███       | 483/1563 [00:19<00:43, 24.83it/s]

batch 480 loss: 0.6388240247964859


Train, Epoch 3 / 20:  32%|███▏      | 495/1563 [00:20<00:43, 24.82it/s]

batch 490 loss: 0.5907736599445343


Train, Epoch 3 / 20:  32%|███▏      | 504/1563 [00:20<00:42, 24.92it/s]

batch 500 loss: 0.5797511547803879


Train, Epoch 3 / 20:  33%|███▎      | 513/1563 [00:21<00:42, 24.98it/s]

batch 510 loss: 0.535437262058258


Train, Epoch 3 / 20:  33%|███▎      | 522/1563 [00:21<00:42, 24.21it/s]

batch 520 loss: 0.5631977170705795


Train, Epoch 3 / 20:  34%|███▍      | 534/1563 [00:22<00:41, 24.55it/s]

batch 530 loss: 0.6028509080410004


Train, Epoch 3 / 20:  35%|███▍      | 543/1563 [00:22<00:41, 24.67it/s]

batch 540 loss: 0.6186395913362503


Train, Epoch 3 / 20:  35%|███▌      | 552/1563 [00:22<00:41, 24.43it/s]

batch 550 loss: 0.5844007432460785


Train, Epoch 3 / 20:  36%|███▌      | 564/1563 [00:23<00:40, 24.65it/s]

batch 560 loss: 0.5980430752038955


Train, Epoch 3 / 20:  37%|███▋      | 573/1563 [00:23<00:39, 24.84it/s]

batch 570 loss: 0.5930151760578155


Train, Epoch 3 / 20:  37%|███▋      | 585/1563 [00:24<00:39, 24.94it/s]

batch 580 loss: 0.6207773298025131


Train, Epoch 3 / 20:  38%|███▊      | 594/1563 [00:24<00:38, 25.18it/s]

batch 590 loss: 0.6003203451633453


Train, Epoch 3 / 20:  39%|███▊      | 603/1563 [00:24<00:38, 25.06it/s]

batch 600 loss: 0.587296849489212


Train, Epoch 3 / 20:  39%|███▉      | 615/1563 [00:25<00:37, 25.00it/s]

batch 610 loss: 0.5994454354047776


Train, Epoch 3 / 20:  40%|███▉      | 624/1563 [00:25<00:37, 24.99it/s]

batch 620 loss: 0.6265359044075012


Train, Epoch 3 / 20:  40%|████      | 633/1563 [00:26<00:37, 24.76it/s]

batch 630 loss: 0.604433798789978


Train, Epoch 3 / 20:  41%|████      | 642/1563 [00:26<00:37, 24.75it/s]

batch 640 loss: 0.6423530012369156


Train, Epoch 3 / 20:  42%|████▏     | 654/1563 [00:26<00:37, 24.45it/s]

batch 650 loss: 0.6428354471921921


Train, Epoch 3 / 20:  42%|████▏     | 663/1563 [00:27<00:39, 22.81it/s]

batch 660 loss: 0.608096319437027


Train, Epoch 3 / 20:  43%|████▎     | 672/1563 [00:27<00:38, 22.86it/s]

batch 670 loss: 0.5882016956806183


Train, Epoch 3 / 20:  44%|████▍     | 684/1563 [00:28<00:38, 22.71it/s]

batch 680 loss: 0.5575130164623261


Train, Epoch 3 / 20:  44%|████▍     | 693/1563 [00:28<00:39, 22.18it/s]

batch 690 loss: 0.6233854591846466


Train, Epoch 3 / 20:  45%|████▌     | 705/1563 [00:29<00:37, 22.82it/s]

batch 700 loss: 0.5567613780498505


Train, Epoch 3 / 20:  46%|████▌     | 714/1563 [00:29<00:35, 24.16it/s]

batch 710 loss: 0.5941453099250793


Train, Epoch 3 / 20:  46%|████▋     | 723/1563 [00:29<00:34, 24.61it/s]

batch 720 loss: 0.554195785522461


Train, Epoch 3 / 20:  47%|████▋     | 735/1563 [00:30<00:32, 25.10it/s]

batch 730 loss: 0.5863845646381378


Train, Epoch 3 / 20:  48%|████▊     | 744/1563 [00:30<00:32, 24.99it/s]

batch 740 loss: 0.5831216484308243


Train, Epoch 3 / 20:  48%|████▊     | 753/1563 [00:31<00:32, 24.86it/s]

batch 750 loss: 0.6168103307485581


Train, Epoch 3 / 20:  49%|████▉     | 765/1563 [00:31<00:31, 24.97it/s]

batch 760 loss: 0.5929149150848388


Train, Epoch 3 / 20:  50%|████▉     | 774/1563 [00:31<00:31, 25.08it/s]

batch 770 loss: 0.6400935739278794


Train, Epoch 3 / 20:  50%|█████     | 783/1563 [00:32<00:31, 24.81it/s]

batch 780 loss: 0.6412826120853424


Train, Epoch 3 / 20:  51%|█████     | 795/1563 [00:32<00:30, 25.16it/s]

batch 790 loss: 0.5542660832405091


Train, Epoch 3 / 20:  51%|█████▏    | 804/1563 [00:33<00:30, 24.95it/s]

batch 800 loss: 0.6430494397878647


Train, Epoch 3 / 20:  52%|█████▏    | 813/1563 [00:33<00:30, 24.84it/s]

batch 810 loss: 0.6183680891990662


Train, Epoch 3 / 20:  53%|█████▎    | 825/1563 [00:33<00:29, 25.08it/s]

batch 820 loss: 0.572178053855896


Train, Epoch 3 / 20:  53%|█████▎    | 834/1563 [00:34<00:29, 25.04it/s]

batch 830 loss: 0.6125564634799957


Train, Epoch 3 / 20:  54%|█████▍    | 843/1563 [00:34<00:28, 25.20it/s]

batch 840 loss: 0.5263090074062348


Train, Epoch 3 / 20:  55%|█████▍    | 855/1563 [00:35<00:28, 24.99it/s]

batch 850 loss: 0.5883295834064484


Train, Epoch 3 / 20:  55%|█████▌    | 864/1563 [00:35<00:27, 24.98it/s]

batch 860 loss: 0.5703026473522186


Train, Epoch 3 / 20:  56%|█████▌    | 873/1563 [00:35<00:27, 25.02it/s]

batch 870 loss: 0.5669755429029465


Train, Epoch 3 / 20:  57%|█████▋    | 885/1563 [00:36<00:27, 25.05it/s]

batch 880 loss: 0.5331174850463867


Train, Epoch 3 / 20:  57%|█████▋    | 894/1563 [00:36<00:26, 24.90it/s]

batch 890 loss: 0.5735321283340454


Train, Epoch 3 / 20:  58%|█████▊    | 903/1563 [00:37<00:26, 24.93it/s]

batch 900 loss: 0.5751460671424866


Train, Epoch 3 / 20:  59%|█████▊    | 915/1563 [00:37<00:26, 24.82it/s]

batch 910 loss: 0.5322583526372909


Train, Epoch 3 / 20:  59%|█████▉    | 924/1563 [00:37<00:25, 24.77it/s]

batch 920 loss: 0.5647501528263092


Train, Epoch 3 / 20:  60%|█████▉    | 933/1563 [00:38<00:25, 24.55it/s]

batch 930 loss: 0.656936663389206


Train, Epoch 3 / 20:  60%|██████    | 945/1563 [00:38<00:24, 24.96it/s]

batch 940 loss: 0.538603526353836


Train, Epoch 3 / 20:  61%|██████    | 954/1563 [00:39<00:26, 23.01it/s]

batch 950 loss: 0.5294426798820495


Train, Epoch 3 / 20:  62%|██████▏   | 963/1563 [00:39<00:26, 22.42it/s]

batch 960 loss: 0.644571852684021


Train, Epoch 3 / 20:  62%|██████▏   | 972/1563 [00:40<00:26, 22.06it/s]

batch 970 loss: 0.6306870311498642


Train, Epoch 3 / 20:  63%|██████▎   | 984/1563 [00:40<00:26, 21.89it/s]

batch 980 loss: 0.651567667722702


Train, Epoch 3 / 20:  64%|██████▎   | 993/1563 [00:40<00:25, 21.96it/s]

batch 990 loss: 0.6155287325382233


Train, Epoch 3 / 20:  64%|██████▍   | 1005/1563 [00:41<00:23, 23.93it/s]

batch 1000 loss: 0.5648891299962997


Train, Epoch 3 / 20:  65%|██████▍   | 1014/1563 [00:41<00:22, 24.41it/s]

batch 1010 loss: 0.5769721150398255


Train, Epoch 3 / 20:  65%|██████▌   | 1023/1563 [00:42<00:21, 24.71it/s]

batch 1020 loss: 0.589425465464592


Train, Epoch 3 / 20:  66%|██████▌   | 1035/1563 [00:42<00:21, 24.74it/s]

batch 1030 loss: 0.5857928216457366


Train, Epoch 3 / 20:  67%|██████▋   | 1044/1563 [00:43<00:20, 24.86it/s]

batch 1040 loss: 0.5699395149946213


Train, Epoch 3 / 20:  67%|██████▋   | 1053/1563 [00:43<00:20, 24.94it/s]

batch 1050 loss: 0.6081337541341781


Train, Epoch 3 / 20:  68%|██████▊   | 1065/1563 [00:43<00:19, 25.01it/s]

batch 1060 loss: 0.5562482476234436


Train, Epoch 3 / 20:  69%|██████▊   | 1074/1563 [00:44<00:19, 25.02it/s]

batch 1070 loss: 0.5950840562582016


Train, Epoch 3 / 20:  69%|██████▉   | 1083/1563 [00:44<00:19, 24.88it/s]

batch 1080 loss: 0.5837171107530594


Train, Epoch 3 / 20:  70%|███████   | 1095/1563 [00:45<00:18, 24.73it/s]

batch 1090 loss: 0.5875259101390838


Train, Epoch 3 / 20:  71%|███████   | 1104/1563 [00:45<00:18, 24.80it/s]

batch 1100 loss: 0.5993185549974441


Train, Epoch 3 / 20:  71%|███████   | 1113/1563 [00:45<00:18, 24.87it/s]

batch 1110 loss: 0.6204288095235825


Train, Epoch 3 / 20:  72%|███████▏  | 1125/1563 [00:46<00:17, 24.94it/s]

batch 1120 loss: 0.5900187313556671


Train, Epoch 3 / 20:  73%|███████▎  | 1134/1563 [00:46<00:17, 25.09it/s]

batch 1130 loss: 0.5826228946447373


Train, Epoch 3 / 20:  73%|███████▎  | 1143/1563 [00:47<00:16, 24.97it/s]

batch 1140 loss: 0.609310981631279


Train, Epoch 3 / 20:  74%|███████▍  | 1155/1563 [00:47<00:16, 24.93it/s]

batch 1150 loss: 0.5926844298839569


Train, Epoch 3 / 20:  74%|███████▍  | 1164/1563 [00:47<00:15, 24.99it/s]

batch 1160 loss: 0.5982393652200699


Train, Epoch 3 / 20:  75%|███████▌  | 1173/1563 [00:48<00:15, 25.07it/s]

batch 1170 loss: 0.5843084514141083


Train, Epoch 3 / 20:  76%|███████▌  | 1182/1563 [00:48<00:15, 25.04it/s]

batch 1180 loss: 0.6222495317459107


Train, Epoch 3 / 20:  76%|███████▋  | 1194/1563 [00:49<00:14, 24.91it/s]

batch 1190 loss: 0.5585022628307342


Train, Epoch 3 / 20:  77%|███████▋  | 1203/1563 [00:49<00:14, 24.79it/s]

batch 1200 loss: 0.5066465586423874


Train, Epoch 3 / 20:  78%|███████▊  | 1215/1563 [00:49<00:13, 24.93it/s]

batch 1210 loss: 0.541911992430687


Train, Epoch 3 / 20:  78%|███████▊  | 1224/1563 [00:50<00:13, 24.57it/s]

batch 1220 loss: 0.5982578575611115


Train, Epoch 3 / 20:  79%|███████▉  | 1233/1563 [00:50<00:13, 24.65it/s]

batch 1230 loss: 0.5770551204681397


Train, Epoch 3 / 20:  79%|███████▉  | 1242/1563 [00:51<00:12, 24.74it/s]

batch 1240 loss: 0.5164845913648606


Train, Epoch 3 / 20:  80%|████████  | 1254/1563 [00:51<00:13, 22.60it/s]

batch 1250 loss: 0.54032461643219


Train, Epoch 3 / 20:  81%|████████  | 1263/1563 [00:51<00:13, 22.57it/s]

batch 1260 loss: 0.5254996150732041


Train, Epoch 3 / 20:  81%|████████▏ | 1272/1563 [00:52<00:13, 22.28it/s]

batch 1270 loss: 0.5130504041910171


Train, Epoch 3 / 20:  82%|████████▏ | 1284/1563 [00:52<00:12, 22.33it/s]

batch 1280 loss: 0.6238665461540223


Train, Epoch 3 / 20:  83%|████████▎ | 1293/1563 [00:53<00:12, 22.36it/s]

batch 1290 loss: 0.5688113957643509


Train, Epoch 3 / 20:  83%|████████▎ | 1305/1563 [00:53<00:10, 24.31it/s]

batch 1300 loss: 0.6029879331588746


Train, Epoch 3 / 20:  84%|████████▍ | 1314/1563 [00:54<00:10, 24.74it/s]

batch 1310 loss: 0.5760821312665939


Train, Epoch 3 / 20:  85%|████████▍ | 1323/1563 [00:54<00:09, 25.09it/s]

batch 1320 loss: 0.5431099086999893


Train, Epoch 3 / 20:  85%|████████▌ | 1335/1563 [00:54<00:09, 24.92it/s]

batch 1330 loss: 0.5148802697658539


Train, Epoch 3 / 20:  86%|████████▌ | 1344/1563 [00:55<00:08, 24.80it/s]

batch 1340 loss: 0.5586855411529541


Train, Epoch 3 / 20:  87%|████████▋ | 1353/1563 [00:55<00:08, 24.62it/s]

batch 1350 loss: 0.5375571876764298


Train, Epoch 3 / 20:  87%|████████▋ | 1365/1563 [00:56<00:07, 24.91it/s]

batch 1360 loss: 0.5694700062274933


Train, Epoch 3 / 20:  88%|████████▊ | 1374/1563 [00:56<00:07, 24.91it/s]

batch 1370 loss: 0.6696659684181213


Train, Epoch 3 / 20:  88%|████████▊ | 1383/1563 [00:56<00:07, 24.78it/s]

batch 1380 loss: 0.5628407090902329


Train, Epoch 3 / 20:  89%|████████▉ | 1395/1563 [00:57<00:06, 25.03it/s]

batch 1390 loss: 0.4837847888469696


Train, Epoch 3 / 20:  90%|████████▉ | 1404/1563 [00:57<00:06, 24.92it/s]

batch 1400 loss: 0.6277568340301514


Train, Epoch 3 / 20:  90%|█████████ | 1413/1563 [00:58<00:06, 24.86it/s]

batch 1410 loss: 0.595442259311676


Train, Epoch 3 / 20:  91%|█████████ | 1425/1563 [00:58<00:05, 24.91it/s]

batch 1420 loss: 0.5713861137628555


Train, Epoch 3 / 20:  92%|█████████▏| 1434/1563 [00:58<00:05, 24.81it/s]

batch 1430 loss: 0.6129629582166671


Train, Epoch 3 / 20:  92%|█████████▏| 1443/1563 [00:59<00:04, 25.05it/s]

batch 1440 loss: 0.5959571421146392


Train, Epoch 3 / 20:  93%|█████████▎| 1452/1563 [00:59<00:04, 25.00it/s]

batch 1450 loss: 0.5411625564098358


Train, Epoch 3 / 20:  94%|█████████▎| 1464/1563 [01:00<00:03, 24.82it/s]

batch 1460 loss: 0.6042114973068238


Train, Epoch 3 / 20:  94%|█████████▍| 1473/1563 [01:00<00:03, 24.90it/s]

batch 1470 loss: 0.6337014019489289


Train, Epoch 3 / 20:  95%|█████████▌| 1485/1563 [01:01<00:03, 24.61it/s]

batch 1480 loss: 0.5230951070785522


Train, Epoch 3 / 20:  96%|█████████▌| 1494/1563 [01:01<00:02, 24.75it/s]

batch 1490 loss: 0.5344861209392547


Train, Epoch 3 / 20:  96%|█████████▌| 1503/1563 [01:01<00:02, 24.36it/s]

batch 1500 loss: 0.6776803433895111


Train, Epoch 3 / 20:  97%|█████████▋| 1512/1563 [01:02<00:02, 24.77it/s]

batch 1510 loss: 0.5717013627290726


Train, Epoch 3 / 20:  98%|█████████▊| 1524/1563 [01:02<00:01, 24.63it/s]

batch 1520 loss: 0.5076352506875992


Train, Epoch 3 / 20:  98%|█████████▊| 1533/1563 [01:02<00:01, 24.59it/s]

batch 1530 loss: 0.5414245992898941


Train, Epoch 3 / 20:  99%|█████████▊| 1542/1563 [01:03<00:00, 23.76it/s]

batch 1540 loss: 0.5868131816387177


Train, Epoch 3 / 20:  99%|█████████▉| 1554/1563 [01:03<00:00, 22.56it/s]

batch 1550 loss: 0.6378661334514618


Train, Epoch 3 / 20: 100%|██████████| 1563/1563 [01:04<00:00, 24.31it/s]


batch 1560 loss: 0.5672443926334381


Test, Epoch 3 / 20: 100%|██████████| 1563/1563 [00:30<00:00, 51.13it/s]


Epoch 3, loss: 0.5743291412162781, accuracy: 0.69576


Train, Epoch 4 / 20:   1%|          | 15/1563 [00:00<01:01, 25.05it/s]

batch 10 loss: 0.5325159281492233


Train, Epoch 4 / 20:   2%|▏         | 24/1563 [00:00<01:01, 25.12it/s]

batch 20 loss: 0.5063532263040542


Train, Epoch 4 / 20:   2%|▏         | 33/1563 [00:01<01:01, 24.87it/s]

batch 30 loss: 0.5536741733551025


Train, Epoch 4 / 20:   3%|▎         | 45/1563 [00:01<01:01, 24.87it/s]

batch 40 loss: 0.5660856515169144


Train, Epoch 4 / 20:   3%|▎         | 54/1563 [00:02<01:00, 24.78it/s]

batch 50 loss: 0.6053462266921997


Train, Epoch 4 / 20:   4%|▍         | 63/1563 [00:02<01:00, 24.82it/s]

batch 60 loss: 0.5107677042484283


Train, Epoch 4 / 20:   5%|▍         | 72/1563 [00:02<01:01, 24.39it/s]

batch 70 loss: 0.5764912754297257


Train, Epoch 4 / 20:   5%|▌         | 84/1563 [00:03<00:59, 24.81it/s]

batch 80 loss: 0.5768588334321976


Train, Epoch 4 / 20:   6%|▌         | 93/1563 [00:03<00:59, 24.72it/s]

batch 90 loss: 0.5524179875850678


Train, Epoch 4 / 20:   7%|▋         | 105/1563 [00:04<00:58, 24.81it/s]

batch 100 loss: 0.5827014118432998


Train, Epoch 4 / 20:   7%|▋         | 114/1563 [00:04<00:58, 24.65it/s]

batch 110 loss: 0.5466916710138321


Train, Epoch 4 / 20:   8%|▊         | 123/1563 [00:04<01:01, 23.44it/s]

batch 120 loss: 0.614558282494545


Train, Epoch 4 / 20:   8%|▊         | 132/1563 [00:05<01:04, 22.23it/s]

batch 130 loss: 0.5737560212612152


Train, Epoch 4 / 20:   9%|▉         | 144/1563 [00:05<01:04, 22.02it/s]

batch 140 loss: 0.5937600761651993


Train, Epoch 4 / 20:  10%|▉         | 153/1563 [00:06<01:03, 22.17it/s]

batch 150 loss: 0.6104830831289292


Train, Epoch 4 / 20:  10%|█         | 162/1563 [00:06<01:04, 21.78it/s]

batch 160 loss: 0.6365291088819504


Train, Epoch 4 / 20:  11%|█         | 174/1563 [00:07<01:00, 23.10it/s]

batch 170 loss: 0.5059424966573716


Train, Epoch 4 / 20:  12%|█▏        | 183/1563 [00:07<00:56, 24.32it/s]

batch 180 loss: 0.5541404128074646


Train, Epoch 4 / 20:  12%|█▏        | 195/1563 [00:08<00:55, 24.73it/s]

batch 190 loss: 0.5401996076107025


Train, Epoch 4 / 20:  13%|█▎        | 204/1563 [00:08<00:55, 24.67it/s]

batch 200 loss: 0.5220323085784913


Train, Epoch 4 / 20:  14%|█▎        | 213/1563 [00:08<00:54, 24.72it/s]

batch 210 loss: 0.5475835740566254


Train, Epoch 4 / 20:  14%|█▍        | 225/1563 [00:09<00:54, 24.54it/s]

batch 220 loss: 0.57273188829422


Train, Epoch 4 / 20:  15%|█▍        | 234/1563 [00:09<00:53, 24.89it/s]

batch 230 loss: 0.480910724401474


Train, Epoch 4 / 20:  16%|█▌        | 243/1563 [00:10<00:54, 24.44it/s]

batch 240 loss: 0.5753901541233063


Train, Epoch 4 / 20:  16%|█▋        | 255/1563 [00:10<00:52, 24.72it/s]

batch 250 loss: 0.4917940378189087


Train, Epoch 4 / 20:  17%|█▋        | 264/1563 [00:10<00:52, 24.84it/s]

batch 260 loss: 0.5378265023231507


Train, Epoch 4 / 20:  17%|█▋        | 273/1563 [00:11<00:52, 24.73it/s]

batch 270 loss: 0.48734311759471893


Train, Epoch 4 / 20:  18%|█▊        | 285/1563 [00:11<00:51, 24.68it/s]

batch 280 loss: 0.5150344669818878


Train, Epoch 4 / 20:  19%|█▉        | 294/1563 [00:12<00:52, 24.02it/s]

batch 290 loss: 0.5573519736528396


Train, Epoch 4 / 20:  19%|█▉        | 303/1563 [00:12<00:51, 24.39it/s]

batch 300 loss: 0.6123328864574432


Train, Epoch 4 / 20:  20%|██        | 315/1563 [00:13<00:50, 24.67it/s]

batch 310 loss: 0.5495894968509674


Train, Epoch 4 / 20:  21%|██        | 324/1563 [00:13<00:50, 24.70it/s]

batch 320 loss: 0.5963294595479965


Train, Epoch 4 / 20:  21%|██▏       | 333/1563 [00:13<00:49, 24.78it/s]

batch 330 loss: 0.527406656742096


Train, Epoch 4 / 20:  22%|██▏       | 345/1563 [00:14<00:49, 24.67it/s]

batch 340 loss: 0.531528303027153


Train, Epoch 4 / 20:  23%|██▎       | 354/1563 [00:14<00:48, 24.77it/s]

batch 350 loss: 0.5233925700187683


Train, Epoch 4 / 20:  23%|██▎       | 363/1563 [00:14<00:48, 24.83it/s]

batch 360 loss: 0.5858026146888733


Train, Epoch 4 / 20:  24%|██▍       | 375/1563 [00:15<00:47, 24.89it/s]

batch 370 loss: 0.5078790217638016


Train, Epoch 4 / 20:  25%|██▍       | 384/1563 [00:15<00:47, 24.91it/s]

batch 380 loss: 0.5433731406927109


Train, Epoch 4 / 20:  25%|██▌       | 393/1563 [00:16<00:46, 24.91it/s]

batch 390 loss: 0.6398198872804641


Train, Epoch 4 / 20:  26%|██▌       | 405/1563 [00:16<00:47, 24.64it/s]

batch 400 loss: 0.5591202288866043


Train, Epoch 4 / 20:  26%|██▋       | 414/1563 [00:17<00:46, 24.60it/s]

batch 410 loss: 0.6555543005466461


Train, Epoch 4 / 20:  27%|██▋       | 423/1563 [00:17<00:51, 22.26it/s]

batch 420 loss: 0.5623994052410126


Train, Epoch 4 / 20:  28%|██▊       | 432/1563 [00:17<00:51, 21.76it/s]

batch 430 loss: 0.529457899928093


Train, Epoch 4 / 20:  28%|██▊       | 444/1563 [00:18<00:49, 22.53it/s]

batch 440 loss: 0.5728715360164642


Train, Epoch 4 / 20:  29%|██▉       | 453/1563 [00:18<00:49, 22.51it/s]

batch 450 loss: 0.5431780070066452


Train, Epoch 4 / 20:  30%|██▉       | 462/1563 [00:19<00:49, 22.13it/s]

batch 460 loss: 0.6066113442182541


Train, Epoch 4 / 20:  30%|███       | 474/1563 [00:19<00:45, 23.84it/s]

batch 470 loss: 0.5573313713073731


Train, Epoch 4 / 20:  31%|███       | 483/1563 [00:20<00:44, 24.12it/s]

batch 480 loss: 0.572783249616623


Train, Epoch 4 / 20:  32%|███▏      | 495/1563 [00:20<00:44, 24.20it/s]

batch 490 loss: 0.5465877622365951


Train, Epoch 4 / 20:  32%|███▏      | 504/1563 [00:20<00:43, 24.51it/s]

batch 500 loss: 0.5431646823883056


Train, Epoch 4 / 20:  33%|███▎      | 513/1563 [00:21<00:42, 24.75it/s]

batch 510 loss: 0.49362927079200747


Train, Epoch 4 / 20:  34%|███▎      | 525/1563 [00:21<00:41, 24.75it/s]

batch 520 loss: 0.5459579825401306


Train, Epoch 4 / 20:  34%|███▍      | 534/1563 [00:22<00:41, 24.79it/s]

batch 530 loss: 0.531113937497139


Train, Epoch 4 / 20:  35%|███▍      | 543/1563 [00:22<00:41, 24.64it/s]

batch 540 loss: 0.605018076300621


Train, Epoch 4 / 20:  36%|███▌      | 555/1563 [00:23<00:40, 24.77it/s]

batch 550 loss: 0.5606294512748718


Train, Epoch 4 / 20:  36%|███▌      | 564/1563 [00:23<00:40, 24.75it/s]

batch 560 loss: 0.5834782361984253


Train, Epoch 4 / 20:  37%|███▋      | 573/1563 [00:23<00:39, 24.80it/s]

batch 570 loss: 0.5316042572259903


Train, Epoch 4 / 20:  37%|███▋      | 585/1563 [00:24<00:39, 24.90it/s]

batch 580 loss: 0.6058195054531097


Train, Epoch 4 / 20:  38%|███▊      | 594/1563 [00:24<00:38, 24.89it/s]

batch 590 loss: 0.5315994471311569


Train, Epoch 4 / 20:  39%|███▊      | 603/1563 [00:24<00:39, 24.08it/s]

batch 600 loss: 0.5503942102193833


Train, Epoch 4 / 20:  39%|███▉      | 612/1563 [00:25<00:46, 20.53it/s]

batch 610 loss: 0.48723545372486116


Train, Epoch 4 / 20:  40%|███▉      | 624/1563 [00:25<00:43, 21.47it/s]

batch 620 loss: 0.52403684258461


Train, Epoch 4 / 20:  40%|████      | 630/1563 [00:26<00:57, 16.14it/s]

batch 630 loss: 0.5024805709719657


Train, Epoch 4 / 20:  41%|████      | 643/1563 [00:27<00:51, 17.94it/s]

batch 640 loss: 0.5676780253648758


Train, Epoch 4 / 20:  42%|████▏     | 655/1563 [00:27<00:39, 23.00it/s]

batch 650 loss: 0.5342420995235443


Train, Epoch 4 / 20:  42%|████▏     | 664/1563 [00:28<00:37, 24.05it/s]

batch 660 loss: 0.5368284732103348


Train, Epoch 4 / 20:  43%|████▎     | 673/1563 [00:28<00:36, 24.63it/s]

batch 670 loss: 0.49089536964893343


Train, Epoch 4 / 20:  44%|████▍     | 685/1563 [00:28<00:35, 24.83it/s]

batch 680 loss: 0.591818705201149


Train, Epoch 4 / 20:  44%|████▍     | 694/1563 [00:29<00:35, 24.60it/s]

batch 690 loss: 0.6245617091655731


Train, Epoch 4 / 20:  45%|████▍     | 703/1563 [00:29<00:37, 22.80it/s]

batch 700 loss: 0.5477097421884537


Train, Epoch 4 / 20:  46%|████▌     | 712/1563 [00:30<00:37, 22.58it/s]

batch 710 loss: 0.6026766121387481


Train, Epoch 4 / 20:  46%|████▌     | 722/1563 [00:30<00:58, 14.27it/s]

batch 720 loss: 0.5648225039243698


Train, Epoch 4 / 20:  47%|████▋     | 732/1563 [00:31<01:08, 12.08it/s]

batch 730 loss: 0.6100983828306198


Train, Epoch 4 / 20:  47%|████▋     | 742/1563 [00:32<00:45, 17.87it/s]

batch 740 loss: 0.5932612478733063


Train, Epoch 4 / 20:  48%|████▊     | 754/1563 [00:32<00:35, 22.49it/s]

batch 750 loss: 0.6110513269901275


Train, Epoch 4 / 20:  49%|████▉     | 763/1563 [00:33<00:33, 24.06it/s]

batch 760 loss: 0.5481884628534317


Train, Epoch 4 / 20:  50%|████▉     | 775/1563 [00:33<00:31, 24.64it/s]

batch 770 loss: 0.5501746475696564


Train, Epoch 4 / 20:  50%|█████     | 784/1563 [00:34<00:31, 24.58it/s]

batch 780 loss: 0.5028259307146072


Train, Epoch 4 / 20:  51%|█████     | 793/1563 [00:34<00:31, 24.51it/s]

batch 790 loss: 0.4903596371412277


Train, Epoch 4 / 20:  52%|█████▏    | 805/1563 [00:34<00:30, 24.72it/s]

batch 800 loss: 0.5705315113067627


Train, Epoch 4 / 20:  52%|█████▏    | 814/1563 [00:35<00:30, 24.70it/s]

batch 810 loss: 0.5839883983135223


Train, Epoch 4 / 20:  53%|█████▎    | 823/1563 [00:35<00:30, 24.44it/s]

batch 820 loss: 0.4980076432228088


Train, Epoch 4 / 20:  53%|█████▎    | 835/1563 [00:36<00:29, 24.73it/s]

batch 830 loss: 0.5316472083330155


Train, Epoch 4 / 20:  54%|█████▍    | 844/1563 [00:36<00:28, 24.90it/s]

batch 840 loss: 0.509066817164421


Train, Epoch 4 / 20:  55%|█████▍    | 853/1563 [00:36<00:28, 24.67it/s]

batch 850 loss: 0.7338328391313553


Train, Epoch 4 / 20:  55%|█████▌    | 865/1563 [00:37<00:28, 24.78it/s]

batch 860 loss: 0.5344974339008332


Train, Epoch 4 / 20:  56%|█████▌    | 874/1563 [00:37<00:27, 24.64it/s]

batch 870 loss: 0.5813308000564575


Train, Epoch 4 / 20:  56%|█████▋    | 883/1563 [00:38<00:27, 24.60it/s]

batch 880 loss: 0.5571399897336959


Train, Epoch 4 / 20:  57%|█████▋    | 895/1563 [00:38<00:26, 24.82it/s]

batch 890 loss: 0.527638390660286


Train, Epoch 4 / 20:  58%|█████▊    | 904/1563 [00:38<00:26, 24.83it/s]

batch 900 loss: 0.5541907250881195


Train, Epoch 4 / 20:  58%|█████▊    | 913/1563 [00:39<00:26, 24.68it/s]

batch 910 loss: 0.5696552217006683


Train, Epoch 4 / 20:  59%|█████▉    | 922/1563 [00:39<00:26, 24.37it/s]

batch 920 loss: 0.5431881964206695


Train, Epoch 4 / 20:  60%|█████▉    | 934/1563 [00:40<00:25, 24.65it/s]

batch 930 loss: 0.585565248131752


Train, Epoch 4 / 20:  60%|██████    | 943/1563 [00:40<00:24, 24.89it/s]

batch 940 loss: 0.5519728928804397


Train, Epoch 4 / 20:  61%|██████    | 952/1563 [00:40<00:24, 24.91it/s]

batch 950 loss: 0.5117693424224854


Train, Epoch 4 / 20:  62%|██████▏   | 964/1563 [00:41<00:24, 24.82it/s]

batch 960 loss: 0.5491445899009705


Train, Epoch 4 / 20:  62%|██████▏   | 973/1563 [00:41<00:24, 24.47it/s]

batch 970 loss: 0.520336464047432


Train, Epoch 4 / 20:  63%|██████▎   | 982/1563 [00:42<00:24, 23.26it/s]

batch 980 loss: 0.5561848640441894


Train, Epoch 4 / 20:  64%|██████▎   | 994/1563 [00:42<00:24, 22.98it/s]

batch 990 loss: 0.5560734301805497


Train, Epoch 4 / 20:  64%|██████▍   | 1003/1563 [00:43<00:26, 21.52it/s]

batch 1000 loss: 0.5837470769882203


Train, Epoch 4 / 20:  65%|██████▍   | 1012/1563 [00:43<00:25, 22.00it/s]

batch 1010 loss: 0.5441390901803971


Train, Epoch 4 / 20:  66%|██████▌   | 1024/1563 [00:44<00:24, 21.88it/s]

batch 1020 loss: 0.5212848424911499


Train, Epoch 4 / 20:  66%|██████▌   | 1033/1563 [00:44<00:22, 23.64it/s]

batch 1030 loss: 0.5585695803165436


Train, Epoch 4 / 20:  67%|██████▋   | 1045/1563 [00:44<00:21, 24.57it/s]

batch 1040 loss: 0.4690635442733765


Train, Epoch 4 / 20:  67%|██████▋   | 1054/1563 [00:45<00:20, 24.61it/s]

batch 1050 loss: 0.5939170062541962


Train, Epoch 4 / 20:  68%|██████▊   | 1063/1563 [00:45<00:20, 24.75it/s]

batch 1060 loss: 0.5016900837421417


Train, Epoch 4 / 20:  69%|██████▉   | 1075/1563 [00:46<00:19, 24.56it/s]

batch 1070 loss: 0.5146406590938568


Train, Epoch 4 / 20:  69%|██████▉   | 1084/1563 [00:46<00:19, 24.81it/s]

batch 1080 loss: 0.5536737233400345


Train, Epoch 4 / 20:  70%|██████▉   | 1093/1563 [00:46<00:18, 24.84it/s]

batch 1090 loss: 0.6814760744571686


Train, Epoch 4 / 20:  71%|███████   | 1105/1563 [00:47<00:18, 24.70it/s]

batch 1100 loss: 0.43525520265102385


Train, Epoch 4 / 20:  71%|███████▏  | 1114/1563 [00:47<00:18, 24.56it/s]

batch 1110 loss: 0.5578436613082886


Train, Epoch 4 / 20:  72%|███████▏  | 1123/1563 [00:48<00:17, 24.52it/s]

batch 1120 loss: 0.5771253138780594


Train, Epoch 4 / 20:  72%|███████▏  | 1132/1563 [00:48<00:17, 24.55it/s]

batch 1130 loss: 0.5885362446308136


Train, Epoch 4 / 20:  73%|███████▎  | 1144/1563 [00:48<00:16, 24.80it/s]

batch 1140 loss: 0.5063028752803802


Train, Epoch 4 / 20:  74%|███████▍  | 1153/1563 [00:49<00:16, 24.75it/s]

batch 1150 loss: 0.5720522582530976


Train, Epoch 4 / 20:  75%|███████▍  | 1165/1563 [00:49<00:16, 24.68it/s]

batch 1160 loss: 0.5210839152336121


Train, Epoch 4 / 20:  75%|███████▌  | 1174/1563 [00:50<00:15, 24.66it/s]

batch 1170 loss: 0.5202333331108093


Train, Epoch 4 / 20:  76%|███████▌  | 1183/1563 [00:50<00:15, 24.72it/s]

batch 1180 loss: 0.5392727434635163


Train, Epoch 4 / 20:  76%|███████▋  | 1195/1563 [00:50<00:14, 24.76it/s]

batch 1190 loss: 0.5342416286468505


Train, Epoch 4 / 20:  77%|███████▋  | 1204/1563 [00:51<00:14, 24.63it/s]

batch 1200 loss: 0.5708996593952179


Train, Epoch 4 / 20:  78%|███████▊  | 1213/1563 [00:51<00:14, 24.58it/s]

batch 1210 loss: 0.5335181832313538


Train, Epoch 4 / 20:  78%|███████▊  | 1225/1563 [00:52<00:13, 24.78it/s]

batch 1220 loss: 0.5494490444660187


Train, Epoch 4 / 20:  79%|███████▉  | 1234/1563 [00:52<00:13, 24.71it/s]

batch 1230 loss: 0.5371610522270203


Train, Epoch 4 / 20:  80%|███████▉  | 1243/1563 [00:52<00:12, 24.89it/s]

batch 1240 loss: 0.5455634623765946


Train, Epoch 4 / 20:  80%|████████  | 1255/1563 [00:53<00:12, 24.71it/s]

batch 1250 loss: 0.5178646266460418


Train, Epoch 4 / 20:  81%|████████  | 1264/1563 [00:53<00:12, 24.66it/s]

batch 1260 loss: 0.5568557649850845


Train, Epoch 4 / 20:  81%|████████▏ | 1273/1563 [00:54<00:12, 23.67it/s]

batch 1270 loss: 0.5576871335506439


Train, Epoch 4 / 20:  82%|████████▏ | 1282/1563 [00:54<00:12, 22.76it/s]

batch 1280 loss: 0.5491332143545151


Train, Epoch 4 / 20:  83%|████████▎ | 1294/1563 [00:55<00:11, 22.52it/s]

batch 1290 loss: 0.5039234220981598


Train, Epoch 4 / 20:  83%|████████▎ | 1303/1563 [00:55<00:11, 22.07it/s]

batch 1300 loss: 0.6072314798831939


Train, Epoch 4 / 20:  84%|████████▍ | 1312/1563 [00:55<00:11, 21.75it/s]

batch 1310 loss: 0.4805709093809128


Train, Epoch 4 / 20:  85%|████████▍ | 1324/1563 [00:56<00:10, 23.29it/s]

batch 1320 loss: 0.5110566735267639


Train, Epoch 4 / 20:  85%|████████▌ | 1333/1563 [00:56<00:09, 24.29it/s]

batch 1330 loss: 0.5328911066055297


Train, Epoch 4 / 20:  86%|████████▌ | 1342/1563 [00:57<00:09, 24.37it/s]

batch 1340 loss: 0.6849725782871247


Train, Epoch 4 / 20:  87%|████████▋ | 1354/1563 [00:57<00:08, 24.36it/s]

batch 1350 loss: 0.6035234451293945


Train, Epoch 4 / 20:  87%|████████▋ | 1363/1563 [00:57<00:08, 24.79it/s]

batch 1360 loss: 0.5690385848283768


Train, Epoch 4 / 20:  88%|████████▊ | 1375/1563 [00:58<00:07, 24.74it/s]

batch 1370 loss: 0.5849410414695739


Train, Epoch 4 / 20:  89%|████████▊ | 1384/1563 [00:58<00:07, 24.77it/s]

batch 1380 loss: 0.5225207060575485


Train, Epoch 4 / 20:  89%|████████▉ | 1393/1563 [00:59<00:06, 24.67it/s]

batch 1390 loss: 0.5698494344949723


Train, Epoch 4 / 20:  90%|████████▉ | 1405/1563 [00:59<00:06, 24.59it/s]

batch 1400 loss: 0.5281515777111053


Train, Epoch 4 / 20:  90%|█████████ | 1414/1563 [01:00<00:06, 24.76it/s]

batch 1410 loss: 0.5787096381187439


Train, Epoch 4 / 20:  91%|█████████ | 1423/1563 [01:00<00:05, 24.54it/s]

batch 1420 loss: 0.5138774082064629


Train, Epoch 4 / 20:  92%|█████████▏| 1435/1563 [01:00<00:05, 24.45it/s]

batch 1430 loss: 0.5476427793502807


Train, Epoch 4 / 20:  92%|█████████▏| 1444/1563 [01:01<00:04, 24.79it/s]

batch 1440 loss: 0.5423404753208161


Train, Epoch 4 / 20:  93%|█████████▎| 1453/1563 [01:01<00:04, 24.62it/s]

batch 1450 loss: 0.5551119118928909


Train, Epoch 4 / 20:  94%|█████████▎| 1465/1563 [01:02<00:03, 24.90it/s]

batch 1460 loss: 0.5534699380397796


Train, Epoch 4 / 20:  94%|█████████▍| 1474/1563 [01:02<00:03, 24.62it/s]

batch 1470 loss: 0.5513827472925186


Train, Epoch 4 / 20:  95%|█████████▍| 1483/1563 [01:02<00:03, 24.78it/s]

batch 1480 loss: 0.5845537275075913


Train, Epoch 4 / 20:  96%|█████████▌| 1495/1563 [01:03<00:02, 24.78it/s]

batch 1490 loss: 0.5801034092903137


Train, Epoch 4 / 20:  96%|█████████▌| 1504/1563 [01:03<00:02, 24.72it/s]

batch 1500 loss: 0.5637706398963929


Train, Epoch 4 / 20:  97%|█████████▋| 1513/1563 [01:04<00:02, 24.65it/s]

batch 1510 loss: 0.5083149343729019


Train, Epoch 4 / 20:  98%|█████████▊| 1525/1563 [01:04<00:01, 24.50it/s]

batch 1520 loss: 0.5252516776323318


Train, Epoch 4 / 20:  98%|█████████▊| 1534/1563 [01:04<00:01, 24.78it/s]

batch 1530 loss: 0.5520349234342575


Train, Epoch 4 / 20:  99%|█████████▊| 1543/1563 [01:05<00:00, 24.46it/s]

batch 1540 loss: 0.5427878797054291


Train, Epoch 4 / 20:  99%|█████████▉| 1555/1563 [01:05<00:00, 24.64it/s]

batch 1550 loss: 0.5658957481384277


Train, Epoch 4 / 20: 100%|██████████| 1563/1563 [01:06<00:00, 23.65it/s]


batch 1560 loss: 0.4652030125260353


Test, Epoch 4 / 20: 100%|██████████| 1563/1563 [00:29<00:00, 52.65it/s]


Epoch 4, loss: 0.5621322078895569, accuracy: 0.70532


Train, Epoch 5 / 20:   1%|          | 15/1563 [00:00<01:02, 24.80it/s]

batch 10 loss: 0.627215451002121


Train, Epoch 5 / 20:   2%|▏         | 24/1563 [00:00<01:02, 24.80it/s]

batch 20 loss: 0.5323933184146881


Train, Epoch 5 / 20:   2%|▏         | 33/1563 [00:01<01:01, 24.83it/s]

batch 30 loss: 0.5307622104883194


Train, Epoch 5 / 20:   3%|▎         | 42/1563 [00:01<01:01, 24.54it/s]

batch 40 loss: 0.5410012066364288


Train, Epoch 5 / 20:   3%|▎         | 54/1563 [00:02<01:01, 24.63it/s]

batch 50 loss: 0.4939273774623871


Train, Epoch 5 / 20:   4%|▍         | 63/1563 [00:02<01:00, 24.76it/s]

batch 60 loss: 0.5152513593435287


Train, Epoch 5 / 20:   5%|▍         | 75/1563 [00:03<01:00, 24.55it/s]

batch 70 loss: 0.5055547624826431


Train, Epoch 5 / 20:   5%|▌         | 84/1563 [00:03<00:59, 24.91it/s]

batch 80 loss: 0.5146320432424545


Train, Epoch 5 / 20:   6%|▌         | 93/1563 [00:03<00:59, 24.88it/s]

batch 90 loss: 0.4520599275827408


Train, Epoch 5 / 20:   7%|▋         | 105/1563 [00:04<00:58, 24.71it/s]

batch 100 loss: 0.4933549016714096


Train, Epoch 5 / 20:   7%|▋         | 114/1563 [00:04<00:58, 24.67it/s]

batch 110 loss: 0.5853319853544235


Train, Epoch 5 / 20:   8%|▊         | 123/1563 [00:04<00:58, 24.80it/s]

batch 120 loss: 0.5844825804233551


Train, Epoch 5 / 20:   8%|▊         | 132/1563 [00:05<00:57, 24.77it/s]

batch 130 loss: 0.4895548805594444


Train, Epoch 5 / 20:   9%|▉         | 144/1563 [00:05<00:58, 24.29it/s]

batch 140 loss: 0.6016355812549591


Train, Epoch 5 / 20:  10%|▉         | 153/1563 [00:06<00:56, 24.76it/s]

batch 150 loss: 0.5540739506483078


Train, Epoch 5 / 20:  11%|█         | 165/1563 [00:06<00:56, 24.84it/s]

batch 160 loss: 0.5817841321229935


Train, Epoch 5 / 20:  11%|█         | 174/1563 [00:07<00:55, 24.83it/s]

batch 170 loss: 0.48720064461231233


Train, Epoch 5 / 20:  12%|█▏        | 183/1563 [00:07<01:00, 22.81it/s]

batch 180 loss: 0.5101742386817932


Train, Epoch 5 / 20:  12%|█▏        | 192/1563 [00:07<01:02, 21.82it/s]

batch 190 loss: 0.4857399523258209


Train, Epoch 5 / 20:  13%|█▎        | 204/1563 [00:08<01:01, 22.11it/s]

batch 200 loss: 0.5155711621046066


Train, Epoch 5 / 20:  14%|█▎        | 213/1563 [00:08<01:00, 22.27it/s]

batch 210 loss: 0.6062610268592834


Train, Epoch 5 / 20:  14%|█▍        | 225/1563 [00:09<00:58, 22.99it/s]

batch 220 loss: 0.5286482542753219


Train, Epoch 5 / 20:  15%|█▍        | 234/1563 [00:09<00:54, 24.24it/s]

batch 230 loss: 0.49026265144348147


Train, Epoch 5 / 20:  16%|█▌        | 243/1563 [00:10<00:55, 23.94it/s]

batch 240 loss: 0.5289059221744538


Train, Epoch 5 / 20:  16%|█▋        | 255/1563 [00:10<00:53, 24.41it/s]

batch 250 loss: 0.5106655865907669


Train, Epoch 5 / 20:  17%|█▋        | 264/1563 [00:10<00:54, 24.05it/s]

batch 260 loss: 0.47859087586402893


Train, Epoch 5 / 20:  17%|█▋        | 273/1563 [00:11<00:52, 24.66it/s]

batch 270 loss: 0.5105417788028717


Train, Epoch 5 / 20:  18%|█▊        | 285/1563 [00:11<00:51, 24.70it/s]

batch 280 loss: 0.5056950390338898


Train, Epoch 5 / 20:  19%|█▉        | 294/1563 [00:12<00:51, 24.83it/s]

batch 290 loss: 0.5440424859523774


Train, Epoch 5 / 20:  19%|█▉        | 303/1563 [00:12<00:50, 24.76it/s]

batch 300 loss: 0.48882798552513124


Train, Epoch 5 / 20:  20%|██        | 315/1563 [00:13<00:50, 24.65it/s]

batch 310 loss: 0.5445830404758454


Train, Epoch 5 / 20:  21%|██        | 324/1563 [00:13<00:50, 24.68it/s]

batch 320 loss: 0.5051424950361252


Train, Epoch 5 / 20:  21%|██▏       | 333/1563 [00:13<00:49, 24.69it/s]

batch 330 loss: 0.5634130299091339


Train, Epoch 5 / 20:  22%|██▏       | 345/1563 [00:14<00:49, 24.76it/s]

batch 340 loss: 0.5426395267248154


Train, Epoch 5 / 20:  23%|██▎       | 354/1563 [00:14<00:48, 25.13it/s]

batch 350 loss: 0.5403548449277877


Train, Epoch 5 / 20:  23%|██▎       | 363/1563 [00:14<00:48, 24.92it/s]

batch 360 loss: 0.585378223657608


Train, Epoch 5 / 20:  24%|██▍       | 375/1563 [00:15<00:47, 24.89it/s]

batch 370 loss: 0.5352147400379181


Train, Epoch 5 / 20:  25%|██▍       | 384/1563 [00:15<00:47, 24.97it/s]

batch 380 loss: 0.4823087602853775


Train, Epoch 5 / 20:  25%|██▌       | 393/1563 [00:16<00:47, 24.66it/s]

batch 390 loss: 0.5124440729618073


Train, Epoch 5 / 20:  26%|██▌       | 405/1563 [00:16<00:46, 24.90it/s]

batch 400 loss: 0.5230100065469742


Train, Epoch 5 / 20:  26%|██▋       | 414/1563 [00:17<00:45, 25.00it/s]

batch 410 loss: 0.5085289031267166


Train, Epoch 5 / 20:  27%|██▋       | 423/1563 [00:17<00:46, 24.59it/s]

batch 420 loss: 0.5094946593046188


Train, Epoch 5 / 20:  28%|██▊       | 435/1563 [00:17<00:45, 24.79it/s]

batch 430 loss: 0.5207561701536179


Train, Epoch 5 / 20:  28%|██▊       | 444/1563 [00:18<00:45, 24.71it/s]

batch 440 loss: 0.4730351448059082


Train, Epoch 5 / 20:  29%|██▉       | 453/1563 [00:18<00:45, 24.65it/s]

batch 450 loss: 0.514373990893364


Train, Epoch 5 / 20:  30%|██▉       | 465/1563 [00:19<00:44, 24.83it/s]

batch 460 loss: 0.4890492022037506


Train, Epoch 5 / 20:  30%|███       | 474/1563 [00:19<00:46, 23.49it/s]

batch 470 loss: 0.49227501153945924


Train, Epoch 5 / 20:  31%|███       | 483/1563 [00:19<00:47, 22.95it/s]

batch 480 loss: 0.4974701166152954


Train, Epoch 5 / 20:  31%|███▏      | 492/1563 [00:20<00:47, 22.46it/s]

batch 490 loss: 0.5699662387371063


Train, Epoch 5 / 20:  32%|███▏      | 504/1563 [00:20<00:47, 22.26it/s]

batch 500 loss: 0.4182576462626457


Train, Epoch 5 / 20:  33%|███▎      | 513/1563 [00:21<00:46, 22.51it/s]

batch 510 loss: 0.5326145648956299


Train, Epoch 5 / 20:  34%|███▎      | 525/1563 [00:21<00:43, 23.63it/s]

batch 520 loss: 0.542699220776558


Train, Epoch 5 / 20:  34%|███▍      | 534/1563 [00:22<00:42, 24.45it/s]

batch 530 loss: 0.5841988891363143


Train, Epoch 5 / 20:  35%|███▍      | 543/1563 [00:22<00:41, 24.66it/s]

batch 540 loss: 0.592172521352768


Train, Epoch 5 / 20:  36%|███▌      | 555/1563 [00:22<00:40, 24.66it/s]

batch 550 loss: 0.5099993824958802


Train, Epoch 5 / 20:  36%|███▌      | 564/1563 [00:23<00:40, 24.69it/s]

batch 560 loss: 0.4939355731010437


Train, Epoch 5 / 20:  37%|███▋      | 573/1563 [00:23<00:40, 24.73it/s]

batch 570 loss: 0.473723304271698


Train, Epoch 5 / 20:  37%|███▋      | 585/1563 [00:24<00:39, 24.76it/s]

batch 580 loss: 0.5468175381422042


Train, Epoch 5 / 20:  38%|███▊      | 594/1563 [00:24<00:39, 24.74it/s]

batch 590 loss: 0.5263071984052659


Train, Epoch 5 / 20:  39%|███▊      | 603/1563 [00:24<00:38, 24.68it/s]

batch 600 loss: 0.4771431118249893


Train, Epoch 5 / 20:  39%|███▉      | 615/1563 [00:25<00:38, 24.72it/s]

batch 610 loss: 0.5306685477495193


Train, Epoch 5 / 20:  40%|███▉      | 624/1563 [00:25<00:38, 24.56it/s]

batch 620 loss: 0.543735870718956


Train, Epoch 5 / 20:  40%|████      | 633/1563 [00:26<00:37, 24.82it/s]

batch 630 loss: 0.5488754123449325


Train, Epoch 5 / 20:  41%|████▏     | 645/1563 [00:26<00:37, 24.77it/s]

batch 640 loss: 0.49514652490615846


Train, Epoch 5 / 20:  42%|████▏     | 654/1563 [00:26<00:36, 24.68it/s]

batch 650 loss: 0.5969807893037796


Train, Epoch 5 / 20:  42%|████▏     | 663/1563 [00:27<00:36, 24.62it/s]

batch 660 loss: 0.532140702009201


Train, Epoch 5 / 20:  43%|████▎     | 675/1563 [00:27<00:35, 24.67it/s]

batch 670 loss: 0.5166833519935607


Train, Epoch 5 / 20:  44%|████▍     | 684/1563 [00:28<00:35, 24.80it/s]

batch 680 loss: 0.5616772592067718


Train, Epoch 5 / 20:  44%|████▍     | 693/1563 [00:28<00:35, 24.85it/s]

batch 690 loss: 0.5208346486091614


Train, Epoch 5 / 20:  45%|████▌     | 705/1563 [00:29<00:34, 24.69it/s]

batch 700 loss: 0.5609675198793411


Train, Epoch 5 / 20:  46%|████▌     | 714/1563 [00:29<00:34, 24.70it/s]

batch 710 loss: 0.5271938472986222


Train, Epoch 5 / 20:  46%|████▋     | 723/1563 [00:29<00:33, 24.75it/s]

batch 720 loss: 0.5143449753522873


Train, Epoch 5 / 20:  47%|████▋     | 735/1563 [00:30<00:33, 24.86it/s]

batch 730 loss: 0.48774107694625857


Train, Epoch 5 / 20:  48%|████▊     | 744/1563 [00:30<00:33, 24.72it/s]

batch 740 loss: 0.521916076540947


Train, Epoch 5 / 20:  48%|████▊     | 753/1563 [00:30<00:32, 24.67it/s]

batch 750 loss: 0.5285678744316101


Train, Epoch 5 / 20:  49%|████▉     | 765/1563 [00:31<00:32, 24.69it/s]

batch 760 loss: 0.5226016759872436


Train, Epoch 5 / 20:  50%|████▉     | 774/1563 [00:31<00:35, 22.51it/s]

batch 770 loss: 0.47442280650138857


Train, Epoch 5 / 20:  50%|█████     | 783/1563 [00:32<00:34, 22.75it/s]

batch 780 loss: 0.5855068951845169


Train, Epoch 5 / 20:  51%|█████     | 792/1563 [00:32<00:34, 22.09it/s]

batch 790 loss: 0.4735231250524521


Train, Epoch 5 / 20:  51%|█████▏    | 804/1563 [00:33<00:34, 22.03it/s]

batch 800 loss: 0.5452162623405457


Train, Epoch 5 / 20:  52%|█████▏    | 813/1563 [00:33<00:34, 21.80it/s]

batch 810 loss: 0.5255589067935944


Train, Epoch 5 / 20:  53%|█████▎    | 825/1563 [00:34<00:30, 24.02it/s]

batch 820 loss: 0.4741085976362228


Train, Epoch 5 / 20:  53%|█████▎    | 834/1563 [00:34<00:29, 24.47it/s]

batch 830 loss: 0.45192975997924806


Train, Epoch 5 / 20:  54%|█████▍    | 843/1563 [00:34<00:29, 24.53it/s]

batch 840 loss: 0.5251658618450165


Train, Epoch 5 / 20:  55%|█████▍    | 852/1563 [00:35<00:29, 24.41it/s]

batch 850 loss: 0.49783390760421753


Train, Epoch 5 / 20:  55%|█████▌    | 864/1563 [00:35<00:28, 24.15it/s]

batch 860 loss: 0.5757873475551605


Train, Epoch 5 / 20:  56%|█████▌    | 873/1563 [00:36<00:28, 24.42it/s]

batch 870 loss: 0.5094841986894607


Train, Epoch 5 / 20:  57%|█████▋    | 885/1563 [00:36<00:27, 24.55it/s]

batch 880 loss: 0.46997652649879457


Train, Epoch 5 / 20:  57%|█████▋    | 894/1563 [00:36<00:27, 24.65it/s]

batch 890 loss: 0.5551583498716355


Train, Epoch 5 / 20:  58%|█████▊    | 903/1563 [00:37<00:26, 24.67it/s]

batch 900 loss: 0.48958356082439425


Train, Epoch 5 / 20:  59%|█████▊    | 915/1563 [00:37<00:26, 24.68it/s]

batch 910 loss: 0.49361211955547335


Train, Epoch 5 / 20:  59%|█████▉    | 924/1563 [00:38<00:26, 24.54it/s]

batch 920 loss: 0.5921081095933914


Train, Epoch 5 / 20:  60%|█████▉    | 933/1563 [00:38<00:25, 24.43it/s]

batch 930 loss: 0.5039740145206452


Train, Epoch 5 / 20:  60%|██████    | 945/1563 [00:39<00:24, 24.75it/s]

batch 940 loss: 0.5326285302639008


Train, Epoch 5 / 20:  61%|██████    | 954/1563 [00:39<00:24, 25.03it/s]

batch 950 loss: 0.48528188169002534


Train, Epoch 5 / 20:  62%|██████▏   | 963/1563 [00:39<00:24, 24.61it/s]

batch 960 loss: 0.5433850347995758


Train, Epoch 5 / 20:  62%|██████▏   | 972/1563 [00:40<00:24, 24.21it/s]

batch 970 loss: 0.5872934997081757


Train, Epoch 5 / 20:  63%|██████▎   | 984/1563 [00:40<00:23, 24.32it/s]

batch 980 loss: 0.5692476987838745


Train, Epoch 5 / 20:  64%|██████▎   | 993/1563 [00:41<00:23, 24.69it/s]

batch 990 loss: 0.5036932736635208


Train, Epoch 5 / 20:  64%|██████▍   | 1005/1563 [00:41<00:22, 24.72it/s]

batch 1000 loss: 0.47227057814598083


Train, Epoch 5 / 20:  65%|██████▍   | 1014/1563 [00:41<00:22, 24.76it/s]

batch 1010 loss: 0.5361680775880814


Train, Epoch 5 / 20:  65%|██████▌   | 1023/1563 [00:42<00:21, 24.62it/s]

batch 1020 loss: 0.539108008146286


Train, Epoch 5 / 20:  66%|██████▌   | 1035/1563 [00:42<00:21, 24.92it/s]

batch 1030 loss: 0.5209302335977555


Train, Epoch 5 / 20:  67%|██████▋   | 1044/1563 [00:43<00:21, 24.64it/s]

batch 1040 loss: 0.5720291256904602


Train, Epoch 5 / 20:  67%|██████▋   | 1053/1563 [00:43<00:20, 24.81it/s]

batch 1050 loss: 0.517921906709671


Train, Epoch 5 / 20:  68%|██████▊   | 1062/1563 [00:43<00:21, 23.17it/s]

batch 1060 loss: 0.5503683298826217


Train, Epoch 5 / 20:  69%|██████▊   | 1074/1563 [00:44<00:21, 22.48it/s]

batch 1070 loss: 0.5292361736297607


Train, Epoch 5 / 20:  69%|██████▉   | 1083/1563 [00:44<00:21, 22.15it/s]

batch 1080 loss: 0.4892150819301605


Train, Epoch 5 / 20:  70%|██████▉   | 1092/1563 [00:45<00:21, 21.85it/s]

batch 1090 loss: 0.49241905510425565


Train, Epoch 5 / 20:  71%|███████   | 1104/1563 [00:45<00:21, 21.61it/s]

batch 1100 loss: 0.539142319560051


Train, Epoch 5 / 20:  71%|███████   | 1113/1563 [00:46<00:20, 22.38it/s]

batch 1110 loss: 0.5878254503011704


Train, Epoch 5 / 20:  72%|███████▏  | 1125/1563 [00:46<00:18, 24.16it/s]

batch 1120 loss: 0.5224021375179291


Train, Epoch 5 / 20:  73%|███████▎  | 1134/1563 [00:47<00:17, 24.09it/s]

batch 1130 loss: 0.6693525373935699


Train, Epoch 5 / 20:  73%|███████▎  | 1143/1563 [00:47<00:17, 24.56it/s]

batch 1140 loss: 0.48116818368434905


Train, Epoch 5 / 20:  74%|███████▍  | 1155/1563 [00:47<00:16, 24.83it/s]

batch 1150 loss: 0.48186246752738954


Train, Epoch 5 / 20:  74%|███████▍  | 1164/1563 [00:48<00:16, 24.80it/s]

batch 1160 loss: 0.5660766690969468


Train, Epoch 5 / 20:  75%|███████▌  | 1173/1563 [00:48<00:16, 24.33it/s]

batch 1170 loss: 0.46605687886476516


Train, Epoch 5 / 20:  76%|███████▌  | 1182/1563 [00:48<00:15, 24.49it/s]

batch 1180 loss: 0.502165150642395


Train, Epoch 5 / 20:  76%|███████▋  | 1194/1563 [00:49<00:14, 24.62it/s]

batch 1190 loss: 0.5057036191225052


Train, Epoch 5 / 20:  77%|███████▋  | 1203/1563 [00:49<00:14, 24.54it/s]

batch 1200 loss: 0.5054951399564743


Train, Epoch 5 / 20:  78%|███████▊  | 1215/1563 [00:50<00:14, 24.55it/s]

batch 1210 loss: 0.5223765254020691


Train, Epoch 5 / 20:  78%|███████▊  | 1224/1563 [00:50<00:13, 24.83it/s]

batch 1220 loss: 0.5663347989320755


Train, Epoch 5 / 20:  79%|███████▉  | 1233/1563 [00:51<00:13, 24.68it/s]

batch 1230 loss: 0.5047332495450974


Train, Epoch 5 / 20:  80%|███████▉  | 1245/1563 [00:51<00:12, 24.62it/s]

batch 1240 loss: 0.4856435626745224


Train, Epoch 5 / 20:  80%|████████  | 1254/1563 [00:51<00:12, 24.76it/s]

batch 1250 loss: 0.523144993185997


Train, Epoch 5 / 20:  81%|████████  | 1263/1563 [00:52<00:12, 24.73it/s]

batch 1260 loss: 0.5140342563390732


Train, Epoch 5 / 20:  82%|████████▏ | 1275/1563 [00:52<00:11, 24.83it/s]

batch 1270 loss: 0.5308827787637711


Train, Epoch 5 / 20:  82%|████████▏ | 1284/1563 [00:53<00:11, 24.50it/s]

batch 1280 loss: 0.5677163898944855


Train, Epoch 5 / 20:  83%|████████▎ | 1293/1563 [00:53<00:10, 24.73it/s]

batch 1290 loss: 0.4585253417491913


Train, Epoch 5 / 20:  83%|████████▎ | 1305/1563 [00:53<00:10, 24.67it/s]

batch 1300 loss: 0.5017607897520066


Train, Epoch 5 / 20:  84%|████████▍ | 1314/1563 [00:54<00:10, 24.72it/s]

batch 1310 loss: 0.5484564751386642


Train, Epoch 5 / 20:  85%|████████▍ | 1323/1563 [00:54<00:09, 24.78it/s]

batch 1320 loss: 0.474098214507103


Train, Epoch 5 / 20:  85%|████████▌ | 1335/1563 [00:55<00:09, 24.56it/s]

batch 1330 loss: 0.5157875061035156


Train, Epoch 5 / 20:  86%|████████▌ | 1341/1563 [00:55<00:10, 21.81it/s]

batch 1340 loss: 0.5249775022268295


Train, Epoch 5 / 20:  87%|████████▋ | 1353/1563 [00:56<00:09, 22.53it/s]

batch 1350 loss: 0.5197707146406174


Train, Epoch 5 / 20:  87%|████████▋ | 1362/1563 [00:56<00:09, 22.14it/s]

batch 1360 loss: 0.48432366251945497


Train, Epoch 5 / 20:  88%|████████▊ | 1374/1563 [00:56<00:08, 22.48it/s]

batch 1370 loss: 0.48343345820903777


Train, Epoch 5 / 20:  88%|████████▊ | 1383/1563 [00:57<00:08, 21.97it/s]

batch 1380 loss: 0.4901362627744675


Train, Epoch 5 / 20:  89%|████████▉ | 1392/1563 [00:57<00:07, 21.46it/s]

batch 1390 loss: 0.5567224740982055


Train, Epoch 5 / 20:  90%|████████▉ | 1404/1563 [00:58<00:07, 22.26it/s]

batch 1400 loss: 0.5328926682472229


Train, Epoch 5 / 20:  90%|█████████ | 1413/1563 [00:58<00:06, 23.78it/s]

batch 1410 loss: 0.49301951229572294


Train, Epoch 5 / 20:  91%|█████████ | 1425/1563 [00:59<00:05, 24.17it/s]

batch 1420 loss: 0.5163809955120087


Train, Epoch 5 / 20:  92%|█████████▏| 1434/1563 [00:59<00:05, 24.55it/s]

batch 1430 loss: 0.5166206061840057


Train, Epoch 5 / 20:  92%|█████████▏| 1443/1563 [00:59<00:04, 24.69it/s]

batch 1440 loss: 0.5361197978258133


Train, Epoch 5 / 20:  93%|█████████▎| 1455/1563 [01:00<00:04, 24.91it/s]

batch 1450 loss: 0.5571826517581939


Train, Epoch 5 / 20:  94%|█████████▎| 1464/1563 [01:00<00:03, 24.86it/s]

batch 1460 loss: 0.5251784533262253


Train, Epoch 5 / 20:  94%|█████████▍| 1473/1563 [01:01<00:03, 24.84it/s]

batch 1470 loss: 0.6088702648878097


Train, Epoch 5 / 20:  95%|█████████▌| 1485/1563 [01:01<00:03, 24.86it/s]

batch 1480 loss: 0.5161931633949279


Train, Epoch 5 / 20:  96%|█████████▌| 1494/1563 [01:01<00:02, 24.63it/s]

batch 1490 loss: 0.49700741171836854


Train, Epoch 5 / 20:  96%|█████████▌| 1503/1563 [01:02<00:02, 24.61it/s]

batch 1500 loss: 0.5589325308799744


Train, Epoch 5 / 20:  97%|█████████▋| 1515/1563 [01:02<00:01, 24.80it/s]

batch 1510 loss: 0.4941903382539749


Train, Epoch 5 / 20:  98%|█████████▊| 1524/1563 [01:03<00:01, 24.81it/s]

batch 1520 loss: 0.49609418511390685


Train, Epoch 5 / 20:  98%|█████████▊| 1533/1563 [01:03<00:01, 24.70it/s]

batch 1530 loss: 0.5074892342090607


Train, Epoch 5 / 20:  99%|█████████▊| 1542/1563 [01:03<00:00, 24.15it/s]

batch 1540 loss: 0.610336372256279


Train, Epoch 5 / 20:  99%|█████████▉| 1554/1563 [01:04<00:00, 24.08it/s]

batch 1550 loss: 0.5007728517055512


Train, Epoch 5 / 20: 100%|██████████| 1563/1563 [01:04<00:00, 24.12it/s]


batch 1560 loss: 0.5138665109872818


Test, Epoch 5 / 20:  14%|█▎        | 213/1563 [00:04<00:29, 45.33it/s]