![model](img/model1.png)

    实现一个简单的孪生网络，做语义相似度：
    
        1.从上图可看出整体的结构相对较简单，左右两边几本一致。A句和B句分别进入左右两个结构。输入到网络中是token embedding + position_embedding
        
        2.再经过cnn-encoder进行编码
        
        3.多头注意力层，self-attention的输入：一个是本句cnn-encoder的输出；一个是另一句的cnn-encoder的输出。作为两句的交互层。
        
        4.将cnn-encoder的输出和self-attention的输出进行cat连接
        
        5.接一个fc层
        
        6.一个平均池化层
        
        7.最后是用cosine余弦作相似度匹配计算。

cnn-encoder结构如下：

![model](img/cnn-encoder.png)

In [28]:
import os
import pandas as pd

data_path = os.getcwd()
file_path = os.path.join(data_path, 'data', 'rawdata.csv')

pd_csv = pd.read_csv(file_path, sep='	', names=['texta', 'textb', 'label'])
# pd_csv.drop('id', axis=1, inplace=True)
pd_csv.head(10)

Unnamed: 0,texta,textb,label
0,也开不了花呗，就这样了？完事了,真的嘛？就是花呗付款,0
1,花呗冻结以后还能开通吗,我的条件可以开通花呗借款吗,0
2,如何得知关闭借呗,想永久关闭借呗,0
3,花呗扫码付钱,二维码扫描可以用花呗吗,0
4,花呗逾期后不能分期吗,我这个 逾期后还完了 最低还款 后 能分期吗,0
5,花呗分期清空,花呗分期查询,0
6,借呗逾期短信通知,如何购买花呗短信通知,0
7,借呗即将到期要还的账单还能分期吗,借呗要分期还，是吗,0
8,花呗为什么不能支付手机交易,花呗透支了为什么不可以继续用了,0
9,在吗，双***有临时花呗额度吗,花呗临时额度到时间怎么办,0


In [32]:
import torch
from torchtext import data,datasets
from torchtext.data import Iterator, BucketIterator
from torchtext.vocab import Vectors
from torch import nn,optim
import torch.nn.functional as F
import pickle
import random
import re

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def tokenize(x):
    x = x.strip().replace(' ','')
    sentences = re.sub("[.。，“”,!?\\-]", '', x.lower()) # 去除特殊符号
    return list(sentences)

TEXT = data.Field(
                    sequential=True,
                    tokenize=tokenize,
                    lower=False,
                    use_vocab=True,
                    pad_token='<pad>',
                    unk_token='<unk>',
                    batch_first=True,
                    fix_length=30)

LABEL = data.Field(
                    sequential=False,
                    use_vocab=False,
                    batch_first=True)


# 获取训练或测试数据集
def get_dataset(csv_data, text_field, label_field, test=False):
    fields = [('id', None), ('texta', text_field), ('textb', text_field), ('label', label_field)]
    examples = []
    if test: #测试集，不加载label
        for texta, textb in zip(csv_data['texta'], csv_data['textb']):
            examples.append(data.Example.fromlist([None, texta, textb, None], fields))
    else: # 训练集
        for texta, textb, label in zip(csv_data['texta'], csv_data['textb'], csv_data['label']):
            examples.append(data.Example.fromlist([None, texta, textb, label], fields))
    return examples, fields

train_examples,train_fields = get_dataset(pd_csv, TEXT, LABEL)

train = data.Dataset(train_examples, train_fields)

# 可以加载预训练数据
'''
pretrained_embedding = os.path.join(os.getcwd(), 'sgns.sogou.char')
vectors = Vectors(name=pretrained_embedding)
# 构建词典
TEXT.build_vocab(train, min_freq=1, vectors = vectors)
'''
TEXT.build_vocab(train, min_freq=1)

words_path = os.path.join(os.getcwd(), 'words.pkl')
with open(words_path, 'wb') as f_words:
    pickle.dump(TEXT.vocab, f_words)

# 划分训练与验证集，一个问题，利用random_split进行数据集划分后，会丢失fields属性
train_set, val_set = train.split(split_ratio=0.9, random_state=random.seed(1))
    
train_iter, val_iter = data.BucketIterator.splits(
                                                (train_set, val_set),
                                                batch_sizes=(64, len(val_set)),
                                                shuffle=True,
                                                # device=device,
                                                sort_within_batch=True, #为true则一个batch内的数据会按sort_key规则降序排序
                                                sort_key=lambda x: len(x.texta)) #这里按src的长度降序排序，主要是为后面pack,pad操作
                                                # repeat=False
print('dataset load done!')

dataset load done!


In [33]:
#print(len(TEXT.vocab))
print(train_iter.batch_size)
print(val_iter.batch_size)
print(TEXT.vocab.stoi['<pad>'])
print(TEXT.vocab.stoi['<unk>'])

64
20
1
0


In [48]:

print(len(train_iter))
for b in train_iter:
    print(b.texta)
    print(b.textb)
    print(b.label)
    break

3
tensor([[381, 360,  21,  ..., 421, 103, 229],
        [120,  40,  53,  ...,  53, 118,  41],
        [ 11,  33, 103,  ...,  11,  29,  22],
        ...,
        [  3,   2, 253,  ...,   1,   1,   1],
        [ 11,  83,  27,  ...,   1,   1,   1],
        [ 21,  15,   4,  ...,   1,   1,   1]])
tensor([[ 21,  15,   4,  ...,   1,   1,   1],
        [ 46,  40,  36,  ..., 100,  20,  65],
        [ 11,  46,  24,  ...,   1,   1,   1],
        ...,
        [ 11,   3,   2,  ...,   1,   1,   1],
        [ 11,  10,  32,  ...,   1,   1,   1],
        [ 21,  15,   4,  ...,   1,   1,   1]])
tensor([1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1,
        1, 1, 0, 1])


In [49]:
# 搭建模型

class Encoder(nn.Module):
    def __init__(self, input_dim, hid_dim, n_layers, kernel_size, dropout, max_length=30):
        super(Encoder, self).__init__()
        
        #for kernel in kernel_size:
        assert kernel_size % 2 == 1,'kernel size must be odd!' # 卷积核size为奇数，方便序列两边pad处理
        
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(DEVICE) # 确保整个网络的方差不会发生显著变化
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim) # token编码
        self.pos_embedding = nn.Embedding(max_length, hid_dim) # token的位置编码
        
        #self.emb2hid = nn.Linear(emb_dim, hid_dim) # 线性层，从emb_dim转为hid_dim
        #self.hid2emb = nn.Linear(hid_dim, emb_dim) # 线性层，从hid_dim转为emb_dim
        
        # 卷积块
        
        self.convs = nn.ModuleList([nn.Conv1d(in_channels=hid_dim,
                                              out_channels=2*hid_dim, # 卷积后输出的维度，这里2*hid_dim是为了后面的glu激活函数
                                              kernel_size=kernel_size,
                                              padding=(kernel_size - 1)//2) # 序列两边补0个数，保持维度不变
                                              for _ in range(n_layers)]) 
        
        '''
        利用不同size的卷积核进行特征提取
        self.conv_1 = nn.ModuleList([nn.Conv1d(in_channels=hid_dim,
                                                  out_channels=2*hid_dim, # 卷积后输出的维度，这里2*hid_dim是为了后面的glu激活函数
                                                  kernel_size=kernel_size[0],
                                                  padding=(kernel_size[0] - 1)//2) # 序列两边补0个数，保持维度不变
                                                  for _ in range(n_layers)])
        self.conv_2 = nn.ModuleList([nn.Conv1d(in_channels=hid_dim,
                                                  out_channels=2*hid_dim, # 卷积后输出的维度，这里2*hid_dim是为了后面的glu激活函数
                                                  kernel_size=kernel_size[1],
                                                  padding=(kernel_size[1] - 1)//2) # 序列两边补0个数，保持维度不变
                                                  for _ in range(n_layers)])
        self.conv_3 = nn.ModuleList([nn.Conv1d(in_channels=hid_dim,
                                                  out_channels=2*hid_dim, # 卷积后输出的维度，这里2*hid_dim是为了后面的glu激活函数
                                                  kernel_size=kernel_size[2],
                                                  padding=(kernel_size[2] - 1)//2) # 序列两边补0个数，保持维度不变
                                                  for _ in range(n_layers)])
        
        # 几个卷积模块转换维度
        self.convhid2hid = nn.Linear(len(kernel_size) * hid_dim, hid_dim)
        '''
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        # src: [batch_size, src_len]
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        # 创建token位置信息
        pos = torch.arange(src_len).unsqueeze(0).repeat(batch_size, 1).to(DEVICE) # [batch_size, src_len]
        
        # 对token与其位置进行编码
        tok_embedded = self.tok_embedding(src) # [batch_size, src_len, emb_dim]
        pos_embedded = self.pos_embedding(pos.long()) # [batch_size, src_len, emb_dim]
        
        # 对token embedded和pos_embedded逐元素加和
        embedded = self.dropout(tok_embedded + pos_embedded) # [batch_size, src_len, emb_dim]
        
        # embedded经过一线性层，将emb_dim转为hid_dim，作为卷积块的输入
        #conv_input = self.emb2hid(embedded) # [batch_size, src_len, hid_dim]
        
        # 转变维度，卷积在输入数据的最后一维进行
        conv_input = embedded.permute(0, 2, 1) # [batch_size, hid_dim, src_len]
        
        
        # 以下进行卷积块
        for i, conv in enumerate(self.convs):
            # 进行卷积
            conved = conv(self.dropout(conv_input)) # [batch_size, 2*hid_dim, src_len]
            
            # 进行激活glu
            conved = F.glu(conved, dim=1) # [batch_size, hid_dim, src_len]
            
            # 进行残差连接
            conved = (conved + conv_input) * self.scale # [batch_size, hid_dim, src_len]
            
            # 作为下一个卷积块的输入
            conv_input = conved
        
        # 经过一线性层，将hid_dim转为emb_dim，作为enocder的卷积输出的特征
        #conved = self.hid2emb(conved.permute(0, 2, 1)) # [batch_size, src_len, emb_dim]
        
        
        
        '''
        利用不同size的卷积核进行特征提取
        # 第一个kernel_size
        conved_input = conv_input
        for i, conv in enumerate(self.conv_1):
            # 进行卷积
            conved1 = conv(self.dropout(conved_input)) # [batch_size, 2*hid_dim, src_len]

            # 进行激活glu
            conved1 = F.glu(conved1, dim=1) # [batch_size, hid_dim, src_len]

            # 进行残差连接
            conved1 = (conved1 + conved_input) * self.scale # [batch_size, hid_dim, src_len]

            # 作为下一个卷积块的输入
            conved_input = conved1
        
        combine_conv_module = conved1
        
        # 第二个kernel_size
        conved_input = conv_input
        for i, conv in enumerate(self.conv_2):
            # 进行卷积
            conved2 = conv(self.dropout(conved_input)) # [batch_size, 2*hid_dim, src_len]

            # 进行激活glu
            conved2 = F.glu(conved2, dim=1) # [batch_size, hid_dim, src_len]

            # 进行残差连接
            conved2 = (conved2 + conved_input) * self.scale # [batch_size, hid_dim, src_len]

            # 作为下一个卷积块的输入
            conved_input = conved2
            
        combine_conv_module = torch.cat([combine_conv_module, conved2], dim = 1)
        
        # 第三个kernel_size
        conved_input = conv_input
        for i, conv in enumerate(self.conv_3):
            # 进行卷积
            conved3 = conv(self.dropout(conved_input)) # [batch_size, 2*hid_dim, src_len]

            # 进行激活glu
            conved3 = F.glu(conved3, dim=1) # [batch_size, hid_dim, src_len]

            # 进行残差连接
            conved3 = (conved3 + conved_input) * self.scale # [batch_size, hid_dim, src_len]

            # 作为下一个卷积块的输入
            conved_input = conved3
            
        combine_conv_module = torch.cat([combine_conv_module, conved3], dim = 1)
        
        conved = self.convhid2hid(combine_conv_module.permute(0, 2, 1)) # [batch_size, src_len, hid_dim]
        '''
        
        # 又是一个残差连接，逐元素加和输出，作为encoder的联合输出特征
        combined = (conved.permute(0, 2, 1) + embedded) * self.scale # [batch_size, src_len, emb_dim]
        
        return conved, combined

'''
多头注意力multi-head attention
'''
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout):
        super(MultiHeadAttentionLayer, self).__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.hid_dim])).to(DEVICE) # 缩放因子
        
    def forward(self, query, key, value, mask=None):
        '''
        query: [batch_size, query_len, hid_dim]
        key: [batch_size, key_len, hid_dim]
        value: [batch_size, value_len, hid_dim]
        '''
        batch_size = query.shape[0]
        
        Q = self.fc_q(query) # [batch_size, query_len, hid_dim]
        K = self.fc_k(key) # [batch_size, key_len, hid_dim]
        V = self.fc_v(value) # [batch_size, value_len, hid_dim]
        
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3) # [batch_size, n_heads, query_len, head_dim]
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3) # [batch_size, n_heads, key_len, head_dim]
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3) # [batch_size, n_heads, value_len, head_dim]
        
        # [batch_size, n_heads, query_len, head_dim] * [batch_size, n_heads, head_dim, key_len]
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale # [batch_size, n_heads, query_len, key_len]
        
        if mask != None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim=-1) # [batch_size, n_heads, query_len, key_len]
        
        # [batch_size, n_heads, query_len, key_len] * [batch_size, n_heads, value_len, head_dim]
        x = torch.matmul(self.dropout(attention), V) # [batch_size, n_heads, query_len, head_dim]
        
        x = x.permute(0, 2, 1, 3).contiguous() # [batch_size, query_len, n_heads, head_dim]
        
        x = x.view(batch_size, -1, self.hid_dim) # [batch_size, query_len, hid_dim]
        
        x = self.fc_o(x) # [batch_size, query_len, hid_dim]
        
        return x, attention
    
class SiameseNetwork(nn.Module):
    def __init__(self, EncoderA, hid_dim, n_heads, dropout):
        super(SiameseNetwork, self).__init__()
        self.EncoderA = EncoderA
        #self.EncoderB = EncoderB
        #self.dropout = nn.Dropout(dropout)
        
        # 多头
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout)
        
        self.fcA = nn.Linear(2 * hid_dim, hid_dim)
        self.fcB = nn.Linear(2 * hid_dim, hid_dim)
        
        self.fc_out = nn.Linear(5 * hid_dim, 2)
        
    def calculate_attention(self, convedA, convedB):
        '''
        convedA:[batch_size, len, hid_dim]
        convedB:[batch_size, len, hid_dim]
        '''
        energy = torch.matmul(convedA, convedB.permute(0, 2, 1)) # [batch_size, trg_len, src_len]
        
        attention = F.softmax(energy, dim=2) # [batch_size, trg_len, src_len]
        
        attention_encoding = torch.matmul(attention, convedB) # [batch_size, trg_len, hid_dim]
        
        return attention, attention_encoding
    
    def forward(self, sentA, sentB):
        convedA, combinedA = self.EncoderA(sentA)
        convedB, combinedB = self.EncoderA(sentB)
        
        # 普通attention
        #attentionA, attended_encodingA = self.calculate_attention(combinedB, combinedA)
        #attentionB, attended_encodingB = self.calculate_attention(combinedA, combinedB)
        
        # 多头attention，来自transformer模型中
        self_attentionA, attentionA = self.self_attention(combinedB, combinedA, combinedA)
        self_attentionB, attentionB = self.self_attention(combinedA, combinedB, combinedB)
        
        combinedA = torch.cat([self_attentionA, combinedA], dim=2) # [batch_size, len, 2 * hid_dim]
        combinedB = torch.cat([self_attentionB, combinedB], dim=2) # [batch_size, len, 2 * hid_dim]
        
        combinedA = self.fcA(combinedA) # [batch_size, len, hid_dim]
        combinedB = self.fcB(combinedB) # [batch_size, len, hid_dim]
        
        combinedA = F.avg_pool1d(combinedA.permute(0, 2, 1), combinedA.shape[1]).squeeze(2) # [batch_size, emb_dim]
        combinedB = F.avg_pool1d(combinedB.permute(0, 2, 1), combinedB.shape[1]).squeeze(2) # [batch_size, emb_dim]

        similarity = torch.cosine_similarity(combinedA, combinedB, dim=1) # 直接计算和学习相似度
        
        # 以下是做二分类
        # [p, q, p+q, p-q, p*q]
        #fc_out = self.fc_out(torch.cat([combinedA, combinedB, combinedA+combinedB, combinedA-combinedB, combinedA*combinedB], dim=1)) # 【batch_size, 2】
        
        return similarity

In [6]:
#对所有模块和子模块进行权重初始化
def init_weights(model):
    for name,param in model.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

In [54]:
'''
定义model
'''

input_dim = len(TEXT.vocab)
emb_dim = 128
cnn_layers = 4 # Encoder中几层卷积块
kernel_size = 3 #(3,5,7)
dropout = 0.5
n_heads = 4

pad_idx = TEXT.vocab.stoi['<pad>']

encA = Encoder(input_dim, emb_dim, cnn_layers, kernel_size, dropout) # source与target共享encoder，也可分开定义不同的encoder

#encB = Encoder(input_dim, emb_dim, cnn_layers, kernel_size, dropout)

model = SiameseNetwork(encA, emb_dim, n_heads, dropout).to(DEVICE)

# 对所有模块和子模块进行权重初始化
# model.apply(init_weights)

'''
在已训练的model基础上再训练
model_path = os.path.join(os.getcwd(), "model.h5")
if os.path.exists(model_path):
    model.load_state_dict(torch.load(model_path))
    print('model load done!')
'''

# 优化函数
optimizer = optim.Adam(model.parameters())
# optimizer = optim.AdamW(model.parameters(), lr=0.001)

# optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# 损失函数
#分类
# criterion = nn.BCELoss() # 和sigmoid结合使用
# criterion = nn.BCEWithLogitsLoss() # sigmoid + bceloss
# 不平衡类加权
#weight = torch.tensor([0.1, 1.0])
#weight = weight.to(DEVICE)

#criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

#回归，用于相似度
criterion = nn.MSELoss()

In [55]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import r2_score

# 训练
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        texta = batch.texta  # src=[batch_size, seq_len]
        textb = batch.textb  # trg=[batch_size, seq_len]
        label = batch.label # [batch_size]
        
        texta = texta.to(DEVICE)
        textb = textb.to(DEVICE)
        label = label.to(DEVICE)
        
        optimizer.zero_grad()
        
        out = model(texta, textb) # [batch_size, 2]
        # 计算loss
        loss = criterion(out, label.float())
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)

# val loss
def evaluate(model, iterator, criterion):
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            texta = batch.texta  # src=[batch_size, seq_len]
            textb = batch.textb  # trg=[batch_size, seq_len]
            label = batch.label # [batch_size]

            texta = texta.to(DEVICE)
            textb = textb.to(DEVICE)
            label = label.to(DEVICE)
            
            out = model(texta, textb) # [batch_size, 2]
            
            # 分类的评估auc score
            '''
            prediction = torch.max(F.softmax(out, dim=1), dim=1)[1]
            
            pred_y = prediction.cpu().data.numpy().squeeze()
            target_y = label.cpu().data.numpy()
            score = roc_auc_score(target_y, pred_y)
            '''
            # 计算loss
            loss = criterion(out, label.float())
        
            
            epoch_loss += loss.item()
            
    return epoch_loss / len(iterator)

In [56]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [57]:
n_epochs = 10 # 迭代次数 
clip = 0.1 # 梯度裁剪
import math

model_path = os.path.join(os.getcwd(), "model.h5")

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    
    start_time = time.time()
    
    train_loss = train(model, train_iter, optimizer, criterion, clip)
    valid_loss = evaluate(model, val_iter, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time) # 每个epoch花费的时间
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_path)
        
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
    # print(f'\t Val. auc_score: {score:.3f}')

Epoch: 01 | Time: 0m 0s
	Train Loss: 0.365 | Train PPL:   1.440
	 Val. Loss: 0.254 |  Val. PPL:   1.289
Epoch: 02 | Time: 0m 0s
	Train Loss: 0.256 | Train PPL:   1.292
	 Val. Loss: 0.278 |  Val. PPL:   1.321
Epoch: 03 | Time: 0m 0s
	Train Loss: 0.242 | Train PPL:   1.274
	 Val. Loss: 0.288 |  Val. PPL:   1.333
Epoch: 04 | Time: 0m 0s
	Train Loss: 0.239 | Train PPL:   1.270
	 Val. Loss: 0.301 |  Val. PPL:   1.352
Epoch: 05 | Time: 0m 0s
	Train Loss: 0.248 | Train PPL:   1.281
	 Val. Loss: 0.321 |  Val. PPL:   1.378
Epoch: 06 | Time: 0m 0s
	Train Loss: 0.246 | Train PPL:   1.279
	 Val. Loss: 0.291 |  Val. PPL:   1.337
Epoch: 07 | Time: 0m 0s
	Train Loss: 0.232 | Train PPL:   1.261
	 Val. Loss: 0.311 |  Val. PPL:   1.364
Epoch: 08 | Time: 0m 0s
	Train Loss: 0.216 | Train PPL:   1.241
	 Val. Loss: 0.342 |  Val. PPL:   1.408
Epoch: 09 | Time: 0m 0s
	Train Loss: 0.226 | Train PPL:   1.253
	 Val. Loss: 0.325 |  Val. PPL:   1.385
Epoch: 10 | Time: 0m 0s
	Train Loss: 0.201 | Train PPL:   1.223
