### 任务三：基于注意力机制的文本匹配

输入两个句子判断，判断它们之间的关系。参考ESIM（可以只用LSTM，忽略Tree-LSTM），用双向的注意力机制实现。

#### 知识点：

注意力机制

token2token attetnion

#### 本文暂不考虑tree-LSTM

#### 有关LSTM：
https://www.cnblogs.com/wangduo/p/6773601.html?utm_source=itdadao&utm_medium=referral
#### 有关bilstm（attention）：
https://blog.csdn.net/qq_34992900/article/details/115443992
#### 有关RNN，attention等
https://www.bilibili.com/video/BV15b4y1R7c9?spm_id_from=333.337.search-card.all.click

（发展变化及原理初探: 神经语言模型(NNLM)、循环神经网络(RNN)和注意力(Attention)机......）

In [1]:
from torchtext.legacy import data
import torch

#### 加载数据

In [2]:
TEXT=data.Field(lower=True,batch_first=True,include_lengths=True)
LABEL=data.LabelField(batch_first=True)

#### 在定义fields时，我们可以看到数据集中README部分：

sentence1: The premise caption that was supplied to the author of the pair.

sentence2: The hypothesis caption that was written by the author of the pair.

sentence{1,2}_parse: The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.

sentence{1,2}_binary_parse: The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.

annotator_labels (label1-5 in the tab separated file): These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.

gold_label: This is the label chosen by the majority of annotators. Where no majority exists, this is '-', and the pair should not be included when evaluating hard classification accuracy.

captionID: A unique identifier for each sentence1 from the original Flickr30k example.

pairID: A unique identifier for each sentence1--sentence2 pair.

NOTE: captionID and pairID contain information that can be useful in making classification decisions and should not be included in model input (nor, of course, should either annotator_labels or gold_label).

#### 可以看到在我们不考虑tree-LSTM时，我们仅需利用sentence1（text），sentence2（text），gold_label（label）

In [3]:
fields={'sentence1':('premise',TEXT),'sentence2':('hypothesis',TEXT),'gold_label':('label',LABEL)}

train_data,dev_data,test_data=data.TabularDataset.splits(
    path=r'C:\Users\16786\Desktop\复旦nlp上手教程\task3\data',
    train='snli_1.0_train.jsonl',
    validation='snli_1.0_dev.jsonl',
    test='snli_1.0_test.jsonl',
    format='json',
    fields=fields,
    filter_pred=lambda x:x.label!='-'#筛选出标签为'-'（即无标签）的例子。
)

In [4]:
from torchtext.vocab import Vectors

vectors=Vectors('glove.6B.300d.txt',r'C:\Users\16786\Desktop\复旦nlp上手教程\task3\GLOVE')

TEXT.build_vocab(train_data,vectors=vectors,unk_init=torch.Tensor.normal_)
LABEL.build_vocab(dev_data)

注：Tensor.normal_(mean=0, std=1, *, generator=None) → Tensor

Fills tensor with elements samples from the normal distribution parameterized by mean and std.self

##### 迭代器：

In [5]:
from torchtext.legacy.data import Iterator, BucketIterator
batch=32
device="cpu"

train_iter,dev_iter=BucketIterator.splits(
    (train_data,dev_data),
    batch_size=batch,
    device=device,
    sort_key=lambda x:len(x.premise)+len(x.hypothesis),
    sort_within_batch=True,
    repeat=False,
    shuffle=True
)
test_iter=data.Iterator(
    test_data,              
    batch_size=batch,
    device=device,
    sort=False,
    sort_within_batch=False,
    repeat=False,
    shuffle=False
)

简单检查一下迭代器能否正常使用，别写完训练发现迭代器有问题

In [6]:
# batch=next(iter(train_iter))
# premise, premise_lens = batch.premise
# print(batch.premise)
# print(premise.size(1))
# print(premise)
# print(premise_lens)

## ESIM：

Input Encoding →Local Inference Modeling→ Inference Composition→Prediction（也就是pooling）

In [7]:
from torch import nn
import torch.nn.functional as F

class input_encoding(nn.Module):
    def __init__(self,num_features,embedding_size,hidden_size,num_layers,vectors,batch_first=True,dropout=0.5):
        super(input_encoding,self).__init__()
        self.num_features=num_features
        self.embedding_size=embedding_size
        self.hidden_size=hidden_size
        self.num_layers=num_layers
        
        self.dropout=nn.Dropout(dropout)
        self.embedding=nn.Embedding.from_pretrained(vectors)    #from_pretrained加载预训练好的词向量
        self.bilstm=nn.LSTM(input_size=embedding_size,hidden_size=hidden_size//2,num_layers=num_layers,batch_first=True,dropout=dropout,bidirectional=True)
        # every LSTM's(forward and backward) hidden size is half of HIDDEN_SIZE
        
    def forward(self,x,lens):
        a=self.embedding(x)
        b=self.dropout(a)
        c,_=self.bilstm(b)
        return c

class local_inference_modeling(nn.Module):
    def __init__(self):
        super(local_inference_modeling,self).__init__()
        
        self.softmax_1=nn.Softmax(dim=1)
        self.softmax_2=nn.Softmax(dim=2)
    def forward(self,a,b):
        #equation 11 in paper:
        e=torch.matmul(a,b.transpose(1,2))#torch.matmul()类似于矩阵相乘，但可以利用python 中的广播机制，与下文.bmm()形成对比
        #equation 12 in paper:
        a_=(self.softmax_2(e)).bmm(b)#.bmm()矩阵相乘
        #equation 13 in paper:
        b_=(self.softmax_1(e).transpose(1, 2)).bmm(a)
        #equation 14 in paper:
        m_a=torch.cat([a,a_,a-a_,a*a_],dim=-1)#按行（横着）拼接
        #equation 15 in paper:
        m_b=torch.cat([b,b_,b-b_,b*b_],dim=-1)
        return m_a,m_b
    
class inference_composition(nn.Module):
    def __init__(self,num_features,input_size,hidden_size,num_layers,embedding_size,batch_first=True,dropout=0.5):
        super(inference_composition,self).__init__()
        self.linear=nn.Linear(4*hidden_size,hidden_size)#4*hiddensize=inputsize
        self.bilstm=nn.LSTM(input_size=hidden_size,hidden_size=hidden_size//2,num_layers=num_layers,dropout=dropout,bidirectional=True)
        #！！！！！！！！！！这里hidden_size一定要记得//2！！！！！！！！
        self.dropout=nn.Dropout(dropout)

    def forward(self,x,lens):
        a=self.linear(x)
        b=self.dropout(a)
        c,_=self.bilstm(b)
        return c   

class prediction(nn.Module):
    def __init__(self, input_size, output_size, num_classes=4, dropout=0.5):
        super(prediction, self).__init__()
        self.mlp = nn.Sequential(
            nn.Linear(input_size,output_size),
            nn.Tanh(),
            nn.Linear(output_size,num_classes)
        )
    def forward(self,a,b):
        #equation 18,19 in paper:
        v_a_avg=F.avg_pool1d(a.transpose(1,2),a.size(1)).squeeze(-1)#.size(1) 类似“长度”
        v_a_max=F.max_pool1d(a.transpose(1,2),a.size(1)).squeeze(-1)
        v_b_avg=F.avg_pool1d(b.transpose(1,2),b.size(1)).squeeze(-1)
        v_b_max=F.max_pool1d(b.transpose(1,2),b.size(1)).squeeze(-1)
        #equation 20 in paper:
        out=torch.cat((v_a_avg,v_a_max,v_b_avg,v_b_max),dim=-1)
        output=self.mlp(out)
        return output

class ESIM(nn.Module):
    def __init__(self,num_features,hidden_size,embedding_size,num_classes=4,vectors=None,num_layers=1,batch_first=True,drop_out=0.5,freeze=False):
        super(ESIM,self).__init__()
        self.embedding_size=embedding_size
        self.input_encoding=input_encoding(num_features,embedding_size,hidden_size,num_layers,vectors,dropout=0.5,batch_first=True)
        self.local_inference_modeling=local_inference_modeling()
        self.inference_composition=inference_composition(num_features,4*hidden_size,hidden_size,num_layers,embedding_size,batch_first=True,dropout=0.5)
        self.prediction=prediction(4*hidden_size,hidden_size,num_classes,drop_out)
    def forward(self,a,len_a,b,len_b):
        a_bar=self.input_encoding(a,len_a)
        b_bar=self.input_encoding(b,len_b)
        m_a,m_b=self.local_inference_modeling(a_bar,b_bar)
        v_a=self.inference_composition(m_a,len_a)
        v_b=self.inference_composition(m_b,len_b)
        out_put=self.prediction(v_a,v_b)
        return out_put

In [8]:
hidden_size=600
epochs=20
dropout=0.5
num_layers=1
learning_rate=4e-4
embedding_size=300
batch_size=32

In [9]:
model = ESIM(num_features=(TEXT.vocab),hidden_size=hidden_size,
             embedding_size=embedding_size,num_classes=4,
             vectors=TEXT.vocab.vectors,num_layers=num_layers,
             batch_first=True, drop_out=0.5, freeze=False)



### train
有关

In [10]:
from tqdm import tqdm
import torch.optim as optim

optimizer=optim.Adam(model.parameters(),lr=learning_rate)
loss_func=nn.CrossEntropyLoss()

def train(train_iter,dev_iter,loss_func,optimizer,epochs):
    best_acc=-1
    patience_count=0
    for epoch in range(epochs):
        model.train()
        total_loss=0
        n=0
        for batch in tqdm(train_iter):
            premise,premise_lens=batch.premise
            hypothesis,hypothesis_lens=batch.hypothesis
            labels=batch.label

            model.zero_grad()
            output=model(premise,premise_lens,hypothesis, hypothesis_lens).to(device)
            loss=loss_func(output,labels)
            total_loss+=loss.item()
            n+=batch_size
            loss.backward()
            #torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()
            if n%3600==0:
                print('epoch : {} step : {}, loss : {}'.format(epoch,int(n/3600),total_loss/n))
        tqdm.write("Epoch: %d, Train Loss: %d"%(epoch+1,total_loss))

In [11]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

train(train_iter, dev_iter, loss_func, optimizer, epochs)

  0%|                                                                                        | 0/17168 [00:00<?, ?it/s]

The model has 6,493,204 trainable parameters


  1%|▉                                                                           | 225/17168 [02:24<3:07:22,  1.51it/s]

epoch : 0 step : 2, loss : 0.033685843124985695


  3%|█▉                                                                          | 450/17168 [04:53<2:53:30,  1.61it/s]

epoch : 0 step : 4, loss : 0.03283083680189318


  4%|██▉                                                                         | 675/17168 [07:23<3:37:28,  1.26it/s]

epoch : 0 step : 6, loss : 0.03207162418023304


  5%|███▉                                                                        | 900/17168 [09:57<3:15:45,  1.39it/s]

epoch : 0 step : 8, loss : 0.03161189170761241


  7%|████▉                                                                      | 1125/17168 [12:33<4:16:34,  1.04it/s]

epoch : 0 step : 10, loss : 0.03129819345143106


  7%|█████                                                                      | 1170/17168 [13:06<2:59:11,  1.49it/s]


KeyboardInterrupt: 