# 本次研讨课内容:
   问答摘要内容讲解:
    1. vocab对象构建
    2. batcher 方法构建
    3. PGN Model.
    4. coverage
    5. coverage_loss

In [4]:
import tensorflow as tf

![](img/Text.png)

蓝色的字体表示的是参考摘要，三个模型的生成摘要的结果差别挺大。红色字体表明了不准确的摘要细节生成(UNK未登录词，无法解决OOV问题)，绿色的字体表明了模型生成了重复文本。

![](img/rouge.png)

![](img/aistudio.png)

# 1.PGN

1. sequence-to-sequence mode baseline,
2. pointer generater mode 
3. coverage机制

# Seq2Seq
![](img/seq2seq.png)

-----

# PGN
![](img/pgn.png)

## 0. vocab对象构建 
将vocab处理为一个对象，即建立一个vocab类，把之前处理vocab字典的操作全部封进去，为了后边写代码的时候方便

In [46]:
class Vocab:
    PAD_TOKEN = '<PAD>'
    UNKNOWN_TOKEN = '<UNK>'
    START_DECODING = '<START>'
    STOP_DECODING = '<STOP>'

    def __init__(self, vocab_file, vocab_max_size=None):
        """
        Vocab 对象,vocab基本操作封装
        :param vocab_file: Vocab 存储路径
        :param vocab_max_size: 最大字典数量
        """
        self.word2id, self.id2word = self.load_vocab(vocab_file, vocab_max_size)
        self.count = len(self.word2id)

    @staticmethod
    def load_vocab(file_path, vocab_max_size=None):
        """
        读取字典
        :param file_path: 文件路径
        :return: 返回读取后的字典
        """
        vocab = {}
        reverse_vocab = {}
        for line in open(file_path, "r", encoding='utf-8').readlines():
            word, index = line.strip().split("\t")
            index = int(index)
            # 如果vocab 超过了指定大小
            # 跳出循环 截断
            if vocab_max_size and index > vocab_max_size:
                print("max_size of vocab was specified as %i; we now have %i words. Stopping reading." % (
                    vocab_max_size, index))
                break
            vocab[word] = index
            reverse_vocab[index] = word
        return vocab, reverse_vocab

    def word_to_id(self, word):
        if word not in self.word2id:
            return self.word2id[self.UNKNOWN_TOKEN]
        return self.word2id[word]

    def id_to_word(self, word_id):
        if word_id not in self.id2word:
            raise ValueError('Id not found in vocab: %d' % word_id)
        return self.id2word[word_id]

    def size(self):
        return self.count

## Q 1. oov词去哪里取?

# 1. batcher 改进

由于之前对词典的处理将OOV词都清除掉了，再拿着UNK去找对应的词的时候就找不到了，所以这里对batcher进行一个改进，实现在输入的时候，将OOV词的列表也带进去，这一部分计算attention时要用到。所以这里构造了一个batch的帮助类，这个类里边主要包括了，一个batch里边主要包括哪些输入数据

### predictions -> idx2word -> `QQ224,这是车辆<UNK><UNK>`

### enc_extended_inp -> idx2word ->  `道奇,锋哲,昨晚看到一台车车牌是绿色的？这是什么牌？`

### enc_inp -> idx2word->  `<UNK>,<UNK>,昨晚看到一台车车牌是绿色的？这是什么牌？`

### article_oovs ->` [道奇 , 锋哲]`

### dec_input -> idx2word-> `<start>QQ224,这是车辆<UNK><UNK><end><pad><pad>`

### target -> idx2word-> ` QQ224,这是车辆道奇,锋哲<end><pad>`

In [None]:
# 将整个过程使用生成器给封装起来了，再训练的时候，直接迭代生成器就可以了
# 相当于是将处理数据的逻辑和构造batch的逻辑，全部放到生成器里边去迭代了
# 这样就实现了，在你训练的时候，你的数据预处理的操作你可以把它放到你模型训练里边来做，让你的CPU来做你的
# 数据预处理，Gpu来跑训练，而且可以经常的改动你的预处理的处理模块，再运行的时候J就直接run就可以了，
# 不用首先build_dataset
dataset = tf.data.Dataset.from_generator(
        lambda: generator(params, vocab, max_enc_len, max_dec_len, mode, batch_size),
        output_types={  # 每个batch里边会包含以下这些数据
            "enc_len": tf.int32,  # 输入句子长度
            "enc_input": tf.int32,  #输入的句子
            "enc_extended_inp": tf.int32,  # 原始的输入语句，即没有使用UNK等替换过的原始句子，如5.0.2
            # 但是真实训练的句子是enc_input，即切完词，替换之后的句子,如5.0.3，输入依旧都是索引
            
            "article_oovs": tf.string,  # OOV的词，将这些OOV的词也存起来，这些词的索引是如何得到的呢？
            # vocab.size() + article_oovs.index(w),在词表中的序号下继续往下排，这个词表不是共享的，
            # 每一句话都有一个oov的index，查找UNK的时候，就看它对应的索引位置，然后到
            # 遇到UNK的时候就会自动调整UNK的注意力权重，更加倾向于vocab之外的词，具体的实现是通过神经网络自
            # 动调节 
            "dec_input": tf.int32,  # decoder的输入
            "target": tf.int32,  # 最终的结果
            "dec_len": tf.int32,
            "article": tf.string,
            "abstract": tf.string,
            "abstract_sents": tf.string,
            "sample_decoder_pad_mask": tf.int32,
            "sample_encoder_pad_mask": tf.int32,
        },
        output_shapes={
            "enc_len": [],
            "enc_input": [None],
            "enc_extended_inp": [None],
            
            "article_oovs": [None],
            "dec_input": [None],
            "target": [None],
            "dec_len": [],
            "article": [],
            "abstract": [],
            "abstract_sents": [],
            "sample_decoder_pad_mask": [None],
            "sample_encoder_pad_mask": [None]
        })

In [None]:
dataset = dataset.padded_batch(batch_size,
               padded_shapes=({"enc_len": [],
                               "enc_input": [None],
                               "enc_input_extend_vocab": [None],
                               "article_oovs": [None],
                               "dec_input": [max_dec_len],
                               "target": [max_dec_len],
                               "dec_len": [],
                               "article": [],
                               "abstract": [],
                               }),
               padding_values={"enc_len": -1,
                               "enc_input": vocab.word2id[Vocab.PAD_TOKEN],
                               "enc_input_extend_vocab": vocab.word2id[Vocab.PAD_TOKEN],
                               "article_oovs": b'',
                               "dec_input": vocab.word2id[Vocab.PAD_TOKEN],
                               "target": vocab.word2id[Vocab.PAD_TOKEN],
                               "dec_len": -1,
                               "article": b"",
                               "abstract": b"",
                               },
               drop_remainder=True)

对于前边的两步都处理vocab构建好了

# PGN Model

## Ppgn  

### seq2seq+point

混合了 Baseline seq2seq和PointerNetwork的网络，它具有Baseline seq2seq的生成能力和PointerNetwork的Copy能力。如何权衡一个词应该是生成的还是复制的？原文中引入了一个权重$p_{gen}$

从Baseline seq2seq的模型结构中得到了$S_t$和$h^*_t$，和解码器输入 $x_t$ 一起来计算 $p_{gen}$ ： 

![](img/pgn1.png)

# Pgen ∈ [0,1]

+ context vector $h^*_t$
+ decoder input $x_t$ 
+ the decoder state $S_t$

In [None]:
class Pointer(tf.keras.layers.Layer):

    def __init__(self):
        super(Pointer, self).__init__()
        self.w_s_reduce = tf.keras.layers.Dense(1)
        self.w_i_reduce = tf.keras.layers.Dense(1)
        self.w_c_reduce = tf.keras.layers.Dense(1)

    def call(self, context_vector, dec_hidden, dec_inp):
        return tf.nn.sigmoid(self.w_s_reduce(dec_hidden) + self.w_c_reduce(context_vector) + self.w_i_reduce(dec_inp))

In [41]:
p_gens = []
for t in range(dec_target.shape[1]):
    .....
    p_gen = self.pointer(context_vector, dec_hidden, dec_x)

p_gens.append(p_gen)

![](img/pgn.png)

# Final_dists

这时，会扩充单词表形成一个更大的单词表--扩充单词表(将原文当中的单词也加入到其中)，该时间步的预测词概率为：

![](img/pw.png)

其中 $a_i^t$ 表示的是原文档中的词。我们可以看到解码器一个词的输出概率有其是否拷贝是否生成的概率和决定。当一个词不出现在常规的单词表上时$P_{vocab}(w)$ 为0.

![](img/pgn.png)

## predictions -> idx2word -> `QQ224,这是车辆<UNK><UNK>`

## enc_extended_inp -> idx2word ->  `道奇,锋哲,昨晚看到一台车车牌是绿色的？这是什么牌？`

## enc_inp -> idx2word->  `<UNK>,<UNK>,昨晚看到一台车车牌是绿色的？这是什么牌？`

In [None]:
# 使用这样的方法，来得到最终概率的生成
final_dists=calc_final_dist(enc_extended_inp,# 原始输入
                             predictions,# 原始预测概率，经过decoder call之后得到的概率
                             attentions, # att权重  调用attention得到的权重
                             p_gens, # pgn概率
                             batch_oov_len,# 2  比如有两个UNK长度就是2
                             self.params["vocab_size"],# 原始的wocab size
                             self.params["batch_size"])

### 具体的vocab -> extended_vocab的变化

vocab -> `{
    <UNK>:0,
    昨晚:1,
    看到:2,
    一台车:3,
    车牌:4,
    是绿色的:5,
    ？:6,
    这是什么:7,
    牌:8
}`


extended_vsize=10

extend_vocab ->
`{
    <UNK>:0,
    昨晚:1,
    看到:2,
    一台车:3,
    车牌:4,
    是绿色的:5,
    ？:6,
    这是什么:7,
    牌:8,
    道奇:9,
    锋哲:10.
}`  

这里的9,10就是通过上边的batch_oov_len来确定的

### predictions -> idx2word -> `QQ224,这是车辆<UNK><UNK>`

### enc_extended_inp -> idx2word ->  `道奇,锋哲,昨晚看到一台车车牌是绿色的？这是什么牌？`

### enc_inp -> idx2word->  `<UNK>,<UNK>,昨晚看到一台车车牌是绿色的？这是什么牌？`

### article_oovs ->` [道奇 , 锋哲]`

### dec_input -> idx2word-> `<start>QQ224,这是车辆<UNK><UNK><end><pad><pad>`

### target -> idx2word-> ` QQ224,这是车辆道奇,锋哲<end><pad>`

![](img/pgn.png)

enc_inp <unk> <unk> 1 2 3 4 5 6 7 8

enc_extended_inp 9 10 1 2 3 4 5 6 7 8

In [None]:
final_dists=calc_final_dist(enc_extended_inp,# 原始输入：enc_extended_inp 9 10 1 2 3 4 5 6 7 8
                             predictions,# 原始预测概率  图中绿色的部分
                             attentions, # att权重  图中蓝色的部分
                             p_gens, # pgn概率
                             batch_oov_len,# 2
                             self.params["vocab_size"],# 原始的wocab size
                             self.params["batch_size"])

![](img/pw.png)

### 计算最终概率的代码实现
这里函数中传入的参数是和final_dists里边的参数一一对应的

In [None]:

def calc_final_dist(_enc_batch_extend_vocab, vocab_dists, attn_dists, p_gens, batch_oov_len, vocab_size, batch_size):
    """
    Calculate the final distribution, for the pointer-generator model
    Args:
    vocab_dists: The vocabulary distributions. List length max_dec_steps of (batch_size, vsize) arrays.
                The words are in the order they appear in the vocabulary file.
    attn_dists: The attention distributions. List length max_dec_steps of (batch_size, attn_len) arrays
    Returns:
    final_dists: The final distributions. List length max_dec_steps of (batch_size, extended_vsize) arrays.
    """
    # Multiply vocab dists by p_gen and attention dists by (1-p_gen)
    # 这里会构建两个dictionary，一个是vocabd_disruption ictionary，一个是attention_disruption dictionary
    # 一个是p_gen, 一个是 1- p_gen，和公式里边是一样的
    vocab_dists = [p_gen * dist for (p_gen, dist) in zip(p_gens, vocab_dists)]
    
    # attn_dists传进来以后先计算了一下1 - p_gen的概率，这里之后会构建一个图中左边attention的结果
    attn_dists = [(1 - p_gen) * dist for (p_gen, dist) in zip(p_gens, attn_dists)]

    # Concatenate some zeros to each vocabulary dist, to hold the probabilities for in-article OOV words
    extended_vsize = vocab_size + batch_oov_len  # the maximum (over the batch) size of the extended vocabulary
    extra_zeros = tf.zeros((batch_size, batch_oov_len))
    # list length max_dec_steps of shape (batch_size, extended_vsize)
    # 通过遍历vocab的分布来建立vocab_dists_extended的分布，扩展到加上OOV的length
    # 就相当于在后边补上一部分，到这里构建的是图中的绿色部分的分布，接下来看一下蓝色的部分是如何构建的
    vocab_dists_extended = [tf.concat(axis=1, values=[dist, extra_zeros]) for dist in vocab_dists]

    # Project the values in the attention distributions onto the appropriate entries in the final distributions
    # This means that if a_i = 0.1 and the ith encoder word is w, and w has index 500 in the vocabulary,
    # then we add 0.1 onto the 500th entry of the final distribution
    # This is done for each decoder timestep.
    # This is fiddly; we use tf.scatter_nd to do the projection
    batch_nums = tf.range(0, limit=batch_size)  # shape (batch_size)
    batch_nums = tf.expand_dims(batch_nums, 1)  # shape (batch_size, 1)
    attn_len = tf.shape(_enc_batch_extend_vocab)[1]  # number of states we attend over
    batch_nums = tf.tile(batch_nums, [1, attn_len])  # shape (batch_size, attn_len)
    # shape (batch_size, enc_t, 2)
    indices = tf.stack((batch_nums, _enc_batch_extend_vocab), axis=2)  
    shape = [batch_size, extended_vsize]
    
    # list length max_dec_steps (batch_size, extended_vsize) extended_vsize = 30000 + 2
    # 蓝色部分分布的构建。这里相当于是在attention里边做一个映射，得到(batch_size, extended_vsize)
    # 这样的一个大小。这里的是通过indices，也就是上边的原始输入的那句话得到的，到这里copy这么大小的数据
    # 放进来
    attn_dists_projected = [tf.scatter_nd(indices, copy_dist, shape) for copy_dist in attn_dists]

    # Add the vocab distributions and the copy distributions together to get the final distributions
    # final_dists is a list length max_dec_steps; each entry is a tensor shape (batch_size, extended_vsize) giving
    # the final distribution for that decoder timestep
    # Note that for decoder timesteps and examples corresponding to a [PAD] token, this is junk - ignore.
    final_dists = [vocab_dist + copy_dist for (vocab_dist, copy_dist) in
                   zip(vocab_dists_extended, attn_dists_projected)]

    return final_dists</div><i class="fa fa-lightbulb-o "></i>

### Q2 .模型运行的时候 上一步预测出来的词 超出vocab范围,下一步输入会不会出问题?

## 1. 训练

In [None]:
# using teacher forcing
dec_input = tf.expand_dims(dec_target[:, t], 1)

## 2. 测试

In [None]:
# 替换掉 oov token unknown token
latest_tokens = [t if t in vocab.id2word else unk_index for t in latest_tokens]  

## Q3 Seq2Seq和Point分别起到什么作用?

![](img/pgn.png)

# 2. coverage

这里要将coverage放到模型里边需要改变一下下边的两个部分的代码
1. attention
2. loss

## Bahdanau Attention

![](img/attention.png)

In [27]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W_s = tf.keras.layers.Dense(units)
        self.W_h = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, dec_hidden, enc_output):
        # query为上次的GRU隐藏层
        # values为编码器的编码结果enc_output
        hidden_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(self.W_s(enc_output) + self.W_h(hidden_with_time_axis)))
       
        attention_weights = tf.nn.softmax(score, axis=1)
        
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector,attention_weights

## Coverage Attention

下边的两个图是原来 的Attention分数计算公式和coverage Attention的分数计算公式的对比

![](img/attention.png)

![](img/e_t.png)

## 改造$e^t$

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W_s = tf.keras.layers.Dense(units)
        self.W_h = tf.keras.layers.Dense(units)
        self.W_c = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, dec_hidden, enc_output, enc_pad_mask, use_coverage, prev_coverage):
        # query 隐藏层
        # values为 编码器的编码结果enc_output
        hidden_with_time_axis = tf.expand_dims(dec_hidden, 1)
        # self.W_s(values)  [batch_sz, max_len, units] self.W_h(hidden_with_time_axis) [batch_sz, 1, units]
        # self.W_c(prev_coverage) [batch_sz, max_len, units]  score [batch_sz, max_len, 1]    
        score = self.V(tf.nn.tanh(self.W_s(enc_output) + self.W_h(hidden_with_time_axis) + self.W_c(prev_coverage)))
        
        attention_weights = tf.nn.softmax(score, axis=1)
        # [batch_sz, max_len, enc_units]
        context_vector = attention_weights * enc_output
        # [batch_sz, enc_units]
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector,attention_weights

## mask + coverage

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W_s = tf.keras.layers.Dense(units)
        self.W_h = tf.keras.layers.Dense(units)
        self.W_c = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, dec_hidden, enc_output, enc_pad_mask, use_coverage, prev_coverage):
        # query为上次的GRU隐藏层
        # values为编码器的编码结果enc_output
        # 在seq2seq模型中，St是后面的query向量，而编码过程的隐藏状态hi是values。

        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(dec_hidden, 1)

        if use_coverage and prev_coverage is not None:
            # self.W_s(values) [batch_sz, max_len, units] self.W_h(hidden_with_time_axis) [batch_sz, 1, units]
            # self.W_c(prev_coverage) [batch_sz, max_len, units]  score [batch_sz, max_len, 1]
            score = self.V(tf.nn.tanh(self.W_s(enc_output) + self.W_h(hidden_with_time_axis) + self.W_c(prev_coverage)))
            # attention_weights shape (batch_size, max_len, 1)

            mask = tf.cast(enc_pad_mask, dtype=score.dtype)
            masked_score = tf.squeeze(score, axis=-1) * mask
            masked_score = tf.expand_dims(masked_score, axis=2)

            attention_weights = tf.nn.softmax(masked_score, axis=1)
            coverage = attention_weights + prev_coverage
        else:
            # score shape == (batch_size, max_length, 1)
            # we get 1 at the last axis because we are applying score to self.V
            # the shape of the tensor before applying self.V is (batch_size, max_length, units)
            # 计算注意力权重值
            score = self.V(tf.nn.tanh(
                self.W_s(enc_output) + self.W_h(hidden_with_time_axis)))

            mask = tf.cast(enc_pad_mask, dtype=score.dtype)
            masked_score = tf.squeeze(score, axis=-1) * mask
            masked_score = tf.expand_dims(masked_score, axis=2)

            attention_weights = tf.nn.softmax(masked_score, axis=1)
            # attention_weights = masked_attention(attention_weights)
            if use_coverage:
                coverage = attention_weights

        # attention_weights sha== (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # # 使用注意力权重*编码器输出作为返回值，将来会作为解码器的输入
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector,attention_weights, coverage

## 2.2 coverage_loss

![](img/loss_t.png)

## log loss

In [34]:
# 定义损失函数  一个loss的计算
# 一开始的loss计算就是拿真实值和预测值做了一个交叉熵
# 这里新定义的loss就是加了一个mask，mask对应位置的loss就不去计算
def loss_function(real, pred):
    pad_mask = tf.math.equal(real, pad_index)
    mask = tf.math.logical_not(pad_mask)
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

## log loss + mask batch loss

In [35]:
# 定义损失函数
# 新定义的loss就是加了一个mask，mask对应位置的loss就不去计算
def loss_function(real, pred, padding_mask):
    loss = 0
    for t in range(real.shape[1]):
        if padding_mask:
            loss_ = loss_object(real[:, t], pred[:, t, :])
            mask = tf.cast(padding_mask[:, t], dtype=loss_.dtype)
            loss_ *= mask
            loss_ = tf.reduce_mean(loss_, axis=0)  # batch-wise
            loss += loss_
        else:
            loss_ = loss_object(real[:, t], pred[:, t, :])
            loss_ = tf.reduce_mean(loss_, axis=0)  # batch-wise
            loss += loss_
    return tf.reduce_mean(loss)

就是将先前时间步的注意力权重加到一起得到所谓的覆盖向量 $c_t$ (coverage vector)，用先前的注意力权重决策来影响当前注意力权重的决策，这样就避免在同一位置重复，从而避免重复生成文本。计算上，先计算coverage vector $c_t$
![](img/c_t.png)
+ $c^t$就是一个长度为输入长度的向量
+ 第一项是之前时刻输入第一个词attention权重的叠加和
+ 加这个参数的目的是为了给attention之前生成词的信息，如果之前生成过这些词那么后续要抑制。抑制通过loss函数加惩罚项实现.

## 两个地方使用$c_t$:

+ 注意力权重的计算过程中 $e^t_i$
+ cov_loss

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        pass
        
    def call(self, dec_hidden, enc_output, enc_pad_mask, use_coverage, prev_coverage):
        if use_coverage and prev_coverage is not None:
            pass
            attention_weights = tf.nn.softmax(score, axis=1)
            # 如果使用coverage，那么这个coverage就是attention的权重，加上上一步coverage的权重
            # 上一步的coverage的权重是什么呢？如果是第一步的话，那就直接是attention的权重
            # 这样会出现的效果就是：反复出现的那个词的权重，会变得更大，由于是累加，后边再进行计算的
            # 时候，对于权重更大的词，就进行更强的惩罚
            coverage = attention_weights + prev_coverage
        else:    
            if use_coverage:
                coverage = attention_weights

![](img/cobloss.png)

`<START> 举起 车辆 左 前轮 缸体 上 <STOP> <PAD> <PAD> `
一个提升的部分：除了上边的改进，这里还可以加入一个padding_mask，mask就是把对应的位置不去计算
这里可以准确的告诉你<PAD>对应的都是哪些位，然后再和cover_losses做一个乘积，乘以0，就忽略不计了，乘以1就保留，相当于将后边两位的结果就不去计算了，即填充位的loss就不去计算了，由于这里的loss是用来计算优化 更新权重的，所以如果不计算进来的话，影响会更小一些，可以更加专注于真正的自然语言的部分，这也是一个提升的点
`padding_mask`->`[1,1,1,1,1,1,1,0,0]`

In [28]:
def mask_coverage_loss(attn_dists, coverages, padding_mask):
    # 这里计算loss的时候使用padding_mask 
    """
    Calculates the coverage loss from the attention distributions.
      Args:
        attn_dists coverages: [max_len_y, batch_sz, max_len_x, 1]
        padding_mask: shape (batch_size, max_len_y).
      Returns:
        coverage_loss: scalar
    """
    cover_losses = []
    # attn_dists 和coverages 拿进来之后，将为1的维数拿掉，全部处理成[max_len_y, batch_sz, max_len_x]
    # 这样大小的数据
    # transfer attn_dists coverages to [max_len_y, batch_sz, max_len_x]
    attn_dists = tf.squeeze(attn_dists, axis=3)
    coverages = tf.squeeze(coverages, axis=3)

    
    for t in range(attn_dists.shape[0]):
        # 取这两个对应位置最小的值，就是上边的公式
        # 拿到attn_dists和coverages两个对应位的最小的值，拿到cover_loss_ 放到list里边去
        # 这里遍历一下就可以拿到一整句话的loss
        cover_loss_ = tf.reduce_sum(tf.minimum(attn_dists[t, :, :], coverages[t, :, :]), axis=-1)  # max_len_x wise
        cover_losses.append(cover_loss_)
    
    # change from[max_len_y, batch_sz] to [batch_sz, max_len_y]
    cover_losses = tf.stack(cover_losses, 1)

    # cover_loss_ [batch_sz, max_len_y]
    mask = tf.cast(padding_mask, dtype=cover_loss_.dtype)
    cover_losses *= mask
    
    # mean loss of each time step and then sum up
    loss = tf.reduce_sum(tf.reduce_mean(cover_losses, axis=0))  
    tf.print('coverage loss(batch sum):', loss)
    return loss

# loss改变
最后的loss就是将两个部分做一个求和，使用一个超参来决定两个部分的比重
等号右边的第一个是原来的loss（交叉熵），加上这里计算的attention和coverage累加的一个loss
实现位置train_helper.py 79行

![](img/loss_t_coverage.png)

In [38]:
batch_loss = loss_function(dec_target[:, 1:], predictions)

In [39]:
batch_loss = loss_function(dec_target, predictions, padding_mask) + \
                         cov_loss_wt * coverage_loss(attentions, coverages, padding_mask)

定义个coverage loss来多次惩罚对相对位置的关注.原理很直观，如果之前该词出现过了，那么它的$c^t_i$就很大，那么为了减少$loss$，就需要$a^t_i$变小（因为loss是取两者较小值）,$a^t_i$小就代表着这个位置被注意的概率减少。

# tensorflow 操作

### tf.constant 操作

In [15]:
x1=tf.constant([[1,1,5],
                [1,1,1]])

x2=tf.constant([[1,3,2],
                [3,1,3]])
tf.minimum(x1, x2)

<tf.Tensor: id=37, shape=(2, 3), dtype=int32, numpy=
array([[1, 1, 2],
       [1, 1, 1]], dtype=int32)>

## tf.reduce_sum 操作

In [19]:
x=tf.constant([[1,1,1],[1,1,1]])
tf.reduce_sum(x)

<tf.Tensor: id=62, shape=(), dtype=int32, numpy=6>

In [20]:
tf.reduce_sum(x,0)

<tf.Tensor: id=64, shape=(3,), dtype=int32, numpy=array([2, 2, 2], dtype=int32)>

In [21]:
tf.reduce_sum(x,1)

<tf.Tensor: id=66, shape=(2,), dtype=int32, numpy=array([3, 3], dtype=int32)>