<font color='red'>注：此处是文档第226页</font>

写在前面：这个模型在“[09-混合前端的seq2seq模型部署-聊天机器人](../09-%E6%B7%B7%E5%90%88%E5%89%8D%E7%AB%AF%E7%9A%84seq2seq%E6%A8%A1%E5%9E%8B%E9%83%A8%E7%BD%B2-%E8%81%8A%E5%A4%A9%E6%9C%BA%E5%99%A8%E4%BA%BA.ipynb)”中已经实现过一遍了，这次相当于换数据再次实现，所以应该尝试自己实现一下，这里写的模型说明很详细。

# 聊天机器人教程

在本教程中，我们探索一个好玩有趣的循环的序列到序列（sequence-to-sequence）的模型用例。我们将用[Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)处的电影剧本来训练一个简单的聊天机器人。

在人工智能研究领域中，对话模型是一个非常热门的话题。聊天机器人可以在各种设置中找到，包括客户服务应用和在线帮助。这些机器人通常 由基于检索的模型提供支持，这些模型的输出是某些形式问题预先定义的响应。在像公司IT服务台这样高度受限制的领域中，这些模型可能足够了， 但是，对于更一般的用例它们还不够健壮。让一台机器与多领域的人进行有意义的对话是一个远未解决的研究问题。最近，深度学习热潮已经允许 强大的生成模型，如谷歌的神经对话模型[Neural Conversational Model](https://arxiv.org/abs/1506.05869)，这标志着向多领域生成对话模型迈出了一大步。 在本教程中，我们将在PyTorch中实现这种模型。

**教程要点**
- 对[Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)数据集的加载和预处理
- 用[Luong attention mechanism(s)](https://arxiv.org/abs/1508.04025)实现一个sequence-to-sequence模型
- 使用小批量数据联合训练解码器和编码器模型
- 实现贪婪搜索解码模块
- 与训练好的聊天机器人互动

**大致流程**
- 数据预处理：将文本数据处理成适合格式化处理的格式
- 数据编码
- 编码器：使用RNN的GRU实现编码器
- 注意力模块：用于解码器
- 解码器
- 编码器训练方法
- 解码器训练方法
- 整合编码器与解码器到一个模型中
- 模型使用方法
- 模型评估

## 1.下载数据文件
下载数据文件点击[这里](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)并将其放入到当前目录下的 `data/` 文件夹下。之后我们引入一些必须的包。

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import pandas as pd
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math

USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

## 2.加载和预处理数据
下一步就是格式化处理我们的数据文件并将数据加载到我们可以使用的结构中。 [Cornell MovieDialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)是一个丰富的电影角色对话数据集： 
* 10,292 对电影角色之间的220,579次对话 
* 617部电影中的9,035个电影角色 
* 总共304,713发言量

这个数据集庞大而多样，在语言形式、时间段、情感上等都有很大的变化。我们希望这种多样性使我们的模型能够适应多种形式的输入和查询。

首先，我们通过数据文件的某些行来查看原始数据的格式

In [2]:
corpus_name = "cornell movie-dialogs corpus"
corpus = os.path.join("../../data/", corpus_name)

def printLines(file, n=10):
    with open(file, "rb") as f:
        lines = f.readlines()
    for i in lines[:n]:
        print(i)

printLines(os.path.join(corpus, "movie_lines.txt"))

b'L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!\n'
b'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!\n'
b'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.\n'
b'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?\n'
b"L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.\n"
b'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow\n'
b"L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.\n"
b'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No\n'
b'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?\n'
b'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?\n'


### 2.1 创建格式化数据文件
为了方便起见，我们将创建一个格式良好的数据文件，其中每一行包含一个由 tab 制表符分隔的查询语句和响应语句对。

以下函数便于解析原始 movie_lines.txt 数据文件。 
* loadLines: 将文件的每一行拆分为字段(lineID, characterID, movieID, character, text) 组合的字典 
* loadConversations: 根据movie_conversations.txt 将 loadLines 中的每一行数据进行归类 
* extractSentencePairs: 从对话中提取句子对

In [3]:
def loadLines(fileName, fields):
    # 结构：{lineID: {field: value, ...}, ...}
    # 没办法用pd.read_csv 实现
    # fields: lineID, characterID, movieID, character, text
    lines = {}
    with open(fileName, "r", encoding="iso-8859-1") as f:
        for line in f:
            values = line.split(" +++$+++ ")
            lineObj = {}
            for value, field in zip(values, fields):
                lineObj[field] = value
            lines[lineObj['lineID']] = lineObj
    return lines


def loadConversations(fileName, lines, fields):
    # 结构：[{field: value, "lines": lines[utteranceID]}]
    # utteranceIDs=eval(values[-1])
    conversations = []
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            values = line.split(" +++$+++ ")
            # Extract fields
            convObj = {}
            for value, field in zip(values, fields):
                convObj[field] = value
            convObj["lines"] = [lines[i] for i in eval(convObj["utteranceIDs"])]
            conversations.append(convObj)
    return conversations


# 从对话中提取一对句子
def extractSentencePairs(conversations):
    # 结构：[[q_text, a_text], ...]
    qa_pairs = []
    for conversation in conversations:
        for i in range(len(conversation['lines']) - 1):
            inputLine = conversation['lines'][i]['text'].strip()
            targetLine = conversation['lines'][i+1]['text'].strip()
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

现在我们将调用这些函数来创建文件，我们命名为formatted_movie_lines.txt 。

In [4]:
datafile = os.path.join(corpus, "formatted_movie_lines.txt")
delimiter = '\t'
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# 初始化行dict，对话列表和字段ID
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]

movie_lines_file = os.path.join(corpus, "movie_lines.txt")
movie_lines = loadLines(movie_lines_file, fields=MOVIE_LINES_FIELDS)

# 加载行和进程对话
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"), 
                                  lines, MOVIE_CONVERSATIONS_FIELDS)
pd.DataFrame(extractSentencePairs(conversations)).to_csv(
    datafile, sep='\t', line_terminator="\n", encoding='utf-8', header=False, index=False
)


Processing corpus...

Loading conversations...


In [5]:
printLines(datafile)

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\n"
b'Why?\tUnsolved mystery.  She used t

In [6]:
conversations[0]

{'character1ID': 'u0',
 'character2ID': 'u2',
 'movieID': 'm0',
 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n",
 'lines': [{'lineID': 'L194',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
  {'lineID': 'L195',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
  {'lineID': 'L196',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
  {'lineID': 'L197',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]}