## Neural Language Models
Status of Notebook: Work in Progress

Reference: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Dynet Version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm/nn-lm.py

Old PyTorch version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm-pytorch/nn-lm-batch.py

Additions compared to `nn.lm.ipnyb`:
- Cleaned up model architecture code
- Added Dropout
- Using different initial learning rate

In [1]:
import torch # 导入torch模块，用于进行深度学习相关的操作
import random # 导入random模块，用于生成随机数
import torch # 再次导入torch模块，可能是因为后续的代码需要使用torch的功能
import torch.nn as nn # 导入torch.nn模块，用于定义神经网络模型
import math # 导入math模块，用于数学计算中的一些函数
import time # 导入time模块，用于进行时间相关的操作
import numpy as np # 导入numpy模块，用于进行科学计算中的数组操作

### Download the Data

In [2]:
# uncomment to download the datasets
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/test.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/train.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/valid.txt

### Process the Data

In [2]:
# function to read in data, pro=ess each line and split columns by " ||| "
def read_data(filename):
    data = []  # 创建一个空列表，用于存储读取的数据
    with open(filename, "r") as f:  # 打开文件，以只读模式打开
        for line in f:  # 遍历文件的每一行
            line = line.strip().split(" ")  # 去除行首行尾的空白字符并按空格分割得到一个列表
            data.append(line)  # 将分割后的列表添加到data列表中
    return data  # 返回读取的数据

# 读取数据
train_data = read_data('data/ptb/train.txt')  # 读取'train.txt'文件中的数据并赋值给train_data变量
val_data = read_data('data/ptb/valid.txt')  # 读取'valid.txt'文件中的数据并赋值给val_data变量

# 创建单词和标签索引及特殊符号
word_to_index = {}  # 创建一个空字典，用于存储单词到索引的映射关系
index_to_word = {}  # 创建一个空字典，用于存储索引到单词的映射关系
word_to_index["<s>"] = len(word_to_index)  # 将"<s>"添加到字典中，并将其索引赋值为当前字典长度
index_to_word[len(word_to_index)-1] = "<s>"  # 将索引到单词的映射关系添加到字典中

word_to_index["<unk>"] = len(word_to_index)  # 将"<unk>"添加到字典中，并将其索引赋值为当前字典长度
index_to_word[len(word_to_index)-1] = "<unk>"  # 将索引到单词的映射关系添加到字典中

# create word to index dictionary and tag to index dictionary from data
def create_dict(data, check_unk=False):
    # 遍历每一行数据
    for line in data:
        # 遍历每个单词
        for word in line:
            # 检查是否需要检查未知单词
            if check_unk == False:
                # 如果单词不在word_to_index字典中
                if word not in word_to_index:
                    # 将单词添加到word_to_index字典，并赋值为当前字典的长度
                    word_to_index[word] = len(word_to_index)
                    # 将索引到单词的映射关系添加到index_to_word字典中
                    index_to_word[len(word_to_index)-1] = word

            # 由于数据中已经包含了<unk>，所以该分支的代码没有影响
            # 应该只用于处理不包含<unk>的数据
            else: 
                # 如果单词不在word_to_index字典中
                if word not in word_to_index:
                    # 将单词的索引赋值为"<unk>"的索引
                    word_to_index[word] = word_to_index["<unk>"]
                    # 将索引到单词的映射关系添加到index_to_word字典中
                    index_to_word[len(word_to_index)-1] = word

# 调用create_dict函数，处理train_data数据
create_dict(train_data)
# 调用create_dict函数，处理val_data数据，同时设置check_unk为True
create_dict(val_data, check_unk=True)

# create word and tag tensors from data
def create_tensor(data):
    # 遍历数据中的每一行
    for line in data:
        # 使用列表生成式将单词转换为对应的索引，并生成一个生成器对象
        yield [word_to_index[word] for word in line]

# 调用create_tensor函数处理train_data数据，并将生成器对象转换为列表
train_data = list(create_tensor(train_data))
# 调用create_tensor函数处理val_data数据，并将生成器对象转换为列表
val_data = list(create_tensor(val_data))

# 计算word_to_index字典中的单词数量，即词汇量
number_of_words = len(word_to_index)

In our implementation we are using batched training. There are a few differences from the original implementation found [here](https://github.com/neubig/nn4nlp-code/blob/master/02-lm/loglin-lm.py). 

### Define the Model

In [3]:
# 检查是否有可用的GPU设备，如果有则使用cuda，否则使用cpu
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 定义N，表示n-gram的长度
N = 2

# 定义EMB_SIZE，表示嵌入层的大小
EMB_SIZE = 128

# 定义HID_SIZE，表示隐藏层的大小
HID_SIZE = 128

# Neural LM
class FNN_LM(nn.Module):
    def __init__(self, number_of_words, ngram_length, EMB_SIZE, HID_SIZE, dropout):
        super(FNN_LM, self).__init__()

        # 嵌入层
        self.embedding = nn.Embedding(number_of_words, EMB_SIZE)

        # 全连接前馈神经网络
        self.fnn = nn.Sequential(
            # 隐藏层：线性变换，将输入大小转换为隐藏层大小
            nn.Linear(EMB_SIZE * ngram_length, HID_SIZE),
            # 激活函数：双曲正切函数（Tanh）
            nn.Tanh(),
            # 丢弃层：根据丢弃概率随机丢弃元素
            nn.Dropout(dropout),
            # 输出层：线性变换，将隐藏层大小转换为单词数量
            nn.Linear(HID_SIZE, number_of_words)
        )

    def forward(self, x):
        embs = self.embedding(x)     # 嵌入层           # Size: [batch_size x num_hist x emb_size]
        feat = embs.view(embs.size(0), -1)  # 特征提取，将嵌入层的输出展平为一维向量  # Size: [batch_size x (num_hist*emb_size)]
        logit = self.fnn(feat)  # 使用全连接前馈神经网络进行正向传播               # Size: batch_size x num_words                    
        return logit

### Model Settings and Functions

In [17]:
# 创建FNN_LM模型实例
model = FNN_LM(number_of_words, N, EMB_SIZE, HID_SIZE, dropout=0.2)

# 创建优化器，将模型参数作为优化器的输入，设置学习率为0.001
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# 创建损失函数，交叉熵损失函数，reduction="sum"表示将所有样本的损失求和
criterion = torch.nn.CrossEntropyLoss(reduction="sum")

# 如果GPU可用，则将模型移动到GPU设备上进行计算
if torch.cuda.is_available():
    device = torch.device("cuda")
    model.to(device)
# function to calculate the sentence loss
def calc_sent_loss(sent):
    # 将起始符号"<s>"映射为对应的整数索引
    S = word_to_index["<s>"]
    
    # 初始化历史记录，将每个元素都设置为起始符号的索引
    hist = [S] * N
    
    # 收集所有目标词和历史记录
    all_targets = []
    all_histories = []
    
    # 遍历句子中的每个单词，包括句子末尾的标记符号
    for next_word in sent + [S]:
        all_histories.append(list(hist))
        all_targets.append(next_word)
        # 更新历史记录，将历史记录中的第一个元素删除，并将下一个单词添加到末尾
        hist = hist[1:] + [next_word]

    # 将历史记录转换为LongTensor类型，并将其移动到GPU设备上
    logits = model(torch.LongTensor(all_histories).to(device))
    # 将目标词转换为LongTensor类型，并将其移动到GPU设备上
    loss = criterion(logits, torch.LongTensor(all_targets).to(device))

    return loss

MAX_LEN = 100
# Function to generate a sentence
def generate_sent():
    # 将起始符号"<s>"映射为对应的整数索引
    S = word_to_index["<s>"]
    # 初始化历史记录，将每个元素都设置为起始符号的索引
    hist = [S] * N
    # 初始化生成的句子列表
    sent = []
    while True:
        # 使用模型进行预测，将历史记录转换为LongTensor类型，并将其移动到GPU设备上
        logits = model(torch.LongTensor([hist]).to(device))
        # 对预测结果进行softmax归一化
        p = torch.nn.functional.softmax(logits)  # 形状为(1, number_of_words)
        # 从概率分布中采样得到一个单词
        next_word = p.multinomial(num_samples=1).item()
        # 如果采样到了结束符号"<s>"，或者句子长度达到了最大长度限制，则停止生成
        if next_word == S or len(sent) == MAX_LEN:
            break
        # 将生成的单词添加到句子列表中
        sent.append(next_word)
        # 更新历史记录，将历史记录中的第一个元素删除，并将生成的单词添加到末尾
        hist = hist[1:] + [next_word]
    return sent

### Train the Model

In [19]:
# start training
for ITER in range(5):
    # 进行训练
    random.shuffle(train_data)  # 对训练数据进行随机打乱
    model.train()  # 将模型设置为训练模式
    train_words, train_loss = 0, 0.0  # 初始化训练数据的单词数量和损失值
    start = time.time()  # 记录训练开始的时间
    for sent_id, sent in enumerate(train_data):  # 遍历训练数据的每个句子
        my_loss = calc_sent_loss(sent)  # 计算当前句子的损失
        train_loss += my_loss.item()  # 累加损失值
        train_words += len(sent)  # 累加单词数量
        optimizer.zero_grad()  # 清空优化器的梯度缓存
        my_loss.backward()  # 反向传播计算梯度
        optimizer.step()  # 更新模型参数
        if (sent_id+1) % 5000 == 0:  # 每处理5000个句子打印一次训练进度
            print("--finished %r sentences (words/sec=%.2f)" % (sent_id+1, train_words/(time.time()-start)))
    print("iter %r: train loss/word=%.4f, ppl=%.4f, (words/sec=%.2f)" % (ITER, train_loss/train_words, math.exp(train_loss/train_words), train_words/(time.time()-start)))

    # 进行评估
    model.eval()  # 将模型设置为评估模式
    dev_words, dev_loss = 0, 0.0  # 初始化验证数据的单词数量和损失值
    start = time.time()  # 记录评估开始的时间
    for sent_id, sent in enumerate(val_data):  # 遍历验证数据的每个句子
        my_loss = calc_sent_loss(sent)  # 计算当前句子的损失
        dev_loss += my_loss.item()  # 累加损失值
        dev_words += len(sent)  # 累加单词数量
    print("iter %r: dev loss/word=%.4f, ppl=%.4f, (words/sec=%.2fs)" % (ITER, dev_loss/dev_words, math.exp(dev_loss/dev_words), time.time()-start))

    # Generate a few sentences
    for _ in range(5):
    sent = generate_sent()  # 生成一个句子
    print(" ".join([index_to_word[x] for x in sent]))  # 将句子中的每个单词转换为对应的词表中的词，并以空格分隔打印出来

--finished 5000 sentences (words/sec=12807.67)
--finished 10000 sentences (words/sec=12788.71)
--finished 15000 sentences (words/sec=12807.44)
--finished 20000 sentences (words/sec=12801.59)
--finished 25000 sentences (words/sec=12852.69)
--finished 30000 sentences (words/sec=12843.39)
--finished 35000 sentences (words/sec=12835.04)
--finished 40000 sentences (words/sec=12816.01)
iter 0: train loss/word=6.1274, ppl=458.2398, (words/sec=12801.17)
iter 0: dev loss/word=5.8676, ppl=353.3835, (words/sec=1.44s)
it will change at georgia & co. got instead of totally a appointment from the big bankers posted <unk> & co. also received that brokers
one and claim the politicians amount for <unk> the measure of the california santa contract
our birth capitol led the giant <unk> by an <unk> market in the central <unk> held the rise of the company 's sheet that the irs on britain dollars
yesterday 's jail & <unk> investigations on news for buying creditors has lower market for polish statement so a



--finished 5000 sentences (words/sec=12587.62)
--finished 10000 sentences (words/sec=12652.41)
--finished 15000 sentences (words/sec=12740.18)
--finished 20000 sentences (words/sec=12763.71)
--finished 25000 sentences (words/sec=12753.94)
--finished 30000 sentences (words/sec=12754.24)
--finished 35000 sentences (words/sec=12762.18)
--finished 40000 sentences (words/sec=12740.41)
iter 1: train loss/word=5.7389, ppl=310.7324, (words/sec=12744.21)
iter 1: dev loss/word=5.7766, ppl=322.6629, (words/sec=1.40s)
the advertising for champion the dollar was named whose damage was down from lawyers and the new england told them need
rumors with cents a share
justice general operations in chicago
british bought what of going to pay to rates since april
according to an <unk> family
--finished 5000 sentences (words/sec=12702.39)
--finished 10000 sentences (words/sec=12731.82)
--finished 15000 sentences (words/sec=12755.89)
--finished 20000 sentences (words/sec=12828.83)
--finished 25000 sentences 