## Tensor

![Tensor](https://img-blog.csdnimg.cn/73d79be2ccaa413fa252072169e38573.png?)

In [1]:
import torch
data = [[1,2],[3,4]]
x_data = torch.tensor(data)

In [3]:
import numpy as np 
np_array = np.array(data)
x_tensor = torch.from_numpy(np_array)
print(x_tensor)
one_element = x_tensor[1][1]
print(one_element)
print(one_element.item())  # 只有一个元素的tensor才能使用item函数

tensor([[1, 2],
        [3, 4]])
tensor(4)
4


In [5]:
x_ones = torch.ones_like(x_data)
print(x_ones)
# 修改新张量的数据类型
x_rand = torch.rand_like(x_data, dtype=torch.float)
print(x_rand)

tensor([[1, 1],
        [1, 1]])
tensor([[0.5309, 0.5070],
        [0.5631, 0.9979]])


## 改变张量 tensor 的形状

![reshape](https://img-blog.csdnimg.cn/a06143c7aaa54007a75c4b5bd2ecac1b.png?)

In [6]:
x = torch.randn(4, 4)
print(x)
print(x.size())
y = torch.reshape(x, (2, 8))
print(y)
print(y.size())

tensor([[ 1.2097, -0.2918,  2.8595, -0.0547],
        [ 0.1935,  1.2770,  0.5390, -2.3792],
        [ 0.7055,  0.1889, -0.0337, -0.0768],
        [-0.3051, -0.7940,  0.4360,  2.5206]])
torch.Size([4, 4])
tensor([[ 1.2097, -0.2918,  2.8595, -0.0547,  0.1935,  1.2770,  0.5390, -2.3792],
        [ 0.7055,  0.1889, -0.0337, -0.0768, -0.3051, -0.7940,  0.4360,  2.5206]])
torch.Size([2, 8])


![view](https://img-blog.csdnimg.cn/ba63b09a4ef845b287ae01a45c0ef2c7.png)

In [8]:
# view 返回的原数据的引用
# 有时候 tensor 不连续，就没法使用 view， 可以使用 reshape. 如果一定要使用 view, 则先调用 contuiguous(), 在view。
y = x.view(16)
print(y)
print(y.size())
x[0][0] = 0
print(x)
print(y)

tensor([ 1.2097, -0.2918,  2.8595, -0.0547,  0.1935,  1.2770,  0.5390, -2.3792,
         0.7055,  0.1889, -0.0337, -0.0768, -0.3051, -0.7940,  0.4360,  2.5206])
torch.Size([16])
tensor([[ 0.0000, -0.2918,  2.8595, -0.0547],
        [ 0.1935,  1.2770,  0.5390, -2.3792],
        [ 0.7055,  0.1889, -0.0337, -0.0768],
        [-0.3051, -0.7940,  0.4360,  2.5206]])
tensor([ 0.0000, -0.2918,  2.8595, -0.0547,  0.1935,  1.2770,  0.5390, -2.3792,
         0.7055,  0.1889, -0.0337, -0.0768, -0.3051, -0.7940,  0.4360,  2.5206])


## 广播

In [9]:
a = torch.arange(6).reshape(2,3)
print(a)
b = torch.arange(1, 3).reshape(2,1)
print(b)
print(a+b)

tensor([[0, 1, 2],
        [3, 4, 5]])
tensor([[1],
        [2]])
tensor([[1, 2, 3],
        [5, 6, 7]])


In [10]:
c = torch.arange(1,4).reshape(1,3)
print(c) # [[1,2,3],[1,2,3]]
print(a+c)

tensor([[1, 2, 3]])
tensor([[1, 3, 5],
        [4, 6, 8]])


## squeeze 

![squeeze](https://img-blog.csdnimg.cn/8845529d07984e2c82a3ee2845e70eb4.png)

In [13]:
x = torch.zeros(2,1,2,1,2)
print(x.size())
y = torch.squeeze(x)    # 删除了大小为1的所有维度
# squeeze来减少tensor的维度
print(y.size())

y = torch.squeeze(x, 1)
print(y.size())   # 删除 dim=1 的维度

y = torch.squeeze(x, 0)
print(y.size())

torch.Size([2, 1, 2, 1, 2])
torch.Size([2, 2, 2])
torch.Size([2, 2, 1, 2])
torch.Size([2, 1, 2, 1, 2])


![unsqueeze](https://img-blog.csdnimg.cn/91694a970afb42ee91e0933fba4388d4.png)

In [15]:
x = torch.tensor([1,2,3,4])
print(x.size())
y = torch.unsqueeze(x, 0)
print(y.size())
y = torch.unsqueeze(x, 1)
print(y.size())

torch.Size([4])
torch.Size([1, 4])
torch.Size([4, 1])


## 交换维度/改变维度

In [17]:
# transpose 转置 一次只能改变2个维度的位置
x = torch.ones(3,4,5)
print(x.size())
y = x.transpose(0,1)
print(y.size())

# permute 可以同时改变多个维度的位置
y = x.permute(1,2,0)
print(y.size())

torch.Size([3, 4, 5])
torch.Size([4, 3, 5])
torch.Size([4, 5, 3])


## 组合（stack/cat）

![stack](https://img-blog.csdnimg.cn/474af12235964ae7a44840bbec13d98a.png)

In [18]:
# stack 在新维度进行拼接
a = torch.rand(3,4)
print(a)
b = torch.rand(3,4)
print(b)
c = torch.stack([a,b], 1)
print(c)
print(c.shape)
# (3, 4) -> (3, 2, 4)

tensor([[0.1514, 0.5191, 0.6667, 0.3323],
        [0.7784, 0.7498, 0.0861, 0.4328],
        [0.6962, 0.1142, 0.0742, 0.3999]])
tensor([[0.0594, 0.9961, 0.3403, 0.3854],
        [0.1247, 0.3389, 0.2981, 0.7379],
        [0.4233, 0.6466, 0.5118, 0.1789]])
tensor([[[0.1514, 0.5191, 0.6667, 0.3323],
         [0.0594, 0.9961, 0.3403, 0.3854]],

        [[0.7784, 0.7498, 0.0861, 0.4328],
         [0.1247, 0.3389, 0.2981, 0.7379]],

        [[0.6962, 0.1142, 0.0742, 0.3999],
         [0.4233, 0.6466, 0.5118, 0.1789]]])
torch.Size([3, 2, 4])


![cat](https://img-blog.csdnimg.cn/0bfa66a00a7045019be1f70b75f73516.png?)

In [20]:
# cat 在现有维度上进行拼接
c = torch.cat([a,b], 0)
print(c)
print(c.shape)
c = torch.cat([a,b], 1)
print(c)
print(c.shape)

tensor([[0.1514, 0.5191, 0.6667, 0.3323],
        [0.7784, 0.7498, 0.0861, 0.4328],
        [0.6962, 0.1142, 0.0742, 0.3999],
        [0.0594, 0.9961, 0.3403, 0.3854],
        [0.1247, 0.3389, 0.2981, 0.7379],
        [0.4233, 0.6466, 0.5118, 0.1789]])
torch.Size([6, 4])
tensor([[0.1514, 0.5191, 0.6667, 0.3323, 0.0594, 0.9961, 0.3403, 0.3854],
        [0.7784, 0.7498, 0.0861, 0.4328, 0.1247, 0.3389, 0.2981, 0.7379],
        [0.6962, 0.1142, 0.0742, 0.3999, 0.4233, 0.6466, 0.5118, 0.1789]])
torch.Size([3, 8])


## gather 按维度索引取值

![torch.gather](https://img-blog.csdnimg.cn/25de3a36cfec457f835bff4258bd78bc.png?)

In [22]:
t = torch.tensor([[1,2],[3,4]])
print(t.shape)
y = torch.gather(t, 1,torch.tensor([[0,0],[1,0]]))
print(y)

torch.Size([2, 2])
tensor([[1, 1],
        [4, 3]])


![gather](https://img-blog.csdnimg.cn/28fb4f245c66462fb9a5918b02154812.png?)

In [23]:
y = torch.gather(t, 0, torch.tensor([[0,0],[1,0]]))
print(y)  #[[1,2],[3,2]]

tensor([[1, 2],
        [3, 2]])


## 数据增强

提出：深度学习需要通过大量数据避免过拟合。
数据增强：通过有限的数据产生更多的等价数据来人工扩展训练数据集的技术。

## EDA（Easy Data Augmentation）

![EDA3](https://img-blog.csdnimg.cn/50c22b4212714b509ce053ff921d6bdd.png?)

### 对于训练集中的给定句子，随机选择并执行以下操作之一：
* 同义词替换（SR）：从句子中随机选择 n 个不是停用词的词。 用随机选择的同义词之一替换这些单词中的每一个。
* 随机插入 (RI)：在句子中随机找到一个词，并找出其同义词，且该同义词不是停用词。 将该同义词插入句子中的随机位置。 这样做n次。
* 随机交换（RS）：随机选择句子中的两个单词并交换它们的位置。 这样做n次。
* 随机删除（RD）：以概率 p 随机删除句子中的每个单词。

## 闭包数据增强

数据集中每条数据有两条句子, 现在共3条句子\
a, b, 1\
a, c, 1\
a, d, 0\
a~b, a~c => b~c\
a~b, a 和 d 不相似 => ad不相似

## 对偶数据增强

### a-b对，变成b-a对, 把两个句子换顺序
### 我们的无监督数据增强就是用的对偶数据增强
### BERT 输入 a，b两个句子，现在输入以b,a作为输入，增强样本

### UDA（Unsupervised Data Augmentation for Consistency Training）用于一致性训练的无监督数据增强

![UDA5](https://img-blog.csdnimg.cn/9d10da70d1d0467e93ef5bb1267ac87f.png?)

![code](https://img-blog.csdnimg.cn/d97f35fd41e0485185f40d50f4fd8e8d.png?x)

## TextCNN 源码

In [24]:
import torch 
import torch.nn as nn

config = {
    'train_file_path': 'data/data86671/train.csv',
    'test_file_path': 'data/data86671/test.csv',
    'train_val_ratio': 0.1,  # 10%用作验证集
    'vocab_size': 30000,   # 词典 3W
    'batch_size': 64,      # batch 大小 64
    'num_epochs': 2,      # 2次迭代
    'learning_rate': 1e-3, # 学习率
    'logging_step': 300,   # 每跑300个batch记录一次
    'seed': 2021           # 随机种子
}

config['device'] = 'cuda' if torch.cuda.is_available() else 'cpu' # cpu&gpu

import random
import numpy as np

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    return seed

seed_everything(config['seed'])

2021

In [25]:
# train.csv 四列 id, label, label_desc, sentence
from collections import Counter
from tqdm import tqdm
import jieba
def get_vocab(config):
    token_counter = Counter()
    with open(config['train_file_path'], 'r', encoding='utf8') as f:
        lines = f.readlines()
        for line in tqdm(lines, desc='Counting tokens', total=len(lines)):
            sent = line.split(',')[-1].strip()
            sent_cut = list(jieba.cut(sent))
            token_counter.update(sent_cut)
            # token_counter {'我': 2,'是': 5}
    
    vocab = set(token for token, _ in token_counter.most_common(config['vocab_size']))
    return vocab

In [26]:
vocab = get_vocab(config)

Counting tokens:   0%|          | 0/53361 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.822 seconds.
Prefix dict has been built successfully.
Counting tokens: 100%|██████████| 53361/53361 [00:07<00:00, 7465.68it/s]


In [27]:
# 将 词典（vocab） 中的token 转化为 词向量
# token -> embedding 
# token -> id

# '是' <-> 10 <-> 300d Vector

def get_embedding(vocab):
    token2embedding ={}

    with bz2.open('data/data86671/sgns.weibo.word.bz2') as f:
        token_vector = f.readlines()

        meta_info = token_vector[0].split()
        print(f'{meta_info[0]} tokens in embedding file in total, vector size is {meta_info[1]}')

        # sgns.weibo.word.bz2 从第二行开始，每一行是 'token embedding' 的形式
        # '我' 0.88383 0.22222 *300
        for line in tqdm(token_vector[1:]):
            line = line.split()
            token = line[0].decode('utf8')

            vector = line[1:]

            if token in vocab:
                token2embedding[token] = [float(num) for num in vector]

        # enumerate(, [start])
        token2id = {token: idx for idx, token in enumerate(token2embedding.keys(), 4)}
        id2embedding = {token2id[token]: embedding for token, embedding in token2embedding.items()}

        PAD, UNK, BOS, EOS = '<pad>', '<unk>', '<bos>', '<eos>'

        token2id[PAD] = 0
        token2id[UNK] = 1
        token2id[BOS] = 2
        token2id[EOS] = 3

        id2embedding[0] = [.0] * int(meta_info[1])
        id2embedding[1] = [.0] * int(meta_info[1])

        id2embedding[2] = np.random.random(int(meta_info[1])).tolist()
        id2embedding[3] = np.random.random(int(meta_info[1])).tolist()

        emb_mat = [id2embedding[idx] for idx in range(len(id2embedding))]

        return torch.tensor(emb_mat, dtype=torch.float), token2id, len(vocab)+4

In [28]:
import bz2
emb_mat, token2id, config['vocab_size'] = get_embedding(vocab)
# print(token2id)

  0%|          | 776/195202 [00:00<00:25, 7758.53it/s]

b'195202' tokens in embedding file in total, vector size is b'300'


  2%|▏         | 3600/195202 [00:00<00:18, 10089.08it/s]100%|██████████| 195202/195202 [00:04<00:00, 45270.50it/s]


In [29]:
def tokenizer(sent, token2id):
    ids = [token2id.get(token, 1) for token in jieba.cut(sent)]
    return ids

In [30]:
import pandas as pd
from collections import defaultdict
def read_data(config, token2id, mode='train'):
    data_df = pd.read_csv(config[f'{mode}_file_path'], sep=',')
    if mode == 'train':
        X_train, y_train = defaultdict(list), []
        X_val, y_val = defaultdict(list), []
        num_val = int(config['train_val_ratio'] * len(data_df))
    
    else:
        X_test, y_test = defaultdict(list), []

    for i, row in tqdm(data_df.iterrows(), desc=f'Preprocesing {mode} data', total=len(data_df)):
        label=row[1] if mode == 'train' else 0
        sentence = row[-1]
        inputs = tokenizer(sentence, token2id)
        if mode == 'train':
            if i < num_val:
                X_val['input_ids'].append(inputs)
                y_val.append(label)
            else:
                X_train['input_ids'].append(inputs)
                y_train.append(label)
        
        else:
            X_test['input_ids'].append(inputs)
            y_test.append(label)

    if mode == 'train':
        label2id = {label: i for i, label in enumerate(np.unique(y_train))}
        id2label = {i: label for label, i in label2id.items()}

        y_train = torch.tensor([label2id[label] for label in y_train], dtype=torch.long)
        y_val = torch.tensor([label2id[label] for label in y_val], dtype=torch.long)

        return X_train, y_train, X_val, y_val, label2id, id2label
    
    else:
        y_test = torch.tensor(y_test, dtype=torch.long)
        return X_test, y_test

  return f(*args, **kwds)


In [31]:
X_train, y_train, X_val, y_val, label2id, id2label = read_data(config, token2id, mode='train')
X_test, y_test = read_data(config, token2id, mode='test')

Preprocesing train data: 100%|██████████| 53360/53360 [00:12<00:00, 4253.68it/s]
Preprocesing test data: 100%|██████████| 10000/10000 [00:02<00:00, 4188.08it/s]


### Dataset、DataLoder

[torch.utils.data.DataLoader](https://zhuanlan.zhihu.com/p/402666821)

![dataset&dataloaders](https://img-blog.csdnimg.cn/c4ba8de5a4934a68a071c0b03f16d8cf.png)

```
from torch.utils.data import Dataset   # All datasets that represent a map from keys to data samples should subclass it. 
class TNEWSDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __getitem__(self, idx):
        return (self.x[idx], self.y[idx])   # supporting fetching a data sample for a given key. 

    def __len__(self):
        return len(self.y)
```

In [32]:
from torch.utils.data import Dataset
class TNEWSDataset(Dataset):
    def __init__(self, X, y):
        self.x = X
        self.y = y

    def __getitem__(self, idx):
        return {
            'input_ids': self.x['input_ids'][idx],
            'label': self.y[idx]
        }
    
    def __len__(self):
        return self.y.size(0)

In [33]:
train_dataset = TNEWSDataset(X_train, y_train)

In [45]:
print(train_dataset[100]) # 返回某条sample 
print(len(train_dataset)) # 返回整个数据集的大小

{'input_ids': [2258, 2362, 10, 199, 2362, 18, 5984, 17686, 1237, 4280, 19, 1, 23], 'label': tensor(7)}
48024


In [48]:
def collete_fn(examples):
    input_ids_list =[]
    labels = []
    for example in examples:
        input_ids_list.append(example['input_ids'])
        labels.append(example['label'])
    
    # 1.找到 input_ids_list 中最长的句子
    max_length = max(len(input_ids) for input_ids in input_ids_list)

    # 2. 定义一个Tensor
    input_ids_tensor = torch.zeros((len(labels), max_length), dtype=torch.long)

    for i, input_ids in enumerate(input_ids_list):
        # 3.得到当前句子长度
        seq_len = len(input_ids)
        input_ids_tensor[i, :seq_len] = torch.tensor(input_ids, dtype=torch.long)

    return {
        'input_ids': input_ids_tensor,
        'label': torch.tensor(labels, dtype=torch.long)
    }

```
def collate_fn(examples):    
    input_ids_list = []
    labels = []

    for example in examples:
        input_ids_list.append(example[0])
        labels.append(example[1])

    max_length = max(len(input_ids) for input_ids in input_ids_list)

    input_ids_tensor = torch.zeros((len(labels), max_length), dtype=torch.long)
    for i, input_ids in enumerate(input_ids_list):
        seq_len = len(input_ids)
        input_ids_tensor[i, :seq_len] = torch.tensor(input_ids, dtype=torch.long)

    label_tensor = torch.tensor(labels, dtype=torch.long)
    
    return (input_ids_tensor, label_tensor)
```

```
class Collator:
    def __init__(self, max_seq_len):
        self.max_seq_len = max_seq_len

    def get_max_seq_len(self, ids_list):
        cur_max_seq_len = max(len(input_id) for input_id in ids_list)
        max_seq_len = min(self.max_seq_len, cur_max_seq_len)
        return max_seq_len
    
    @staticmethod
    def pad_and_truncate(text_ids_list, max_seq_len):
        input_ids = torch.zeros((len(text_ids_list), max_seq_len), dtype=torch.long)
        for i, text_ids in enumerate(text_ids_list):
            seq_len = min(len(text_ids), max_seq_len)
            input_ids[i, :seq_len] = torch.tensor(text_ids[:seq_len], dtype=torch.long)
        
        return input_ids

    
    def __call__(self, examples):
        # 1. 将元组中属于sentence1的放在一起，属于sentence2的放在一起，属于label的放在一起
        text_ids_left_list, text_ids_right_list, labels_list = list(zip(*examples))

        # 2.1 找到 text_ids_left_list, text_ids_right_list 最长的句子长度
        max_text_left_length = self.get_max_seq_len(text_ids_left_list)
        max_text_right_length = self.get_max_seq_len(text_ids_right_list)

        # 2.2 执行短暂句子补齐, 3.定义一个tensor，把数据放里面
        text_left_ids = self.pad_and_truncate(text_ids_left_list, max_text_left_length)
        text_right_ids = self.pad_and_truncate(text_ids_right_list, max_text_right_length)
        labels = torch.tensor(labels_list, dtype=torch.long)
        
        data_list = [text_left_ids, text_right_ids, labels]
        return data_list
```

In [49]:
from torch.utils.data import DataLoader
def build_dataloader(config, vocab):
    X_train, y_train, X_val, y_val, label2id, id2label = read_data(config, token2id, mode='train')
    X_test, y_test = read_data(config, token2id, mode='test')

    train_dataset = TNEWSDataset(X_train, y_train)
    val_dataset = TNEWSDataset(X_val, y_val)
    test_dataset = TNEWSDataset(X_test, y_test)
    
    train_dataloader = DataLoader(dataset=train_dataset, batch_size=config['batch_size'], num_workers=4, shuffle=True, collate_fn=collete_fn)
    val_dataloader = DataLoader(dataset=val_dataset, batch_size=config['batch_size'], num_workers=4, shuffle=False, collate_fn=collete_fn)
    test_dataloader = DataLoader(dataset=test_dataset, batch_size=config['batch_size'], num_workers=4, shuffle=False, collate_fn=collete_fn)

    return id2label, train_dataloader, val_dataloader, test_dataloader

```
from torch.utils.data import DataLoader
def build_dataloader(train_df, test_df, config, vocab):
    X_train, y_train, X_val, y_val, label2id, id2label = read_data(train_df, config['train_val_ratio'], vocab, mode='train')
    X_test, y_test = read_data(test_df, config['train_val_ratio'], vocab, mode='test')

    train_dataset = AFQMCDataset(X_train, y_train)
    val_dataset = AFQMCDataset(X_val, y_val)
    test_dataset = AFQMCDataset(X_test, y_test)
    
    # -----------------new -----------------------#
    collate_fn = Collator(config['max_seq_len'])
    # -----------------new -----------------------#

    train_dataloader = DataLoader(dataset=train_dataset, batch_size=config['batch_size'],
                                  num_workers=4, shuffle=True, collate_fn=collate_fn)
    val_dataloader = DataLoader(dataset=val_dataset, batch_size=config['batch_size'],
                                num_workers=4, shuffle=False, collate_fn=collate_fn)
    test_dataloader = DataLoader(dataset=test_dataset, batch_size=config['batch_size'],
                                 num_workers=4, shuffle=False, collate_fn=collate_fn)

    return id2label, test_dataloader, train_dataloader, val_dataloader
```

In [50]:
 id2label, train_dataloader, val_dataloader, test_dataloader = build_dataloader(config, vocab)

Preprocesing train data: 100%|██████████| 53360/53360 [00:12<00:00, 4318.02it/s]
Preprocesing test data: 100%|██████████| 10000/10000 [00:02<00:00, 4408.49it/s]


## 打印 batch

In [51]:
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[ 1793, 20039,  1034,  ...,     0,     0,     0],
        [  348,  2484, 10497,  ...,     0,     0,     0],
        [ 7051, 13667,  6551,  ...,     0,     0,     0],
        ...,
        [  170, 17538,  8992,  ...,     0,     0,     0],
        [    1, 10274,     1,  ...,     0,     0,     0],
        [  290,     1,     9,  ...,     0,     0,     0]]), 'label': tensor([ 7,  3,  3,  6,  3,  9, 13, 10,  9,  2, 10,  8,  8,  8,  9, 14,  3,  3,
        14,  5,  6, 14,  1,  9,  8, 10,  8,  6,  1,  2,  6, 10,  9,  3,  8,  3,
         6,  8, 11,  4, 11,  5,  8,  1, 14,  9,  3, 11,  2, 11,  4,  0,  1,  3,
         8,  8,  5,  2, 12,  5, 13, 11,  9,  6])}


## 采样

## 所有采样器都继承自Sampler这个类
## 每个Sampler子类都要实现__iter__方法【迭代数据集example索引的方法】，以及返回迭代器长度的__len__方法

### SequentialSampler: 在初始化时拿到数据集， 按顺序对元素进行采样，每次返回一个索引值。

### RandomSampler：随机采样（可以重复采样）

### SubsetRandomSampler： 从给定的索引列表中随机采样元素，不放回采样

### BatchSampler（Sampler）: 之前的采样器每次只返回一个索引值，将基采样器采样的到的索引值进行合并

![BucketSampler](https://img-blog.csdnimg.cn/d7e03938f2824f9cb8a6c3a895f5a78a.png?)

In [52]:
from time import sleep
from tqdm import tqdm
for i in tqdm(range(60*15), desc="现在是休息时间，看录播的同学可以跳过哦！～"):
    sleep(1)

现在是休息时间，看录播的同学可以跳过哦！～: 100%|██████████| 900/900 [15:02<00:00,  1.00s/it]


# 模型构建

```
class Model(nn.Module):
    def __init__(self, config):
        pass
    
    def forward(self, x):
        pass


for batch in train_iterator:
    loss = model(**batch)[0] # 正向传播
    model.zero_grad()    # 梯度清0
    loss.backward()      # 反向传播
    optimizer.step()     # 更新网络的权重
```

## TextCNN

![TextCNN图](https://img-blog.csdnimg.cn/2021063014002246.png?)

```
import torch.nn.functional as F
class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()

        self.embedding = nn.Embedding.from_pretrained(config['embedding_pretrained'], freeze=True)

        self.convs = nn.ModuleList([nn.Conv2d(1, config['num_filters'], (k, config['emb_size'])) for k in config['filter_sizes']])

        self.dropout = nn.Dropout(config['dropout'])

        # 变换维度，得到logits
        self.fc = nn.Linear(len(config['filter_sizes'] * config['num_filters']), config['num_classes'])

    def convs_and_pool(self, x, conv):

        # x [batch_size, out_channels, seq_len_out, 1]
        # x [batch_size, out_channels, seq_len_out]
        x = F.relu(conv(x)).squeeze(3)

        # x (batch_size, out_channels, 1)
        # x (batch_size, out_channels)
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, input_ids=None, label=None):
        # out [batch_size, seq_len, embedding_dim]
        out = self.embedding(input_ids)
        
        # H: seq_len; W:embedding_dim
        # out [batch_size, 1, seq_len, embedding_dim]
        out = out.unsqueeze(1)

        # (batch_size, out_channels)
        out = torch.cat([self.convs_and_pool(out, conv) for conv in self.convs], 1)

        out = self.dropout(out)

        out = self.fc(out)

        output = (out, )

        if label is not None: # 训练集用
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(out, label)
            output = (loss, ) + output

        # train output (loss, out)
        # test output (out)
        return output
```

## ESIM

![ESIM](https://img-blog.csdnimg.cn/img_convert/1adb67ec46e87da23fa042f298ff88bb.png)

![all of gongshi](https://img-blog.csdnimg.cn/652165f9f0584ac683c0df8d412514be.png?)

![stackRNN](https://img-blog.csdnimg.cn/02b1b24b629c4defabb888776f9d3f57.png?)

```
import torch.nn.functional as F
class StackedBRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers,
                 dropout_rate=0, dropout_output=False, rnn_type=nn.LSTM, 
                 concat_layers=False):
        super().__init__()
        self.dropout_output = dropout_output
        self.dropout_rate = dropout_rate
        self.num_layers = num_layers
        self.concat_layers = concat_layers
        self.rnns = nn.ModuleList()
        # 共有两层LSTM
        for i in range(num_layers):
            input_size = input_size if i == 0 else 2*hidden_size
            self.rnns.append(rnn_type(input_size, hidden_size, num_layers=1, bidirectional=True))
    
    def forward(self, x):
        # x (B, L, D) -> (L, B, D)
        x = x.transpose(0, 1)

        outputs = [x]
        for i in range(self.num_layers):
            rnn_input = outputs[-1]

            if self.dropout_rate > 0:
                rnn_input = F.dropout(rnn_input, p=self.dropout_rate, training=self.training)
            
            # self.rnn[i](rnn_input) (output, (h_n, c_n))
            rnn_output = self.rnns[i](rnn_input)[0]
            outputs.append(rnn_output)
        
        # outputs [x, output0, output1]
        if self.concat_layers:
            output = torch.cat(outputs[1:], 2)
        else:
            output = outputs[-1]
        
        # output (L, B, D) -> (B, L, D)
        output = output.transpose(0, 1)

        if self.dropout_output and self.dropout_rate > 0:
            output = F.dropout(output, p=self.dropout_rate, training=self.training)
        
        # 进行 transpose之后，tensor在内存中不连续， contiguous将output内存连续
        return output.contiguous()
```

```
import torch.nn as nn
class RNNDropout(nn.Dropout):
    # 将词向量 某些维度 清0
    # sequences_batch [B, L, D]
    def forward(self, sequences_batch):
        # ones [B, D]
        ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])

        # 随机 mask ones
        # dropout_mask [B, D]
        dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
       
        return dropout_mask.unsqueeze(1) * sequences_batch
```

```
class BidirectionalAttention(nn.Module):
    def __init__(self):
        super().__init__()
        
        # v1 [B, L, H]
        # v1_mask [B, L]
        # v2 [B, R, H]
        # v2_mask [B, R]
    def forward(self, v1, v1_mask, v2, v2_mask):
        # v2:a v1:b 

        # 1.计算矩阵相似度
        # similarity_matrix [B, L, R]
        similarity_matrix = v1.bmm(v2.transpose(2, 1).contiguous())

        # 2.计算attention时没有必要计算pad=0, 要进行mask操作 3.进行softmax
        # 将similarity_matrix v1中pad对应的权重给mask
        # [B, L, R]
        v2_v1_attn = F.softmax(
            similarity_matrix.masked_fill(
                v1_mask.unsqueeze(2), -1e7), dim=1)

        # 将similarity_matrix v2中pad对应的权重给mask
        # [B, L, R]
        v1_v2_attn = F.softmax(
            similarity_matrix.masked_fill(
                v2_mask.unsqueeze(1), -1e7),dim=2)

        # 4.计算attention
        # [B, L, R] @ [B, R, H] 
        # 句子a 对b的影响 [B, L, H]
        # attented_v1 [B, L, H]
        attented_v1 = v1_v2_attn.bmm(v2)

        # 句子b 对a的影响 
        # v2_v1_attn [B, L, R] -> [B, R, L] @[B, L, H] -> [B, R, H]
        # attented_v2 [B, R, H]
        attented_v2 = v2_v1_attn.transpose(1,2).bmm(v1)

        # attented_v1 将v1对应的pad填充为0
        # attented_v2 将v2对应的pad填充为0
        attented_v1.masked_fill(v1_mask.unsqueeze(2), 0)
        attented_v2.masked_fill(v2_mask.unsqueeze(2), 0)
        return attented_v1, attented_v2
```

```
class ESIM(nn.Module):
    
    def __init__(self, config):
        super().__init__()

        # -----------------------   input encoding  ---------------------#
        rnn_mapping = {'lstm': nn.LSTM, 'gru': nn.GRU}
        self.embedding = nn.Embedding.from_pretrained(config['embedding'], freeze=config['freeze_emb'])

        self.rnn_dropout = RNNDropout(p=config['dropout'])
        rnn_size = config['hidden_size']

        if config['concat_layers']:
            rnn_size //= config['num_layers']

        self.input_encoding = StackedBRNN(input_size=config['embedding'].size(1),
                                          hidden_size=rnn_size // 2,
                                          num_layers=config['num_layers'],
                                          rnn_type=rnn_mapping[config['rnn_type']],
                                          concat_layers=config['concat_layers'])

        # -----------------------   input encoding  ---------------------#

        # -----------------------   Local inference collected over sequences  ---------------------#
        self.attention = BidirectionalAttention()
        # -----------------------   Local inference collected over sequences  ---------------------#


        # -----------------------   the compositon layer  ---------------------#
        self.projection = nn.Sequential(
            nn.Linear(4 * config['hidden_size'], config['hidden_size']),
            nn.ReLU()
        )


        self.composition = StackedBRNN(input_size=config['hidden_size'],
                                      hidden_size=rnn_size // 2,
                                      num_layers=config['num_layers'],
                                      rnn_type=rnn_mapping[config['rnn_type']],
                                      concat_layers=config['concat_layers'])


        # -----------------------   the compositon layer  ---------------------#


        self.classification = nn.Sequential(
            nn.Dropout(p=config['dropout']),
            nn.Linear(4 * config['hidden_size'], config['hidden_size']),
            nn.Tanh(),
            nn.Dropout(p=config['dropout']))
            
        self.out = nn.Linear(config['hidden_size'], config['num_labels'])

    def forward(self, inputs):
        # inputs: [sentence1_tensor, sentence2_tensor, labels_tensor]
        # B: batch_size
        # L = 'inputs left'  sequence length
        # R = 'inputs right'  sequence length
        # D = embedding size
        # H = hidden size

        # -----------------------   input encoding  ---------------------#
        # query sentence1_tensor
        # doc sentence2_tensor
   
        # query [B, L]
        # doc [B, R] 
        query, doc = inputs[0].long(), inputs[1].long()
        
        # 判断 query，doc中的每一个数是不是0， 是1则表示该位置是pad
        # query：[2,3,4,5,0,0,0] -> query_mask：[0,0,0,0,1,1,1]
        # query_mask [B, L]
        # doc_mask [B, R]
        query_mask = (query == 0)
        doc_mask = (doc == 0)
        
        # query [B, L, D]
        # doc [B, R, D]
        query = self.embedding(query)
        doc = self.embedding(doc)
        
        # query [B, L, D]
        # doc [B, R, D]
        query = self.rnn_dropout(query)
        doc = self.rnn_dropout(doc)
        
        # query [B, L, H]
        # doc [B, R, H]
        query = self.input_encoding(query)
        doc = self.input_encoding(doc)
        # -----------------------   input encoding  ---------------------#

        # 1.计算矩阵相似度
        # 2.计算attention时没有必要计算pad=0, 要进行mask操作
        # 3.进行softmax
        # 4.计算attention

        # -----------------------   Local inference collected over sequences  ---------------------#
        # query [B, L, H]
        # query_mask [B, L]
        # doc [B, R, H]
        # doc_mask [B, R]
        attended_query, attended_doc = self.attention(query, query_mask, doc, doc_mask)
        # -----------------------   Local inference collected over sequences  ---------------------#

        # -----------------------  Enhancement of local inference information ---------------------#
        # enhanced_query [B, L, 4*h]
        # enhanced_doc [B, R, 4*h]
        enhanced_query = torch.cat([query, 
                                    attended_query, 
                                    query-attended_query, 
                                    query*attended_query], 
                                    dim=-1)
        
        enhanced_doc = torch.cat([doc, 
                                  attended_doc, 
                                  doc-attended_doc, 
                                  doc*attended_doc], 
                                  dim=-1)
        
        # -----------------------  Enhancement of local inference information ---------------------#
         

        # -----------------------   the compositon layer  ---------------------#
        # projected_query [B, L, H]
        # projected_doc [B, R, H]
        projected_query = self.projection(enhanced_query)
        projected_doc = self.projection(enhanced_doc)
        
        # projected_query [B, L, H]
        # projected_doc [B, R, H]
        query = self.composition(projected_query)
        doc = self.composition(projected_doc)
        # -----------------------   the compositon layer  ---------------------#

        # -----------------------   Pooling  ---------------------#
        # query_mask， doc_mask. 判断 query，doc中的每一个数是不是0， 是1则表示该位置是pad
        # reverse_query_mask 0的位置代表pad
        # reverse_query_mask [B, L]
        # reverse_doc_mask [B, R]
        reverse_query_mask = 1. - query_mask.float()
        reverse_doc_mask = 1. - doc_mask.float()

        query_avg = torch.sum(query * reverse_query_mask.unsqueeze(2),dim=1)/ (torch.sum(reverse_query_mask, dim=1, keepdim=True) + 1e-8)
        doc_avg = torch.sum(doc * reverse_doc_mask.unsqueeze(2),dim=1)/ (torch.sum(reverse_doc_mask, dim=1, keepdim=True) + 1e-8)
        
        # 防止取出pad
        query = query.masked_fill(query_mask.unsqueeze(2), -1e7)
        doc = doc.masked_fill(doc_mask.unsqueeze(2), -1e7)
        
        
        query_max, _ = query.max(dim=1)
        doc_max, _ = doc.max(dim=1)
        
        # v [B, 4*H]
        v = torch.cat([query_avg, query_max, doc_avg, doc_max], dim=-1)
        # -----------------------   Pooling  ---------------------#

        # -----------------------   prediction  ---------------------#
        # hidden [B, H]
        hidden = self.classification(v)

        out = self.out(hidden)
        outputs = (out, )
        # -----------------------   prediction  ---------------------#

        if len(inputs) == 3:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(out, inputs[-1])
            outputs = (loss, ) + outputs
        return outputs
```

## BERT

![attention3](https://img-blog.csdnimg.cn/20210708201252776.png?)

Attention计算过程：
1. 将文本转为embeddings
2. 使用embedding分别与三个矩阵（Wq, Wk, Wv）想乘， 得到q,k,v
3. 为每个embedding计算一个score, score=q.k
4. 除以 根号 dk
5. 对 v 加权平均（softmax, sum）

```
def train(config, id2label, train_dataloader, val_dataloader):
    # ---------------------- part 1 ---------------------- #
    # 配置文件
    bert_config = BertConfig.from_pretrained(config['model_path'])
    bert_config.num_labels = len(id2label)
    model = BertForSequenceClassification.from_pretrained(config['model_path'], config=bert_config)
    # ---------------------- part 1 ---------------------- #
    
```

![在这里插入图片描述](https://img-blog.csdnimg.cn/20210709132851125.png?)

## NeZha

![NeZha1](https://img-blog.csdnimg.cn/20210708204825728.png?)

NeZha对于BERT的改进 \
函数式相对位置编码、全词掩码、混合精度训练和LAMB优化器。

![NeZha2](https://img-blog.csdnimg.cn/20210708205133810.png?)

## 训练

# 混合精度训练
作用：训练时，尽量不降低性能，并提升速度
Float16优点:
* 减少内存的使用
* 加快训练和推断的计算，能带来多一倍速的体验

Float16缺点:
* 溢出错误
* 舍入误差

![amp2](https://img-blog.csdnimg.cn/72b642b508024cc2a6207c308347c7e7.png?)

当进入autocast()时， 系统自动切换为float16, autocast上下文只包含前向传播，建议不用反向传播

![scaling](https://img-blog.csdnimg.cn/bbae5cdd360748ecb59cee8dc6f728f2.png?)

* scaler.scale(loss) 将给定的损失成一缩放器的当前比例因子，进行反向传播
* scaler.step(optimizer) 取消缩放梯度并调用optimizer.step()
* scaler.update() 更新缩放器的比例因子

![example](https://img-blog.csdnimg.cn/dfeebde4d34b496096062bb7dbbee7b6.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MTI4NzA2MA==,size_16,color_FFFFFF,t_70)

# 优化器

* Momentum
* Adagrad
* Adam
* AdamW
* Lookahead
* Lamb
* WarmUp

# 损失函数

* cross-entropy
* KLDIV
* MSE
* Label smoothing
* Focal loss

## 对抗训练方法

![对抗训练](https://img-blog.csdnimg.cn/20210716113626200.png?) 

## 对抗训练方法
### Fast Gradient Method(FGM)
### Projected Gradient Descent(PGD)

## FGM
对于每个x:
1. 计算x的前向loss, 反向传播得到梯度；
2. 根据embeddign矩阵计算的梯度计算出r, 并加到当前embedding上，相当于x+r
3. 计算x+r的前向loss, 反向传播得到梯度，然后累加到(1)的梯度上；
4. 将embedding恢复为（1）时的embedding；
5. 根据（3）的梯度对参数进行更新。

## PGD
FGM是一下子算出了对抗扰动，这样得到的扰动不一定是最优的。因此PGD进行了改进，多迭代了K/t次，慢慢找到最优的扰动
对于每个x:
1. 计算x的前向loss, 反向传播得到梯度；
  对于每步t：
  2. 根据embeddign矩阵计算的梯度计算出r, 并加到当前embedding上，相当于x+r；
  3. t如果不是最后一步，将梯度归0， 根据2的x+r计算前后向并得到梯度
  4. t是最后一步，恢复1的梯度，计算最后的x+r并将梯度累加到(1)上
5. 将embedding恢复为（1）时的embedding；
6.根据（4）的梯度对参数进行更新。