- 零基础入门NLP-[新闻文本分类](https://tianchi.aliyun.com/competition/entrance/531810/information)
- 文本分类的任务是将给定的文本划分到事先规定的文本类别。
- 解题思路：
    - 思路1：TF-IDF提取特征 + SVM分类
    - 思路2：训练FastText词向量并分类
    - 思路3：训练Word2Vec词向量 + TextCNN模型分类
    - 思路4：训练BERT词向量并分类
    - 思路5：BERT分类 + 统计特征的树模型

<img src="https://cdn.nlark.com/yuque/0/2021/png/1508544/1614251188699-25bf14fd-9602-4db6-914b-aef600981658.png"/>

1. 赛题中每个新闻包含的字符个数平均为1000个，还有一些新闻字符较长；
2. 赛题中新闻类别分布不均衡，科技类新闻样本量接近4万，星座类新闻(13)样本量不到1千；
3. 赛题总共包括7000-8000个字符。

# 一、赛题数据
- 赛题以新闻数据为赛题数据，数据集报名后可见并可下载。赛题数据为新闻文本，并按照字符级别进行匿名处理。整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。
- 赛题数据由以下几个部分构成：训练集20w条样本，测试集A包括5w条样本，测试集B包括5w条样本。为了预防选手人工标注测试集的情况，我们将比赛数据的文本按照字符级别进行了匿名处理。处理后的赛题训练数据如下：

|label|text|
|---|---|
|6|57 44 66 56 2 3 3 37 5 41 9 57 44 47 45 33 13 63 58 31 17 47 0 1 1 69 26 60 62 15 21 12 49 18 38 20 50 23 57 44 45 33 25 28 47 22 52 35 30 14 24 69 54 7 48 19 11 51 16 43 26 34 53 27 64 8 4 42 36 46 65 69 29 39 15 37 57 44 45 33 69 54 7 25 40 35 30 66 56 47 55 69 61 10 60 42 36 46 65 37 5 41 32 67 6 59 47 0 1 1 68|

- 在数据集中标签的对应的关系如下：
`{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}`
- 赛题数据来源为互联网上的新闻，通过收集并匿名处理得到。因此选手可以自行进行数据分析，可以充分发挥自己的特长来完成各种特征工程，不限制使用任何外部数据和模型。
- 数据列使用\t进行分割，Pandas读取数据的代码如下：
`train_df = pd.read_csv('../input/train_set.csv', sep='\t')`

# 二、评测标准
- 评价标准为类别f1_score的均值，选手提交结果与实际测试集的类别进行对比，结果越大越好。
    - 计算公式：$F1=2*\frac{(precision * recall)}{(preci)}$
- 可以通过sklearn完成f1_score计算：

In [1]:
from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')

0.26666666666666666

# 三、机器学习
## 3.1 导入相关库

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC  # 可以使用其它机器学习模型
from sklearn.metrics import f1_score

## 3.2 读入数据

代码成功从天池实验室点击编辑按钮加载到DSW，加载好的代码会⾃动打开，默认在<b>download</b>
⽬录下<br>
1、点击左侧的【天池】按钮<br>
2、会出现【保存到天池】按钮和【添加数据源】模块，搜索新闻文本分类，点击数据集中的下载按钮即可<br>
###### （具体如下图所示）
<center><img
src="https://img.alicdn.com/imgextra/i4/O1CN01zsetgx1zaOBQbSDLs_!!6000000006730-2-
tps-616-589.png" width=60%></center>
核⼼问题2
数据集下载成功后，⻚⾯右上⻆会提示数据集下载成功，也会说名数据集存储位置，默认在
<b>download</b>⽬录下，如下图所示。
<center>
<img
src="https://img.alicdn.com/imgextra/i3/O1CN01uJzjgf1MLwg6jK7za_!!6000000001419-2-tps-
1409-377.png" width=60%>
<img
src="https://img.alicdn.com/imgextra/i1/O1CN01XQmAP027k1R811xls_!!6000000007834-2-
tps-857-465.png" width=60%>
</center>

In [3]:
train_df = pd.read_csv('./train_set.csv', sep='\t', nrows=8000)
test_df = pd.read_csv('./test_a.csv', sep='\t')

## 3.3 文本表示

In [4]:
tfidf = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1,1),
    max_features=1000)

tfidf.fit(pd.concat([train_df['text'], test_df['text']]))
train_word_features = tfidf.transform(train_df['text'])
test_word_features = tfidf.transform(test_df['text'])

## 3.4 训练模型

In [5]:
X_train = train_word_features
y_train = train_df['label']
X_test = test_word_features

KF = KFold(n_splits=5, random_state=7) 
clf = LinearSVC()
# 存储测试集预测结果 行数：len(X_test) ,列数：1列
test_pred = np.zeros((X_test.shape[0], 1), int)  
for KF_index, (train_index,valid_index) in enumerate(KF.split(X_train)):
    print('第', KF_index+1, '折交叉验证开始...')
    # 训练集划分
    x_train_, x_valid_ = X_train[train_index], X_train[valid_index]
    y_train_, y_valid_ = y_train[train_index], y_train[valid_index]
    # 模型构建
    clf.fit(x_train_, y_train_)
    # 模型预测
    val_pred = clf.predict(x_valid_)
    print("LinearSVC准确率为：",f1_score(y_valid_, val_pred, average='macro'))
    # 保存测试集预测结果
    test_pred = np.column_stack((test_pred, clf.predict(X_test)))  # 将矩阵按列合并
# 多数投票
preds = []
for i, test_list in enumerate(test_pred):
    preds.append(np.argmax(np.bincount(test_list)))
preds = np.array(preds)

第 1 折交叉验证开始...
LinearSVC准确率为： 0.8653196365489803
第 2 折交叉验证开始...
LinearSVC准确率为： 0.8635863117425737
第 3 折交叉验证开始...
LinearSVC准确率为： 0.8663983874833651
第 4 折交叉验证开始...
LinearSVC准确率为： 0.8810913779158512
第 5 折交叉验证开始...
LinearSVC准确率为： 0.8524364220474782


## 3.5 输出上传文件

In [6]:
submission = pd.read_csv('./test_a_sample_submit.csv')
submission['label'] = preds
submission.to_csv('./LinearSVC_submission.csv', index=False)

## 3.6 存在问题
- 虽然n元语法能够体现邻接词组的关系，但是它难以捕捉句子中距离较远的词和词之间的关系。
- 每个新闻平均字符个数较多，可能需要截断。
- 由于类别不均衡，会严重影响模型的精度。

# 四、FastText
FastText在文本分类任务上是优于TF-IDF的：
- FastText用单词的Embedding叠加获得的文档向量，将相似的句子分为一类；
- FastText学习到的Embedding空间维度比较低，可以快速进行训练。

## 4.1 导入相关库

In [7]:
import numpy as np
import pandas as pd
import fasttext
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

## 4.2 读入数据

In [8]:
train_df = pd.read_csv('./train_set.csv', sep='\t', nrows=8000)
test_df = pd.read_csv('./test_a.csv', sep='\t')

## 4.3 文本预处理

In [9]:
# 转换为FastText需要的格式
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)

## 4.4 训练模型
设计可以评估各个类别的性能度量函数。

In [10]:
# 各个类别性能度量的函数
def category_performance_measure(labels_right, labels_pred):
    text_labels = list(set(labels_right))
    text_pred_labels = list(set(labels_pred))
    
    TP = dict.fromkeys(text_labels,0)  #预测正确的各个类的数目
    TP_FP = dict.fromkeys(text_labels,0)   #测试数据集中各个类的数目
    TP_FN = dict.fromkeys(text_labels,0) #预测结果中各个类的数目
    
    # 计算TP等数量
    for i in range(0,len(labels_right)):
        TP_FP[labels_right[i]] += 1
        TP_FN[labels_pred[i]] += 1
        if labels_right[i] == labels_pred[i]:
            TP[labels_right[i]] += 1
    #计算准确率P，召回率R，F1值
    for key in TP_FP:
        P = float(TP[key]) / float(TP_FP[key] + 1)
        R = float(TP[key]) / float(TP_FN[key] + 1)
        F1 = P * R * 2 / (P + R) if (P + R) != 0 else 0
        print("%s:\t P:%f\t R:%f\t F1:%f" % (key,P,R,F1))

FastText是一种融合深度学习和机器学习各自优点的文本分类模型，速度非常快，但是模型结构简单，效果还算中上游。由于其使用词袋思想，语义信息获取有限。

In [11]:
X_train = train_df['text']
y_train = train_df['label']
X_test = test_df['text']
KF = KFold(n_splits=5, random_state=7, shuffle=True)
test_pred = np.zeros((X_test.shape[0], 1), int)  # 存储测试集预测结果 行数：len(X_test) ,列数：1列
for KF_index, (train_index,valid_index) in enumerate(KF.split(X_train)):
    print('第', KF_index+1, '折交叉验证开始...')
    # 转换为FastText需要的格式
    train_df[['text','label_ft']].iloc[train_index].to_csv('train_df.csv', header=None, index=False, sep='\t')
    # 模型构建
    model = fasttext.train_supervised('train_df.csv', lr=0.1, epoch=27, wordNgrams=5, 
                                      verbose=2, minCount=1, loss='hs')
    # 模型预测
    val_pred = [int(model.predict(x)[0][0].split('__')[-1]) for x in X_train.iloc[valid_index]]
    print('Fasttext准确率为：',f1_score(list(y_train.iloc[valid_index]), val_pred, average='macro'))
    category_performance_measure(list(y_train.iloc[valid_index]), val_pred)
    
    # 保存测试集预测结果
    test_pred_ = [int(model.predict(x)[0][0].split('__')[-1]) for x in X_test]
    test_pred = np.column_stack((test_pred, test_pred_))  # 将矩阵按列合并
# 取测试集中预测数量最多的数
preds = []
for i, test_list in enumerate(test_pred):
    preds.append(np.argmax(np.bincount(test_list)))
preds = np.array(preds) 

第 1 折交叉验证开始...


  'precision', 'predicted', average, warn_for)


Fasttext准确率为： 0.19447980178703447
0:	 P:0.762987	 R:0.661972	 F1:0.708899
1:	 P:0.942652	 R:0.659148	 F1:0.775811
2:	 P:0.955556	 R:0.510891	 F1:0.665806
3:	 P:0.802139	 R:0.434783	 F1:0.563910
4:	 P:0.000000	 R:0.000000	 F1:0.000000
5:	 P:0.000000	 R:0.000000	 F1:0.000000
6:	 P:0.000000	 R:0.000000	 F1:0.000000
7:	 P:0.000000	 R:0.000000	 F1:0.000000
8:	 P:0.000000	 R:0.000000	 F1:0.000000
9:	 P:0.000000	 R:0.000000	 F1:0.000000
10:	 P:0.000000	 R:0.000000	 F1:0.000000
11:	 P:0.000000	 R:0.000000	 F1:0.000000
12:	 P:0.000000	 R:0.000000	 F1:0.000000
13:	 P:0.000000	 R:0.000000	 F1:0.000000
第 2 折交叉验证开始...
Fasttext准确率为： 0.19857951330734983
0:	 P:0.816129	 R:0.598109	 F1:0.690314
1:	 P:0.909396	 R:0.705729	 F1:0.794721
2:	 P:0.969466	 R:0.577273	 F1:0.723647
3:	 P:0.810526	 R:0.431373	 F1:0.563071
4:	 P:0.000000	 R:0.000000	 F1:0.000000
5:	 P:0.000000	 R:0.000000	 F1:0.000000
6:	 P:0.000000	 R:0.000000	 F1:0.000000
7:	 P:0.000000	 R:0.000000	 F1:0.000000
8:	 P:0.000000	 R:0.000000	 F1:0.

## 4.5 输出上传文件

In [12]:
submission = pd.read_csv('./test_a_sample_submit.csv')
submission['label'] = preds
submission.to_csv('./LinearSVC_submission.csv', index=False)

# 五、BERT

<img src="https://cdn.nlark.com/yuque/0/2021/png/1508544/1614252341600-908e79ed-86c4-45b9-b661-9fd77d6c3681.png"/>
<img src="https://cdn.nlark.com/yuque/0/2021/png/1508544/1614252349214-4ca253ae-ce85-4655-8a44-63fd715df607.png"/>

## 5.1 导入相关库

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics
import transformers
import time
import torch
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertModel, BertConfig

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 5.2 读入数据

In [None]:
train_df = pd.read_csv('./train_set.csv', sep='\t')
test_df = pd.read_csv('./test_a.csv', sep='\t')
test_df['label'] = 0

In [None]:
tokenizer = BertTokenizer.from_pretrained('./emb/bert-mini/vocab.txt')
tokenizer.encode_plus("2967 6758 339 2021 1854",
        add_special_tokens=True,
        max_length=20,
        truncation=True)
# token_type_ids 通常第一个句子全部标记为0，第二个句子全部标记为1。
# attention_mask padding的地方为0，未padding的地方为1。

In [None]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = self.data.text
        self.targets = self.data.label
        self.max_len = max_len

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, index):
        comment_text = self.comment_text[index]

        inputs = self.tokenizer.encode_plus(
            comment_text,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [None]:
# Creating the dataset and dataloader for the neural network
MAX_LEN = 256
train_size = 0.8
train_dataset = train_df.sample(frac=train_size,random_state=7)
valid_dataset = train_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(train_df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("VALID Dataset: {}".format(valid_dataset.shape))
print("TEST Dataset: {}".format(test_df.shape))

train_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
valid_set = CustomDataset(valid_dataset, tokenizer, MAX_LEN)
test_set = CustomDataset(test_df, tokenizer, MAX_LEN)

In [None]:
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True}

valid_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True}

test_params = {'batch_size': TEST_BATCH_SIZE,
                'shuffle': False}

train_loader = DataLoader(train_set, **train_params)
valid_loader = DataLoader(valid_set, **valid_params)
test_loader = DataLoader(test_set, **test_params)

## 5.3 模型创建
预训练BERT以及相关代码下载地址：  
链接: https://pan.baidu.com/s/1zd6wN7elGgp1NyuzYKpvGQ 提取码: tmp5

In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 
class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.config = BertConfig.from_pretrained('./emb/bert-mini/bert_config.json', output_hidden_states=True)
        self.l1 = BertModel.from_pretrained('./emb/bert-mini/pytorch_model.bin', config=self.config)
        self.bilstm1 = torch.nn.LSTM(512, 64, 1, bidirectional=True)
        self.l2 = torch.nn.Linear(128, 64)
        self.a1 = torch.nn.ReLU()
        self.l3 = torch.nn.Dropout(0.3)
        self.l4 = torch.nn.Linear(64, 14)
    
    def forward(self, ids, mask, token_type_ids):
        sequence_output, pooler_output, hidden_states= self.l1(ids, attention_mask=mask, token_type_ids=token_type_ids)
        # [bs, 200, 256]  [bs,256]
        bs = len(sequence_output)
        h12 = hidden_states[-1][:,0].view(1,bs,256)
        h11 = hidden_states[-2][:,0].view(1,bs,256)
        concat_hidden = torch.cat((h12,h11),2)
        x, _ = self.bilstm1(concat_hidden)
        x = self.l2(x.view(bs,128))
        x = self.a1(x)
        x = self.l3(x)
        output = self.l4(x)
        return output

net = BERTClass()
net.to(device)

In [None]:
# 超参数设置
lr, num_epochs = 1e-5, 30
criterion = torch.nn.CrossEntropyLoss()  # 选择损失函数
optimizer = torch.optim.Adam(net.parameters(), lr=lr)  # 选择优化器

## 5.4 训练模型

In [None]:
def evaluate_accuracy(data_iter, net, device=torch.device('cpu')):
    """Evaluate accuracy of a model on the given data set."""
    acc_sum, n = torch.tensor([0], dtype=torch.float32,device=device), 0
    y_pred_, y_true_ = [], []
    for data in tqdm(data_iter):
        # If device is the GPU, copy the data to the GPU.
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)
        net.eval()
        y_hat_ = net(ids, mask, token_type_ids)
        with torch.no_grad():
            targets = targets.long()
            # [[0.2 ,0.4 ,0.5 ,0.6 ,0.8] ,[ 0.1,0.2 ,0.4 ,0.3 ,0.1]] => [ 4 , 2 ]
            acc_sum += torch.sum((torch.argmax(y_hat_, dim=1) == targets))
            y_pred_.extend(torch.argmax(y_hat_, dim=1).cpu().numpy().tolist())
            y_true_.extend(targets.cpu().numpy().tolist())
            n += targets.shape[0]
    valid_f1 = metrics.f1_score(y_true_, y_pred_, average='macro')
    return acc_sum.item()/n, valid_f1

In [None]:
def train(epoch,train_iter, test_iter, criterion, num_epochs, optimizer, device):
    print('training on', device)
    net.to(device)
    best_test_f1 = 0
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)  # 设置学习率下降策略
#     scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5, eta_min=2e-06)  # 余弦退火
    for epoch in range(num_epochs):
        train_l_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        train_acc_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        n, start = 0, time.time()
        y_pred, y_true = [], []
        for data in tqdm(train_iter):
            net.train()
            optimizer.zero_grad()
            ids = data['ids'].to(device, dtype=torch.long)
            mask = data['mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            y_hat = net(ids, mask, token_type_ids)
            loss = criterion(y_hat, targets.long())
            loss.backward()
            optimizer.step()
            
            with torch.no_grad():
                targets = targets.long()
                train_l_sum += loss.float()
                train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == targets))).float()
                y_pred.extend(torch.argmax(y_hat, dim=1).cpu().numpy().tolist())
                y_true.extend(targets.cpu().numpy().tolist())
                n += targets.shape[0]
        valid_acc, valid_f1 = evaluate_accuracy(test_iter, net, device)
        train_acc = train_acc_sum / n
        train_f1 = metrics.f1_score(y_true, y_pred, average='macro')
        print('epoch %d, loss %.4f, train acc %.3f, valid acc %.3f, '
              'train f1 %.3f, valid f1 %.3f, time %.1f sec'
              % (epoch + 1, train_l_sum / n, train_acc, valid_acc,
                 train_f1, valid_f1, time.time() - start))
        if valid_f1 > best_test_f1:
            print('find best! save at model/best.pth')
            best_test_f1 = valid_f1
            torch.save(net.state_dict(), 'model/best.pth')
        scheduler.step()  # 更新学习率

In [None]:
train(net,train_loader, valid_loader, criterion, num_epochs, optimizer, device)

In [None]:
def model_predict(net, test_iter):
    # 预测模型
    preds_list = []
    print('加载最优模型')
    net.load_state_dict(torch.load('model/best.pth'))
    net = net.to(device)
    print('inference测试集')
    with torch.no_grad():
        for data in tqdm(test_iter):
            ids = data['ids'].to(device, dtype=torch.long)
            mask = data['mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            batch_preds = list(net(ids, mask, token_type_ids).argmax(dim=1).cpu().numpy())
            for preds in batch_preds:
                preds_list.append(preds)           
    return preds_list
preds_list = model_predict(net, test_loader)

In [None]:
submission = pd.read_csv('./test_a_sample_submit.csv')
submission['label'] = preds_list
submission.to_csv('./submission.csv', index=False)

# 练习题
对代码进行修改，分别计算验证集每个类的准确率、召回率和F1 score。  

<img src="https://cdn.nlark.com/yuque/0/2021/png/1508544/1614252224981-4d1f5ca6-4cd7-4885-bb83-aa4180774a21.png"/>