<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;应用系统体系架构 - 人工智能模块 </h1>

<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A3：自然语言理解 - 文本分类</h1>

## 0. 引言

在这个Notebook中，我们将使用 Keras 构建 多层感知机(Multilayer Perceptron) 和 CNN 模型进行文档分类

## 1. 数据预处理

In [1]:
import os
import jieba
jieba.load_userdict('nlp/lib/dict.txt.big')

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.404 seconds.
Prefix dict has been built successfully.


In [2]:
# 数据集按不同类别分组
root = 'nlp/dataset/'
# groupA：数据集较小  groupB：数据集较大
group = 'groupA'

texts = []
categories = []

for (path, dirs, fnames) in os.walk(root + group):
    for fname in fnames:
        category = fname[4:-4]
        print("category: " + category)
        with open(os.path.join(path, fname), 'r', encoding='gbk', errors='ignore') as f:
            item = f.read().split('\n')
            for i in item:
                seg = jieba.cut(i)
                seg = ' '.join(seg)
                texts.append(seg)
                categories.append(category)

category: Energy
category: Electronics
category: Communication
category: Mine
category: Transport
category: Medical


## 2. 导入构建神经网络模型的包

In [3]:
# 加载 Tensorboard 扩展，用于模型可视化
%load_ext tensorboard

# 设置 Keras 的 Backend。每次进行新的设置时，需要restart环境
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

# 导入 Keras 需要用到包
import tensorflow.keras as keras
from datetime import datetime
import tensorboard

# 清理之前的log文件，以确保 Tensorboard 显示正确
!rm -rf ./logs/

Active code page: 65001


'rm' is not recognized as an internal or external command,
operable program or batch file.


## 3. 获取数据集

In [4]:
from sklearn.model_selection import train_test_split     # 训练、测试集分类
from sklearn.preprocessing import LabelEncoder           # 标签方法
from keras.preprocessing.text import Tokenizer           # 文本编码方法
from keras_preprocessing.sequence import pad_sequences   # 补零方法
from keras.utils.np_utils import to_categorical           # One-Hot 编码方法

# 选择数据集较小的一组，并设置最大特征长度
if group == 'groupA':
    n_sent = 8000
elif group == 'groupB':
    n_sent = 50200
else:
    raise(Exception("You set a wrong group!"))
    
# 总类别数
n_label = len(set(categories))

# One-Hot 方式对标签组重新打标签，形如：[[1.,0.,0.],[0.,1.,0.],[0.,0.,1.]]
encoder = LabelEncoder()
encoded_cate = encoder.fit_transform(categories)
label_cate = to_categorical(encoded_cate)

# 过滤标点符号以及特殊字符
tokenizer = Tokenizer(filters='。，、；：’“ ”—【】（）·★', split=' ')
tokenizer.fit_on_texts(texts)
# 对每个词进行word2id 编码
vocabs = tokenizer.word_index
# 文本编码后结果
seqs = tokenizer.texts_to_sequences(texts)
# One-Hot 方式对文本编码
oh_seqs = tokenizer.sequences_to_matrix(seqs)
# 对于没有达到最大特征长度的句子，左边补零
pad_oh_seqs = pad_sequences(oh_seqs, maxlen=n_sent)

# 对结果划分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(pad_oh_seqs, label_cate, test_size=0.1, random_state=0)

## 3. 构建MLP模型

In [5]:
from keras import Input
from keras.models import Sequential
from keras.layers import Flatten, Dense, Dropout, Conv2D, MaxPooling2D

model = keras.Sequential(
    [
        Input(shape=(n_sent,)),
        Dense(512, input_shape=(len(vocabs)+1,), activation='relu'),
        Dropout(0.5),
        Dense(n_label, activation='softmax'),
    ]
)

model.summary()

model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy'])

# Define the Keras TensorBoard callback.
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

model.fit(x_train, y_train, epochs=10, callbacks=[tensorboard_callback])
model.save("model.keras")

# %tensorboard --logdir logs

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               4096512   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 3078      
Total params: 4,099,590
Trainable params: 4,099,590
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## . 评估 MLP 模型

In [6]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Test loss: 1.526979923248291
Test accuracy: 0.5


## 5. 构建 CNN 模型

In [7]:
from keras.layers import Embedding, Convolution1D, MaxPool1D,  BatchNormalization

EMB_SIZE = 300

model = keras.Sequential(
    [
        Input(shape=(n_sent,), dtype='float32'),
        Embedding(len(vocabs) + 1, EMB_SIZE),
        Convolution1D(filters=256, kernel_size=3, padding='same'),
        MaxPool1D(3, 3, padding='same'),
        Convolution1D(128, 3, padding='same'),
        MaxPool1D(3, 3, padding='same'),
        Convolution1D(64, 3, padding='same'),
        Flatten(),
        Dropout(0.1),
        # 批量标准化，作用：通过对训练权重标准化，使得值域范围更接近正态分布，防止边缘极值情况导致的梯度消失等问题，从而使的模型更快收敛
        BatchNormalization(),
        Dense(256, activation='relu'),
        Dropout(0.1),
        Dense(n_label, activation='softmax')
    ]
)

model.summary()

model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy'])

# Define the Keras TensorBoard callback.
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

model.fit(x_train, y_train, epochs=10, callbacks=[tensorboard_callback])
model.save("model.keras")

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8000, 300)         3942900   
_________________________________________________________________
conv1d (Conv1D)              (None, 8000, 256)         230656    
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 2667, 256)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 2667, 128)         98432     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 889, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 889, 64)           24640     
_________________________________________________________________
flatten (Flatten)            (None, 56896)            

## 6. 评估  模型

In [8]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Test loss: 1.212170124053955
Test accuracy: 0.5833333134651184


## 7. 构建 TextCNN 模型

In [9]:
from keras.layers import Concatenate
from keras import Model

# 输入层
fin = Input(shape=(n_sent,), dtype='float32')

# 词向量Embedding层
embed = Embedding(len(vocabs) + 1, EMB_SIZE)(fin)

# 1D卷积层-1
cnn1 = Convolution1D(filters=2, kernel_size=3, padding='same', strides=1, activation='relu')(embed)
# 1D Max Pooling层-1
mp1 = MaxPool1D(pool_size=4)(cnn1)

# 1D卷积层-2
cnn2 = Convolution1D(2, 4, padding='same', strides=1, activation='relu')(mp1)
# 1D Max Pooling层-2
mp2 = MaxPool1D(pool_size=4)(cnn2)

# 1D卷积层-3
cnn3 = Convolution1D(2, 5, padding='same', strides=1, activation='relu')(mp2)
# 1D Max Pooling层-3
mp3 = MaxPool1D(pool_size=4)(cnn3)

# 合并3个Max Pooling 层
cnn = Concatenate(axis=1)([mp1, mp2, mp3])
# 降维（展平）数据矩阵
flat = Flatten()(cnn)
# 丢弃20%神经元
dropout = Dropout(0.2)(flat)

# 全链接层
fout = Dense(n_label, activation='softmax')(dropout)
# 模型输出
model = Model(inputs=fin, outputs=fout)

model.summary()

model.compile(optimizer='adam',
  loss='categorical_crossentropy',
  metrics=['accuracy'])

# Define the Keras TensorBoard callback.
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

model.fit(x_train, y_train, epochs=10, callbacks=[tensorboard_callback])
model.save("model.keras")

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 8000)]       0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 8000, 300)    3942900     input_3[0][0]                    
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 8000, 2)      1802        embedding_1[0][0]                
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D)  (None, 2000, 2)      0           conv1d_3[0][0]                   
______________________________________________________________________________________________

## 8. 评估 TextCNN 模型

In [10]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Test loss: 1.776894450187683
Test accuracy: 0.25
