## 1. 任务介绍
> 介绍任务的基本内容，以及问题的公式化
---
文本分类是自然语言处理中最基础的任务之一，主要是通过分类器将给定的文本划分到特定的类，比如情绪分类、垃圾邮件分类、电影评论分类等。具体任务公式化如下：
$$
\begin{aligned}
文本 ： &X = (x_1,x_2,\dots,x_n) \\
类标签 ：& Y = (y_1,y_2,\dots,y_n)\\
模型 ：& f: x_i  \xrightarrow{f} y_i, \hspace{1em} i = 1,2, \dots,n
\end{aligned}
$$
本文选用Kaggle的电影评论情感分析来作为任务。

## 2.环境准备

In [1]:
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics 

import nltk
from nltk.corpus import stopwords

## 3. 数据预处理
> 需要对数据进行清洗
---

处理步骤大致如下：
    1. 去除html标签
    2. 去除标点
    3. 切分成词
    4. 去除停用词
    5. 重组为新的句子

In [2]:
# 0. 先准备数据
file_path = '../data/IMDB/labeledTrainData.tsv'
df = pd.read_csv(file_path,sep='\t',escapechar='\\')
print('Number of samples:{}'.format(len(df)))
df.head()

Number of samples:25000


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
eng_stopwords = stopwords.words('english') #定义停用词

def text_clean(text):
    text = BeautifulSoup(text,'html.parser').get_text() #去除html标签
    text = re.sub(r'[^a-zA-Z]',' ',text) #去除标点
    words = text.lower().split()  #全部转成小写，然后按空格分词
    words = [w for w in words if w not in eng_stopwords] #去除停用词
    return ' '.join(words)  #重组成新的句子

df['clean_review'] = df.review.apply(text_clean)
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,stuff going moment mj started listening music ...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin...",classic war worlds timothy hines entertaining ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,film starts manager nicholas bell giving welco...
3,3630_4,0,It must be assumed that those who praised this...,must assumed praised film greatest filmed oper...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy wondrously unpretentious explo...


## 4. 文本特征表示 
> 文本向量化，并提取特征，分为离散法和分布式法(此处主要讲离散法)
---
离散法主要有以下几种方法：
 1. 词袋模型(Bag of word): 用单词频数来表示文本，不考虑文本的语法结构和单词顺序
 2. 独热编码(One-hot): 将文本表示成整个词标长度的向量，出现过的词为1，否则为0
 3. n元语法(n-gram): 对词袋模型的一种改进，即用n个词或词组组成的字符串作为特征，然后再用词袋模型的方法将文本表示为向量形式
 4. TF-IDF： 用词的TF-IDF来表示文本
 
<span style="color:red">注：其实上述方法都可以看作是词袋模型</span>

In [4]:
# 1. 使用统计词频，作为文本特征
vectorizer_feq = CountVectorizer(max_features=5000) #取词频为前5000的词
train_freq = vectorizer_feq.fit_transform(df.clean_review).toarray()
print("词频为特征的文本-单词矩阵维度:",train_freq.shape)

# 2. 使用bigram，作为文本特征
vectorizer_bigram = CountVectorizer(ngram_range=(2,2),max_features=1000,token_pattern=r'\b\w+\b', min_df=1)
# analyze = vectorizer_bigram.build_analyzer()
# print("bi-gram示例：",analyze(df.clean_review[0]))#bi-gram举例
train_bigram = vectorizer_bigram.fit_transform(df.clean_review).toarray()
print("bi-gram为特征的文本-单词矩阵维度：",train_bigram.shape)

# 2. 使用tfidf, 作为文本特征
vectorizer_tfidf = TfidfVectorizer(max_features=5000)
train_tfidf = vectorizer_tfidf.fit_transform(df.clean_review).toarray()

print("TF-IDF为特征的文本-单词矩阵维度：",train_tfidf.shape)

词频为特征的文本-单词矩阵维度: (25000, 5000)
bi-gram为特征的文本-单词矩阵维度： (25000, 1000)
TF-IDF为特征的文本-单词矩阵维度： (25000, 5000)


## 5. 辅助函数
> 包括数据批量生成器，softmax函数，预测函数，评估函数

In [20]:
#定义数据批量生成器
def batch_generator(data, batch_size, shuffle=True):
    X, Y = data
    n_samples = X.shape[0]
    indices = np.arange(n_samples)
    if shuffle:
        np.random.shuffle(indices)  #打乱顺序

    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        batch_idx = indices[start:end]

        yield X[batch_idx], Y[batch_idx]
        
# softmax函数
def softmax(scores):
    sum_exp = np.sum(np.exp(scores),axis=1,keepdims=True)
    softmax = np.exp(scores) / sum_exp
    return softmax

# 预测函数
def predict(w,b,x):
    scores = np.dot(x,w.T) + b
    probs = softmax(scores)
    
    return np.argmax(probs,axis=1).reshape(-1,1)
    
# 评估函数，包括精确率，召回率和F1值
def evaluate(w,val_x,val_y):
    val_loss = []
    val_gen = batch_generator((val_x,val_y),batch_size=32,shuffle=False)
    for batch_x,batch_y in val_gen:
        scores = np.dot(batch_x,w.T)
        prob = softmax(scores)

        y_one_hot = one_hot(batch_y)
        # 损失函数
        loss = - (1.0 / len(batch_x)) * np.sum(y_one_hot * np.log(prob))
        val_loss.append(loss)
        
    return np.mean(val_loss)
    
    
def one_hot(batch_y,n_classes=2):
    n = batch_y.shape[0]
    one_hot = np.zeros((n, n_classes))
    one_hot[np.arange(n), batch_y.T] = 1
    return one_hot
    

## 6. 构建分类器
> 此处以softmax regression作为分类器

In [21]:
def logistic_regression(x, y, n_classes=2, lr = 0.001, val_split=0.2, batch_size=64, epochs=1000, early_stop=None):
    n_samples,n_features = x.shape

    w = np.random.rand(n_classes, n_features)
    train_all_loss = []
    val_all_loss = []
    not_improved = 0
    best_val_loss = np.inf
    best_w = None

    indices = np.random.permutation(n_samples)
    split = int(n_samples * (1-val_split))
    training_idx = indices[:split]
    valid_idx = indices[split+1:]
    
    train_x = x[training_idx]
    train_y = y[training_idx]
    
    valid_x = x[valid_idx]
    valid_y = y[valid_idx]
    
    for epoch in range(epochs):
        training_gen = batch_generator((train_x, train_y), batch_size=64)
        train_loss = []
        for batch_x, batch_y in training_gen:
            scores = np.dot(batch_x, w.T)
            prob = softmax(scores)

            y_one_hot = one_hot(batch_y,n_classes)
            # 损失函数
            loss = - (1.0 / len(batch_x)) * np.sum(y_one_hot * np.log(prob))
            train_loss.append(loss)

            # 梯度下降
            dw = -(1.0/len(batch_x)) * np.dot((y_one_hot - prob).T, batch_x)
            w = w - lr * dw

        val_loss = evaluate(w, valid_x, valid_y)

        print("Epoch = {0},the train loss = {1:.4f}, the val loss = {2:.4f}".format(
            epoch, np.mean(train_loss), val_loss))

        train_all_loss.append(np.mean(train_loss))
        val_all_loss.append(val_loss)

        if early_stop is not None:
            if val_loss <= best_val_loss:
                best_val_loss = val_loss
                best_w = w
                not_improved = 0
            else:
                not_improved += 1

            if not_improved > early_stop:
                printnt("Validation performance didn\'t improve for {} epochs. "
                        "Training stops.".format(early_stop))
                break
                
    return best_w,train_all_loss,val_all_loss
        

## 7.训练 

### 7.1 以BOW为特征进行训练 

In [None]:
label = df['sentiment'].values
w,train_all_loss,val_all_loss = logistic_regression(train_freq,label,early_stop=10)

Epoch = 0,the train loss = 1.9454, the val loss = 1.8243
Epoch = 1,the train loss = 1.7821, the val loss = 1.7350
Epoch = 2,the train loss = 1.7172, the val loss = 1.6801
Epoch = 3,the train loss = 1.6640, the val loss = 1.6335
Epoch = 4,the train loss = 1.6172, the val loss = 1.5910
Epoch = 5,the train loss = 1.5757, the val loss = 1.5515
Epoch = 6,the train loss = 1.5354, the val loss = 1.5150
Epoch = 7,the train loss = 1.5000, the val loss = 1.4809
Epoch = 8,the train loss = 1.4644, the val loss = 1.4492
Epoch = 9,the train loss = 1.4325, the val loss = 1.4196
Epoch = 10,the train loss = 1.4019, the val loss = 1.3917
Epoch = 11,the train loss = 1.3731, the val loss = 1.3658
Epoch = 12,the train loss = 1.3477, the val loss = 1.3414
Epoch = 13,the train loss = 1.3213, the val loss = 1.3185
Epoch = 14,the train loss = 1.2970, the val loss = 1.2968
Epoch = 15,the train loss = 1.2754, the val loss = 1.2764
Epoch = 16,the train loss = 1.2537, the val loss = 1.2573
Epoch = 17,the train los

Epoch = 141,the train loss = 0.6021, the val loss = 0.6896
Epoch = 142,the train loss = 0.5998, the val loss = 0.6882
Epoch = 143,the train loss = 0.5991, the val loss = 0.6869
Epoch = 144,the train loss = 0.5968, the val loss = 0.6855
Epoch = 145,the train loss = 0.5950, the val loss = 0.6841
Epoch = 146,the train loss = 0.5930, the val loss = 0.6828
Epoch = 147,the train loss = 0.5924, the val loss = 0.6815
Epoch = 148,the train loss = 0.5900, the val loss = 0.6802
Epoch = 149,the train loss = 0.5883, the val loss = 0.6789
Epoch = 150,the train loss = 0.5873, the val loss = 0.6776
Epoch = 151,the train loss = 0.5852, the val loss = 0.6763
Epoch = 152,the train loss = 0.5832, the val loss = 0.6750
Epoch = 153,the train loss = 0.5813, the val loss = 0.6737
Epoch = 154,the train loss = 0.5806, the val loss = 0.6725
Epoch = 155,the train loss = 0.5790, the val loss = 0.6713
Epoch = 156,the train loss = 0.5773, the val loss = 0.6700
Epoch = 157,the train loss = 0.5756, the val loss = 0.66