# 基于机器学习的文本分类

## 1. 机器学习模型

机器学习是对能通过经验自动改进的计算机算法的研究. 机器学习通过历史数据训练出模型对应于人类对经验进行归纳的过程，机器学习利用模型对新数据进行预测对应于人类利用总结的规律对新问题进行预测的过程.

## 2. 文本表示方法

### 2.1 One-hot

每句包括 n 个字，因此每个字可以转换为一个 n 维度稀疏向量;

### 2.2 Bag of Words

+ Bag of Words（词袋表示），也称为Count Vectors，每个文档的字/词可以使用其出现次数来进行表示;
<br>
+ 直接统计每个字出现的次数，并进行赋值;

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

### 2.3 N-gram

N-gram与Count Vectors类似，不过加入了相邻单词组合成为新的单词，并进行计数;

### 2.4 TF-IDF

+ TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数;
+ IDF(t)= $log_e$（文档总数 / 出现该词语的文档总数）;

## 3. 基于机器学习的文本分类

### 3.1 Count Vectors + RidgeClassifier

In [4]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('./datasets/train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

0.8721598830546126


### 3.2 TF-IDF + RidgeClassifier

In [5]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('./datasets/train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

0.8721598830546126


## 4. 本章作业

#### 1. 尝试改变TF-IDF的参数，并验证精度;

In [6]:
for a in [(1,2),(1,3),(1,4)]:
    for b in [1000,2000,4000,5000]:
        tfidf = TfidfVectorizer(ngram_range=a, max_features=b)
        train_test = tfidf.fit_transform(train_df['text'])

        clf = RidgeClassifier()
        clf.fit(train_test[:10000], train_df['label'].values[:10000])

        val_pred = clf.predict(train_test[10000:])
        print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

0.8288900927279318
0.8584782097110735
0.8794233135546486
0.8864473704687724
0.8270776630718544
0.8603842642428617
0.8753274805998447
0.8850817067811825
0.8255484428068071
0.8620590192116346
0.8747722106348722
0.8847558213777788


#### 2. 尝试使用其他机器学习模型，完成训练和验证

+ 逻辑回归

In [7]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score

#TF-IDF
train_df = pd.read_csv('./datasets/train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,4), max_features=20000)
X = tfidf.fit_transform(train_df['text'])
y = train_df['label']

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(train_X, train_y)
#logistic_model.score(test_X, test_y)

val_pred = logistic_model.predict(X[12000:])
print(f1_score(train_df['label'].values[12000:], val_pred, average='macro'))

0.9084787116171941
