### 1.加载数据集


- 从[20 Newsgroups](http://www.qwone.com/~jason/20Newsgroups/)下载**20news-bydate.tar.gz**数据集，解压到当前目录。

- 使用[sklearn.datasets.load_files](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html)加载数据集。

In [1]:
from sklearn import datasets

# 加载数据
twenty_train = datasets.load_files("./20news-bydate/20news-bydate-train")
twenty_test = datasets.load_files("./20news-bydate/20news-bydate-test")

# 查看数据集的大小
print(len(twenty_train.target_names), len(twenty_train.data), len(twenty_test.target_names), len(twenty_test.data))

20 11314 20 7532


从输出可以看到：
- 训练集包含20个类别，共11314篇新闻
- 测试集包含20个类别，共7532篇新闻

### 2.文件特征提取

文本数据属于非结构化的数据，一般要转换成结构化的数据，才能进行实施机器学习算法实现文本分类。

常见的做法是将文本转换成『文档-词项矩阵』。矩阵中的元素，可以使用词频，或者『TF-IDF』值等。

#### 计算词频

为了将原始文本转换成分类算法可以使用的特征数据，首先需要使用**词袋(bag-of-word)**方法来衡量文本间相似性，最终生成每个文本的特征向量。

词袋方法基于简单的词频统计；统计每一个文本中的词频，表示成一个向量，即向量化。Scikit-learn的**[CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)**可以高效地完成词频统计的工作。

In [2]:
# 计算词频
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df=1, stop_words="english", decode_error='ignore')
X_train_counts = count_vect.fit_transform(twenty_train.data)
num_samples, num_features = X_train_counts.shape
print("#sample: %d, #feature: %d" % (num_samples, num_features))

#sample: 11314, #feature: 129783


从输出可以看到：11314个文档包含了129783个词语。

#### 使用TF-IDF进行特征提取

In [3]:
# 使用TF-IDF进行特征提取
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
num_samples, num_features = X_train_tfidf.shape
print("#sample: %d, #feature: %d" % (num_samples, num_features))

#sample: 11314, #feature: 129783


### 3.分类器训练

用贝叶斯分类器对测试文档做分类

In [4]:
# 训练分类器
from time import time
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

t = time()
clf = clf.fit(X_train_tfidf, twenty_train.target)
print("training time: %f" % round(time()-t, 3))

# 对新的样本进行预测
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

pred = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, pred):
    print("%r => %s" %(doc, twenty_train.target_names[category]))

training time: 0.227000
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


### 4.分类效果评价

#### 建立管道

In [5]:
# 建立管道
from sklearn.pipeline import Pipeline
text_clf_NB = Pipeline([('vect', CountVectorizer(stop_words="english", decode_error='ignore')),
                        ('tfidf', TfidfTransformer()),
                        ('clf', MultinomialNB()),
                    ])

t = time()
text_clf_NB = text_clf_NB.fit(twenty_train.data, twenty_train.target)
print("training time: %f" % round(time()-t, 3))

training time: 4.384000


#### 测试集分类准确率

In [6]:
import numpy as np
from sklearn.metrics import accuracy_score

docs_test = twenty_test.data

t = time()
pred_NB = text_clf_NB.predict(docs_test)
print("testing time: %f" % round(time()-t, 3))

accuracy = accuracy_score(twenty_test.target, pred_NB)
print("Accuracy:", accuracy)

testing time: 2.289000
Accuracy: 0.816914498141


使用朴素贝叶斯分类器，得到的测试集分类准确率为81.7%，而且训练时间为4秒(包括特征提取)，测试时间只用2秒，效果还不错！

下面，使用**线性核支持向量机**看看效果如何。

In [7]:
from sklearn.svm import SVC
text_clf_SVM = Pipeline([('vect', CountVectorizer(stop_words='english', decode_error='ignore')),
                            ('tfidf', TfidfTransformer()),
                            ('clf', SVC(kernel='linear', C=1.0, random_state=2)),
                        ])

t0 = time()
text_clf_SVM = text_clf_SVM.fit(twenty_train.data, twenty_train.target)
print("training time: %f" % round(time()-t0, 3))

t1 = time()
pred_SVM = text_clf_SVM.predict(docs_test)
print("testing time: %f" % round(time()-t1, 3))

accuracy = accuracy_score(twenty_test.target, pred_SVM)
print("Accuracy:", accuracy)

training time: 149.937000
testing time: 58.933000
Accuracy: 0.83497079129


支持向量机的分类准确率有所提升,但训练和测试时间都比贝叶斯分类器要长很多。

下面，使用**k最近邻**看看效果如何。

In [8]:
from sklearn.neighbors import KNeighborsClassifier
text_clf_kNN = Pipeline([('vect', CountVectorizer(stop_words='english', decode_error='ignore')),
                            ('tfidf', TfidfTransformer()),
                            ('clf', KNeighborsClassifier(n_neighbors=5)),
                        ])

t0 = time()
text_clf_kNN = text_clf_kNN.fit(twenty_train.data, twenty_train.target)
print("training time: %f" % round(time()-t0, 3))

t1 = time()
pred_kNN = text_clf_kNN.predict(docs_test)
print("testing time: %f" % round(time()-t1, 3))

accuracy = accuracy_score(twenty_test.target, pred_kNN)
print("Accuracy:", accuracy)

training time: 3.905000
testing time: 22.343000
Accuracy: 0.675783324482


kNN算法训练时间比SVM短，但正确率只有67%。

`scikit-learn`中提供了更精细化的评价指标，如：各类别的精确度，召回率，F值等。

下面，我们来看看更详细的指标表现如何。

In [9]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, pred_NB, target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.80      0.69      0.74       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.79      0.72      0.75       394
comp.sys.ibm.pc.hardware       0.68      0.81      0.74       392
   comp.sys.mac.hardware       0.86      0.81      0.84       385
          comp.windows.x       0.87      0.78      0.82       395
            misc.forsale       0.87      0.80      0.83       390
               rec.autos       0.88      0.91      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.91      0.92      0.92       397
        rec.sport.hockey       0.88      0.98      0.93       399
               sci.crypt       0.75      0.96      0.84       396
         sci.electronics       0.84      0.65      0.74       393
                 sci.med       0.92      0.79      0.85       396
         

贝叶斯分类器，在测试集上的精确度和召回率的表现均不错。

In [10]:
print(metrics.classification_report(twenty_test.target, pred_SVM, target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.84      0.75      0.79       319
           comp.graphics       0.68      0.81      0.74       389
 comp.os.ms-windows.misc       0.78      0.72      0.75       394
comp.sys.ibm.pc.hardware       0.69      0.80      0.74       392
   comp.sys.mac.hardware       0.83      0.84      0.84       385
          comp.windows.x       0.83      0.74      0.78       395
            misc.forsale       0.79      0.90      0.84       390
               rec.autos       0.88      0.89      0.89       396
         rec.motorcycles       0.97      0.94      0.95       398
      rec.sport.baseball       0.91      0.93      0.92       397
        rec.sport.hockey       0.97      0.95      0.96       399
               sci.crypt       0.96      0.89      0.93       396
         sci.electronics       0.72      0.80      0.75       393
                 sci.med       0.87      0.85      0.86       396
         

SVM分类器在测试集上的精确度和召回率的表现只比贝叶斯分类器好一点。

In [11]:
print(metrics.classification_report(twenty_test.target, pred_kNN, target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.62      0.75      0.68       319
           comp.graphics       0.46      0.64      0.54       389
 comp.os.ms-windows.misc       0.52      0.59      0.55       394
comp.sys.ibm.pc.hardware       0.50      0.59      0.54       392
   comp.sys.mac.hardware       0.57      0.57      0.57       385
          comp.windows.x       0.67      0.56      0.61       395
            misc.forsale       0.49      0.45      0.47       390
               rec.autos       0.73      0.70      0.72       396
         rec.motorcycles       0.82      0.85      0.83       398
      rec.sport.baseball       0.74      0.75      0.74       397
        rec.sport.hockey       0.81      0.87      0.84       399
               sci.crypt       0.78      0.83      0.80       396
         sci.electronics       0.70      0.55      0.62       393
                 sci.med       0.79      0.55      0.65       396
         

kNN分类器，在测试集上的精确度和召回率的表现均比较差。

### 使用网格搜索进行参数优化

In [13]:
from sklearn.grid_search import GridSearchCV
parameters_SVM = {
                    'clf__kernel':('linear', 'rbf'),
                    'clf__C': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
                }

gs_clf_SVM = GridSearchCV(text_clf_SVM, parameters_SVM, n_jobs = -1)
gs_clf_SVM = gs_clf_SVM.fit(twenty_train.data,twenty_train.target)

In [20]:
gs_clf_SVM.best_score_
# gs_clf_SVM.best_estimator_.get_params()  

0.90825525897118609

In [15]:
pred = gs_clf_SVM.predict(twenty_test.data)
print("Accuracy: ", accuracy_score(twenty_test.target, pred))

Accuracy:  0.837095061073
