TF-IDF

TF-IDF是在自然语言处理（NLP）中常用的统计方法，用来评估单个单词在文档中的重要程度。

TF表示词频，对一个文档而言，词频就是词在文档出现的次数除以文档的词语总数。

IDF表示一个词的逆向文档频率指数。可以由总文档数除以包含该词出现的文档数目，然后取对数。

例如：有10000个文档，“Python”只出现了10篇文章，则IDF=log(10000/10)=3；“我”字在所有文档都出现过，则IDF为0。

词频和权重指数相乘，就是词在文档中的重要程度。可以看出，词的重要性随它在文档中出现的频次呈正比增加，但同时也会随着它在语料库中出现的频率呈反比下降。

### 导入数据

In [7]:
from sklearn.datasets import load_files

In [8]:
news_train = load_files('data/379/train')

In [9]:
news_train.target

array([18, 13,  1, ..., 14, 15,  4])

In [10]:
news_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [11]:
news_train.target_names[news_train.target[0]]

'talk.politics.misc'

In [12]:
news_train.data[0]



### 数据转换

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(encoding='latin-1')
X_train = vect.fit_transform(news_train.data)

In [15]:
import numpy as np
X_train.shape

(13180, 130274)

In [16]:
print(X_train[0,:])

  (0, 56813)	0.014332663773643272
  (0, 45689)	0.08373343949755
  (0, 46084)	0.08109733529789522
  (0, 125882)	0.0873157704840211
  (0, 50150)	0.020654313721609956
  (0, 87702)	0.04643235585055511
  (0, 33334)	0.1025405658189532
  (0, 111805)	0.014332663773643272
  (0, 115086)	0.07712947008554673
  (0, 99721)	0.05039080222145882
  (0, 109314)	0.11722978151290249
  (0, 89035)	0.17133856150856897
  (0, 117388)	0.06879042701689829
  (0, 66565)	0.03373591167198109
  (0, 120409)	0.0422379709863369
  (0, 62408)	0.3669560132378484
  (0, 36885)	0.17911945780714125
  (0, 113268)	0.045722894553426
  (0, 36634)	0.10887320610155209
  (0, 95990)	0.19525865433679337
  (0, 67717)	0.15584408886218354
  (0, 124607)	0.03283897072404772
  (0, 59746)	0.052472908888225735
  (0, 115068)	0.03850593696943872
  (0, 89566)	0.03321158087168379
  :	:
  (0, 115836)	0.05445585339392558
  (0, 28463)	0.07743709638232414
  (0, 79247)	0.04065946034093589
  (0, 92698)	0.03574458886447535
  (0, 59404)	0.04178054660832682

### 训练模型

In [17]:
from sklearn.naive_bayes import MultinomialNB
y_train = news_train.target
clf = MultinomialNB(alpha=0.0001)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)

In [18]:
train_score

0.9978755690440061

In [19]:
news_test = load_files('data/379/test')

In [20]:
X_test = vect.transform(news_test.data)
y_test = news_test.target

In [21]:
pred = clf.predict(X_test[0])

In [22]:
pred

array([7])

In [23]:
news_test.target[0]

7

### 模型评价

In [24]:
clf.score(X_test, y_test)

0.9088172804532578

In [26]:
from sklearn.metrics import classification_report

pred = clf.predict(X_test)

print(classification_report(y_test, pred, target_names=news_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.90      0.91      0.91       245
           comp.graphics       0.80      0.90      0.85       298
 comp.os.ms-windows.misc       0.82      0.79      0.80       292
comp.sys.ibm.pc.hardware       0.81      0.80      0.81       301
   comp.sys.mac.hardware       0.90      0.91      0.91       256
          comp.windows.x       0.88      0.88      0.88       297
            misc.forsale       0.87      0.81      0.84       290
               rec.autos       0.92      0.93      0.92       324
         rec.motorcycles       0.96      0.96      0.96       294
      rec.sport.baseball       0.97      0.94      0.96       315
        rec.sport.hockey       0.96      0.99      0.98       302
               sci.crypt       0.95      0.96      0.95       297
         sci.electronics       0.91      0.85      0.88       313
                 sci.med       0.96      0.96      0.96       277
         

In [27]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred)

print(cm)

[[224   0   0   0   0   0   0   0   0   0   0   0   0   0   2   5   0   0
    1  13]
 [  1 267   5   5   2   8   1   1   0   0   0   2   3   2   1   0   0   0
    0   0]
 [  1  13 230  24   4  10   5   0   0   0   0   1   2   1   0   0   0   0
    1   0]
 [  0   9  21 242   7   2  10   1   0   0   1   1   7   0   0   0   0   0
    0   0]
 [  0   1   5   5 233   2   2   2   1   0   0   3   1   0   1   0   0   0
    0   0]
 [  0  20   6   3   1 260   0   0   0   2   0   1   0   0   2   0   2   0
    0   0]
 [  0   2   5  12   3   1 235  10   2   3   1   0   7   0   2   0   2   1
    4   0]
 [  0   1   0   0   1   0   8 300   4   1   0   0   1   2   3   0   2   0
    1   0]
 [  0   1   0   0   0   2   2   3 283   0   0   0   1   0   0   0   0   0
    1   1]
 [  0   1   1   0   1   2   1   2   0 297   8   1   0   1   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   2   2 298   0   0   0   0   0   0   0
    0   0]
 [  0   1   2   0   0   1   1   0   0   0   0 284   2   1   0   0