bag of words:将构成文档的所有词组成一个字典。因此每篇文章中的词都可以用一个字典长度的向量表示。向量中的元素表示对应到相应位置字典中的词在文档中出现的次数。

#### 几个基本概念
1.词频term frequency,TF : 某一给定的词语在该文档中出现的频率。
2.逆向文件频率 inverse document frequency,IDF : 一个词语普遍性的度量。某一特定词语的IDF可以由总文件数目除以包含该词语的文件数目，再取对数。
3.TF-IDF:TF与IDF的乘积。高词语频率乘以低文件频率可产生高权重TF-IDF。TF-IDF倾向过滤掉常见的词语，保持重要的词语。

In [1]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#举个例子
texts = ['dog cat fish','dog cat cat','fish bird','bird']#代表每个文档的内容
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
print(cv.get_feature_names())#字典值
print(cv_fit.toarray())#稀疏矩阵
print(cv.vocabulary_) #获取字典

['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
 [0 2 1 0]
 [1 0 0 1]
 [1 0 0 0]]
{'dog': 2, 'cat': 1, 'fish': 3, 'bird': 0}


In [4]:
#获取datasets文本数据集
categories = ['alt.atheism','soc.religion.christian','comp.graphics','sci.med']
twenty_train = datasets.fetch_20newsgroups(subset='train',categories=categories)

In [5]:
twenty_train['target'].shape

(2257,)

In [6]:
pd.value_counts(twenty_train['target'])

3    599
2    594
1    584
0    480
dtype: int64

In [40]:
twenty_train['target_names']

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [8]:
#将数据转化为词向量
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train['data'])#根据全部文本形成词袋，根据词袋形成各文本词向量

In [9]:
X_train_counts.shape#词向量矩阵行和列大小

(2257, 35788)

In [15]:
#词向量矩阵
X_train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [17]:
X_train_counts.sum(axis=0)

matrix([[134,  92,   1, ...,   1,   1,   1]], dtype=int64)

In [21]:
#TF-IDF计算
from sklearn.feature_extraction.text import TfidfTransformer

In [22]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)#词频 该词占该文本的频率

In [23]:
X_train_tf = tf_transformer.transform(X_train_counts)

In [25]:
X_train_tf.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [117]:
X_train_tf.sum(axis=0)

matrix([[ 3.62414333,  2.26442833,  0.00407465, ...,  0.12700013,
          0.06819943,  0.06741999]])

In [26]:
tfidf_transformer = TfidfTransformer()#计算TF-IDF

In [27]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [118]:
X_train_tfidf.sum(axis=0)

matrix([[ 4.67345222,  3.27795162,  0.01121216, ...,  0.21430177,
          0.10555802,  0.11587165]])

In [None]:
#如何查看tf_idf中权重最大的词？？

In [32]:
from sklearn.naive_bayes import MultinomialNB

In [33]:
clf = MultinomialNB().fit(X_train_tfidf,twenty_train['target'])

In [34]:
docs_new = ['God is love','OpenGL on the GPU is fast']

In [36]:
#将在训练集上fit后方法transform到测试集
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [37]:
clf.predict(X_new_tfidf)

array([3, 1], dtype=int64)

In [38]:
predicted = clf.predict(X_new_tfidf)

In [39]:
for doc,category in zip(docs_new,predicted):
    print('%r >> %s' % (doc,twenty_train['target_names'][category]))

'God is love' >> soc.religion.christian
'OpenGL on the GPU is fast' >> comp.graphics


### building a pipeline

In [41]:
from sklearn.pipeline import Pipeline

In [45]:
text_clf = Pipeline([
        ('vect',CountVectorizer()),
        ('tfidf',TfidfTransformer()),
        ('clf',MultinomialNB()),
    ])
#Pipeline fit()时按顺序执行fit_transform()并将结果给下一个环节；predict()时，顺序执行transform()并将结果给下一个环节

In [47]:
text_clf = text_clf.fit(twenty_train['data'],twenty_train['target'])

In [49]:
twenty_test = datasets.fetch_20newsgroups(subset='test',
                                         categories=categories)

In [50]:
docs_test = twenty_test['data']

In [51]:
predicted = text_clf.predict(docs_test)

In [53]:
np.mean(predicted == twenty_test['target'])

0.83488681757656458

In [56]:
from sklearn.linear_model import SGDClassifier

In [58]:
text_clf = Pipeline([
        ('vect',CountVectorizer()),
        ('tfidf',TfidfTransformer()),
        ('clf',SGDClassifier(loss='hinge',penalty='l2',
                            alpha=0.001,n_iter=5,random_state=42)),
    ])

In [59]:
text_clf = text_clf.fit(twenty_train['data'],twenty_train['target'])

In [60]:
predicted = text_clf.predict(docs_test)

In [61]:
np.mean(predicted == twenty_test['target'])

0.9127829560585885

### 模型评估

In [63]:
from sklearn import metrics

In [67]:
print(metrics.classification_report(twenty_test['target'],predicted,
                             target_names=twenty_test['target_names']))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



In [72]:
metrics.confusion_matrix(twenty_test['target'],predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

### Parameter tuning using grid search

In [77]:
from sklearn.grid_search import GridSearchCV

In [81]:
parameters = {'vect__ngram_range':[(1,1),(1,2)],
             'tfidf__use_idf':(True,False),
             'clf__alpha':(0.01,0.001)}

In [82]:
gs_clf = GridSearchCV(text_clf,parameters,n_jobs=-1)

In [83]:
gs_clf = gs_clf.fit(twenty_train['data'][:400],twenty_train['target'][:400])

In [86]:
twenty_train['target_names'][gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

In [87]:
gs_clf.best_score_

0.90000000000000002

In [89]:
gs_clf.best_params_

{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

### 利用jieba包进行中文分词

In [2]:
import jieba as jb

In [3]:
ch_texts = ['狗是人类的好朋友','猫也是','很多人都喜欢猫和狗','批款金额太少了','客户希望门店工作人员办事效率更高点']

In [6]:
L = []
for i in ch_texts:
    j = jb.cut(i)
    L.append(j)
for i in L:
    print(i)

<generator object Tokenizer.cut at 0x0000000004C45F10>
<generator object Tokenizer.cut at 0x0000000004F023B8>
<generator object Tokenizer.cut at 0x00000000054FF048>
<generator object Tokenizer.cut at 0x00000000054FF0A0>
<generator object Tokenizer.cut at 0x00000000054FF0F8>


In [103]:
cv = CountVectorizer()
cv_fit = cv.fit_transform(L)
print(cv.get_feature_names())#字典值
print(cv_fit.toarray())#稀疏矩阵
print(cv.vocabulary_) #获取字典

['人类', '办事效率', '喜欢', '太少', '客户', '工作人员', '希望', '很多', '批款', '朋友', '金额', '门店', '高点']
[[1 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 1 0 0]
 [0 1 0 0 1 1 1 0 0 0 0 1 1]]
{'人类': 0, '朋友': 9, '很多': 7, '喜欢': 2, '批款': 8, '金额': 10, '太少': 3, '客户': 4, '希望': 6, '门店': 11, '工作人员': 5, '办事效率': 1, '高点': 12}


In [99]:
seg_list = jb.cut('我来到北京清华大学',cut_all=True)

In [100]:
print('全模式：'+'/'.join(seg_list))

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\xupu\AppData\Local\Temp\jieba.cache
Loading model cost 1.178 seconds.
Prefix dict has been built succesfully.


全模式：我/来到/北京/清华/清华大学/华大/大学


In [103]:
seg_list = jb.cut('我来到北京清华大学',cut_all=False)
print('精确模式：'+'/'.join(seg_list))
#默认为精确模式

精确模式：我/来到/北京/清华大学


In [23]:
import jieba.posseg as pseg #词性标注
import jieba.analyse

In [105]:
str = '我来到北京清华大学'

In [106]:
result = pseg.cut(str)

In [108]:
for i in result:
    print (i)

我/r
来到/v
北京/ns
清华大学/nt


In [109]:
result1 = jieba.analyse.extract_tags(str,2)
#关键词提取，参数setence对应str1，topK对应2为TF-IDF权重最大的关键词，默认20

In [110]:
print (result1)

['清华大学', '来到']


In [None]:
#jieba,tf_idf,word2vec??