## 使用Scikit-learn進行文本分類

目的：使用Scikit-learn庫自帶的新聞信息數據來進行試驗，該數據集有19,000個新聞信息組成，通過新聞文本的內容，
使用scikit-learn中的樸素貝葉斯算法，來判斷新聞屬於什麼主題類別。參考：[Scikit-learn Totorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#building-a-pipeline)

In [3]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print news.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']


In [7]:
print news.data[0]
print news.target[0], news.target_names[news.target[0]]

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10 rec.sport.hockey


In [4]:
split_rate = 0.8
split_size = int(len(news.data) * split_rate)
X_train = news.data[:split_size]
y_train = news.target[:split_size]
X_test  = news.data[split_size:]
y_test  = news.target[split_size:]

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
# Tokenizing text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
# Tf
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
# Tf_idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [21]:
print X_train_tfidf.shape
print y_train.shape
print type(y_train)

(15076, 146269)
(15076,)
<type 'numpy.ndarray'>


In [6]:
from sklearn.naive_bayes import MultinomialNB
# create classifier
clf = MultinomialNB().fit(X_train_tfidf, y_train)
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
# using classifier to predict
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
       print('%r => %s' % (doc, news.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## 另一個範例

In [11]:
#構建分類器

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

#nbc means naive bayes classifier
nbc_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])
nbc_2 = Pipeline([
    ('vect', HashingVectorizer(non_negative=True)),
    ('clf', MultinomialNB()),
])
nbc_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

nbcs = [nbc_1, nbc_2, nbc_3]


In [12]:
# 交叉驗證
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
import numpy as np

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

In [14]:
for nbc in nbcs:
    evaluate_cross_validation(nbc, X_train, y_train, 5)

[ 0.84515915  0.84577114  0.84378109  0.8358209   0.83781095]
Mean score: 0.842 (+/-0.002)
[ 0.73872679  0.76351575  0.75754561  0.77446103  0.71343284]
Mean score: 0.750 (+/-0.011)
[ 0.84151194  0.84908789  0.83781095  0.84543947  0.8252073 ]
Mean score: 0.840 (+/-0.004)


In [16]:
# 优化提取单词规则参数
nbc_4 = Pipeline([
    ('vect', TfidfVectorizer(
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB()),
])

evaluate_cross_validation(nbc_4, X_train, y_train, 5)


[ 0.85941645  0.86533997  0.85339967  0.85505804  0.84742952]
Mean score: 0.856 (+/-0.003)


In [17]:
# 优化省略词参数
def get_stop_words():
    result = set()
    for line in open('hlt_stop_words.txt', 'r').readlines():
        result.add(line.strip())
    return result

stop_words = get_stop_words()
nbc_5 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",    
    )),
    ('clf', MultinomialNB()),
])


evaluate_cross_validation(nbc_5, X_train, y_train, 5)

[ 0.85941645  0.86533997  0.85339967  0.85505804  0.84742952]
Mean score: 0.856 (+/-0.003)


In [18]:
# 优化贝叶斯分类器的alpha参数
nbc_6 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])

evaluate_cross_validation(nbc_6, X_train, y_train, 5)

[ 0.91080902  0.91376451  0.91475954  0.91708126  0.91376451]
Mean score: 0.914 (+/-0.001)
