1. Use sklearn to extract tf-idf
- CountVectorizer() + TfidfTransformer()
- TfidfVectorizer()
"As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.
As you can see, TfidfVectorizer is a CountVectorizer followed by TfidfTransformer."


In [21]:
# import dataset

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='all')
news_data = twenty_train.data[:50]


In [28]:
# vectorize corpus
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(news_data)
# show the matrix of the vectorized corpus
X.toarray()
# show the unique terms (feature names)
feature_names = vectorizer.get_feature_names()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [35]:
# count tf-idf of each term in the vectorized corpus
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
tfidfArray = tfidf.toarray()

In [41]:
# view teh top 5 terms of each news article based on the tfidf score

feature_names = vectorizer.get_feature_names()
li = []
for l in tfidfArray:
    # How to explain this : ???
    print [(feature_names[x],l[x]) for x in (l*-1).argsort()][:5]
    
# reference: https://stackoverflow.com/questions/28619595/how-to-get-top-terms-based-on-tf-idf-python

[(u'pens', 0.57457982949654918), (u'devils', 0.19152660983218306), (u'jagr', 0.19152660983218306), (u'season', 0.15663692903159965), (u'regular', 0.15663692903159965)]
[(u'vlb', 0.31282633392421016), (u'uoknor', 0.28687950970352283), (u'ecn', 0.28687950970352283), (u'card', 0.20992053310059022), (u'performance', 0.20072488892546592)]
[(u'hilmi', 0.21799238139149693), (u'armenians', 0.21799238139149693), (u'weapons', 0.21393783685745585), (u'armenia', 0.21393783685745585), (u'announced', 0.17828153071454655)]
[(u'scsi', 0.35871726913937901), (u'dma', 0.35871726913937901), (u'bus', 0.30854605792918893), (u'data', 0.27608745039290811), (u'devices', 0.25110208839756537)]
[(u'system', 0.29716049705138053), (u'drive', 0.25381328734924902), (u'jasmine', 0.2418362493420185), (u'inexpensive', 0.2418362493420185), (u'utility', 0.2418362493420185)]
[(u'myers', 0.31264602206683523), (u'unc', 0.31264602206683523), (u'fc', 0.23448451655012642), (u'tell', 0.21875329512801253), (u'chapel', 0.156323011

In [27]:
# store tf-idf in the file
import os
import string

word = vectorizer.get_feature_names()
weight = tfidf.toarray()

sFilePath = './tfidffile'
if not os.path.exists(sFilePath) : 
    os.mkdir(sFilePath)
# 这里将每份文档词语的TF-IDF写入tfidffile文件夹中保存
for i in range(len(weight)) :
    print u"--------Writing all the tf-idf in the",i,u" file into ",sFilePath+'/'+string.zfill(i,5)+'.txt',"--------"
    f = open(sFilePath+'/'+string.zfill(i,5)+'.txt','w+')
    for j in range(len(word)) :
        f.write(word[j]+"    "+str(weight[i][j])+"\n")
    f.close()

 --------Writing all the tf-idf in the 0  file into  ./tfidffile/00000.txt --------
--------Writing all the tf-idf in the 1  file into  ./tfidffile/00001.txt --------
--------Writing all the tf-idf in the 2  file into  ./tfidffile/00002.txt --------
--------Writing all the tf-idf in the 3  file into  ./tfidffile/00003.txt --------
--------Writing all the tf-idf in the 4  file into  ./tfidffile/00004.txt --------
--------Writing all the tf-idf in the 5  file into  ./tfidffile/00005.txt --------
--------Writing all the tf-idf in the 6  file into  ./tfidffile/00006.txt --------
--------Writing all the tf-idf in the 7  file into  ./tfidffile/00007.txt --------
--------Writing all the tf-idf in the 8  file into  ./tfidffile/00008.txt --------
--------Writing all the tf-idf in the 9  file into  ./tfidffile/00009.txt --------
--------Writing all the tf-idf in the 10  file into  ./tfidffile/00010.txt --------
--------Writing all the tf-idf in the 11  file into  ./tfidffile/00011.txt --------
-

reference: http://blog.csdn.net/liuxuejiang158blog/article/details/31360765