## Extracting Important Keywords from Text with TF-IDF and Python's Scikit-Learn 

Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch as Data Science nor GitHub were a thing back then and libraries were just limited. The world is much different today. You have several [libraries](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) and [open-source code on Github](https://github.com/topics/tf-idf?o=desc&s=forks) that provide a decent implementation of TF-IDF. If you don't need a lot of control over how the TF-IDF math is computed then I would highly recommend re-using libraries from known packages such as [Spark's MLLib](https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html) or [Python's scikit-learn](http://scikit-learn.org/stable/). 

The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling and text classification. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of the document, it can be used for text-analytics (when we look at these keywords in aggregate), as candidate labels for a document and more. 

In this article, I will show you how you can use scikit-learn to extract top keywords for a given document using its tf-idf modules. We will specifically do this on a stackoverflow dataset. 

In [1]:
%run ../../forth.py

reDef unknown
reDef -->
p e f o r t h    v1.26
source code http://github.com/hcchengithub/peforth
Type 'peforth.ok()' to enter forth interpreter, 'exit' to come back.



In [2]:
import pandas as pd

# read json into a dataframe
df_idf=pd.read_json("data/stackoverflow-data-idf.json",lines=True) # 不下 lines=True 就會有 ValueError: Trailing data 因為原始資料是 dict 的直接堆積沒有打包成 array of dict.

In [3]:
# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)
display(df_idf)

Schema:

 id                            int64
title                        object
body                         object
answer_count                  int64
comment_count                 int64
creation_date                object
last_activity_date           object
last_editor_display_name     object
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
view_count                    int64
accepted_answer_id          float64
favorite_count              float64
last_edit_date               object
last_editor_user_id         float64
community_owned_date         object
dtype: object
Number of questions,columns= (20000, 19)


Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,last_editor_display_name,owner_display_name,owner_user_id,post_type_id,score,tags,view_count,accepted_answer_id,favorite_count,last_edit_date,last_editor_user_id,community_owned_date
0,4821394,Serializing a private struct - Can it be done?,<p>I have a public class that contains a priva...,1,0,2011-01-27 20:19:13.563 UTC,2011-01-27 20:21:37.59 UTC,,,163534.0,1,0,c#|serialization|xml-serialization,296,,,,,
1,3367882,How do I prevent floated-right content from ov...,<p>I have the following HTML:</p>\n\n<pre><cod...,2,2,2010-07-30 00:01:50.9 UTC,2012-05-10 14:16:05.143 UTC,,,1190.0,1,2,css|overflow|css-float|crop,4121,3367943.0,0.0,2012-05-10 14:16:05.143 UTC,44390.0,
2,31682135,Gradle command line,<p>I'm trying to run a shell script with gradl...,0,2,2015-07-28 16:30:18.28 UTC,2015-07-28 16:32:15.117 UTC,,,1299158.0,1,1,bash|shell|android-studio|gradle,259,,,,,
3,20218536,Loop variable as parameter in asynchronous fun...,<p>I have an object with the following form.</...,1,1,2013-11-26 13:34:49.957 UTC,2013-11-26 15:07:50.8 UTC,,,642751.0,1,0,javascript|asynchronous|foreach|async.js,120,,1.0,2013-11-26 15:02:47.993 UTC,1333873.0,
4,19941459,Canot get the href value,<p>Hi I need to valid the href is empty or not...,5,1,2013-11-12 22:41:36.11 UTC,2013-11-12 23:48:34.67 UTC,,,819774.0,1,0,javascript,97,19941620.0,,2013-11-12 22:43:42.97 UTC,21886.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,45643057,C# Telnet Error (Using CMD),<p>At first I want to say that my English is n...,0,1,2017-08-11 20:10:22.387 UTC,2017-08-11 22:00:08.853 UTC,,,8450318.0,1,0,c#|winforms|cmd|telnet,28,,,2017-08-11 22:00:08.853 UTC,5359302.0,
19996,18079413,UI Control for presenting data base in the iph...,<p>I'm developing an application for CRM propo...,0,0,2013-08-06 11:51:37.463 UTC,2013-08-06 11:51:37.463 UTC,,,1939409.0,1,1,android|ios|ipad|xamarin.ios|xamarin.android,80,,,,,
19997,39977022,How to make atom-typescript write config out o...,<p>I'm working in a team of 3 devs and we're w...,0,0,2016-10-11 12:14:43.26 UTC,2017-01-18 18:41:16.67 UTC,,,4102561.0,1,1,typescript|atom-editor,18,,,2017-01-18 18:41:16.67 UTC,2449905.0,
19998,33328431,understanding for in angular loop in ng-repeat,<p>I was just going through this example onlin...,3,3,2015-10-25 10:03:00.587 UTC,2015-10-25 10:30:08.807 UTC,,,4195815.0,1,1,angularjs,164,,,,,


Take note that this stackoverflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don't quite need for this tutorial. What we are mostly interested in for this tutorial is the `body` and `title` which is our source of text. We will now create a field that combines both body and title so we have it in one field. We will also print the second `text` entry in our new field just to see what the text looks like.

In [4]:
import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
df_idf['text'][2]

'gradle command line i m trying to run a shell script with gradle i currently have something like this def test project tasks create test exec commandline bash c bash c my file dir script sh the problem is that i cannot run this script because i have spaces in my dir name i have tried everything e g commandline bash c bash c my file dir script sh tokenize commandline bash c bash c my file dir script sh commandline bash c new stringbuilder append bash append c my file dir script sh commandline bash c bash c my file dir script sh file dir file c my file dir script sh commandline bash c bash dir getabsolutepath im using windows bit and if i use a path without spaces the script runs perfectly therefore the only issue as i can see is how gradle handles spaces '

###
Hmm, doesn't look very pretty with all the html in there, but that's the point. Even in such a mess we can extract some great stuff out of this. While you can eliminate all code from the text, we will keep the code sections for this tutorial for the sake of simplicity.  

## Creating the IDF

### CountVectorizer to create a vocabulary and generate word counts
The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our `df_idf['text']` and generate counts for each row in `df_idf['text']`. The result of the last two lines is a sparse matrix representation of the counts, meaning each column represents a word in the vocabulary and each row represents the document in our dataset where the values are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import re

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("resources/stopwords.txt")

#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs)



Now let's check the shape of the resulting vector. Notice that the shape below is `(20000,149391)` because we have 20,000 documents in our dataset (the rows) and the vocabulary size is `149391` meaning we have `149391` unique words (the columns) in our dataset minus the stopwords. In some of the text mining applications, such as clustering and text classification we limit the size of the vocabulary. It's really easy to do this by setting `max_features=vocab_size` when instantiating CountVectorizer.

In [6]:
# 查看 sparse matrix 的方法 
# 'toarray', # 類似 todense()
# 'tobsr', # default 就是這個 <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in Compressed Sparse Row format>
# 'tocoo', # <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in COOrdinate format>
# 'tocsc', # <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in Compressed Sparse Column format>
# 'tocsr', # <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in Compressed Sparse Row format>
# 'todense', # 太大爆掉了
# 'todia',   # 太大爆掉了
# 'todok',   # <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in Dictionary Of Keys format>
# 'tolil',   # <20000x124901 sparse matrix of type '<class 'numpy.int64'>' with 1079735 stored elements in List of Lists format>
dict(word_count_vector[:3].todok()) # 抓一小部分來查看


{(1, 3747): 1,
 (1, 5886): 1,
 (1, 6122): 1,
 (2, 7376): 2,
 (0, 9262): 1,
 (2, 11205): 12,
 (2, 12465): 1,
 (1, 12744): 1,
 (1, 13882): 1,
 (0, 18753): 1,
 (1, 18753): 1,
 (2, 20603): 1,
 (2, 20629): 6,
 (0, 22178): 2,
 (1, 22178): 1,
 (1, 22211): 5,
 (2, 23486): 1,
 (1, 24457): 1,
 (2, 25004): 1,
 (1, 25483): 2,
 (1, 25493): 1,
 (2, 27591): 1,
 (1, 29610): 1,
 (2, 29744): 9,
 (0, 30049): 1,
 (1, 30160): 2,
 (0, 31222): 2,
 (1, 32811): 1,
 (1, 33025): 1,
 (0, 35193): 1,
 (2, 36214): 1,
 (2, 38422): 8,
 (1, 39582): 1,
 (1, 39956): 1,
 (1, 39965): 1,
 (1, 40314): 2,
 (1, 40323): 1,
 (2, 43389): 1,
 (2, 47247): 3,
 (2, 48529): 1,
 (1, 49363): 1,
 (1, 49523): 1,
 (1, 50490): 1,
 (1, 51279): 2,
 (2, 52100): 1,
 (1, 52357): 1,
 (1, 52505): 5,
 (1, 53150): 2,
 (2, 56335): 1,
 (1, 58395): 2,
 (0, 58721): 1,
 (1, 60766): 1,
 (2, 61529): 1,
 (2, 61593): 1,
 (1, 63500): 1,
 (1, 65339): 1,
 (2, 71710): 1,
 (1, 72120): 1,
 (0, 72320): 1,
 (2, 72612): 1,
 (1, 74456): 1,
 (0, 76926): 1,
 (2, 76926):

Let's limit our vocabulary size to 10,000

In [7]:
# 重作一次 with 新增的 argument of max_features=10000
cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000) # 掏出 cv object 出來用，順便設定這個 cv object 的 optional 特性
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

# 注意 cv 類似 Orange3 data table 的 domain 不含實際 data.
# 套用實際 docs 到 cv.fit_transform() 才產生實際 data 如 word_count_vector 者。

(20000, 10000)

Now, let's look at 10 words from our vocabulary. Sweet, these are mostly programming related.

In [8]:
# cv.vocabulary_ 是個 dictionary 查看部分 dict 除了 dict.keya() 有更好的方法 
dict(list(cv.vocabulary_.items())[0:10]) # 原文是 list(cv.vocabulary_.keys())[:10]

{'serializing': 7852,
 'private': 6761,
 'struct': 8520,
 'public': 6888,
 'class': 1351,
 'contains': 1729,
 'properties': 6846,
 'string': 8498,
 'serialize': 7848,
 'attempt': 631}

We can also get the vocabulary by using `get_feature_names()`

In [10]:
# 上面的 cv.vocabulary_ 是個無序的 dictionary 而 cv.get_feature_names() 是有序的 list 故兩邊看起來不同。
list(cv.get_feature_names_out())[2000:2015] 

['customization',
 'customize',
 'customized',
 'customlog',
 'customview',
 'cut',
 'cv',
 'cv_',
 'cval',
 'cvc',
 'cw',
 'cwd',
 'cx',
 'cx_oracle',
 'cxf']

### TfidfTransformer to Compute Inverse Document Frequency (IDF) 
In the code below, we are essentially taking the sparse matrix from CountVectorizer to generate the IDF when you invoke `fit`. An extremely important point to note here is that the IDF should be based on a large corpora and should be representative of texts you would be using to extract keywords. I've seen several articles on the Web that compute the IDF using a handful of documents. To understand why IDF should be based on a fairly large collection, please read this [page from Standford's IR book](https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html).

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) # 祭出 tfidf_transformer object 順便設定好 options 
tfidf_transformer.fit(word_count_vector) # 拿剛做好的 BoW 去疊加訓練 tf-idf 得到的結果還是 tfidf_transformer object 本身

TfidfTransformer()

Let's look at some of the IDF values:

In [12]:
display(tfidf_transformer.idf_.shape) # 上面的 code 限縮 vocabulary 到 1W 個。
display(tfidf_transformer.idf_) # 1w 個 vocabulary 的 idf 值

array([ 7.37717703,  9.80492526,  9.51724319, ...,  8.82409601,
       10.21039037,  9.51724319])

(10000,)

## Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords. In this example, we will extract top keywords for the questions in `data/stackoverflow-test.json`. This data file has 500 questions with fields identical to that of `data/stackoverflow-data-idf.json` as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list.

In [13]:
# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("data/stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

In [14]:
df_test

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,creation_date,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,post_type_id,score,tags,view_count,favorite_count,text
0,3247246,Integrate War-Plugin for m2eclipse into Eclips...,<p>I set up a small web project with JSF and M...,3247526.0,2,0,2010-07-14 14:39:48.053 UTC,2010-07-14 16:02:19.683 UTC,2010-07-14 15:56:37.803 UTC,,70604.0,,389430.0,1,2,eclipse|maven-2|tomcat|m2eclipse,1653,,integrate war plugin for m eclipse into eclips...
1,40270764,phantomjs-node page.evaulate seems to hang,<p>I have an implementation of 'waitfor' with ...,,1,0,2016-10-26 19:35:00.537 UTC,2016-11-02 20:05:09.143 UTC,,,,,245076.0,1,0,node.js|phantomjs,35,,phantomjs node page evaulate seems to hang i h...
2,27532383,Dynamic operations can only be performed in ho...,<p>I'm working with an API that requires:</p>\...,,1,0,2014-12-17 18:31:18.6 UTC,2014-12-17 19:57:43.443 UTC,,,,,3105880.0,1,1,c#|asp.net-mvc,4372,,dynamic operations can only be performed in ho...
3,33511888,CSS with relative URL to background image?,<p>I have a file structure of:</p>\n\n<pre><co...,,2,2,2015-11-04 00:50:35.223 UTC,2015-11-04 01:51:03.037 UTC,2015-11-04 01:51:03.037 UTC,,5464492.0,,5464492.0,1,0,css|background-image,406,,css with relative url to background image i ha...
4,46160163,Share canvas image on android,<p>Hello so I write a small game where in the ...,46160246.0,1,0,2017-09-11 16:19:18.32 UTC,2017-09-11 16:24:12.69 UTC,,,,,8570512.0,1,0,android|canvas|bitmap|share,52,,share canvas image on android hello so i write...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,5972679,Is there any .NET string.format compatible fun...,<p>Is there any .NET string.format compatible ...,6001444.0,1,8,2011-05-12 02:23:05.997 UTC,2011-05-14 11:16:16.3 UTC,2011-05-14 11:16:16.3 UTC,,26736.0,,26736.0,1,4,delphi|formatting,403,,is there any net string format compatible func...
496,21473995,How to handle failed XPATH lookup in MSXML fro...,<p>I am parsing a piece of XML returned from a...,21512998.0,2,0,2014-01-31 06:41:50.217 UTC,2014-02-02 16:33:58.39 UTC,,,,,274354.0,1,0,autoit|msxml,534,,how to handle failed xpath lookup in msxml fro...
497,11279736,Logo Animation - Don't know where to begin,<p>I'm trying to have a cool little animation ...,11897603.0,2,2,2012-07-01 05:07:28.437 UTC,2013-02-12 12:19:48.26 UTC,2013-02-12 12:19:48.26 UTC,,1720391.0,,1193321.0,1,1,android|opengl-es|android-animation,858,,logo animation don t know where to begin i m t...
498,33005260,How to introduce a new variable in Coq?,<p>I was wondering if there is a way to introd...,33005343.0,3,0,2015-10-08 01:44:46.92 UTC,2015-10-08 16:10:57.683 UTC,,,,,683218.0,1,2,coq|coq-tactic,933,2.0,how to introduce a new variable in coq i was w...


In [15]:
# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['body'].tolist()

In [19]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

The next step is to compute the tf-idf value for a given document in our test set by invoking `tfidf_transformer.transform(...)`. This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature names, In the example below, we are extracting keywords for the first document in our test set. 

The `sort_coo(...)` method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in `extract_topn_from_vector(...)` where we do `feature_vals.append(feature_names[idx])`.

In [22]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0]

#generate tf-idf for the given document
# cv.transform([doc]) 是如上的 word_count_vector 
# 根據 doc 本身的 word count vector 參考受過訓練的 tfidf_transformer 本身對整體的了解，算出針對 doc 的 tf_idf_vector
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc])) 

In [26]:
dict(tf_idf_vector[:3].todok()) # 抓一小部分來查看

{(0, 682): 0.15738110538202002,
 (0, 685): 0.10989572800183882,
 (0, 1055): 0.08035124212045888,
 (0, 1823): 0.11479784342810864,
 (0, 2249): 0.13044542936872117,
 (0, 2627): 0.5925293426217563,
 (0, 2973): 0.11025500158033014,
 (0, 3197): 0.06978016987971357,
 (0, 3901): 0.061158962303011905,
 (0, 4353): 0.2812431411685825,
 (0, 4643): 0.15156053652156862,
 (0, 4928): 0.048451995699994393,
 (0, 4951): 0.08716149048211315,
 (0, 5274): 0.27296938355933525,
 (0, 5875): 0.06570029321302341,
 (0, 6559): 0.21432446619836365,
 (0, 6635): 0.14633101990519562,
 (0, 6774): 0.09162799231378672,
 (0, 6827): 0.23853819069873228,
 (0, 7075): 0.08132014229978367,
 (0, 7861): 0.07219710213123282,
 (0, 7888): 0.06463167495704715,
 (0, 8154): 0.1084032073516024,
 (0, 8888): 0.06230701502481014,
 (0, 9027): 0.2699024130362261,
 (0, 9165): 0.07218637169128597,
 (0, 9183): 0.11894924279865834,
 (0, 9636): 0.316861174881799,
 (0, 9662): 0.08316521991636568}

In [30]:
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
sorted_items[:10] # 大到小排列 

[(2627, 0.5925293426217563),
 (9636, 0.316861174881799),
 (4353, 0.2812431411685825),
 (5274, 0.27296938355933525),
 (9027, 0.2699024130362261),
 (6827, 0.23853819069873228),
 (6559, 0.21432446619836365),
 (682, 0.15738110538202002),
 (4643, 0.15156053652156862),
 (6635, 0.14633101990519562)]

In [31]:
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_body[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Title=====
Integrate War-Plugin for m2eclipse into Eclipse Project

=====Body=====
<p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p>

<p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p>

<p>Can you link me to help or try to explain it. Thanks.</p>

===Keywords===
eclipse 0.593
war 0.317
integrate 0.281
maven 0.273
tomcat 0.27
project 0.239
plugin 0.214
automate 0.157
jsf 0.152
possibility 0.146


From the keywords above, the top keywords actually make sense, it talks about `eclipse`, `maven`, `integrate`, `war` and `tomcat` which are all unique to this specific question. There are a couple of kewyords that could have been eliminated such as `possibility` and perhaps even `project` and you can do this by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain as [described here](http://kavita-ganesan.com/tips-for-constructing-custom-stop-word-lists/).



In [15]:
# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])



Now let's look at keywords generated for a much longer question: 


In [16]:
idx=120
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
SQL Import Wizard - Error

=====Body=====
<p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p>

<p>In Excel, the column giving me trouble looks like this:
<a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p>

<p>Tasks > import data > Flat Source File > select file</p>

<p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p>

<p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p>

<p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p>

<p>The column ma

## Generate keywords for a batch of documents

In [17]:
#generate tf-idf for all documents in your list. docs_test has 500 documents
tf_idf_vector=tfidf_transformer.transform(cv.transform(docs_test))

results=[]
for i in range(tf_idf_vector.shape[0]):
    
    # get vector for a single document
    curr_vector=tf_idf_vector[i]
    
    #sort the tf-idf vector by descending order of scores
    sorted_items=sort_coo(curr_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    
    results.append(keywords)

df=pd.DataFrame(zip(docs,results),columns=['doc','keywords'])
df

Unnamed: 0,doc,keywords
0,serializing a private struct can it be done i ...,"{'eclipse': 0.593, 'war': 0.317, 'integrate': ..."
1,how do i prevent floated right content from ov...,"{'evaluate': 0.472, 'content': 0.403, 'console..."
2,gradle command line i m trying to run a shell ...,"{'appdomain': 0.409, 'dynamic': 0.384, 'perfor..."
3,loop variable as parameter in asynchronous fun...,"{'image': 0.424, 'jpg': 0.412, 'background': 0..."
4,canot get the href value hi i need to valid th...,"{'uri': 0.371, 'bitmap': 0.318, 'intent': 0.30..."
...,...,...
495,how to unbind click and click submit button in...,"{'delphi': 0.617, 'compatible': 0.365, 'win': ..."
496,swaggerui auth redirect swaggeruiauth of null ...,"{'node': 0.547, 'selectsinglenode': 0.304, 'nu..."
497,ssrs value display error for ssrs conditional ...,"{'logo': 0.549, 'step': 0.33, 'triangle': 0.32..."
498,accessing and changing a class instance from a...,"{'length': 0.426, 'ev': 0.415, 'introduce': 0...."


Whoala! Now you can extract important keywords from any type of text! 