# Keyword Extraction APIs and Packages   

- Keyword Extraction is the automated procedure of finding the most relevant and important words and phrases from text.  
- A Comprehensive Guide to Keyword Extraction analysis: what it is, how it works, use cases:    
https://monkeylearn.com/keyword-extraction/![image.png](attachment:image.png)

## RAKE NLTK

- RAKE NLTK is a specific Python implementation of the **Rapid Automatic Keyword Extraction (RAKE) algorithm** that uses NLTK under the hood. This makes it easier to extend and perform other text analysis tasks.  
- Github repo: https://github.com/csurfer/rake-nltk  
- This is a python implementation of the algorithm as mentioned in paper: **Automatic keyword extraction from individual documents** by Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley  
- There are some rather popular implementations out there, in python(aneesha/RAKE) and node(waseem18/node-rake) but neither seemed to use the power of NLTK. By making NLTK an integral part of the implementation its author get the flexibility and power to extend it in other creative ways.

In [7]:
from rake_nltk import Rake

In [8]:
# Uses stopwords for english from NLTK, and all puntuation characters by default
r = Rake()

# raw text to extract keywords
myText = '''
Keyword extraction allows business to sift through big data to capture the most important words that best 
describe the text (e.g. customer review) in just seconds, obtain insights about the topics that your customers are 
talking about while saving your teams many hours of manual processing. 
It also provides you with actionable insights that you can use to make better business decisions.
The best thing about keyword extraction models is that they are easy to set up and implement.
Keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to actually read a single line.
'''

# Extraction given the text.
r.extract_keywords_from_text(myText)

# Extraction given the list of strings where each string is a sentence.
#r.extract_keywords_from_sentences(<list of sentences>)

# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()

['make better business decisions',
 'keyword extraction allows business',
 'teams many hours',
 'keyword extraction models',
 'given text without',
 'keyword extraction',
 'single line',
 'manual processing',
 'key phrases',
 'important words',
 'important keywords',
 'customer review',
 'big data',
 'best thing',
 'best describe',
 'also provides',
 'actually read',
 'actionable insights',
 'obtain insights',
 'text',
 'obtain',
 'use',
 'topics',
 'talking',
 'sift',
 'set',
 'seconds',
 'saving',
 'implement',
 'help',
 'g',
 'easy',
 'e',
 'customers',
 'capture']

### Observation
We can see RAKE-NLTK nicely extracts keywords and keyphrases from the sample text. Following shows ranking score as well.

In [9]:
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()

[(16.0, 'make better business decisions'),
 (14.0, 'keyword extraction allows business'),
 (9.0, 'teams many hours'),
 (9.0, 'keyword extraction models'),
 (8.0, 'given text without'),
 (6.0, 'keyword extraction'),
 (4.0, 'single line'),
 (4.0, 'manual processing'),
 (4.0, 'key phrases'),
 (4.0, 'important words'),
 (4.0, 'important keywords'),
 (4.0, 'customer review'),
 (4.0, 'big data'),
 (4.0, 'best thing'),
 (4.0, 'best describe'),
 (4.0, 'also provides'),
 (4.0, 'actually read'),
 (4.0, 'actionable insights'),
 (3.5, 'obtain insights'),
 (2.0, 'text'),
 (1.5, 'obtain'),
 (1.0, 'use'),
 (1.0, 'topics'),
 (1.0, 'talking'),
 (1.0, 'sift'),
 (1.0, 'set'),
 (1.0, 'seconds'),
 (1.0, 'saving'),
 (1.0, 'implement'),
 (1.0, 'help'),
 (1.0, 'g'),
 (1.0, 'easy'),
 (1.0, 'e'),
 (1.0, 'customers'),
 (1.0, 'capture')]

## Scikit-Learn with TF-IDF    
- Use Scikit-learn to extract keywords with TF-IDF (Score)    
- An excellent tutorial: 
https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/  
- Tutorial repo with stackoverflow data samples: https://github.com/kavgan/nlp-in-practice/tree/master/tf-idf  
- Tutorial notebook: https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb  

### Using Stackoverflow as sample data

In [10]:
import pandas as pd

# read json into a dataframe
# data source acknowledgement: https://github.com/kavgan/nlp-in-practice/tree/master/tf-idf/data
df_idf=pd.read_json("./stackoverflow-data-idf.json",lines=True)

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)
print(f'stackoverflow training dataset shape:{df_idf.shape}')

Schema:

 accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)
stackoverflow training dataset shape:(20000, 19)


In [11]:
df_idf.head(3)

Unnamed: 0,accepted_answer_id,answer_count,body,comment_count,community_owned_date,creation_date,favorite_count,id,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,post_type_id,score,tags,title,view_count
0,,1,<p>I have a public class that contains a priva...,0,,2011-01-27 20:19:13.563 UTC,,4821394,2011-01-27 20:21:37.59 UTC,,,,,163534.0,1,0,c#|serialization|xml-serialization,Serializing a private struct - Can it be done?,296
1,3367943.0,2,<p>I have the following HTML:</p>\n\n<pre><cod...,2,,2010-07-30 00:01:50.9 UTC,0.0,3367882,2012-05-10 14:16:05.143 UTC,2012-05-10 14:16:05.143 UTC,,44390.0,,1190.0,1,2,css|overflow|css-float|crop,How do I prevent floated-right content from ov...,4121
2,,0,<p>I'm trying to run a shell script with gradl...,2,,2015-07-28 16:30:18.28 UTC,,31682135,2015-07-28 16:32:15.117 UTC,,,,,1299158.0,1,1,bash|shell|android-studio|gradle,Gradle command line,259


Take note that this stackoverflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don't quite need for this tutorial. **What we are mostly interested in for this tutorial is the body and title which is our source of text.** We will now create a new field 'text' that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.

In [12]:
import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text) #e.g. replace </p> with <>
    
    # remove all special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df_idf['text'] = df_idf['title'] + df_idf['body'] #generate a new column 'text' combining both 'title' and 'body'
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

In [13]:
#show its original 'body' content
df_idf['title'][2] + df_idf['body'][2]

'Gradle command line<p>I\'m trying to run a shell script with gradle. I currently have something like this</p>\n\n<pre><code>def test = project.tasks.create("test", Exec) {\n    commandLine \'bash\', \'-c\', \'bash C:/my file dir/script.sh\'\n}\n</code></pre>\n\n<p>The problem is that I cannot run this script because i have spaces in my dir name. I have tried everything e.g: </p>\n\n<pre><code>commandLine \'bash\', \'-c\', \'bash C:/my file dir/script.sh\'.tokenize() \ncommandLine \'bash\', \'-c\', [\'bash\', \'C:/my file dir/script.sh\']\ncommandLine \'bash\', \'-c\', new StringBuilder().append(\'bash\').append(\'C:/my file dir/script.sh\')\ncommandLine \'bash\', \'-c\', \'bash "C:/my file dir/script.sh"\'\nFile dir = file(\'C:/my file dir/script.sh\')\ncommandLine \'bash\', \'-c\', \'bash \' + dir.getAbsolutePath();\n</code></pre>\n\n<p>Im using windows7 64bit and if I use a path without spaces the script runs perfectly, therefore the only issue as I can see is how gradle handles spa

In [14]:
#show a sample of cleaned 'text' column
df_idf['text'][2]

'gradle command line i m trying to run a shell script with gradle i currently have something like this def test project tasks create test exec commandline bash c bash c my file dir script sh the problem is that i cannot run this script because i have spaces in my dir name i have tried everything e g commandline bash c bash c my file dir script sh tokenize commandline bash c bash c my file dir script sh commandline bash c new stringbuilder append bash append c my file dir script sh commandline bash c bash c my file dir script sh file dir file c my file dir script sh commandline bash c bash dir getabsolutepath im using windows bit and if i use a path without spaces the script runs perfectly therefore the only issue as i can see is how gradle handles spaces '

Hmm, doesn't look very pretty with all the html in there, but that's the point. Even in such a mess we can extract some great stuff out of this. While you can eliminate all code from the text, we will keep the code sections for this tutorial for the sake of simplicity.

### Creating the TF - IDF  
#### TF - Term Frequency
#### Using Sklearn's CountVectorizer to create a vocabulary and generate word counts
The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our dataset df_idf['text'] and generate counts for each row in df_idf['text']. The result of the last two lines is a sparse matrix representation of the counts, **meaning each column represents a word (a.k.a. term/token) in the vocabulary and each row represents a document/record in our dataset where the values are the word counts (i.e. TF)**. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
import re

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stopwords if you want to use custom stopwords 
#stopwords=get_stop_words("resources/stopwords.txt")
stopwords=get_stop_words("./stopwords.txt")

#get the text column (note: docs contains a list of text strings for model training!)
docs=df_idf['text'].tolist() #a list of text strings
print(f'Note: docs contains a list of text strings for model training with {len(docs)} samples!')

#create a vocabulary of words, 
#max_df=0.85: ignore stopwords and words that appear in 85% of documents, a tunable hyperparameter, normally [0.7,1.0], This parameter is ignored if vocabulary is not None! 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs) #output is a sparse matrix representation of the word counts: N_docs x M_words_in_Vocabulary

Note: docs contains a list of text strings for model training with 20000 samples!


  'stop_words.' % sorted(inconsistent))


Now let's check the shape of the resulting vector. Notice that the shape below is (20000,124901) because we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124901 meaning we have 124k unique words (the columns) in our dataset minus the stopwords. **(Optional) in some of the text mining applications, such as clustering and text classification we may limit the size of the vocabulary. It's really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer**.

In [16]:
print('stopwords numbers: ', len(stopwords))
print(f'word count vector shape:{word_count_vector.shape}')

stopwords numbers:  752
word count vector shape:(20000, 124901)


Let's limit our vocabulary size to 10,000

In [17]:
cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

  'stop_words.' % sorted(inconsistent))


(20000, 10000)

In [18]:
print(type(word_count_vector))
print(word_count_vector[0,:]) #a sparse matrix representation 

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 7852)	1
  (0, 6761)	3
  (0, 8520)	5
  (0, 6888)	3
  (0, 1351)	1
  (0, 1729)	2
  (0, 6846)	1
  (0, 8498)	1
  (0, 7848)	3
  (0, 631)	1
  (0, 8486)	1
  (0, 2411)	1
  (0, 9431)	1
  (0, 9909)	1
  (0, 2831)	1
  (0, 7651)	1
  (0, 6065)	1
  (0, 9212)	1
  (0, 7849)	1
  (0, 2497)	2
  (0, 5744)	1
  (0, 9650)	1
  (0, 4689)	1


Now, let's look at 10 words from our vocabulary. Sweet, these are mostly programming related

In [19]:
list(cv.vocabulary_.keys())[:10]

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

We can also get the vocabulary by using `get_feature_names()`

In [20]:
list(cv.get_feature_names())[2000:2015]

['customization',
 'customize',
 'customized',
 'customlog',
 'customview',
 'cut',
 'cv',
 'cv_',
 'cval',
 'cvc',
 'cw',
 'cwd',
 'cx',
 'cx_oracle',
 'cxf']

In [21]:
print(len(list(cv.vocabulary_.keys())), len(list(cv.get_feature_names())))

10000 10000


#### IDF 
#### Use Sklearn's TfidfTransformer to Compute Inverse Document Frequency (IDF)  

In the code below, we are essentially taking the TF sparse matrix from CountVectorizer to generate the IDF when you invoke `fit`. **An extremely important point to note here is that the IDF should be based on a large enough corpora and should be representative of texts you would be using to extract keywords (i.e. use similar and relevant text/docs in the same/similar domain as training data!).** 

To understand why IDF should be based on a fairly large collection, please read this page from Standford's IR - Information Retrieval online book (https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html).

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Let's look at some of the IDF values:

In [23]:
tfidf_transformer.idf_

array([ 7.37717703,  9.80492526,  9.51724319, ...,  8.82409601,
       10.21039037,  9.51724319])

#### Note: idf is Global, so just one value for each word/token in the vocabulary, regardless of how many records! 
Say, you have 100 records and if one word/token only appears in one record, its Document Frequency (DF) is 1/100, thus its IDF, Inverse DF is 100. For another work/token which appears in 50 out of 100 records, its IDF will be 2. So the former word/token has much higher IDF score than the latter as the fact that it only presents in 1 out of 100 records contains much more info itself and it's very unique in that record. 

In [24]:
tfidf_transformer.idf_.shape 

(10000,)

#### Computing TF-IDF and Extracting Keywords  

- Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords.  
- In this example, we will extract top keywords for the questions in stackoverflow test sample file: `data/stackoverflow-test.json`. This data file has 500 questions with fields identical to that of training data file: `data/stackoverflow-data-idf.json` as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list. 

In [25]:
import pandas as pd 

# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("./stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))
print(f'stackoverflow testing dataset shape:{df_test.shape}')

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['body'].tolist()

stackoverflow testing dataset shape:(500, 19)


#### In following, define utility functions to sort and get top n keywords/keyphrases for each document/record, according to their tf-idf scores
#### Rationale: if a word is more unique globally among all docs and more frequently appearing in a given document, its tf-idf score is higher thus should be recognised as top-ranking keyword to be extracted! 

In [26]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data) #tuples of (idx_word_in_vocabulary, word_tf_idf_score)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)#sort tuples firstly according to word_tf_idf_score in descending order

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

#a simplified and better version done by myself
def extract_topn_from_vector2(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
    #results is a look-up dict
    results = {}

    for idx, score in sorted_items:
        fname = feature_names[idx]
        results[fname] = round(score,3) #a dict: key:value,where key is word and value is word's tf_idf score
    
    return results

#### Extract keywords from test data documents/records
- The next step is to compute the tf-idf value for a given document in our test set by invoking `tfidf_transformer.transform(...)`. This generates a vector of tf-idf scores.  
- Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature (i.e. token/words) names, In the example below, we are extracting keywords for the first document in our test set.
- Note: the `sort_coo(...)` method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in `extract_topn_from_vector2(...)` where we do `fname = feature_names[idx]` 

In [27]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0] # get the first record from a list of text strings

#generate tf-idf score vector for the given document - a key command to calculate tf-idf score! 
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10, custom it to any top N value
top_n=10
#keywords=extract_topn_from_vector(feature_names,sorted_items,10)
keywords=extract_topn_from_vector2(feature_names,sorted_items,top_n)

# now print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_body[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Title=====
Integrate War-Plugin for m2eclipse into Eclipse Project

=====Body=====
<p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p>

<p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p>

<p>Can you link me to help or try to explain it. Thanks.</p>

===Keywords===
eclipse 0.593
war 0.317
integrate 0.281
maven 0.273
tomcat 0.27
project 0.239
plugin 0.214
automate 0.157
jsf 0.152
possibility 0.146


#### Test Result Observation
From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. 

#### All-in-one, I defined utility functions to freely select and test any document samples as follows

In [28]:
# put the common code into several methods - my self-define utility funcs here
def get_keywords(idx): #idx: doc index

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector2(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

Now let's look at keywords generated for a much longer question:

In [29]:
idx=120
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
SQL Import Wizard - Error

=====Body=====
<p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p>

<p>In Excel, the column giving me trouble looks like this:
<a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p>

<p>Tasks > import data > Flat Source File > select file</p>

<p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p>

<p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p>

<p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p>

<p>The column ma

#### A text sample about keyword extraction description, slightly out of the training domain (stackoverflow) to test 

In [30]:
# you only needs to do this once
#feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
#doc=docs_test[0] # a list of text strings
doc = myText #the orginal text sample above 

#generate tf-idf for the given document - a key command to calculate tf-idf! 
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
#keywords=extract_topn_from_vector(feature_names,sorted_items,10)
keywords=extract_topn_from_vector2(feature_names,sorted_items,10)
# now print the results
print("\n=====Raw Text=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Raw Text=====

Keyword extraction allows business to sift through big data to capture the most important words that best 
describe the text (e.g. customer review) in just seconds, obtain insights about the topics that your customers are 
talking about while saving your teams many hours of manual processing. 
It also provides you with actionable insights that you can use to make better business decisions.
The best thing about keyword extraction models is that they are easy to set up and implement.
Keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to actually read a single line.


===Keywords===
extraction 0.478
keyword 0.367
insights 0.309
business 0.241
obtain 0.237
important 0.222
best 0.167
decisions 0.164
teams 0.155
topics 0.139


#### Another ODA text sample out of the training domain (stackoverflow) to test 

In [31]:
# you only needs to do this once
#feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
#doc=docs_test[0] # a list of text strings
doc = '''Open Digital Architecture. ODA transforms business agility, 
customer experience and operational efficiency 
by creating simpler IT solutions 
that are easier and cheaper to deploy, integrate & upgrade.''' #the TMForum ODA description

#generate tf-idf for the given document - a key command to calculate tf-idf! 
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
#keywords=extract_topn_from_vector(feature_names,sorted_items,10)
keywords=extract_topn_from_vector2(feature_names,sorted_items,10)
# now print the results
print("\n=====Raw Text=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Raw Text=====
Open Digital Architecture. ODA transforms business agility, 
customer experience and operational efficiency 
by creating simpler IT solutions 
that are easier and cheaper to deploy, integrate & upgrade.

===Keywords===
transforms 0.339
digital 0.324
efficiency 0.314
simpler 0.29
architecture 0.266
upgrade 0.261
integrate 0.26
business 0.254
deploy 0.241
easier 0.238


#### Observation
Good, but not perfect yet.

### Let's enhance a bit Sklearn TF-IDF method by enabling n-gram support:

In [32]:
# Option B:  NOT remove frequent words and stop-words

# Combining CountVectorizer and ngrams in Python: https://stackoverflow.com/questions/47887247/combining-countvectorizer-and-ngrams-in-python
# simply add ngram_range to CountVectorizer constructor, e.g. ngram_range=(1,3) means we wants uigram, bigram and trigram
#cv=CountVectorizer(ngram_range=(1,2),max_df=0.85,stop_words=stopwords,max_features=10000)
cv=CountVectorizer(ngram_range=(1,2)) #enable unigram and bigram
word_count_vector=cv.fit_transform(docs)
print(word_count_vector.shape)
list(cv.get_feature_names())[2000:2015]

(20000, 1184760)


['_click define',
 '_click dim',
 '_click gt',
 '_click here',
 '_click object',
 '_click only',
 '_click runat',
 '_click sender',
 '_click sub',
 '_click system',
 '_click text',
 '_click this',
 '_click_',
 '_click_ sender',
 '_client']

In [33]:
#Again, re-train tfidf transformer as tf matrix becomes different:
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

#### In-Domain testing case for tf-idf method:

In [34]:
# Re-do the test 
# you only needs to do this once
feature_names=cv.get_feature_names()

idx=120 #to (random) pick a stackoverflow testing sample
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
SQL Import Wizard - Error

=====Body=====
<p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p>

<p>In Excel, the column giving me trouble looks like this:
<a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p>

<p>Tasks > import data > Flat Source File > select file</p>

<p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p>

<p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p>

<p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p>

<p>The column ma

#### Observation
It now wisely picked up the most important information about this issue i.e. **'data conversion'**, also 'sql', 'import', 'column' and 'decimal' are relevant. Good job of tf-idf method with n-gram. 

In [35]:
#another in-domain testing case
idx=0#to (random) pick a stackoverflow testing sample
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
Integrate War-Plugin for m2eclipse into Eclipse Project

=====Body=====
<p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p>

<p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p>

<p>Can you link me to help or try to explain it. Thanks.</p>

===Keywords===
eclipse 0.342
war plugin 0.232
war 0.183
integrate 0.162
maven 0.157
tomcat 0.156
project 0.138
my process 0.13
you link 0.125
tutorial how 0.125


#### Observation
Again, n-gram tf-idf is capable of picking up the most important information about this issue i.e. **'war plugin'**, also 'eclipse', 'maven', 'tomcat' and 'project' are relevant. 

#### Out-of-Domain testing case (neuroscience) for tf-idf method:

In [36]:
# Re-do the test 
# you only needs to do this once
feature_names=cv.get_feature_names()
#print(feature_names[10000:10050])

# get the document that we want to extract keywords from
#doc=docs_test[0] # a list of text strings
doc = '''NEUROSCIENCE CORE CONCEPTS
The	brain	is	the	body's	most	complex	organ.
Genetically	determined	circuits	are	the	foundation	of	the	nervous	system.
The	human	brain	endows	us	with	a	natural	curiosity	to understand	how	the	world	works.
Fundamental	discoveries	promote	healthy	living	and	treatment	of	disease.'''

#generate tf-idf for the given document - a key command to calculate tf-idf! 
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
#keywords=extract_topn_from_vector(feature_names,sorted_items,10)
top_n=10
keywords=extract_topn_from_vector2(feature_names,sorted_items,top_n)

# now print the results
print("\n=====Raw Text=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Raw Text=====
NEUROSCIENCE CORE CONCEPTS
The	brain	is	the	body's	most	complex	organ.
Genetically	determined	circuits	are	the	foundation	of	the	nervous	system.
The	human	brain	endows	us	with	a	natural	curiosity	to understand	how	the	world	works.
Fundamental	discoveries	promote	healthy	living	and	treatment	of	disease.

===Keywords===
brain 0.331
us with 0.209
the human 0.209
most complex 0.209
foundation of 0.209
nervous 0.201
the foundation 0.195
promote 0.195
living 0.195
disease 0.195


#### Observation
Fair result: for out-of-domain test, the tf-idf model trained on stackoverflow domain data can still pick up some keywords in neuroscience domain like 'brain' and 'the human' but not others like 'neuroscience' and so on. 

### Summary of Scikit-Learn with TF-IDF for Keyword Extraction:  
- The result is promising: to some degree, the model is **able to generise** for new text out of the training domain. But in principle you should ensure the test samples to predict should fall in the same/similar domain as the training data to make tf-idf method work properly.  
- Single keyword (unigram) is extracted, but not key phrases as RAKE-NLTK does by default above. When I enabled n-gram it can give some keyphrases like 'the human' above, but key keyphrases such as 'nervous system' are still missing ('cos such phrase did not exist in the training data at all!)  
- This proves from another angle how heavily TF-IDF approach accuracy depends on the relevance and representativeness of training text data to the target sample to predict   
- Tentative conclusion: 
**highly customible and tunable, excellent for in-domain prediction but it requires big data to be able to generalise well and requires trial/error to fine tuning**

Whoala! Now you can extract important keywords from any type of text!

## PKE - Python Keyword Extraction      
- pke is an open source Python-based keyphrase extraction toolkit.  
- pke provides a standardized **API** for extracting keyphrases from a document. **Inputs can be file or text string.**  
- pke uses NLTK and SpaCy under the hood 
- Tutorials and code documentation are available at https://boudinfl.github.io/pke/.  
- It provides **an end-to-end keyphrase extraction pipeline** in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset (https://www.aclweb.org/anthology/S10-1004/).   
- Start by typing the 5 lines below. For using another model, simply replace pke.unsupervised.TopicRank with another model (a list of implemented models: https://github.com/boudinfl/pke#implemented-models).  
- Implemented models  
pke currently implements the following keyphrase extraction models:  

- **Unsupervised models**  
-- **Statistical models**  
--- TfIdf [documentation]  
--- KPMiner [documentation, article by (El-Beltagy and Rafea, 2010)]  
--- YAKE [documentation, article by (Campos et al., 2020)]  
-- **Graph-based models**  
--- TextRank [documentation, article by (Mihalcea and Tarau, 2004)]  
--- SingleRank [documentation, article by (Wan and Xiao, 2008)]  
--- TopicRank [documentation, article by (Bougouin et al., 2013)]  
--- TopicalPageRank [documentation, article by (Sterckx et al., 2015)]  
--- PositionRank [documentation, article by (Florescu and Caragea, 2017)]  
--- MultipartiteRank [documentation, article by (Boudin, 2018)]  
- **Supervised models**  
-- **Feature-based models**  
--- Kea [documentation, article by (Witten et al., 2005)]  
--- WINGNUS [documentation, article by (Nguyen and Luong, 2010)]  

In [46]:
# Installation 
# pip install git+https://github.com/boudinfl/pke.git
# install following in a cmd terminal 
# python -m nltk.downloader stopwords
# python -m nltk.downloader universal_tagset
# python -m spacy download en # download the english model (note: You can now load the model via spacy.load('en'))

In [47]:
import pke

In [48]:
# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank() 
#extractor = pke.unsupervised.MultipartiteRank() #also good, others may need to customise their hyperparameters? 
#extractor = pke.supervised.WINGNUS() #Kea() #not good

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
#extractor.load_document(input='/path/to/input.txt', language='en')
extractor.load_document(input='./sample_text.txt', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm by default 
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

In [49]:
keyphrases

[('keyword extraction', 0.2653032756303063),
 ('text', 0.10994610654811321),
 ('key phrases', 0.09800165682226829),
 ('business', 0.0929954274390085),
 ('best thing', 0.08786144700445668),
 ('seconds', 0.0790037656731154),
 ('insights', 0.07613533455514761),
 ('easy', 0.07264997703427613),
 ('big data', 0.07073413624959467),
 ('manual processing', 0.047368873043712914)]

In [50]:
# Another zero-shot testing case:
idx=0 # 120
print(f'text sample:\n{docs_test[idx] }')

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank() 
#extractor = pke.unsupervised.MultipartiteRank() #also good, others may need to customise their hyperparameters? 
#extractor = pke.supervised.WINGNUS() #Kea() #not good

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
#extractor.load_document(input='/path/to/input.txt', language='en')
extractor.load_document(input=docs_test[idx], language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)
keyphrases

text sample:
integrate war plugin for m eclipse into eclipse project i set up a small web project with jsf and maven now i want to deploy on a tomcat server is there a possibility to automate that like a button in eclipse that automatically deploys the project to tomcat i read about a the maven war plugin but i couldn t find a tutorial how to integrate that into my process eclipse m eclipse can you link me to help or try to explain it thanks 


[('small web project', 0.16383828363291691),
 ('eclipse', 0.15532567270795175),
 ('maven', 0.15244269621720916),
 ('jsf', 0.12483748240736106),
 ('tomcat server', 0.08416212975417435),
 ('possibility', 0.08249195204917972),
 ('button', 0.0726443762532862),
 ('war', 0.0708371213170549),
 ('tutorial', 0.05479219595990904),
 ('thanks', 0.038628089700957394)]

### PKE Prediction Results Observation
PKE is nice, easy and pretty accurate!

## FlashText      
- This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm (https://arxiv.org/abs/1711.00046).  
- Github Repo: https://github.com/vi3k6i5/flashtext  
- Documentation can be found at (https://flashtext.readthedocs.io/en/latest/)  
- Input is text string, also (drawback): you have to supply a pre-defined set of keywords 

In [51]:
# Installation 
# pip install flashtext

In [56]:
from flashtext import KeywordProcessor

#### Extract Keywords Example

In [57]:
keyword_processor = KeywordProcessor()
# keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple', 'New York') # case insensitive, input is a tuple of (informal_name, associated_formal_name)
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('action', 'Action Point')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area. My aCtION is to complete a draft')
keywords_found
# ['New York', 'Bay Area']

['New York', 'Bay Area', 'Action Point']

In [58]:
keyword_processor.add_keyword('Keyword Extraction')
keyword_processor.add_keyword('Business Decision')
keyword_processor.add_keyword('Business Decisions')
keyword_processor.add_keyword('Business Value')
keyword_processor.add_keyword('Business Values')
keyword_processor.add_keyword('Business Intelligence')
keyword_processor.add_keyword('Business Analytics')
keyword_processor.add_keyword('Business Analysis')
keyword_processor.add_keyword('Insight')
keyword_processor.add_keyword('Insights')
keyword_processor.add_keyword('Actionable Insight')
keyword_processor.add_keyword('Actionable Insights')
keyword_processor.add_keyword('Use Case')
keyword_processor.add_keyword('Use Cases')
keyword_processor.add_keyword('Business Scenario')
keyword_processor.add_keyword('Business Scenarios')
keyword_processor.add_keyword('Data-driven')
keyword_processor.add_keyword('Data-driven Decision')
keyword_processor.add_keyword('Data-driven Decisions')
keyword_processor.add_keyword('Data-driven Strategy')
keyword_processor.add_keyword('Data-driven Strategies')
keyword_processor.add_keyword('Manual')
keyword_processor.add_keyword('Automate')
keyword_processor.add_keyword('Automation')
keyword_processor.add_keyword('Save Time')
keyword_processor.add_keyword('Manaul Work')
keyword_processor.add_keyword('Efficient')
keyword_processor.add_keyword('Efficiency')
keyword_processor.add_keyword('Productivity')
keyword_processor.add_keyword('KPI')
keyword_processor.add_keyword('KPIs')
keyword_processor.add_keyword('Speed')
keyword_processor.add_keyword('Speed To Market')
keyword_processor.add_keyword('Simplicity')
keyword_processor.add_keyword('Simplicify')
keyword_processor.add_keyword('Simplicification')
keyword_processor.add_keyword('Radical Simplicity')
keyword_processor.add_keyword('TCO')
keyword_processor.add_keyword('Total Cost of Ownership')
keyword_processor.add_keyword('Local Market')
keyword_processor.add_keyword('LM')
keyword_processor.add_keyword('OpCo')
keyword_processor.add_keyword('Big Four')
keyword_processor.add_keyword('Big 4')
keyword_processor.add_keyword('Customer-oriented')
keyword_processor.add_keyword('Customer Experience')
keyword_processor.add_keyword('Customer Experiences')
keyword_processor.add_keyword('Customer Support')
keyword_processor.add_keyword('Customer Service')
keyword_processor.add_keyword('Customer Support Ticket')
keyword_processor.add_keyword('Customer Feedback')
keyword_processor.add_keyword('Customer Survey')
keyword_processor.add_keyword('Customer Review')
keyword_processor.add_keyword('Churn')
keyword_processor.add_keyword('Churn Rate')
keyword_processor.add_keyword('NPS')
keyword_processor.add_keyword('Strategy')
keyword_processor.add_keyword('Data-driven Strategy')
keyword_processor.add_keyword('Tech2025')
keyword_processor.add_keyword('Tech Company')
keyword_processor.add_keyword('Techy Company')
keyword_processor.add_keyword('Telco Company')
keyword_processor.add_keyword('Tecol')
keyword_processor.add_keyword('Tecols')
keyword_processor.add_keyword('Big Data')
keyword_processor.add_keyword('Software')
keyword_processor.add_keyword('Machine Learning')
keyword_processor.add_keyword('AI')
keyword_processor.add_keyword('Artificial Intelligence')
keyword_processor.add_keyword('GCP')
keyword_processor.add_keyword('Google Cloud Platform')
keyword_processor.add_keyword('Vendor')
keyword_processor.add_keyword('Vendor Management')
keyword_processor.add_keyword('Supply Chain')
keyword_processor.add_keyword('Social Media')
keyword_processor.add_keyword('Brand')
keyword_processor.add_keyword('Image')
keyword_processor.add_keyword('Reputation')
keyword_processor.add_keyword('Analytics')
keyword_processor.add_keyword('Analytics CoE')
keyword_processor.add_keyword('Manual Processing')
keyword_processor.add_keyword('Easy to Use')
keyword_processor.add_keyword('Easy to Set Up')
keyword_processor.add_keyword('Seconds')
keyword_processor.add_keyword('Minutes')
keyword_processor.add_keyword('Hours')
keyword_processor.add_keyword('Days')
keyword_processor.add_keyword('Weeks')
keyword_processor.add_keyword('Months')
keyword_processor.add_keyword('Years')
keyword_processor.add_keyword('Delay')
keyword_processor.add_keyword('Latency')
keyword_processor.add_keyword('Reliable')
keyword_processor.add_keyword('Reliability')
keyword_processor.add_keyword('Security')
keyword_processor.add_keyword('Flexibility')
keyword_processor.add_keyword('Flexible')
keyword_processor.add_keyword('Social Media')
keyword_processor.add_keyword('False Information')

True

In [59]:
test_str= '''"Keyword extraction allows business to sift through big data to capture the most important words that best"
"describe the text (e.g. customer review) in just seconds, obtain insights about the topics that your customers are" 
"talking about while saving your teams many hours of manual processing. "
"It also provides you with actionable insights that you can use to make better business decisions."
"The best thing about keyword extraction models is that they are easy to set up and implement."
"Keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to"
"actually read a single line."
"Social Media giants like google does that."
"Social media giants like Facebook are under increasing pressure to stop the spread of false information"'''
test_str_lower = test_str.lower()
print(test_str_lower)

keywords_found = keyword_processor.extract_keywords(test_str.lower())
set(keywords_found)


"keyword extraction allows business to sift through big data to capture the most important words that best"
"describe the text (e.g. customer review) in just seconds, obtain insights about the topics that your customers are" 
"talking about while saving your teams many hours of manual processing. "
"it also provides you with actionable insights that you can use to make better business decisions."
"the best thing about keyword extraction models is that they are easy to set up and implement."
"keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to"
"actually read a single line."
"social media giants like google does that."
"social media giants like facebook are under increasing pressure to stop the spread of false information"


{'Actionable Insights',
 'Big Data',
 'Business Decisions',
 'Customer Review',
 'Easy to Set Up',
 'False Information',
 'Hours',
 'Insights',
 'Keyword Extraction',
 'Manual Processing',
 'Seconds',
 'Social Media'}

#### Summary of FlashText 
1. Pros: highly flexible and can self define a custom keywords list  
2. Cons: manual pre-definition, and inputs only accept text string, not file  

### Tentative Assessment of Keyword Extraction Tools  
#### 1. PKE  
- Pros: 'Best of Breed', accurate, powerful with many algos support, support both file/string inputs, easy to use, generalise well for zero-shot prediction for most use cases!     
- Cons: still searching... :-D    

#### 2. RAKE-NLTK   
- Pros: pretty accurate, automatic and easy to use  
- Cons: Result is a bit verbose (phrase max_length is set high)  

#### 3. FlashText  
- Pros: highly customisable, accurate and optimised to specific domain vocabulary  
- Cons: need to manually define a custom keyword look-up list     

#### 4. Scikit-Learn TF-IDF  
- Pros: highly tunable, excellent for in-domain prediction  
- Cons: it requires big data to be able to generalise and requires manual work, i.e. its quality depends on training data (require relevant and representative documents for training a good model for testing domain) 

#### Summary
If you got a test text from which you'd like to extract keywords but have no idea where to start with, try **PKE** first as it zero-shot prediction ability often can deliver good/excellent results in the first place. 