# 1 Introduction
In this tutorial you will learn how to extract keywords automatically using both Python and Java, and you will also understand its related tasks such as keyphrase extraction with a controlled vocabulary (or, in other words, text classification into a very large set of possible classes) and terminology extraction.

The tutorial is organized as follows: First, we discuss a little bit of background — what are keywords, and how does a keyword algorithm work? Then we demonstrate a simple, but in many cases effective, keyword extraction with a Python library called RAKE. And finally, we show how a Java tool called Maui extracts keywords using a machine-learning technique.

# 1.1 Why extract keywords?
Extracting keywords is one of the most important tasks when working with text. Readers benefit from keywords because they can judge more quickly whether the text is worth reading. Website creators benefit from keywords because they can group similar content by its topics. Algorithm programmers benefit from keywords because they reduce the dimensionality of text to the most important features. And these are just some examples&hellips;

By definition, keywords describe the main topics expressed in a document. The terminology can get a little confusing, so the image below compares related tasks in terms of the source of terminology and number of topics selected per document.



In this tutorial we will focus on two specific tasks and their evaluation:

Extracting the most significant words and phrases that appear in given text
Identifying a set of topics from a predefined vocabulary that match a given text
If consistency of keywords across many documents is important, I always recommend that you use a vocabulary — or a lexicon or a thesaurus — unless it’s not possible for some reason.

A couple of words for those interested in text categorization (also called text classification), another popular task when working with text: if the number of categories is very large, you will struggle to collect enough training data for supervised classification. So, if you have 100 or more categories, and you can name these categories (they are not abstract), you are dealing with fine-grained categorization. We can treat this task as keyword extraction with a controlled vocabulary, or term assignment. So, read on, this tutorial is also for you!

#2 How does keyword extraction work?
A typical keyword extraction algorithm has three main components:

Candidate selection: Here, we extract all possible words, phrases, terms or concepts (depending on the task) that can potentially be keywords.
Properties calculation: For each candidate, we need to calculate properties that indicate that it may be a keyword. For example, a candidate appearing in the title of a book is a likely keyword.
Scoring and selecting keywords: All candidates can be scored by either combining the properties into a formula, or using a machine learning technique to determine probability of a candidate being a keyword. A score or probability threshold, or a limit on the number of keywords is then used to select the final set of keywords..


Finally, parameters such as the minimum frequency of a candidate, its minimum and maximum length in words, or the stemmer used to normalize the candidates help tweak the algorithm’s performance to a specific dataset.

# 3 Keyword extraction with Python using RAKE
For Python users, there is an easy-to-use keyword extraction library called RAKE, which stands for Rapid Automatic Keyword Extraction. The algorithm itself is described in the Text Mining Applications and Theory book by Michael W. Berry (free PDF). Here, we follow the existing Python implementation. There is also a modified version that uses the natural language processing toolkit NLTK for some of the calculations. For this tutorial, I have forked and extended the original RAKE repository into RAKE-tutorial in order to use additional parameters and evaluate its performance.

# 3.1 Setting up RAKE
First, you will need to get the RAKE-tutorial repo from https://github.com/zelandiya/RAKE-tutorial.

In [2]:
#!git clone https://github.com/zelandiya/RAKE-tutorial
!python RAKE-tutorial/setup.py install

running install
running bdist_egg
running egg_info
writing nlp_rake.egg-info/PKG-INFO
writing dependency_links to nlp_rake.egg-info/dependency_links.txt
writing top-level names to nlp_rake.egg-info/top_level.txt
package init file './__init__.py' not found (or not a regular file)
reading manifest file 'nlp_rake.egg-info/SOURCES.txt'
writing manifest file 'nlp_rake.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py

creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying nlp_rake.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlp_rake.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlp_rake.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlp_rake.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/nlp_rake-1.0-py3.7.egg' and adding 'build

Then, following the instructions in _raketutorial.py, import RAKE, and import the operator for the “Behind the scenes” part of this tutorial:

In [2]:
!pip install python-rake

Collecting python-rake
  Downloading python_rake-1.5.0-py3-none-any.whl (14 kB)
Installing collected packages: python-rake
Successfully installed python-rake-1.5.0


In [8]:
import RAKE
import operator

# 3.2 Applying RAKE on a piece of text
First, let us initialize RAKE with a path to a stop words list and set some parameters:

In [17]:
from RAKE import Rake
rake_object = Rake("RAKE-tutorial/data/stoplists/SmartStoplist.txt")
Rake.run(rake_object, text='it was a good day today very nice.') 

[('good day today', 9.0), ('nice', 1.0)]

To change how a file is read-in, simply use the code below. The default regex described above is [\W\n]+.

    RAKE.Rake(<path_to_your_stopwords_file> , regex = '<your regex>')

#### For lists:

    import RAKE
    Rake = RAKE.Rake(<list>); #takes stopwords as list of strings
    Rake.run(<text>)

SmartStopList(), FoxStopList(), NLTKStopList() and MySQLStopList return the expected lists as lists, they can be used as shown bellow. GoogleSearchStopList() returns what were thought to be stop words in Google search back when large numbers of search suggestions very available. RanksNLStopList() and RanksNLLongStopList() returns the in-house developed stoplists from Ranks NL, a webmaster suite.

In [18]:
import RAKE
Rake = RAKE.Rake(RAKE.SmartStopList())
Rake.run('the doctor realized his patient was sick and decided to give him 4 ccs of Micoplathonal and this was followed by some other tests on the subject things looked like they were improving over all  ')

[('subject things looked', 9.0),
 ('doctor realized', 4.0),
 ('patient', 1.0),
 ('sick', 1.0),
 ('decided', 1.0),
 ('give', 1.0),
 ('4 ccs', 1.0),
 ('micoplathonal', 1.0),
 ('tests', 1.0),
 ('improving', 1.0)]

# Additional flags:

The RAKE.rake function also accepts minCharacters, maxWords and minFrequency flags to better tune your outputs. minCharacters is the minimum characters allowed in a keyword. maxWords is the maximum number of words allowed in a phrase considered as a keyword. minFrequency is the minimum number of occurances a keyword has to have to be considered as a keyword. An example of this which shows the default values is as follows:


In [19]:
import RAKE
rake = RAKE.Rake(RAKE.SmartStopList())
text =''' this patient is suffering from early onset parkensons disease he needs 
to be treated for it I am going to break down 4 miligrams of tabanol 
for him and prescribe him 20 doses of Nyosopin as well to help with the pain at this stage.
 '''
rake.run(text, minCharacters = 1, maxWords = 5, minFrequency = 1)

[('early onset parkensons disease', 16.0),
 ('patient', 1.0),
 ('suffering', 1.0),
 ('treated', 1.0),
 ('break', 1.0),
 ('4 miligrams', 1.0),
 ('tabanol', 1.0),
 ('prescribe', 1.0),
 ('20 doses', 1.0),
 ('nyosopin', 1.0),
 ('pain', 1.0),
 ('stage', 1.0)]

# Final Notes
Other stoplists and stoplists in other languages can be found at https://github.com/trec-kba/many-stop-words/tree/master/orig, at http://www.ranks.nl/stopwords, at https://sites.google.com/site/kevinbouge/stopwords-lists and in the NLTK stopwords package


#Credit
This is a maintained fork of the original python RAKE project, which can be found here: https://github.com/aneesha/RAKE The Fox Stopwords list was originally created by Christopher Fox, http://dl.acm.org/citation.cfm?id=378888 The Smart stopwords list was originally created by Gerard Salton and Chris Buckley for the experimental SMART information retrieval system at Cornell University. The MySQL stopwords list is (surprisingly) from MySQL, owned and mainted by Oracle and under the GPL2 license. The NTLK stopword list was created by the NLTK project under the Apache license, project here: https://github.com/nltk/nltk The Ranks NL stopword lists were created by Ranks NL, who also compiled the Google Search stopword list, who said via email that we could include them in this package if we credited them.