## Comparision of python modules for keyword extraction
### spaCy
Setup steps:  
* pip install -U spacy  
* python -m spacy download en_core_web_lg
* python -m spacy validate

In [7]:
import spacy
nlp = spacy.load("en_core_web_lg")
text = """spaCy is an open-source software library for advanced natural language processing, 
written in the programming languages Python and Cython. The library is published under the MIT license
and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion."""
doc = nlp(text)
print(doc.ents)

(Python, Cython, MIT, Matthew Honnibal, Ines Montani, Explosion)


In [16]:
import spacy

text = """Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations,
and nonstrict inequations are considered. Upper bounds for components of a minimal set of
solutions and algorithms of construction of minimal generating sets of solutions for all types
of systems are given. These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."""

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_lg")

from collections import Counter
from string import punctuation

def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN','VERB'] 
    doc = nlp(text.lower()) 
    for token in doc:
        # 3
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        # 4
        if(token.pos_ in pos_tag):
            result.append(token.text)
                
    return result 

output = get_hotwords(text)
print(set(output))

{'linear', 'corresponding', 'numbers', 'sets', 'equations', 'considered', 'types', 'components', 'constraints', 'set', 'bounds', 'systems', 'algorithms', 'solving', 'compatibility', 'constructing', 'supporting', 'nonstrict', 'inequations', 'given', 'minimal', 'upper', 'mixed', 'solutions', 'construction', 'generating', 'criteria', 'natural', 'system', 'diophantine', 'strict'}


### YAKE
Yet Another Keyword Extractor (Yake) library selects the most important keywords using the text statistical features method from the article. With the help of YAKE, you can control the extracted keyword word count and other features.

* pip install yake

In [10]:
import yake
kw_extractor = yake.KeywordExtractor()
text = """spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion."""
textFromAudio = "wow that brightest star do you remember the pan pacific bank robbery where four people were killed and the subsequent arrest of the perp where a young man was shot and killed"
textFromImages = """THE ANDNRENT AND UNACCEPTABLE TRUTH OF
                    CUSTODIAL DEATH"""


def TestYake(doc):
    language = "en"
    max_ngram_size = 1 # increase to get phrases instead of words
    deduplication_threshold = 0.9
    numOfKeywords = int(0.45 * len(doc)) # controls no of keywords to extract 
    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
    keywords = custom_kw_extractor.extract_keywords(doc)
    print("for the string: ",doc)
    for kw in keywords:
        print(kw)

TestYake(text)
TestYake(textFromAudio)
TestYake(textFromImages)

for the string:  spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.
('Cython', 0.053691021027863564)
('Python', 0.06651575167590484)
('spaCy', 0.10241338875304772)
('processing', 0.10241338875304772)
('written', 0.10241338875304772)
('software', 0.11761141438285434)
('library', 0.11761141438285434)
('open-source', 0.13442462743719766)
('advanced', 0.13442462743719766)
('natural', 0.13442462743719766)
('programming', 0.13442462743719766)
('language', 0.13986690653033845)
('languages', 0.13986690653033845)
('Montani', 0.1646146628535413)
('Explosion', 0.1646146628535413)
('MIT', 0.19838041526103037)
('Matthew', 0.19838041526103037)
('Honnibal', 0.19838041526103037)
('Ines', 0.19838041526103037)
('published', 0.35038366644254865)
('license', 0.350

### RAKE

You can form a powerful keyword extraction method by combining the Rapid Automatic Keyword Extraction (RAKE) algorithm with the NLTK toolkit. It is known as rake-nltk.

* pip install rake-nltk
* import nltk
* nltk.download('punkt')

In [28]:
from rake_nltk import Rake


r = Rake(min_length=1, max_length=1) 
text = """spaCy is an open-source software library for advanced natural language processing,
written in the programming languages Python and Cython. The library is published under the MIT license
and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion."""
r.extract_keywords_from_text(text)
keyword_extracted = r.get_ranked_phrases_with_scores()
print(keyword_extracted)

[(1.0, 'written'), (1.0, 'spacy'), (1.0, 'published'), (1.0, 'open'), (1.0, 'library'), (1.0, 'founders'), (1.0, 'cython')]


### Gensim
Gensim is primarily developed for topic modeling. Over time, Gensim added other NLP tasks such as summarization, finding text similarity, etc. Here we will demonstrate the use of Genism for keyword extraction tasks.
* pip3 install gensim==3.6.0

In [2]:
# for short text

import gensim
text = "Non-negative matrix factorization (NMF) has previously been shown to " + \
"be a useful decomposition for multivariate data. Two different multiplicative " + \
"algorithms for NMF are analyzed. They differ only slightly in the " + \
"multiplicative factor used in the update rules. One algorithm can be shown to " + \
"minimize the conventional least squares error while the other minimizes the  " + \
"generalized Kullback-Leibler divergence. The monotonic convergence of both  " + \
"algorithms can be proven using an auxiliary function analogous to that used " + \
"for proving convergence of the Expectation-Maximization algorithm. The algorithms  " + \
"can also be interpreted as diagonally rescaled gradient descent, where the  " + \
"rescaling factor is optimally chosen to ensure convergence."
gensim.summarization.keywords(text, 
        ratio=0.5,               # use 50% of original text
        words=None,              # Number of returned words
        split=True,              # Whether split keywords
        scores=True,            # Whether score of keyword
        pos_filter=('NN', 'JJ'), # Part of speech (nouns, adjectives etc.) filters
        lemmatize=True,         # If True - lemmatize words
        deacc=True)              # If True - remove accentuation



[('factor', 0.3066938262406011),
 ('convergence', 0.3065532260323327),
 ('rescaling', 0.24358369124679524),
 ('multiplicative', 0.23877639442201123),
 ('function', 0.23315315782740828),
 ('kullback', 0.20739874793094204),
 ('gradient', 0.17745105527488267),
 ('algorithm', 0.1688639034957118),
 ('matrix', 0.16540489544978382),
 ('rules', 0.1597530896224829),
 ('update', 0.15975308962248286),
 ('squares error', 0.1597530896224828),
 ('optimally', 0.15975308962248275)]

In [5]:
# for large text

def get_keywords_gensim(docs):
    
    keywords=gensim.summarization.keywords(docs, 
                                ratio=None, 
                                words=10,         
                                split=True,             
                                scores=True,           
                                pos_filter=None, 
                                lemmatize=True,         
                                deacc=True)              
    
    return keywords

keywords=get_keywords_gensim(text)
print(keywords)

[('factor', 0.3066938262406005), ('convergence', 0.30655322603233226), ('rescaling', 0.24358369124679452), ('multiplicative', 0.23877639442201173), ('function', 0.23315315782740748), ('kullback', 0.20739874793094265), ('gradient', 0.17745105527488209)]


### KeyBERT
KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.
* pip install keybert


In [19]:
from keybert import KeyBERT
from pprint import pprint



docs = ["""Supervised learning is the machine learning task of learning a function that
        maps an input to an output based on example input-output pairs. It infers a
        function from labeled training data consisting of a set of training examples.
        In supervised learning, each example is a pair consisting of an input object
        (typically a vector) and a desired output value (also called the supervisory signal). 
        A supervised learning algorithm analyzes the training data and produces an inferred function, 
        which can be used for mapping new examples. An optimal scenario will allow for the 
        algorithm to correctly determine the class labels for unseen instances. This requires 
        the learning algorithm to generalize from the training data to unseen situations in a 
        'reasonable' way (see inductive bias).""", 
        
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval."""]

text = """spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion."""
textFromAudio = "wow that brightest star do you remember the pan pacific bank robbery where four people were killed and the subsequent arrest of the perp where a young man was shot and killed"
textFromImages = """THE ANDNRENT AND UNACCEPTABLE TRUTH OF
                        CUSTODIAL DEATH"""
kw_model = KeyBERT()
def TestKeyBERT(text):
        keys = kw_model.extract_keywords(
                docs=text, 
                keyphrase_ngram_range=(1,1),
                top_n= int( 0.45*len(text) )
                )
        print("For string: ",text)
        pprint(keys)

TestKeyBERT(docs[0])
TestKeyBERT(text)
TestKeyBERT(textFromAudio)
TestKeyBERT(textFromImages)

For string:  Supervised learning is the machine learning task of learning a function that
        maps an input to an output based on example input-output pairs. It infers a
        function from labeled training data consisting of a set of training examples.
        In supervised learning, each example is a pair consisting of an input object
        (typically a vector) and a desired output value (also called the supervisory signal). 
        A supervised learning algorithm analyzes the training data and produces an inferred function, 
        which can be used for mapping new examples. An optimal scenario will allow for the 
        algorithm to correctly determine the class labels for unseen instances. This requires 
        the learning algorithm to generalize from the training data to unseen situations in a 
        'reasonable' way (see inductive bias).
[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947),
 ('supervisory

<br>
<br>
<br>
<br>
<hr>

### ARCHIVE:

In [200]:
import spacy
import pytextrank

In [211]:
document = """India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it
registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and
family welfare data showed. The last time India's Covid-19 tally was below 30,000-mark was on 
March 16 when the country saw 28,903 fresh cases.

The country also saw 374 deaths due to Covid-19 in the last 24 hours, taking the death toll to 414,482. This is also the lowest death count India has seen after over three months. India witnessed deaths below 400 on March 30 when 354 fatalities were recorded.

Active cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed. These account for 1.35% of the total infections reported in the country.

At least 45,254 people recovered from the infectious disease in the last 24 hours, taking India's recovery rate to 97.32%."""

In [212]:
en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank")
doc = en_nlp(document)

In [213]:
tr = doc._.textrank
print(tr.elapsed_time);

16.001462936401367


In [214]:
for combination in doc._.phrases:
    print(combination.text, combination.rank, combination.count)

family welfare data 0.12600090030695432 1
Active cases 0.10762708856818633 1
Covid-19 0.0972910880017003 2
its lowest daily Covid-19 cases 0.08493733074461425 1
deaths 0.08105682814454315 1
the health ministry data 0.07832397465246121 1
India 0.07622677252903803 8
28,903 fresh cases 0.07129934511205784 1
30,093 fresh cases 0.07129934511205784 1
health 0.07040385930179857 1
Indias Covid-19 tally 0.06832929773944697 1
March 0.06820137241457065 2
Tuesday 0.06730934841577292 2
the Union ministry 0.05804340764065743 1
daily 0.05801486530369755 1
the health ministry 0.05736999157497043 1
Indias recovery rate 0.05719488369024152 1
the coronavirus disease 0.05620344960951462 1
the lowest death count 0.055052440762211316 1
The country 0.052203381342001136 1
the country 0.052203381342001136 3
the Union ministry of health 0.051315344525987495 1
the current infections 0.05102621920759585 1
the total infections 0.050044489310882095 1
the death toll 0.04958030523542612 1
the infectious disease 0.048

In [215]:
en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = en_nlp(document)
for phrase in doc._.phrases[:5]:
    print(phrase)

Phrase(text='family welfare data', chunks=[family welfare data], count=1, rank=0.12600090030695432)
Phrase(text='Active cases', chunks=[Active cases], count=1, rank=0.10762708856818633)
Phrase(text='Covid-19', chunks=[Covid-19, Covid-19], count=2, rank=0.0972910880017003)
Phrase(text='its lowest daily Covid-19 cases', chunks=[its lowest daily Covid-19 cases], count=1, rank=0.08493733074461425)
Phrase(text='deaths', chunks=[deaths], count=1, rank=0.08105682814454315)


In [216]:
document

"India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it\nregistered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and\nfamily welfare data showed. The last time India's Covid-19 tally was below 30,000-mark was on \nMarch 16 when the country saw 28,903 fresh cases.\n\nThe country also saw 374 deaths due to Covid-19 in the last 24 hours, taking the death toll to 414,482. This is also the lowest death count India has seen after over three months. India witnessed deaths below 400 on March 30 when 354 fatalities were recorded.\n\nActive cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed. These account for 1.35% of the total infections reported in the country.\n\nAt least 45,254 people recovered from the infectious disease in the last 24 hours, taking India's recovery rate to 97.32%."

In [218]:
tr = doc._.textrank
for sent in tr.summary(limit_phrases=10, limit_sentences=2):
    print(sent)

India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it
registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and
family welfare data showed.
Active cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed.


In [225]:
from summa import summarizer
from summa import keywords
print(summarizer.summarize(document))


India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it
Active cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed.


In [223]:
print(keywords.keywords(document))

india
deaths
death
infections
ministry
data
disease
covid
fresh
