In this mini-lecture, we study keyword extraction algorithms (KE). The foundational problem can be summarized as follows: given a paragraph of texts, can we write an algorithm that picks out all the keywords from a content perspective? Luckily, Python has a few built-in algorithm that tackles this type of problem.

There are many applications of keyword extractions. For example, when we are doing online reading, the keyword extraction process not only separates the articles but also helps in saving time on social media platforms. We can take the decision to read the post and comments based on their keywords.
We can check whether your article belongs to a current trend or not, or our article will trend or not. 

There is more than one KE algorithm. We will have a quick review of what they do and provide code examples. The syntaxes of these algorithms are pretty easy to implement. 

In [1]:
#!pip install rake-nltk
#!pip install yake
#!pip install -e git + https://github.com/smirnov-am/pytopicrank.git#egg=pytopicrank 
import numpy as np
import pandas as pd
import nltk
import gensim
import yake
import os

from rake_nltk import Rake
from gensim.summarization import keywords # this module only pertains to 3.8.3, and Gensim version above 4.0.1 will not contain the 'summarization' submodule
from gensim.summarization.summarizer import summarize

ERROR: Invalid requirement: '+'
You should consider upgrading via the 'c:\users\gao\.conda\envs\gao_uat\python.exe -m pip install --upgrade pip' command.


In [3]:
print(gensim.__version__) # the gensim.summarization package only exists for 3.8.3 and below. The package was deprecated after the release of gensim 4.0 versions or above

3.8.3


In [4]:
#path="C:\\Users\\gao\\GAO_Jupyter_Notebook\\Datasets"
path="C:\\Users\\GAO\\python workspace\\GAO_Jupyter_Notebook\\Datasets"
os.chdir(path)

#path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
#os.chdir(path)

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GAO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GAO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

### I. Yet Another Keyword Extractor (Yake!)

The YAKE! algorithm was proposed in 2020 and was studied in comparison with a bunch of similar algorithms, including statistical methods such as **TF-IDF**, **KP-Miner**, **RAKE**, graph-based methods such as **PageRank** (still used by Google to rank webpages), **TextRank**, **SingleRank**, **ExpandRank**, **TopicRank**, **TopicalPageRank**, **PositionRank**, and supervised methods such as **KEA**. Here are the key features of Yake!: 

   1. Unsupervised approach: the algorithm is a light-weight unsupervised automatic keyword extraction algorithm which builds upon local text statistical features extracted from single documents; i.e., it does not require any training corpus.
   2. Corpus-independent: the algorithm a solution which can retrieve keywords from a single document only, without the need to rely on external document collection statistics as IDF does; i.e., it can be applied to any text.
   3. Domain and language-independent: YAKE! works with domains and languages for which there are no ready keyword extraction systems, as it neither requires a training corpus nor depends on sophisticated external sources (such as WordNet or Wikipedia) or linguistic tools (such as PoS taggers) other than a static list of stopwords.
   4. Interior stopwords: YAKE! can retrieve keywords containing interior stopwords (e.g., “game of Thrones”) with higher precision than the state-of-the-art methods.
   5. Scale: YAKE! scales to any document length linearly in the number of candidate terms identified.
   6. Term frequency-free: meaning that no conditions are set with respect to the minimum frequency or sentence frequency that a candidate keyword must have. Therefore, based on the features used, a keyword may be considered significant or insignificant with either one occurrence or with multiple occurrences.
   7. Open-source: finally, we make available a demo [yake.inesctec.pt] [9] and an app on Google Play, as well as an API [yake.inesctec.pt/api] and a python package [github.com/LIAAD/yake], so that the scientific community can test our approach and evaluate it in the future in subsequent studies.

The algorithm has five main steps: (1) text pre-processing and candidate term identification; (2) feature extraction; (3) computing term score; (4) n-gram generation and computing candidate keyword score; and (5) data deduplication and ranking. Details of the algorithm can be found in the original YAKE! paper. The 'Yake' package can help us do the extraction. The output will not only write out the keywords but also a score. The lower the score, the more relevant the keyword is.

The 'yake' Python package is fairly new and is still being developed, so the documentation is still scarce compared to other algorithms. Nonetheless, the package provides a very convenient way of summarizing keywords in text documents. Let's see a concrete example:

In [4]:
text = """
We are currently trying to understand the Vizient Q&A readmission logic. In the CDB online tool-report templates, we see that there is a set of readmission report
templates available for members to pull. We want to focus on specific service lines (cardiology in our case) and understand our poor performance. Our z-score is 
currently at 2.5, which is already worse than 95 percent of our cohorts. Can we talk to the experts in the analytics team and help us understand this?
"""
print(text)


We are currently trying to understand the Vizient Q&A readmission logic. In the CDB online tool-report templates, we see that there is a set of readmission report
templates available for members to pull. We want to focus on specific service lines (cardiology in our case) and understand our poor performance. Our z-score is 
currently at 2.5, which is already worse than 95 percent of our cohorts. Can we talk to the experts in the analytics team and help us understand this?



In [6]:
kw_extractor = yake.KeywordExtractor()

language = "en" # YAKE! can be applied to non-English languages as well
max_ngram_size = 2 # picking the maximum size of n-gram, where n=3 in this case
deduplication_threshold = 0.9
numOfKeywords = 25 # number of keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
kws= custom_kw_extractor.extract_keywords(text)
for kw in kws:
    print(kw)

('readmission logic', 0.05571858509133558)
('Vizient', 0.08320359766886957)
('CDB online', 0.1266728539959154)
('logic', 0.1501643180173009)
('readmission', 0.1757337468250936)
('understand', 0.18127425329060223)
('readmission report', 0.1974390035217166)
('templates', 0.2358520547051198)
('CDB', 0.23965767344532257)
('online tool-report', 0.2821896149816428)
('tool-report templates', 0.2841770066167947)
('report templates', 0.2841770066167947)
('service lines', 0.36181549152450343)
('poor performance', 0.36181549152450343)
('pull', 0.38029995855635135)
('specific service', 0.4636804578837865)
('online', 0.46913127445820996)
('tool-report', 0.46913127445820996)
('set', 0.46913127445820996)
('report', 0.46913127445820996)
('members', 0.46913127445820996)
('lines', 0.4720441592731822)
('cardiology', 0.4720441592731822)
('case', 0.4720441592731822)
('performance', 0.4720441592731822)


### II. RAKE (Rapid Automatic Keyword Extraction)

RAKE is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text. The algorithm requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words. Details can be found in Berry and Kogan (2010). The package needs to be used with NLTK in conjunction (mainly to use its stopwords list). 

RAKE is based on the observations that keywords frequently contain multiple words with standard punctuation or stopwords. Words that are considered to carry a meaning related to the text are described as the **content bearing** and are called as **content words**. The input parameters for the RAKE Algorithm comprise a list of stopwords as well as a set of phrase delimiters and word delimiters. The algorithm roughly goes like this: firstly the document text is split into an array of words by the specific word delimiters, and secondly, the array is again split into a sequence of contiguous words at phrase delimiters and stopword positions. Finally, the words that lie in the same sequence are assigned the same position in the text and together are considered as a candidate key. After identifying all the candidate keywords from the text data, a graph of word co-occurrence is generated which calculates the score for each candidate keyword and defined as the **member word score**. With the help of this graph, we evaluate several metrics for calculating word scores, based on the degree and frequency of the vertices in the graph. The major metrics are as follow:

   - word frequencies: favors words that occur frequently regardless of the number of words with which they co-occurred
   - word degrees: favors words that occur often and in longer candidates
   - ratio of degree to frequency: favors the words that predominately occur in longer candidate keywords

The final score for each candidate keyword is calculated as the sum of its member word scores. After the candidate keyword score is calculated, the top $T$ candidate keywords are selected from the document. The **T value** is one-third the number of words in the graph. So RAKE is not dependent on any word embeddings; it's based on frequencies and graphs. 

The details of the algorithm can be found from the following link:

   - https://medium.datadriveninvestor.com/rake-rapid-automatic-keyword-extraction-algorithm-f4ec17b2886c. 



In [19]:
r = Rake() # using stopwords fro English from NLTK, and all punctuation characters by default
r.extract_keywords_from_text(text) # extracting given text

In [20]:
print(set(r.get_ranked_phrases()))

{'spacy', 'cython', 'spacy functionally', 'advantages', 'source software library', 'main developers', 'open', 'software company explosion', 'contrast', 'founders', 'written', 'published', 'advanced natural language processing', 'nltk', 'users often encounter different types', 'ines montani', 'general', 'gensim package mainly handles word embeddings', 'programming languages python', 'matthew honnibal', 'whereas spacy combines', 'powerful nlp tool', 'data scientists enjoy using besides gensim', 'mit license', 'word similarities', 'package', 'library', 'basic', 'tricky business', 'technical glitches', 'installation', 'gensim'}


In [21]:
rakescores=set(r.get_ranked_phrases_with_scores()) # getting keyword phrases ranked highest to lowerst with scores
[i for i in rakescores if i[0]>=5]

[(34.333333333333336, 'data scientists enjoy using besides gensim'),
 (9.0, 'powerful nlp tool'),
 (7.75, 'whereas spacy combines'),
 (6.0, 'word similarities'),
 (9.0, 'software company explosion'),
 (8.0, 'source software library'),
 (29.833333333333332, 'gensim package mainly handles word embeddings'),
 (9.0, 'programming languages python'),
 (16.0, 'advanced natural language processing'),
 (25.0, 'users often encounter different types')]

### III. TextRank 

The TextRank algorithm can be implemented in the Gensim 3.8.3 version package. Notice that this functionatliy was deprecated in later versions such as 4.0 versions above. Here we focus on the lower version number. The foundation of TextRank algorithm is the PageRank algorithm, which Google uses to rank page relevance of search results. In TextRank, the only difference is that we consider sentences instead of pages.

TextRank is a general purpose, graph-based ranking algorithm for NLP. Graph-based ranking algorithms are a way for deciding the importance of a vertex within a graph, based on global information recursively drawn from the entire graph. When one vertex links to another one, it is basically casting a vote for that vertex. The higher the number of votes cast for a vertex, the higher the importance of that vertex. Thus, we have to build a graph that represents the text, interconnects words or other text entities with meaningful relations. TextRank includes two NLP tasks:
   
   - Keyword extraction task
   - Sentence extraction task

TextRank is very well suited for applications involving entire sentences, since it allows for a ranking over text units that is recursively computed based on information drawn from the entire text. To apply TextRank, we first build a graph associated with the text, where the graph vertices are representative for the units to be ranked. The goal is to rank entire sentences, therefore, a vertex is added to the graph for each sentence in the text.

Let's apply TextRank to the existing text:

In [22]:
print(keywords(text, scores=True, lemmatize=True)) # using Gensim

[('languages', 0.3103891302404008), ('software', 0.24152957209043588), ('word', 0.2407147735187126), ('spacy', 0.2082656110080652), ('scientists', 0.20826561100806484), ('different', 0.20826561100806482), ('nlp', 0.2082656110080647), ('company', 0.16913153914645165)]


One can also do text summarization using the Gensim package as well. The idea is the same as extracting keywords. For TextRank, the selected text fragments to use in the graph construction can be phrases, sentences, or paragraphs. Currently, many successful systems adopt the sentences considering the tradeoff between content richness and grammar correctness. According to these approach the most important sentences are the most connected ones in the graph and are used for building a final summary. The API gensim.summarization.summarizer.summarize() has a few tuning parameters. The _word\_count_ argument specifies the maximum amount of words we want in the summary. The _ratio_ argument specifies what fraction of sentences in the original text should be returned as output.

In [23]:
summarize(text, word_count=10)

'spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.'

Let's use a more complicated example. Using the Bible dataset, let's try to summarize biblical texts using TextRank.

In [24]:
bible = pd.read_csv('t_kjv.csv')
df_aux = pd.read_csv("BibleBooks.csv") # additional information related to the Bible

In [25]:
bible.loc[bible['b'] <= 39, 'Testament'] = 'OT'
bible.loc[bible['b'] > 39, 'Testament'] = 'NT'

df_enriched = df_aux.drop(['Tanakh','New Jerusalem Version'], axis=1)
df_enriched['King James Version']=df_enriched['King James Version'].replace(np.nan, 0)
df_enriched['King James Version'] = df_enriched['King James Version'].astype('int')


In [26]:
df = bible.merge(df_enriched, left_on='b', right_on='King James Version')
df.rename(columns={'b': 'Book Index', 'c': 'Chapter Index', 'v': 'Verse', 't': 'Text'}, inplace=True)
df.drop(['King James Version'], axis=1, inplace=True)
df['text'] = df['Text'].str.lower()
df.drop(['Text'], axis=1, inplace=True)
df.head()

Unnamed: 0,id,Book Index,Chapter Index,Verse,Testament,Book,Time,Period,Location,text
0,1001001,1,1,1,OT,Genesis,-500,Persian,Israel,in the beginning god created the heaven and th...
1,1001002,1,1,2,OT,Genesis,-500,Persian,Israel,"and the earth was without form, and void; and ..."
2,1001003,1,1,3,OT,Genesis,-500,Persian,Israel,"and god said, let there be light: and there wa..."
3,1001004,1,1,4,OT,Genesis,-500,Persian,Israel,"and god saw the light, that it was good: and g..."
4,1001005,1,1,5,OT,Genesis,-500,Persian,Israel,"and god called the light day, and the darkness..."


In [27]:
Romans=df.loc[df['Book']=='Romans']
l=[v for v in Romans['text']]
ls=' '.join((str(n) for n in l))
print(summarize(ls, word_count=500))

<class 'str'>
paul, a servant of jesus christ, called to be an apostle, separated unto the gospel of god, (which he had promised afore by his prophets in the holy scriptures,) concerning his son jesus christ our lord, which was made of the seed of david according to the flesh; and declared to be the son of god with power, according to the spirit of holiness, by the resurrection from the dead: by whom we have received grace and apostleship, for obedience to the faith among all nations, for his name: among whom are ye also the called of jesus christ: to all that be in rome, beloved of god, called to be saints: grace to you and peace from god our father, and the lord jesus christ.
but after thy hardness and impenitent heart treasurest up unto thyself wrath against the day of wrath and revelation of the righteous judgment of god; who will render to every man according to his deeds: to them who by patient continuance in well doing seek for glory and honour and immortality, eternal life: but

### References:

   - https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c 
   - https://towardsdatascience.com/keyword-extraction-python-tf-idf-textrank-topicrank-yake-bert-7405d51cd839
   - https://medium.com/@shivangisareen/text-summarisation-with-gensim-textrank-46bbb3401289
   - https://datascience.stackexchange.com/questions/23969/sentence-similarity-prediction
   - https://www.sciencedirect.com/science/article/abs/pii/S0020025519308588?via%3Dihub
   - https://github.com/LIAAD/yake
   - https://pypi.org/project/rake-nltk/ 
   - https://medium.datadriveninvestor.com/rake-rapid-automatic-keyword-extraction-algorithm-f4ec17b2886c 
   - https://www.bibsonomy.org/bibtex/1f2c1ce382d62625a1c3aeca81e96c4b4/lopusz_kdd
   - https://radimrehurek.com/gensim_3.8.3/auto_examples/tutorials/run_summarization.html
   - Michael W. Berry and Jacob Kogan (2010). Text Mining: Applications and Theory. John Wiley & Sons.
   - https://aclanthology.org/I13-1062.pdf 