## Practice with YAKE! Keyword Extraction

YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Yake! does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. 

NOTE: My understanding is that YAKE! is meant to be run on a **single document as a string**. If you have multiple documents, you need to merge them into one single string before using YAKE!

#### **YAKE! features to note:**

* **Corpus-Independent:** YAKE! offers a solution which can retrieve keywords from a single document only, without the need to rely on external document collection statistics as IDF does; i.e., it can be applied to any text.

* **Domain and Language-Independent:** YAKE! works with domains and languages for which there are no ready keyword extraction systems, as it neither requires a training corpus nor depends on sophisticated external sources (such as WordNet or Wikipedia) or linguistic tools (such as NER or PoS taggers) other than a static list of stopwords.

* **Interior Stopwords:** YAKE! can retrieve keywords containing interior stopwords (e.g., “game of Thrones”) with higher precision than the state-of-the-art methods.

* **Scale:** YAKE! scales to any document length linearly in the number of candidate terms identified.

* **Term Frequency-free:** meaning that no conditions are set with respect to the minimum frequency or sentence frequency that a candidate keyword must have. Therefore, based on the features used, a keyword may be considered significant or insignificant with either one occurrence or with multiple occurrences.

#### **YAKE! has five main steps:** 

1. **Text pre-processing and candidate term identification.** The first step pre-processes the document into a machine-readable format in order to identify potential candidate terms.This is an important and crucial step to identify better candidate terms and thus to improve the effectiveness of the algorithm.

2. **Feature extraction.** The second phase takes as input a list of individual terms and represents them by a set of statistical features.

3. **Computing term score.** In the third step, these features are heuristically combined into a single score likely to reflect the importance of the term.

4. **n-gram generation and computing candidate keyword score.** The fourth step then generates the candidate keywords (through an n-gram7 construction methodology) and assigns them scores, based on their importance.

5. **Data deduplication and ranking.** Finally, the fifth step compares likely similar keywords through the application of a deduplication distance similarity measure. The list of final keywords is then sorted by their relevance scores. 

     

#### **YAKE! References:**

* Yake! on Github: <https://github.com/LIAAD/yake/blob/master/tests/test_yake.py>
* Yake! publication: <https://www.sciencedirect.com/science/article/pii/S0020025519308588>

In [1]:
## Dependencies
import sys, os
import yake
import pandas as pd
from tika import parser # pip install tika
import re
import glob
import numpy as np
import nltk as nltk
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

## Load my dataset
Load the list of pdfs, convert the pdfs to text files, and create a pandas dataframe

In [2]:
directory = "practice_pdfs"
files = list(glob.glob(os.path.join(directory,'*.*')))
print(files)
#https://stackoverflow.com/questions/34000914/how-to-create-a-list-from-filenames-in-a-user-specified-directory-in-python
#https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
#https://stackoverflow.com/questions/33912773/python-read-txt-files-into-a-dataframe

['practice_pdfs\\0066-782X-abc-113-04-0787.pdf', 'practice_pdfs\\aspr-cvdprev-draftes131.pdf', 'practice_pdfs\\CIR.0000000000000749.pdf', 'practice_pdfs\\cvd-nontraditional-risk-factors-final-evidence-review.pdf', 'practice_pdfs\\lipidscreening_chmd-review.pdf', 'practice_pdfs\\S135.full.pdf']


In [3]:
# Open files, convert from PDF to text file, append each file to a document list
#https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file

document_list = []
for f in files:
    raw = parser.from_file(f)
    document_list.append(raw)

# print(document_list)

In [4]:
## Create a dataframe form the document list
text_df = pd.DataFrame(document_list)
text_df.head()
# print(text_df["content"][1])

Unnamed: 0,metadata,content,status
0,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
1,{'Author': 'U.S. Preventive Services Task Forc...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
2,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
3,{'Author': 'U.S. Preventive Services Task Forc...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
4,{'Author': 'U.S. Preventive Services Task Forc...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200


## Test YAKE! on my dataset
This option uses the internal YAKE! preprocessing and stopwords

#### Test Yake on one document in dataframe

In [None]:
## Test YAKE on one document in dataframe

def test_one_yake(text):
    
    kw_extractor = yake.KeywordExtractor()
    keywords = kw_extractor.extract_keywords(text)

    for kw in keywords:
        print(kw)

test_one_yake(text_df['content'][1])

#The lower the score, the more relevant the keyword is.
#he smaller the value, the more significant the 1-gram term (t) is.

#### Test Yake on all documents in dataframe

In [None]:
## Test YAKE on all documents in dataframe
#Change dataframe content column in to one long string

def prepare_text(text):
    
    text.dropna(inplace = True)
    
    #initialize empty string
    global string_for_yake
    
    # create string using list comprehension 
    string_for_yake = ' '.join(text_df['content'].tolist()) 
    
    return string_for_yake

prepare_text(text_df) 




In [None]:
## Run Yake on all-document string

def test_yake(text):
    
    global keywords
    language = "en"
    max_ngram_size = 3
    deduplication_thresold = 0.9
    deduplication_algo = 'seqm'
    windowSize = 1
    numOfKeywords = 200

    custom_kw_extractor = yake.KeywordExtractor(lan=language, 
                                                n=max_ngram_size, 
                                                dedupLim=deduplication_thresold, 
                                                dedupFunc=deduplication_algo, 
                                                windowsSize=windowSize, 
                                                top=numOfKeywords, 
                                                features=None)
    keywords = custom_kw_extractor.extract_keywords(text)

    for kw in keywords:
        print(kw)

test_yake(string_for_yake)

#The lower the score, the more relevant the keyword is.
#he smaller the value, the more significant the 1-gram term (t) is.

In [None]:
# Save keywords to CSV

def create_keyword_CSV(keywords):
     
    ## Create new dataframe with keywords
    keywords_df = pd.DataFrame(keywords)

    ## Save dataframe to csv
    with open(r"yake_all_documents_only.csv", 'w', encoding='utf-8') as file:
        keywords_df.to_csv(file)
        file.close()
    
create_keyword_CSV(keywords)

## In Excel, use the TRIM() function to change the relevance scores to numbers 
## and then sort by "Sort numbers and numbers stored as text separately"

## Test YAKE! on my dataset

This option uses EXTERNAL pre-processing and stopwords and lemmatization PRIOR to using YAKE!

In [6]:
## Pre-process the text by lowcase, remove emails, remove URLS, remove special characters and numbers
## https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn

def pre_process(text):
    
    # Lowercase
    text_lower = text.lower()
    
    # Remove Emails
    text_email = re.sub('\\S*@\\S*\\s?', '', text_lower) 
    
    # Remove URLS
    text_urls = re.sub(r'^https?:\/\/.*[\r\n]*', '', text_email, flags=re.MULTILINE)
    
    # Remove all white \t spaces, new lines \n and tabs \t
    text_spaces = re.sub('\s+',' ',text_urls)
    
    # Remove \n from text
    text_space_character = text_spaces.replace('\n','')
    
    # Remove \t from text
    text_tab_character = text_space_character.replace('\t','')
    
    # Remove special characters and numbers
    text_numbers = re.sub("(\\d|\\W)+"," ",text_tab_character)
    
    # Remove tags
    text_final = re.sub("","",text_numbers)

    # Remove special characters and space, but leave in periods and numbers
    #text_special = re.sub('[^A-Za-z0-9.]+|\s',' ',text_tab_character)
    
    return text_final

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['preprocess'] = text_df['content'].apply(lambda x:pre_process(x))

# print(text_df['preprocess'][1])


In [7]:
## Get stopwords
def get_stop_words(stop_file_path):
#     """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("stop_words.txt")

In [8]:
## Tokenize and lemmatize documents

def split_stop_lemmatize(stopwords, doc_list):
    
    #initiate a lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #initiate an empty string
    lemmatized_text=''

    #split each doc into words
    for word in doc_list.split():
            
        #check if each word is in stopword list and lemmatize, add to string
        if word not in stopwords:
            lemmatized_text = lemmatized_text+' '+ str(lemmatizer.lemmatize(word))
                
    return lemmatized_text
            
## New column "lemmatized" is formed from applying pre_process function to each item in the "preprocess" column in dataframe

text_df['lemmatized'] = text_df['preprocess'].apply(lambda x:split_stop_lemmatize(stopwords, x))

print(text_df['lemmatized'][1])

 aspirin primary prevention cardiovascular event systematic evidence review preventive service task force evidence synthesis number evidence synthesis number aspirin primary prevention cardiovascular event systematic evidence review preventive service task force prepared agency healthcare research quality department health human service gaither road rockville md www ahrq gov contract hhsa task order prepared kaiser permanente research affiliate evidence based practice center kaiser permanente center health research portland investigator janelle guirguis blake md corinne evans mpp caitlyn senger mph maya rowland mph elizabeth connor phd evelyn whitlock md mph ahrq publication ef september systematic review conducted coordination systematic review decision model support preventive service task force uspstf making updated clinical preventive service recommendation aspirin primary prevention original literature search completed june order prepare set manuscript derived review conducted upd

In [9]:
## Convert the "lemmatized" column in dataframe to one long string

def convert_lemmatized_to_string(text_df):
    
    global lemmatized_string_for_yake
    lemmatized_string_for_yake = ' '.join(text_df['lemmatized'].tolist())

    return lemmatized_string_for_yake

convert_lemmatized_to_string(text_df)





In [10]:
## Run Yake on pre-processed, lemmatized string
## https://github.com/LIAAD/yake/blob/master/tests/test_yake.py
## https://www.sciencedirect.com/science/article/pii/S0020025519308588?via%3Dihub

def test_yake_preprocessed_lemmatized(text):
    global keywords
    language = "en"
    max_ngram_size = 3
    deduplication_thresold = 0.9
    deduplication_algo = 'seqm'
    windowSize = 1
    numOfKeywords = 200

    custom_kw_extractor = yake.KeywordExtractor(lan=language, 
                                                n=max_ngram_size, 
                                                dedupLim=deduplication_thresold, 
                                                dedupFunc=deduplication_algo, 
                                                windowsSize=windowSize, 
                                                top=numOfKeywords, 
                                                features=None)
    keywords = custom_kw_extractor.extract_keywords(text)

#     for kw in keywords:
#         print(kw)
    return keywords

test_yake_preprocessed_lemmatized(lemmatized_string_for_yake)

## The lower the score, the more relevant the keyword is.


[('risk factor cvd', 8.460601516015482e-10),
 ('nontraditional risk factor', 1.6710836318082857e-09),
 ('cardiovascular risk factor', 2.1530819764186673e-09),
 ('cardiovascular event kaiser', 2.5179827648055475e-09),
 ('coronary heart disease', 2.783740161493331e-09),
 ('cardiovascular disease risk', 2.88995891725275e-09),
 ('kaiser permanente research', 3.237061892484154e-09),
 ('factor cvd kaiser', 3.565611907811574e-09),
 ('prevent cardiovascular event', 4.210664910018262e-09),
 ('risk prediction study', 4.742861952022172e-09),
 ('cvd cardiovascular disease', 5.251595924921653e-09),
 ('cvd kaiser permanente', 5.257991724135302e-09),
 ('event kaiser permanente', 5.474306883537874e-09),
 ('year cvd risk', 5.4852318522461e-09),
 ('cvd risk factor', 6.8289140807839246e-09),
 ('risk cardiovascular disease', 6.830811986233774e-09),
 ('coronary event fatal', 7.252507853103373e-09),
 ('year year year', 7.292047897707867e-09),
 ('study nontraditional risk', 7.467427862448901e-09),
 ('risk cv

In [11]:
# Save keywords to CSV

def create_keyword_CSV(keywords):
     
    ## Create new dataframe with keywords
    keywords_df = pd.DataFrame(keywords)

    ## Save dataframe to csv
    with open(r"yake_all_documents_preprocess_lemmatization.csv", 'w', encoding='utf-8') as file:
        keywords_df.to_csv(file)
        file.close()
    
create_keyword_CSV(keywords)

## In Excel, use the TRIM() function to change the relevance scores to numbers 
## and then sort by "Sort numbers and numbers stored as text separately"

## Test YAKE! on my dataset

This option runs Yake individual on each document in dataframe and uses external pre-processing ONLY

In [None]:
## Pre-process the text by lowcase, remove emails, remove URLS, remove special characters and numbers
#https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn

def pre_process(text):
    
    # Lowercase
    text_lower = text.lower()
    
    # Remove Emails
    text_email = re.sub('\\S*@\\S*\\s?', '', text_lower) 
    
    # Remove URLS
    text_urls = re.sub(r'^https?:\/\/.*[\r\n]*', '', text_email, flags=re.MULTILINE)
    
    # Remove all white \t spaces, new lines \n and tabs \t
    text_spaces = re.sub('\s+',' ',text_urls)
    
    # Remove \n from text
    text_space_character = text_spaces.replace('\n','')
    
    # Remove \t from text
    text_tab_character = text_space_character.replace('\t','')
    
    # Remove special characters and space, but leave in periods and numbers
    text_special = re.sub('[^A-Za-z0-9.]+|\s',' ',text_tab_character)
    
    # Remove tags
    text_final = re.sub("","",text_special)

    # Remove special characters and numbers
    #text_numbers = re.sub("(\\d|\\W)+"," ",text_spaces)
    
    
    return text_final

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['preprocess_only'] = text_df['content'].apply(lambda x:pre_process(x))

print(text_df['preprocess_only'][0])

In [None]:
## Test YAKE on all documents in dataframe
#Change dataframe content column in to one long string

def prepare_text(text):
    
    text.dropna(inplace = True)
    
    #initialize empty string
    global string_for_yake
    
    # create string using list comprehension 
    string_for_yake = ' '.join(text_df['preprocess_only'].tolist()) 
    
    return string_for_yake

prepare_text(text_df) 


In [None]:
def test_yake_preprocessed_only(text):
    
    #print(type(text))
    global keywords
    language = "en"
    max_ngram_size = 3
    deduplication_thresold = 0.9
    deduplication_algo = 'seqm'
    windowSize = 1
    numOfKeywords = 200

    custom_kw_extractor = yake.KeywordExtractor(lan=language, 
                                                n=max_ngram_size, 
                                                dedupLim=deduplication_thresold, 
                                                dedupFunc=deduplication_algo, 
                                                windowsSize=windowSize, 
                                                top=numOfKeywords, 
                                                features=None)
    keywords = custom_kw_extractor.extract_keywords(text)

#     for kw in keywords:
#         print(kw)
    return keywords

test_yake_preprocessed_only(string_for_yake)


In [None]:
# Save keywords to CSV

def create_keyword_CSV(keywords):
     
    ## Create new dataframe with keywords
    keywords_df = pd.DataFrame(keywords)

    ## Save dataframe to csv
    with open(r"yake_all_documents_preprocess_only.csv", 'w', encoding='utf-8') as file:
        keywords_df.to_csv(file)
        file.close()
    
create_keyword_CSV(keywords)

## In Excel, use the TRIM() function to change the relevance scores to numbers 
## and then sort by "Sort numbers and numbers stored as text separately"

## Test YAKE! 
This option uses text from YAKE tutorials in github

In [None]:
def test_simple_interface_1():
    text_content = """
    Sources tell us that Google is acquiring Kaggle, a platform that
    hosts data science and machine learning competitions. Details about
    the transaction remain somewhat vague , but given that Google is hosting
    its Cloud Next conference in San Francisco this week, the official announcement
    could come as early    as tomorrow.  Reached by phone, Kaggle co-founder
    CEO Anthony Goldbloom declined to deny that the
    acquisition is happening. Google itself declined 'to comment on rumors'.
    Kaggle, which has about half a million data scientists on its platform,
    was founded by Goldbloom    and Ben Hamner in 2010. The service got an
    early start and even though it has a few competitors    like DrivenData,
    TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its
    specific niche. The service is basically the de facto home for running data science
    and machine learning    competitions.  With Kaggle, Google is buying one of the largest
    and most active communities for    data scientists - and with that, it will get increased
    mindshare in this community, too    (though it already has plenty of that thanks to Tensorflow
    and other projects).    Kaggle has a bit of a history with Google, too, but that's pretty recent.
    Earlier this month,    Google and Kaggle teamed up to host a $100,000 machine learning competition
    around classifying    YouTube videos. That competition had some deep integrations with the
    Google Cloud Platform, too.    Our understanding is that Google will keep the service running -
    likely under its current name.    While the acquisition is probably more about Kaggle's community
    than technology, Kaggle did build    some interesting tools for hosting its competition and 'kernels',
    too. On Kaggle, kernels are    basically the source code for analyzing data sets and developers can
    share this code on the    platform (the company previously called them 'scripts').  Like similar
    competition-centric sites,    Kaggle also runs a job board, too. It's unclear what Google will do
    with that part of the service.    According to Crunchbase, Kaggle raised $12.5 million (though PitchBook
    says it's $12.75) since its    launch in 2010. Investors in Kaggle include Index Ventures, SV Angel,
    Max Levchin, Naval Ravikant,    Google chief economist Hal Varian, Khosla Ventures and Yuri Milner
    """

    pyake = yake.KeywordExtractor(lan="en",n=3)

    result = pyake.extract_keywords(text_content)

    print(result)

    keywords = [kw[0] for kw in result]

    print(keywords)
    assert "google" in keywords
    assert "kaggle" in keywords
    assert "san francisco" in keywords
    assert "machine learning" in keywords

test_simple_interface_1()

In [None]:
def test_simple_interface_2():
    text_content = """
    Sources tell us that Google is acquiring Kaggle, a platform that
    hosts data science and machine learning competitions."""

    pyake = yake.KeywordExtractor(lan="ca",n=3)

    result = pyake.extract_keywords(text_content)

    print(result)

    assert len(result) > 0

test_simple_interface_2()

In [None]:
def test_simple_interface_3():
    text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning "\
    "competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud "\
    "Next conference in San Francisco this week, the official announcement could come as early as tomorrow. "\
    "Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. "\
    "Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, "\
    "was founded by Goldbloom  and Ben Hamner in 2010. "\
    "The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, "\
    "it has managed to stay well ahead of them by focusing on its specific niche. "\
    "The service is basically the de facto home for running data science and machine learning competitions. "\
    "With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, "\
    "it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow "\
    "and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, "\
    "Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. "\
    "That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google "\
    "will keep the service running - likely under its current name. While the acquisition is probably more about "\
    "Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition "\
    "and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can "\
    "share this code on the platform (the company previously called them 'scripts'). "\
    "Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with "\
    "that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) "\
    "since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, "\
    "Google chief economist Hal Varian, Khosla Ventures and Yuri Milner "
    
    kw_extractor = yake.KeywordExtractor()
    keywords = kw_extractor.extract_keywords(text)

    for kw in keywords:
        print(kw)

test_simple_interface_3()

In [None]:
def test_simple_interface_4():
    
    text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning "\
    "competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud "\
    "Next conference in San Francisco this week, the official announcement could come as early as tomorrow. "\
    "Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. "\
    "Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, "\
    "was founded by Goldbloom  and Ben Hamner in 2010. "\
    "The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, "\
    "it has managed to stay well ahead of them by focusing on its specific niche. "\
    "The service is basically the de facto home for running data science and machine learning competitions. "\
    "With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, "\
    "it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow "\
    "and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, "\
    "Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. "\
    "That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google "\
    "will keep the service running - likely under its current name. While the acquisition is probably more about "\
    "Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition "\
    "and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can "\
    "share this code on the platform (the company previously called them 'scripts'). "\
    "Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with "\
    "that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) "\
    "since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, "\
    "Google chief economist Hal Varian, Khosla Ventures and Yuri Milner "
    
    print(type(text))
    language = "en"
    max_ngram_size = 3
    deduplication_thresold = 0.9
    deduplication_algo = 'seqm'
    windowSize = 1
    numOfKeywords = 20

    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_thresold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
    keywords = custom_kw_extractor.extract_keywords(text)

    for kw in keywords:
        print(kw)

test_simple_interface_4()