# Keywords Extractor
Inspiration for using Fasttext for grouping simmilar concepts and candidate generation: 
    https://aclweb.org/anthology/N18-2100

First the train set is procressed to do assumptions. The actual evaluation will be done using text set at the end of this notebook. 

The Semeval 2010 is the dataset. It is available in the folder data. For evaluation, please unpack it to any location and make sure the correct locations are set in the cell below

The extractor also needs Fasttext model, available here: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz . Please define path to this model below.


In [1]:
TRAINSET_LOCATION = "/tmp/keyword_gen/SemEval2010/train/"
TESTSET_LOCATION = "/tmp/keyword_gen/SemEval2010/test/"
FASTTEXT_MODEL_PATH = "/tmp/cc.en.300.bin"

In [2]:
from gensim.test.utils import datapath
from gensim.models import FastText

cap_path = datapath(FASTTEXT_MODEL_PATH)
fb_partial = FastText.load_fasttext_format(cap_path, full_model=False)


In [3]:
#Helper class for discovery of relevant POS sequences
acceptable_pos = set()
acceptable_pos.add("J")
acceptable_pos.add("N")
#acceptable_pos.add("V")

class Trie():
    def __init__(self, pos, parent=None, terminal = 0):
        self.pos = pos
        self.parent = parent
        self.children = {}
        self.terminal = terminal
        self.deph = 0
        par = parent
        while par is not None:
            self.deph += 1
            par = par.parent
            
    def add_node(self, pos):
        if pos not in self.children:
            self.children[pos] = Trie(pos, self)
        return self.children[pos]
        
    def print_me(self, indent=""):
        print(indent+ self.pos+": ")
        for p, chld in self.children.items():
            chld.print_me(indent+"-")
    
    def can_start(self, pos):
        if len(pos)>0 and pos[0] in acceptable_pos and pos[0] in self.children:
            return self.children[pos[0]]
        return self  
    
    def can_move(self, pos):
        if len(pos)>0 and pos[0] in acceptable_pos and pos[0] in self.children:
            return True
        return False 

### Read dataset: Semeval
Semeval is used for results comparison with Key2Vec paper (mentioned in the beginning).

In [107]:
import os

import numpy as np
import spacy
nlp = spacy.load("en", disable=["parser", "textcat"])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

class dataset_loader():
    
    def __init__(self, path):
        self.path = path
        self.documents = {}
        self.keywords = {}
        self.keywords_count = 0
        
    def get_all_files(self, extension):
        all_files = []
        for root, dirs, files in os.walk(self.path):
            for file in files:
                if file.endswith(extension):
                     all_files.append(os.path.join(root, file))
        return all_files
    
    def read_file(self, path):
        lines = []
        with open(path) as fp:  
            lines = fp.readlines()
        return lines
    
    # extract seuences word words which POS tags have the same sequence as keywords pos sequences
    def extract_keywords_candidates(self, root, minimal_frequency=2, stemmer=None):
        candidates = {}
        doc_candidates = {}
        document_embeddings = {}
        for doc_id, text in self.abstracts.items():
            doc_embedding = None
            print(doc_id+", ", end = '')
            if doc_id not in doc_candidates:
                doc_candidates[doc_id]={}
            processed = nlp(text)
            sent_id = 0
            for sent in processed.sents:
                sent_id += 1
                sent_toks = []
                tmp_currents = [root]
                tmp_currents2 = []
                local_candidates = []
                for token in sent:
                    if token.is_alpha and not token.is_stop and len(token.lemma_)>3: # keywords are mostly alfa and dont contain stop words
                        if stemmer is None:
                            sent_toks.append(token.lemma_)
                        else:
                            sent_toks.append(stemmer.stem(token.text))
                    else:
                        sent_toks.append("--ommit--")
                    # print(token.lemma_+"  len curremt "+str(len(tmp_currents)))
                    for current in tmp_currents:
                        is_valid_keyword = True
                        new_current = current.can_start(token.tag_)
                        if current == new_current: #fail or start from scratch 
                            current = root
                            
                        else: # move forward
                            current = new_current
                        if current.terminal > minimal_frequency:
                            #avoid non words
                            for tok in sent_toks[-current.deph:]:
                                if tok=="--ommit--":
                                    is_valid_keyword = False
                            candidate_sent = " ".join(sent_toks[-current.deph:])
                            #avoid repetitions
                            if is_valid_keyword:
                                for lc in local_candidates:
                                    lcstr = " ".join(lc)
                                    if candidate_sent == lcstr:
                                        is_valid_keyword = False
                                        break
                            if is_valid_keyword:
                                local_candidates.append(sent_toks[-current.deph:])
                                #print("Add to local candidates "+" ".join(sent_toks[-current.deph:])+"   "+str(current.deph) )
                            
                        if current != root and is_valid_keyword:
                            tmp_currents2.append(current)
                            potential_start = root.can_start(token.tag_)
                            if root!=potential_start and current!= potential_start:
                                tmp_currents2.append(potential_start)
                    tmp_currents = tmp_currents2.copy()
                    if len(tmp_currents)==0:
                        if len(local_candidates)>0:# select best from local candidates
                            longest = None
                            for lc in local_candidates:
                                if longest is None or len(lc) > len(longest):
                                    longest=lc
                            candidate_str = " ".join(longest)
                            if candidate_str not in candidates:
                                candidates[candidate_str]= {}
                            if doc_id not in candidates[candidate_str]:
                                candidates[candidate_str][doc_id]=0
                            candidates[candidate_str][doc_id]+=1
                            if candidate_str not in doc_candidates[doc_id]:
                                doc_candidates[doc_id][candidate_str]=0
                            doc_candidates[doc_id][candidate_str]+=1
                            if sent_id <= 10:
                                if doc_embedding is None:
                                    doc_embedding = np.array(fb_partial.wv[candidate_str])
                                else:
                                    doc_embedding+=np.array(fb_partial.wv[candidate_str])
                        local_candidates.clear()
                        tmp_currents = [root]
                    tmp_currents2.clear()   
            document_embeddings[doc_id] = doc_embedding
        return candidates, doc_candidates, document_embeddings
    
class semelval_loader(dataset_loader):
    
    def __init__(self, path):
        super().__init__(path)
        
        self.keywords_reader = {}
        self.keywords_combined = {}
        self.abstracts = {}
        self.keywords_combined_count =0
        self.keywords_reader_count =0
        
    def _get_abstract(self, lines, all_document=False):
        ret = ""
        i = 0
        start = False
        for l in lines:
            l=l.strip()
            if not all_document and l == "1. INTRODUCTION":
                break
            if ". REFERENCES" in l:
                break
            if i==0:
                ret = l
            elif i == 1 or start:
                ret += " "+l  
            if l == "ABSTRACT":
                start = True
            i+=1
        return ret.replace("-", " ")
    
    def read_keywords(self, content):
        ret_keywords = {}
        total_keywords = 0
        for c in content:
            splitted = c.split(":")
            doc_id = splitted[0].strip()
            keywords = splitted[1].strip().replace("-", " ").split(",")
            ret_keywords[doc_id]=keywords
            total_keywords += len(keywords)
        return ret_keywords, total_keywords
    
    def load_data(self, side="train"):
        files = self.get_all_files("final")
        stem = ""
        if side=="test":
            stem="stem."
        for f in files:
            fname=f.split("/")[-1]
            if "txt.final" in fname:
                content = self.read_file(f)
                self.abstracts[fname[:4]]=self._get_abstract(content)
                self.documents[fname[:4]]=self._get_abstract(content, True)
            elif fname==side+".author."+stem+"final":
                content = self.read_file(f)
                self.keywords, self.keywords_count = self.read_keywords(content)
            elif fname==side+".combined."+stem+"final":
                content = self.read_file(f)
                self.keywords_combined, self.keywords_combined_count = self.read_keywords(content)
            elif fname==side+".reader."+stem+"final":
                content = self.read_file(f)
                self.keywords_reader, self.keywords_reader_count = self.read_keywords(content)
                  
ds = semelval_loader(TRAINSET_LOCATION)
ds.load_data()

### What keywords are better to use?
Posisble options are: authors, readers and combine

In [5]:
#compare keywords coverage

def keyword_coverage(documents, keywords_corpora):
    all_keywords = 0
    covered_keywords = 0
    for doc_id, keywords in keywords_corpora.items():
        text  = documents[doc_id]
        all_keywords += len(keywords)
        for keyword in keywords:
            if keyword[:-1] in text: #simplified stemming just for stat purposes
                covered_keywords += 1
    return covered_keywords, all_keywords

print("--Abstracts--")
n , a = keyword_coverage(ds.abstracts, ds.keywords)
print("Authors keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a))

n , a = keyword_coverage(ds.abstracts, ds.keywords_reader)
print("Readers keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a))

n , a = keyword_coverage(ds.abstracts, ds.keywords_combined)
print("Combined keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a))

print("--Full documents--")
n , a = keyword_coverage(ds.documents, ds.keywords)
print("Authors keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a))

n , a = keyword_coverage(ds.documents, ds.keywords_reader)
print("Readers keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a))

n , a = keyword_coverage(ds.documents, ds.keywords_combined)
print("Combined keyword coverage "+str(n)+" "+str(a)+" = "+str(n/a)+" \n")

print("Average keyword count for author "+str(ds.keywords_count / len(ds.keywords) ) )
print("Average keyword count for reader "+str(ds.keywords_reader_count / len(ds.keywords_reader) ) )
print("Average keyword count combined "+str(ds.keywords_combined_count / len(ds.keywords_combined) ) )

--Abstracts--
Authors keyword coverage 220 559 = 0.3935599284436494
Readers keyword coverage 686 1824 = 0.37609649122807015
Combined keyword coverage 823 2223 = 0.3702204228520018
--Full documents--
Authors keyword coverage 367 559 = 0.6565295169946332
Readers keyword coverage 1429 1824 = 0.7834429824561403
Combined keyword coverage 1672 2223 = 0.7521367521367521 

Average keyword count for author 3.8819444444444446
Average keyword count for reader 12.666666666666666
Average keyword count combined 15.4375


### For further processing we'll use combined keywords as they contain largest number of words, with only slightly lower coverage

In [6]:
ds.keywords = ds.keywords_combined

We'll focus on combined readers and authors keywords are they are most popular with slightly smaller coverage than readers' keywords

## How do keywords look like?
What are most common POS sequences among them?

In [7]:
import operator
#what kind of POS'es are the keywords.

pos_frequency = {}
keywors_sequence_freqs = {}
keywords_freq = {}
root = Trie("")
for doc_id, keywords in ds.keywords_combined.items():
    for keyword in keywords:
        if keyword not in keywords_freq:
            keywords_freq[keyword] = 0
        keywords_freq[keyword] += 1
        
        processed = nlp(keyword)
        sequence = ""
        nnode = root
        for token in processed:
            nnode = nnode.add_node(token.tag_[0])
#             print(token.text, token.lemma_, token.pos_, token.tag_,
#                     token.shape_, token.is_alpha, token.is_stop)
            sequence+=token.tag_+" "
            if token.tag_ not in pos_frequency:
                pos_frequency[token.tag_] = 0
            pos_frequency[token.tag_] += 1
        nnode.terminal += 1
        if sequence not in keywors_sequence_freqs:
            keywors_sequence_freqs[sequence] = 0
        keywors_sequence_freqs[sequence] += 1
pos_frequency = sorted(pos_frequency.items(), key=operator.itemgetter(1))
keywors_sequence_freqs = sorted(keywors_sequence_freqs.items(), key=operator.itemgetter(1))
keywords_freq = sorted(keywords_freq.items(), key=operator.itemgetter(1))
            
print("Most popular POSes among keywords "+str(pos_frequency))
# print("Keywords POS sequences "+str(keywors_sequence_freqs))
# print("How uniformly keywords are spreaded across documents "+str(keywords_freq))

Most popular POSes among keywords [('SYM', 1), ('ADD', 1), ('.', 1), ('``', 1), ('VBZ', 1), ('RP', 1), ('POS', 2), ('JJS', 2), ('AFX', 2), ('NNP', 3), ('LS', 4), ('JJR', 4), ('XX', 5), ('DT', 6), ('UH', 8), ('TO', 10), ('CD', 11), ('FW', 12), ('VBD', 20), ('RB', 20), ('NNS', 31), ('CC', 38), ('VBP', 41), ('VBG', 89), ('VB', 101), ('IN', 102), ('VBN', 162), ('JJ', 868), ('NN', 3646)]


### Conclusion
Keyword candidates will be havying the structure of the target keywords, limitted to POS starting with *N* , *J* and *V* to cover most popular keyword structures

# Extract keyword candidates
Process documents and extract token sequences which reflect keywords POS sequences

In [108]:
print("Processed documents:")
keywords, doc_keywords, doc_embeddings = ds.extract_keywords_candidates(root, minimal_frequency=2)


Processed documents:
C-57, H-52, I-66, I-51, H-44, J-40, I-45, J-50, C-45, H-48, H-50, I-38, I-37, I-54, H-41, J-67, J-51, H-92, H-38, H-53, H-81, J-36, H-88, C-77, H-84, H-62, C-46, J-44, C-44, H-73, J-47, H-96, I-58, C-75, I-73, C-52, I-60, J-37, I-47, J-55, I-59, H-69, H-79, J-66, J-45, C-56, H-64, J-42, I-57, H-47, I-43, I-55, I-65, J-73, C-55, C-81, H-49, C-61, J-62, J-38, C-50, C-78, H-46, J-70, I-77, J-59, I-75, I-64, J-71, I-42, C-62, I-56, H-97, J-49, J-39, C-74, C-67, H-45, J-61, J-56, I-48, J-74, H-85, C-71, C-72, H-54, H-35, H-87, C-80, C-69, I-74, H-98, J-60, C-66, I-49, H-60, C-54, I-53, J-58, C-76, I-68, C-48, J-57, J-34, J-53, I-62, I-76, I-46, C-41, C-83, J-41, C-42, J-69, I-71, C-58, I-52, H-43, H-90, J-33, J-72, J-65, H-61, H-83, I-61, J-63, J-52, I-50, H-42, C-65, C-49, H-77, C-68, H-37, I-63, H-82, C-53, H-63, I-70, H-40, I-72, H-95, J-35, C-79, C-84, 

### Gather embeddings
For all candidate keywords, gether their embeddings to avoid repetitive processing

In [40]:
def gather_embeddings(keywords):
    embeddings = {}
    for word, _ in keywords.items():
        embeddings[word] = fb_partial.wv[word]
    return embeddings

embeddings = gather_embeddings(keywords)

# Candidate filtering
To select best keyword candidates for each document, we'll use tf-idf. 

Since the target keywords often express same meanings with different words (for e.g. load-dependent resource failure, load-dependent failure), the tf-idf will be calculated in a fuzzy way for meanings instead of words. 

To group words into meanings we'll use simple threashold, assessed by experiemnts on trainset

The tf for a given word will be calculated by adding to the usual tf, the frequencies of the words having the same meaning, multiplied by the cosine simmilarity of their embeddings.

Idf is calculated in simmilar manner - by adding to usual idf, numbers of documents the "sibling" word appeared in, multiplied by their simmilarity.

In [124]:
import math
from multiprocessing import Pool, cpu_count
import itertools

PROCESS_COUNT = cpu_count()-1
simmilarity_threshold = 0.8 # assumption derived from manual simmilarity analysis 

def group_results(scores, groups):
    result = {}
    for word, score in scores:
        group_processed = False
        if word in groups:
            for w in groups[word]:
                if w in result:
                    group_processed = True
                    break
            if group_processed:
                continue
        result[word] = score
    return result
            

def calculate_tf(keywords, embeddings, simmilarity_threshold):
    tf_scores = {}
    groups = {}
    
    all_words = list(keywords.keys())
    word_embeddings=[]
    for word2, freq in keywords.items():
        word_embeddings.append(embeddings[word2])
    for word, freq in keywords.items():
        word_embedding = embeddings[word]
        dists = []
        
        dists = fb_partial.wv.cosine_similarities(word_embedding, word_embeddings)
        score = 0
        for d, w in zip(dists, all_words):
            if d > simmilarity_threshold:
                score += d * keywords[w]
                if w != word:
                    if word not in groups and w not in groups:
                        groups[word] = set()
                        groups[w] = groups[word]
                    elif word not in groups:
                        groups[word] = groups[w]
                    else:
                        groups[w] = groups[word]
                    groups[word].add(w)
                    groups[word].add(word)
                    groups[w].add(w)
                    groups[w].add(word)
                #print(word+"   <-  "+w+"  "+str(d)+"  "+str(keywords[w]))
        tf_scores[word] = score
    tf_scores = sorted(tf_scores.items(), key=lambda tup: tup[1], reverse=True)
    tf_scores = group_results(tf_scores, groups)
    return tf_scores, groups

#modified idf ?- calculates in how many documents a word appeared with what strenghts
def idf(word, embeddings, keywords, doclen, simmilarity_threshold, groups):
    best_score = 0
    best_group_word = ""
    if word not in groups:
        groups[word] = set()
        groups[word].add(word)
    for wg in groups[word]:
        doc_number = len(keywords[wg]) -1
#         for w, _ in keywords.items():
#             dist = fb_partial.wv.cosine_similarities(embeddings[wg], [embeddings[w]])
#             if dist > simmilarity_threshold:
#                 doc_number += dist * (len(keywords[w]) -1 )
        score = math.log(doclen / (doc_number + 0.1))
        if score > best_score:
            best_score = score
            best_group_word = wg
    return best_score, best_group_word

def tfidf(tf_keywords, doc_embedding, keywords, embeddings, doclen, simmilarity_threshold, groups):
    result = []
    for word, tf in tf_keywords.items():
        idf_score, best_word = idf(word, embeddings, keywords, doclen, simmilarity_threshold, groups)
        cos_simmilar_to_doc = fb_partial.wv.cosine_similarities(doc_embedding, [embeddings[word]])[0]
        #print(word+" -> "+best_word+"   tf= "+str(tf)+"   idf =  "+str(idf_score)+"  cos_simmilar_to_doc= "+str(cos_simmilar_to_doc) )
        result.append( (word, tf * idf_score *cos_simmilar_to_doc ) ) # tf-idf adjusted by simmilarity to document embedding
    result = sorted(result, key=lambda tup: tup[1], reverse=True)
    return result

def predict(doc_id, doc_words,doc_embedding, keywords, embeddings, simmilarity_threshold):
    predictions = {}
    groups_dict = {}
    #print("start "+doc_id)
    tf_keywords, groups = calculate_tf(doc_words, embeddings, simmilarity_threshold)
    keyword_prediction = tfidf(tf_keywords, doc_embedding, keywords, embeddings, len(doc_keywords), simmilarity_threshold, groups)
    predictions[doc_id] = keyword_prediction[:15] # 15 keywords are the Semeval requirements
    groups_dict[doc_id] = groups
    #print("end "+doc_id+"  -> ")#
    return predictions, groups_dict

docs = list(doc_keywords.keys())
keyword_lists = []
embeddings_list = []
for d in docs:
    keyword_lists.append(doc_keywords[d])
    embeddings_list.append(doc_embeddings[d])
pool = Pool(processes=PROCESS_COUNT)
results = pool.starmap(predict, zip(docs, keyword_lists,embeddings_list, itertools.repeat(keywords), itertools.repeat(embeddings),itertools.repeat(simmilarity_threshold), ) )
groups_dict={}
predictions={}
for p,g in results:
    for docid, vals in p.items():
        predictions[docid] = vals
    for docid, vals in g.items():
        groups_dict[docid] = vals
print("ready")

ready


In [121]:
def calculate_stats(predictions, groups_dict, gold_keywords, use_meanings=False, show_details = False):
    tp =0
    fp=0
    tn=0
    fn=0
    for doc_id, doc_gold_keywords in gold_keywords.items():
        if doc_id not in predictions:
            continue
        if show_details:
            print("Detailed results for document "+doc_id)
        preds = predictions[doc_id]
        groups = groups_dict[doc_id]
        if show_details and use_meanings:
            print("Stats are calculated based on meaning groups:")
            i=0
            shown = set()
            for word, st in groups.items():
                if "".join(st) in shown or len(st)<=1:
                    continue
                print("Meaning group "+str(i)+" = "+str(st))
                shown.add("".join(st))
                i+=1
        #precision - related
        pred_words_set = set()
        for pred_word, score in preds:
            if use_meanings:
                if pred_word not in groups:
                    groups[pred_word] = set(list(pred_word))
                found = False
                for gword in groups[pred_word]:
                    if gword in doc_gold_keywords:
                        tp+=1
                        found = True
                        if show_details:
                            print("Correctly predicted meaning: "+str(groups[pred_word])+" score= "+str(score))
                        break
                if not found:
                    fp += 1
                pred_words_set |= groups[pred_word]
            else:
                if pred_word in doc_gold_keywords:
                    if show_details:
                        print("Correctly predicted keyword: "+pred_word+" score= "+str(score))
                    tp+=1
                else:
                    fp+=1
                pred_words_set.add(pred_word)
        #recall related
        for gold_word in doc_gold_keywords:
            if gold_word not in pred_words_set:
                fn+=1
    if show_details:
        print("True positives: "+str(tp))
        print("False positives: "+str(fp))
        print("False negatives: "+str(fn))
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    print("Precision = "+str(precision)+"   Recall = "+str(recall))
calculate_stats(predictions, groups_dict, ds.keywords_combined)

Precision = 0.15831381733021077   Recall = 0.1525959367945824


The above statistics assumes simple word matching.

If we aim for meaning matching (as initially assumed), the results will be following:

In [116]:
calculate_stats(predictions, groups_dict, ds.keywords_combined, use_meanings=True)

Precision = 0.17892271662763465   Recall = 0.1734786557674841


The meaning groups are:

In [13]:
for d in docs:
    i=0
    shown = set()
    for word, st in groups_dict[d].items():
        if "".join(st) in shown or len(st)<=1:
            continue
        print(d+" - meaning group "+str(i)+" = "+str(st))
        shown.add("".join(st))
        i+=1

C-57 - meaning group 0 = {'congestion setting', 'congestion games', 'congestion game'}
C-57 - meaning group 1 = {'loaddependent failure', 'load dependent failures', 'load dependent failure'}
C-57 - meaning group 2 = {'identical resource', 'identical resources'}
C-57 - meaning group 3 = {'distributed artificial intelligence', 'artificial intelligence'}
H-52 - meaning group 0 = {'vocabulary independent spoken term', 'vocabulary independent system'}
H-52 - meaning group 1 = {'phonetic lattice', 'phonetic transcript', 'such phonetic transcript'}
H-52 - meaning group 2 = {'information search', 'information storage'}
I-66 - meaning group 0 = {'global optimality', 'global optimal algorithm'}
I-51 - meaning group 0 = {'argumentation framework', 'counterargument generation policy', 'argument generation policy', 'argumentation process'}
I-51 - meaning group 1 = {'distributed artificial intelligence multiagent', 'artificial intelligence'}
I-45 - meaning group 0 = {'programming languages', 'progra

# Evaluation using testset
Warning! Before test evaluation ,please make sure the answer files are copied to test directory (where the documents are)

In [122]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

ts = semelval_loader(TESTSET_LOCATION)
ts.load_data(side="test")
ts.keywords = ts.keywords_combined
tkeywords, tdoc_keywords, tdoc_embeddings = ts.extract_keywords_candidates(root, stemmer=ps)
tembeddings = gather_embeddings(tkeywords)

docs = list(tdoc_keywords.keys())
keyword_lists = []
embeddings_list = []
for d in docs:
    keyword_lists.append(tdoc_keywords[d])
    embeddings_list.append(tdoc_embeddings[d])
pool = Pool(processes=PROCESS_COUNT)
results = pool.starmap(predict, zip(docs, keyword_lists,embeddings_list, itertools.repeat(tkeywords), itertools.repeat(tembeddings),itertools.repeat(simmilarity_threshold), ) )
groups_dict={}
predictions={}
for p,g in results:
    for docid, vals in p.items():
        predictions[docid] = vals
    for docid, vals in g.items():
        groups_dict[docid] = vals

C-6., C-17, J-14, J-7., J-4., H-17, I-35, C-19, I-20, C-28, J-21, I-31, I-6., I-32, I-26, H-30, J-22, H-16, H-32, C-20, I-34, H-2., H-26, J-9., J-27, H-13, C-34, C-8., I-33, J-11, I-29, J-26, H-9., C-29, C-9., I-19, J-2., I-21, H-14, I-15, I-14, C-4., J-1., I-1., H-12, I-9., I-11, J-10, H-20, I-5., C-27, I-16, H-10, I-18, H-29, J-31, H-4., H-19, C-33, C-23, I-7., C-14, H-24, I-12, H-31, I-30, H-21, C-1., H-7., J-32, J-3., J-18, I-4., C-18, J-30, J-28, C-22, C-3., H-11, J-17, H-25, H-3., C-36, C-38, C-86, H-8., J-15, J-23, H-5., I-22, J-8., C-30, J-20, J-25, J-13, I-10, C-31, C-32, C-40, H-18, 

# Final results for test set

In [123]:
print("keyword stats: ")
calculate_stats(predictions, groups_dict, ts.keywords_combined, True)
print("meaning stats: ")
calculate_stats(predictions, groups_dict, ts.keywords_combined, use_meanings=True)

keyword stats: 
Precision = 0.17477477477477477   Recall = 0.18780251694094868
meaning stats: 
Precision = 0.17477477477477477   Recall = 0.18780251694094868


## Further work

Adapt Key2Vec pagerank for adjusting kaywords ranking

Search for optimal hyperparameters (simmilarity_threshold, pos sequence frequency)

# Results supporting assumptions

### What pos tags to allow to be part of keywords

set of 10 with N J V   
without meaning groups: Precision = 0.2318840579710145   Recall = 0.21768707482993196  
with meaning groups: Precision = 0.2391304347826087   Recall = 0.22602739726027396  

set of 10 with N J V C   
without meaning groups: Precision = 0.21739130434782608   Recall = 0.20408163265306123    
with meaning groups: Precision = 0.2246376811594203   Recall = 0.21232876712328766   

set of 10 with N J  
without meaning groups: Precision = 0.2608695652173913   Recall = 0.24489795918367346  
with meaning groups: Precision = 0.26811594202898553   Recall = 0.2534246575342466  

#### NJ give best performance  
#### Meaning groups improve performance

### Process whole documents of just abstracts?

all set for full documents:  
Precision = 0.10896309314586995   Recall = 0.13982859720342805  
Precision = 0.1539543057996485   Recall = 0.20249653259361997  

all set for full abstracts:  
Precision = 0.14894613583138172   Recall = 0.1435665914221219  
Precision = 0.17283372365339578   Recall = 0.16857012334399268  

#### Processing abstracts give better results

### Doest fuzzy idf help?

set of 20 abstracts with Fuzzy IDF  
Precision = 0.21180555555555555   Recall = 0.20819112627986347  

set of 20 abstracts without Fuzzy IDF  
Precision = 0.21875   Recall = 0.2150170648464164  

#### Fuzzy IDF is a bad idea

### Do deocument embeddings help in addition to tf-idf

test set with document embeddings
Precision = 0.18018018018018017   Recall = 0.1937984496124031

test set without document embeddings
Precision = 0.17477477477477477   Recall = 0.18780251694094868

#### Using document embeddings to compare keyword candidate embeddings and use them for ranking helps