## Data and Model Preparation

The following code prepares TF-IDF commonly used WINGUS approaches. Please modify input_dir and output_file as per your local setup. For more details please look at https://boudinfl.github.io/pke/build/html/tutorials/training.html

In [None]:
# -*- coding: utf-8 -*-

import logging
import sys
from string import punctuation

from pke import compute_document_frequency

# setting info in terminal
logging.basicConfig(level=logging.INFO)

# path to the collection of documents
input_dir = '../train_data/document/test/'

# path to the df weights dictionary, saved as a gzipped csv file
output_file = "../data/df_wingus_test.tsv.gz"

# stoplist are punctuation marks
stoplist = list(punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']

# compute idf weights
compute_document_frequency(input_dir=input_dir,
                           output_file=output_file,
                           extension='txt', # input file extension
                           language='en', # language of the input files
                           normalization="stemming", # use porter stemmer
                           stoplist=stoplist,  # stoplist
                           delimiter='\t',  # tab separated output
                           n=20)  # compute n-grams up to 5-grams

## TRAINING WINGUS

This code is responsible for training the WINGUS model. Please execute the prior cell before starting training. Also point to term frquencey file and gold dataset. Data preparation steps are in https://boudinfl.github.io/pke/build/html/tutorials/training.html

In [None]:
# -*- coding: utf-8 -*-
#trainining

import logging
import pandas as pd
import pke

# setting info in terminal
logging.basicConfig(level=logging.INFO)

# path to the collection of documents
input_dir = '../train_data/document/train/'

# path to the reference file
reference_file = "../train_data/gold-annotation/train_gold.txt"

# path to the df file
df_file = "../data/df_wingus_train.tsv.gz"
logging.info('Loading df counts from {}'.format(df_file))
df_counts = pke.load_document_frequency_file(input_file=df_file,
                                             delimiter='\t')

# path to the model, saved as a pickle
output_mdl = "../data/wingus-model.pickle"

pke.train_supervised_model(input_dir=input_dir,
                           reference_file=reference_file,
                           model_file=output_mdl,
                           extension='txt',
                           language='en',
                           normalization="stemming",
                           df=df_counts,
                           model=pke.supervised.WINGUS())

## TESTING

* This part of the code executes testing. Modify the paths to point to trained model and testset by changing path in line ```
df=pd.read_csv("../train_data/tsv/test2.tsv",delimiter="\t")```

* Possible values for test files are test1.tsv for fold-1, test2.tsv for fold-2 respectively

In [None]:
import pke
import pandas as pd
from collections import Counter
# create a Kea extractor and set the input language to English (used for
# the stoplist in the candidate selection method)
extractor = pke.supervised.Kea()
#extractor = pke.supervised.WINGUS()

# load the content of the document, here in CoreNLP XML format
# the use_lemmas parameter allows to choose using CoreNLP lemmas or stems 
# computed using nltk
extractor.load_document('../train_data/document/train/train.txt')

# select the keyphrase candidates, for Kea the 1-3 grams that do not start or
# end with a stopword.
extractor.candidate_selection()

# load the df counts
df_counts = pke.load_document_frequency_file(input_file="../data/df_wingus_train.tsv.gz",
                                             delimiter='\t')

# weight the candidates using Kea model.
extractor.candidate_weighting(model_file="../data/wingus-model.pickle", df=df_counts)

# print the n-highest (10) scored candidates
allkeyphrases=[]
# print the n-highest (10) scored candidates
for (keyphrase, score) in extractor.get_n_best(n=10, stemming=False):
    allkeyphrases.append(keyphrase)

df=pd.read_csv("../train_data/tsv/test1.tsv",delimiter="\t")
texts=df["text"]
texts=[i.replace(",","") for i in texts]
labels=df["label"]
labels=[0 if i==0 else 1 for i in labels]

overall_evidence=[]
removed_text=[]
for txt in texts:
    evidence=[]
    for phr in allkeyphrases :
        #Fuzzy matching
        if phr in txt and txt not in removed_text:
            evidence.append(1)
            removed_text.append(txt)
    if evidence==[]:
        evidence=[0]
    overall_evidence.append(evidence)

ypred=[Counter(i).most_common(1)[0][0] for i in overall_evidence]
print(ypred)

from sklearn.metrics import classification_report
print(classification_report(labels,ypred,digits=5))