# Term and Sentence Expansion for the training of a NER Tagger

In the first pipeline we go over the preparation required for the TSE-NER, from data collection, extraction (to MongoDB), indexing (to Elasticsearch) and other preliminary steps (word2vec and doc2vec models). 

In this second pipeline we lay out the steps of Term Expansion, Sentence Expansion, NER Training, and NER Tagging. In short, the steps are:

1. Initial Data Generation
2. Term Expansion
3. Sentence Expansion
4. Training Data Generation
5. Train NER Tagger
6. Extract new entities
7. Filtering

The papers collected are expendable at this point, but the Elasticsearch indexes and embedding models are required.

In [1]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

## Basic Configuration

Before we start, this process requires certain high-level configuration parameters that can be introduced below (you can also dive into the code, we are trying to make it as clear as possible for further customization).

* model_name: a one-word name for the model, it should be representative of the facet that the model focuses on
* seeds: list (in format [‘item’, ‘item’ ... ]) of representative entities of the type
* context_words: list (in format [‘item’, ‘item’ ... ]) of words that are usually surrounding the entities, that often appear in sentences together (only required when PMI filtering is applied)
* sentence_expansion: True or False if the sentence expansion step should be performed (term expansion is always done) 
* training_cycles: number of the training cycles to perform
* filtering_pmi: True or False if Pointwise Mutual Information filtering should be used at the end of each cycle
* filtering_st: True or False if Similarity	Terms filtering should be used at the end of each cycle
* filtering_ws: True or False if Stopword + WordNet filtering should be used at the end of each cycle
* filtering_kbl: True or False if Knowledge Base Lookup filtering should be used at the end of each cycle

For example, we can provide with a number of entities of the **dataset** type, which is also the name of our model. In this context, datasets are collections of information that were constructed with a specific structure and a purpose, such as comparing performance of different technologies in the same task.

For example, 50 entities of the dataset facet with the rest of initial configurations:

In [71]:
model_name = 'dataset_50'

seeds = ['buzzfeed', 'pslnl', 'dailymed', 'robust04', 'scovo', 'ask.com', 'cacm', 'stanford large network dataset', 
         'mediaeval', 'lexvo', 'spambase', 'shop.com', 'orkut', 'jnlpba', 'cyworld', 'citebase', 'blog06', 'worldcat', 
         'booking.com', 'semeval', 'imagenet', 'nasdaq', 'brightkite', 'movierating', 'webkb', 'ionosphere', 'moviepilot', 
         'duc2001', 'datahub', 'cifar', 'tdt', 'refseq', 'stack overflow', 'wikiwars', 'blogpulse', 'ws-353', 'gerbil', 
         'wikia', 'reddit', 'ldoce', 'kitti dataset', 'specweb', 'fedweb', 'wt2g', 'as3ap', 'friendfeed', 'new york times', 
         'chemid', 'imageclef', 'newegg']

context_words = ['dataset', 'corpus', 'collection', 'repository', 'benchmark']
sentence_expansion = True
training_cycles = 5
filtering_pmi = True
filtering_st = True
filtering_ws = True
filtering_kbl = True
filtering_majority = True

**Important** In addition to this configuration, we need to find the `config.py` file and edit the ROOTPATH and STANFORD_NER_PATH to the respective locations! In that file we can also edit the ports used for Elasticsearch and MongoDB.

We also import all the scripts required for the process, as mentioned before, you can check he code for further detail.

In [3]:
from m1_preprocessing import seed_data_extraction, term_sentence_expansion, training_data_generation, ner_training
from m1_postprocessing import extract_new_entities, filtering
import config as cfg
import gensim
import elasticsearch
import time
import re 
import string

doc2vec_model = gensim.models.Doc2Vec.load('embedding_models/doc2vec.model') #this is the path of the model created in the previous pipeline
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}])



In its creation, TSE-NER was imagined as a cyclic process, where we generate training data, train the NER model, extract the entities from our full-corpus and then use those as the new seeds (filtering out as much noise as possible, of course). However, in this demo we will go step by step for a single cycle, and at the end show how it would look like for a cyclic process.

Therefore, we create a new variable that should iterate but will be fixed for this example.

In [4]:
cycle = 0

## TSE-NER Process
### 1 - Initial Data Generation

Once we have the seeds and basic configuration, the first step consists of searching our entire corpus for sentences that contain the seed terms. This will create a txt file with the seeds, and one with the sentences in the `processing_files` folder.

In [72]:
seed_data_extraction.sentence_extraction(model_name, cycle, seeds)

Started initial training data extraction
Extracting sentences for 50 seed terms
..................................................Process finished with 50 seeds and 2425 sentences added for training in cycle number 0


### 2 - Term Expansion
For the *Term Expansion* we process all the sentences we just extracted, and use the Natural Language Toolkit (NLTK) to find generic entities, these are words that could potentially be an entity given that they are nouns, they have a certain position in a sentence, and/or a relationship with other parts of the sentence, more information [here](https://www.nltk.org/book/ch07.html). 

Then we use the vectors of those words, obtained from word2vec, and cluster them using k-means, selecting the best number of clusters in base of their silhouette score. If there is a seed entity in the cluster, we consider that the rest of the potential entities in that same cluster should be kept as Expanded Terms. Like the previous step, this creates a txt file with the expanded terms in the `processing_files` folder.

In [73]:
term_sentence_expansion.term_expansion(model_name, cycle)

Starting term expansion
Started to extract generic named entity from sentences...
.....Finished processing sentences with 5329 new possible entities
Started term clustering
Added 251 expanded terms


NLTK produces over 5 thousand potential entities, after clustering and selecting, we keep 251 Expanded Terms.

### 3 - Sentence Expansion
In *Sentence Expansion*, we use doc2vec to find a single similar sentence (this can be modified in the code, for instance, in line 248) to each one of the sentences that we obtained in Step 1. If the sentence has a consine similarity above 0.5 (also can be changed), we add it to our set of Expanded Sentences. This set is stored in the `processing_files`.

*Note:* There is a chance that this process runs out of memmory if the doc2vec model is too large, this is because it needs to compare ALL sentences to find the most similar. To fix this, you might have to retrace your steps and create a smaller model in the Preparation Pipeline.

In [None]:
term_sentence_expansion.sentence_expansion(model_name, cycle, doc2vec_model)

Sentence Expansion generates **XX** new sentences, if they include one of the Expanded Terms, they will be used as possitive examples for the training, if they do not include any entity, they are also very helpful as similar sentences but as negative examples. We argue that this helps to improve the performance of the NER Tagger.

### 4 - Training Data Generation
For the [Stanford NER Tagger](https://nlp.stanford.edu/software/CRF-NER.html) model training, a specific file is required. The format consists of sentences, presented as a list of *word -> label*, where entities are labelled either with the current entity, say **DATASET**, or with **O** if they are not. 

    ...
    we     O
    apply  O
    this   O
    to     O
    the    O
    Wikia  DATASET
    corpus O
    ...
    
For this, we take all the Sentences + Expanded Sentences, and label all the Seed Terms + Expanded Terms in them.

In [None]:
training_data_generation.sentence_labelling(model_name, cycle, sentence_expansion)

### 5 - NER Tagger Training
Once we have the file with all the labelled sentences, we have to create a [property file](https://nlp.stanford.edu/software/crf-faq.shtml#a) for the Tagger. In this file we can edit certain configurations, point to the correct training file, and the location of the Tagger.

In [None]:
ner_training.create_prop(model_name, cycle, sentence_expansion)

With the data in place, and the property file ready, we can start training. This script executes a Java process like command line, which will generate a CRF (Conditional Random Field) file: the NER Model.

In [None]:
ner_training.train_model(model_name, cycle)

And we have a Long-Tail Entity Extraction Model!

### 6 - Extract New Entities
Since this is the goal of this whole process, we will take it step by step to see what's happening in the new Entity Extraction. Python fortunately allows for very easy use of the model we trained.

First, we instantiate the Tagger like this:

In [48]:
from nltk.tag.stanford import StanfordNERTagger
from nltk.corpus import stopwords
path_to_model = 'crf_trained_files/dataset_TSE_model_3.ser.gz' 
STANFORD_NER_PATH = 'stanford_files/stanford-ner.jar' # This should be in config, but we can show it again
ner_tagger = StanfordNERTagger(path_to_model, STANFORD_NER_PATH)

Let's take an example sentence from one of our documents, with a dataset entity that was not in the seeds, for example Flickr.

In [99]:
text_to_tag = "However, we cannot produce results for node2vec and GraRep on Flickr due to unmanageable out-of-memory errors on a machine with 64GB memory. Thus we exclude them from comparison on Flickr."
tagged = ner_tagger.tag(text_to_tag.split())
print(text_to_tag)

 However, we cannot produce results for node2vec and GraRep on Flickr due to unmanageable out-of-memory errors on a machine with 64GB memory. Thus we exclude them from comparison on Flickr.


In [100]:
result = []
for jj, (a, b) in enumerate(tagged):
    no_tag = 'O'
    if b != no_tag:
        a = a.translate(str.maketrans('', '', string.punctuation))
        try:
            if res[jj + 1][1] != no_tag:
                temp = res[jj + 1][0].translate(str.maketrans('', '', string.punctuation))
                bigram = a + ' ' + temp
                result.append(bigram)
        except KeyError:
            result.append(a)
extracted_words = [word for word in set(result) if word not in stopwords.words('english')]
print(extracted_words)

['Flickr', 'GraRep', 'exclude']


The tagger works! 

In [103]:
path = 'processing_files/' + model_name + '_extracted_entities_' + str(cycle) + '.txt'
f1 = open(path, 'w', encoding='utf-8')
for item in extracted_words:
    f1.write(item + '\n')
f1.close()

And we could filter these new entities and start the process all over again with a larger set of seed terms.

### Filtering

Here we can apply different filters and evaluate the results. 

* WS
WordNet + Stopword filtering simply filters out stop and common words, following the assumption that long-tail entities may be rare and domain specific. 

* ST
Similar Terms filtering is based on the same approach as the Term Expansion, by clustering the vectors of the terms and only keeping those clusters where there is one of the original seed terms.

* PMI
Pointwise Mutual Information (PMI) filtering adopts a semantic similarity measure derived  from the number of times two given keywords appear together in a sentence in our corpus   (for example, the sentence., "we evaluate on x" typically indicates a dataset). A set of context words, terms that often appear with the entities in the same sentence, is required for this filtering.

* KBL
Knowledge Base Lookup, like WordNet filtering, follows the assumption that long-tail entities will not appear in a common knowledge database, such as DBpedia.

* Ensemble
To reduce the amount of false positives at the end of the process, we propose to only keep entities that remain after applying several, or all, filtering approaches to the results. In the current implementation, the resulting entities have to pass all filters.

For more details about the filtering, please refer to the main article. 

In [None]:
# Similar Term filtering relies on clustering of the vectors of the extracted entities, 
# and therefore doesn't work with the 3 entities of this example

filtering.filter_st(model_name, cycle, seeds)

In [109]:
print(extracted_words)

['Flickr', 'GraRep', 'exclude']


In [104]:
filtering.filter_pmi(model_name, cycle, context_words)

Filtering 3 entities with PMI
2 entities are kept from the total of 3


['exclude', 'flickr']

In [105]:
filtering.filter_ws(model_name, cycle)

Filtering 3 entities with WordNet and Stopwords
2 entities are kept from the total of 3


['Flickr', 'GraRep']

In [106]:
filtering.filter_kbl(model_name, cycle, seeds)

Filtering 3 entities with knowledge base lookup
3 entities are kept from the total of 3


['grarep', 'exclude', 'flickr']

In [108]:
filtering.majority_vote(model_name, cycle)

Filtering 3 entities by vote of selected filter methods
1 entities are kept from the total of 3


['flickr']

After filtering we get the actual dataset for the example!

## Full-text Tagging

For a more extended use, we can apply it to the document in our corpus. First we define a cleaning function to get rid of some characters that can affect the tagger.

In [21]:
def clean_text(es_doc):
    content = doc["_source"]["content"]
    content = content.replace("@ BULLET", "")
    content = content.replace("@BULLET", "")
    content = content.replace(", ", " , ")
    content = content.replace('(', '')
    content = content.replace(')', '')
    content = content.replace('[', '')
    content = content.replace(']', '')
    content = content.replace(',', ' ,')
    content = content.replace('?', ' ?')
    content = content.replace('..', '.')
    content = re.sub(r"(\.)([A-Z])", r"\1 \2", content)
    return content

The we can take some text to tag, for instance, we can search for some documents by their title.

In [35]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "artificial intelligence"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for doc in res['hits']['hits']:
    print(doc['_id'], doc['_source']['title'])

Got 65 Hits:
arxiv_73990249 Minimally Naturalistic Artificial Intelligence
arxiv_129351956 Human-in-the-loop Artificial Intelligence
arxiv_141537850 Artificial Intelligence and Statistics
arxiv_42743993 Towards Verified Artificial Intelligence
arxiv_93942616 Knowledge Transfer Between Artificial Intelligence Systems
arxiv_83853685 Artificial Intelligence Based Malware Analysis
arxiv_74202707 Artificial Intelligence Approaches To UCAV Autonomy
arxiv_83854592 A System for Accessible Artificial Intelligence
arxiv_74203373 Ethical Considerations in Artificial Intelligence Courses
arxiv_93939510 Explainable Artificial Intelligence: Understanding, Visualizing and
  Interpreting Deep Learning Models


In [55]:
example = res['hits']['hits'][5]
example['_source']['content']

'Artificial Intelligence Based Malware Analysis Avi Pfeffera,∗, Brian Ruttenberga,∗, Lee Kellogga, Michael Howarda, Catherine Calla, Alison O’Connora, Glenn Takataa, Scott Neal Reillya, Terry Pattena, Jason Taylora, Robert Halla, Arun Lakhotiab, Craig Milesb, Dan Scofieldc, Jared Frankc aCharles River Analytics 625 Mt. Auburn St. Cambridge, MA, 02138 bSoftware Research Lab University of Louisiana at Lafayette Lafayette, LA cAssured Information Security Rome, NY Abstract Artificial intelligence methods have often been applied to perform specific functions or tasks in the cyber– defense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyber–attacks, and malware–based attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future characteristics of malware. In this paper, we present the Malware Analysis and Attributed using Genetic Information (MAAGI) system. The und

We can take the document above and clean it with the defined function. Then we use that content in the Tagger.

In [56]:
text_to_tag = clean_text(example)
tagged = ner_tagger.tag(text_to_tag.split())
result = []
for jj, (a, b) in enumerate(tagged):
    no_tag = 'O'
    if b != no_tag:
        a = a.translate(str.maketrans('', '', string.punctuation))
        try:
            if res[jj + 1][1] != no_tag:
                temp = res[jj + 1][0].translate(str.maketrans('', '', string.punctuation))
                bigram = a + ' ' + temp
                result.append(bigram)
        except KeyError:
            result.append(a)
extracted_words = [word for word in set(result) if word not in stopwords.words('english')]

We can iterate through the full corpus and labell all entities, evaluate performance, and improve for next training cycles. 

In [103]:
path = 'processing_files/' + model_name + '_extracted_entities_' + str(cycle) + '.txt'
f1 = open(path, 'w', encoding='utf-8')
for item in extracted_words:
    f1.write(item + '\n')
f1.close()