# Term and Sentence Expansion for the training of a NER Tagger

In the first pipeline we go over the preparation required for the TSE-NER, from data collection, extraction (to MongoDB), indexing (to Elasticsearch) and other preliminary steps (word2vec and doc2vec models). 

In this second pipeline we lay out the steps of Term Expansion, Sentence Expansion, NER Training, and NER Tagging. In short, the steps are:

1. Initial Data Generation
2. Term Expansion
3. Sentence Expansion
4. Training Data Generation
5. Train NER Tagger
6. Extract new entities
7. Filtering

The papers collected are expendable at this point, but the Elasticsearch indexes and embedding models are required.

In [1]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

## Basic Configuration

Before we start, this process requires certain high-level configuration parameters that can be introduced below (you can also dive into the code, we are trying to make it as clear as possible for further customization).

* model_name: a one-word name for the model, it should be representative of the facet that the model focuses on
* seeds: list (in format [‘item’, ‘item’ ... ]) of representative entities of the type
* context_words: list (in format [‘item’, ‘item’ ... ]) of words that are usually surrounding the entities, that often appear in sentences together (only required when PMI filtering is applied)
* sentence_expansion: True or False if the sentence expansion step should be performed (term expansion is always done) 
* training_cycles: number of the training cycles to perform
* filtering_pmi: True or False if Pointwise Mutual Information filtering should be used at the end of each cycle
* filtering_st: True or False if Similarity	Terms filtering should be used at the end of each cycle
* filtering_ws: True or False if Stopword + WordNet filtering should be used at the end of each cycle
* filtering_kbl: True or False if Knowledge Base Lookup filtering should be used at the end of each cycle

For example, we can provide with a number of entities of the **dataset** type, which is also the name of our model. In this context, datasets are collections of information that were constructed with a specific structure and a purpose, such as comparing performance of different technologies in the same task.

For example, 50 entities of the dataset facet with the rest of initial configurations:

In [2]:
model_name = 'dataset_50'

seeds = ['buzzfeed', 'pslnl', 'dailymed', 'robust04', 'scovo', 'ask.com', 'cacm', 'stanford large network dataset', 
         'mediaeval', 'lexvo', 'spambase', 'shop.com', 'orkut', 'jnlpba', 'cyworld', 'citebase', 'blog06', 'worldcat', 
         'booking.com', 'semeval', 'imagenet', 'nasdaq', 'brightkite', 'movierating', 'webkb', 'ionosphere', 'moviepilot', 
         'duc2001', 'datahub', 'cifar', 'tdt', 'refseq', 'stack overflow', 'wikiwars', 'blogpulse', 'ws-353', 'gerbil', 
         'wikia', 'reddit', 'ldoce', 'kitti dataset', 'specweb', 'fedweb', 'wt2g', 'as3ap', 'friendfeed', 'new york times', 
         'chemid', 'imageclef', 'newegg']

context_words = ['dataset', 'corpus', 'collection', 'repository', 'benchmark']
sentence_expansion = True
training_cycles = 5
filtering_pmi = True
filtering_st = True
filtering_ws = True
filtering_kbl = True
filtering_majority = True

**Important** In addition to this configuration, we need to find the `config.py` file and edit the ROOTPATH and STANFORD_NER_PATH to the respective locations! In that file we can also edit the ports used for Elasticsearch.

We also import all the scripts required for the process, as mentioned before, you can check he code for further detail.

In [4]:
from m1_preprocessing import seed_data_extraction, term_sentence_expansion, training_data_generation, ner_training
from m1_postprocessing import extract_new_entities, filtering
import config as cfg
import gensim
import elasticsearch
import time
import re 
import string

doc2vec_model = gensim.models.Doc2Vec.load('embedding_models/doc2vec.model') #this is the path of the model created in the previous pipeline
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}])

In its creation, TSE-NER was imagined as a cyclic process, where we generate training data, train the NER model, extract the entities from our full-corpus and then use those as the new seeds (filtering out as much noise as possible, of course). However, in this demo we will go step by step for a single cycle, and at the end show how it would look like for a cyclic process.

Therefore, we create a new variable that should iterate but will be fixed for this example.

In [5]:
cycle = 0

## TSE-NER Process
### 1 - Initial Data Generation

Once we have the seeds and basic configuration, the first step consists of searching our entire corpus for sentences that contain the seed terms. This will create a txt file with the seeds, and one with the sentences in the `processing_files` folder.

In [6]:
seed_data_extraction.sentence_extraction(model_name, cycle, seeds)

Started initial training data extraction
Extracting sentences for 50 seed terms
..................................................Process finished with 50 seeds and 1979 sentences added for training in cycle number 0


### 2 - Term Expansion
For the *Term Expansion* we process all the sentences we just extracted, and use the Natural Language Toolkit (NLTK) to find generic entities, these are words that could potentially be an entity given that they are nouns, they have a certain position in a sentence, and/or a relationship with other parts of the sentence, more information [here](https://www.nltk.org/book/ch07.html). 

Then we use the vectors of those words, obtained from word2vec, and cluster them using k-means, selecting the best number of clusters in base of their silhouette score. If there is a seed entity in the cluster, we consider that the rest of the potential entities in that same cluster should be kept as Expanded Terms. Like the previous step, this creates a txt file with the expanded terms in the `processing_files` folder.

In [7]:
term_sentence_expansion.term_expansion(model_name, cycle)

Starting term expansion
Started to extract generic named entity from sentences...
....Finished processing sentences with 4161 new possible entities
Started term clustering
Added 160 expanded terms


NLTK produces over 4 thousand potential entities, after clustering and selecting, we keep 160 Expanded Terms.

### 3 - Sentence Expansion
In *Sentence Expansion*, we use doc2vec to find a single similar sentence (this can be modified in the code, for instance, in line 248) to each one of the sentences that we obtained in Step 1. If the sentence has a consine similarity above 0.5 (also can be changed), we add it to our set of Expanded Sentences. This set is stored in the `processing_files`.

*Note:* There is a chance that this process runs out of memmory if the doc2vec model is too large, this is because it needs to compare the current sentence against ALL other sentences to find the most similar. To fix this, you might have to retrace your steps and create a smaller model in the Preparation Pipeline.

In [11]:
term_sentence_expansion.sentence_expansion(model_name, cycle, doc2vec_model)

Starting sentence expansion
Finding similar sentences to the 3115 starting sentences
....Added 130 expanded sentences to the 3115 original


Sentence Expansion generates **130** new sentences, if they include one of the Expanded Terms, they will be used as possitive examples for the training, if they do not include any entity, they are also very helpful as similar sentences but as negative examples. We argue that this helps to improve the performance of the NER Tagger.

### 4 - Training Data Generation
For the [Stanford NER Tagger](https://nlp.stanford.edu/software/CRF-NER.html) model training, a specific file is required. The format consists of sentences, compiled as a list of *word -> label*, where entities are labelled either with the current entity, say **DATASET**, or with **O** if they are not. 

    ...
    we     O
    apply  O
    this   O
    to     O
    the    O
    Wikia  DATASET
    corpus O
    ...
    
For this, we take all the Sentences + Expanded Sentences, and label all the Seed Terms + Expanded Terms in them.

In [12]:
training_data_generation.sentence_labelling(model_name, cycle, sentence_expansion)

Labelling sentences in the required format
3285 lines labelled


### 5 - NER Tagger Training
Once we have the file with all the labelled sentences, we have to create a [property file](https://nlp.stanford.edu/software/crf-faq.shtml#a) for the Tagger. In this file we can edit certain configurations, point to the correct training file, and the location of the Tagger.

In [13]:
ner_training.create_prop(model_name, cycle, sentence_expansion)

Creating property file for Stanford NER training


With the data in place, and the property file ready, we can start training. This script executes a Java process like command line, which will generate a CRF (Conditional Random Field) file: the NER Model.

In [14]:
ner_training.train_model(model_name, cycle)

Training the model...


And we have a Long-Tail Entity Extraction Model!

### 6 - Extract New Entities
Since this is the goal of this whole process, we will take it step by step to see what's happening in the new Entity Extraction. Python fortunately allows for very easy use of the model we trained.

First, we instantiate the Tagger like this:

In [15]:
from nltk.tag.stanford import StanfordNERTagger
from nltk.corpus import stopwords
path_to_model = 'crf_trained_files/dataset_50_TSE_model_0.ser.gz' 
STANFORD_NER_PATH = 'stanford_files/stanford-ner.jar' # This should be in config, but we can show it again
ner_tagger = StanfordNERTagger(path_to_model, STANFORD_NER_PATH)

Let's take an example sentence from one of our documents.

In [54]:
def get_entities(text):
    tagged = ner_tagger.tag(text_to_tag.split())
    result = []
    for jj, (a, b) in enumerate(tagged):
        no_tag = 'O'
        if b != no_tag:
            a = a.translate(str.maketrans('', '', string.punctuation))
            try:
                if res[jj + 1][1] != no_tag:
                    temp = res[jj + 1][0].translate(str.maketrans('', '', string.punctuation))
                    bigram = a + ' ' + temp
                    result.append(bigram)
            except KeyError:
                result.append(a)
    extracted_words = [word for word in set(result) if word not in stopwords.words('english')]
    return extracted_words

In [55]:
text_to_tag = "This suggests that training a model to rerank responses based on labeled Reddit threads and responses cannot help improve performance. Even though the logistic regression model did improve the appropriateness of responses selected for Reddit threads, VHREDRedditPolitics is used extremely rarely in the final system see Section 4."
extracted_entities = get_entities(text_to_tag)
print(text_to_tag)
print(extracted_entities)

This suggests that training a model to rerank responses based on labeled Reddit threads and responses cannot help improve performance. Even though the logistic regression model did improve the appropriateness of responses selected for Reddit threads, VHREDRedditPolitics is used extremely rarely in the final system see Section 4.
['Reddit']


The tagger works! 

In [69]:
text_to_tag = "We propose a rather straightforward pipeline combining deep-feature extraction using a CNN pretrained on ImageNet and a classic clustering algorithm to classify sets of images. We study the impact of different pretrained CNN feature extractors on the problem of image set clustering for object classification as well as fine-grained classification."
extracted_entities = get_entities(text_to_tag)
print(text_to_tag)
print(extracted_entities)

We propose a rather straightforward pipeline combining deep-feature extraction using a CNN pretrained on ImageNet and a classic clustering algorithm to classify sets of images. We study the impact of different pretrained CNN feature extractors on the problem of image set clustering for object classification as well as fine-grained classification.
['problem', 'ImageNet']


In this case we got some noise, so we can check our filtering strategies.

In [70]:
path = 'processing_files/' + model_name + '_extracted_entities_' + str(cycle) + '.txt'
f1 = open(path, 'w', encoding='utf-8')
for item in extracted_entities:
    f1.write(item + '\n')
f1.close()

And we could filter these new entities and start the process all over again with a larger set of seed terms.

### Filtering

Here we can apply different filters and evaluate the results. 

* WS
WordNet + Stopword filtering simply filters out stop and common words, following the assumption that long-tail entities may be rare and domain specific. 

* ST
Similar Terms filtering is based on the same approach as the Term Expansion, by clustering the vectors of the terms and only keeping those clusters where there is one of the original seed terms.

* PMI
Pointwise Mutual Information (PMI) filtering adopts a semantic similarity measure derived  from the number of times two given keywords appear together in a sentence in our corpus   (for example, the sentence., "we evaluate on x" typically indicates a dataset). A set of context words, terms that often appear with the entities in the same sentence, is required for this filtering.

* KBL
Knowledge Base Lookup, like WordNet filtering, follows the assumption that long-tail entities will not appear in a common knowledge database, such as DBpedia.

* Ensemble
To reduce the amount of false positives at the end of the process, we propose to only keep entities that remain after applying several, or all, filtering approaches to the results. In the current implementation, the resulting entities have to pass all filters.

For more details about the filtering, please refer to the main article. 

In [None]:
# Similar Term filtering relies on clustering of the vectors of the extracted entities, 
# and therefore doesn't work with the few entities of this example

filtering.filter_st(model_name, cycle, seeds)

In [93]:
print(extracted_entities)

['problem', 'ImageNet']


In [94]:
filtering.filter_pmi(model_name, cycle, context_words)

Filtering 2 entities with PMI
2 entities are kept from the total of 2


['imagenet', 'problem']

In [95]:
filtering.filter_ws(model_name, cycle)

Filtering 2 entities with WordNet and Stopwords
1 entities are kept from the total of 2


['imagenet']

In [96]:
filtering.filter_kbl(model_name, cycle, seeds)

Filtering 2 entities with knowledge base lookup
2 entities are kept from the total of 2


['imagenet', 'problem']

In [98]:
filtering.majority_vote(model_name, cycle)

Filtering 2 entities by vote of selected filter methods
1 entities are kept from the total of 2


['imagenet']

After filtering we get the actual dataset for the example!

## Full-text Tagging

For a more extended use, we can apply it to the document in our corpus. First we define a cleaning function to get rid of some characters that can affect the tagger.

In [18]:
def clean_text(es_doc):
    content = doc["_source"]["content"]
    content = content.replace("@ BULLET", "")
    content = content.replace("@BULLET", "")
    content = content.replace(", ", " , ")
    content = content.replace('(', '')
    content = content.replace(')', '')
    content = content.replace('[', '')
    content = content.replace(']', '')
    content = content.replace(',', ' ,')
    content = content.replace('?', ' ?')
    content = content.replace('..', '.')
    content = re.sub(r"(\.)([A-Z])", r"\1 \2", content)
    return content

The we can take some text to tag, for instance, we can search for some documents by their title.

In [118]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "computer vision"}}}, size = 7)

print("Got %d Hits:" % res['hits']['total'])
for doc in res['hits']['hits']:
    print(doc['_id'], doc['_source']['title'])

Got 27 Hits:
arxiv_141535856 Towards Practical Verification of Machine Learning: The Case of Computer
  Vision Systems
arxiv_83836033 Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social
  Media
arxiv_84327993 Compiling LATEX to computer algebra-enabled HTML5
arxiv_129361451 Neural Networks Architecture Evaluation in a Quantum Computer
arxiv_86419363 Robust Computer Algebra, Theorem Proving, and Oracle AI
arxiv_83844865 Aligned Image-Word Representations Improve Inductive Transfer Across
  Vision-Language Tasks
arxiv_141535714 Discriminant Projection Representation-based Classification for Vision
  Recognition


We can iterate through the full corpus and labell all entities, evaluate performance, and improve for next training cycles. 

In [119]:
for doc in res['hits']['hits']:
    print(doc['_id'], doc['_source']['title'])
    text_to_tag = clean_text(doc)
    print(get_entities(text_to_tag))
    print('')

arxiv_141535856 Towards Practical Verification of Machine Learning: The Case of Computer
  Vision Systems
['Scale', 'arXiv161002357', 'MobileNet', 'arXiv', 'Recognition', 'arXiv160407316', 'Imagenet', 'ResNet50', 'corresponding', 'ImageNet', 'problem', 'correctness', 'correspond', 'DNNs', 'Journal', 'imagenet', 'arXiv170808559', 'International', 'MNIST', 'convolutional', 'Visual', 'IMAGENET', 'IEEE', 'HOcclMask']

arxiv_83836033 Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social
  Media
['corresponding', 'convolutional', 'correspond', 'Reddit']

arxiv_84327993 Compiling LATEX to computer algebra-enabled HTML5
[]

arxiv_129361451 Neural Networks Architecture Evaluation in a Quantum Computer
['probability', 'Journal', 'arXiv', 'problems', 'arXiv14123489', 'IEEE', 'arXiv170401127', 'International']

arxiv_86419363 Robust Computer Algebra, Theorem Proving, and Oracle AI
['problem', 'problems', 'ITPs', 'probabilities', 'corresponding', 'GitHub']

arxiv_83844865 Aligned Im

With the filtering approaches that we showed before, we can obtain the datasets used in each paper.

This is only a demo of the TSE-NER approach, with a larger dataset we can improve the recall of the Tagger and extract even more entities, maintaining the precision with proper filtering.