# Content Collection and Processing for TSE-NER Long-tail Entity Extraction

Here we will try to follow the full process required to collect and prepare the data required before we can train and use a NER model. In the second pipeline, we review the process of Term and Sentence Expansion, training, and finally using the Stanford NER Tagger.

In [1]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

ModuleNotFoundError: No module named 'autoreload '

## Content Collection (a.k.a. getting a *lot* of papers)

For this step we will be using [sci-paper-miner](https://github.com/ronentk/sci-paper-miner) to download scientific publications from arXiv. 

In the `crawl_core` file, you can modify the topics and range of years that you want to download, for example here we selected papers only from 2017 for Artificial Intelligence, Computational Complexity, Cryptography and Security, Human-Computer Interaction, Logic in Computer Science, Mathematical Software, Multiagent Systems, Neural and Evolutionary Computing and Sound (in the `crawl_core` file they have all the topics, you can choose what to keep).  In `configs` you can write a name for the folder where the data will be stored.

Get the code from their repository, and use the following command to run it:

`python crawl_core.py <your-api-key>` 

In [None]:
%run crawl_core.py <your-api>

In [3]:
import json
import os

In [4]:
# This is the default path where all the dataset files were downloaded 

path = 'sci_paper_miner/data/arxiv_2006-2017_cs/db/'

In [6]:
filename = 'fulltext_0.json'
json_file = os.path.join(path, filename)
articles = []
for line in open(json_file):
    articles.append(json.loads(line))
    
x, y = 0, 0
for article in articles:
    if article['fullText']:
        x+=1
    else:
        y+=1
print(x, 'with full text', y, 'without')

3814 with full text 1658 without


Awesome! So we have to iterate for all the articles so we can store it in a database. In this case we will use MongoDB.

## Extraction and Storage - MongoDB

We use MongoDB because it is great for prototyping and quick schema changes, but other storage systems can be probably used. Ultimately, all the data is then indexed in Elasticsearch, and that's where we actually query in production.

You can go [here](https://docs.mongodb.com/manual/installation/) to install MongoDB.

Then, using a MongoDB client, create a database, for example _pub_ and a collection, like _publications_.

In [2]:
from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError
import string

In [3]:
# Default values
mongoDB_IP = '127.0.0.1'
mongoDB_Port = 27017
mongoDB_db = 'pub'

In [4]:
def connect_to_mongo():
    """
    Returns a db connection to the mongo instance
    :return:
    """
    try:
        client = MongoClient(mongoDB_IP, mongoDB_Port)
        db = client[mongoDB_db]
        db.downloads.find_one({'_id': 'test'})
        return db
    except ServerSelectionTimeoutError as e:
        raise Exception("Local MongoDB instance at "+mongoDB_IP+":"+mongoDB_Port+" could not be accessed.") from e

We can test the connection to the database with the following command:

In [5]:
db = connect_to_mongo()

Then we define a function that will take each of the articles from the JSON files downloaded from CORE, and store the information into our collection.

The data obtained from CORE does not separate sections in the text, it is not so important except for the references, since we may use them later on, therefore we also define a function that will manage this.

In [14]:
def arxiv_json_to_mongo(article):
    """
    Creates a new entry in mongodb from the input article
    :return:
    """
    
    mongo_input = {}
    translator = str.maketrans('', '', string.punctuation)
    article_data = article

    mongo_input['title'] = article_data['title']
    mongo_input['authors'] = article_data['authors']
    mongo_input['journal'] = 'arxiv'
    mongo_input['year'] = article_data['year']
    mongo_input['type'] = article_data['subjects']
    mongo_input['content.abstract'] = article_data['description']
    mongo_input['content.keywords'] = article_data['topics']
    mongo_input['content.fulltext'] = article_data['fullText']
    mongo_input['content.references'] = article_data['references']

    mongo_mongo_input = db.publications.update_one(
        {'_id': 'arxiv_' + article_data['id']},
        {'$set': mongo_input},
        upsert=True
    )    
    print('Processed', article_data['id'], 'from JSON')
    
def manage_text_and_refs(article):
    article['fullText'] = article['fullText'].replace('\n', ' ').replace('\r', '')
    if len(article['fullText'].split('References', 1)) == 2:
        article['references'] = article['fullText'].split('References', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    elif len(article['fullText'].split('REFERENCES', 1)) == 2:
        article['references'] = article['fullText'].split('REFERENCES', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    else:
        article['fullText'] = article['fullText']
        article['references'] = ''
    return article

Now we have to iterate through all the downloaded JSON files, and all the articles in each file, to process them and introduce them into our database.

In [16]:
for article in articles:
    # Process text and references for each article
    manage_text_and_refs(article)
    # Store to database
    arxiv_json_to_mongo(article)

Processed 25036267 from JSON
Processed 2113087 from JSON
Processed 25025591 from JSON
Processed 25015102 from JSON
Processed 24956790 from JSON
Processed 25038409 from JSON
Processed 2246166 from JSON
Processed 73386897 from JSON
Processed 83863968 from JSON
Processed 83867371 from JSON
Processed 94047681 from JSON
Processed 146472595 from JSON
Processed 94063315 from JSON
Processed 29526388 from JSON
Processed 29536914 from JSON
Processed 84090473 from JSON
Processed 84093559 from JSON
Processed 84093915 from JSON
Processed 84093852 from JSON
Processed 84094420 from JSON
Processed 84326602 from JSON
Processed 84328172 from JSON
Processed 84330324 from JSON
Processed 84329889 from JSON
Processed 84326664 from JSON
Processed 84331310 from JSON
Processed 86416235 from JSON
Processed 86419366 from JSON
Processed 86420205 from JSON
Processed 84330603 from JSON
Processed 84330526 from JSON
Processed 84331497 from JSON
Processed 73423992 from JSON
Processed 73417539 from JSON
Processed 73408

Done! We have a bunch of files in our database! 

## Indexing - Elasticsearch 

We use Elasticsearch for the quick and very flexible (elastic?) queries across the full text of articles, so we have to index all the content from the database for this.

Once again, we have to [install](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) it and run the service...yay!

Once it's running we can connect and check status:

In [6]:
import pymongo
import elasticsearch
import requests
import nltk
from elasticsearch import helpers
import re
import string

In [7]:
client = pymongo.MongoClient('localhost:' + str(mongoDB_Port))
publications_collection = client.pub.publications
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}],
                                 timeout=30, max_retries=10, retry_on_timeout=True)

In [26]:
es.cluster.health()

{'active_primary_shards': 0,
 'active_shards': 0,
 'active_shards_percent_as_number': 100.0,
 'cluster_name': 'elasticsearch',
 'delayed_unassigned_shards': 0,
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'task_max_waiting_in_queue_millis': 0,
 'timed_out': False,
 'unassigned_shards': 0}

### Full text indexing

First we index the full content of every article in the database.

(This step corresponds to `collection_extraction -> index_papersfulltext.py` in the code)

In [14]:
def extract_metadata(documents):
    list_of_docs = []
    for i, r in enumerate(documents):
        extracted = {
                "_id": "",
                "title": "",
                "publication": "",
                "year": "",
                "content": "",
                "abstract": "",
                "authors": [],
                "references": []}
        extracted['_id'] = r['_id']
        extracted['title'] = r['title']
        extracted['publication'] = r['journal']
        extracted['year'] = r['year']
        extracted['content'] = r['content']['fulltext']
        extracted['abstract'] = r['content']['abstract']
        extracted['authors'] = r['authors']
        extracted['references'] = r['content']['references']
        list_of_docs.append(extracted)
    return list_of_docs

In [15]:
filter_publications = ['arxiv'] # Here we could also put PubMed or other sources

extracted_publications = []
for publication in filter_publications:
    query ={'$and': [{'journal': publication}, {'content.fulltext': {'$exists': True}}]}                   
    results = publications_collection.find(query)
    extracted_publications.append(extract_metadata(results))

In [29]:
for publication in extracted_publications:
    actions = []
    for article in publication:
        authors = []
        if len(article['authors']) > 0:
            if type(article['authors'][0]) == list:
                try:
                    for name in article['authors']:
                        authors.append(', '.join([name[1], name[0]]))
                    authors = list(set(authors))
                except:
                    pass
            else:
                authors = article['authors']
        actions.append({
            "_index": "ir", 
            "_type": "publications",  
            "_id": article['_id'],
            "_source": {
                "title": article["title"],
                "journal": article['publication'],
                "year": str(article['year']),
                "content": article["content"],
                "abstract": article["abstract"],
                "authors": authors,
                "references": article["references"]
            }
        })
    if len(actions) == 0:
        continue
    res = helpers.bulk(es, actions)
    print('Done with', res, 'articles added to index')

Done with (3814, []) articles added to index


We can look for anything in the text, like this:

In [30]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_id'], hit['_source']['abstract'])
    print('-'*50)

Got 1 Hits:
arxiv_84094279 Re-Evaluating the Netflix Prize - Human Uncertainty and its Impact on
  Reliability
arxiv_84094279 In this paper, we examine the statistical soundness of comparative
assessments within the field of recommender systems in terms of reliability and
human uncertainty. From a controlled experiment, we get the insight that users
provide different ratings on same items when repeatedly asked. This volatility
of user ratings justifies the assumption of using probability densities instead
of single rating scores. As a consequence, the well-known accuracy metrics
(e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution
of all rating densities. When two different systems produce different RMSE
distributions with significant intersection, then there exists a probability of
error for each possible ranking. As an application, we examine possible ranking
errors of the Netflix Prize. We are able to show that all top rankings are more
or less subject to h

### Sentence Indexing 
For the sentence expansion step of our process, we need to have sentences indexed. This is because we need to find similar sentences as well as their surrounding sentences, for context. 

In addition, we create a file with all the text, of all our articles, and we will use it later for the training of embedding models.

(This step corresponds to `collection_extraction -> index_twosent.py` in the code)

In [8]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [17]:
for publication in extracted_publications:
    for article in publication:
        actions = []
        cleaned = []
        dataset_sent = []
        other_sent = []

        lines = (sent_detector.tokenize(article['content'].strip()))
        
        # This will be used for the training of word2vec and doc2vec
        # You need to create the data folder beforehand
        with open('data/full_text_corpus.txt', 'a', encoding='utf-8') as f:
            for line in lines:
                f.write(line)
        f.close()
        
        if len(lines) < 3:
            continue

        for i in range(len(lines)):
            words = nltk.word_tokenize(lines[i])
            word_lengths = [len(x) for x in words]
            average_word_length = sum(word_lengths) / len(word_lengths)
            if average_word_length < 4:
                continue

            two_sentences = ''
            try:
                two_sentences = lines[i] + ' ' + lines[i - 1]
            except:
                two_sentences = lines[i] + ' ' + lines[i + 1]

            dataset_sent.append(two_sentences)

        for num, added_lines in enumerate(dataset_sent):
            actions.append({
                "_index": "twosent",
                "_type": "twosentnorules",
                "_id": article['_id'] + str(num),
                "_source": {
                    "title": article['title'],
                    "content.chapter.sentpositive": added_lines,
                    "paper_id": article['_id']
                }})

        if len(actions) == 0:
            continue
        res = helpers.bulk(es, actions)
print('Done')

Done


In [9]:
res = es.search(index = "twosent", body = {"query": {"match": {"content.chapter.sentpositive" : "imagenet"}}}, size = 5)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_source']['content.chapter.sentpositive'])
    print('-'*50)

Got 1311 Hits:
arxiv_94074613218 Expert Gate: Lifelong Learning with a Network of Experts
We consider the following knowl- edge transfer cases: Scenes → Actions, ImageNet → Ac- tions, SVHN → Letters, ImageNet → Letters, SVHN → Mnist, ImageNet→ Mnist, Flowers→ Cars and Imagenet → Cars. We also consider ImageNet as a possible source.
--------------------------------------------------
arxiv_93937583198 Learning Efficient Convolutional Networks through Network Slimming
The results for ImageNet dataset are summa- rized in Table 2. ImageNet.
--------------------------------------------------
arxiv_86420809239 ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural
  Networks without Training Substitute Models
Substi- tute model based attack cannot easily scale to ImageNet. Table 2: Untargeted ImageNet attacks comparison.
--------------------------------------------------
arxiv_93937583152 Learning Efficient Convolutional Networks through Network Slimming
The ImageNet dataset co

### Doc2vec Indexing

Once again, we index the sentences from the full text file created before, which will be used for the Sentence Expansion using doc2vec.

(This step corresponds to `collection_extraction -> index_doc2vec.py` in the code)

In [10]:
file = open('data/full_text_corpus.txt', 'r', encoding='utf-8')
text = file.read()
file.close()
sentences = nltk.tokenize.sent_tokenize(text)
print('Sentences ready')
count = 0
docLabels = []
actions = []

for i, sent in enumerate(sentences):
    try:
        neighbors = sentences[i + 1]
        neighbor_count = count + 1
    except:
        neighbors = sentences[i - 1]
        neighbor_count = count - 1

    docLabels.append(count)
    actions.append({
        "_index": "devtwosentnew",
        "_type": "devtwosentnorulesnew",
        "_id": count,
        "_source": {
            "content.chapter.sentpositive": sent,
            "content.chapter.sentnegtive": neighbors,
            "neighborcount": neighbor_count
        }})
    count = count + 1

print(len(sentences))
print(len(docLabels))
res = helpers.bulk(es, actions)
print(res)

Sentences ready
112410
112410
(112410, [])


## Training Embedding Models (word2vec & doc2vec)

For the Term and Sentence expansion, we need to find similar words or sentences, and we use word2vec and doc2vec respectively. We have to generate a document with all the sentences in our corpus, which will be stored in the data folder. While this is not a full implementation, we have around 4 million sentences, so it is a considerable example, and processing can take some time.

(This step corresponds to `data_preparation -> prepare_embedding_data.py` in the code)

In [16]:
sentence_text = []
translator = str.maketrans('', '', string.punctuation)

for publication in extracted_publications:
    for article in publication:
        query = {"query":
                     {"match":
                          {"_id":
                               {"query": article['_id'],
                                "operator": "and"
                                }
                           }
                      }
                 }

        results = es.search(index="ir", body=query, size=200)
        for doc in results['hits']['hits']:
            fulltext = doc["_source"]["content"]
            fulltext = re.sub("[\[].*?[\]]", "", fulltext)
            sentence_text.append(fulltext)
    print('Done', '-' * 100)

f = open("data/data2vec.txt", "w", encoding='utf-8')
for line in sentence_text:
    f.write(line)
f.close()

Done ----------------------------------------------------------------------------------------------------


Before training, verify that you have the optimized version of Gensim using Cython, for better performance.

In [18]:
import gensim
assert gensim.models.doc2vec.FAST_VERSION > -1

Now that we have the document with all the text, we can use them for the training. 

We will use a bigram model for word2vec, since it's common that entities are represented as two words. You can run the command from your Python console or directly here, using the magic `%run` syntax. The structure of this command is as follows:

`%run script_to_run training_data model_output word_vector_output`

This was run in a normal Lenovo machine with 8GB and i7 processor. As you can see by the timestamps below, it might take a while...

In [19]:
%run data_preparation/train_word2vec.py data/data2vec.txt embedding_models/modelword2vecbigram.model embedding_models/modelword2vecbigram.vec

2018-05-15 17:07:25,101: INFO: running data_preparation/train_word2vec.py data/data2vec.txt embedding_models/modelword2vecbigram.model embedding_models/modelword2vecbigram.vec
2018-05-15 17:17:08,824: INFO: collecting all words and their counts
2018-05-15 17:17:08,831: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-05-15 17:17:09,350: INFO: PROGRESS: at sentence #10000, processed 201700 words and 71739 word types
2018-05-15 17:17:09,858: INFO: PROGRESS: at sentence #20000, processed 398393 words and 134942 word types
2018-05-15 17:17:10,411: INFO: PROGRESS: at sentence #30000, processed 617098 words and 204286 word types
2018-05-15 17:17:11,162: INFO: PROGRESS: at sentence #40000, processed 862142 words and 279857 word types
2018-05-15 17:17:11,845: INFO: PROGRESS: at sentence #50000, processed 1119381 words and 356880 word types
2018-05-15 17:17:12,539: INFO: PROGRESS: at sentence #60000, processed 1357143 words and 420874 word types
2018-05-15 17:17:13,233: I

Now we select the sentences from the full text and use them for the doc2vec training. Once again, we can run the command from your Python console or directly here, using the magic `%run` syntax. The structure of this command is as follows:

`%run script_to_run training_data model_output`

This also took some time.

In [2]:
%run data_preparation/train_doc2vec.py data/data2vec.txt embedding_models/doc2vec.model

2018-05-16 10:08:22,798: INFO: running data_preparation/train_doc2vec.py data/data2vec.txt embedding_models/doc2vec.model
2018-05-16 10:09:38,589: INFO: collecting all words and their counts
  yield LabeledSentence(words=doc.split(), tags=[self.labels_list[idx]])
2018-05-16 10:09:38,591: INFO: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-05-16 10:09:38,702: INFO: PROGRESS: at example #10000, processed 162831 words (1479995/s), 18123 word types, 10000 tags


1897215
1897215


2018-05-16 10:09:38,808: INFO: PROGRESS: at example #20000, processed 316533 words (1467065/s), 33211 word types, 20000 tags
2018-05-16 10:09:38,931: INFO: PROGRESS: at example #30000, processed 493736 words (1452716/s), 48068 word types, 30000 tags
2018-05-16 10:09:39,054: INFO: PROGRESS: at example #40000, processed 692599 words (1627814/s), 63748 word types, 40000 tags
2018-05-16 10:09:39,176: INFO: PROGRESS: at example #50000, processed 903639 words (1751362/s), 79018 word types, 50000 tags
2018-05-16 10:09:39,299: INFO: PROGRESS: at example #60000, processed 1095945 words (1570394/s), 91547 word types, 60000 tags
2018-05-16 10:09:39,416: INFO: PROGRESS: at example #70000, processed 1291450 words (1682320/s), 102646 word types, 70000 tags
2018-05-16 10:09:39,532: INFO: PROGRESS: at example #80000, processed 1464803 words (1518706/s), 114144 word types, 80000 tags
2018-05-16 10:09:39,649: INFO: PROGRESS: at example #90000, processed 1657208 words (1660514/s), 125392 word types, 9000

And this is it for data collection, extraction and preparation!

The next step is the generation of training data (with Term and Sentence Expansion) for the Named-Entity Tagger and, of course, the actual training.