# Content Collection and Processing for TSE-NER Long-tail Entity Extraction

Here we will try to follow the full process required to collect and prepare the data required before we can train and use a NER model. In the second pipeline, we review the process of Term and Sentence Expansion, training, and finally using the Stanford NER Tagger.

In [1]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

## Content Collection (a.k.a. getting a *lot* of papers)

For this step we will be using [sci-paper-miner](https://github.com/ronentk/sci-paper-miner) to download scientific publications from arXiv. Get the code from their repository, and use the following command to run it:

`python crawl_core.py <your-api>` 

In the `crawl_core` file, you can modify the topics and range of years that you want to download, for example here we selected papers only from 2017 for Artificial Intelligence, Computational Complexity, Cryptography and Security, Human-Computer Interaction, Logic in Computer Science, Mathematical Software, Multiagent Systems, Neural and Evolutionary Computing and Sound (in the `crawl_core` file they have all the topics, you can choose what to keep).  In `configs` you can write a name for the folder where the data will be stored.

In [2]:
import json
import os

In [2]:
# This is the path where all the json files were downloaded 

path = 'sci_paper_miner/data/arxiv_2006-2017_cs/raw_query/'

We know that the files we downloaded are in JSON format, but let's check their structure.
For this we can use `os.walk`.

In [3]:
for dirpath, dirnames, filenames in os.walk(path):
    for filename in filenames:
        if filename.endswith('.json'):
            print(filename)

144_Computer Science - Artificial Intelligence_2017_5_0.json
144_Computer Science - Artificial Intelligence_2017_5_1.json
144_Computer Science - Artificial Intelligence_2017_5_2.json
144_Computer Science - Artificial Intelligence_2017_5_3.json
144_Computer Science - Artificial Intelligence_2017_5_4.json
144_Computer Science - Artificial Intelligence_2017_5_5.json
144_Computer Science - Computational Complexity_2017_4_0.json
144_Computer Science - Computational Complexity_2017_4_1.json
144_Computer Science - Computational Complexity_2017_4_2.json
144_Computer Science - Computational Complexity_2017_4_3.json
144_Computer Science - Computational Complexity_2017_4_4.json
144_Computer Science - Computational Complexity_2017_4_5.json
144_Computer Science - Computational Complexity_2017_4_6.json
144_Computer Science - Cryptography and Security_2017_8_0.json
144_Computer Science - Cryptography and Security_2017_8_1.json
144_Computer Science - Cryptography and Security_2017_8_2.json
144_Compute

It seems like they are not individual articles, so we have to look into one of them.
Let's just take the last one.

In [4]:
json_file = os.path.join(path, filename)
with open(json_file) as f:
    data = json.load(f)

JSON files are loaded as dictionaries:

In [5]:
data.keys()

dict_keys(['authors', 'contributors', 'datePublished', 'description', 'doi', 'downloadUrl', 'fullText', 'fulltextIdentifier', 'id', 'identifiers', 'oai', 'relations', 'repositories', 'subjects', 'title', 'topics', 'types', 'year'])

If we look into one of those keys, it seems like there is many articles, each one with their respective value for the keys above.

In [6]:
data['title'].keys()

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99'])

So we can check the content of the `Title` key, for all articles:

In [7]:
for key in data['title']:
    print(data['title'][key])
    print('-'*50)

Learning in the Machine: Random Backpropagation and the Deep Learning
  Channel
--------------------------------------------------
Imitation from Observation: Learning to Imitate Behaviors from Raw Video
  via Context Translation
--------------------------------------------------
Identifying hazardousness of sewer pipeline gas mixture using
  classification methods: a comparative study
--------------------------------------------------
Value Iteration Networks
--------------------------------------------------
Outrageously Large Neural Networks: The Sparsely-Gated
  Mixture-of-Experts Layer
--------------------------------------------------
A Review of Neural Network Based Machine Learning Approaches for Rotor
  Angle Stability Control
--------------------------------------------------
Learned Primal-dual Reconstruction
--------------------------------------------------
ACO for Continuous Function Optimization: A Performance Analysis
--------------------------------------------------
D

Or the content of all keys, for the first article, which is what we are looking for:

In [8]:
for key in data:
    print(key, data[key]['0'])
    print('-'*50)

authors ['Baldi, Pierre', 'Sadowski, Peter', 'Lu, Zhiqin']
--------------------------------------------------
contributors []
--------------------------------------------------
datePublished 2017-12-22
--------------------------------------------------
description Random backpropagation (RBP) is a variant of the backpropagation algorithm
for training neural networks, where the transpose of the forward matrices are
replaced by fixed random matrices in the calculation of the weight updates. It
is remarkable both because of its effectiveness, in spite of using random
matrices to communicate error information, and because it completely removes
the taxing requirement of maintaining symmetric weights in a physical neural
system. To better understand random backpropagation, we first connect it to the
notions of local learning and learning channels. Through this connection, we
derive several alternatives to RBP, including skipped RBP (SRPB), adaptive RBP
(ARBP), sparse RBP, and their combinati

Awesome! So we have to iterate for all the downloaded JSON files, for all keys, for all articles, so we can store it in a database. In this case we will use MongoDB.

## Extraction and Storage - MongoDB

We use MongoDB because it is great for prototyping and quick schema changes, but other storage systems can be probably used. Ultimately, all the data is then indexed in Elasticsearch, and that's where we actually query in production.

You can go [here](https://docs.mongodb.com/manual/installation/) to install MongoDB.

Then, using a MongoDB client, create a database, for example _pub_ and a collection, like _publications_.

In [2]:
from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError
import string

In [3]:
# Default values
mongoDB_IP = '127.0.0.1'
mongoDB_Port = 27017
mongoDB_db = 'pub'

In [4]:
def connect_to_mongo():
    """
    Returns a db connection to the mongo instance
    :return:
    """
    try:
        client = MongoClient(mongoDB_IP, mongoDB_Port)
        db = client[mongoDB_db]
        db.downloads.find_one({'_id': 'test'})
        return db
    except ServerSelectionTimeoutError as e:
        raise Exception("Local MongoDB instance at "+mongoDB_IP+":"+mongoDB_Port+" could not be accessed.") from e

We can test the connection to the database with the following command:

In [5]:
db = connect_to_mongo()

Then we define a function that will take each of the articles from the JSON files downloaded from CORE, and store the information into our collection.

The data obtained from CORE does not separate sections in the text, it is not so important except for the references, since we may use them later on, therefore we also define a function that will manage this.

In [14]:
def arxiv_json_to_mongo(article):
    """
    Creates a new entry in mongodb from the input article
    :return:
    """
    
    mongo_input = {}
    translator = str.maketrans('', '', string.punctuation)
    article_data = article

    mongo_input['title'] = article_data['title']
    mongo_input['authors'] = article_data['authors']
    mongo_input['journal'] = 'arxiv'
    mongo_input['year'] = article_data['year']
    mongo_input['type'] = article_data['subjects']
    mongo_input['content.abstract'] = article_data['description']
    mongo_input['content.keywords'] = article_data['topics']
    mongo_input['content.fulltext'] = article_data['fullText']
    mongo_input['content.references'] = article_data['references']

    mongo_mongo_input = db.publications.update_one(
        {'_id': 'arxiv_' + article_data['id']},
        {'$set': mongo_input},
        upsert=True
    )    
    print('Processed', article_data['id'], 'from JSON')
    
def manage_text_and_refs(article):
    article['fullText'] = article['fullText'].replace('\n', ' ').replace('\r', '')
    if len(article['fullText'].split('References', 1)) == 2:
        article['references'] = article['fullText'].split('References', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    elif len(article['fullText'].split('REFERENCES', 1)) == 2:
        article['references'] = article['fullText'].split('REFERENCES', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    else:
        article['fullText'] = article['fullText']
        article['references'] = ''
    return article

Now we have to iterate through all the downloaded JSON files, and all the articles in each file, to process them and introduce them into our database.

In [16]:
for dirpath, dirnames, filenames in os.walk(path):
    # Iterate files...
    for filename in filenames:
        if filename.endswith('.json'):
            json_file = os.path.join(path, filename)
            with open(json_file) as f:
                data = json.load(f)            
                    # Iterate articles in file...
                for article_number in data['title'].keys():
                    article = {}
                    for key in data.keys():
                        article[key] = data[key][str(article_number)]
                        
                    # Process text and references for each article
                    manage_text_and_refs(article)
                    # Store to database
                    arxiv_json_to_mongo(article)

Processed 25036267 from JSON
Processed 2113087 from JSON
Processed 25025591 from JSON
Processed 25015102 from JSON
Processed 24956790 from JSON
Processed 25038409 from JSON
Processed 2246166 from JSON
Processed 73386897 from JSON
Processed 83863968 from JSON
Processed 83867371 from JSON
Processed 94047681 from JSON
Processed 146472595 from JSON
Processed 94063315 from JSON
Processed 29526388 from JSON
Processed 29536914 from JSON
Processed 84090473 from JSON
Processed 84093559 from JSON
Processed 84093915 from JSON
Processed 84093852 from JSON
Processed 84094420 from JSON
Processed 84326602 from JSON
Processed 84328172 from JSON
Processed 84330324 from JSON
Processed 84329889 from JSON
Processed 84326664 from JSON
Processed 84331310 from JSON
Processed 86416235 from JSON
Processed 86419366 from JSON
Processed 86420205 from JSON
Processed 84330603 from JSON
Processed 84330526 from JSON
Processed 84331497 from JSON
Processed 73423992 from JSON
Processed 73417539 from JSON
Processed 73408

Processed 93955298 from JSON
Processed 93957100 from JSON
Processed 93954637 from JSON
Processed 86415589 from JSON
Processed 86416049 from JSON
Processed 86421656 from JSON
Processed 86420019 from JSON
Processed 86416917 from JSON
Processed 84332112 from JSON
Processed 84331544 from JSON
Processed 86417035 from JSON
Processed 86415248 from JSON
Processed 86414384 from JSON
Processed 84331049 from JSON
Processed 129362579 from JSON
Processed 129353364 from JSON
Processed 129358112 from JSON
Processed 129363022 from JSON
Processed 129349022 from JSON
Processed 129348500 from JSON
Processed 129355601 from JSON
Processed 129355043 from JSON
Processed 129252805 from JSON
Processed 129263821 from JSON
Processed 129269449 from JSON
Processed 129276487 from JSON
Processed 94077449 from JSON
Processed 94026340 from JSON
Processed 73404689 from JSON
Processed 73402359 from JSON
Processed 73403592 from JSON
Processed 84090569 from JSON
Processed 84091942 from JSON
Processed 84092015 from JSON
Pr

Processed 9259374 from JSON
Processed 2418176 from JSON
Processed 24952807 from JSON
Processed 25041752 from JSON
Processed 25054883 from JSON
Processed 2064976 from JSON
Processed 25036302 from JSON
Processed 25015255 from JSON
Processed 25015890 from JSON
Processed 24978117 from JSON
Processed 24961009 from JSON
Processed 25063031 from JSON
Processed 25060134 from JSON
Processed 24960871 from JSON
Processed 25004536 from JSON
Processed 24975044 from JSON
Processed 25058543 from JSON
Processed 25015215 from JSON
Processed 2192099 from JSON
Processed 25050201 from JSON
Processed 25043984 from JSON
Processed 24947497 from JSON
Processed 25030700 from JSON
Processed 24960513 from JSON
Processed 2246166 from JSON
Processed 24965000 from JSON
Processed 146472672 from JSON
Processed 94066275 from JSON
Processed 84090954 from JSON
Processed 86419582 from JSON
Processed 86418022 from JSON
Processed 86420919 from JSON
Processed 84329360 from JSON
Processed 73407275 from JSON
Processed 73406878

Processed 73988612 from JSON
Processed 73988782 from JSON
Processed 73987842 from JSON
Processed 42701158 from JSON
Processed 73359320 from JSON
Processed 83837177 from JSON
Processed 83865527 from JSON
Processed 93955960 from JSON
Processed 94075803 from JSON
Processed 83847068 from JSON
Processed 93888370 from JSON
Processed 42739003 from JSON
Processed 93867540 from JSON
Processed 93875891 from JSON
Processed 129350143 from JSON
Processed 93874843 from JSON
Processed 83858768 from JSON
Processed 73349182 from JSON
Processed 73395899 from JSON
Processed 83857671 from JSON
Processed 83860794 from JSON
Processed 83845746 from JSON
Processed 73422623 from JSON
Processed 73400698 from JSON
Processed 73407079 from JSON
Processed 84093827 from JSON
Processed 29554804 from JSON
Processed 73409828 from JSON
Processed 84329847 from JSON
Processed 73954126 from JSON
Processed 73957459 from JSON
Processed 73989678 from JSON
Processed 42644498 from JSON
Processed 84328192 from JSON
Processed 742

Processed 141532918 from JSON
Processed 93897084 from JSON
Processed 83847336 from JSON
Processed 83859944 from JSON
Processed 84094692 from JSON
Processed 84093303 from JSON
Processed 83848897 from JSON
Processed 83848727 from JSON
Processed 83849218 from JSON
Processed 83852278 from JSON
Processed 83850272 from JSON
Processed 83842604 from JSON
Processed 73354579 from JSON
Processed 73377367 from JSON
Processed 83869728 from JSON
Processed 83863650 from JSON
Processed 83867397 from JSON
Processed 73351431 from JSON
Processed 129354096 from JSON
Processed 93946309 from JSON
Processed 93948590 from JSON
Processed 93953828 from JSON
Processed 93889927 from JSON
Processed 73423893 from JSON
Processed 29559494 from JSON
Processed 73406964 from JSON
Processed 73407518 from JSON
Processed 73416608 from JSON
Processed 29506180 from JSON
Processed 29530381 from JSON
Processed 73404261 from JSON
Processed 83856395 from JSON
Processed 73348729 from JSON
Processed 73381445 from JSON
Processed 73

Processed 83865803 from JSON
Processed 83835077 from JSON
Processed 83849751 from JSON
Processed 83852052 from JSON
Processed 129356124 from JSON
Processed 129358858 from JSON
Processed 93940659 from JSON
Processed 93942864 from JSON
Processed 93943759 from JSON
Processed 93896770 from JSON
Processed 146473534 from JSON
Processed 94056463 from JSON
Processed 129227352 from JSON
Processed 93947441 from JSON
Processed 93957750 from JSON
Processed 93954559 from JSON
Processed 93948817 from JSON
Processed 93909883 from JSON
Processed 93937738 from JSON
Processed 146473116 from JSON
Processed 83847623 from JSON
Processed 141534744 from JSON
Processed 129242742 from JSON
Processed 83832535 from JSON
Processed 83834740 from JSON
Processed 83832908 from JSON
Processed 83833413 from JSON
Processed 83844244 from JSON
Processed 83866820 from JSON
Processed 129360404 from JSON
Processed 129348967 from JSON
Processed 129281738 from JSON
Processed 129358098 from JSON
Processed 129358005 from JSON
Pr

Processed 94038148 from JSON
Processed 146475459 from JSON
Processed 94034975 from JSON
Processed 129348860 from JSON
Processed 93904884 from JSON
Processed 93939569 from JSON
Processed 93940547 from JSON
Processed 93959257 from JSON
Processed 93871665 from JSON
Processed 73373770 from JSON
Processed 73373864 from JSON
Processed 83861533 from JSON
Processed 83862149 from JSON
Processed 83836764 from JSON
Processed 83838830 from JSON
Processed 83840455 from JSON
Processed 94058595 from JSON
Processed 146474906 from JSON
Processed 129269211 from JSON
Processed 93949164 from JSON
Processed 93949753 from JSON
Processed 93951347 from JSON
Processed 93958906 from JSON
Processed 129349940 from JSON
Processed 129353631 from JSON
Processed 129357811 from JSON
Processed 129358193 from JSON
Processed 93937626 from JSON
Processed 93938302 from JSON
Processed 93942891 from JSON
Processed 93951455 from JSON
Processed 78506528 from JSON
Processed 141529838 from JSON
Processed 141537040 from JSON
Proc

Processed 141537693 from JSON
Processed 93946295 from JSON
Processed 93953334 from JSON
Processed 129351396 from JSON
Processed 93943551 from JSON
Processed 73987597 from JSON
Processed 93954755 from JSON
Processed 93942844 from JSON
Processed 93954553 from JSON
Processed 93954508 from JSON
Processed 83860222 from JSON
Processed 73391781 from JSON
Processed 83856935 from JSON
Processed 83846316 from JSON
Processed 83867829 from JSON
Processed 94034650 from JSON
Processed 83841739 from JSON
Processed 83833433 from JSON
Processed 84093874 from JSON
Processed 146476396 from JSON
Processed 146475223 from JSON
Processed 129361954 from JSON
Processed 93909848 from JSON
Processed 141533623 from JSON
Processed 83848490 from JSON
Processed 83850053 from JSON
Processed 83841305 from JSON
Processed 73990349 from JSON
Processed 84332367 from JSON
Processed 84326485 from JSON
Processed 78508297 from JSON
Processed 78512407 from JSON
Processed 86418977 from JSON
Processed 74202687 from JSON
Processe

Processed 25050344 from JSON
Processed 25056701 from JSON
Processed 25041752 from JSON
Processed 25048645 from JSON
Processed 24984510 from JSON
Processed 24938942 from JSON
Processed 24953401 from JSON
Processed 25013547 from JSON
Processed 24964557 from JSON
Processed 25015102 from JSON
Processed 24992766 from JSON
Processed 25029857 from JSON
Processed 2189676 from JSON
Processed 2184911 from JSON
Processed 25062049 from JSON
Processed 24956790 from JSON
Processed 25048018 from JSON
Processed 25023845 from JSON
Processed 25054674 from JSON
Processed 25030049 from JSON
Processed 25043675 from JSON
Processed 25034492 from JSON
Processed 24973903 from JSON
Processed 25030003 from JSON
Processed 25041490 from JSON
Processed 2246166 from JSON
Processed 24970122 from JSON
Processed 83864257 from JSON
Processed 83864163 from JSON
Processed 94063795 from JSON
Processed 84327713 from JSON
Processed 84328778 from JSON
Processed 86415573 from JSON
Processed 86416297 from JSON
Processed 8641952

Processed 78511424 from JSON
Processed 78510469 from JSON
Processed 78507670 from JSON
Processed 78508673 from JSON
Processed 78512100 from JSON
Processed 78508907 from JSON
Processed 129209900 from JSON
Processed 93903698 from JSON
Processed 94054926 from JSON
Processed 94065771 from JSON
Processed 93905687 from JSON
Processed 93911970 from JSON
Processed 93889189 from JSON
Processed 93943168 from JSON
Processed 93942879 from JSON
Processed 129355046 from JSON
Processed 129286933 from JSON
Processed 129349133 from JSON
Processed 129355231 from JSON
Processed 129353925 from JSON
Processed 93956902 from JSON
Processed 94026747 from JSON
Processed 141535174 from JSON
Processed 129240441 from JSON
Processed 129236628 from JSON
Processed 129251258 from JSON
Processed 129243806 from JSON
Processed 129269756 from JSON
Processed 129260672 from JSON
Processed 93946097 from JSON
Processed 93897029 from JSON
Processed 129359411 from JSON
Processed 83858300 from JSON
Processed 83852472 from JSON


Processed 24930779 from JSON
Processed 25030003 from JSON
Processed 86417517 from JSON
Processed 129205956 from JSON
Processed 94024259 from JSON
Processed 83860666 from JSON
Processed 129350991 from JSON
Processed 83857264 from JSON
Processed 83842145 from JSON
Processed 83838175 from JSON
Processed 73961361 from JSON
Processed 73421461 from JSON
Processed 42694258 from JSON
Processed 129241322 from JSON
Processed 141532259 from JSON
Processed 129356820 from JSON
Processed 83853876 from JSON
Processed 83849053 from JSON
Processed 86415541 from JSON
Processed 84330976 from JSON
Processed 129358390 from JSON
Processed 73421309 from JSON
Processed 84329069 from JSON
Processed 84327993 from JSON
Processed 78506856 from JSON
Processed 42745470 from JSON
Processed 83833862 from JSON
Processed 129359813 from JSON
Processed 129237940 from JSON
Processed 93940921 from JSON
Processed 129363302 from JSON
Processed 129362208 from JSON
Processed 83857623 from JSON
Processed 94050389 from JSON
Proc

Processed 93950900 from JSON
Processed 129286047 from JSON
Processed 83854092 from JSON
Processed 83844358 from JSON
Processed 84095274 from JSON
Processed 93955492 from JSON
Processed 93889881 from JSON
Processed 94022322 from JSON
Processed 141534671 from JSON
Processed 73992693 from JSON
Processed 42707453 from JSON
Processed 84333322 from JSON
Processed 84333214 from JSON
Processed 78508622 from JSON
Processed 141531711 from JSON
Processed 74202935 from JSON
Processed 83835579 from JSON
Processed 83835562 from JSON
Processed 83861303 from JSON
Processed 83839557 from JSON
Processed 93937297 from JSON
Processed 146476009 from JSON
Processed 94029742 from JSON
Processed 83848523 from JSON
Processed 83839293 from JSON
Processed 73358911 from JSON
Processed 93959051 from JSON
Processed 93950573 from JSON
Processed 93958762 from JSON
Processed 93942020 from JSON
Processed 42658277 from JSON
Processed 129361784 from JSON
Processed 84328381 from JSON
Processed 86415580 from JSON
Processed

Processed 84331949 from JSON
Processed 84326882 from JSON
Processed 42675139 from JSON
Processed 78507936 from JSON
Processed 78507176 from JSON
Processed 78510439 from JSON
Processed 42739776 from JSON
Processed 94044970 from JSON
Processed 83838920 from JSON
Processed 83839149 from JSON
Processed 83836948 from JSON
Processed 83860961 from JSON
Processed 83860822 from JSON
Processed 83860776 from JSON
Processed 83862378 from JSON
Processed 73381250 from JSON
Processed 73381591 from JSON
Processed 93913620 from JSON
Processed 93943418 from JSON
Processed 146473488 from JSON
Processed 146473363 from JSON
Processed 94067542 from JSON
Processed 94069454 from JSON
Processed 94074799 from JSON
Processed 129283260 from JSON
Processed 93873874 from JSON
Processed 93902293 from JSON
Processed 129227338 from JSON
Processed 93945609 from JSON
Processed 94066943 from JSON
Processed 94069144 from JSON
Processed 83847281 from JSON
Processed 83847654 from JSON
Processed 129210446 from JSON
Processed

Done! We have a bunch of files in our database! (**3814** articles in this demo)

## Indexing - Elasticsearch 

We use Elasticsearch for the quick and very flexible (elastic?) queries across the full text of articles, so we have to index all the content from the database for this.

Once again, we have to [install](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) it and run the service...yay!

Once it's running we can connect and check status:

In [6]:
import pymongo
import elasticsearch
import requests
import nltk
from elasticsearch import helpers
import re
import string

In [7]:
client = pymongo.MongoClient('localhost:' + str(mongoDB_Port))
publications_collection = client.pub.publications
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}],
                                 timeout=30, max_retries=10, retry_on_timeout=True)

In [26]:
es.cluster.health()

{'active_primary_shards': 0,
 'active_shards': 0,
 'active_shards_percent_as_number': 100.0,
 'cluster_name': 'elasticsearch',
 'delayed_unassigned_shards': 0,
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'task_max_waiting_in_queue_millis': 0,
 'timed_out': False,
 'unassigned_shards': 0}

### Full text indexing

First we index the full content of every article in the database.

(This step corresponds to `collection_extraction -> index_papersfulltext.py` in the code)

In [14]:
def extract_metadata(documents):
    list_of_docs = []
    for i, r in enumerate(documents):
        extracted = {
                "_id": "",
                "title": "",
                "publication": "",
                "year": "",
                "content": "",
                "abstract": "",
                "authors": [],
                "references": []}
        extracted['_id'] = r['_id']
        extracted['title'] = r['title']
        extracted['publication'] = r['journal']
        extracted['year'] = r['year']
        extracted['content'] = r['content']['fulltext']
        extracted['abstract'] = r['content']['abstract']
        extracted['authors'] = r['authors']
        extracted['references'] = r['content']['references']
        list_of_docs.append(extracted)
    return list_of_docs

In [15]:
filter_publications = ['arxiv'] # Here we could also put PubMed or other sources

extracted_publications = []
for publication in filter_publications:
    query ={'$and': [{'journal': publication}, {'content.fulltext': {'$exists': True}}]}                   
    results = publications_collection.find(query)
    extracted_publications.append(extract_metadata(results))

In [29]:
for publication in extracted_publications:
    actions = []
    for article in publication:
        authors = []
        if len(article['authors']) > 0:
            if type(article['authors'][0]) == list:
                try:
                    for name in article['authors']:
                        authors.append(', '.join([name[1], name[0]]))
                    authors = list(set(authors))
                except:
                    pass
            else:
                authors = article['authors']
        actions.append({
            "_index": "ir", 
            "_type": "publications",  
            "_id": article['_id'],
            "_source": {
                "title": article["title"],
                "journal": article['publication'],
                "year": str(article['year']),
                "content": article["content"],
                "abstract": article["abstract"],
                "authors": authors,
                "references": article["references"]
            }
        })
    if len(actions) == 0:
        continue
    res = helpers.bulk(es, actions)
    print('Done with', res, 'articles added to index')

Done with (3814, []) articles added to index


We can look for anything in the text, like this:

In [30]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_id'], hit['_source']['abstract'])
    print('-'*50)

Got 1 Hits:
arxiv_84094279 Re-Evaluating the Netflix Prize - Human Uncertainty and its Impact on
  Reliability
arxiv_84094279 In this paper, we examine the statistical soundness of comparative
assessments within the field of recommender systems in terms of reliability and
human uncertainty. From a controlled experiment, we get the insight that users
provide different ratings on same items when repeatedly asked. This volatility
of user ratings justifies the assumption of using probability densities instead
of single rating scores. As a consequence, the well-known accuracy metrics
(e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution
of all rating densities. When two different systems produce different RMSE
distributions with significant intersection, then there exists a probability of
error for each possible ranking. As an application, we examine possible ranking
errors of the Netflix Prize. We are able to show that all top rankings are more
or less subject to h

### Sentence Indexing 
For the sentence expansion step of our process, we need to have sentences indexed. This is because we need to find similar sentences as well as their surrounding sentences, for context. 

In addition, we create a file with all the text, of all our articles, and we will use it later for the training of embedding models.

(This step corresponds to `collection_extraction -> index_twosent.py` in the code)

In [8]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [17]:
for publication in extracted_publications:
    for article in publication:
        actions = []
        cleaned = []
        dataset_sent = []
        other_sent = []

        lines = (sent_detector.tokenize(article['content'].strip()))
        
        # This will be used for the training of word2vec and doc2vec
        # You need to create the data folder beforehand
        with open('data/full_text_corpus.txt', 'a', encoding='utf-8') as f:
            for line in lines:
                f.write(line)
        f.close()
        
        if len(lines) < 3:
            continue

        for i in range(len(lines)):
            words = nltk.word_tokenize(lines[i])
            word_lengths = [len(x) for x in words]
            average_word_length = sum(word_lengths) / len(word_lengths)
            if average_word_length < 4:
                continue

            two_sentences = ''
            try:
                two_sentences = lines[i] + ' ' + lines[i - 1]
            except:
                two_sentences = lines[i] + ' ' + lines[i + 1]

            dataset_sent.append(two_sentences)

        for num, added_lines in enumerate(dataset_sent):
            actions.append({
                "_index": "twosent",
                "_type": "twosentnorules",
                "_id": article['_id'] + str(num),
                "_source": {
                    "title": article['title'],
                    "content.chapter.sentpositive": added_lines,
                    "paper_id": article['_id']
                }})

        if len(actions) == 0:
            continue
        res = helpers.bulk(es, actions)
print('Done')

Done


In [9]:
res = es.search(index = "twosent", body = {"query": {"match": {"content.chapter.sentpositive" : "imagenet"}}}, size = 5)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_source']['content.chapter.sentpositive'])
    print('-'*50)

Got 1311 Hits:
arxiv_94074613218 Expert Gate: Lifelong Learning with a Network of Experts
We consider the following knowl- edge transfer cases: Scenes → Actions, ImageNet → Ac- tions, SVHN → Letters, ImageNet → Letters, SVHN → Mnist, ImageNet→ Mnist, Flowers→ Cars and Imagenet → Cars. We also consider ImageNet as a possible source.
--------------------------------------------------
arxiv_93937583198 Learning Efficient Convolutional Networks through Network Slimming
The results for ImageNet dataset are summa- rized in Table 2. ImageNet.
--------------------------------------------------
arxiv_86420809239 ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural
  Networks without Training Substitute Models
Substi- tute model based attack cannot easily scale to ImageNet. Table 2: Untargeted ImageNet attacks comparison.
--------------------------------------------------
arxiv_93937583152 Learning Efficient Convolutional Networks through Network Slimming
The ImageNet dataset co

### Doc2vec Indexing

Once again, we index the sentences from the full text file created before, which will be used for the Sentence Expansion using doc2vec.

(This step corresponds to `collection_extraction -> index_doc2vec.py` in the code)

In [10]:
file = open('data/full_text_corpus.txt', 'r', encoding='utf-8')
text = file.read()
file.close()
sentences = nltk.tokenize.sent_tokenize(text)
print('Sentences ready')
count = 0
docLabels = []
actions = []

for i, sent in enumerate(sentences):
    try:
        neighbors = sentences[i + 1]
        neighbor_count = count + 1
    except:
        neighbors = sentences[i - 1]
        neighbor_count = count - 1

    docLabels.append(count)
    actions.append({
        "_index": "devtwosentnew",
        "_type": "devtwosentnorulesnew",
        "_id": count,
        "_source": {
            "content.chapter.sentpositive": sent,
            "content.chapter.sentnegtive": neighbors,
            "neighborcount": neighbor_count
        }})
    count = count + 1

print(len(sentences))
print(len(docLabels))
res = helpers.bulk(es, actions)
print(res)

Sentences ready
112410
112410
(112410, [])


## Training Embedding Models (word2vec & doc2vec)

For the Term and Sentence expansion, we need to find similar words or sentences, and we use word2vec and doc2vec respectively. We have to generate a document with all the sentences in our corpus, which will be stored in the data folder. While this is not a full implementation, we have around 4 million sentences, so it is a considerable example, and processing can take some time.

(This step corresponds to `data_preparation -> prepare_embedding_data.py` in the code)

In [16]:
sentence_text = []
translator = str.maketrans('', '', string.punctuation)

for publication in extracted_publications:
    for article in publication:
        query = {"query":
                     {"match":
                          {"_id":
                               {"query": article['_id'],
                                "operator": "and"
                                }
                           }
                      }
                 }

        results = es.search(index="ir", body=query, size=200)
        for doc in results['hits']['hits']:
            fulltext = doc["_source"]["content"]
            fulltext = re.sub("[\[].*?[\]]", "", fulltext)
            sentence_text.append(fulltext)
    print('Done', '-' * 100)

f = open("data/data2vec.txt", "w", encoding='utf-8')
for line in sentence_text:
    f.write(line)
f.close()

Done ----------------------------------------------------------------------------------------------------


Before training, verify that you have the optimized version of Gensim using Cython, for better performance.

In [18]:
import gensim
assert gensim.models.doc2vec.FAST_VERSION > -1

Now that we have the document with all the text, we can use them for the training. 

We will use a bigram model for word2vec, since it's common that entities are represented as two words. You can run the command from your Python console or directly here, using the magic `%run` syntax. The structure of this command is as follows:

`%run script_to_run training_data model_output word_vector_output`

This was run in a normal Lenovo machine with 8GB and i7 processor. As you can see by the timestamps below, it might take a while...

In [19]:
%run data_preparation/train_word2vec.py data/data2vec.txt embedding_models/modelword2vecbigram.model embedding_models/modelword2vecbigram.vec

2018-05-15 17:07:25,101: INFO: running data_preparation/train_word2vec.py data/data2vec.txt embedding_models/modelword2vecbigram.model embedding_models/modelword2vecbigram.vec
2018-05-15 17:17:08,824: INFO: collecting all words and their counts
2018-05-15 17:17:08,831: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-05-15 17:17:09,350: INFO: PROGRESS: at sentence #10000, processed 201700 words and 71739 word types
2018-05-15 17:17:09,858: INFO: PROGRESS: at sentence #20000, processed 398393 words and 134942 word types
2018-05-15 17:17:10,411: INFO: PROGRESS: at sentence #30000, processed 617098 words and 204286 word types
2018-05-15 17:17:11,162: INFO: PROGRESS: at sentence #40000, processed 862142 words and 279857 word types
2018-05-15 17:17:11,845: INFO: PROGRESS: at sentence #50000, processed 1119381 words and 356880 word types
2018-05-15 17:17:12,539: INFO: PROGRESS: at sentence #60000, processed 1357143 words and 420874 word types
2018-05-15 17:17:13,233: I

2018-05-15 17:17:55,380: INFO: PROGRESS: at sentence #740000, processed 16120117 words and 2597213 word types
2018-05-15 17:17:56,083: INFO: PROGRESS: at sentence #750000, processed 16345603 words and 2637256 word types
2018-05-15 17:17:56,753: INFO: PROGRESS: at sentence #760000, processed 16571765 words and 2675889 word types
2018-05-15 17:17:57,412: INFO: PROGRESS: at sentence #770000, processed 16792631 words and 2716795 word types
2018-05-15 17:17:57,987: INFO: PROGRESS: at sentence #780000, processed 16981822 words and 2749520 word types
2018-05-15 17:17:58,700: INFO: PROGRESS: at sentence #790000, processed 17225297 words and 2783581 word types
2018-05-15 17:17:59,690: INFO: PROGRESS: at sentence #800000, processed 17459556 words and 2822675 word types
2018-05-15 17:18:00,351: INFO: PROGRESS: at sentence #810000, processed 17683519 words and 2858184 word types
2018-05-15 17:18:01,101: INFO: PROGRESS: at sentence #820000, processed 17920868 words and 2895177 word types
2018-05-15

2018-05-15 17:18:47,773: INFO: PROGRESS: at sentence #1490000, processed 33089396 words and 4905102 word types
2018-05-15 17:18:48,349: INFO: PROGRESS: at sentence #1500000, processed 33316021 words and 4928569 word types
2018-05-15 17:18:48,983: INFO: PROGRESS: at sentence #1510000, processed 33538115 words and 4946507 word types
2018-05-15 17:18:49,781: INFO: PROGRESS: at sentence #1520000, processed 33811072 words and 4965999 word types
2018-05-15 17:18:50,274: INFO: PROGRESS: at sentence #1530000, processed 33983474 words and 4994608 word types
2018-05-15 17:18:50,843: INFO: PROGRESS: at sentence #1540000, processed 34189105 words and 5026079 word types
2018-05-15 17:18:51,448: INFO: PROGRESS: at sentence #1550000, processed 34433445 words and 5059287 word types
2018-05-15 17:18:51,977: INFO: PROGRESS: at sentence #1560000, processed 34644834 words and 5091811 word types
2018-05-15 17:18:52,538: INFO: PROGRESS: at sentence #1570000, processed 34866405 words and 5131430 word types
2

2018-05-15 17:21:09,980: INFO: PROGRESS: at sentence #230000, processed 4069387 words, keeping 256004 word types
2018-05-15 17:21:10,738: INFO: PROGRESS: at sentence #240000, processed 4214508 words, keeping 261233 word types
2018-05-15 17:21:11,743: INFO: PROGRESS: at sentence #250000, processed 4410567 words, keeping 268577 word types
2018-05-15 17:21:12,814: INFO: PROGRESS: at sentence #260000, processed 4617727 words, keeping 275896 word types
2018-05-15 17:21:13,804: INFO: PROGRESS: at sentence #270000, processed 4808859 words, keeping 283390 word types
2018-05-15 17:21:14,787: INFO: PROGRESS: at sentence #280000, processed 5001159 words, keeping 290583 word types
2018-05-15 17:21:15,680: INFO: PROGRESS: at sentence #290000, processed 5179180 words, keeping 295423 word types
2018-05-15 17:21:16,542: INFO: PROGRESS: at sentence #300000, processed 5346906 words, keeping 301698 word types
2018-05-15 17:21:17,447: INFO: PROGRESS: at sentence #310000, processed 5521343 words, keeping 3

2018-05-15 17:22:20,257: INFO: PROGRESS: at sentence #960000, processed 17045165 words, keeping 622723 word types
2018-05-15 17:22:21,063: INFO: PROGRESS: at sentence #970000, processed 17206815 words, keeping 626420 word types
2018-05-15 17:22:21,860: INFO: PROGRESS: at sentence #980000, processed 17366391 words, keeping 631957 word types
2018-05-15 17:22:22,792: INFO: PROGRESS: at sentence #990000, processed 17552913 words, keeping 640484 word types
2018-05-15 17:22:23,665: INFO: PROGRESS: at sentence #1000000, processed 17723623 words, keeping 647237 word types
2018-05-15 17:22:24,577: INFO: PROGRESS: at sentence #1010000, processed 17884475 words, keeping 653943 word types
2018-05-15 17:22:25,382: INFO: PROGRESS: at sentence #1020000, processed 18045564 words, keeping 659946 word types
2018-05-15 17:22:26,194: INFO: PROGRESS: at sentence #1030000, processed 18204849 words, keeping 665730 word types
2018-05-15 17:22:27,043: INFO: PROGRESS: at sentence #1040000, processed 18374412 wo

2018-05-15 17:24:04,757: INFO: PROGRESS: at sentence #1680000, processed 30163346 words, keeping 952323 word types
2018-05-15 17:24:05,875: INFO: PROGRESS: at sentence #1690000, processed 30358019 words, keeping 955970 word types
2018-05-15 17:24:06,937: INFO: PROGRESS: at sentence #1700000, processed 30540864 words, keeping 959160 word types
2018-05-15 17:24:07,710: INFO: PROGRESS: at sentence #1710000, processed 30678584 words, keeping 961612 word types
2018-05-15 17:24:08,603: INFO: PROGRESS: at sentence #1720000, processed 30841232 words, keeping 964178 word types
2018-05-15 17:24:09,671: INFO: PROGRESS: at sentence #1730000, processed 31035672 words, keeping 967099 word types
2018-05-15 17:24:10,743: INFO: PROGRESS: at sentence #1740000, processed 31231321 words, keeping 970061 word types
2018-05-15 17:24:11,743: INFO: PROGRESS: at sentence #1750000, processed 31407719 words, keeping 976310 word types
2018-05-15 17:24:12,822: INFO: PROGRESS: at sentence #1760000, processed 3160016

2018-05-15 17:26:10,458: INFO: EPOCH 1 - PROGRESS: at 18.98% examples, 100884 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:11,480: INFO: EPOCH 1 - PROGRESS: at 19.45% examples, 100800 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:26:12,505: INFO: EPOCH 1 - PROGRESS: at 19.90% examples, 100895 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:13,543: INFO: EPOCH 1 - PROGRESS: at 20.35% examples, 100920 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:14,569: INFO: EPOCH 1 - PROGRESS: at 20.83% examples, 101168 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:15,621: INFO: EPOCH 1 - PROGRESS: at 21.33% examples, 101477 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:16,661: INFO: EPOCH 1 - PROGRESS: at 21.83% examples, 101814 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:17,705: INFO: EPOCH 1 - PROGRESS: at 22.28% examples, 102258 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:26:18,747: INFO: EPOCH 1 - PROGRESS: at 22.76% examples, 102669 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:27:26,991: INFO: EPOCH 1 - PROGRESS: at 54.70% examples, 108725 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:27:28,000: INFO: EPOCH 1 - PROGRESS: at 55.20% examples, 108727 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:27:29,025: INFO: EPOCH 1 - PROGRESS: at 55.70% examples, 108783 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:27:30,048: INFO: EPOCH 1 - PROGRESS: at 56.18% examples, 108835 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:27:31,143: INFO: EPOCH 1 - PROGRESS: at 56.68% examples, 108829 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:27:32,164: INFO: EPOCH 1 - PROGRESS: at 57.13% examples, 108882 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:27:33,223: INFO: EPOCH 1 - PROGRESS: at 57.62% examples, 108959 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:27:34,236: INFO: EPOCH 1 - PROGRESS: at 58.11% examples, 109014 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:27:35,254: INFO: EPOCH 1 - PROGRESS: at 58.60% examples, 109125 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:28:43,429: INFO: EPOCH 1 - PROGRESS: at 89.47% examples, 111293 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:28:44,466: INFO: EPOCH 1 - PROGRESS: at 90.08% examples, 111320 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:28:45,491: INFO: EPOCH 1 - PROGRESS: at 90.63% examples, 111358 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:28:46,549: INFO: EPOCH 1 - PROGRESS: at 91.10% examples, 111373 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:28:47,613: INFO: EPOCH 1 - PROGRESS: at 91.53% examples, 111396 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:28:48,644: INFO: EPOCH 1 - PROGRESS: at 92.06% examples, 111436 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:28:49,686: INFO: EPOCH 1 - PROGRESS: at 92.52% examples, 111481 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:28:50,702: INFO: EPOCH 1 - PROGRESS: at 92.96% examples, 111497 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:28:51,731: INFO: EPOCH 1 - PROGRESS: at 93.39% examples, 111508 words/s, in_qsize 1, out_qsize 0
2

2018-05-15 17:29:53,048: INFO: EPOCH 2 - PROGRESS: at 23.39% examples, 120568 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:29:54,070: INFO: EPOCH 2 - PROGRESS: at 23.89% examples, 120527 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:29:55,101: INFO: EPOCH 2 - PROGRESS: at 24.41% examples, 120493 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:29:56,153: INFO: EPOCH 2 - PROGRESS: at 24.95% examples, 120570 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:29:57,159: INFO: EPOCH 2 - PROGRESS: at 25.52% examples, 120560 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:29:58,162: INFO: EPOCH 2 - PROGRESS: at 25.98% examples, 120591 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:29:59,296: INFO: EPOCH 2 - PROGRESS: at 26.29% examples, 119645 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:30:00,345: INFO: EPOCH 2 - PROGRESS: at 26.75% examples, 119032 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:30:01,361: INFO: EPOCH 2 - PROGRESS: at 27.27% examples, 119046 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:31:09,729: INFO: EPOCH 2 - PROGRESS: at 59.86% examples, 117147 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:31:10,786: INFO: EPOCH 2 - PROGRESS: at 60.36% examples, 117158 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:31:11,802: INFO: EPOCH 2 - PROGRESS: at 60.84% examples, 117210 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:31:12,814: INFO: EPOCH 2 - PROGRESS: at 61.31% examples, 117208 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:31:13,816: INFO: EPOCH 2 - PROGRESS: at 61.76% examples, 117211 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:31:14,859: INFO: EPOCH 2 - PROGRESS: at 62.30% examples, 117267 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:31:15,862: INFO: EPOCH 2 - PROGRESS: at 62.78% examples, 117309 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:31:16,865: INFO: EPOCH 2 - PROGRESS: at 63.15% examples, 117230 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:31:17,881: INFO: EPOCH 2 - PROGRESS: at 63.60% examples, 117209 words/s, in_qsize 1, out_qsize 0
2

2018-05-15 17:32:26,187: INFO: EPOCH 2 - PROGRESS: at 95.93% examples, 117981 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:27,226: INFO: EPOCH 2 - PROGRESS: at 96.36% examples, 117955 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:28,285: INFO: EPOCH 2 - PROGRESS: at 96.88% examples, 117949 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:29,296: INFO: EPOCH 2 - PROGRESS: at 97.40% examples, 118012 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:30,309: INFO: EPOCH 2 - PROGRESS: at 97.91% examples, 117995 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:31,359: INFO: EPOCH 2 - PROGRESS: at 98.43% examples, 118066 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:32,365: INFO: EPOCH 2 - PROGRESS: at 98.95% examples, 118125 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:33,391: INFO: EPOCH 2 - PROGRESS: at 99.45% examples, 118214 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:32:34,350: INFO: worker thread finished; awaiting finish of 7 more threads
2018-05-15 17:32:34,353

2018-05-15 17:33:34,941: INFO: EPOCH 3 - PROGRESS: at 33.13% examples, 132106 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:33:35,967: INFO: EPOCH 3 - PROGRESS: at 33.64% examples, 132100 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:33:36,977: INFO: EPOCH 3 - PROGRESS: at 34.22% examples, 132139 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:38,010: INFO: EPOCH 3 - PROGRESS: at 34.81% examples, 132145 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:39,024: INFO: EPOCH 3 - PROGRESS: at 35.39% examples, 132186 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:40,080: INFO: EPOCH 3 - PROGRESS: at 35.94% examples, 132173 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:41,097: INFO: EPOCH 3 - PROGRESS: at 36.48% examples, 132223 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:42,124: INFO: EPOCH 3 - PROGRESS: at 37.01% examples, 132047 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:33:43,160: INFO: EPOCH 3 - PROGRESS: at 37.65% examples, 132052 words/s, in_qsize 1, out_qsize 0
2

2018-05-15 17:34:56,183: INFO: EPOCH 3 - PROGRESS: at 72.35% examples, 125589 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:34:57,189: INFO: EPOCH 3 - PROGRESS: at 72.81% examples, 125552 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:34:58,207: INFO: EPOCH 3 - PROGRESS: at 73.28% examples, 125513 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:34:59,282: INFO: EPOCH 3 - PROGRESS: at 73.75% examples, 125474 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:35:00,304: INFO: EPOCH 3 - PROGRESS: at 74.20% examples, 125446 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:35:01,337: INFO: EPOCH 3 - PROGRESS: at 74.81% examples, 125443 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:35:02,369: INFO: EPOCH 3 - PROGRESS: at 75.25% examples, 125399 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:35:03,423: INFO: EPOCH 3 - PROGRESS: at 75.73% examples, 125338 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:35:04,494: INFO: EPOCH 3 - PROGRESS: at 76.36% examples, 125299 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:36:06,507: INFO: EPOCH 4 - PROGRESS: at 5.79% examples, 117543 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:36:07,528: INFO: EPOCH 4 - PROGRESS: at 6.35% examples, 117804 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:08,570: INFO: EPOCH 4 - PROGRESS: at 6.83% examples, 117925 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:09,594: INFO: EPOCH 4 - PROGRESS: at 7.26% examples, 118186 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:36:10,657: INFO: EPOCH 4 - PROGRESS: at 7.97% examples, 117988 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:11,734: INFO: EPOCH 4 - PROGRESS: at 8.43% examples, 117856 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:12,783: INFO: EPOCH 4 - PROGRESS: at 8.95% examples, 117842 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:13,832: INFO: EPOCH 4 - PROGRESS: at 9.55% examples, 117793 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:36:14,853: INFO: EPOCH 4 - PROGRESS: at 10.07% examples, 118318 words/s, in_qsize 0, out_qsize 0
2018-05-1

2018-05-15 17:37:23,149: INFO: EPOCH 4 - PROGRESS: at 43.05% examples, 116983 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:37:24,152: INFO: EPOCH 4 - PROGRESS: at 43.33% examples, 116406 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:37:25,251: INFO: EPOCH 4 - PROGRESS: at 43.62% examples, 115891 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:26,255: INFO: EPOCH 4 - PROGRESS: at 43.90% examples, 115504 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:27,333: INFO: EPOCH 4 - PROGRESS: at 44.24% examples, 115015 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:28,410: INFO: EPOCH 4 - PROGRESS: at 44.53% examples, 114491 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:29,417: INFO: EPOCH 4 - PROGRESS: at 44.77% examples, 113981 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:30,487: INFO: EPOCH 4 - PROGRESS: at 44.93% examples, 113180 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:37:31,550: INFO: EPOCH 4 - PROGRESS: at 45.16% examples, 112537 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:38:40,202: INFO: EPOCH 4 - PROGRESS: at 72.15% examples, 106787 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:38:41,306: INFO: EPOCH 4 - PROGRESS: at 72.45% examples, 106601 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:38:42,460: INFO: EPOCH 4 - PROGRESS: at 72.65% examples, 106209 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:38:43,568: INFO: EPOCH 4 - PROGRESS: at 72.90% examples, 105929 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:38:44,608: INFO: EPOCH 4 - PROGRESS: at 73.25% examples, 105792 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:38:45,634: INFO: EPOCH 4 - PROGRESS: at 73.58% examples, 105666 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:38:46,635: INFO: EPOCH 4 - PROGRESS: at 73.90% examples, 105598 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:38:47,651: INFO: EPOCH 4 - PROGRESS: at 74.31% examples, 105560 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:38:48,702: INFO: EPOCH 4 - PROGRESS: at 74.71% examples, 105419 words/s, in_qsize 0, out_qsize 0
2

2018-05-15 17:39:58,055: INFO: EPOCH 4 - PROGRESS: at 96.85% examples, 98046 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:39:59,158: INFO: EPOCH 4 - PROGRESS: at 97.11% examples, 97874 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:00,226: INFO: EPOCH 4 - PROGRESS: at 97.40% examples, 97748 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:01,288: INFO: EPOCH 4 - PROGRESS: at 97.77% examples, 97621 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:02,397: INFO: EPOCH 4 - PROGRESS: at 98.01% examples, 97479 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:03,492: INFO: EPOCH 4 - PROGRESS: at 98.28% examples, 97344 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:04,541: INFO: EPOCH 4 - PROGRESS: at 98.55% examples, 97196 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:05,671: INFO: EPOCH 4 - PROGRESS: at 98.81% examples, 97019 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:40:06,743: INFO: EPOCH 4 - PROGRESS: at 99.08% examples, 96895 words/s, in_qsize 0, out_qsize 0
2018-05-15

2018-05-15 17:41:10,799: INFO: EPOCH 5 - PROGRESS: at 18.55% examples, 74901 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:41:11,828: INFO: EPOCH 5 - PROGRESS: at 18.92% examples, 75293 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:12,866: INFO: EPOCH 5 - PROGRESS: at 19.37% examples, 75627 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:41:13,899: INFO: EPOCH 5 - PROGRESS: at 19.79% examples, 75872 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:14,964: INFO: EPOCH 5 - PROGRESS: at 20.20% examples, 76055 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:15,989: INFO: EPOCH 5 - PROGRESS: at 20.56% examples, 76292 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:16,994: INFO: EPOCH 5 - PROGRESS: at 20.99% examples, 76647 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:17,996: INFO: EPOCH 5 - PROGRESS: at 21.36% examples, 76902 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:41:19,094: INFO: EPOCH 5 - PROGRESS: at 21.77% examples, 77146 words/s, in_qsize 0, out_qsize 0
2018-05-15

2018-05-15 17:42:28,969: INFO: EPOCH 5 - PROGRESS: at 49.23% examples, 86748 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:29,996: INFO: EPOCH 5 - PROGRESS: at 49.61% examples, 86736 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:31,032: INFO: EPOCH 5 - PROGRESS: at 49.93% examples, 86668 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:32,068: INFO: EPOCH 5 - PROGRESS: at 50.22% examples, 86553 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:33,092: INFO: EPOCH 5 - PROGRESS: at 50.59% examples, 86654 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:34,130: INFO: EPOCH 5 - PROGRESS: at 50.99% examples, 86750 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:42:35,175: INFO: EPOCH 5 - PROGRESS: at 51.54% examples, 86881 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:42:36,237: INFO: EPOCH 5 - PROGRESS: at 52.01% examples, 87108 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:42:37,305: INFO: EPOCH 5 - PROGRESS: at 52.51% examples, 87273 words/s, in_qsize 0, out_qsize 0
2018-05-15

2018-05-15 17:43:46,906: INFO: EPOCH 5 - PROGRESS: at 80.04% examples, 91419 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:47,965: INFO: EPOCH 5 - PROGRESS: at 80.63% examples, 91538 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:49,037: INFO: EPOCH 5 - PROGRESS: at 81.16% examples, 91655 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:50,078: INFO: EPOCH 5 - PROGRESS: at 81.57% examples, 91715 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:51,241: INFO: EPOCH 5 - PROGRESS: at 81.92% examples, 91627 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:52,307: INFO: EPOCH 5 - PROGRESS: at 82.35% examples, 91680 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:53,373: INFO: EPOCH 5 - PROGRESS: at 82.79% examples, 91737 words/s, in_qsize 0, out_qsize 0
2018-05-15 17:43:54,429: INFO: EPOCH 5 - PROGRESS: at 83.22% examples, 91854 words/s, in_qsize 1, out_qsize 0
2018-05-15 17:43:55,454: INFO: EPOCH 5 - PROGRESS: at 83.66% examples, 91893 words/s, in_qsize 0, out_qsize 0
2018-05-15

Now we select the sentences from the full text and use them for the doc2vec training. Once again, we can run the command from your Python console or directly here, using the magic `%run` syntax. The structure of this command is as follows:

`%run script_to_run training_data model_output`

This also took some time.

In [55]:
%run data_preparation/train_doc2vec.py data/data2vec.txt embedding_models/doc2vec.model

2018-05-14 23:23:49,487: INFO: running data_preparation/train_doc2vec.py data/dataDoc2vec.txt data/doc2vec.model
2018-05-14 23:27:02,732: INFO: collecting all words and their counts
  yield LabeledSentence(words=doc.split(), tags=[self.labels_list[idx]])
2018-05-14 23:27:02,757: INFO: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-05-14 23:27:02,873: INFO: PROGRESS: at example #10000, processed 192995 words (1691505/s), 27656 word types, 10000 tags


4138892
4138892


2018-05-14 23:27:02,992: INFO: PROGRESS: at example #20000, processed 377367 words (1559424/s), 43708 word types, 20000 tags
2018-05-14 23:27:03,087: INFO: PROGRESS: at example #30000, processed 508001 words (1388833/s), 52207 word types, 30000 tags
2018-05-14 23:27:03,193: INFO: PROGRESS: at example #40000, processed 686456 words (1700594/s), 65978 word types, 40000 tags
2018-05-14 23:27:03,306: INFO: PROGRESS: at example #50000, processed 880869 words (1734820/s), 79708 word types, 50000 tags
2018-05-14 23:27:03,428: INFO: PROGRESS: at example #60000, processed 1092356 words (1766510/s), 93801 word types, 60000 tags
2018-05-14 23:27:03,547: INFO: PROGRESS: at example #70000, processed 1284802 words (1626946/s), 105557 word types, 70000 tags
2018-05-14 23:27:03,658: INFO: PROGRESS: at example #80000, processed 1476281 words (1734553/s), 116196 word types, 80000 tags
2018-05-14 23:27:03,770: INFO: PROGRESS: at example #90000, processed 1663771 words (1692946/s), 127655 word types, 9000

2018-05-14 23:27:10,404: INFO: PROGRESS: at example #660000, processed 12732835 words (1664873/s), 526757 word types, 660000 tags
2018-05-14 23:27:10,524: INFO: PROGRESS: at example #670000, processed 12949445 words (1829706/s), 533368 word types, 670000 tags
2018-05-14 23:27:10,636: INFO: PROGRESS: at example #680000, processed 13140942 words (1719813/s), 539006 word types, 680000 tags
2018-05-14 23:27:10,755: INFO: PROGRESS: at example #690000, processed 13351149 words (1786536/s), 543682 word types, 690000 tags
2018-05-14 23:27:10,879: INFO: PROGRESS: at example #700000, processed 13551907 words (1626434/s), 549806 word types, 700000 tags
2018-05-14 23:27:11,007: INFO: PROGRESS: at example #710000, processed 13763406 words (1677777/s), 557006 word types, 710000 tags
2018-05-14 23:27:11,125: INFO: PROGRESS: at example #720000, processed 13960015 words (1683663/s), 562340 word types, 720000 tags
2018-05-14 23:27:11,244: INFO: PROGRESS: at example #730000, processed 14153815 words (163

2018-05-14 23:27:17,613: INFO: PROGRESS: at example #1290000, processed 24615959 words (1580936/s), 847614 word types, 1290000 tags
2018-05-14 23:27:17,718: INFO: PROGRESS: at example #1300000, processed 24784655 words (1630871/s), 853677 word types, 1300000 tags
2018-05-14 23:27:17,838: INFO: PROGRESS: at example #1310000, processed 24981022 words (1649469/s), 860911 word types, 1310000 tags
2018-05-14 23:27:17,937: INFO: PROGRESS: at example #1320000, processed 25130850 words (1523873/s), 866810 word types, 1320000 tags
2018-05-14 23:27:18,042: INFO: PROGRESS: at example #1330000, processed 25298357 words (1621912/s), 872001 word types, 1330000 tags
2018-05-14 23:27:18,146: INFO: PROGRESS: at example #1340000, processed 25466785 words (1636203/s), 876442 word types, 1340000 tags
2018-05-14 23:27:18,251: INFO: PROGRESS: at example #1350000, processed 25636602 words (1626704/s), 880499 word types, 1350000 tags
2018-05-14 23:27:18,363: INFO: PROGRESS: at example #1360000, processed 2580

2018-05-14 23:27:24,311: INFO: PROGRESS: at example #1910000, processed 35267115 words (1643036/s), 1128897 word types, 1910000 tags
2018-05-14 23:27:24,423: INFO: PROGRESS: at example #1920000, processed 35445530 words (1617262/s), 1134756 word types, 1920000 tags
2018-05-14 23:27:24,535: INFO: PROGRESS: at example #1930000, processed 35631285 words (1677485/s), 1141309 word types, 1930000 tags
2018-05-14 23:27:24,650: INFO: PROGRESS: at example #1940000, processed 35827394 words (1720693/s), 1146548 word types, 1940000 tags
2018-05-14 23:27:24,758: INFO: PROGRESS: at example #1950000, processed 35971380 words (1345030/s), 1152399 word types, 1950000 tags
2018-05-14 23:27:24,863: INFO: PROGRESS: at example #1960000, processed 36132195 words (1549368/s), 1155610 word types, 1960000 tags
2018-05-14 23:27:24,976: INFO: PROGRESS: at example #1970000, processed 36317022 words (1650593/s), 1162486 word types, 1970000 tags
2018-05-14 23:27:25,083: INFO: PROGRESS: at example #1980000, process

2018-05-14 23:27:31,451: INFO: PROGRESS: at example #2530000, processed 46728852 words (1467069/s), 1406061 word types, 2530000 tags
2018-05-14 23:27:31,560: INFO: PROGRESS: at example #2540000, processed 46846237 words (1113814/s), 1407969 word types, 2540000 tags
2018-05-14 23:27:31,701: INFO: PROGRESS: at example #2550000, processed 47034440 words (1345017/s), 1412387 word types, 2550000 tags
2018-05-14 23:27:31,841: INFO: PROGRESS: at example #2560000, processed 47223856 words (1369599/s), 1415914 word types, 2560000 tags
2018-05-14 23:27:31,968: INFO: PROGRESS: at example #2570000, processed 47417225 words (1539666/s), 1420259 word types, 2570000 tags
2018-05-14 23:27:32,107: INFO: PROGRESS: at example #2580000, processed 47591739 words (1264934/s), 1423404 word types, 2580000 tags
2018-05-14 23:27:32,238: INFO: PROGRESS: at example #2590000, processed 47779121 words (1439323/s), 1428506 word types, 2590000 tags
2018-05-14 23:27:32,364: INFO: PROGRESS: at example #2600000, process

2018-05-14 23:27:38,867: INFO: PROGRESS: at example #3150000, processed 58094507 words (1757050/s), 1723193 word types, 3150000 tags
2018-05-14 23:27:39,025: INFO: PROGRESS: at example #3160000, processed 58286095 words (1216530/s), 1728369 word types, 3160000 tags
2018-05-14 23:27:39,163: INFO: PROGRESS: at example #3170000, processed 58439508 words (1115792/s), 1729975 word types, 3170000 tags
2018-05-14 23:27:39,293: INFO: PROGRESS: at example #3180000, processed 58621956 words (1419922/s), 1736419 word types, 3180000 tags
2018-05-14 23:27:39,391: INFO: PROGRESS: at example #3190000, processed 58767268 words (1497925/s), 1738574 word types, 3190000 tags
2018-05-14 23:27:39,496: INFO: PROGRESS: at example #3200000, processed 58941499 words (1667832/s), 1742528 word types, 3200000 tags
2018-05-14 23:27:39,610: INFO: PROGRESS: at example #3210000, processed 59122805 words (1605352/s), 1749286 word types, 3210000 tags
2018-05-14 23:27:39,732: INFO: PROGRESS: at example #3220000, process

2018-05-14 23:27:46,266: INFO: PROGRESS: at example #3770000, processed 69372393 words (1504818/s), 1995013 word types, 3770000 tags
2018-05-14 23:27:46,394: INFO: PROGRESS: at example #3780000, processed 69569790 words (1597213/s), 1996988 word types, 3780000 tags
2018-05-14 23:27:46,512: INFO: PROGRESS: at example #3790000, processed 69780197 words (1793304/s), 2000755 word types, 3790000 tags
2018-05-14 23:27:46,627: INFO: PROGRESS: at example #3800000, processed 69977409 words (1727340/s), 2003826 word types, 3800000 tags
2018-05-14 23:27:46,744: INFO: PROGRESS: at example #3810000, processed 70168069 words (1648272/s), 2007040 word types, 3810000 tags
2018-05-14 23:27:46,888: INFO: PROGRESS: at example #3820000, processed 70354348 words (1306447/s), 2010486 word types, 3820000 tags
2018-05-14 23:27:47,006: INFO: PROGRESS: at example #3830000, processed 70546820 words (1638414/s), 2016995 word types, 3830000 tags
2018-05-14 23:27:47,124: INFO: PROGRESS: at example #3840000, process

2018-05-14 23:32:00,219: INFO: EPOCH 1 - PROGRESS: at 7.79% examples, 29700 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:32:01,290: INFO: EPOCH 1 - PROGRESS: at 8.00% examples, 30317 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:32:02,290: INFO: EPOCH 1 - PROGRESS: at 8.26% examples, 31048 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:32:03,301: INFO: EPOCH 1 - PROGRESS: at 8.61% examples, 31996 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:32:04,378: INFO: EPOCH 1 - PROGRESS: at 8.98% examples, 33275 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:32:05,418: INFO: EPOCH 1 - PROGRESS: at 9.35% examples, 34465 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:32:06,421: INFO: EPOCH 1 - PROGRESS: at 9.68% examples, 35635 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:32:07,446: INFO: EPOCH 1 - PROGRESS: at 10.06% examples, 36849 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:32:08,462: INFO: EPOCH 1 - PROGRESS: at 10.41% examples, 37963 words/s, in_qsize 15, out_qsize 0
2018-05-

2018-05-14 23:38:04,020: INFO: EPOCH 1 - PROGRESS: at 33.82% examples, 41546 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:38:05,101: INFO: EPOCH 1 - PROGRESS: at 33.96% examples, 41604 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:06,202: INFO: EPOCH 1 - PROGRESS: at 34.23% examples, 41841 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:07,222: INFO: EPOCH 1 - PROGRESS: at 34.54% examples, 42088 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:08,268: INFO: EPOCH 1 - PROGRESS: at 34.94% examples, 42443 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:09,299: INFO: EPOCH 1 - PROGRESS: at 35.32% examples, 42759 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:10,315: INFO: EPOCH 1 - PROGRESS: at 35.69% examples, 43085 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:11,418: INFO: EPOCH 1 - PROGRESS: at 36.11% examples, 43435 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:38:12,428: INFO: EPOCH 1 - PROGRESS: at 36.46% examples, 43739 words/s, in_qsize 15, out_qsize 0
2

2018-05-14 23:44:21,449: INFO: EPOCH 1 - PROGRESS: at 60.05% examples, 42413 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:22,482: INFO: EPOCH 1 - PROGRESS: at 60.27% examples, 42538 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:23,495: INFO: EPOCH 1 - PROGRESS: at 60.65% examples, 42738 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:24,504: INFO: EPOCH 1 - PROGRESS: at 60.97% examples, 42936 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:44:25,560: INFO: EPOCH 1 - PROGRESS: at 61.28% examples, 43108 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:26,574: INFO: EPOCH 1 - PROGRESS: at 61.71% examples, 43293 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:44:27,579: INFO: EPOCH 1 - PROGRESS: at 62.01% examples, 43465 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:28,589: INFO: EPOCH 1 - PROGRESS: at 62.38% examples, 43667 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:44:29,607: INFO: EPOCH 1 - PROGRESS: at 62.72% examples, 43872 words/s, in_qsize 16, out_qsize 0
2

2018-05-14 23:50:45,067: INFO: EPOCH 1 - PROGRESS: at 86.26% examples, 42845 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:46,171: INFO: EPOCH 1 - PROGRESS: at 86.54% examples, 42953 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:47,196: INFO: EPOCH 1 - PROGRESS: at 86.88% examples, 43099 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:48,204: INFO: EPOCH 1 - PROGRESS: at 87.18% examples, 43213 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:49,230: INFO: EPOCH 1 - PROGRESS: at 87.67% examples, 43363 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:50,255: INFO: EPOCH 1 - PROGRESS: at 87.98% examples, 43498 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:50:51,305: INFO: EPOCH 1 - PROGRESS: at 88.30% examples, 43630 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:52,345: INFO: EPOCH 1 - PROGRESS: at 88.77% examples, 43784 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:50:53,407: INFO: EPOCH 1 - PROGRESS: at 89.16% examples, 43938 words/s, in_qsize 15, out_qsize 0
2

2018-05-14 23:54:29,180: INFO: EPOCH 2 - PROGRESS: at 10.22% examples, 241046 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:54:30,197: INFO: EPOCH 2 - PROGRESS: at 10.60% examples, 241709 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:54:31,207: INFO: EPOCH 2 - PROGRESS: at 10.94% examples, 241826 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:54:32,222: INFO: EPOCH 2 - PROGRESS: at 11.29% examples, 242540 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:54:33,266: INFO: EPOCH 2 - PROGRESS: at 11.65% examples, 242063 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:54:34,343: INFO: EPOCH 2 - PROGRESS: at 12.01% examples, 241102 words/s, in_qsize 15, out_qsize 0
2018-05-14 23:57:14,090: INFO: EPOCH 2 - PROGRESS: at 12.43% examples, 43671 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:57:15,128: INFO: EPOCH 2 - PROGRESS: at 12.57% examples, 43960 words/s, in_qsize 16, out_qsize 0
2018-05-14 23:57:16,214: INFO: EPOCH 2 - PROGRESS: at 12.87% examples, 44763 words/s, in_qsize 15, out_qsi

2018-05-15 00:00:52,835: INFO: EPOCH 2 - PROGRESS: at 36.78% examples, 58806 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:00:53,887: INFO: EPOCH 2 - PROGRESS: at 37.18% examples, 59209 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:00:54,912: INFO: EPOCH 2 - PROGRESS: at 37.58% examples, 59612 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:00:55,958: INFO: EPOCH 2 - PROGRESS: at 37.95% examples, 60021 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:00:56,993: INFO: EPOCH 2 - PROGRESS: at 38.33% examples, 60421 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:03:30,629: INFO: EPOCH 2 - PROGRESS: at 38.55% examples, 44371 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:03:31,843: INFO: EPOCH 2 - PROGRESS: at 38.68% examples, 44425 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:03:32,871: INFO: EPOCH 2 - PROGRESS: at 38.92% examples, 44580 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:03:33,891: INFO: EPOCH 2 - PROGRESS: at 39.30% examples, 44885 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:07:18,121: INFO: EPOCH 2 - PROGRESS: at 63.29% examples, 51474 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:07:19,182: INFO: EPOCH 2 - PROGRESS: at 63.67% examples, 51695 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:07:20,187: INFO: EPOCH 2 - PROGRESS: at 64.02% examples, 51907 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:07:21,247: INFO: EPOCH 2 - PROGRESS: at 64.40% examples, 52145 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:09:58,458: INFO: EPOCH 2 - PROGRESS: at 64.78% examples, 43837 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:09:59,948: INFO: EPOCH 2 - PROGRESS: at 64.79% examples, 43778 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:10:00,970: INFO: EPOCH 2 - PROGRESS: at 64.98% examples, 43872 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:10:01,984: INFO: EPOCH 2 - PROGRESS: at 65.30% examples, 44039 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:10:03,000: INFO: EPOCH 2 - PROGRESS: at 65.66% examples, 44198 words/s, in_qsize 16, out_qsize 0
2

2018-05-15 00:13:48,889: INFO: EPOCH 2 - PROGRESS: at 89.91% examples, 48942 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:13:49,938: INFO: EPOCH 2 - PROGRESS: at 90.32% examples, 49078 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:13:50,941: INFO: EPOCH 2 - PROGRESS: at 90.66% examples, 49252 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:13:51,962: INFO: EPOCH 2 - PROGRESS: at 91.00% examples, 49413 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:16:31,363: INFO: EPOCH 2 - PROGRESS: at 91.06% examples, 43614 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:16:32,631: INFO: EPOCH 2 - PROGRESS: at 91.08% examples, 43585 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:16:33,713: INFO: EPOCH 2 - PROGRESS: at 91.13% examples, 43570 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:16:34,744: INFO: EPOCH 2 - PROGRESS: at 91.35% examples, 43659 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:16:35,748: INFO: EPOCH 2 - PROGRESS: at 91.67% examples, 43797 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:20:13,524: INFO: EPOCH 3 - PROGRESS: at 12.79% examples, 44718 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:20:14,528: INFO: EPOCH 3 - PROGRESS: at 13.09% examples, 45660 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:20:15,564: INFO: EPOCH 3 - PROGRESS: at 13.53% examples, 46631 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:20:16,568: INFO: EPOCH 3 - PROGRESS: at 13.89% examples, 47714 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:20:17,590: INFO: EPOCH 3 - PROGRESS: at 14.24% examples, 48740 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:20:18,614: INFO: EPOCH 3 - PROGRESS: at 14.61% examples, 49685 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:20:19,726: INFO: EPOCH 3 - PROGRESS: at 15.00% examples, 50690 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:20:20,744: INFO: EPOCH 3 - PROGRESS: at 15.40% examples, 51619 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:20:21,797: INFO: EPOCH 3 - PROGRESS: at 15.75% examples, 52642 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:27:10,066: INFO: EPOCH 3 - PROGRESS: at 37.66% examples, 40524 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:11,091: INFO: EPOCH 3 - PROGRESS: at 38.03% examples, 40836 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:12,098: INFO: EPOCH 3 - PROGRESS: at 38.43% examples, 41147 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:13,184: INFO: EPOCH 3 - PROGRESS: at 38.81% examples, 41459 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:14,188: INFO: EPOCH 3 - PROGRESS: at 39.21% examples, 41763 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:27:15,221: INFO: EPOCH 3 - PROGRESS: at 39.57% examples, 42049 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:16,238: INFO: EPOCH 3 - PROGRESS: at 39.96% examples, 42351 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:27:17,332: INFO: EPOCH 3 - PROGRESS: at 40.32% examples, 42645 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:27:18,354: INFO: EPOCH 3 - PROGRESS: at 40.73% examples, 42975 words/s, in_qsize 16, out_qsize 0
2

2018-05-15 00:33:39,728: INFO: EPOCH 3 - PROGRESS: at 63.62% examples, 41230 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:40,795: INFO: EPOCH 3 - PROGRESS: at 63.99% examples, 41424 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:41,883: INFO: EPOCH 3 - PROGRESS: at 64.39% examples, 41633 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:42,885: INFO: EPOCH 3 - PROGRESS: at 64.76% examples, 41827 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:43,930: INFO: EPOCH 3 - PROGRESS: at 65.12% examples, 42021 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:44,983: INFO: EPOCH 3 - PROGRESS: at 65.47% examples, 42198 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:46,013: INFO: EPOCH 3 - PROGRESS: at 65.89% examples, 42398 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:47,023: INFO: EPOCH 3 - PROGRESS: at 66.23% examples, 42583 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:33:48,048: INFO: EPOCH 3 - PROGRESS: at 66.60% examples, 42782 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:40:12,721: INFO: EPOCH 3 - PROGRESS: at 89.58% examples, 41575 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:13,841: INFO: EPOCH 3 - PROGRESS: at 89.93% examples, 41708 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:14,841: INFO: EPOCH 3 - PROGRESS: at 90.36% examples, 41849 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:15,881: INFO: EPOCH 3 - PROGRESS: at 90.72% examples, 42008 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:40:16,992: INFO: EPOCH 3 - PROGRESS: at 91.09% examples, 42165 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:18,007: INFO: EPOCH 3 - PROGRESS: at 91.47% examples, 42326 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:19,045: INFO: EPOCH 3 - PROGRESS: at 91.81% examples, 42472 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:20,059: INFO: EPOCH 3 - PROGRESS: at 92.16% examples, 42612 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:40:21,086: INFO: EPOCH 3 - PROGRESS: at 92.54% examples, 42756 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:46:43,918: INFO: EPOCH 4 - PROGRESS: at 12.99% examples, 45863 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:44,925: INFO: EPOCH 4 - PROGRESS: at 13.40% examples, 46772 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:45,973: INFO: EPOCH 4 - PROGRESS: at 13.76% examples, 47807 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:47,005: INFO: EPOCH 4 - PROGRESS: at 14.11% examples, 48847 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:48,021: INFO: EPOCH 4 - PROGRESS: at 14.49% examples, 49850 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:49,067: INFO: EPOCH 4 - PROGRESS: at 14.86% examples, 50794 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:50,072: INFO: EPOCH 4 - PROGRESS: at 15.21% examples, 51822 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:51,102: INFO: EPOCH 4 - PROGRESS: at 15.60% examples, 52735 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:46:52,136: INFO: EPOCH 4 - PROGRESS: at 15.96% examples, 53725 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:53:25,777: INFO: EPOCH 4 - PROGRESS: at 38.37% examples, 42349 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:53:26,803: INFO: EPOCH 4 - PROGRESS: at 38.72% examples, 42632 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:53:27,817: INFO: EPOCH 4 - PROGRESS: at 39.11% examples, 42941 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:53:28,828: INFO: EPOCH 4 - PROGRESS: at 39.48% examples, 43249 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:53:29,833: INFO: EPOCH 4 - PROGRESS: at 39.85% examples, 43547 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:53:30,835: INFO: EPOCH 4 - PROGRESS: at 40.22% examples, 43825 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:53:31,838: INFO: EPOCH 4 - PROGRESS: at 40.63% examples, 44164 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:53:32,849: INFO: EPOCH 4 - PROGRESS: at 40.98% examples, 44431 words/s, in_qsize 16, out_qsize 0
2018-05-15 00:53:33,890: INFO: EPOCH 4 - PROGRESS: at 41.40% examples, 44749 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 00:59:58,173: INFO: EPOCH 4 - PROGRESS: at 64.01% examples, 42035 words/s, in_qsize 15, out_qsize 0
2018-05-15 00:59:59,174: INFO: EPOCH 4 - PROGRESS: at 64.39% examples, 42241 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:00:00,205: INFO: EPOCH 4 - PROGRESS: at 64.76% examples, 42437 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:00:01,214: INFO: EPOCH 4 - PROGRESS: at 65.11% examples, 42627 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:00:02,216: INFO: EPOCH 4 - PROGRESS: at 65.46% examples, 42807 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:00:03,219: INFO: EPOCH 4 - PROGRESS: at 65.86% examples, 43003 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:00:04,236: INFO: EPOCH 4 - PROGRESS: at 66.22% examples, 43198 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:00:05,254: INFO: EPOCH 4 - PROGRESS: at 66.59% examples, 43399 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:00:06,336: INFO: EPOCH 4 - PROGRESS: at 66.98% examples, 43597 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 01:06:35,978: INFO: EPOCH 4 - PROGRESS: at 90.43% examples, 42201 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:36,992: INFO: EPOCH 4 - PROGRESS: at 90.76% examples, 42349 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:38,015: INFO: EPOCH 4 - PROGRESS: at 91.11% examples, 42498 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:39,020: INFO: EPOCH 4 - PROGRESS: at 91.48% examples, 42660 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:40,024: INFO: EPOCH 4 - PROGRESS: at 91.81% examples, 42801 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:41,050: INFO: EPOCH 4 - PROGRESS: at 92.18% examples, 42948 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:42,135: INFO: EPOCH 4 - PROGRESS: at 92.55% examples, 43091 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:06:43,148: INFO: EPOCH 4 - PROGRESS: at 92.93% examples, 43239 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:06:44,162: INFO: EPOCH 4 - PROGRESS: at 93.30% examples, 43385 words/s, in_qsize 16, out_qsize 0
2

2018-05-15 01:13:03,539: INFO: EPOCH 5 - PROGRESS: at 13.87% examples, 26096 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:04,553: INFO: EPOCH 5 - PROGRESS: at 14.17% examples, 26630 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:05,609: INFO: EPOCH 5 - PROGRESS: at 14.48% examples, 27094 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:06,644: INFO: EPOCH 5 - PROGRESS: at 14.86% examples, 27698 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:07,660: INFO: EPOCH 5 - PROGRESS: at 15.19% examples, 28299 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:08,669: INFO: EPOCH 5 - PROGRESS: at 15.58% examples, 28843 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:09,768: INFO: EPOCH 5 - PROGRESS: at 15.96% examples, 29495 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:10,800: INFO: EPOCH 5 - PROGRESS: at 16.33% examples, 30145 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:13:11,828: INFO: EPOCH 5 - PROGRESS: at 16.68% examples, 30729 words/s, in_qsize 16, out_qsize 0
2

2018-05-15 01:19:42,154: INFO: EPOCH 5 - PROGRESS: at 40.57% examples, 34921 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:19:43,173: INFO: EPOCH 5 - PROGRESS: at 40.91% examples, 35144 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:19:44,178: INFO: EPOCH 5 - PROGRESS: at 41.30% examples, 35388 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:19:45,213: INFO: EPOCH 5 - PROGRESS: at 41.69% examples, 35643 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:19:46,264: INFO: EPOCH 5 - PROGRESS: at 42.06% examples, 35906 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:19:47,374: INFO: EPOCH 5 - PROGRESS: at 42.49% examples, 36163 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:19:48,394: INFO: EPOCH 5 - PROGRESS: at 42.88% examples, 36402 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:19:49,415: INFO: EPOCH 5 - PROGRESS: at 43.22% examples, 36640 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:19:50,456: INFO: EPOCH 5 - PROGRESS: at 43.59% examples, 36876 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 01:26:27,208: INFO: EPOCH 5 - PROGRESS: at 66.82% examples, 37191 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:28,209: INFO: EPOCH 5 - PROGRESS: at 67.20% examples, 37368 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:29,217: INFO: EPOCH 5 - PROGRESS: at 67.53% examples, 37531 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:30,242: INFO: EPOCH 5 - PROGRESS: at 67.92% examples, 37677 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:31,264: INFO: EPOCH 5 - PROGRESS: at 68.28% examples, 37855 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:32,348: INFO: EPOCH 5 - PROGRESS: at 68.64% examples, 38028 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:33,349: INFO: EPOCH 5 - PROGRESS: at 69.04% examples, 38198 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:34,374: INFO: EPOCH 5 - PROGRESS: at 69.40% examples, 38368 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:26:35,396: INFO: EPOCH 5 - PROGRESS: at 69.74% examples, 38534 words/s, in_qsize 15, out_qsize 0
2

2018-05-15 01:33:08,552: INFO: EPOCH 5 - PROGRESS: at 92.36% examples, 38212 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:33:09,637: INFO: EPOCH 5 - PROGRESS: at 92.56% examples, 38273 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:33:10,695: INFO: EPOCH 5 - PROGRESS: at 92.90% examples, 38390 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:33:11,714: INFO: EPOCH 5 - PROGRESS: at 93.25% examples, 38512 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:33:12,715: INFO: EPOCH 5 - PROGRESS: at 93.62% examples, 38646 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:33:13,719: INFO: EPOCH 5 - PROGRESS: at 93.95% examples, 38773 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:33:14,740: INFO: EPOCH 5 - PROGRESS: at 94.28% examples, 38901 words/s, in_qsize 16, out_qsize 0
2018-05-15 01:33:15,753: INFO: EPOCH 5 - PROGRESS: at 94.65% examples, 39038 words/s, in_qsize 15, out_qsize 0
2018-05-15 01:33:16,855: INFO: EPOCH 5 - PROGRESS: at 95.03% examples, 39169 words/s, in_qsize 15, out_qsize 0
2

And this is it for data collection, extraction and preparation!

The next step is the generation of training data (with Term and Sentence Expansion) for the Named-Entity Tagger and, of course, the actual training.