# Content Collection and Processing for TSE-NER Long-tail Entity Extraction

Here we will try to follow the full process required to train and use a NER model.

In [1]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

## Content Collection (a.k.a. getting a *lot* of papers)

For this step we will be using [sci-paper-miner](https://github.com/ronentk/sci-paper-miner) to download scientific publications from arXiv. Get the code from their repository, and use the following command to run it:

`python crawl_core.py <your-api>` 

In the `crawl_core` file, you can modify the topics and range of years that you want to download, for example here we selected papers only from 2017 for Artificial Intelligence, Computational Complexity, Cryptography and Security, Human-Computer Interaction, Logic in Computer Science, Mathematical Software, Multiagent Systems, Neural and Evolutionary Computing and Sound (in the `crawl_core` file they have all the topics, you can choose what to keep).  In `configs` you can write a name for the folder where the data will be stored.

In [24]:
import json
import os

In [32]:
# This is the path where all the json files were downloaded 

path = 'sci_paper_miner/data/arxiv_2006-2017_cs/raw_query/'

We know that the files we downloaded are in JSON format, but let's check their structure.
For this we can use `os.walk`.

In [41]:
for dirpath, dirnames, filenames in os.walk(path):
    for filename in filenames:
        if filename.endswith('.json'):
            print(filename)

144_Computer Science - Artificial Intelligence_2017_5_0.json
144_Computer Science - Artificial Intelligence_2017_5_1.json
144_Computer Science - Artificial Intelligence_2017_5_10.json
144_Computer Science - Artificial Intelligence_2017_5_11.json
144_Computer Science - Artificial Intelligence_2017_5_12.json
144_Computer Science - Artificial Intelligence_2017_5_13.json
144_Computer Science - Artificial Intelligence_2017_5_14.json
144_Computer Science - Artificial Intelligence_2017_5_15.json
144_Computer Science - Artificial Intelligence_2017_5_16.json
144_Computer Science - Artificial Intelligence_2017_5_17.json
144_Computer Science - Artificial Intelligence_2017_5_18.json
144_Computer Science - Artificial Intelligence_2017_5_19.json
144_Computer Science - Artificial Intelligence_2017_5_2.json
144_Computer Science - Artificial Intelligence_2017_5_20.json
144_Computer Science - Artificial Intelligence_2017_5_21.json
144_Computer Science - Artificial Intelligence_2017_5_22.json
144_Compute

It seems like they are not individual articles, so we have to look into one of them.
Let's just take the last one.

In [42]:
json_file = os.path.join(path, filename)
with open(json_file) as f:
    data = json.load(f)

JSON files are loaded as dictionaries:

In [43]:
data.keys()

dict_keys(['authors', 'contributors', 'datePublished', 'description', 'doi', 'downloadUrl', 'fullText', 'fulltextIdentifier', 'id', 'identifiers', 'oai', 'relations', 'repositories', 'subjects', 'title', 'topics', 'types', 'year'])

If we look into one of those keys, it seems like there is many articles, each one with their respective value for the keys above.

In [53]:
data['title'].keys()

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47'])

So we can check the content of the `Title` key, for all articles:

In [61]:
for key in data['title']:
    print(data['title'][key])
    print('-'*50)

A new cosine series antialiasing function and its application to
  aliasing-free glottal source models for speech and singing synthesis
--------------------------------------------------
Investigating the role of musical genre in human perception of music
  stretching resistance
--------------------------------------------------
Enabling Early Audio Event Detection with Neural Networks
--------------------------------------------------
Multi-Speaker Localization Using Convolutional Neural Network Trained
  with Noise
--------------------------------------------------
Acoustic Reflector Localization: Novel Image Source Reversion and Direct
  Localization Methods
--------------------------------------------------
A modulation property of time-frequency derivatives of filtered phase
  and its application to aperiodicity and fo estimation
--------------------------------------------------
Weakly Supervised PLDA Training
--------------------------------------------------
CNN Architectures f

Or the content of all keys, for the first article, which is what we are looking for:

In [60]:
for key in data:
    print(key, data[key]['0'])
    print('-'*50)

authors ['Kawahara, Hideki', 'Sakakibara, Ken-Ichi', 'Banno, Hideki', 'Morise, Masanori', 'Toda, Tomoki', 'Irino, Toshio']
--------------------------------------------------
contributors []
--------------------------------------------------
datePublished 2017-06-08
--------------------------------------------------
description We formulated and implemented a procedure to generate aliasing-free
excitation source signals. It uses a new antialiasing filter in the continuous
time domain followed by an IIR digital filter for response equalization. We
introduced a cosine-series-based general design procedure for the new
antialiasing function. We applied this new procedure to implement the
antialiased Fujisaki-Ljungqvist model. We also applied it to revise our
previous implementation of the antialiased Fant-Liljencrants model. A
combination of these signals and a lattice implementation of the time varying
vocal tract model provides a reliable and flexible basis to test fo extractors
and sourc

Awesome! So we have to iterate for all the downloaded JSON files, for all keys, for all articles, so we can store it in a database. In this case we will use MongoDB.

## Extraction and Storage - MongoDB

We use MongoDB because it is great for prototyping and quick schema changes, but other storage systems can be probably used. Ultimately, all the data is then indexed in Elasticsearch, and that's where we actually query in production.

You can go [here](https://docs.mongodb.com/manual/installation/) to install MongoDB.

Then, using a MongoDB client, create a database, for example _pub_ and a collection, like _publications_.

In [2]:
from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError
import string

In [3]:
# Default values
mongoDB_IP = '127.0.0.1'
mongoDB_Port = 27017
mongoDB_db = 'pub'

In [4]:
def connect_to_mongo():
    """
    Returns a db connection to the mongo instance
    :return:
    """
    try:
        client = MongoClient(mongoDB_IP, mongoDB_Port)
        db = client[mongoDB_db]
        db.downloads.find_one({'_id': 'test'})
        return db
    except ServerSelectionTimeoutError as e:
        raise Exception("Local MongoDB instance at "+mongoDB_IP+":"+mongoDB_Port+" could not be accessed.") from e

We can test the connection to the database with the following command:

In [5]:
db = connect_to_mongo()

Then we define a function that will take each of the articles from the JSON files downloaded from CORE, and store the information into our collection.

The data obtained from CORE does not separate sections in the text, it is not so important except for the references, since we may use them later on, therefore we also define a function that will manage this.

In [147]:
def arxiv_json_to_mongo(article):
    """
    Creates a new entry in mongodb from the input article
    :return:
    """
    
    mongo_input = {}
    translator = str.maketrans('', '', string.punctuation)
    article_data = article

    mongo_input['title'] = article_data['title']
    mongo_input['authors'] = article_data['authors']
    mongo_input['journal'] = 'arxiv'
    mongo_input['year'] = article_data['year']
    mongo_input['type'] = article_data['subjects']
    mongo_input['content.abstract'] = article_data['description']
    mongo_input['content.keywords'] = article_data['topics']
    mongo_input['content.fulltext'] = article_data['fullText']
    mongo_input['content.references'] = article_data['references']

    mongo_mongo_input = db.publications.update_one(
        {'_id': 'arxiv_' + article_data['id']},
        {'$set': mongo_input},
        upsert=True
    )    
    print('Processed', article_data['id'], 'from JSON')
    
def manage_text_and_refs(article):
    article['fullText'] = article['fullText'].replace('\n', ' ').replace('\r', '')
    if len(article['fullText'].split('References', 1)) == 2:
        article['references'] = article['fullText'].split('References', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    elif len(article['fullText'].split('REFERENCES', 1)) == 2:
        article['references'] = article['fullText'].split('REFERENCES', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    else:
        article['fullText'] = article['fullText']
        article['references'] = ''
    return article

Now we have to iterate through all the downloaded JSON files, and all the articles in each file, to process them and introduce them into our database.

In [149]:
for dirpath, dirnames, filenames in os.walk(path):
    # Iterate files...
    for filename in filenames:
        if filename.endswith('.json'):
            json_file = os.path.join(path, filename)
            with open(json_file) as f:
                data = json.load(f)            
                    # Iterate articles in file...
                for article_number in data['title'].keys():
                    article = {}
                    for key in data.keys():
                        article[key] = data[key][str(article_number)]
                        
                    # Process text and references for each article
                    manage_text_and_refs(article)
                    # Store to database
                    arxiv_json_to_mongo(article)

Processed 25036267 from JSON
Processed 2113087 from JSON
Processed 25025591 from JSON
Processed 25015102 from JSON
Processed 24956790 from JSON
Processed 25038409 from JSON
Processed 2246166 from JSON
Processed 73386897 from JSON
Processed 83863968 from JSON
Processed 83867371 from JSON
Processed 94047681 from JSON
Processed 146472595 from JSON
Processed 94063315 from JSON
Processed 29526388 from JSON
Processed 29536914 from JSON
Processed 84090473 from JSON
Processed 84093559 from JSON
Processed 84093915 from JSON
Processed 84093852 from JSON
Processed 84094420 from JSON
Processed 84326602 from JSON
Processed 84328172 from JSON
Processed 84330324 from JSON
Processed 84329889 from JSON
Processed 84326664 from JSON
Processed 84331310 from JSON
Processed 86416235 from JSON
Processed 86419366 from JSON
Processed 86420205 from JSON
Processed 84330603 from JSON
Processed 84330526 from JSON
Processed 84331497 from JSON
Processed 73423992 from JSON
Processed 73417539 from JSON
Processed 73408

Processed 83866609 from JSON
Processed 93946032 from JSON
Processed 93954042 from JSON
Processed 93952440 from JSON
Processed 93874081 from JSON
Processed 93943848 from JSON
Processed 93944339 from JSON
Processed 93944694 from JSON
Processed 93945015 from JSON
Processed 93942643 from JSON
Processed 146474068 from JSON
Processed 146474486 from JSON
Processed 42744287 from JSON
Processed 42715942 from JSON
Processed 42730154 from JSON
Processed 94072332 from JSON
Processed 94071067 from JSON
Processed 94058844 from JSON
Processed 93907363 from JSON
Processed 146475319 from JSON
Processed 84331874 from JSON
Processed 86415709 from JSON
Processed 86415925 from JSON
Processed 86418777 from JSON
Processed 84332442 from JSON
Processed 84330749 from JSON
Processed 84331069 from JSON
Processed 84330624 from JSON
Processed 84330701 from JSON
Processed 84331502 from JSON
Processed 84331595 from JSON
Processed 86415305 from JSON
Processed 86414737 from JSON
Processed 86417914 from JSON
Processed 7

Processed 93944988 from JSON
Processed 93947576 from JSON
Processed 93948719 from JSON
Processed 93951313 from JSON
Processed 93953333 from JSON
Processed 93953566 from JSON
Processed 93936930 from JSON
Processed 93936512 from JSON
Processed 93940234 from JSON
Processed 93869650 from JSON
Processed 93907518 from JSON
Processed 93942191 from JSON
Processed 93945152 from JSON
Processed 94032693 from JSON
Processed 146471726 from JSON
Processed 94024498 from JSON
Processed 129246622 from JSON
Processed 129264059 from JSON
Processed 129225652 from JSON
Processed 141534671 from JSON
Processed 141535333 from JSON
Processed 141534796 from JSON
Processed 93883528 from JSON
Processed 83845686 from JSON
Processed 83848072 from JSON
Processed 83849835 from JSON
Processed 83852895 from JSON
Processed 83859106 from JSON
Processed 83864340 from JSON
Processed 83856208 from JSON
Processed 83858660 from JSON
Processed 73383291 from JSON
Processed 83869383 from JSON
Processed 83841305 from JSON
Process

Processed 86421012 from JSON
Processed 86419000 from JSON
Processed 86417477 from JSON
Processed 86417617 from JSON
Processed 86419961 from JSON
Processed 29518418 from JSON
Processed 83831911 from JSON
Processed 83834080 from JSON
Processed 83833917 from JSON
Processed 83834840 from JSON
Processed 83832976 from JSON
Processed 83833715 from JSON
Processed 146475980 from JSON
Processed 146473775 from JSON
Processed 94062508 from JSON
Processed 94050938 from JSON
Processed 146474949 from JSON
Processed 94070006 from JSON
Processed 94075467 from JSON
Processed 94074604 from JSON
Processed 129357551 from JSON
Processed 93940465 from JSON
Processed 83836132 from JSON
Processed 83835487 from JSON
Processed 83835069 from JSON
Processed 42669201 from JSON
Processed 42706142 from JSON
Processed 42719258 from JSON
Processed 74203909 from JSON
Processed 74251340 from JSON
Processed 42743024 from JSON
Processed 74251078 from JSON
Processed 73960926 from JSON
Processed 73960988 from JSON
Processed 

Processed 42714190 from JSON
Processed 29509195 from JSON
Processed 84094933 from JSON
Processed 73442277 from JSON
Processed 86414492 from JSON
Processed 42641564 from JSON
Processed 42688561 from JSON
Processed 42655868 from JSON
Processed 42742913 from JSON
Processed 74202836 from JSON
Processed 74203978 from JSON
Processed 74203947 from JSON
Processed 42743606 from JSON
Processed 129239884 from JSON
Processed 129243931 from JSON
Processed 73993085 from JSON
Processed 42730879 from JSON
Processed 42739232 from JSON
Processed 42737724 from JSON
Processed 129270085 from JSON
Processed 129269247 from JSON
Processed 129274513 from JSON
Processed 94031902 from JSON
Processed 141529941 from JSON
Processed 141530082 from JSON
Processed 74250873 from JSON
Processed 74251720 from JSON
Processed 78509678 from JSON
Processed 141533043 from JSON
Processed 141532723 from JSON
Processed 78512366 from JSON
Processed 146472859 from JSON
Processed 94046603 from JSON
Processed 93908494 from JSON
Proc

Processed 93937729 from JSON
Processed 93941218 from JSON
Processed 93940664 from JSON
Processed 93945861 from JSON
Processed 93948790 from JSON
Processed 93951109 from JSON
Processed 93953282 from JSON
Processed 83839329 from JSON
Processed 83837633 from JSON
Processed 73351914 from JSON
Processed 83837789 from JSON
Processed 83844873 from JSON
Processed 83842947 from JSON
Processed 83841139 from JSON
Processed 83842497 from JSON
Processed 83842404 from JSON
Processed 83836801 from JSON
Processed 83839932 from JSON
Processed 83842264 from JSON
Processed 83842187 from JSON
Processed 73365814 from JSON
Processed 83856551 from JSON
Processed 83856520 from JSON
Processed 83856892 from JSON
Processed 73370539 from JSON
Processed 83856210 from JSON
Processed 83860925 from JSON
Processed 83863513 from JSON
Processed 83865038 from JSON
Processed 83864996 from JSON
Processed 83857819 from JSON
Processed 73375736 from JSON
Processed 83860336 from JSON
Processed 83861959 from JSON
Processed 8386

Processed 141531624 from JSON
Processed 141531530 from JSON
Processed 141532937 from JSON
Processed 141533086 from JSON
Processed 93905212 from JSON
Processed 93907869 from JSON
Processed 129360972 from JSON
Processed 129362156 from JSON
Processed 129357929 from JSON
Processed 93873661 from JSON
Processed 93958711 from JSON
Processed 94045225 from JSON
Processed 146473617 from JSON
Processed 93943546 from JSON
Processed 93942682 from JSON
Processed 93942435 from JSON
Processed 83846352 from JSON
Processed 83846600 from JSON
Processed 83854595 from JSON
Processed 83855226 from JSON
Processed 83855815 from JSON
Processed 83856243 from JSON
Processed 83858883 from JSON
Processed 83859033 from JSON
Processed 83865566 from JSON
Processed 73378124 from JSON
Processed 83858528 from JSON
Processed 94028358 from JSON
Processed 94067832 from JSON
Processed 83837464 from JSON
Processed 83841821 from JSON
Processed 83844301 from JSON
Processed 73416796 from JSON
Processed 73408630 from JSON
Proces

Processed 129348590 from JSON
Processed 129354890 from JSON
Processed 129357200 from JSON
Processed 93880669 from JSON
Processed 129270134 from JSON
Processed 93912094 from JSON
Processed 93937321 from JSON
Processed 93939341 from JSON
Processed 93937941 from JSON
Processed 93937910 from JSON
Processed 73991512 from JSON
Processed 73956573 from JSON
Processed 73956636 from JSON
Processed 73955727 from JSON
Processed 83835073 from JSON
Processed 83836401 from JSON
Processed 83833318 from JSON
Processed 83842578 from JSON
Processed 83842840 from JSON
Processed 83843641 from JSON
Processed 83834210 from JSON
Processed 83838919 from JSON
Processed 83859314 from JSON
Processed 73442229 from JSON
Processed 73440704 from JSON
Processed 73417537 from JSON
Processed 73409200 from JSON
Processed 73440937 from JSON
Processed 73414576 from JSON
Processed 73413016 from JSON
Processed 73410752 from JSON
Processed 73418989 from JSON
Processed 73419603 from JSON
Processed 86415498 from JSON
Processed 

Processed 83832006 from JSON
Processed 83832253 from JSON
Processed 83861093 from JSON
Processed 83863347 from JSON
Processed 83862438 from JSON
Processed 83851467 from JSON
Processed 83853393 from JSON
Processed 73398543 from JSON
Processed 146472503 from JSON
Processed 94041170 from JSON
Processed 94036169 from JSON
Processed 94043330 from JSON
Processed 146472178 from JSON
Processed 93939249 from JSON
Processed 141532163 from JSON
Processed 141537795 from JSON
Processed 141538147 from JSON
Processed 141530314 from JSON
Processed 141537081 from JSON
Processed 141538457 from JSON
Processed 141538394 from JSON
Processed 93907431 from JSON
Processed 93909158 from JSON
Processed 93957892 from JSON
Processed 93959247 from JSON
Processed 93959184 from JSON
Processed 93947103 from JSON
Processed 93947398 from JSON
Processed 93949262 from JSON
Processed 93949154 from JSON
Processed 93949604 from JSON
Processed 93948308 from JSON
Processed 86414944 from JSON
Processed 86413725 from JSON
Proce

Processed 78509210 from JSON
Processed 78509443 from JSON
Processed 78510034 from JSON
Processed 78512427 from JSON
Processed 94068624 from JSON
Processed 94028983 from JSON
Processed 94056713 from JSON
Processed 94060961 from JSON
Processed 93912555 from JSON
Processed 129357934 from JSON
Processed 129361315 from JSON
Processed 129360204 from JSON
Processed 129363304 from JSON
Processed 129269941 from JSON
Processed 93885086 from JSON
Processed 93936904 from JSON
Processed 93942956 from JSON
Processed 129351316 from JSON
Processed 129349861 from JSON
Processed 93913635 from JSON
Processed 93937628 from JSON
Processed 129358285 from JSON
Processed 129362488 from JSON
Processed 129362178 from JSON
Processed 129351424 from JSON
Processed 93945823 from JSON
Processed 93945638 from JSON
Processed 93944077 from JSON
Processed 93947331 from JSON
Processed 93946174 from JSON
Processed 93949180 from JSON
Processed 93950377 from JSON
Processed 93952412 from JSON
Processed 94024323 from JSON
Pro

Processed 9259374 from JSON
Processed 2418176 from JSON
Processed 24952807 from JSON
Processed 25041752 from JSON
Processed 25054883 from JSON
Processed 2064976 from JSON
Processed 25036302 from JSON
Processed 25015255 from JSON
Processed 25015890 from JSON
Processed 24978117 from JSON
Processed 24961009 from JSON
Processed 25063031 from JSON
Processed 25060134 from JSON
Processed 24960871 from JSON
Processed 25004536 from JSON
Processed 24975044 from JSON
Processed 25058543 from JSON
Processed 25015215 from JSON
Processed 2192099 from JSON
Processed 25050201 from JSON
Processed 25043984 from JSON
Processed 24947497 from JSON
Processed 25030700 from JSON
Processed 24960513 from JSON
Processed 2246166 from JSON
Processed 24965000 from JSON
Processed 146472672 from JSON
Processed 94066275 from JSON
Processed 84090954 from JSON
Processed 86419582 from JSON
Processed 86418022 from JSON
Processed 86420919 from JSON
Processed 84329360 from JSON
Processed 73407275 from JSON
Processed 73406878

Processed 73988612 from JSON
Processed 73988782 from JSON
Processed 73987842 from JSON
Processed 42701158 from JSON
Processed 73359320 from JSON
Processed 83837177 from JSON
Processed 83865527 from JSON
Processed 93955960 from JSON
Processed 94075803 from JSON
Processed 83847068 from JSON
Processed 93888370 from JSON
Processed 42739003 from JSON
Processed 93867540 from JSON
Processed 93875891 from JSON
Processed 129350143 from JSON
Processed 93874843 from JSON
Processed 83858768 from JSON
Processed 73349182 from JSON
Processed 73395899 from JSON
Processed 83857671 from JSON
Processed 83860794 from JSON
Processed 83845746 from JSON
Processed 73422623 from JSON
Processed 73400698 from JSON
Processed 73407079 from JSON
Processed 84093827 from JSON
Processed 29554804 from JSON
Processed 73409828 from JSON
Processed 84329847 from JSON
Processed 73954126 from JSON
Processed 73957459 from JSON
Processed 73989678 from JSON
Processed 42644498 from JSON
Processed 84328192 from JSON
Processed 742

Processed 141532918 from JSON
Processed 93897084 from JSON
Processed 83847336 from JSON
Processed 83859944 from JSON
Processed 84094692 from JSON
Processed 84093303 from JSON
Processed 83848897 from JSON
Processed 83848727 from JSON
Processed 83849218 from JSON
Processed 83852278 from JSON
Processed 83850272 from JSON
Processed 83842604 from JSON
Processed 73354579 from JSON
Processed 73377367 from JSON
Processed 83869728 from JSON
Processed 83863650 from JSON
Processed 83867397 from JSON
Processed 73351431 from JSON
Processed 129354096 from JSON
Processed 93946309 from JSON
Processed 93948590 from JSON
Processed 93953828 from JSON
Processed 93889927 from JSON
Processed 73423893 from JSON
Processed 29559494 from JSON
Processed 73406964 from JSON
Processed 73407518 from JSON
Processed 73416608 from JSON
Processed 29506180 from JSON
Processed 29530381 from JSON
Processed 73404261 from JSON
Processed 83856395 from JSON
Processed 73348729 from JSON
Processed 73381445 from JSON
Processed 73

Processed 73378048 from JSON
Processed 83856415 from JSON
Processed 73371155 from JSON
Processed 83838033 from JSON
Processed 83850556 from JSON
Processed 83850897 from JSON
Processed 84333065 from JSON
Processed 29546787 from JSON
Processed 73953566 from JSON
Processed 42649812 from JSON
Processed 42684628 from JSON
Processed 84326539 from JSON
Processed 78509696 from JSON
Processed 78511413 from JSON
Processed 78512012 from JSON
Processed 73988683 from JSON
Processed 42691248 from JSON
Processed 141538114 from JSON
Processed 78512524 from JSON
Processed 78510535 from JSON
Processed 141534042 from JSON
Processed 73440284 from JSON
Processed 73404437 from JSON
Processed 73404266 from JSON
Processed 93956396 from JSON
Processed 146476033 from JSON
Processed 73387261 from JSON
Processed 73382127 from JSON
Processed 146475262 from JSON
Processed 83858203 from JSON
Processed 83863175 from JSON
Processed 83852344 from JSON
Processed 83862280 from JSON
Processed 83866819 from JSON
Processed 

Processed 73957710 from JSON
Processed 73987971 from JSON
Processed 73989092 from JSON
Processed 73988309 from JSON
Processed 73991321 from JSON
Processed 84331754 from JSON
Processed 84326798 from JSON
Processed 84330333 from JSON
Processed 86420861 from JSON
Processed 86419559 from JSON
Processed 86419698 from JSON
Processed 86417243 from JSON
Processed 86417801 from JSON
Processed 84092611 from JSON
Processed 84091113 from JSON
Processed 42695980 from JSON
Processed 73955473 from JSON
Processed 73410050 from JSON
Processed 73408334 from JSON
Processed 73408428 from JSON
Processed 73403618 from JSON
Processed 146471816 from JSON
Processed 146476239 from JSON
Processed 146474093 from JSON
Processed 73356677 from JSON
Processed 83841183 from JSON
Processed 83866922 from JSON
Processed 93952994 from JSON
Processed 129356442 from JSON
Processed 129356149 from JSON
Processed 93944845 from JSON
Processed 129227515 from JSON
Processed 129351074 from JSON
Processed 129350259 from JSON
Proces

Processed 83854780 from JSON
Processed 83853516 from JSON
Processed 83861805 from JSON
Processed 83853965 from JSON
Processed 83856508 from JSON
Processed 94030555 from JSON
Processed 94026617 from JSON
Processed 73356618 from JSON
Processed 83843005 from JSON
Processed 29544999 from JSON
Processed 73440548 from JSON
Processed 73408135 from JSON
Processed 84331727 from JSON
Processed 84329188 from JSON
Processed 73990555 from JSON
Processed 73992018 from JSON
Processed 73993112 from JSON
Processed 73993129 from JSON
Processed 73988341 from JSON
Processed 84326800 from JSON
Processed 84327911 from JSON
Processed 86421860 from JSON
Processed 141536946 from JSON
Processed 74251345 from JSON
Processed 42742259 from JSON
Processed 74204248 from JSON
Processed 42720488 from JSON
Processed 141536155 from JSON
Processed 141537730 from JSON
Processed 86416967 from JSON
Processed 86415312 from JSON
Processed 84094186 from JSON
Processed 83860803 from JSON
Processed 83860245 from JSON
Processed 8

Processed 141535173 from JSON
Processed 73440475 from JSON
Processed 73417462 from JSON
Processed 84090490 from JSON
Processed 84093514 from JSON
Processed 84092434 from JSON
Processed 73407850 from JSON
Processed 29552379 from JSON
Processed 74203280 from JSON
Processed 78511950 from JSON
Processed 78511208 from JSON
Processed 78509830 from JSON
Processed 94068624 from JSON
Processed 146472236 from JSON
Processed 146472281 from JSON
Processed 129358519 from JSON
Processed 129361500 from JSON
Processed 93885086 from JSON
Processed 93939554 from JSON
Processed 93938582 from JSON
Processed 129285606 from JSON
Processed 129357732 from JSON
Processed 129361283 from JSON
Processed 129360343 from JSON
Processed 129362053 from JSON
Processed 129349289 from JSON
Processed 129349070 from JSON
Processed 129352984 from JSON
Processed 93944091 from JSON
Processed 93945544 from JSON
Processed 93947780 from JSON
Processed 93953802 from JSON
Processed 93956591 from JSON
Processed 94026280 from JSON
P

Processed 141535595 from JSON
Processed 141537476 from JSON
Processed 29518058 from JSON
Processed 84091358 from JSON
Processed 83854680 from JSON
Processed 83856314 from JSON
Processed 83858783 from JSON
Processed 83837674 from JSON
Processed 73356626 from JSON
Processed 83837751 from JSON
Processed 83862520 from JSON
Processed 83831443 from JSON
Processed 83851162 from JSON
Processed 83845761 from JSON
Processed 83847628 from JSON
Processed 83862614 from JSON
Processed 83848087 from JSON
Processed 83839199 from JSON
Processed 84326701 from JSON
Processed 141529760 from JSON
Processed 141530818 from JSON
Processed 141533437 from JSON
Processed 42654204 from JSON
Processed 42724228 from JSON
Processed 86421558 from JSON
Processed 78509361 from JSON
Processed 86420788 from JSON
Processed 84331813 from JSON
Processed 86419731 from JSON
Processed 86420199 from JSON
Processed 42709706 from JSON
Processed 141537554 from JSON
Processed 141538014 from JSON
Processed 141533514 from JSON
Proces

Processed 141537526 from JSON
Processed 129209692 from JSON
Processed 141536600 from JSON
Processed 141536554 from JSON
Processed 93896872 from JSON
Processed 93889151 from JSON
Processed 93947062 from JSON
Processed 93948560 from JSON
Processed 93958280 from JSON
Processed 93954459 from JSON
Processed 146471849 from JSON
Processed 94078752 from JSON
Processed 83858446 from JSON
Processed 83837755 from JSON
Processed 83839869 from JSON
Processed 83849433 from JSON
Processed 83863402 from JSON
Processed 83865516 from JSON
Processed 84093860 from JSON
Processed 83847755 from JSON
Processed 83848107 from JSON
Processed 83852278 from JSON
Processed 83833280 from JSON
Processed 83836645 from JSON
Processed 83845812 from JSON
Processed 83847522 from JSON
Processed 83852494 from JSON
Processed 83842185 from JSON
Processed 83864684 from JSON
Processed 83869571 from JSON
Processed 83835565 from JSON
Processed 83831431 from JSON
Processed 83868198 from JSON
Processed 73399260 from JSON
Processed

Processed 83833433 from JSON
Processed 84093874 from JSON
Processed 146476396 from JSON
Processed 146475223 from JSON
Processed 129361954 from JSON
Processed 93909848 from JSON
Processed 141533623 from JSON
Processed 83848490 from JSON
Processed 83850053 from JSON
Processed 83841305 from JSON
Processed 73990349 from JSON
Processed 84332367 from JSON
Processed 84326485 from JSON
Processed 78508297 from JSON
Processed 78512407 from JSON
Processed 86418977 from JSON
Processed 74202687 from JSON
Processed 86414021 from JSON
Processed 83841242 from JSON
Processed 83841165 from JSON
Processed 83854202 from JSON
Processed 83861271 from JSON
Processed 83861769 from JSON
Processed 93951919 from JSON
Processed 129348229 from JSON
Processed 129354079 from JSON
Processed 129355887 from JSON
Processed 83851925 from JSON
Processed 83856393 from JSON
Processed 83857288 from JSON
Processed 83857257 from JSON
Processed 83840411 from JSON
Processed 83869043 from JSON
Processed 83831972 from JSON
Process

Processed 93936752 from JSON
Processed 83837777 from JSON
Processed 25050344 from JSON
Processed 25056701 from JSON
Processed 25041752 from JSON
Processed 25048645 from JSON
Processed 24984510 from JSON
Processed 24938942 from JSON
Processed 24953401 from JSON
Processed 25013547 from JSON
Processed 24964557 from JSON
Processed 25015102 from JSON
Processed 24992766 from JSON
Processed 25029857 from JSON
Processed 2189676 from JSON
Processed 2184911 from JSON
Processed 25062049 from JSON
Processed 24956790 from JSON
Processed 25048018 from JSON
Processed 25023845 from JSON
Processed 25054674 from JSON
Processed 25030049 from JSON
Processed 25043675 from JSON
Processed 25034492 from JSON
Processed 24973903 from JSON
Processed 25030003 from JSON
Processed 25041490 from JSON
Processed 2246166 from JSON
Processed 24970122 from JSON
Processed 83864257 from JSON
Processed 83864163 from JSON
Processed 94063795 from JSON
Processed 84327713 from JSON
Processed 84328778 from JSON
Processed 8641557

Processed 83849159 from JSON
Processed 83862790 from JSON
Processed 83833114 from JSON
Processed 73440113 from JSON
Processed 84094806 from JSON
Processed 29568557 from JSON
Processed 84090486 from JSON
Processed 129277339 from JSON
Processed 93945984 from JSON
Processed 93946785 from JSON
Processed 93909207 from JSON
Processed 93943173 from JSON
Processed 93947447 from JSON
Processed 42738304 from JSON
Processed 42740068 from JSON
Processed 42744495 from JSON
Processed 78508492 from JSON
Processed 78510923 from JSON
Processed 78510738 from JSON
Processed 78510985 from JSON
Processed 93941898 from JSON
Processed 93957833 from JSON
Processed 83850555 from JSON
Processed 83855877 from JSON
Processed 73379482 from JSON
Processed 83865458 from JSON
Processed 73422832 from JSON
Processed 73423462 from JSON
Processed 84330849 from JSON
Processed 84332667 from JSON
Processed 84331479 from JSON
Processed 73990928 from JSON
Processed 73958591 from JSON
Processed 73989298 from JSON
Processed 739

Processed 73392037 from JSON
Processed 83849459 from JSON
Processed 83854958 from JSON
Processed 83854554 from JSON
Processed 83853878 from JSON
Processed 83859333 from JSON
Processed 84332626 from JSON
Processed 84328733 from JSON
Processed 86417949 from JSON
Processed 86418283 from JSON
Processed 42686031 from JSON
Processed 73961301 from JSON
Processed 74252028 from JSON
Processed 78508286 from JSON
Processed 141532244 from JSON
Processed 141530673 from JSON
Processed 141536051 from JSON
Processed 141535560 from JSON
Processed 141537751 from JSON
Processed 74203949 from JSON
Processed 73419684 from JSON
Processed 84091354 from JSON
Processed 84092061 from JSON
Processed 73402792 from JSON
Processed 29499418 from JSON
Processed 78511424 from JSON
Processed 78510469 from JSON
Processed 78507670 from JSON
Processed 78508673 from JSON
Processed 78512100 from JSON
Processed 78508907 from JSON
Processed 129209900 from JSON
Processed 93903698 from JSON
Processed 94054926 from JSON
Processe

Processed 129361410 from JSON
Processed 129360222 from JSON
Processed 129349012 from JSON
Processed 93915310 from JSON
Processed 129250871 from JSON
Processed 93870381 from JSON
Processed 93942286 from JSON
Processed 94025003 from JSON
Processed 73989029 from JSON
Processed 73990470 from JSON
Processed 73992863 from JSON
Processed 73992630 from JSON
Processed 73993323 from JSON
Processed 129244432 from JSON
Processed 129243307 from JSON
Processed 93937376 from JSON
Processed 93941081 from JSON
Processed 93945324 from JSON
Processed 129283769 from JSON
Processed 129278842 from JSON
Processed 73378227 from JSON
Processed 73392289 from JSON
Processed 73368381 from JSON
Processed 83860068 from JSON
Processed 94057020 from JSON
Processed 94060965 from JSON
Processed 83857301 from JSON
Processed 73357846 from JSON
Processed 83859307 from JSON
Processed 83850595 from JSON
Processed 73441404 from JSON
Processed 73410916 from JSON
Processed 29565790 from JSON
Processed 84092857 from JSON
Proces

Processed 83855223 from JSON
Processed 83849512 from JSON
Processed 83852743 from JSON
Processed 83855038 from JSON
Processed 83856598 from JSON
Processed 94078538 from JSON
Processed 94055175 from JSON
Processed 73385657 from JSON
Processed 83857724 from JSON
Processed 83869325 from JSON
Processed 83869495 from JSON
Processed 83867893 from JSON
Processed 83833421 from JSON
Processed 83839450 from JSON
Processed 83839931 from JSON
Processed 73354905 from JSON
Processed 84095109 from JSON
Processed 29528185 from JSON
Processed 29528060 from JSON
Processed 84090591 from JSON
Processed 84094649 from JSON
Processed 73990862 from JSON
Processed 73991632 from JSON
Processed 84329945 from JSON
Processed 73961262 from JSON
Processed 42675750 from JSON
Processed 84328803 from JSON
Processed 86421786 from JSON
Processed 84327879 from JSON
Processed 84333047 from JSON
Processed 42723833 from JSON
Processed 74204074 from JSON
Processed 74204120 from JSON
Processed 86415594 from JSON
Processed 8433

Processed 74202852 from JSON
Processed 78510096 from JSON
Processed 93915732 from JSON
Processed 129360406 from JSON
Processed 93945638 from JSON
Processed 93953958 from JSON
Processed 93952412 from JSON
Processed 93957068 from JSON
Processed 93948288 from JSON
Processed 94067683 from JSON
Processed 83838117 from JSON
Processed 83852566 from JSON
Processed 83847687 from JSON
Processed 73987670 from JSON
Processed 42685666 from JSON
Processed 42737929 from JSON
Processed 141535826 from JSON
Processed 42709067 from JSON
Processed 129214547 from JSON
Processed 78508796 from JSON
Processed 141531539 from JSON
Processed 141530707 from JSON
Processed 141530815 from JSON
Processed 141535719 from JSON
Processed 129351689 from JSON
Processed 93948225 from JSON
Processed 73390778 from JSON
Processed 86415307 from JSON
Processed 86415585 from JSON
Processed 93907613 from JSON
Processed 83860128 from JSON
Processed 83838396 from JSON
Processed 83863754 from JSON
Processed 83858768 from JSON
Proces

Processed 86414323 from JSON
Processed 73408557 from JSON
Processed 73409093 from JSON
Processed 73408773 from JSON
Processed 42747087 from JSON
Processed 42748600 from JSON
Processed 42752834 from JSON
Processed 78509679 from JSON
Processed 78507673 from JSON
Processed 78507488 from JSON
Processed 141538256 from JSON
Processed 42681963 from JSON
Processed 93939510 from JSON
Processed 93910508 from JSON
Processed 93943946 from JSON
Processed 93947070 from JSON
Processed 93944918 from JSON
Processed 129233964 from JSON
Processed 73989191 from JSON
Processed 73987652 from JSON
Processed 93895212 from JSON
Processed 129265378 from JSON
Processed 129268182 from JSON
Processed 129268834 from JSON
Processed 129269527 from JSON
Processed 129356230 from JSON
Processed 93942741 from JSON
Processed 83869517 from JSON
Processed 83836249 from JSON
Processed 83836650 from JSON
Processed 73349162 from JSON
Processed 83842316 from JSON
Processed 83842967 from JSON
Processed 83839147 from JSON
Process

Processed 94025848 from JSON
Processed 74251795 from JSON
Processed 74251252 from JSON
Processed 146473522 from JSON
Processed 146472257 from JSON
Processed 94056134 from JSON
Processed 129354224 from JSON
Processed 129357494 from JSON
Processed 129363237 from JSON
Processed 141532659 from JSON
Processed 93907882 from JSON
Processed 93915955 from JSON
Processed 93947042 from JSON
Processed 93946830 from JSON
Processed 93943824 from JSON
Processed 129361680 from JSON
Processed 129224567 from JSON
Processed 129355676 from JSON
Processed 93886529 from JSON
Processed 129203131 from JSON
Processed 129238032 from JSON
Processed 78510782 from JSON
Processed 73387570 from JSON
Processed 73357496 from JSON
Processed 83846492 from JSON
Processed 83866522 from JSON
Processed 83869172 from JSON
Processed 73378699 from JSON
Processed 73355678 from JSON
Processed 83841358 from JSON
Processed 83833425 from JSON
Processed 83836277 from JSON
Processed 73353766 from JSON
Processed 83845615 from JSON
Pro

Processed 83860964 from JSON
Processed 83859582 from JSON
Processed 83861563 from JSON
Processed 83862270 from JSON
Processed 83855807 from JSON
Processed 83851160 from JSON
Processed 83849848 from JSON
Processed 83863862 from JSON
Processed 73360346 from JSON
Processed 83849088 from JSON
Processed 83847981 from JSON
Processed 93877417 from JSON
Processed 83866499 from JSON
Processed 83832849 from JSON
Processed 83838241 from JSON
Processed 83839167 from JSON
Processed 83853275 from JSON
Processed 73393242 from JSON
Processed 73361148 from JSON
Processed 83847160 from JSON
Processed 73420526 from JSON
Processed 73407389 from JSON
Processed 73408623 from JSON
Processed 84092329 from JSON
Processed 84092855 from JSON
Processed 73420728 from JSON
Processed 84094969 from JSON
Processed 73953774 from JSON
Processed 84330096 from JSON
Processed 84327283 from JSON
Processed 42680346 from JSON
Processed 73992924 from JSON
Processed 73987595 from JSON
Processed 84333196 from JSON
Processed 8432

Processed 83833450 from JSON
Processed 94034127 from JSON
Processed 94075171 from JSON
Processed 129353838 from JSON
Processed 93940465 from JSON
Processed 83840614 from JSON
Processed 74251340 from JSON
Processed 42698180 from JSON
Processed 73958868 from JSON
Processed 84330224 from JSON
Processed 84331955 from JSON
Processed 84331272 from JSON
Processed 73990846 from JSON
Processed 73993061 from JSON
Processed 42728177 from JSON
Processed 42736156 from JSON
Processed 93908285 from JSON
Processed 93907530 from JSON
Processed 93912161 from JSON
Processed 93867227 from JSON
Processed 78508000 from JSON
Processed 93946665 from JSON
Processed 83860612 from JSON
Processed 83852880 from JSON
Processed 83853061 from JSON
Processed 83852789 from JSON
Processed 83855500 from JSON
Processed 83847817 from JSON
Processed 83850365 from JSON
Processed 83855719 from JSON
Processed 83842711 from JSON
Processed 129233569 from JSON
Processed 83865840 from JSON
Processed 83870426 from JSON
Processed 83

Processed 94047681 from JSON
Processed 84094468 from JSON
Processed 84093340 from JSON
Processed 86421626 from JSON
Processed 84326431 from JSON
Processed 86415559 from JSON
Processed 84329252 from JSON
Processed 73414888 from JSON
Processed 73405938 from JSON
Processed 78507488 from JSON
Processed 141531952 from JSON
Processed 141532753 from JSON
Processed 93891203 from JSON
Processed 93936890 from JSON
Processed 93887776 from JSON
Processed 93942772 from JSON
Processed 42638983 from JSON
Processed 83839161 from JSON
Processed 73384954 from JSON
Processed 73394009 from JSON
Processed 83857668 from JSON
Processed 83865306 from JSON
Processed 83846248 from JSON
Processed 83850977 from JSON
Processed 73377839 from JSON
Processed 83864132 from JSON
Processed 94078807 from JSON
Processed 129358344 from JSON
Processed 129361018 from JSON
Processed 129243468 from JSON
Processed 73961238 from JSON
Processed 83838221 from JSON
Processed 83847342 from JSON
Processed 86419132 from JSON
Processed

Processed 86420799 from JSON
Processed 129240605 from JSON
Processed 141531716 from JSON
Processed 93900853 from JSON
Processed 129356830 from JSON
Processed 93949731 from JSON
Processed 93948310 from JSON
Processed 83835194 from JSON
Processed 146474357 from JSON
Processed 73374050 from JSON
Processed 93937091 from JSON
Processed 146471442 from JSON
Processed 94041308 from JSON
Processed 94049681 from JSON
Processed 146473431 from JSON
Processed 83867553 from JSON
Processed 73384129 from JSON
Processed 83848680 from JSON
Processed 78509187 from JSON
Processed 141538825 from JSON
Processed 73441981 from JSON
Processed 84093274 from JSON
Processed 93955721 from JSON
Processed 83831680 from JSON
Processed 93888084 from JSON
Processed 129235802 from JSON
Processed 86422136 from JSON
Processed 84331601 from JSON
Processed 86415268 from JSON
Processed 146471597 from JSON
Processed 141534242 from JSON
Processed 141538453 from JSON
Processed 141532549 from JSON
Processed 83844191 from JSON
Pr

Done! We have a bunch of files in our database! (**8848** articles in this demo)

## Indexing - Elasticsearch 

We use Elasticsearch for the quick and very flexible (elastic?) queries across the full text of articles, so we have to index all the content from the database for this.

Once again, we have to [install](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) it and run the service...yay!

Once it's running we can connect and check status:

In [6]:
import pymongo
import elasticsearch
import requests
import nltk
from elasticsearch import helpers
import re
import string

In [7]:
client = pymongo.MongoClient('localhost:' + str(mongoDB_Port))
publications_collection = client.pub.publications
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}],
                                 timeout=30, max_retries=10, retry_on_timeout=True)

In [7]:
es.cluster.health()

{'active_primary_shards': 10,
 'active_shards': 10,
 'active_shards_percent_as_number': 50.0,
 'cluster_name': 'elasticsearch',
 'delayed_unassigned_shards': 0,
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'yellow',
 'task_max_waiting_in_queue_millis': 0,
 'timed_out': False,
 'unassigned_shards': 10}

### Full text indexing

First we index the full content of every article in the database.

(This step corresponds to `collection_extraction -> index_papersfulltext.py` in the code)

In [8]:
def extract_metadata(documents):
    list_of_docs = []
    for i, r in enumerate(documents):
        extracted = {
                "_id": "",
                "title": "",
                "publication": "",
                "year": "",
                "content": "",
                "abstract": "",
                "authors": [],
                "references": []}
        extracted['_id'] = r['_id']
        extracted['title'] = r['title']
        extracted['publication'] = r['journal']
        extracted['year'] = r['year']
        extracted['content'] = r['content']['fulltext']
        extracted['abstract'] = r['content']['abstract']
        extracted['authors'] = r['authors']
        extracted['references'] = r['content']['references']
        list_of_docs.append(extracted)
    return list_of_docs

In [26]:
filter_publications = ['arxiv'] # Here we could also put PubMed or other sources

extracted_publications = []
for publication in filter_publications:
    query ={'$and': [{'journal': publication}, {'content.fulltext': {'$exists': True}}]}                   
    results = publications_collection.find(query)
    extracted_publications.append(extract_metadata(results))

In [166]:
for publication in extracted_publications:
    actions = []
    for article in publication:
        authors = []
        if len(article['authors']) > 0:
            if type(article['authors'][0]) == list:
                try:
                    for name in article['authors']:
                        authors.append(', '.join([name[1], name[0]]))
                    authors = list(set(authors))
                except:
                    pass
            else:
                authors = article['authors']
        actions.append({
            "_index": "ir", 
            "_type": "publications",  
            "_id": article['_id'],
            "_source": {
                "title": article["title"],
                "journal": article['publication'],
                "year": str(article['year']),
                "content": article["content"],
                "abstract": article["abstract"],
                "authors": authors,
                "references": article["references"]
            }
        })
    if len(actions) == 0:
        continue
    res = helpers.bulk(es, actions)
    print('Done with', res, 'articles added to index')

Done with (8848, []) articles added to index


We can look for anything in the text, like this:

In [180]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_id'], hit['_source']['abstract'])
    print('-'*50)

Got 1 Hits:
arxiv_84094279 Re-Evaluating the Netflix Prize - Human Uncertainty and its Impact on
  Reliability
arxiv_84094279 In this paper, we examine the statistical soundness of comparative
assessments within the field of recommender systems in terms of reliability and
human uncertainty. From a controlled experiment, we get the insight that users
provide different ratings on same items when repeatedly asked. This volatility
of user ratings justifies the assumption of using probability densities instead
of single rating scores. As a consequence, the well-known accuracy metrics
(e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution
of all rating densities. When two different systems produce different RMSE
distributions with significant intersection, then there exists a probability of
error for each possible ranking. As an application, we examine possible ranking
errors of the Netflix Prize. We are able to show that all top rankings are more
or less subject to h

### Sentence Indexing 
For the sentence expansion step of our process, we need to have sentences indexed. This is because we need to find similar sentences as well as their surrounding sentences, for context. 

In addition, we create a file with all the text, of all our articles, and we will use it later for the training of embedding models.

(This step corresponds to `collection_extraction -> index_twosent.py` in the code)

In [8]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [203]:
for publication in extracted_publications:
    for article in publication:
        actions = []
        cleaned = []
        dataset_sent = []
        other_sent = []

        lines = (sent_detector.tokenize(article['content'].strip()))
        
        # This will be used for the training of word2vec and doc2vec
        # You need to create the data folder beforehand
        with open('data/full_text_corpus.txt', 'a', encoding='utf-8') as f:
            for line in lines:
                f.write(line)
        f.close()
        
        if len(lines) < 3:
            continue

        for i in range(len(lines)):
            words = nltk.word_tokenize(lines[i])
            word_lengths = [len(x) for x in words]
            average_word_length = sum(word_lengths) / len(word_lengths)
            if average_word_length < 4:
                continue

            two_sentences = ''
            try:
                two_sentences = lines[i] + ' ' + lines[i - 1]
            except:
                two_sentences = lines[i] + ' ' + lines[i + 1]

            dataset_sent.append(two_sentences)

        for num, added_lines in enumerate(dataset_sent):
            actions.append({
                "_index": "twosent",
                "_type": "twosentnorules",
                "_id": article['_id'] + str(num),
                "_source": {
                    "title": article['title'],
                    "content.chapter.sentpositive": added_lines,
                    "paper_id": article['_id']
                }})

        if len(actions) == 0:
            continue
        res = helpers.bulk(es, actions)
    print('Done')

Done


In [193]:
res = es.search(index = "twosent", body = {"query": {"match": {"content.chapter.sentpositive" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print('-'*50)

Got 174 Hits:
arxiv_14153268986 Learning to Rank based on Analogical Reasoning
--------------------------------------------------
arxiv_146472211389 One-sided Differential Privacy
--------------------------------------------------
arxiv_129348860158 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_14153268985 Learning to Rank based on Analogical Reasoning
--------------------------------------------------
arxiv_129348860154 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_129348860153 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_93942601272 Interacting Attention-gated Recurrent Networks for Recommendation
--------------------------------------------------
arxiv_129348860157 Differentially Private Query Learning: from

### Doc2vec Indexing

Once again, we index the sentences from the full text file created before, which will be used for the Sentence Expansion using doc2vec.

(This step corresponds to `collection_extraction -> index_doc2vec.py` in the code)

In [9]:
file = open('data/full_text_corpus.txt', 'r', encoding='utf-8')
text = file.read()
file.close()
sentences = nltk.tokenize.sent_tokenize(text)
print('Sentences ready')
count = 0
docLabels = []
actions = []

for i, sent in enumerate(sentences):
    try:
        neighbors = sentences[i + 1]
        neighbor_count = count + 1
    except:
        neighbors = sentences[i - 1]
        neighbor_count = count - 1

    docLabels.append(count)
    actions.append({
        "_index": "devtwosentnew",
        "_type": "devtwosentnorulesnew",
        "_id": count,
        "_source": {
            "content.chapter.sentpositive": sent,
            "content.chapter.sentnegtive": neighbors,
            "neighborcount": neighbor_count
        }})
    count = count + 1

print(len(sentences))
print(len(docLabels))
res = helpers.bulk(es, actions)
print(res)

Sentences ready
248840
248840
(248840, [])


## Training Embedding Models (word2vec & doc2vec)

For the Term and Sentence expansion, we need to find similar words or sentences, and we use word2vec and doc2vec respectively. We have to generate a document with all the words in our corpus and one with all the sentences, both will be stored in the data folder. While this is not a full implementation, we have around 4 million sentences, so it is a considerable example, and processing can take some time.

(This step corresponds to `data_preparation -> prepare_embedding_data.py` in the code)

In [27]:
papers_text = []
sentence_text = []
translator = str.maketrans('', '', string.punctuation)

for publication in extracted_publications:
    for article in publication:
        query = {"query":
                     {"match":
                          {"_id":
                               {"query": article['_id'],
                                "operator": "and"
                                }
                           }
                      }
                 }

        results = es.search(index="ir", body=query, size=200)
        for doc in results['hits']['hits']:
            fulltext = doc["_source"]["content"]
            fulltext = re.sub("[\[].*?[\]]", "", fulltext)
            cleaned_text = fulltext.translate(translator)
            papers_text.append(cleaned_text.lower())
            sentence_text.append(fulltext)
    print('Done', '-' * 100)

tokens = " ".join(papers_text)

f = open("data/dataWord2vec.txt", "w", encoding='utf-8')
f.write(tokens)
f.close()

f = open("data/dataDoc2vec.txt", "w", encoding='utf-8')
for line in sentence_text:
    f.write(line)
f.close()

Done ----------------------------------------------------------------------------------------------------


Before training, verify that you have the optimized version of Gensim using Cython, for better performance.

In [None]:
import gensim
assert gensim.models.doc2vec.FAST_VERSION > -1

Now that we have both documents with all the text, we can use them for the training. 

We will use a bigram model for word2vec, since it's common that entities are represented as two words. You can run the command from your Python console or directly here, using the magic `%run` syntax. The structure of this command is as follows:

`%run script_to_run training_data model_output word_vector_output`

This was run in a normal Lenovo machine with 8GB and i7 processor. As you can see by the timestamps below, it might take a while...

In [52]:
%run data_preparation/train_word2vec.py data/dataDoc2vec.txt data/modelword2vecbigram.model data/modelword2vecbigram.vec

2018-05-14 17:07:47,086: INFO: running data_preparation/train_word2vec.py data/dataDoc2vec.txt data/modelword2vecbigram.model data/modelword2vecbigram.vec
2018-05-14 17:28:19,989: INFO: collecting all words and their counts
2018-05-14 17:28:20,025: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-05-14 17:28:20,610: INFO: PROGRESS: at sentence #10000, processed 229355 words and 118344 word types
2018-05-14 17:28:21,168: INFO: PROGRESS: at sentence #20000, processed 454354 words and 192778 word types
2018-05-14 17:28:21,542: INFO: PROGRESS: at sentence #30000, processed 623029 words and 227709 word types
2018-05-14 17:28:22,066: INFO: PROGRESS: at sentence #40000, processed 841989 words and 295848 word types
2018-05-14 17:28:22,678: INFO: PROGRESS: at sentence #50000, processed 1084244 words and 360848 word types
2018-05-14 17:28:23,274: INFO: PROGRESS: at sentence #60000, processed 1344739 words and 431377 word types
2018-05-14 17:28:23,827: INFO: PROGRESS: at se

2018-05-14 17:29:07,734: INFO: PROGRESS: at sentence #740000, processed 17361048 words and 3022251 word types
2018-05-14 17:29:08,990: INFO: PROGRESS: at sentence #750000, processed 17594845 words and 3042351 word types
2018-05-14 17:29:10,103: INFO: PROGRESS: at sentence #760000, processed 17814742 words and 3074467 word types
2018-05-14 17:29:10,995: INFO: PROGRESS: at sentence #770000, processed 18025720 words and 3091216 word types
2018-05-14 17:29:12,946: INFO: PROGRESS: at sentence #780000, processed 18283854 words and 3116093 word types
2018-05-14 17:29:14,310: INFO: PROGRESS: at sentence #790000, processed 18531023 words and 3154856 word types
2018-05-14 17:29:15,352: INFO: PROGRESS: at sentence #800000, processed 18774891 words and 3177816 word types
2018-05-14 17:29:16,261: INFO: PROGRESS: at sentence #810000, processed 19014899 words and 3198612 word types
2018-05-14 17:29:17,139: INFO: PROGRESS: at sentence #820000, processed 19255256 words and 3219610 word types
2018-05-14

2018-05-14 17:30:01,067: INFO: PROGRESS: at sentence #1490000, processed 34208850 words and 4778377 word types
2018-05-14 17:30:01,721: INFO: PROGRESS: at sentence #1500000, processed 34450358 words and 4787806 word types
2018-05-14 17:30:02,259: INFO: PROGRESS: at sentence #1510000, processed 34639208 words and 4809380 word types
2018-05-14 17:30:02,807: INFO: PROGRESS: at sentence #1520000, processed 34825756 words and 4826088 word types
2018-05-14 17:30:03,374: INFO: PROGRESS: at sentence #1530000, processed 35024839 words and 4839282 word types
2018-05-14 17:30:04,114: INFO: PROGRESS: at sentence #1540000, processed 35251130 words and 4855824 word types
2018-05-14 17:30:04,662: INFO: PROGRESS: at sentence #1550000, processed 35438098 words and 4866389 word types
2018-05-14 17:30:06,181: INFO: PROGRESS: at sentence #1560000, processed 35667352 words and 4889372 word types
2018-05-14 17:30:08,014: INFO: PROGRESS: at sentence #1570000, processed 35900369 words and 4911339 word types
2

2018-05-14 17:31:01,236: INFO: PROGRESS: at sentence #2230000, processed 50414210 words and 6402990 word types
2018-05-14 17:31:02,063: INFO: PROGRESS: at sentence #2240000, processed 50674430 words and 6429017 word types
2018-05-14 17:31:02,681: INFO: PROGRESS: at sentence #2250000, processed 50893583 words and 6449521 word types
2018-05-14 17:31:03,397: INFO: PROGRESS: at sentence #2260000, processed 51129908 words and 6475182 word types
2018-05-14 17:31:04,102: INFO: PROGRESS: at sentence #2270000, processed 51368727 words and 6502858 word types
2018-05-14 17:31:04,802: INFO: PROGRESS: at sentence #2280000, processed 51600670 words and 6533322 word types
2018-05-14 17:31:05,461: INFO: PROGRESS: at sentence #2290000, processed 51828992 words and 6560843 word types
2018-05-14 17:31:06,074: INFO: PROGRESS: at sentence #2300000, processed 52036768 words and 6589634 word types
2018-05-14 17:31:06,748: INFO: PROGRESS: at sentence #2310000, processed 52275567 words and 6609790 word types
2

2018-05-14 17:31:53,454: INFO: PROGRESS: at sentence #2970000, processed 66946087 words and 8174950 word types
2018-05-14 17:31:54,415: INFO: PROGRESS: at sentence #2980000, processed 67214269 words and 8206242 word types
2018-05-14 17:31:55,145: INFO: PROGRESS: at sentence #2990000, processed 67446789 words and 8230690 word types
2018-05-14 17:31:56,513: INFO: PROGRESS: at sentence #3000000, processed 67700484 words and 8253589 word types
2018-05-14 17:31:57,514: INFO: PROGRESS: at sentence #3010000, processed 67911983 words and 8265359 word types
2018-05-14 17:31:58,784: INFO: PROGRESS: at sentence #3020000, processed 68136781 words and 8283506 word types
2018-05-14 17:32:00,477: INFO: PROGRESS: at sentence #3030000, processed 68369668 words and 8305425 word types
2018-05-14 17:32:01,625: INFO: PROGRESS: at sentence #3040000, processed 68605304 words and 8327038 word types
2018-05-14 17:32:02,570: INFO: PROGRESS: at sentence #3050000, processed 68844240 words and 8343521 word types
2

2018-05-14 17:32:52,112: INFO: PROGRESS: at sentence #3710000, processed 83978035 words and 9601645 word types
2018-05-14 17:32:52,990: INFO: PROGRESS: at sentence #3720000, processed 84202182 words and 9626545 word types
2018-05-14 17:32:53,483: INFO: PROGRESS: at sentence #3730000, processed 84349236 words and 9647368 word types
2018-05-14 17:32:54,182: INFO: PROGRESS: at sentence #3740000, processed 84582407 words and 9671653 word types
2018-05-14 17:32:55,117: INFO: PROGRESS: at sentence #3750000, processed 84832806 words and 9698696 word types
2018-05-14 17:32:56,103: INFO: PROGRESS: at sentence #3760000, processed 85077556 words and 9720335 word types
2018-05-14 17:32:56,770: INFO: PROGRESS: at sentence #3770000, processed 85319877 words and 9742603 word types
2018-05-14 17:32:57,456: INFO: PROGRESS: at sentence #3780000, processed 85556706 words and 9753865 word types
2018-05-14 17:32:58,281: INFO: PROGRESS: at sentence #3790000, processed 85810779 words and 9773846 word types
2

2018-05-14 17:36:25,017: INFO: PROGRESS: at sentence #160000, processed 3109862 words, keeping 231332 word types
2018-05-14 17:36:26,255: INFO: PROGRESS: at sentence #170000, processed 3304543 words, keeping 241941 word types
2018-05-14 17:36:27,328: INFO: PROGRESS: at sentence #180000, processed 3497767 words, keeping 250292 word types
2018-05-14 17:36:28,423: INFO: PROGRESS: at sentence #190000, processed 3689915 words, keeping 259687 word types
2018-05-14 17:36:29,479: INFO: PROGRESS: at sentence #200000, processed 3886594 words, keeping 269256 word types
2018-05-14 17:36:30,668: INFO: PROGRESS: at sentence #210000, processed 4081798 words, keeping 279718 word types
2018-05-14 17:36:31,707: INFO: PROGRESS: at sentence #220000, processed 4262865 words, keeping 288711 word types
2018-05-14 17:36:32,725: INFO: PROGRESS: at sentence #230000, processed 4448373 words, keeping 297773 word types
2018-05-14 17:36:33,752: INFO: PROGRESS: at sentence #240000, processed 4635794 words, keeping 3

2018-05-14 17:37:49,243: INFO: PROGRESS: at sentence #890000, processed 17118392 words, keeping 649294 word types
2018-05-14 17:37:50,423: INFO: PROGRESS: at sentence #900000, processed 17321015 words, keeping 651703 word types
2018-05-14 17:37:51,057: INFO: PROGRESS: at sentence #910000, processed 17422412 words, keeping 653986 word types
2018-05-14 17:37:52,123: INFO: PROGRESS: at sentence #920000, processed 17603996 words, keeping 659294 word types
2018-05-14 17:37:53,403: INFO: PROGRESS: at sentence #930000, processed 17808389 words, keeping 665424 word types
2018-05-14 17:37:54,131: INFO: PROGRESS: at sentence #940000, processed 17929036 words, keeping 668160 word types
2018-05-14 17:37:55,439: INFO: PROGRESS: at sentence #950000, processed 18129312 words, keeping 673031 word types
2018-05-14 17:37:56,568: INFO: PROGRESS: at sentence #960000, processed 18331862 words, keeping 676582 word types
2018-05-14 17:37:57,753: INFO: PROGRESS: at sentence #970000, processed 18523291 words, 

2018-05-14 17:39:01,757: INFO: PROGRESS: at sentence #1610000, processed 30127718 words, keeping 907902 word types
2018-05-14 17:39:02,784: INFO: PROGRESS: at sentence #1620000, processed 30306520 words, keeping 910775 word types
2018-05-14 17:39:03,676: INFO: PROGRESS: at sentence #1630000, processed 30472056 words, keeping 912939 word types
2018-05-14 17:39:04,722: INFO: PROGRESS: at sentence #1640000, processed 30662312 words, keeping 915730 word types
2018-05-14 17:39:05,754: INFO: PROGRESS: at sentence #1650000, processed 30854111 words, keeping 916865 word types
2018-05-14 17:39:06,598: INFO: PROGRESS: at sentence #1660000, processed 31013221 words, keeping 919674 word types
2018-05-14 17:39:07,667: INFO: PROGRESS: at sentence #1670000, processed 31209769 words, keeping 922619 word types
2018-05-14 17:39:08,605: INFO: PROGRESS: at sentence #1680000, processed 31386207 words, keeping 924826 word types
2018-05-14 17:39:09,711: INFO: PROGRESS: at sentence #1690000, processed 3157733

2018-05-14 17:40:19,284: INFO: PROGRESS: at sentence #2320000, processed 43021956 words, keeping 1183471 word types
2018-05-14 17:40:20,418: INFO: PROGRESS: at sentence #2330000, processed 43209137 words, keeping 1186063 word types
2018-05-14 17:40:21,631: INFO: PROGRESS: at sentence #2340000, processed 43409629 words, keeping 1191078 word types
2018-05-14 17:40:22,614: INFO: PROGRESS: at sentence #2350000, processed 43576144 words, keeping 1193507 word types
2018-05-14 17:40:23,762: INFO: PROGRESS: at sentence #2360000, processed 43769268 words, keeping 1196800 word types
2018-05-14 17:40:24,951: INFO: PROGRESS: at sentence #2370000, processed 43962348 words, keeping 1200040 word types
2018-05-14 17:40:25,860: INFO: PROGRESS: at sentence #2380000, processed 44103922 words, keeping 1202429 word types
2018-05-14 17:40:27,069: INFO: PROGRESS: at sentence #2390000, processed 44279228 words, keeping 1205599 word types
2018-05-14 17:40:28,501: INFO: PROGRESS: at sentence #2400000, processed

2018-05-14 17:41:44,034: INFO: PROGRESS: at sentence #3030000, processed 56090905 words, keeping 1444151 word types
2018-05-14 17:41:45,802: INFO: PROGRESS: at sentence #3040000, processed 56287937 words, keeping 1447572 word types
2018-05-14 17:41:47,557: INFO: PROGRESS: at sentence #3050000, processed 56485599 words, keeping 1450811 word types
2018-05-14 17:41:48,752: INFO: PROGRESS: at sentence #3060000, processed 56699300 words, keeping 1454708 word types
2018-05-14 17:41:50,027: INFO: PROGRESS: at sentence #3070000, processed 56890749 words, keeping 1460441 word types
2018-05-14 17:41:51,260: INFO: PROGRESS: at sentence #3080000, processed 57072528 words, keeping 1464002 word types
2018-05-14 17:41:53,217: INFO: PROGRESS: at sentence #3090000, processed 57268905 words, keeping 1467507 word types
2018-05-14 17:41:54,237: INFO: PROGRESS: at sentence #3100000, processed 57460385 words, keeping 1469714 word types
2018-05-14 17:41:55,210: INFO: PROGRESS: at sentence #3110000, processed

2018-05-14 17:42:58,881: INFO: PROGRESS: at sentence #3740000, processed 69501318 words, keeping 1667606 word types
2018-05-14 17:43:00,010: INFO: PROGRESS: at sentence #3750000, processed 69704212 words, keeping 1672204 word types
2018-05-14 17:43:01,117: INFO: PROGRESS: at sentence #3760000, processed 69900898 words, keeping 1676090 word types
2018-05-14 17:43:02,144: INFO: PROGRESS: at sentence #3770000, processed 70094723 words, keeping 1681086 word types
2018-05-14 17:43:03,203: INFO: PROGRESS: at sentence #3780000, processed 70281140 words, keeping 1683254 word types
2018-05-14 17:43:04,315: INFO: PROGRESS: at sentence #3790000, processed 70485774 words, keeping 1686118 word types
2018-05-14 17:43:05,370: INFO: PROGRESS: at sentence #3800000, processed 70678877 words, keeping 1688391 word types
2018-05-14 17:43:06,368: INFO: PROGRESS: at sentence #3810000, processed 70860543 words, keeping 1691331 word types
2018-05-14 17:43:07,402: INFO: PROGRESS: at sentence #3820000, processed

2018-05-14 17:47:07,814: INFO: EPOCH 1 - PROGRESS: at 4.26% examples, 95732 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:47:08,889: INFO: EPOCH 1 - PROGRESS: at 4.46% examples, 96198 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:09,954: INFO: EPOCH 1 - PROGRESS: at 4.65% examples, 96739 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:10,995: INFO: EPOCH 1 - PROGRESS: at 4.85% examples, 97288 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:47:12,000: INFO: EPOCH 1 - PROGRESS: at 5.05% examples, 97869 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:13,060: INFO: EPOCH 1 - PROGRESS: at 5.24% examples, 98573 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:14,101: INFO: EPOCH 1 - PROGRESS: at 5.48% examples, 98749 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:15,108: INFO: EPOCH 1 - PROGRESS: at 5.66% examples, 99277 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:47:16,124: INFO: EPOCH 1 - PROGRESS: at 5.87% examples, 99677 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:47:17

2018-05-14 17:48:25,603: INFO: EPOCH 1 - PROGRESS: at 19.20% examples, 106648 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:48:26,633: INFO: EPOCH 1 - PROGRESS: at 19.40% examples, 106778 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:48:27,634: INFO: EPOCH 1 - PROGRESS: at 19.60% examples, 106856 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:48:28,670: INFO: EPOCH 1 - PROGRESS: at 19.80% examples, 106903 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:48:29,723: INFO: EPOCH 1 - PROGRESS: at 20.00% examples, 107005 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:48:30,742: INFO: EPOCH 1 - PROGRESS: at 20.18% examples, 107117 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:48:31,748: INFO: EPOCH 1 - PROGRESS: at 20.38% examples, 107181 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:48:32,812: INFO: EPOCH 1 - PROGRESS: at 20.60% examples, 107264 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:48:33,863: INFO: EPOCH 1 - PROGRESS: at 20.82% examples, 107369 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 17:49:41,850: INFO: EPOCH 1 - PROGRESS: at 34.76% examples, 109038 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:42,856: INFO: EPOCH 1 - PROGRESS: at 34.95% examples, 109058 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:43,877: INFO: EPOCH 1 - PROGRESS: at 35.20% examples, 109101 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:44,897: INFO: EPOCH 1 - PROGRESS: at 35.42% examples, 109114 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:49:45,910: INFO: EPOCH 1 - PROGRESS: at 35.65% examples, 109130 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:49:46,917: INFO: EPOCH 1 - PROGRESS: at 35.85% examples, 109149 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:47,928: INFO: EPOCH 1 - PROGRESS: at 36.08% examples, 109120 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:48,953: INFO: EPOCH 1 - PROGRESS: at 36.28% examples, 109122 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:49:49,993: INFO: EPOCH 1 - PROGRESS: at 36.54% examples, 109120 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 17:50:58,945: INFO: EPOCH 1 - PROGRESS: at 48.20% examples, 103577 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:50:59,953: INFO: EPOCH 1 - PROGRESS: at 48.38% examples, 103565 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:51:00,987: INFO: EPOCH 1 - PROGRESS: at 48.61% examples, 103598 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:51:02,034: INFO: EPOCH 1 - PROGRESS: at 48.81% examples, 103638 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:51:03,071: INFO: EPOCH 1 - PROGRESS: at 49.00% examples, 103664 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:51:04,098: INFO: EPOCH 1 - PROGRESS: at 49.19% examples, 103705 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:51:05,141: INFO: EPOCH 1 - PROGRESS: at 49.40% examples, 103734 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:51:06,197: INFO: EPOCH 1 - PROGRESS: at 49.61% examples, 103756 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:51:07,217: INFO: EPOCH 1 - PROGRESS: at 49.81% examples, 103800 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 17:52:15,598: INFO: EPOCH 1 - PROGRESS: at 62.89% examples, 103956 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:16,614: INFO: EPOCH 1 - PROGRESS: at 63.10% examples, 103966 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:17,652: INFO: EPOCH 1 - PROGRESS: at 63.30% examples, 103949 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:18,663: INFO: EPOCH 1 - PROGRESS: at 63.52% examples, 103961 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:19,692: INFO: EPOCH 1 - PROGRESS: at 63.72% examples, 103948 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:20,709: INFO: EPOCH 1 - PROGRESS: at 63.93% examples, 103961 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:21,710: INFO: EPOCH 1 - PROGRESS: at 64.13% examples, 103933 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:22,732: INFO: EPOCH 1 - PROGRESS: at 64.32% examples, 103947 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:52:23,748: INFO: EPOCH 1 - PROGRESS: at 64.52% examples, 103959 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 17:53:32,280: INFO: EPOCH 1 - PROGRESS: at 75.15% examples, 101369 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:33,303: INFO: EPOCH 1 - PROGRESS: at 75.30% examples, 101308 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:53:34,314: INFO: EPOCH 1 - PROGRESS: at 75.45% examples, 101306 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:35,369: INFO: EPOCH 1 - PROGRESS: at 75.60% examples, 101290 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:36,437: INFO: EPOCH 1 - PROGRESS: at 75.80% examples, 101297 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:37,446: INFO: EPOCH 1 - PROGRESS: at 75.94% examples, 101258 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:38,529: INFO: EPOCH 1 - PROGRESS: at 76.09% examples, 101202 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:39,578: INFO: EPOCH 1 - PROGRESS: at 76.20% examples, 101119 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:53:40,603: INFO: EPOCH 1 - PROGRESS: at 76.36% examples, 101094 words/s, in_qsize 1, out_qsize 0
2

2018-05-14 17:54:51,033: INFO: EPOCH 1 - PROGRESS: at 86.79% examples, 98563 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:52,129: INFO: EPOCH 1 - PROGRESS: at 87.12% examples, 98565 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:53,147: INFO: EPOCH 1 - PROGRESS: at 87.27% examples, 98538 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:54,168: INFO: EPOCH 1 - PROGRESS: at 87.37% examples, 98464 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:55,295: INFO: EPOCH 1 - PROGRESS: at 87.50% examples, 98384 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:56,338: INFO: EPOCH 1 - PROGRESS: at 87.63% examples, 98332 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:57,371: INFO: EPOCH 1 - PROGRESS: at 87.77% examples, 98284 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:58,529: INFO: EPOCH 1 - PROGRESS: at 87.87% examples, 98185 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:54:59,693: INFO: EPOCH 1 - PROGRESS: at 87.91% examples, 97999 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 17:56:06,438: INFO: worker thread finished; awaiting finish of 3 more threads
2018-05-14 17:56:06,441: INFO: EPOCH 1 - PROGRESS: at 99.98% examples, 97725 words/s, in_qsize 2, out_qsize 1
2018-05-14 17:56:06,444: INFO: worker thread finished; awaiting finish of 2 more threads
2018-05-14 17:56:06,472: INFO: worker thread finished; awaiting finish of 1 more threads
2018-05-14 17:56:06,524: INFO: worker thread finished; awaiting finish of 0 more threads
2018-05-14 17:56:06,525: INFO: EPOCH - 1 : training on 76563714 raw words (55176769 effective words) took 564.6s, 97727 effective words/s
2018-05-14 17:56:07,642: INFO: EPOCH 2 - PROGRESS: at 0.17% examples, 92068 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:56:08,710: INFO: EPOCH 2 - PROGRESS: at 0.36% examples, 101403 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:56:09,806: INFO: EPOCH 2 - PROGRESS: at 0.61% examples, 99259 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:56:10,828: INFO: EPOCH 2 - PROGRESS: at 0.77% examples

2018-05-14 17:57:19,154: INFO: EPOCH 2 - PROGRESS: at 13.08% examples, 103466 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:20,174: INFO: EPOCH 2 - PROGRESS: at 13.30% examples, 103574 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:21,215: INFO: EPOCH 2 - PROGRESS: at 13.49% examples, 103702 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:22,284: INFO: EPOCH 2 - PROGRESS: at 13.65% examples, 103584 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:23,297: INFO: EPOCH 2 - PROGRESS: at 13.84% examples, 103633 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:24,342: INFO: EPOCH 2 - PROGRESS: at 14.02% examples, 103745 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:25,349: INFO: EPOCH 2 - PROGRESS: at 14.21% examples, 103817 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:26,393: INFO: EPOCH 2 - PROGRESS: at 14.44% examples, 103899 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:57:27,398: INFO: EPOCH 2 - PROGRESS: at 14.63% examples, 104042 words/s, in_qsize 1, out_qsize 0
2

2018-05-14 17:58:35,780: INFO: EPOCH 2 - PROGRESS: at 27.64% examples, 105655 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:36,844: INFO: EPOCH 2 - PROGRESS: at 27.81% examples, 105579 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:37,901: INFO: EPOCH 2 - PROGRESS: at 28.01% examples, 105361 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:39,002: INFO: EPOCH 2 - PROGRESS: at 28.12% examples, 105030 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:40,018: INFO: EPOCH 2 - PROGRESS: at 28.25% examples, 104857 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:41,066: INFO: EPOCH 2 - PROGRESS: at 28.41% examples, 104762 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:42,129: INFO: EPOCH 2 - PROGRESS: at 28.58% examples, 104692 words/s, in_qsize 1, out_qsize 0
2018-05-14 17:58:43,143: INFO: EPOCH 2 - PROGRESS: at 28.73% examples, 104608 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:58:44,196: INFO: EPOCH 2 - PROGRESS: at 28.92% examples, 104596 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 17:59:52,562: INFO: EPOCH 2 - PROGRESS: at 41.80% examples, 102914 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:53,673: INFO: EPOCH 2 - PROGRESS: at 41.95% examples, 102786 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:54,704: INFO: EPOCH 2 - PROGRESS: at 42.11% examples, 102694 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:55,753: INFO: EPOCH 2 - PROGRESS: at 42.29% examples, 102590 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:56,794: INFO: EPOCH 2 - PROGRESS: at 42.47% examples, 102525 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:57,845: INFO: EPOCH 2 - PROGRESS: at 42.69% examples, 102518 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:58,901: INFO: EPOCH 2 - PROGRESS: at 42.90% examples, 102506 words/s, in_qsize 0, out_qsize 0
2018-05-14 17:59:59,935: INFO: EPOCH 2 - PROGRESS: at 43.10% examples, 102515 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:00:00,961: INFO: EPOCH 2 - PROGRESS: at 43.27% examples, 102522 words/s, in_qsize 1, out_qsize 0
2

2018-05-14 18:03:46,018: INFO: EPOCH 2 - PROGRESS: at 54.49% examples, 65613 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:47,069: INFO: EPOCH 2 - PROGRESS: at 54.61% examples, 65606 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:48,215: INFO: EPOCH 2 - PROGRESS: at 54.70% examples, 65566 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:49,287: INFO: EPOCH 2 - PROGRESS: at 54.86% examples, 65618 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:50,387: INFO: EPOCH 2 - PROGRESS: at 55.03% examples, 65669 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:51,403: INFO: EPOCH 2 - PROGRESS: at 55.20% examples, 65729 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:03:52,455: INFO: EPOCH 2 - PROGRESS: at 55.39% examples, 65801 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:03:53,485: INFO: EPOCH 2 - PROGRESS: at 55.58% examples, 65876 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:03:54,530: INFO: EPOCH 2 - PROGRESS: at 55.76% examples, 65960 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:05:04,492: INFO: EPOCH 2 - PROGRESS: at 68.96% examples, 70610 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:05,548: INFO: EPOCH 2 - PROGRESS: at 69.15% examples, 70684 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:06,580: INFO: EPOCH 2 - PROGRESS: at 69.31% examples, 70719 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:05:07,638: INFO: EPOCH 2 - PROGRESS: at 69.52% examples, 70805 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:08,673: INFO: EPOCH 2 - PROGRESS: at 69.70% examples, 70892 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:05:09,680: INFO: EPOCH 2 - PROGRESS: at 69.91% examples, 70970 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:10,726: INFO: EPOCH 2 - PROGRESS: at 70.10% examples, 71056 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:11,737: INFO: EPOCH 2 - PROGRESS: at 70.31% examples, 71131 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:05:12,777: INFO: EPOCH 2 - PROGRESS: at 70.53% examples, 71215 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:06:22,188: INFO: EPOCH 2 - PROGRESS: at 84.39% examples, 76076 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:06:23,240: INFO: EPOCH 2 - PROGRESS: at 84.62% examples, 76133 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:06:24,258: INFO: EPOCH 2 - PROGRESS: at 84.81% examples, 76194 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:06:25,377: INFO: EPOCH 2 - PROGRESS: at 85.03% examples, 76242 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:06:26,408: INFO: EPOCH 2 - PROGRESS: at 85.20% examples, 76279 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:06:27,473: INFO: EPOCH 2 - PROGRESS: at 85.42% examples, 76346 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:06:28,548: INFO: EPOCH 2 - PROGRESS: at 85.65% examples, 76424 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:06:29,571: INFO: EPOCH 2 - PROGRESS: at 85.85% examples, 76498 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:06:30,593: INFO: EPOCH 2 - PROGRESS: at 86.05% examples, 76556 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:07:33,118: INFO: worker thread finished; awaiting finish of 0 more threads
2018-05-14 18:07:33,119: INFO: EPOCH - 2 : training on 76563714 raw words (55173975 effective words) took 686.5s, 80370 effective words/s
2018-05-14 18:07:34,251: INFO: EPOCH 3 - PROGRESS: at 0.18% examples, 98332 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:07:35,311: INFO: EPOCH 3 - PROGRESS: at 0.41% examples, 111041 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:07:36,336: INFO: EPOCH 3 - PROGRESS: at 0.67% examples, 112433 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:07:37,351: INFO: EPOCH 3 - PROGRESS: at 0.89% examples, 111001 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:07:38,377: INFO: EPOCH 3 - PROGRESS: at 1.10% examples, 110859 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:07:39,438: INFO: EPOCH 3 - PROGRESS: at 1.32% examples, 112905 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:07:40,494: INFO: EPOCH 3 - PROGRESS: at 1.51% examples, 114461 words/s, in_qsize 0, out_qsize 0
2018

2018-05-14 18:08:50,101: INFO: EPOCH 3 - PROGRESS: at 15.73% examples, 117194 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:51,113: INFO: EPOCH 3 - PROGRESS: at 15.92% examples, 117166 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:08:52,146: INFO: EPOCH 3 - PROGRESS: at 16.10% examples, 117100 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:53,156: INFO: EPOCH 3 - PROGRESS: at 16.30% examples, 117168 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:54,203: INFO: EPOCH 3 - PROGRESS: at 16.52% examples, 117234 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:55,237: INFO: EPOCH 3 - PROGRESS: at 16.71% examples, 117172 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:08:56,256: INFO: EPOCH 3 - PROGRESS: at 16.87% examples, 116946 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:57,257: INFO: EPOCH 3 - PROGRESS: at 17.07% examples, 116938 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:08:58,289: INFO: EPOCH 3 - PROGRESS: at 17.27% examples, 117043 words/s, in_qsize 1, out_qsize 0
2

2018-05-14 18:10:06,339: INFO: EPOCH 3 - PROGRESS: at 32.13% examples, 118846 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:07,384: INFO: EPOCH 3 - PROGRESS: at 32.37% examples, 118855 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:08,400: INFO: EPOCH 3 - PROGRESS: at 32.60% examples, 118860 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:09,405: INFO: EPOCH 3 - PROGRESS: at 32.83% examples, 118870 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:10,433: INFO: EPOCH 3 - PROGRESS: at 33.06% examples, 118911 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:11,482: INFO: EPOCH 3 - PROGRESS: at 33.28% examples, 118935 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:10:12,514: INFO: EPOCH 3 - PROGRESS: at 33.49% examples, 118964 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:13,534: INFO: EPOCH 3 - PROGRESS: at 33.71% examples, 119006 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:10:14,570: INFO: EPOCH 3 - PROGRESS: at 33.95% examples, 118990 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:11:22,674: INFO: EPOCH 3 - PROGRESS: at 49.47% examples, 119177 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:11:23,682: INFO: EPOCH 3 - PROGRESS: at 49.71% examples, 119192 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:24,716: INFO: EPOCH 3 - PROGRESS: at 49.90% examples, 119195 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:25,744: INFO: EPOCH 3 - PROGRESS: at 50.11% examples, 119195 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:26,804: INFO: EPOCH 3 - PROGRESS: at 50.33% examples, 119181 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:27,837: INFO: EPOCH 3 - PROGRESS: at 50.55% examples, 119177 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:28,840: INFO: EPOCH 3 - PROGRESS: at 50.79% examples, 119186 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:29,872: INFO: EPOCH 3 - PROGRESS: at 51.02% examples, 119184 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:11:30,926: INFO: EPOCH 3 - PROGRESS: at 51.24% examples, 119194 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:12:38,793: INFO: EPOCH 3 - PROGRESS: at 65.67% examples, 118219 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:39,839: INFO: EPOCH 3 - PROGRESS: at 65.87% examples, 118174 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:40,871: INFO: EPOCH 3 - PROGRESS: at 66.07% examples, 118160 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:41,877: INFO: EPOCH 3 - PROGRESS: at 66.29% examples, 118156 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:42,905: INFO: EPOCH 3 - PROGRESS: at 66.47% examples, 118114 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:43,945: INFO: EPOCH 3 - PROGRESS: at 66.68% examples, 118085 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:12:44,954: INFO: EPOCH 3 - PROGRESS: at 66.90% examples, 118095 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:46,014: INFO: EPOCH 3 - PROGRESS: at 67.10% examples, 118103 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:12:47,046: INFO: EPOCH 3 - PROGRESS: at 67.24% examples, 117991 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:13:55,074: INFO: EPOCH 3 - PROGRESS: at 81.47% examples, 118289 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:13:56,115: INFO: EPOCH 3 - PROGRESS: at 81.71% examples, 118303 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:13:57,189: INFO: EPOCH 3 - PROGRESS: at 81.91% examples, 118288 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:13:58,195: INFO: EPOCH 3 - PROGRESS: at 82.12% examples, 118309 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:13:59,239: INFO: EPOCH 3 - PROGRESS: at 82.33% examples, 118322 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:14:00,275: INFO: EPOCH 3 - PROGRESS: at 82.57% examples, 118321 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:14:01,311: INFO: EPOCH 3 - PROGRESS: at 82.81% examples, 118336 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:14:02,348: INFO: EPOCH 3 - PROGRESS: at 83.03% examples, 118352 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:14:03,389: INFO: EPOCH 3 - PROGRESS: at 83.23% examples, 118366 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:15:11,591: INFO: EPOCH 3 - PROGRESS: at 97.93% examples, 118307 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:12,620: INFO: EPOCH 3 - PROGRESS: at 98.17% examples, 118321 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:13,660: INFO: EPOCH 3 - PROGRESS: at 98.45% examples, 118318 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:14,716: INFO: EPOCH 3 - PROGRESS: at 98.83% examples, 118307 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:15,757: INFO: EPOCH 3 - PROGRESS: at 99.14% examples, 118317 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:16,789: INFO: EPOCH 3 - PROGRESS: at 99.39% examples, 118333 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:17,810: INFO: EPOCH 3 - PROGRESS: at 99.63% examples, 118353 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:18,842: INFO: EPOCH 3 - PROGRESS: at 99.88% examples, 118370 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:15:19,225: INFO: worker thread finished; awaiting finish of 7 more threads
2018-05-14 18:15:19,231

2018-05-14 18:16:21,254: INFO: EPOCH 4 - PROGRESS: at 13.48% examples, 124912 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:22,283: INFO: EPOCH 4 - PROGRESS: at 13.69% examples, 124956 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:23,323: INFO: EPOCH 4 - PROGRESS: at 13.91% examples, 124958 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:24,346: INFO: EPOCH 4 - PROGRESS: at 14.13% examples, 125020 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:25,393: INFO: EPOCH 4 - PROGRESS: at 14.38% examples, 125005 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:26,401: INFO: EPOCH 4 - PROGRESS: at 14.58% examples, 124962 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:16:27,432: INFO: EPOCH 4 - PROGRESS: at 14.82% examples, 124955 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:28,466: INFO: EPOCH 4 - PROGRESS: at 15.07% examples, 124968 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:16:29,506: INFO: EPOCH 4 - PROGRESS: at 15.32% examples, 124974 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:17:37,266: INFO: EPOCH 4 - PROGRESS: at 30.26% examples, 125075 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:38,329: INFO: EPOCH 4 - PROGRESS: at 30.53% examples, 125086 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:39,366: INFO: EPOCH 4 - PROGRESS: at 30.77% examples, 125112 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:40,424: INFO: EPOCH 4 - PROGRESS: at 31.06% examples, 125130 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:41,438: INFO: EPOCH 4 - PROGRESS: at 31.33% examples, 125175 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:42,473: INFO: EPOCH 4 - PROGRESS: at 31.55% examples, 125160 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:43,522: INFO: EPOCH 4 - PROGRESS: at 31.79% examples, 125077 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:17:44,571: INFO: EPOCH 4 - PROGRESS: at 32.05% examples, 125101 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:17:45,581: INFO: EPOCH 4 - PROGRESS: at 32.31% examples, 125099 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:18:53,770: INFO: EPOCH 4 - PROGRESS: at 48.02% examples, 123756 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:18:54,833: INFO: EPOCH 4 - PROGRESS: at 48.25% examples, 123687 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:18:55,867: INFO: EPOCH 4 - PROGRESS: at 48.50% examples, 123694 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:18:56,917: INFO: EPOCH 4 - PROGRESS: at 48.74% examples, 123702 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:18:57,924: INFO: EPOCH 4 - PROGRESS: at 48.94% examples, 123696 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:18:58,984: INFO: EPOCH 4 - PROGRESS: at 49.17% examples, 123731 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:19:00,026: INFO: EPOCH 4 - PROGRESS: at 49.40% examples, 123739 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:19:01,028: INFO: EPOCH 4 - PROGRESS: at 49.61% examples, 123700 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:19:02,050: INFO: EPOCH 4 - PROGRESS: at 49.83% examples, 123724 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:20:10,341: INFO: EPOCH 4 - PROGRESS: at 64.58% examples, 122225 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:11,351: INFO: EPOCH 4 - PROGRESS: at 64.77% examples, 122154 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:12,384: INFO: EPOCH 4 - PROGRESS: at 64.99% examples, 122150 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:13,405: INFO: EPOCH 4 - PROGRESS: at 65.22% examples, 122145 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:20:14,477: INFO: EPOCH 4 - PROGRESS: at 65.50% examples, 122119 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:15,514: INFO: EPOCH 4 - PROGRESS: at 65.73% examples, 122115 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:16,553: INFO: EPOCH 4 - PROGRESS: at 65.96% examples, 122107 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:17,628: INFO: EPOCH 4 - PROGRESS: at 66.17% examples, 122084 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:20:18,641: INFO: EPOCH 4 - PROGRESS: at 66.39% examples, 122085 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:21:26,761: INFO: EPOCH 4 - PROGRESS: at 81.17% examples, 122548 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:21:27,787: INFO: EPOCH 4 - PROGRESS: at 81.39% examples, 122552 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:21:28,801: INFO: EPOCH 4 - PROGRESS: at 81.67% examples, 122541 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:21:29,864: INFO: EPOCH 4 - PROGRESS: at 81.89% examples, 122555 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:21:30,894: INFO: EPOCH 4 - PROGRESS: at 82.10% examples, 122557 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:21:31,922: INFO: EPOCH 4 - PROGRESS: at 82.31% examples, 122563 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:21:32,952: INFO: EPOCH 4 - PROGRESS: at 82.55% examples, 122573 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:21:33,981: INFO: EPOCH 4 - PROGRESS: at 82.79% examples, 122578 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:21:34,987: INFO: EPOCH 4 - PROGRESS: at 83.02% examples, 122593 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:22:43,047: INFO: EPOCH 4 - PROGRESS: at 98.00% examples, 122307 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:44,065: INFO: EPOCH 4 - PROGRESS: at 98.24% examples, 122316 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:45,110: INFO: EPOCH 4 - PROGRESS: at 98.53% examples, 122320 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:46,111: INFO: EPOCH 4 - PROGRESS: at 98.92% examples, 122313 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:47,136: INFO: EPOCH 4 - PROGRESS: at 99.18% examples, 122272 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:48,191: INFO: EPOCH 4 - PROGRESS: at 99.42% examples, 122274 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:49,196: INFO: EPOCH 4 - PROGRESS: at 99.67% examples, 122290 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:50,257: INFO: EPOCH 4 - PROGRESS: at 99.93% examples, 122289 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:22:50,457: INFO: worker thread finished; awaiting finish of 7 more threads
2018-05-14 18:22:50,462

2018-05-14 18:23:52,994: INFO: EPOCH 5 - PROGRESS: at 12.85% examples, 118689 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:23:54,029: INFO: EPOCH 5 - PROGRESS: at 13.11% examples, 118678 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:23:55,044: INFO: EPOCH 5 - PROGRESS: at 13.35% examples, 118681 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:23:56,051: INFO: EPOCH 5 - PROGRESS: at 13.55% examples, 118760 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:23:57,087: INFO: EPOCH 5 - PROGRESS: at 13.75% examples, 118773 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:23:58,097: INFO: EPOCH 5 - PROGRESS: at 13.96% examples, 118835 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:23:59,137: INFO: EPOCH 5 - PROGRESS: at 14.18% examples, 118952 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:24:00,151: INFO: EPOCH 5 - PROGRESS: at 14.41% examples, 118972 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:24:01,212: INFO: EPOCH 5 - PROGRESS: at 14.61% examples, 118829 words/s, in_qsize 1, out_qsize 0
2

2018-05-14 18:25:08,807: INFO: EPOCH 5 - PROGRESS: at 29.46% examples, 121608 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:09,837: INFO: EPOCH 5 - PROGRESS: at 29.70% examples, 121673 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:10,870: INFO: EPOCH 5 - PROGRESS: at 29.92% examples, 121698 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:25:11,879: INFO: EPOCH 5 - PROGRESS: at 30.17% examples, 121677 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:12,909: INFO: EPOCH 5 - PROGRESS: at 30.41% examples, 121702 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:25:13,974: INFO: EPOCH 5 - PROGRESS: at 30.66% examples, 121733 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:14,993: INFO: EPOCH 5 - PROGRESS: at 30.93% examples, 121743 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:16,010: INFO: EPOCH 5 - PROGRESS: at 31.19% examples, 121765 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:25:17,044: INFO: EPOCH 5 - PROGRESS: at 31.43% examples, 121771 words/s, in_qsize 0, out_qsize 0
2

2018-05-14 18:28:16,609: INFO: EPOCH 5 - PROGRESS: at 47.17% examples, 80019 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:17,616: INFO: EPOCH 5 - PROGRESS: at 47.44% examples, 80170 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:18,653: INFO: EPOCH 5 - PROGRESS: at 47.71% examples, 80332 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:28:19,655: INFO: EPOCH 5 - PROGRESS: at 47.94% examples, 80487 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:20,667: INFO: EPOCH 5 - PROGRESS: at 48.19% examples, 80636 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:21,716: INFO: EPOCH 5 - PROGRESS: at 48.43% examples, 80797 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:22,742: INFO: EPOCH 5 - PROGRESS: at 48.70% examples, 80942 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:23,772: INFO: EPOCH 5 - PROGRESS: at 48.92% examples, 81106 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:28:24,802: INFO: EPOCH 5 - PROGRESS: at 49.14% examples, 81268 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:29:34,011: INFO: EPOCH 5 - PROGRESS: at 65.88% examples, 89850 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:35,074: INFO: EPOCH 5 - PROGRESS: at 66.13% examples, 89977 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:36,095: INFO: EPOCH 5 - PROGRESS: at 66.38% examples, 90092 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:37,101: INFO: EPOCH 5 - PROGRESS: at 66.64% examples, 90203 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:38,107: INFO: EPOCH 5 - PROGRESS: at 66.86% examples, 90296 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:39,146: INFO: EPOCH 5 - PROGRESS: at 67.08% examples, 90414 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:40,196: INFO: EPOCH 5 - PROGRESS: at 67.32% examples, 90515 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:41,236: INFO: EPOCH 5 - PROGRESS: at 67.60% examples, 90615 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:29:42,304: INFO: EPOCH 5 - PROGRESS: at 67.85% examples, 90726 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:30:51,313: INFO: EPOCH 5 - PROGRESS: at 83.84% examples, 96816 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:30:52,358: INFO: EPOCH 5 - PROGRESS: at 84.07% examples, 96888 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:53,366: INFO: EPOCH 5 - PROGRESS: at 84.34% examples, 96968 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:54,370: INFO: EPOCH 5 - PROGRESS: at 84.61% examples, 97035 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:55,385: INFO: EPOCH 5 - PROGRESS: at 84.84% examples, 97113 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:56,399: INFO: EPOCH 5 - PROGRESS: at 85.08% examples, 97176 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:57,458: INFO: EPOCH 5 - PROGRESS: at 85.32% examples, 97250 words/s, in_qsize 0, out_qsize 0
2018-05-14 18:30:58,474: INFO: EPOCH 5 - PROGRESS: at 85.53% examples, 97313 words/s, in_qsize 1, out_qsize 0
2018-05-14 18:30:59,508: INFO: EPOCH 5 - PROGRESS: at 85.75% examples, 97345 words/s, in_qsize 0, out_qsize 0
2018-05-14

2018-05-14 18:32:00,588: INFO: EPOCH - 5 : training on 76563714 raw words (55176038 effective words) took 550.0s, 100326 effective words/s
2018-05-14 18:32:00,589: INFO: training on a 382818570 raw words (275877426 effective words) took 2718.8s, 101471 effective words/s
2018-05-14 18:32:00,597: INFO: saving Word2Vec object under data/modelword2vecbigram.model, separately None
2018-05-14 18:32:00,604: INFO: storing np array 'vectors' to data/modelword2vecbigram.model.wv.vectors.npy
2018-05-14 18:32:02,513: INFO: not storing attribute vectors_norm
2018-05-14 18:32:02,518: INFO: storing np array 'syn1neg' to data/modelword2vecbigram.model.trainables.syn1neg.npy
2018-05-14 18:32:04,097: INFO: not storing attribute cum_table
2018-05-14 18:32:08,458: INFO: saved data/modelword2vecbigram.model
2018-05-14 18:32:08,461: INFO: storing 1169943x100 projection weights into data/modelword2vecbigram.vec


In [53]:
%run train_doc2vec.py data/dataDoc2vec.txt data/doc2vec.model

2018-05-14 18:35:37,494: ERROR: File `'train_doc2vec.py'` not found.
