# Content Collection and Processing for TSE-NER Long-tail Entity Extraction

Here we will try to follow the full process required to train and use a NER model.

In [23]:
# This is in case you update modules while working
%load_ext autoreload 
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Content Collection (a.k.a. crawling a *bunch* of papers)

For this step we will be using [sci-paper-miner](https://github.com/ronentk/sci-paper-miner) to download scientific publications from arXiv. Get the code from their repository, and use the following command to run it:

`python crawl_core.py <your-api>` 

In the `crawl_core` file, you can modify the topics and range of years that you want to download, in `configs` you can write a name for the folder where the data will be stored.

In [24]:
import json
import os

In [32]:
# This is the path where all the json files were downloaded 

path = 'sci_paper_miner/data/arxiv_2006-2017_cs/raw_query/'

We know that the files we downloaded are in JSON format, but let's check their structure.
For this we can use `os.walk`.

In [41]:
for dirpath, dirnames, filenames in os.walk(path):
    for filename in filenames:
        if filename.endswith('.json'):
            print(filename)

144_Computer Science - Artificial Intelligence_2017_5_0.json
144_Computer Science - Artificial Intelligence_2017_5_1.json
144_Computer Science - Artificial Intelligence_2017_5_10.json
144_Computer Science - Artificial Intelligence_2017_5_11.json
144_Computer Science - Artificial Intelligence_2017_5_12.json
144_Computer Science - Artificial Intelligence_2017_5_13.json
144_Computer Science - Artificial Intelligence_2017_5_14.json
144_Computer Science - Artificial Intelligence_2017_5_15.json
144_Computer Science - Artificial Intelligence_2017_5_16.json
144_Computer Science - Artificial Intelligence_2017_5_17.json
144_Computer Science - Artificial Intelligence_2017_5_18.json
144_Computer Science - Artificial Intelligence_2017_5_19.json
144_Computer Science - Artificial Intelligence_2017_5_2.json
144_Computer Science - Artificial Intelligence_2017_5_20.json
144_Computer Science - Artificial Intelligence_2017_5_21.json
144_Computer Science - Artificial Intelligence_2017_5_22.json
144_Compute

It seems like they are not individual articles, so we have to look into one of them.
Let's just take the last one.

In [42]:
json_file = os.path.join(path, filename)
with open(json_file) as f:
    data = json.load(f)

JSON files are loaded as dictionaries:

In [43]:
data.keys()

dict_keys(['authors', 'contributors', 'datePublished', 'description', 'doi', 'downloadUrl', 'fullText', 'fulltextIdentifier', 'id', 'identifiers', 'oai', 'relations', 'repositories', 'subjects', 'title', 'topics', 'types', 'year'])

If we look into one of those keys, it seems like there is many articles, each one with their respective value for the keys above.

In [53]:
data['title'].keys()

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47'])

So we can check the content of the `Title` key, for all articles:

In [61]:
for key in data['title']:
    print(data['title'][key])
    print('-'*50)

A new cosine series antialiasing function and its application to
  aliasing-free glottal source models for speech and singing synthesis
--------------------------------------------------
Investigating the role of musical genre in human perception of music
  stretching resistance
--------------------------------------------------
Enabling Early Audio Event Detection with Neural Networks
--------------------------------------------------
Multi-Speaker Localization Using Convolutional Neural Network Trained
  with Noise
--------------------------------------------------
Acoustic Reflector Localization: Novel Image Source Reversion and Direct
  Localization Methods
--------------------------------------------------
A modulation property of time-frequency derivatives of filtered phase
  and its application to aperiodicity and fo estimation
--------------------------------------------------
Weakly Supervised PLDA Training
--------------------------------------------------
CNN Architectures f

Or the content of all keys, for the first article, which is what we are looking for:

In [60]:
for key in data:
    print(key, data[key]['0'])
    print('-'*50)

authors ['Kawahara, Hideki', 'Sakakibara, Ken-Ichi', 'Banno, Hideki', 'Morise, Masanori', 'Toda, Tomoki', 'Irino, Toshio']
--------------------------------------------------
contributors []
--------------------------------------------------
datePublished 2017-06-08
--------------------------------------------------
description We formulated and implemented a procedure to generate aliasing-free
excitation source signals. It uses a new antialiasing filter in the continuous
time domain followed by an IIR digital filter for response equalization. We
introduced a cosine-series-based general design procedure for the new
antialiasing function. We applied this new procedure to implement the
antialiased Fujisaki-Ljungqvist model. We also applied it to revise our
previous implementation of the antialiased Fant-Liljencrants model. A
combination of these signals and a lattice implementation of the time varying
vocal tract model provides a reliable and flexible basis to test fo extractors
and sourc

Awesome! So we have to iterate for all the downloaded JSON files, for all keys, for all articles, so we can store it in a database. In this case we will use MongoDB.

## Extraction and Storage - MongoDB

We use MongoDB because it is great for prototyping and quick schema changes, but other storage systems can be probably used. Ultimately, all the data is then indexed in Elasticsearch, and that's where we actually query in production.

You can go [here](https://docs.mongodb.com/manual/installation/) to install MongoDB.

Then, using a MongoDB client, create a database, for example _pub_ and a collection, like _publications_.

In [92]:
from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError
import string

In [70]:
# Default values
mongoDB_IP = '127.0.0.1'
mongoDB_Port = 27017
mongoDB_db = 'pub'

In [71]:
def connect_to_mongo():
    """
    Returns a db connection to the mongo instance
    :return:
    """
    try:
        client = MongoClient(mongoDB_IP, mongoDB_Port)
        db = client[mongoDB_db]
        db.downloads.find_one({'_id': 'test'})
        return db
    except ServerSelectionTimeoutError as e:
        raise Exception("Local MongoDB instance at "+mongoDB_IP+":"+mongoDB_Port+" could not be accessed.") from e

We can test the connection to the database with the following command:

In [72]:
db = connect_to_mongo()

Then we define a function that will take each of the articles from the JSON files downloaded from CORE, and store the information into our collection.

The data obtained from CORE does not separate sections in the text, it is not so important except for the references, since we may use them later on, therefore we also define a function that will manage this.

In [147]:
def arxiv_json_to_mongo(article):
    """
    Creates a new entry in mongodb from the input article
    :return:
    """
    
    mongo_input = {}
    translator = str.maketrans('', '', string.punctuation)
    article_data = article

    mongo_input['title'] = article_data['title']
    mongo_input['authors'] = article_data['authors']
    mongo_input['journal'] = 'arxiv'
    mongo_input['year'] = article_data['year']
    mongo_input['type'] = article_data['subjects']
    mongo_input['content.abstract'] = article_data['description']
    mongo_input['content.keywords'] = article_data['topics']
    mongo_input['content.fulltext'] = article_data['fullText']
    mongo_input['content.references'] = article_data['references']

    mongo_mongo_input = db.publications.update_one(
        {'_id': 'arxiv_' + article_data['id']},
        {'$set': mongo_input},
        upsert=True
    )    
    print('Processed', article_data['id'], 'from JSON')
    
def manage_text_and_refs(article):
    article['fullText'] = article['fullText'].replace('\n', ' ').replace('\r', '')
    if len(article['fullText'].split('References', 1)) == 2:
        article['references'] = article['fullText'].split('References', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    elif len(article['fullText'].split('REFERENCES', 1)) == 2:
        article['references'] = article['fullText'].split('REFERENCES', 1)[1].split('[')
        article['fullText'] = article['fullText'].split('References', 1)[0]
    else:
        article['fullText'] = article['fullText']
        article['references'] = ''
    return article

Now we have to iterate through all the downloaded JSON files, and all the articles in each file, to process them and introduce them into our database.

In [149]:
for dirpath, dirnames, filenames in os.walk(path):
    # Iterate files...
    for filename in filenames:
        if filename.endswith('.json'):
            json_file = os.path.join(path, filename)
            with open(json_file) as f:
                data = json.load(f)            
                    # Iterate articles in file...
                for article_number in data['title'].keys():
                    article = {}
                    for key in data.keys():
                        article[key] = data[key][str(article_number)]
                        
                    # Process text and references for each article
                    manage_text_and_refs(article)
                    # Store to database
                    arxiv_json_to_mongo(article)

Processed 25036267 from JSON
Processed 2113087 from JSON
Processed 25025591 from JSON
Processed 25015102 from JSON
Processed 24956790 from JSON
Processed 25038409 from JSON
Processed 2246166 from JSON
Processed 73386897 from JSON
Processed 83863968 from JSON
Processed 83867371 from JSON
Processed 94047681 from JSON
Processed 146472595 from JSON
Processed 94063315 from JSON
Processed 29526388 from JSON
Processed 29536914 from JSON
Processed 84090473 from JSON
Processed 84093559 from JSON
Processed 84093915 from JSON
Processed 84093852 from JSON
Processed 84094420 from JSON
Processed 84326602 from JSON
Processed 84328172 from JSON
Processed 84330324 from JSON
Processed 84329889 from JSON
Processed 84326664 from JSON
Processed 84331310 from JSON
Processed 86416235 from JSON
Processed 86419366 from JSON
Processed 86420205 from JSON
Processed 84330603 from JSON
Processed 84330526 from JSON
Processed 84331497 from JSON
Processed 73423992 from JSON
Processed 73417539 from JSON
Processed 73408

Processed 83866609 from JSON
Processed 93946032 from JSON
Processed 93954042 from JSON
Processed 93952440 from JSON
Processed 93874081 from JSON
Processed 93943848 from JSON
Processed 93944339 from JSON
Processed 93944694 from JSON
Processed 93945015 from JSON
Processed 93942643 from JSON
Processed 146474068 from JSON
Processed 146474486 from JSON
Processed 42744287 from JSON
Processed 42715942 from JSON
Processed 42730154 from JSON
Processed 94072332 from JSON
Processed 94071067 from JSON
Processed 94058844 from JSON
Processed 93907363 from JSON
Processed 146475319 from JSON
Processed 84331874 from JSON
Processed 86415709 from JSON
Processed 86415925 from JSON
Processed 86418777 from JSON
Processed 84332442 from JSON
Processed 84330749 from JSON
Processed 84331069 from JSON
Processed 84330624 from JSON
Processed 84330701 from JSON
Processed 84331502 from JSON
Processed 84331595 from JSON
Processed 86415305 from JSON
Processed 86414737 from JSON
Processed 86417914 from JSON
Processed 7

Processed 93944988 from JSON
Processed 93947576 from JSON
Processed 93948719 from JSON
Processed 93951313 from JSON
Processed 93953333 from JSON
Processed 93953566 from JSON
Processed 93936930 from JSON
Processed 93936512 from JSON
Processed 93940234 from JSON
Processed 93869650 from JSON
Processed 93907518 from JSON
Processed 93942191 from JSON
Processed 93945152 from JSON
Processed 94032693 from JSON
Processed 146471726 from JSON
Processed 94024498 from JSON
Processed 129246622 from JSON
Processed 129264059 from JSON
Processed 129225652 from JSON
Processed 141534671 from JSON
Processed 141535333 from JSON
Processed 141534796 from JSON
Processed 93883528 from JSON
Processed 83845686 from JSON
Processed 83848072 from JSON
Processed 83849835 from JSON
Processed 83852895 from JSON
Processed 83859106 from JSON
Processed 83864340 from JSON
Processed 83856208 from JSON
Processed 83858660 from JSON
Processed 73383291 from JSON
Processed 83869383 from JSON
Processed 83841305 from JSON
Process

Processed 86421012 from JSON
Processed 86419000 from JSON
Processed 86417477 from JSON
Processed 86417617 from JSON
Processed 86419961 from JSON
Processed 29518418 from JSON
Processed 83831911 from JSON
Processed 83834080 from JSON
Processed 83833917 from JSON
Processed 83834840 from JSON
Processed 83832976 from JSON
Processed 83833715 from JSON
Processed 146475980 from JSON
Processed 146473775 from JSON
Processed 94062508 from JSON
Processed 94050938 from JSON
Processed 146474949 from JSON
Processed 94070006 from JSON
Processed 94075467 from JSON
Processed 94074604 from JSON
Processed 129357551 from JSON
Processed 93940465 from JSON
Processed 83836132 from JSON
Processed 83835487 from JSON
Processed 83835069 from JSON
Processed 42669201 from JSON
Processed 42706142 from JSON
Processed 42719258 from JSON
Processed 74203909 from JSON
Processed 74251340 from JSON
Processed 42743024 from JSON
Processed 74251078 from JSON
Processed 73960926 from JSON
Processed 73960988 from JSON
Processed 

Processed 42714190 from JSON
Processed 29509195 from JSON
Processed 84094933 from JSON
Processed 73442277 from JSON
Processed 86414492 from JSON
Processed 42641564 from JSON
Processed 42688561 from JSON
Processed 42655868 from JSON
Processed 42742913 from JSON
Processed 74202836 from JSON
Processed 74203978 from JSON
Processed 74203947 from JSON
Processed 42743606 from JSON
Processed 129239884 from JSON
Processed 129243931 from JSON
Processed 73993085 from JSON
Processed 42730879 from JSON
Processed 42739232 from JSON
Processed 42737724 from JSON
Processed 129270085 from JSON
Processed 129269247 from JSON
Processed 129274513 from JSON
Processed 94031902 from JSON
Processed 141529941 from JSON
Processed 141530082 from JSON
Processed 74250873 from JSON
Processed 74251720 from JSON
Processed 78509678 from JSON
Processed 141533043 from JSON
Processed 141532723 from JSON
Processed 78512366 from JSON
Processed 146472859 from JSON
Processed 94046603 from JSON
Processed 93908494 from JSON
Proc

Processed 93937729 from JSON
Processed 93941218 from JSON
Processed 93940664 from JSON
Processed 93945861 from JSON
Processed 93948790 from JSON
Processed 93951109 from JSON
Processed 93953282 from JSON
Processed 83839329 from JSON
Processed 83837633 from JSON
Processed 73351914 from JSON
Processed 83837789 from JSON
Processed 83844873 from JSON
Processed 83842947 from JSON
Processed 83841139 from JSON
Processed 83842497 from JSON
Processed 83842404 from JSON
Processed 83836801 from JSON
Processed 83839932 from JSON
Processed 83842264 from JSON
Processed 83842187 from JSON
Processed 73365814 from JSON
Processed 83856551 from JSON
Processed 83856520 from JSON
Processed 83856892 from JSON
Processed 73370539 from JSON
Processed 83856210 from JSON
Processed 83860925 from JSON
Processed 83863513 from JSON
Processed 83865038 from JSON
Processed 83864996 from JSON
Processed 83857819 from JSON
Processed 73375736 from JSON
Processed 83860336 from JSON
Processed 83861959 from JSON
Processed 8386

Processed 141531624 from JSON
Processed 141531530 from JSON
Processed 141532937 from JSON
Processed 141533086 from JSON
Processed 93905212 from JSON
Processed 93907869 from JSON
Processed 129360972 from JSON
Processed 129362156 from JSON
Processed 129357929 from JSON
Processed 93873661 from JSON
Processed 93958711 from JSON
Processed 94045225 from JSON
Processed 146473617 from JSON
Processed 93943546 from JSON
Processed 93942682 from JSON
Processed 93942435 from JSON
Processed 83846352 from JSON
Processed 83846600 from JSON
Processed 83854595 from JSON
Processed 83855226 from JSON
Processed 83855815 from JSON
Processed 83856243 from JSON
Processed 83858883 from JSON
Processed 83859033 from JSON
Processed 83865566 from JSON
Processed 73378124 from JSON
Processed 83858528 from JSON
Processed 94028358 from JSON
Processed 94067832 from JSON
Processed 83837464 from JSON
Processed 83841821 from JSON
Processed 83844301 from JSON
Processed 73416796 from JSON
Processed 73408630 from JSON
Proces

Processed 129348590 from JSON
Processed 129354890 from JSON
Processed 129357200 from JSON
Processed 93880669 from JSON
Processed 129270134 from JSON
Processed 93912094 from JSON
Processed 93937321 from JSON
Processed 93939341 from JSON
Processed 93937941 from JSON
Processed 93937910 from JSON
Processed 73991512 from JSON
Processed 73956573 from JSON
Processed 73956636 from JSON
Processed 73955727 from JSON
Processed 83835073 from JSON
Processed 83836401 from JSON
Processed 83833318 from JSON
Processed 83842578 from JSON
Processed 83842840 from JSON
Processed 83843641 from JSON
Processed 83834210 from JSON
Processed 83838919 from JSON
Processed 83859314 from JSON
Processed 73442229 from JSON
Processed 73440704 from JSON
Processed 73417537 from JSON
Processed 73409200 from JSON
Processed 73440937 from JSON
Processed 73414576 from JSON
Processed 73413016 from JSON
Processed 73410752 from JSON
Processed 73418989 from JSON
Processed 73419603 from JSON
Processed 86415498 from JSON
Processed 

Processed 83832006 from JSON
Processed 83832253 from JSON
Processed 83861093 from JSON
Processed 83863347 from JSON
Processed 83862438 from JSON
Processed 83851467 from JSON
Processed 83853393 from JSON
Processed 73398543 from JSON
Processed 146472503 from JSON
Processed 94041170 from JSON
Processed 94036169 from JSON
Processed 94043330 from JSON
Processed 146472178 from JSON
Processed 93939249 from JSON
Processed 141532163 from JSON
Processed 141537795 from JSON
Processed 141538147 from JSON
Processed 141530314 from JSON
Processed 141537081 from JSON
Processed 141538457 from JSON
Processed 141538394 from JSON
Processed 93907431 from JSON
Processed 93909158 from JSON
Processed 93957892 from JSON
Processed 93959247 from JSON
Processed 93959184 from JSON
Processed 93947103 from JSON
Processed 93947398 from JSON
Processed 93949262 from JSON
Processed 93949154 from JSON
Processed 93949604 from JSON
Processed 93948308 from JSON
Processed 86414944 from JSON
Processed 86413725 from JSON
Proce

Processed 78509210 from JSON
Processed 78509443 from JSON
Processed 78510034 from JSON
Processed 78512427 from JSON
Processed 94068624 from JSON
Processed 94028983 from JSON
Processed 94056713 from JSON
Processed 94060961 from JSON
Processed 93912555 from JSON
Processed 129357934 from JSON
Processed 129361315 from JSON
Processed 129360204 from JSON
Processed 129363304 from JSON
Processed 129269941 from JSON
Processed 93885086 from JSON
Processed 93936904 from JSON
Processed 93942956 from JSON
Processed 129351316 from JSON
Processed 129349861 from JSON
Processed 93913635 from JSON
Processed 93937628 from JSON
Processed 129358285 from JSON
Processed 129362488 from JSON
Processed 129362178 from JSON
Processed 129351424 from JSON
Processed 93945823 from JSON
Processed 93945638 from JSON
Processed 93944077 from JSON
Processed 93947331 from JSON
Processed 93946174 from JSON
Processed 93949180 from JSON
Processed 93950377 from JSON
Processed 93952412 from JSON
Processed 94024323 from JSON
Pro

Processed 9259374 from JSON
Processed 2418176 from JSON
Processed 24952807 from JSON
Processed 25041752 from JSON
Processed 25054883 from JSON
Processed 2064976 from JSON
Processed 25036302 from JSON
Processed 25015255 from JSON
Processed 25015890 from JSON
Processed 24978117 from JSON
Processed 24961009 from JSON
Processed 25063031 from JSON
Processed 25060134 from JSON
Processed 24960871 from JSON
Processed 25004536 from JSON
Processed 24975044 from JSON
Processed 25058543 from JSON
Processed 25015215 from JSON
Processed 2192099 from JSON
Processed 25050201 from JSON
Processed 25043984 from JSON
Processed 24947497 from JSON
Processed 25030700 from JSON
Processed 24960513 from JSON
Processed 2246166 from JSON
Processed 24965000 from JSON
Processed 146472672 from JSON
Processed 94066275 from JSON
Processed 84090954 from JSON
Processed 86419582 from JSON
Processed 86418022 from JSON
Processed 86420919 from JSON
Processed 84329360 from JSON
Processed 73407275 from JSON
Processed 73406878

Processed 73988612 from JSON
Processed 73988782 from JSON
Processed 73987842 from JSON
Processed 42701158 from JSON
Processed 73359320 from JSON
Processed 83837177 from JSON
Processed 83865527 from JSON
Processed 93955960 from JSON
Processed 94075803 from JSON
Processed 83847068 from JSON
Processed 93888370 from JSON
Processed 42739003 from JSON
Processed 93867540 from JSON
Processed 93875891 from JSON
Processed 129350143 from JSON
Processed 93874843 from JSON
Processed 83858768 from JSON
Processed 73349182 from JSON
Processed 73395899 from JSON
Processed 83857671 from JSON
Processed 83860794 from JSON
Processed 83845746 from JSON
Processed 73422623 from JSON
Processed 73400698 from JSON
Processed 73407079 from JSON
Processed 84093827 from JSON
Processed 29554804 from JSON
Processed 73409828 from JSON
Processed 84329847 from JSON
Processed 73954126 from JSON
Processed 73957459 from JSON
Processed 73989678 from JSON
Processed 42644498 from JSON
Processed 84328192 from JSON
Processed 742

Processed 141532918 from JSON
Processed 93897084 from JSON
Processed 83847336 from JSON
Processed 83859944 from JSON
Processed 84094692 from JSON
Processed 84093303 from JSON
Processed 83848897 from JSON
Processed 83848727 from JSON
Processed 83849218 from JSON
Processed 83852278 from JSON
Processed 83850272 from JSON
Processed 83842604 from JSON
Processed 73354579 from JSON
Processed 73377367 from JSON
Processed 83869728 from JSON
Processed 83863650 from JSON
Processed 83867397 from JSON
Processed 73351431 from JSON
Processed 129354096 from JSON
Processed 93946309 from JSON
Processed 93948590 from JSON
Processed 93953828 from JSON
Processed 93889927 from JSON
Processed 73423893 from JSON
Processed 29559494 from JSON
Processed 73406964 from JSON
Processed 73407518 from JSON
Processed 73416608 from JSON
Processed 29506180 from JSON
Processed 29530381 from JSON
Processed 73404261 from JSON
Processed 83856395 from JSON
Processed 73348729 from JSON
Processed 73381445 from JSON
Processed 73

Processed 73378048 from JSON
Processed 83856415 from JSON
Processed 73371155 from JSON
Processed 83838033 from JSON
Processed 83850556 from JSON
Processed 83850897 from JSON
Processed 84333065 from JSON
Processed 29546787 from JSON
Processed 73953566 from JSON
Processed 42649812 from JSON
Processed 42684628 from JSON
Processed 84326539 from JSON
Processed 78509696 from JSON
Processed 78511413 from JSON
Processed 78512012 from JSON
Processed 73988683 from JSON
Processed 42691248 from JSON
Processed 141538114 from JSON
Processed 78512524 from JSON
Processed 78510535 from JSON
Processed 141534042 from JSON
Processed 73440284 from JSON
Processed 73404437 from JSON
Processed 73404266 from JSON
Processed 93956396 from JSON
Processed 146476033 from JSON
Processed 73387261 from JSON
Processed 73382127 from JSON
Processed 146475262 from JSON
Processed 83858203 from JSON
Processed 83863175 from JSON
Processed 83852344 from JSON
Processed 83862280 from JSON
Processed 83866819 from JSON
Processed 

Processed 73957710 from JSON
Processed 73987971 from JSON
Processed 73989092 from JSON
Processed 73988309 from JSON
Processed 73991321 from JSON
Processed 84331754 from JSON
Processed 84326798 from JSON
Processed 84330333 from JSON
Processed 86420861 from JSON
Processed 86419559 from JSON
Processed 86419698 from JSON
Processed 86417243 from JSON
Processed 86417801 from JSON
Processed 84092611 from JSON
Processed 84091113 from JSON
Processed 42695980 from JSON
Processed 73955473 from JSON
Processed 73410050 from JSON
Processed 73408334 from JSON
Processed 73408428 from JSON
Processed 73403618 from JSON
Processed 146471816 from JSON
Processed 146476239 from JSON
Processed 146474093 from JSON
Processed 73356677 from JSON
Processed 83841183 from JSON
Processed 83866922 from JSON
Processed 93952994 from JSON
Processed 129356442 from JSON
Processed 129356149 from JSON
Processed 93944845 from JSON
Processed 129227515 from JSON
Processed 129351074 from JSON
Processed 129350259 from JSON
Proces

Processed 83854780 from JSON
Processed 83853516 from JSON
Processed 83861805 from JSON
Processed 83853965 from JSON
Processed 83856508 from JSON
Processed 94030555 from JSON
Processed 94026617 from JSON
Processed 73356618 from JSON
Processed 83843005 from JSON
Processed 29544999 from JSON
Processed 73440548 from JSON
Processed 73408135 from JSON
Processed 84331727 from JSON
Processed 84329188 from JSON
Processed 73990555 from JSON
Processed 73992018 from JSON
Processed 73993112 from JSON
Processed 73993129 from JSON
Processed 73988341 from JSON
Processed 84326800 from JSON
Processed 84327911 from JSON
Processed 86421860 from JSON
Processed 141536946 from JSON
Processed 74251345 from JSON
Processed 42742259 from JSON
Processed 74204248 from JSON
Processed 42720488 from JSON
Processed 141536155 from JSON
Processed 141537730 from JSON
Processed 86416967 from JSON
Processed 86415312 from JSON
Processed 84094186 from JSON
Processed 83860803 from JSON
Processed 83860245 from JSON
Processed 8

Processed 141535173 from JSON
Processed 73440475 from JSON
Processed 73417462 from JSON
Processed 84090490 from JSON
Processed 84093514 from JSON
Processed 84092434 from JSON
Processed 73407850 from JSON
Processed 29552379 from JSON
Processed 74203280 from JSON
Processed 78511950 from JSON
Processed 78511208 from JSON
Processed 78509830 from JSON
Processed 94068624 from JSON
Processed 146472236 from JSON
Processed 146472281 from JSON
Processed 129358519 from JSON
Processed 129361500 from JSON
Processed 93885086 from JSON
Processed 93939554 from JSON
Processed 93938582 from JSON
Processed 129285606 from JSON
Processed 129357732 from JSON
Processed 129361283 from JSON
Processed 129360343 from JSON
Processed 129362053 from JSON
Processed 129349289 from JSON
Processed 129349070 from JSON
Processed 129352984 from JSON
Processed 93944091 from JSON
Processed 93945544 from JSON
Processed 93947780 from JSON
Processed 93953802 from JSON
Processed 93956591 from JSON
Processed 94026280 from JSON
P

Processed 141535595 from JSON
Processed 141537476 from JSON
Processed 29518058 from JSON
Processed 84091358 from JSON
Processed 83854680 from JSON
Processed 83856314 from JSON
Processed 83858783 from JSON
Processed 83837674 from JSON
Processed 73356626 from JSON
Processed 83837751 from JSON
Processed 83862520 from JSON
Processed 83831443 from JSON
Processed 83851162 from JSON
Processed 83845761 from JSON
Processed 83847628 from JSON
Processed 83862614 from JSON
Processed 83848087 from JSON
Processed 83839199 from JSON
Processed 84326701 from JSON
Processed 141529760 from JSON
Processed 141530818 from JSON
Processed 141533437 from JSON
Processed 42654204 from JSON
Processed 42724228 from JSON
Processed 86421558 from JSON
Processed 78509361 from JSON
Processed 86420788 from JSON
Processed 84331813 from JSON
Processed 86419731 from JSON
Processed 86420199 from JSON
Processed 42709706 from JSON
Processed 141537554 from JSON
Processed 141538014 from JSON
Processed 141533514 from JSON
Proces

Processed 141537526 from JSON
Processed 129209692 from JSON
Processed 141536600 from JSON
Processed 141536554 from JSON
Processed 93896872 from JSON
Processed 93889151 from JSON
Processed 93947062 from JSON
Processed 93948560 from JSON
Processed 93958280 from JSON
Processed 93954459 from JSON
Processed 146471849 from JSON
Processed 94078752 from JSON
Processed 83858446 from JSON
Processed 83837755 from JSON
Processed 83839869 from JSON
Processed 83849433 from JSON
Processed 83863402 from JSON
Processed 83865516 from JSON
Processed 84093860 from JSON
Processed 83847755 from JSON
Processed 83848107 from JSON
Processed 83852278 from JSON
Processed 83833280 from JSON
Processed 83836645 from JSON
Processed 83845812 from JSON
Processed 83847522 from JSON
Processed 83852494 from JSON
Processed 83842185 from JSON
Processed 83864684 from JSON
Processed 83869571 from JSON
Processed 83835565 from JSON
Processed 83831431 from JSON
Processed 83868198 from JSON
Processed 73399260 from JSON
Processed

Processed 83833433 from JSON
Processed 84093874 from JSON
Processed 146476396 from JSON
Processed 146475223 from JSON
Processed 129361954 from JSON
Processed 93909848 from JSON
Processed 141533623 from JSON
Processed 83848490 from JSON
Processed 83850053 from JSON
Processed 83841305 from JSON
Processed 73990349 from JSON
Processed 84332367 from JSON
Processed 84326485 from JSON
Processed 78508297 from JSON
Processed 78512407 from JSON
Processed 86418977 from JSON
Processed 74202687 from JSON
Processed 86414021 from JSON
Processed 83841242 from JSON
Processed 83841165 from JSON
Processed 83854202 from JSON
Processed 83861271 from JSON
Processed 83861769 from JSON
Processed 93951919 from JSON
Processed 129348229 from JSON
Processed 129354079 from JSON
Processed 129355887 from JSON
Processed 83851925 from JSON
Processed 83856393 from JSON
Processed 83857288 from JSON
Processed 83857257 from JSON
Processed 83840411 from JSON
Processed 83869043 from JSON
Processed 83831972 from JSON
Process

Processed 93936752 from JSON
Processed 83837777 from JSON
Processed 25050344 from JSON
Processed 25056701 from JSON
Processed 25041752 from JSON
Processed 25048645 from JSON
Processed 24984510 from JSON
Processed 24938942 from JSON
Processed 24953401 from JSON
Processed 25013547 from JSON
Processed 24964557 from JSON
Processed 25015102 from JSON
Processed 24992766 from JSON
Processed 25029857 from JSON
Processed 2189676 from JSON
Processed 2184911 from JSON
Processed 25062049 from JSON
Processed 24956790 from JSON
Processed 25048018 from JSON
Processed 25023845 from JSON
Processed 25054674 from JSON
Processed 25030049 from JSON
Processed 25043675 from JSON
Processed 25034492 from JSON
Processed 24973903 from JSON
Processed 25030003 from JSON
Processed 25041490 from JSON
Processed 2246166 from JSON
Processed 24970122 from JSON
Processed 83864257 from JSON
Processed 83864163 from JSON
Processed 94063795 from JSON
Processed 84327713 from JSON
Processed 84328778 from JSON
Processed 8641557

Processed 83849159 from JSON
Processed 83862790 from JSON
Processed 83833114 from JSON
Processed 73440113 from JSON
Processed 84094806 from JSON
Processed 29568557 from JSON
Processed 84090486 from JSON
Processed 129277339 from JSON
Processed 93945984 from JSON
Processed 93946785 from JSON
Processed 93909207 from JSON
Processed 93943173 from JSON
Processed 93947447 from JSON
Processed 42738304 from JSON
Processed 42740068 from JSON
Processed 42744495 from JSON
Processed 78508492 from JSON
Processed 78510923 from JSON
Processed 78510738 from JSON
Processed 78510985 from JSON
Processed 93941898 from JSON
Processed 93957833 from JSON
Processed 83850555 from JSON
Processed 83855877 from JSON
Processed 73379482 from JSON
Processed 83865458 from JSON
Processed 73422832 from JSON
Processed 73423462 from JSON
Processed 84330849 from JSON
Processed 84332667 from JSON
Processed 84331479 from JSON
Processed 73990928 from JSON
Processed 73958591 from JSON
Processed 73989298 from JSON
Processed 739

Processed 73392037 from JSON
Processed 83849459 from JSON
Processed 83854958 from JSON
Processed 83854554 from JSON
Processed 83853878 from JSON
Processed 83859333 from JSON
Processed 84332626 from JSON
Processed 84328733 from JSON
Processed 86417949 from JSON
Processed 86418283 from JSON
Processed 42686031 from JSON
Processed 73961301 from JSON
Processed 74252028 from JSON
Processed 78508286 from JSON
Processed 141532244 from JSON
Processed 141530673 from JSON
Processed 141536051 from JSON
Processed 141535560 from JSON
Processed 141537751 from JSON
Processed 74203949 from JSON
Processed 73419684 from JSON
Processed 84091354 from JSON
Processed 84092061 from JSON
Processed 73402792 from JSON
Processed 29499418 from JSON
Processed 78511424 from JSON
Processed 78510469 from JSON
Processed 78507670 from JSON
Processed 78508673 from JSON
Processed 78512100 from JSON
Processed 78508907 from JSON
Processed 129209900 from JSON
Processed 93903698 from JSON
Processed 94054926 from JSON
Processe

Processed 129361410 from JSON
Processed 129360222 from JSON
Processed 129349012 from JSON
Processed 93915310 from JSON
Processed 129250871 from JSON
Processed 93870381 from JSON
Processed 93942286 from JSON
Processed 94025003 from JSON
Processed 73989029 from JSON
Processed 73990470 from JSON
Processed 73992863 from JSON
Processed 73992630 from JSON
Processed 73993323 from JSON
Processed 129244432 from JSON
Processed 129243307 from JSON
Processed 93937376 from JSON
Processed 93941081 from JSON
Processed 93945324 from JSON
Processed 129283769 from JSON
Processed 129278842 from JSON
Processed 73378227 from JSON
Processed 73392289 from JSON
Processed 73368381 from JSON
Processed 83860068 from JSON
Processed 94057020 from JSON
Processed 94060965 from JSON
Processed 83857301 from JSON
Processed 73357846 from JSON
Processed 83859307 from JSON
Processed 83850595 from JSON
Processed 73441404 from JSON
Processed 73410916 from JSON
Processed 29565790 from JSON
Processed 84092857 from JSON
Proces

Processed 83855223 from JSON
Processed 83849512 from JSON
Processed 83852743 from JSON
Processed 83855038 from JSON
Processed 83856598 from JSON
Processed 94078538 from JSON
Processed 94055175 from JSON
Processed 73385657 from JSON
Processed 83857724 from JSON
Processed 83869325 from JSON
Processed 83869495 from JSON
Processed 83867893 from JSON
Processed 83833421 from JSON
Processed 83839450 from JSON
Processed 83839931 from JSON
Processed 73354905 from JSON
Processed 84095109 from JSON
Processed 29528185 from JSON
Processed 29528060 from JSON
Processed 84090591 from JSON
Processed 84094649 from JSON
Processed 73990862 from JSON
Processed 73991632 from JSON
Processed 84329945 from JSON
Processed 73961262 from JSON
Processed 42675750 from JSON
Processed 84328803 from JSON
Processed 86421786 from JSON
Processed 84327879 from JSON
Processed 84333047 from JSON
Processed 42723833 from JSON
Processed 74204074 from JSON
Processed 74204120 from JSON
Processed 86415594 from JSON
Processed 8433

Processed 74202852 from JSON
Processed 78510096 from JSON
Processed 93915732 from JSON
Processed 129360406 from JSON
Processed 93945638 from JSON
Processed 93953958 from JSON
Processed 93952412 from JSON
Processed 93957068 from JSON
Processed 93948288 from JSON
Processed 94067683 from JSON
Processed 83838117 from JSON
Processed 83852566 from JSON
Processed 83847687 from JSON
Processed 73987670 from JSON
Processed 42685666 from JSON
Processed 42737929 from JSON
Processed 141535826 from JSON
Processed 42709067 from JSON
Processed 129214547 from JSON
Processed 78508796 from JSON
Processed 141531539 from JSON
Processed 141530707 from JSON
Processed 141530815 from JSON
Processed 141535719 from JSON
Processed 129351689 from JSON
Processed 93948225 from JSON
Processed 73390778 from JSON
Processed 86415307 from JSON
Processed 86415585 from JSON
Processed 93907613 from JSON
Processed 83860128 from JSON
Processed 83838396 from JSON
Processed 83863754 from JSON
Processed 83858768 from JSON
Proces

Processed 86414323 from JSON
Processed 73408557 from JSON
Processed 73409093 from JSON
Processed 73408773 from JSON
Processed 42747087 from JSON
Processed 42748600 from JSON
Processed 42752834 from JSON
Processed 78509679 from JSON
Processed 78507673 from JSON
Processed 78507488 from JSON
Processed 141538256 from JSON
Processed 42681963 from JSON
Processed 93939510 from JSON
Processed 93910508 from JSON
Processed 93943946 from JSON
Processed 93947070 from JSON
Processed 93944918 from JSON
Processed 129233964 from JSON
Processed 73989191 from JSON
Processed 73987652 from JSON
Processed 93895212 from JSON
Processed 129265378 from JSON
Processed 129268182 from JSON
Processed 129268834 from JSON
Processed 129269527 from JSON
Processed 129356230 from JSON
Processed 93942741 from JSON
Processed 83869517 from JSON
Processed 83836249 from JSON
Processed 83836650 from JSON
Processed 73349162 from JSON
Processed 83842316 from JSON
Processed 83842967 from JSON
Processed 83839147 from JSON
Process

Processed 94025848 from JSON
Processed 74251795 from JSON
Processed 74251252 from JSON
Processed 146473522 from JSON
Processed 146472257 from JSON
Processed 94056134 from JSON
Processed 129354224 from JSON
Processed 129357494 from JSON
Processed 129363237 from JSON
Processed 141532659 from JSON
Processed 93907882 from JSON
Processed 93915955 from JSON
Processed 93947042 from JSON
Processed 93946830 from JSON
Processed 93943824 from JSON
Processed 129361680 from JSON
Processed 129224567 from JSON
Processed 129355676 from JSON
Processed 93886529 from JSON
Processed 129203131 from JSON
Processed 129238032 from JSON
Processed 78510782 from JSON
Processed 73387570 from JSON
Processed 73357496 from JSON
Processed 83846492 from JSON
Processed 83866522 from JSON
Processed 83869172 from JSON
Processed 73378699 from JSON
Processed 73355678 from JSON
Processed 83841358 from JSON
Processed 83833425 from JSON
Processed 83836277 from JSON
Processed 73353766 from JSON
Processed 83845615 from JSON
Pro

Processed 83860964 from JSON
Processed 83859582 from JSON
Processed 83861563 from JSON
Processed 83862270 from JSON
Processed 83855807 from JSON
Processed 83851160 from JSON
Processed 83849848 from JSON
Processed 83863862 from JSON
Processed 73360346 from JSON
Processed 83849088 from JSON
Processed 83847981 from JSON
Processed 93877417 from JSON
Processed 83866499 from JSON
Processed 83832849 from JSON
Processed 83838241 from JSON
Processed 83839167 from JSON
Processed 83853275 from JSON
Processed 73393242 from JSON
Processed 73361148 from JSON
Processed 83847160 from JSON
Processed 73420526 from JSON
Processed 73407389 from JSON
Processed 73408623 from JSON
Processed 84092329 from JSON
Processed 84092855 from JSON
Processed 73420728 from JSON
Processed 84094969 from JSON
Processed 73953774 from JSON
Processed 84330096 from JSON
Processed 84327283 from JSON
Processed 42680346 from JSON
Processed 73992924 from JSON
Processed 73987595 from JSON
Processed 84333196 from JSON
Processed 8432

Processed 83833450 from JSON
Processed 94034127 from JSON
Processed 94075171 from JSON
Processed 129353838 from JSON
Processed 93940465 from JSON
Processed 83840614 from JSON
Processed 74251340 from JSON
Processed 42698180 from JSON
Processed 73958868 from JSON
Processed 84330224 from JSON
Processed 84331955 from JSON
Processed 84331272 from JSON
Processed 73990846 from JSON
Processed 73993061 from JSON
Processed 42728177 from JSON
Processed 42736156 from JSON
Processed 93908285 from JSON
Processed 93907530 from JSON
Processed 93912161 from JSON
Processed 93867227 from JSON
Processed 78508000 from JSON
Processed 93946665 from JSON
Processed 83860612 from JSON
Processed 83852880 from JSON
Processed 83853061 from JSON
Processed 83852789 from JSON
Processed 83855500 from JSON
Processed 83847817 from JSON
Processed 83850365 from JSON
Processed 83855719 from JSON
Processed 83842711 from JSON
Processed 129233569 from JSON
Processed 83865840 from JSON
Processed 83870426 from JSON
Processed 83

Processed 94047681 from JSON
Processed 84094468 from JSON
Processed 84093340 from JSON
Processed 86421626 from JSON
Processed 84326431 from JSON
Processed 86415559 from JSON
Processed 84329252 from JSON
Processed 73414888 from JSON
Processed 73405938 from JSON
Processed 78507488 from JSON
Processed 141531952 from JSON
Processed 141532753 from JSON
Processed 93891203 from JSON
Processed 93936890 from JSON
Processed 93887776 from JSON
Processed 93942772 from JSON
Processed 42638983 from JSON
Processed 83839161 from JSON
Processed 73384954 from JSON
Processed 73394009 from JSON
Processed 83857668 from JSON
Processed 83865306 from JSON
Processed 83846248 from JSON
Processed 83850977 from JSON
Processed 73377839 from JSON
Processed 83864132 from JSON
Processed 94078807 from JSON
Processed 129358344 from JSON
Processed 129361018 from JSON
Processed 129243468 from JSON
Processed 73961238 from JSON
Processed 83838221 from JSON
Processed 83847342 from JSON
Processed 86419132 from JSON
Processed

Processed 86420799 from JSON
Processed 129240605 from JSON
Processed 141531716 from JSON
Processed 93900853 from JSON
Processed 129356830 from JSON
Processed 93949731 from JSON
Processed 93948310 from JSON
Processed 83835194 from JSON
Processed 146474357 from JSON
Processed 73374050 from JSON
Processed 93937091 from JSON
Processed 146471442 from JSON
Processed 94041308 from JSON
Processed 94049681 from JSON
Processed 146473431 from JSON
Processed 83867553 from JSON
Processed 73384129 from JSON
Processed 83848680 from JSON
Processed 78509187 from JSON
Processed 141538825 from JSON
Processed 73441981 from JSON
Processed 84093274 from JSON
Processed 93955721 from JSON
Processed 83831680 from JSON
Processed 93888084 from JSON
Processed 129235802 from JSON
Processed 86422136 from JSON
Processed 84331601 from JSON
Processed 86415268 from JSON
Processed 146471597 from JSON
Processed 141534242 from JSON
Processed 141538453 from JSON
Processed 141532549 from JSON
Processed 83844191 from JSON
Pr

Done! We have a bunch of files in our database! (8848 in the demo)

## Indexing - Elasticsearch 

We use Elasticsearch for the quick and very flexible (elastic?) queries across the full text of articles, so we have to index all the content from the database for this.

Once again, we have to [install](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) it and run the service...yay!

Once it's running we can connect and check status:

In [182]:
import pymongo
import elasticsearch
import requests
import nltk
from elasticsearch import helpers

In [157]:
client = pymongo.MongoClient('localhost:' + str(mongoDB_Port))
publications_collection = client.pub.publications
es = elasticsearch.Elasticsearch([{'host': 'localhost', 'port': 9200}],
                                 timeout=30, max_retries=10, retry_on_timeout=True)

In [155]:
es.cluster.health()

{'active_primary_shards': 0,
 'active_shards': 0,
 'active_shards_percent_as_number': 100.0,
 'cluster_name': 'elasticsearch',
 'delayed_unassigned_shards': 0,
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'task_max_waiting_in_queue_millis': 0,
 'timed_out': False,
 'unassigned_shards': 0}

In [156]:
def extract_metadata(documents):
    list_of_docs = []
    for i, r in enumerate(documents):
        extracted = {
                "_id": "",
                "title": "",
                "publication": "",
                "year": "",
                "content": "",
                "abstract": "",
                "authors": [],
                "references": []}
        extracted['_id'] = r['_id']
        extracted['title'] = r['title']
        extracted['publication'] = r['journal']
        extracted['year'] = r['year']
        extracted['content'] = r['content']['fulltext']
        extracted['abstract'] = r['content']['abstract']
        extracted['authors'] = r['authors']
        extracted['references'] = r['content']['references']
        list_of_docs.append(extracted)
    return list_of_docs

### Index the full text of all articles in the database

In [162]:
filter_publications = ['arxiv'] # Here we could also put PubMed or other sources

extracted_publications = []
for publication in filter_publications:
    query ={'$and': [{'journal': publication}, {'content.fulltext': {'$exists': True}}]}                   
    results = publications_collection.find(query)
    extracted_publications.append(extract_metadata(results))

In [166]:
for publication in extracted_publications:
    actions = []
    for article in publication:
        authors = []
        if len(article['authors']) > 0:
            if type(article['authors'][0]) == list:
                try:
                    for name in article['authors']:
                        authors.append(', '.join([name[1], name[0]]))
                    authors = list(set(authors))
                except:
                    pass
            else:
                authors = article['authors']
        actions.append({
            "_index": "ir", 
            "_type": "publications",  
            "_id": article['_id'],
            "_source": {
                "title": article["title"],
                "journal": article['publication'],
                "year": str(article['year']),
                "content": article["content"],
                "abstract": article["abstract"],
                "authors": authors,
                "references": article["references"]
            }
        })
    if len(actions) == 0:
        continue
    res = helpers.bulk(es, actions)
    print('Done with', res, 'articles added to index')

Done with (8848, []) articles added to index


We can look for anything in the text, like this:

In [180]:
res = es.search(index = "ir", body = {"query": {"match": {"title" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print(hit['_id'], hit['_source']['abstract'])
    print('-'*50)

Got 1 Hits:
arxiv_84094279 Re-Evaluating the Netflix Prize - Human Uncertainty and its Impact on
  Reliability
arxiv_84094279 In this paper, we examine the statistical soundness of comparative
assessments within the field of recommender systems in terms of reliability and
human uncertainty. From a controlled experiment, we get the insight that users
provide different ratings on same items when repeatedly asked. This volatility
of user ratings justifies the assumption of using probability densities instead
of single rating scores. As a consequence, the well-known accuracy metrics
(e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution
of all rating densities. When two different systems produce different RMSE
distributions with significant intersection, then there exists a probability of
error for each possible ranking. As an application, we examine possible ranking
errors of the Netflix Prize. We are able to show that all top rankings are more
or less subject to h

### Sentence Indexing 
For the sentence expansion step of our process, we need to have sentences indexed. This is because we need to find similar sentences as well as their surrounding sentences, for context. 

In addition, we create a file with all the text, of all our articles, and we will use it later for the training of embedding models.

In [199]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [201]:
f.close()

In [None]:
for publication in extracted_publications:
    for article in publication:
        actions = []
        cleaned = []
        dataset_sent = []
        other_sent = []

        lines = (sent_detector.tokenize(article['content'].strip()))
        
        # This will be used for the training of word2vec and doc2vec
        # You need to create the data folder beforehand
        with open('data/full_text_corpus.txt', 'a', encoding='utf-8') as f:
            for line in lines:
                f.write(line)
        f.close()
        
        if len(lines) < 3:
            continue

        for i in range(len(lines)):
            words = nltk.word_tokenize(lines[i])
            word_lengths = [len(x) for x in words]
            average_word_length = sum(word_lengths) / len(word_lengths)
            if average_word_length < 4:
                continue

            two_sentences = ''
            try:
                two_sentences = lines[i] + ' ' + lines[i - 1]
            except:
                two_sentences = lines[i] + ' ' + lines[i + 1]

            dataset_sent.append(two_sentences)

        for num, added_lines in enumerate(dataset_sent):
            actions.append({
                "_index": "twosent",
                "_type": "twosentnorules",
                "_id": article['_id'] + str(num),
                "_source": {
                    "title": article['title'],
                    "content.chapter.sentpositive": added_lines,
                    "paper_id": article['_id']
                }})

        if len(actions) == 0:
            continue
        res = helpers.bulk(es, actions)
    print('Done')

In [193]:
res = es.search(index = "twosent", body = {"query": {"match": {"content.chapter.sentpositive" : "netflix"}}}, size = 10)

print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit['_id'], hit['_source']['title'])
    print('-'*50)

Got 174 Hits:
arxiv_14153268986 Learning to Rank based on Analogical Reasoning
--------------------------------------------------
arxiv_146472211389 One-sided Differential Privacy
--------------------------------------------------
arxiv_129348860158 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_14153268985 Learning to Rank based on Analogical Reasoning
--------------------------------------------------
arxiv_129348860154 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_129348860153 Differentially Private Query Learning: from Data Publishing to Model
  Publishing
--------------------------------------------------
arxiv_93942601272 Interacting Attention-gated Recurrent Networks for Recommendation
--------------------------------------------------
arxiv_129348860157 Differentially Private Query Learning: from

### Doc2vec Indexing

Once again, we index the sentences from the full text file created before, which will be used for the Sentence Expansion.

In [197]:
file = open('data/full_text_corpus.txt', 'r', encoding='utf-8')
text = file.read()
file.close()
sentences = nltk.tokenize.sent_tokenize(text)
print('Sentences ready')
count = 0
docLabels = []
actions = []

for i, sent in enumerate(sentences):
    try:
        neighbors = sentences[i + 1]
        neighbor_count = count + 1
    except:
        neighbors = sentences[i - 1]
        neighbor_count = count - 1

    docLabels.append(count)
    actions.append({
        "_index": "devtwosentnew",
        "_type": "devtwosentnorulesnew",
        "_id": count,
        "_source": {
            "content.chapter.sentpositive": sent,
            "content.chapter.sentnegtive": neighbors,
            "neighborcount": neighbor_count
        }})
    count = count + 1

print(len(sentences))
print(len(docLabels))
res = helpers.bulk(es, actions)
print(res)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 1144: invalid start byte