# Walktrough example : Train Document Emebeddings

This is a simple example on how it is possible to use the scripts in this repository to train document embeddings on a given dataset. 

First the usaul import

In [1]:
import os
import logging
from configparser import ConfigParser
from load import DatasetGenerator,ModelsLoader,load_and_process
from preprocess import GensimPreprocessor,DocumentsTagged
from train_models import ModelsTrainer
from quality_check import QualityChecker

If you wish to use it with a configuration file please take a look at one in the folder `/configs`. In this walkthrough I will use `/configs/train_newspapers_standard.config` (train document embeddings for newspaper classification)

In order to use the configuraiton file you can use the python built-in *configparser* in this way:

In [2]:
configuration = ConfigParser(allow_no_value=False)
configuration.read('./configs/train_newspapers_standard.config')

['./configs/train_newspapers_standard.config']

This will simply load a special dictionary with all the parameters you can set.

In [3]:
configuration.sections()

['LOADING', 'PREPROCESS', 'PARAMETERS', 'TRAIN', 'LABELS']

## LOAD

In [4]:
for key,value in configuration['LOADING'].items(): print(key,value, sep = ' = ')

load = es
json_path = 
es_server = localhost
es_indices = new_guardian,new_independent,new_telegraph,new_dailymail
es_query = {"query": {"bool" : { "must" : [{"range": {"publication_date": {"gte": "2016-01-01","lte": "2017-10-31"}}},{"query_string": {"fields" : ["body"], "query" : "brexit OR referendum"}}] }}}
doc_ids = url,url,url,url
doc_texts = body,body,body,body
constuct_new_models = True


`DatasetGenerator` will take these parameters and load data accordingly. `load` is the source of the data. Right now you can either load data from a ES instance or from JSON files. If you want to use ES you need to specify: the indices you want to retrieve from, a general query that will be used for all the searched operation, the fields that contains a unique identifier for the document and the one which stores the actual data (ORDER MATTERS). If you load data from JSON files instead you need to specify the path where the file is stored. In this case the script assumes that __doc_ids and *doc_bodys  are the same for all JSONs!__.

In [None]:
data = DatasetGenerator.load_data_from_config(configuration['LOADING'])

In case you wish to load data independently from a configuration file you can use:

In [None]:
data = DatasetGenerator.load_elasticsearch(es_server = ['localhost'], es_indices = ['new_guardian'],
                                           es_query = {"query" : {"match_all" : {}}, "sort" : ["_doc"]},
                                           doc_ids = ['url'],
                                           doc_texts = ['body'])

In this case I will you some data I have stored locally in a JSON file (since I did not find a way to connect this notebook to the server)

In [5]:
data = DatasetGenerator.load_json(folder = '/home/ele/Scrivania/HPI/data/websci/samuele.garda/politics',
                                 doc_id = 'url',
                                 doc_text = 'body')
print(data.__next__())

2018-01-23 16:36:58,459 : INFO : load: Loading data from `/home/ele/Scrivania/HPI/data/websci/samuele.garda/politics/2017.json`


{'author': ['https://www.facebook.com/bbcnews'], 'publication_date': '27 September 2017', 'news_keywords': '---', 'tags': ['http://www.bbc.co.uk/news/uk-politics-41407356'], 'words': 'Liam Fox, Boris Johnson and Priti Patel argued open markets are the best vehicle for reducing poverty and aiding prosperity at an event in London. Free of the "constraints" of the EU, the UK must be an "agitator" for free trade, the foreign secretary said.Meanwhile an ex-Tory leader has warned the UK must prepare for no Brexit deal.Critics say failure to do a Brexit deal could result in new trade barriers but Iain Duncan Smith said the EU must agree to open trade discussions by December or the UK should make arrangements to leave without a deal. Boeing warned over UK government contractsAccusing the EU of "arrogant behaviour... bordering on the deliberately offensive", he said the UK must "throw resources" at a no deal scenario, arguing the UK\'s reach in terms of trade was second to none. As talks on the

You may have noticed that the fields `url` and `body` have been converted to `tags` and `words`. This is for compatibility reason with `gensim`

Now let's load some model as wll. If there are no models to train you can use the `construct_new_models = True`. This will create as many model as the possible configurations in the `PARAMETERS` section (please refer to `gensim` documentation for more details). On the other hand, if you have already trained some model and you want to fine tune them you can load all the models froma a directory or a single one. Again you can use the specific functions if you do not want to stick with configuration files.

In [None]:
single_model = ModelsLoader.load_single_model('./models/Doc2Vec(dbow,d100,n20,mc5,t32)')
models = ModelsLoader.load_from_dir('models')
new_models = ModelsLoader.construct_models(parameters_conf= {'size': [100, 200], 'min_count': [5, 10]})

Let's use the beloved configuration:

In [6]:
models_to_train = ModelsLoader.load_models_from_config(configuration['LOADING'],configuration['PARAMETERS'])
print(models_to_train)

2018-01-23 16:37:01,215 : INFO : load: Loaded : 10 models


[<gensim.models.doc2vec.Doc2Vec object at 0x7fd00be706d8>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70390>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be702e8>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70630>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70588>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70780>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be701d0>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be704e0>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70438>, <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70828>]


## PREPROCESS

Now we can move to the preprocessing part. For now the only options allowed are: removing stopword (a predifined set provided by `gensim`) and/or stemming. The preprocessor is the only part bounded to a configuration file. This is because when using the trained models it is extremely important that the new unseed input is processed in the same way as the training examples. If you do not want to work with configuration file you will have to come up with your own preprocessing pipeline. You simpy need a function that works on the `words` key of the dictionaries.

In [7]:
for key,value in configuration['PREPROCESS'].items(): print(key,value, sep = ' = ')

rm_stopwords = True
stem = True


In [8]:
gp = GensimPreprocessor(configuration['PREPROCESS'])
data_preprocessed = gp.preprocess_data(data)
print(data_preprocessed.__next__())

.{'author': ['https://www.facebook.com/bbcnews'], 'publication_date': '27 September 2017', 'news_keywords': '---', 'tags': ['http://www.bbc.co.uk/news/uk-scotland-scotland-politics-41413587'], 'words': ['Ms', 'Dugdale', 'told', 'the', 'BBC', 'there', 'had', 'been', 'a', 'lot', 'of', 'internal', 'problems', 'in', 'the', 'party', 'ahead', 'of', 'her', 'sudden', 'resignation', 'Her', 'allies', 'have', 'claimed', 'there', 'was', 'a', 'plot', 'against', 'her', 'after', 'interim', 'leader', 'Alex', 'Rowley', 'was', 'caught', 'on', 'tape', 'backing', 'leadership', 'candidate', 'Richard', 'Leonard', 'The', 'row', 'unfolded', 'as', 'Jeremy', 'Corbyn', 'spoke', 'at', 'the', 'Labour', 'conference', 'in', 'Brighton', 'During', 'his', 'speech', 'Mr', 'Corbyn', 'said', 'that', 'Labour', 'was', 'on', 'the', 'way', 'back', 'in', 'Scotland', 'and', 'thanked', 'Ms', 'Dugdale', 'for', 'her', 'work', 'The', 'dispute', 'which', 'has', 'involved', 'several', 'of', 'the', 'party', 's', 'most', 'prominent', '

Now gensim models expect a `namedtuple` as input. Since the original gensim class for converting the input (`TaggedDocument`) is throwing away all extra information except `tags`,`words` , if you think you will need some other field in further processing you can use:

In [9]:
data_doc2vec = DocumentsTagged(data_preprocessed)

Each docuemnt field will be transformed to an attribute of the class `TaggedDocument`. You can access these attributes with the `.` operation. Since we all still working with generator we need:

In [10]:
corpus = load_and_process(data_doc2vec)

2018-01-23 16:37:07,509 : INFO : load: Documents in /home/ele/Scrivania/HPI/data/websci/samuele.garda/politics/2017.json discarded beacuse of missing field : 0





2018-01-23 16:37:07,513 : INFO : load: Retrieved 458 documents
2018-01-23 16:37:07,515 : INFO : utils: 'load_and_process'  0.65 s


In [11]:
corpus[0].category

'politics'

## TRAIN

Now it is possible to actually train the models. First of all we need to initaliaze the models with the corpus we loaded (i.e. creating the actual shallow neural network architecture for training embeddings)

In [12]:
for key,value in configuration['TRAIN'].items(): print(key,value, sep = ' = ')

alpha = 0.025
min_alpha = 0.001
epochs = 20
shuffle = True
adapt_alpha = False
chekpoint = ./models/newspaper/std
quality_check_infered = 5
quality_check_trained = 5


In [13]:
models = ModelsTrainer.init_models(models_to_train,corpus)

2018-01-23 16:37:12,159 : INFO : train_models: Initializing models with corpus...
2018-01-23 16:37:12,166 : INFO : doc2vec: collecting all words and their counts
2018-01-23 16:37:12,169 : INFO : doc2vec: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-01-23 16:37:12,311 : INFO : doc2vec: collected 17378 word types and 458 unique tags from a corpus of 458 examples and 289581 words
2018-01-23 16:37:12,312 : INFO : word2vec: Loading a fresh vocabulary
2018-01-23 16:37:12,341 : INFO : word2vec: min_count=5 retains 5097 unique words (29% of original 17378, drops 12281)
2018-01-23 16:37:12,342 : INFO : word2vec: min_count=5 leaves 269830 word corpus (93% of original 289581, drops 19751)
2018-01-23 16:37:12,373 : INFO : word2vec: deleting the raw counts dictionary of 17378 items
2018-01-23 16:37:12,375 : INFO : word2vec: sample=0 downsamples 0 most-common words
2018-01-23 16:37:12,379 : INFO : word2vec: downsampling leaves estimated 269830 word corpus (100.0% of pr

In [14]:
print(models)

OrderedDict([('Doc2Vec(dm-c,d100,n20,w10,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be706d8>), ('Doc2Vec(dbow,d200,n20,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70390>), ('Doc2Vec(dbow,d100,n20,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be702e8>), ('Doc2Vec(dm-c,d100,n20,w5,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70630>), ('Doc2Vec(dm-s,d200,n20,w10,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70588>), ('Doc2Vec(dm-c,d200,n20,w5,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70780>), ('Doc2Vec(dm-s,d100,n20,w5,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be701d0>), ('Doc2Vec(dm-s,d200,n20,w5,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be704e0>), ('Doc2Vec(dm-s,d100,n20,w10,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70438>), ('Doc2Vec(dm-c,d200,n20,w10,mc5,t4)', <gensim.models.doc2vec.Doc2Vec object at 0x7fd00be70828>)])


Now the actual training can begin. Now you have to chose whether to stick with the gensim standard function for training models or use a manually controlled learning rate decay. This can be set via the `adapt_alpha` parameter. If you think you know what you are doing set go for the manual control. This will allow you to set an inital learning rate and the minimum value you want to reach at the end of the process. For now the decay is set to be `alpha_delta = (alpha - min_alpha) / epochs`. Please take a look at `train_models` if you want to modify this. Finally if you use this procedure and set `shuffle = True` the documents will be shuffled at each epoch. This is done because it as been seen (https://rare-technologies.com/doc2vec-tutorial/) that manually controlling the learning rate and shuffling the corpus at each epoch can lead to better embeddings. Since the opinions about this are conflicting the final word is left to you.  Otherwise all the parameters are ignored and a single shuffle operation is done before training. The checkpoint is used to set a folder where each model is saved at the end of the training process. We can still use the configuration file or the specific functions or the configuration file:

In [None]:
alpha = 0.025
min_alpha = 0.001
epochs = 10

ModelsTrainer.train_from_config(models,corpus,configuration['TRAIN'])


ModelsTrainer.train_manual_lr(models = models ,corpus = corpus ,epochs = epochs,
                              alpha = alpha,minalpha = min_alpha,alpha_delta = (alpha - min_alpha) / epochs,
                              checkpoint = 'models')

Since for illustrational purposes 10 models are a bit too much I will use the specific function. Do not forget to create a checkpoint folder!

In [16]:
logging.getLogger("gensim").setLevel(logging.WARNING) # gensim is quiete verbose...

model = {next(iter(models.keys())) : next(iter(models.values()))} # pick just one model here

os.mkdir('./models')

ModelsTrainer.train_standard(models = model ,corpus = corpus ,epochs = 3,shuffle_docs  = True,checkpoint = './models')


2018-01-23 16:37:25,124 : INFO : train_models: Training models with standar learning rate decay...

2018-01-23 16:37:25,128 : INFO : train_models: Training model Doc2Vec(dm-c,d100,n20,w10,mc5,t4)
2018-01-23 16:37:52,788 : INFO : train_models: Doc2Vec(dm-c,d100,n20,w10,mc5,t4) completed training in 27.657s

2018-01-23 16:37:52,872 : INFO : utils: 'train_standard'  27.75 s


Finally the last step is to evaluate the quality of our embeddings. This kind of inspection can be performed via `QualityChecker` class. Moreover you can inherit from this class if you wish to implement your own evaluation metrics!

In [None]:
print(model)

In [17]:
checker = QualityChecker(models=model,corpus=corpus)

In [18]:
checker.base_check_from_config(config_train=configuration['TRAIN'])

2018-01-23 16:38:06,300 : INFO : quality_check: Doc2Vec(dm-c,d100,n20,w10,mc5,t4) - training - most similar for ['http://www.bbc.co.uk/news/uk-politics-41887401'] : [('http://www.bbc.co.uk/news/uk-politics-41604675', 0.9974935054779053), ('http://www.bbc.co.uk/news/uk-politics-41941414', 0.9966021776199341), ('http://www.bbc.co.uk/news/uk-politics-41437636', 0.9963968992233276), ('http://www.bbc.co.uk/news/uk-politics-39294904', 0.9953902959823608), ('http://www.bbc.co.uk/news/uk-politics-41642051', 0.9949260354042053)]

2018-01-23 16:38:06,437 : INFO : quality_check: Doc2Vec(dm-c,d100,n20,w10,mc5,t4) - inferred - most similar for ['http://www.bbc.co.uk/news/uk-politics-41887401'] : [('http://www.bbc.co.uk/news/uk-politics-41544588', 0.973179817199707), ('http://www.bbc.co.uk/news/uk-politics-41585430', 0.9663538336753845), ('http://www.bbc.co.uk/news/uk-politics-41961389', 0.9655001759529114), ('http://www.bbc.co.uk/news/uk-scotland-scotland-politics-41939416', 0.9604285955429077), ('

This will call:

In [None]:
checker.inferred_most_similar(topk = 3)

In [None]:
checker.trained_most_similar(topk=3)

This basic check is to assess:

    if the training vectors capture the structure of the data. Then you can manually see if the documents with the highest cosine similarity are effectively similar (deploy trained data)
    
    if the inferred vector for a random document have the highest cosine similarity with the original vector (deploy model on unseen examples)