Add layout features to GROBID model #42

de-code · 2019-07-08T09:14:44Z

Something you are already well aware of but I thought it's good to have an issue to record the discussion around it. I am not sure whether you already experimented with adding layout features.

I've started doing it and implemented something here: elifesciences/sciencebeam-trainer-delft#16

Maybe you'll find some of it useful. (I don't want to flood you with too many PRs)

kermitt2 · 2019-07-10T12:21:36Z

Thank you! Yes it's something we definitively want to add, layout features should bring some improvements and make the models competitive with the current CRF models that are using them (and maybe even better).

Just had a quick look at your implementation, great work I think! We can ignore the lexical features like prefix/suffix (8 first), but also more beyond 9, like shadow number and so on (which have many values) because we already have a character input channel in every architectures. Just one-hot encoding of layout features as you are doing is probably enough I think (note: there is a dense_to_one_hot() function in preprocess.py already used for the case features of one model).

The "gazetteers" features could help, but they typically don't help NER models, so it might also the case here.

We could imagine a first pass in the reader to see the number of values for each feature and use that to select which one will be used for one-hot encoding and concatenation.

Thanks a lot for the great contribution! (and sorry not be very reactive currently)

de-code · 2019-07-26T15:23:57Z

Finally got around to do some end-to-end evaluation.

Not yet using GROBID's evaluation (the first attempt failed and it doesn't quite fit into my workflow yet).

I haven't done it on the full PMC sample 1943 dataset but rather a random sample of 390.

On our author submitted dataset (not trained on yet) this looks about:

(Since 0.5.5 it's failing to convert a number of those manuscripts which I will need to investigate and are just ignored rather than counted negatively)

lfoppiano · 2019-12-09T05:56:53Z

Implementation question, do we want to add a parameter in command line that says --use_features or --ignore-features?

I noticed that @de-code implemented a long list of command-line parameters in https://github.com/elifesciences/sciencebeam-trainer-delft, which might be hard to navigate, but useful to make quick scripting. What's your opinion?

Reason I'm asking is that right now I would like to make a quick test to see whether the features are impacting, adding a command-line parameter would allow me to run the command twice without touching anything.

kermitt2 · 2019-12-09T06:30:44Z

We probably want yes :)
Likely a --ignore-features I guess, because when features are available I think they will likely improve the results for many grobid models (because they capture layout information), so should be the default.

lfoppiano · 2019-12-10T01:15:09Z

Thanks!
I'm also wondering whether we should pass the features to the CRF layer in some way, explicitly?

kermitt2 · 2019-12-10T01:34:10Z

No the CRF layer just acts as activation function before the output, so to compute the probability distributions of the possible labels from the last neuron layer. It's the role of the previous layers to "digest" these additional input features.

lfoppiano · 2019-12-10T01:44:05Z

I see.

now in the implementation, I'm selecting a feature or not, only when the cardinality of values appearing in the training is below the feature max length (12).

@de-code implemented an additional parameter that allow the user to select explicitly which features to include. I think that would make the approach more resilient, for example, avoiding low variability (potentially) useless features to be included.

kermitt2 · 2019-12-10T02:51:40Z

I think all the features with cardinality more than 12 are useless because they are all character-based patterns (prefix, suffix, word shape, ...) and the DL architectures have a character input channel already specifically dedicated to this. So these features would very likely be redundant and could actually rather degrade the training (via usual overfitting problems because they are very specific).

Of course features with cardinality less than 12 can also be useless, typically casing and gazetteer are not helping, so a feature selection mechanism certainly makes sense too, though these features might just be ignored during training - something interesting to benchmark!

lfoppiano · 2019-12-10T03:05:43Z

OK, indeed.

Since @de-code already implemented something working, I would just integrate it as a list (including ranges). If not specified, the system will try automatic selection as I've implemented now.

de-code · 2019-12-10T09:49:50Z

Just some random thoughts on the feature indices:

I wasn't sure how fixed the features are across the models, including GROBID submodules. Could they potentially provide different features?
In general it probably doesn't make much sense to provide the first 8 features or so to DeLFT as it should be the responsibility of the model to create those on-the fly. But it's probably just easier to keep them the same as what is currently used for Wapiti.
Keeping the feature indices internally has the advantage that they can be stored in the model config for visibility and being able to load an existing model with those indicies even after changing the default
I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

lfoppiano · 2019-12-11T01:06:25Z

Just some random thoughts on the feature indices:

* I wasn't sure how fixed the features are across the models, including GROBID submodules. Could they potentially provide different features?

Yes, it's up to the model designer / design

* In general it probably doesn't make much sense to provide the first 8 features or so to DeLFT as it should be the responsibility of the model to create those on-the fly. But it's probably just easier to keep them the same as what is currently used for Wapiti.

Yes, indeed.

* Keeping the feature indices internally has the advantage that they can be stored in the model config for visibility and being able to load an existing model with those indicies even after changing the default

Very good point.

* I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

I see, for this I'm not trying to change the current approach at the moment

kermitt2 · 2019-12-11T02:40:38Z

I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

There are too many different hyper parameters for each model, often several per layer, plus plenty of possible training parameters, I think it's not manageable with command line. What is often done in libraries supporting several architectures is to have dedicated config files, one for each architecture with the different hyper parameters (a bit like the current config file associated to each produced model).

kermitt2 · 2019-12-11T02:50:30Z

The question is maybe how much these models make sense outside Grobid. In my original intent, DeLFT was not supposed to provide its own controls over the Grobid models: Only Grobid, with delft interfaced via JEP, trains, evals and runs models because only Grobid can generate the training data with features and the data to be labeled with features (because features and tokens are usually derived from the PDF). So ideally the grobidTagger.py file was not supposed to stay or just as a way to debug models.

However I had a problem to train in DeLFT from Grobid, because the python training process and its “stdout” output often stuck when interfaced with JEP and never ends. I didn't find enough time to solve the problem and I added a training method just calling grobidTagger.py train with an external process. It's working of course but it's more a hack, it should also normally use JEP.

Honestly I don't feel very good building too much stuff to manage grobid training data and models in DeLFT, because it is natural to drive that from Grobid and it would be redundant, painful to maintain. At least having exactly the same input files for both Wapiti and DeLFT is a must I think to keep things minimally simple and transparent.

I don't know if I am very clear with my original design idea, but of course if grobid models as such, independently from Grobid (maybe the date or person parsing models), are useful, this effort could be justified.

de-code · 2019-12-12T10:11:17Z

Training via GROBID has the advantage that is familiar to someone having trained Wapiti before.

But for me personally, it doesn't work very well.

The main one being that it requires a full GROBID setup and me trying to run the training on a separate machine on-demand.
I do not own a GPU but I borrow it from the cloud for a short period. There are tools to do that from a Python code base but having to also have the GROBID setup (which isn't just a library or CLI call) would be a significant road block. And running the training from DeLFT directly, I can run training in parallel.

There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.

As for CLI parameters vs config file:

Google for example offers Hyperparamer tuning (which I haven't used), but as I understand it, it would also pass parameters to the CLI.

A config file is certainly better than having to make code changes. I would think of config files as something more persistent. For example we generate a config file as part of the model or the GROBID configuration. A config file could describe the default arguments while command line arguments could allow overriding the default. The command line parameter could be scoped and generated based the shared "tuning parameters" available via a config file.

The models themselves could probably considered to be be more generic . But maybe it makes sense for the CLI to be grobid specific until we have other use-cases?

kermitt2 · 2019-12-24T15:19:44Z

There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.

Ok indeed, good to have the possibility to train in DeLFT/python, I agree!

So the only remaining problem is that DeLFT alone cannot generated the training/eval files with the features (the .train and .test generated by Grobid). We could think about a mechanism to generate thoses files in Grobid and place them automatically in the data/ directory of DeLFT, so that it is easy then to switch to Python for training/tuning/evaluating/etc.

lfoppiano · 2019-12-24T22:31:45Z

We could add a dropwizard command in the grobid service to handle the integration with delft, such as generate the data, and, if needed other stuff

de-code · 2020-01-03T10:40:39Z

I would love if we could easily generate the training data. Even more so if we could parallelise it (e.g. via a cluster). A service would work well for that.

de-code · 2020-01-21T10:12:50Z

Here are the evaluation results using my implementation using different parameters...

All of them using the the training and test data generated from GROBID 0.5.6.

Apart from the mentioned parameters it is using common parameters, such as:

name	value
embeddings	glove.840B.300d
word_lstm_units	100
action	train_eval
shuffle-input	True
random-seed	42

By feature embedding below, I mean a Dense layer after the feature input.

I am currently running the evaluation on the last epoch trained, although in some cases the f1 score for that epoch was going down, so probably should use the one with the highest score.

No features (epoch 53, eval f1 83.00)

model config

{
    "recurrent_dropout": 0.5,
    "max_sequence_length": 500,
    "embeddings_name": "glove.840B.300d",
    "batch_size": 10,
    "num_char_lstm_units": 25,
    "case_embedding_size": 5,
    "case_vocab_size": 8,
    "num_word_lstm_units": 100,
    "max_char_length": 30,
    "use_features": false,
    "model_name": "header",
    "char_vocab_size": 305,
    "feature_indices": [],
    "use_ELMo": false,
    "fold_number": 1,
    "feature_embedding_size": 0,
    "dropout": 0.5,
    "max_feature_size": 123581,
    "model_type": "CustomBidLSTM_CRF",
    "char_embedding_size": 25,
    "use_char_feature": true,
    "use_crf": true,
    "word_embedding_size": 300,
    "use_BERT": false
}

keras model summary

INFO	2020-01-20 12:46:11 +0000	master-replica-0		2055 train sequences
INFO	2020-01-20 12:46:11 +0000	master-replica-0		229 validation sequences
INFO	2020-01-20 12:46:11 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-20 12:46:11 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:11 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 350)    0           word_input[0][0]                 
INFO	2020-01-20 12:46:11 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 350)    0           concatenate_1[0][0]              
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    360800      dropout_1[0][0]                  
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-20 12:46:11 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:11 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-20 12:46:11 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:11 +0000	master-replica-0		Total params: 405,003
INFO	2020-01-20 12:46:11 +0000	master-replica-0		Trainable params: 405,003
INFO	2020-01-20 12:46:11 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-20 20:35:05 +0000	master-replica-0		training runtime: 28134.038 seconds 
INFO	2020-01-20 20:35:05 +0000	master-replica-0		Evaluation:
INFO	2020-01-20 20:35:05 +0000	master-replica-0			f1 (micro): 67.72
INFO	2020-01-20 20:35:05 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-20 20:35:05 +0000	master-replica-0		          <date>     0.7581    0.7015    0.7287        67
INFO	2020-01-20 20:35:05 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-20 20:35:05 +0000	master-replica-0		         <email>     0.8020    0.8020    0.8020       101
INFO	2020-01-20 20:35:05 +0000	master-replica-0		      <abstract>     0.8224    0.8013    0.8117       156
INFO	2020-01-20 20:35:05 +0000	master-replica-0		        <pubnum>     0.4490    0.4583    0.4536        48
INFO	2020-01-20 20:35:05 +0000	master-replica-0		           <web>     0.5333    0.4444    0.4848        18
INFO	2020-01-20 20:35:05 +0000	master-replica-0		          <note>     0.3509    0.2353    0.2817       170
INFO	2020-01-20 20:35:05 +0000	master-replica-0		   <affiliation>     0.6944    0.6711    0.6826       298
INFO	2020-01-20 20:35:05 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-20 20:35:05 +0000	master-replica-0		     <copyright>     0.7419    0.7188    0.7302        32
INFO	2020-01-20 20:35:05 +0000	master-replica-0		        <author>     0.7642    0.7292    0.7463       240
INFO	2020-01-20 20:35:05 +0000	master-replica-0		       <address>     0.7983    0.7364    0.7661       258
INFO	2020-01-20 20:35:05 +0000	master-replica-0		         <title>     0.7733    0.6960    0.7326       250
INFO	2020-01-20 20:35:05 +0000	master-replica-0		    <submission>     0.8056    0.7838    0.7945        37
INFO	2020-01-20 20:35:05 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-20 20:35:05 +0000	master-replica-0		       <keyword>     0.9211    0.9211    0.9211        38
INFO	2020-01-20 20:35:05 +0000	master-replica-0		         <grant>     0.1250    0.1667    0.1429         6
INFO	2020-01-20 20:35:05 +0000	master-replica-0		        <degree>     0.7500    0.5000    0.6000         6
INFO	2020-01-20 20:35:05 +0000	master-replica-0		     <reference>     0.4688    0.3947    0.4286        76
INFO	2020-01-20 20:35:05 +0000	master-replica-0		         <intro>     0.3913    0.4286    0.4091        42
INFO	2020-01-20 20:35:05 +0000	master-replica-0		all (micro avg.)     0.7066    0.6501    0.6772      1849

Evaluation

Evaluation:
	f1 (micro): 83.00
                  precision    recall  f1-score   support
    &lt;author&gt;     0.9412    0.9412    0.9412        34
   &lt;address&gt;     0.7097    0.6875    0.6984        32
     &lt;grant&gt;     0.5000    0.5000    0.5000         2
     &lt;title&gt;     0.9615    0.9615    0.9615        26
      &lt;note&gt;     0.5000    0.1667    0.2500         6
      &lt;date&gt;     1.0000    0.8571    0.9231         7
     &lt;intro&gt;     1.0000    1.0000    1.0000         3
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
     &lt;email&gt;     0.7917    0.7600    0.7755        25

<affiliation>     0.7879    0.7647    0.7761        34

<submission>     0.0000    0.0000    0.0000         1

<web>     1.0000    1.0000    1.0000         3

<phone>     1.0000    0.6667    0.8000         3

<pubnum>     0.7500    0.7500    0.7500         4

<abstract>     0.9545    0.9545    0.9545        22
all (micro avg.)     0.8513    0.8137    0.8321       204

Features 9-30, no feature embedding (epoch 50, eval f1 85.50)

model config

{
    "embeddings_name": "glove.840B.300d",
    "recurrent_dropout": 0.5,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "max_char_length": 30,
    "max_feature_size": 77,
    "case_vocab_size": 8,
    "fold_number": 1,
    "num_char_lstm_units": 25,
    "case_embedding_size": 5,
    "feature_embedding_size": 0,
    "use_crf": true,
    "char_vocab_size": 305,
    "model_name": "header",
    "char_embedding_size": 25,
    "max_sequence_length": 500,
    "use_BERT": false,
    "batch_size": 10,
    "use_char_feature": true,
    "dropout": 0.5,
    "model_type": "CustomBidLSTM_CRF",
    "use_ELMo": false,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "use_features": true
}

keras model summary

INFO	2020-01-20 12:46:07 +0000	master-replica-0		2055 train sequences
INFO	2020-01-20 12:46:07 +0000	master-replica-0		229 validation sequences
INFO	2020-01-20 12:46:07 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-20 12:46:07 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:07 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		features_input (InputLayer)     (None, None, 77)     0                                            
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 427)    0           word_input[0][0]                 
INFO	2020-01-20 12:46:07 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-20 12:46:07 +0000	master-replica-0		                                                                 features_input[0][0]             
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 427)    0           concatenate_1[0][0]              
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    422400      dropout_1[0][0]                  
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-20 12:46:07 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:07 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-20 12:46:07 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:07 +0000	master-replica-0		Total params: 466,603
INFO	2020-01-20 12:46:07 +0000	master-replica-0		Trainable params: 466,603
INFO	2020-01-20 12:46:07 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-20 20:19:14 +0000	master-replica-0		training runtime: 27184.473 seconds 
INFO	2020-01-20 20:19:14 +0000	master-replica-0		Evaluation:
INFO	2020-01-20 20:19:14 +0000	master-replica-0			f1 (micro): 75.51
INFO	2020-01-20 20:19:14 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-20 20:19:14 +0000	master-replica-0		          <note>     0.4722    0.4000    0.4331       170
INFO	2020-01-20 20:19:14 +0000	master-replica-0		      <abstract>     0.8302    0.8462    0.8381       156
INFO	2020-01-20 20:19:14 +0000	master-replica-0		          <date>     0.8548    0.7910    0.8217        67
INFO	2020-01-20 20:19:14 +0000	master-replica-0		         <email>     0.8431    0.8515    0.8473       101
INFO	2020-01-20 20:19:14 +0000	master-replica-0		    <submission>     0.8108    0.8108    0.8108        37
INFO	2020-01-20 20:19:14 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-20 20:19:14 +0000	master-replica-0		         <intro>     0.4318    0.4524    0.4419        42
INFO	2020-01-20 20:19:14 +0000	master-replica-0		        <pubnum>     0.7111    0.6667    0.6882        48
INFO	2020-01-20 20:19:14 +0000	master-replica-0		       <address>     0.8259    0.7907    0.8079       258
INFO	2020-01-20 20:19:14 +0000	master-replica-0		        <degree>     0.6000    0.5000    0.5455         6
INFO	2020-01-20 20:19:14 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-20 20:19:14 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-20 20:19:14 +0000	master-replica-0		         <grant>     0.2727    0.5000    0.3529         6
INFO	2020-01-20 20:19:14 +0000	master-replica-0		         <title>     0.8408    0.8240    0.8323       250
INFO	2020-01-20 20:19:14 +0000	master-replica-0		     <reference>     0.6212    0.5395    0.5775        76
INFO	2020-01-20 20:19:14 +0000	master-replica-0		           <web>     0.5556    0.5556    0.5556        18
INFO	2020-01-20 20:19:14 +0000	master-replica-0		     <copyright>     0.7273    0.7500    0.7385        32
INFO	2020-01-20 20:19:14 +0000	master-replica-0		       <keyword>     0.8537    0.9211    0.8861        38
INFO	2020-01-20 20:19:14 +0000	master-replica-0		   <affiliation>     0.7778    0.7517    0.7645       298
INFO	2020-01-20 20:19:14 +0000	master-replica-0		        <author>     0.8684    0.8250    0.8462       240
INFO	2020-01-20 20:19:14 +0000	master-replica-0		all (micro avg.)     0.7704    0.7404    0.7551      1849

Evaluation

Evaluation:
	f1 (micro): 85.50
                  precision    recall  f1-score   support
     &lt;email&gt;     0.8261    0.7600    0.7917        25
     &lt;phone&gt;     1.0000    0.6667    0.8000         3
     &lt;grant&gt;     0.5000    0.5000    0.5000         2
    &lt;author&gt;     0.9697    0.9412    0.9552        34
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
      &lt;note&gt;     0.6667    0.3333    0.4444         6
      &lt;date&gt;     1.0000    1.0000    1.0000         7
     &lt;title&gt;     1.0000    1.0000    1.0000        26

<affiliation>     0.7879    0.7647    0.7761        34

<pubnum>     0.7500    0.7500    0.7500         4

<address>     0.7419    0.7188    0.7302        32

<intro>     1.0000    1.0000    1.0000         3

<submission>     1.0000    1.0000    1.0000         1

<web>     1.0000    1.0000    1.0000         3

<abstract>     0.9545    0.9545    0.9545        22
all (micro avg.)     0.8769    0.8382    0.8571       204

Features 9-30, feature embedding 50 (epoch 36, eval f1 85.00)

model config

{
    "num_char_lstm_units": 25,
    "use_crf": true,
    "max_sequence_length": 500,
    "word_embedding_size": 300,
    "batch_size": 10,
    "use_BERT": false,
    "case_embedding_size": 5,
    "fold_number": 1,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "max_feature_size": 77,
    "use_features": true,
    "use_ELMo": false,
    "use_char_feature": true,
    "dropout": 0.5,
    "embeddings_name": "glove.840B.300d",
    "num_word_lstm_units": 100,
    "model_type": "CustomBidLSTM_CRF",
    "model_name": "header",
    "max_char_length": 30,
    "recurrent_dropout": 0.5,
    "char_vocab_size": 305,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "feature_embedding_size": 50
}

keras model summary

INFO	2020-01-20 12:46:33 +0000	master-replica-0		2055 train sequences
INFO	2020-01-20 12:46:33 +0000	master-replica-0		229 validation sequences
INFO	2020-01-20 12:46:33 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-20 12:46:33 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:33 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		features_input (InputLayer)     (None, None, 77)     0                                            
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		feature_embeddings (TimeDistrib (None, None, 50)     3900        features_input[0][0]             
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 400)    0           word_input[0][0]                 
INFO	2020-01-20 12:46:33 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-20 12:46:33 +0000	master-replica-0		                                                                 feature_embeddings[0][0]         
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 400)    0           concatenate_1[0][0]              
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    400800      dropout_1[0][0]                  
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-20 12:46:33 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:46:33 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-20 12:46:33 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:46:33 +0000	master-replica-0		Total params: 448,903
INFO	2020-01-20 12:46:33 +0000	master-replica-0		Trainable params: 448,903
INFO	2020-01-20 12:46:33 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-20 22:30:18 +0000	master-replica-0		training runtime: 20800.898 seconds 
INFO	2020-01-20 22:30:18 +0000	master-replica-0		Evaluation:
INFO	2020-01-20 22:30:18 +0000	master-replica-0			f1 (micro): 75.01
INFO	2020-01-20 22:30:18 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-20 22:30:18 +0000	master-replica-0		        <pubnum>     0.6400    0.6667    0.6531        48
INFO	2020-01-20 22:30:18 +0000	master-replica-0		     <copyright>     0.7333    0.6875    0.7097        32
INFO	2020-01-20 22:30:18 +0000	master-replica-0		        <author>     0.8448    0.8167    0.8305       240
INFO	2020-01-20 22:30:18 +0000	master-replica-0		       <keyword>     0.8947    0.8947    0.8947        38
INFO	2020-01-20 22:30:18 +0000	master-replica-0		     <reference>     0.6029    0.5395    0.5694        76
INFO	2020-01-20 22:30:18 +0000	master-replica-0		        <degree>     0.4286    0.5000    0.4615         6
INFO	2020-01-20 22:30:18 +0000	master-replica-0		         <grant>     0.4444    0.6667    0.5333         6
INFO	2020-01-20 22:30:18 +0000	master-replica-0		         <email>     0.8367    0.8119    0.8241       101
INFO	2020-01-20 22:30:18 +0000	master-replica-0		   <affiliation>     0.7705    0.7550    0.7627       298
INFO	2020-01-20 22:30:18 +0000	master-replica-0		    <submission>     0.7895    0.8108    0.8000        37
INFO	2020-01-20 22:30:18 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-20 22:30:18 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-20 22:30:18 +0000	master-replica-0		         <title>     0.8537    0.8400    0.8468       250
INFO	2020-01-20 22:30:18 +0000	master-replica-0		           <web>     0.4348    0.5556    0.4878        18
INFO	2020-01-20 22:30:18 +0000	master-replica-0		      <abstract>     0.8250    0.8462    0.8354       156
INFO	2020-01-20 22:30:18 +0000	master-replica-0		         <intro>     0.5000    0.4762    0.4878        42
INFO	2020-01-20 22:30:18 +0000	master-replica-0		       <address>     0.8105    0.7791    0.7945       258
INFO	2020-01-20 22:30:18 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-20 22:30:18 +0000	master-replica-0		          <note>     0.4853    0.3882    0.4314       170
INFO	2020-01-20 22:30:18 +0000	master-replica-0		          <date>     0.8387    0.7761    0.8062        67
INFO	2020-01-20 22:30:18 +0000	master-replica-0		all (micro avg.)     0.7646    0.7361    0.7501      1849

Evaluation

Evaluation:
	f1 (micro): 85.00
                  precision    recall  f1-score   support
      &lt;date&gt;     0.8333    0.7143    0.7692         7
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
   &lt;address&gt;     0.7419    0.7188    0.7302        32
    &lt;pubnum&gt;     1.0000    0.7500    0.8571         4
     &lt;email&gt;     0.8261    0.7600    0.7917        25
       &lt;web&gt;     0.7500    1.0000    0.8571         3
  &lt;abstract&gt;     1.0000    1.0000    1.0000        22
     &lt;grant&gt;     0.5000    0.5000    0.5000         2
     &lt;phone&gt;     1.0000    0.6667    0.8000         3
      &lt;note&gt;     1.0000    0.5000    0.6667         6
    &lt;author&gt;     0.9412    0.9412    0.9412        34

<affiliation>     0.7879    0.7647    0.7761        34

<submission>     0.0000    0.0000    0.0000         1

<title>     1.0000    1.0000    1.0000        26

<intro>     1.0000    1.0000    1.0000         3
all (micro avg.)     0.8718    0.8333    0.8521       204

Features 9-30, feature embedding 30 (epoch 41, eval f1 83.21)

model config

{
    "use_features": true,
    "char_vocab_size": 305,
    "num_word_lstm_units": 100,
    "char_embedding_size": 25,
    "use_BERT": false,
    "model_name": "header",
    "embeddings_name": "glove.840B.300d",
    "dropout": 0.5,
    "batch_size": 10,
    "word_embedding_size": 300,
    "max_feature_size": 77,
    "case_embedding_size": 5,
    "num_char_lstm_units": 25,
    "feature_embedding_size": 30,
    "recurrent_dropout": 0.5,
    "model_type": "CustomBidLSTM_CRF",
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "case_vocab_size": 8,
    "use_ELMo": false,
    "fold_number": 1,
    "max_char_length": 30,
    "max_sequence_length": 500,
    "use_crf": true,
    "use_char_feature": true
}

keras model summary

INFO	2020-01-20 12:47:06 +0000	master-replica-0		2055 train sequences
INFO	2020-01-20 12:47:06 +0000	master-replica-0		229 validation sequences
INFO	2020-01-20 12:47:06 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-20 12:47:06 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:47:06 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		features_input (InputLayer)     (None, None, 77)     0                                            
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		feature_embeddings (TimeDistrib (None, None, 30)     2340        features_input[0][0]             
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 380)    0           word_input[0][0]                 
INFO	2020-01-20 12:47:06 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-20 12:47:06 +0000	master-replica-0		                                                                 feature_embeddings[0][0]         
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 380)    0           concatenate_1[0][0]              
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    384800      dropout_1[0][0]                  
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-20 12:47:06 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:06 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-20 12:47:06 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:47:06 +0000	master-replica-0		Total params: 431,343
INFO	2020-01-20 12:47:06 +0000	master-replica-0		Trainable params: 431,343
INFO	2020-01-20 12:47:06 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-20 18:57:40 +0000	master-replica-0		training runtime: 22236.897 seconds 
INFO	2020-01-20 18:57:40 +0000	master-replica-0		Evaluation:
INFO	2020-01-20 18:57:40 +0000	master-replica-0			f1 (micro): 75.27
INFO	2020-01-20 18:57:40 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-20 18:57:40 +0000	master-replica-0		      <abstract>     0.8428    0.8590    0.8508       156
INFO	2020-01-20 18:57:40 +0000	master-replica-0		          <note>     0.4437    0.3706    0.4038       170
INFO	2020-01-20 18:57:40 +0000	master-replica-0		   <affiliation>     0.7864    0.7785    0.7825       298
INFO	2020-01-20 18:57:40 +0000	master-replica-0		       <keyword>     0.9211    0.9211    0.9211        38
INFO	2020-01-20 18:57:40 +0000	master-replica-0		       <address>     0.8086    0.8023    0.8054       258
INFO	2020-01-20 18:57:40 +0000	master-replica-0		        <degree>     0.3333    0.3333    0.3333         6
INFO	2020-01-20 18:57:40 +0000	master-replica-0		        <author>     0.8462    0.8250    0.8354       240
INFO	2020-01-20 18:57:40 +0000	master-replica-0		          <date>     0.7681    0.7910    0.7794        67
INFO	2020-01-20 18:57:40 +0000	master-replica-0		         <intro>     0.4468    0.5000    0.4719        42
INFO	2020-01-20 18:57:40 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-20 18:57:40 +0000	master-replica-0		     <reference>     0.5672    0.5000    0.5315        76
INFO	2020-01-20 18:57:40 +0000	master-replica-0		         <title>     0.8571    0.8400    0.8485       250
INFO	2020-01-20 18:57:40 +0000	master-replica-0		    <submission>     0.7568    0.7568    0.7568        37
INFO	2020-01-20 18:57:40 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-20 18:57:40 +0000	master-replica-0		         <grant>     0.5000    0.6667    0.5714         6
INFO	2020-01-20 18:57:40 +0000	master-replica-0		     <copyright>     0.8000    0.7500    0.7742        32
INFO	2020-01-20 18:57:40 +0000	master-replica-0		         <email>     0.7900    0.7822    0.7861       101
INFO	2020-01-20 18:57:40 +0000	master-replica-0		           <web>     0.5417    0.7222    0.6190        18
INFO	2020-01-20 18:57:40 +0000	master-replica-0		        <pubnum>     0.7292    0.7292    0.7292        48
INFO	2020-01-20 18:57:40 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-20 18:57:40 +0000	master-replica-0		all (micro avg.)     0.7612    0.7447    0.7529      1849

Evaluation

Evaluation:
	f1 (micro): 83.21
                  precision    recall  f1-score   support
&lt;submission&gt;     1.0000    1.0000    1.0000         1
  &lt;abstract&gt;     0.9545    0.9545    0.9545        22
       &lt;web&gt;     0.7500    1.0000    0.8571         3
     &lt;grant&gt;     0.5000    0.5000    0.5000         2
     &lt;phone&gt;     0.5000    0.3333    0.4000         3
   &lt;address&gt;     0.7188    0.7188    0.7188        32
      &lt;date&gt;     0.8571    0.8571    0.8571         7
    &lt;pubnum&gt;     1.0000    0.7500    0.8571         4
    &lt;author&gt;     0.9697    0.9412    0.9552        34
     &lt;title&gt;     0.9615    0.9615    0.9615        26
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
      &lt;note&gt;     1.0000    0.1667    0.2857         6
     &lt;intro&gt;     1.0000    1.0000    1.0000         3

<affiliation>     0.7879    0.7647    0.7761        34

<email>     0.7826    0.7200    0.7500        25
all (micro avg.)     0.8557    0.8137    0.8342       204

Features 9-30, feature embedding 30 (epoch 31, eval f1 82.91) (accidentally used 30 again)

model config

{
    "char_embedding_size": 25,
    "fold_number": 1,
    "embeddings_name": "glove.840B.300d",
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "max_char_length": 30,
    "use_crf": true,
    "batch_size": 10,
    "use_BERT": false,
    "max_sequence_length": 500,
    "model_type": "CustomBidLSTM_CRF",
    "recurrent_dropout": 0.5,
    "char_vocab_size": 305,
    "word_embedding_size": 300,
    "num_char_lstm_units": 25,
    "max_feature_size": 77,
    "model_name": "header",
    "dropout": 0.5,
    "case_embedding_size": 5,
    "num_word_lstm_units": 100,
    "use_char_feature": true,
    "use_features": true,
    "use_ELMo": false,
    "feature_embedding_size": 30,
    "case_vocab_size": 8
}

keras model summary

INFO	2020-01-20 12:47:34 +0000	master-replica-0		2055 train sequences
INFO	2020-01-20 12:47:34 +0000	master-replica-0		229 validation sequences
INFO	2020-01-20 12:47:34 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-20 12:47:34 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:47:34 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		features_input (InputLayer)     (None, None, 77)     0                                            
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		feature_embeddings (TimeDistrib (None, None, 30)     2340        features_input[0][0]             
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 380)    0           word_input[0][0]                 
INFO	2020-01-20 12:47:34 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-20 12:47:34 +0000	master-replica-0		                                                                 feature_embeddings[0][0]         
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 380)    0           concatenate_1[0][0]              
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    384800      dropout_1[0][0]                  
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-20 12:47:34 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-20 12:47:34 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-20 12:47:34 +0000	master-replica-0		==================================================================================================
INFO	2020-01-20 12:47:34 +0000	master-replica-0		Total params: 431,343
INFO	2020-01-20 12:47:34 +0000	master-replica-0		Trainable params: 431,343
INFO	2020-01-20 12:47:34 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-20 22:02:02 +0000	master-replica-0		training runtime: 14670.097 seconds 
INFO	2020-01-20 22:02:02 +0000	master-replica-0		Evaluation:
INFO	2020-01-20 22:02:02 +0000	master-replica-0			f1 (micro): 73.29
INFO	2020-01-20 22:02:02 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-20 22:02:02 +0000	master-replica-0		        <author>     0.8502    0.8042    0.8266       240
INFO	2020-01-20 22:02:02 +0000	master-replica-0		         <email>     0.8061    0.7822    0.7940       101
INFO	2020-01-20 22:02:02 +0000	master-replica-0		    <submission>     0.7941    0.7297    0.7606        37
INFO	2020-01-20 22:02:02 +0000	master-replica-0		       <address>     0.7967    0.7442    0.7695       258
INFO	2020-01-20 22:02:02 +0000	master-replica-0		          <note>     0.4367    0.4059    0.4207       170
INFO	2020-01-20 22:02:02 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-20 22:02:02 +0000	master-replica-0		         <grant>     0.4286    0.5000    0.4615         6
INFO	2020-01-20 22:02:02 +0000	master-replica-0		     <copyright>     0.6667    0.6250    0.6452        32
INFO	2020-01-20 22:02:02 +0000	master-replica-0		          <date>     0.7812    0.7463    0.7634        67
INFO	2020-01-20 22:02:02 +0000	master-replica-0		       <keyword>     0.8718    0.8947    0.8831        38
INFO	2020-01-20 22:02:02 +0000	master-replica-0		        <pubnum>     0.6522    0.6250    0.6383        48
INFO	2020-01-20 22:02:02 +0000	master-replica-0		     <reference>     0.6780    0.5263    0.5926        76
INFO	2020-01-20 22:02:02 +0000	master-replica-0		      <abstract>     0.8323    0.8590    0.8454       156
INFO	2020-01-20 22:02:02 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-20 22:02:02 +0000	master-replica-0		        <degree>     0.4000    0.3333    0.3636         6
INFO	2020-01-20 22:02:02 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-20 22:02:02 +0000	master-replica-0		           <web>     0.4545    0.5556    0.5000        18
INFO	2020-01-20 22:02:02 +0000	master-replica-0		         <intro>     0.5128    0.4762    0.4938        42
INFO	2020-01-20 22:02:02 +0000	master-replica-0		         <title>     0.8436    0.8200    0.8316       250
INFO	2020-01-20 22:02:02 +0000	master-replica-0		   <affiliation>     0.7599    0.7114    0.7348       298
INFO	2020-01-20 22:02:02 +0000	master-replica-0		all (micro avg.)     0.7527    0.7144    0.7331      1849

Evaluation

Evaluation:
	f1 (micro): 82.91
                  precision    recall  f1-score   support
       &lt;web&gt;     1.0000    1.0000    1.0000         3
  &lt;abstract&gt;     0.9545    0.9545    0.9545        22
&lt;submission&gt;     1.0000    1.0000    1.0000         1
      &lt;note&gt;     0.5000    0.1667    0.2500         6
      &lt;date&gt;     0.8571    0.8571    0.8571         7
   &lt;address&gt;     0.7188    0.7188    0.7188        32
    &lt;author&gt;     0.9412    0.9412    0.9412        34
     &lt;email&gt;     0.7619    0.6400    0.6957        25
     &lt;title&gt;     0.9615    0.9615    0.9615        26
     &lt;intro&gt;     1.0000    1.0000    1.0000         3
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
     &lt;phone&gt;     1.0000    0.6667    0.8000         3
    &lt;pubnum&gt;     1.0000    0.7500    0.8571         4

<affiliation>     0.7879    0.7647    0.7761        34

<grant>     0.5000    0.5000    0.5000         2
all (micro avg.)     0.8549    0.8088    0.8312       204

Features 9-30, feature embedding 80 (epoch 40, eval f1 85.00)

model config

{
    "embeddings_name": "glove.840B.300d",
    "word_embedding_size": 300,
    "char_vocab_size": 305,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "use_BERT": false,
    "model_name": "header",
    "use_crf": true,
    "num_char_lstm_units": 25,
    "case_vocab_size": 8,
    "case_embedding_size": 5,
    "batch_size": 10,
    "feature_embedding_size": 80,
    "char_embedding_size": 25,
    "num_word_lstm_units": 100,
    "fold_number": 1,
    "dropout": 0.5,
    "max_feature_size": 77,
    "model_type": "CustomBidLSTM_CRF",
    "use_char_feature": true,
    "max_sequence_length": 500,
    "recurrent_dropout": 0.5,
    "max_char_length": 30,
    "use_ELMo": false,
    "use_features": true
}

keras model summary

INFO	2020-01-21 10:38:39 +0000	master-replica-0		2055 train sequences
INFO	2020-01-21 10:38:39 +0000	master-replica-0		229 validation sequences
INFO	2020-01-21 10:38:39 +0000	master-replica-0		254 evaluation sequences
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		Layer (type)                    Output Shape         Param #     Connected to                     
INFO	2020-01-21 10:38:39 +0000	master-replica-0		==================================================================================================
INFO	2020-01-21 10:38:39 +0000	master-replica-0		char_input (InputLayer)         (None, None, 30)     0                                            
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		features_input (InputLayer)     (None, None, 77)     0                                            
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		word_input (InputLayer)         (None, None, 300)    0                                            
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		feature_embeddings (TimeDistrib (None, None, 80)     6240        features_input[0][0]             
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		concatenate_1 (Concatenate)     (None, None, 430)    0           word_input[0][0]                 
INFO	2020-01-21 10:38:39 +0000	master-replica-0		                                                                 char_lstm[0][0]                  
INFO	2020-01-21 10:38:39 +0000	master-replica-0		                                                                 feature_embeddings[0][0]         
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		dropout_1 (Dropout)             (None, None, 430)    0           concatenate_1[0][0]              
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		bidirectional_2 (Bidirectional) (None, None, 200)    424800      dropout_1[0][0]                  
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO	2020-01-21 10:38:39 +0000	master-replica-0		__________________________________________________________________________________________________
INFO	2020-01-21 10:38:39 +0000	master-replica-0		chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO	2020-01-21 10:38:39 +0000	master-replica-0		==================================================================================================
INFO	2020-01-21 10:38:39 +0000	master-replica-0		Total params: 475,243
INFO	2020-01-21 10:38:39 +0000	master-replica-0		Trainable params: 475,243
INFO	2020-01-21 10:38:39 +0000	master-replica-0		Non-trainable params: 0

training summary

INFO	2020-01-21 16:50:30 +0000	master-replica-0		training runtime: 22309.499 seconds 
INFO	2020-01-21 16:50:30 +0000	master-replica-0		Evaluation:
INFO	2020-01-21 16:50:30 +0000	master-replica-0			f1 (micro): 75.57
INFO	2020-01-21 16:50:30 +0000	master-replica-0		                  precision    recall  f1-score   support
INFO	2020-01-21 16:50:30 +0000	master-replica-0		     <copyright>     0.7273    0.7500    0.7385        32
INFO	2020-01-21 16:50:30 +0000	master-replica-0		         <intro>     0.4898    0.5714    0.5275        42
INFO	2020-01-21 16:50:30 +0000	master-replica-0		         <phone>     0.0000    0.0000    0.0000         3
INFO	2020-01-21 16:50:30 +0000	master-replica-0		        <pubnum>     0.7500    0.6875    0.7174        48
INFO	2020-01-21 16:50:30 +0000	master-replica-0		        <author>     0.8504    0.8292    0.8397       240
INFO	2020-01-21 16:50:30 +0000	master-replica-0		      <abstract>     0.8758    0.8590    0.8673       156
INFO	2020-01-21 16:50:30 +0000	master-replica-0		    <dedication>     1.0000    1.0000    1.0000         1
INFO	2020-01-21 16:50:30 +0000	master-replica-0		        <degree>     0.5000    0.3333    0.4000         6
INFO	2020-01-21 16:50:30 +0000	master-replica-0		          <note>     0.4080    0.4176    0.4128       170
INFO	2020-01-21 16:50:30 +0000	master-replica-0		   <affiliation>     0.7819    0.7819    0.7819       298
INFO	2020-01-21 16:50:30 +0000	master-replica-0		         <grant>     0.5000    0.6667    0.5714         6
INFO	2020-01-21 16:50:30 +0000	master-replica-0		     submission>     0.0000    0.0000    0.0000         2
INFO	2020-01-21 16:50:30 +0000	master-replica-0		     <reference>     0.6364    0.5526    0.5915        76
INFO	2020-01-21 16:50:30 +0000	master-replica-0		         <title>     0.8150    0.8280    0.8214       250
INFO	2020-01-21 16:50:30 +0000	master-replica-0		         <email>     0.8125    0.7723    0.7919       101
INFO	2020-01-21 16:50:30 +0000	master-replica-0		          <date>     0.7910    0.7910    0.7910        67
INFO	2020-01-21 16:50:30 +0000	master-replica-0		       <keyword>     0.8974    0.9211    0.9091        38
INFO	2020-01-21 16:50:30 +0000	master-replica-0		       <address>     0.8440    0.8178    0.8307       258
INFO	2020-01-21 16:50:30 +0000	master-replica-0		           <web>     0.5263    0.5556    0.5405        18
INFO	2020-01-21 16:50:30 +0000	master-replica-0		    <submission>     0.7778    0.7568    0.7671        37
INFO	2020-01-21 16:50:30 +0000	master-replica-0		all (micro avg.)     0.7603    0.7512    0.7557      1849

Evaluation

Evaluation:
	f1 (micro): 85.00
                  precision    recall  f1-score   support
      &lt;date&gt;     0.8571    0.8571    0.8571         7
    &lt;pubnum&gt;     1.0000    0.7500    0.8571         4
  &lt;abstract&gt;     0.9545    0.9545    0.9545        22
     &lt;email&gt;     0.8636    0.7600    0.8085        25
     &lt;intro&gt;     1.0000    1.0000    1.0000         3
&lt;submission&gt;     1.0000    1.0000    1.0000         1
     &lt;phone&gt;     1.0000    0.6667    0.8000         3
   &lt;address&gt;     0.6970    0.7188    0.7077        32
   &lt;keyword&gt;     1.0000    1.0000    1.0000         2
    &lt;author&gt;     0.9697    0.9412    0.9552        34
       &lt;web&gt;     1.0000    1.0000    1.0000         3

<affiliation>     0.7879    0.7647    0.7761        34

<title>     1.0000    1.0000    1.0000        26

<grant>     0.5000    0.5000    0.5000         2

<note>     0.6667    0.3333    0.4444         6
all (micro avg.)     0.8718    0.8333    0.8521       204

lfoppiano · 2020-01-23T08:29:49Z

@de-code thanks! There is a 2% gain in certain cases... mmm interesting

de-code · 2020-01-23T10:26:58Z

@de-code thanks! There is a 2% gain in certain cases... mmm interesting

Yes, I am not sure how good the header test is though. It seems relatively small.

Since I have the models and all of the checkpoints saved (and the logs), I could run the evaluation again on a different test set, just need it in that DeLFT format.

kermitt2 changed the title ~~Add layout features to model~~ Add layout features to GROBID model Sep 21, 2019

lfoppiano self-assigned this Dec 5, 2019

lfoppiano added the enhancement New feature or request label Dec 11, 2019

lfoppiano mentioned this issue Dec 20, 2019

[WIP] Implementing features channel #76

Closed

de-code mentioned this issue Jan 3, 2020

custom delft train args kermitt2/grobid#469

Open

lfoppiano mentioned this issue Jan 14, 2020

Implementing features channel #82

Merged

lfoppiano closed this as completed Jan 23, 2020

lfoppiano reopened this Jan 23, 2020

lfoppiano linked a pull request Feb 19, 2020 that will close this issue

Implementing features channel #82

Merged

kermitt2 closed this as completed in #82 Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add layout features to GROBID model #42

Add layout features to GROBID model #42

de-code commented Jul 8, 2019

kermitt2 commented Jul 10, 2019

de-code commented Jul 26, 2019

lfoppiano commented Dec 9, 2019 •

edited

kermitt2 commented Dec 9, 2019

lfoppiano commented Dec 10, 2019

kermitt2 commented Dec 10, 2019

lfoppiano commented Dec 10, 2019 •

edited

kermitt2 commented Dec 10, 2019 •

edited

lfoppiano commented Dec 10, 2019

de-code commented Dec 10, 2019

lfoppiano commented Dec 11, 2019

kermitt2 commented Dec 11, 2019

kermitt2 commented Dec 11, 2019

de-code commented Dec 12, 2019

kermitt2 commented Dec 24, 2019

lfoppiano commented Dec 24, 2019

de-code commented Jan 3, 2020

de-code commented Jan 21, 2020 •

edited

lfoppiano commented Jan 23, 2020

de-code commented Jan 23, 2020 •

edited

Add layout features to GROBID model #42

Add layout features to GROBID model #42

Comments

de-code commented Jul 8, 2019

kermitt2 commented Jul 10, 2019

de-code commented Jul 26, 2019

lfoppiano commented Dec 9, 2019 • edited

kermitt2 commented Dec 9, 2019

lfoppiano commented Dec 10, 2019

kermitt2 commented Dec 10, 2019

lfoppiano commented Dec 10, 2019 • edited

kermitt2 commented Dec 10, 2019 • edited

lfoppiano commented Dec 10, 2019

de-code commented Dec 10, 2019

lfoppiano commented Dec 11, 2019

kermitt2 commented Dec 11, 2019

kermitt2 commented Dec 11, 2019

de-code commented Dec 12, 2019

kermitt2 commented Dec 24, 2019

lfoppiano commented Dec 24, 2019

de-code commented Jan 3, 2020

de-code commented Jan 21, 2020 • edited

No features (epoch 53, eval f1 83.00)

Features 9-30, no feature embedding (epoch 50, eval f1 85.50)

Features 9-30, feature embedding 50 (epoch 36, eval f1 85.00)

Features 9-30, feature embedding 30 (epoch 41, eval f1 83.21)

Features 9-30, feature embedding 30 (epoch 31, eval f1 82.91) (accidentally used 30 again)

Features 9-30, feature embedding 80 (epoch 40, eval f1 85.00)

lfoppiano commented Jan 23, 2020

de-code commented Jan 23, 2020 • edited

lfoppiano commented Dec 9, 2019 •

edited

lfoppiano commented Dec 10, 2019 •

edited

kermitt2 commented Dec 10, 2019 •

edited

de-code commented Jan 21, 2020 •

edited

de-code commented Jan 23, 2020 •

edited