Improvement in evaluation framework #280

lfoppiano · 2018-01-24T17:39:59Z

In my way down to check if everything was ok, I've rewritten the evaluation in a way that is testable and tested but keeping the same efficiency.

Perhaps some changes could be further improved but that's a nice base to start from

Update: Work in progress to implement #453

… averages in the Stats class in order to be able to test it directly.

coveralls · 2019-07-04T00:58:03Z

Coverage increased (+0.2%) to 36.851% when pulling 149d3b7 on check-evaluation into e3b8886 on master.

lfoppiano · 2019-07-11T07:07:02Z

I've implemented something... however when I'm testing the date model, I get quite suspiciously high results:

org.grobid.trainer.TrainerRunner 3 date -gH /Users/lfoppiano/development/projects/grobid/grobid-home -n 10
path2GbdHome=/Users/lfoppiano/development/projects/grobid/grobid-home   path2GbdProperties=/Users/lfoppiano/development/projects/grobid/grobid-home/config/grobid.properties
Jul 11, 2019 3:56:34 PM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath

[...]

sourcePathLabel: /Users/lfoppiano/development/projects/grobid/grobid-home/../grobid-trainer/resources/dataset/date/corpus
outputPath for training data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2593286755485374659.train
355 tei files
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_0.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2049499740058725076.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6458
    nb features: 45248
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7712087194203659098.test
Jul 11, 2019 3:56:37 PM org.grobid.core.jni.WapitiModel init
INFO: Loading model: /Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti (size: 102435)
[Wapiti] Loading model: "/Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti"
Model path: /Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti
Labeling took: 11 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_1.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date10216013546878777186.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6587
    nb features: 46151
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7677806593219064519.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_2.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date11876661758354315573.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6376
    nb features: 44674
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2377539879914962123.test
Labeling took: 8 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_3.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date751133838306332397.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6422
    nb features: 44996
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date5873862545493725401.test
Labeling took: 18 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_4.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date16816317073972557159.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6509
    nb features: 45605
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date17026021496668950245.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_5.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date9989361506377752060.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6492
    nb features: 45486
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date11548516505392637898.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_6.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date801373626121976008.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6611
    nb features: 46319
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date55840070702985719.test
Labeling took: 17 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_7.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date10616795455243562047.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6507
    nb features: 45591
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date16544204553833352755.test
Labeling took: 7 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_8.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date4726942631385784016.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6560
    nb features: 45962
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date15314887437784567017.test
Labeling took: 10 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_9.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date6530542163135531667.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    567
    nb labels:   7
    nb blocks:   6360
    nb features: 44562
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7810661868839241967.test
Labeling took: 11 ms
Results: 
Worst Model

===== Token-level results =====


label                accuracy     precision    recall       f1     

<day>                99.19        97.83        97.83        97.83  
<month>              99.19        98.25        98.25        98.25  
<year>               100          100          100          100    

all fields           99.46        98.8         98.8         98.8    (micro average)
                     99.46        98.69        98.69        98.69   (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<day>                99.14        97.83        97.83        97.83  
<month>              99.14        98.25        98.25        98.25  
<year>               100          100          100          100    

all fields           99.43        98.8         98.8         98.8    (micro average)
                     99.43        98.69        98.69        98.69   (macro average)

===== Instance-level results =====

Total expected instances:   63
Correct instances:          61
Instance-level recall:      96.83

===== Token-level results =====


label                accuracy     precision    recall       f1     

<day>                100          100          100          100    
<month>              100          100          100          100    
<year>               100          100          100          100    

all fields           100          100          100          100     (micro average)
                     100          100          100          100     (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<day>                100          100          100          100    
<month>              100          100          100          100    
<year>               100          100          100          100    

all fields           100          100          100          100     (micro average)
                     100          100          100          100     (macro average)

===== Instance-level results =====

Total expected instances:   63
Correct instances:          63
Instance-level recall:      100
Average precision: 99.75
Average recall: 99.81
Average F1: 99.78

Split, training and evaluation for date model is realized in 21324 ms

Process finished with exit code 0

…odel class #453

lfoppiano · 2019-07-17T03:10:05Z

OK I've found some more bugs, the worst was that the evaluation of the n-fold was using the model in grobid-home/models returning wrong results and eventually failing when the model was not present

lfoppiano · 2019-08-08T02:55:03Z

I meant the conversion to XML. Although reading the docs again, it appears it should be possible with -Prun=0, just not as convenient or it's not so clear. e.g. I would generally keep the results from previous evaluations results. For my own evaluation, I originally used output suffixes, like .grobid-tei-0.5.3.xml but that became messy and I settled on separate output folders instead (e.g. grobid-tei-0.5.3). I also (optionally) gzip the XML files to save space. [...]

I think there are too many thing here, what about having this in a separate feature? I haven't really modified the end 2 end validation interface.

Another suggestion would be to have a CSV output.

as well it should be separated. Right now the evaluation is stored in class / set of classes, so we could add any output writer afterwards.

(cherry picked from commit 67243f4)

lfoppiano · 2019-08-09T07:16:03Z

One more detail, I've moved the method dispatchExample()in the AbstractTrainer. This method was duplicated in basically any Trainer classes. After the change many submodule's training needs to be updated having this method removed or made public from the AbstractTrainer.

kermitt2 · 2019-08-27T14:28:23Z

Some remarks about the current version. I think this is almost good and great addition!

it should be possible to set numFolds to 1 for using just one fold (one shot eval), this is a normal intuitive usage
numFolds at 0 which is translated to 10 does not make a lot of sense imho, numFolds at 0 should simply not being possible because it corresponds to nothing concrete
for summary of results: I would suggest to replace worst model/best model, by worst fold/best fold because for CRF for instance we do not compare models but a training/eval setting (models are deterministics)
statistics over instances are not averaged (while they are over fields), for example:

====================== Fold 0 ====================== 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          295
Instance-level recall:      92.77



====================== Fold 1 ====================== 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          310
Instance-level recall:      97.48

Average over 2 folds: 
===== Instance-level results =====

Total expected instances:   636
Correct instances:          605
Instance-level recall:      95.13

It should be:

Average over 2 folds: 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          302.5
Instance-level recall:      95.13

kermitt2 · 2019-08-27T15:27:32Z

Gasp I tested an old version! I update my above remarks.

lfoppiano · 2019-08-27T23:26:18Z

Some remarks about the current version. I think this is almost good and great addition!
* it should be possible to set numFolds to 1 for using just one fold (one shot eval), this is a normal intuitive usage

Thanks for the review, I've implemented everything, but I have a question on fold = 1.
Do you mean just evaluation of the current model using all the provided data as one fold?

kermitt2 · 2019-08-27T23:36:01Z

Thanks for the review, I've implemented everything, but I have a question on fold = 1.
Do you mean just evaluation of the current model using all the provided data as one fold?

Ah yes sorry, it's not confusing. In this case, no evaluation is possible, so just one training with the fold.

lfoppiano · 2019-08-27T23:52:35Z

Ah yes sorry, it's not confusing. In this case, no evaluation is possible, so just one training with the fold.

I'm still not understanding.
Fold=1 is computed to take the whole dataset. If you mention training, isn't it then just like running the training? Does it make sense to train a model without evaluation when the process is actually the evaluation?

To be sure I've checked other libraries, like scikitlearn and it looks like they allow only n >= 2: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

kermitt2 · 2019-08-28T00:06:32Z

Yes you're right, it is redundant with the train only option so it's useless. So let's keep n >1!

lfoppiano · 2019-08-28T00:19:15Z

Thanks! I've pushed the change.

Improvement in evaluation framework Former-commit-id: 38489ed

lfoppiano added 7 commits November 16, 2017 10:14

moving logic for the calculation of precision/recall/f1 together with…

9f1687e

… averages in the Stats class in order to be able to test it directly.

adding more real tests and verify manual calculations

6368a6a

as usual, the forgotten file

34134de

disable test not testing anything special

8074cf8

fix broken test

2107af1

refactoring to improve the efficiency when metrics are computed

99d3417

Merge branch 'master' into check-evaluation

bef372b

lfoppiano requested a review from kermitt2 May 30, 2018 06:22

lfoppiano mentioned this pull request Jul 3, 2019

Improvements in the evaluation framework #453

Closed

4 tasks

lfoppiano added 2 commits July 4, 2019 09:14

Merge branch 'master' into check-evaluation

b638483

forgotten import

457f860

lfoppiano changed the title ~~Reworked the evaluation in a testable way~~ [WIP] Improvement in evaluation framework Jul 4, 2019

lfoppiano self-assigned this Jul 4, 2019

lfoppiano added the enhancement label Jul 4, 2019

lfoppiano added this to the 0.6.0 milestone Jul 4, 2019

first implementation of n-fold evaluation #453

080668c

lfoppiano requested review from kermitt2 and removed request for kermitt2 July 11, 2019 02:01

lfoppiano added 4 commits July 11, 2019 13:36

Adding more tests and more output #453

08f2e3a

Adding dummy model for testing #410 #453

e115396

More unit tests and fixes #453

b273728

Printing out more information #453

fbd798a

lfoppiano added 4 commits July 11, 2019 16:12

Adding more information in output #453

59122b4

cleanup #453

55f060f

implementing output report on file - moved printing code within the m…

8cb5851

…odel class #453

Fixing results output #453 #58

ec1b178

lfoppiano mentioned this pull request Jul 12, 2019

Write evaluation report to a file #58

Closed

removing token-level results #453

c72ff35

lfoppiano added 5 commits July 17, 2019 00:21

calm down and go to sleep #453

ff9e6b2

Adding raw result output #453

15c9a87

cosmetics #453

3271c48

Improving visualisation - more cosmetics #453

b9c6961

Adding output of raw results for n-fold evaluation #453

62a897c

lfoppiano added 4 commits July 17, 2019 12:22

fixing copy-pasta distraction problem #453

4f308a5

FIxing other minor and nasty annoying errors #453

e675b11

do not include the raw results in the output #453

149d3b7

Merge branch 'master' into check-evaluation

24487e3

fixing test

270c83e

(cherry picked from commit 67243f4)

Remove 10-fold from date trainer - forgot there from testing

7fc02a8

lfoppiano mentioned this pull request Aug 20, 2019

Update trainer to support the updated evaluation framework softcite/software-mentions#6

Merged

lfoppiano added 3 commits August 20, 2019 17:27

adding more tests for evaluation and fixing small bug on support metrics

abc9490

Adding more tests and moving code around

ce933aa

saved by a test :-)

2b09153

Implementing review remarks #453

2a6ab09

kermitt2 added 2 commits August 28, 2019 03:44

documentation for n-folds evaluation

2adac37

correct spelling in new doc

27cde82

kermitt2 merged commit 38489ed into master Sep 11, 2019

tantikristanti pushed a commit that referenced this pull request Nov 15, 2019

Merge pull request #280 from kermitt2/check-evaluation

43ad93e

Improvement in evaluation framework Former-commit-id: 38489ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement in evaluation framework #280

Improvement in evaluation framework #280

lfoppiano commented Jan 24, 2018 •

edited

Loading

coveralls commented Jul 4, 2019 •

edited

Loading

lfoppiano commented Jul 11, 2019

lfoppiano commented Jul 17, 2019

lfoppiano commented Aug 8, 2019

lfoppiano commented Aug 9, 2019

kermitt2 commented Aug 27, 2019 •

edited

Loading

kermitt2 commented Aug 27, 2019

lfoppiano commented Aug 27, 2019

kermitt2 commented Aug 27, 2019

lfoppiano commented Aug 27, 2019

kermitt2 commented Aug 28, 2019

lfoppiano commented Aug 28, 2019 •

edited

Loading

Improvement in evaluation framework #280

Improvement in evaluation framework #280

Conversation

lfoppiano commented Jan 24, 2018 • edited Loading

coveralls commented Jul 4, 2019 • edited Loading

lfoppiano commented Jul 11, 2019

lfoppiano commented Jul 17, 2019

lfoppiano commented Aug 8, 2019

lfoppiano commented Aug 9, 2019

kermitt2 commented Aug 27, 2019 • edited Loading

kermitt2 commented Aug 27, 2019

lfoppiano commented Aug 27, 2019

kermitt2 commented Aug 27, 2019

lfoppiano commented Aug 27, 2019

kermitt2 commented Aug 28, 2019

lfoppiano commented Aug 28, 2019 • edited Loading

lfoppiano commented Jan 24, 2018 •

edited

Loading

coveralls commented Jul 4, 2019 •

edited

Loading

kermitt2 commented Aug 27, 2019 •

edited

Loading

lfoppiano commented Aug 28, 2019 •

edited

Loading