Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement in evaluation framework #280

Merged
merged 45 commits into from
Sep 11, 2019
Merged

Improvement in evaluation framework #280

merged 45 commits into from
Sep 11, 2019

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Jan 24, 2018

In my way down to check if everything was ok, I've rewritten the evaluation in a way that is testable and tested but keeping the same efficiency.

Perhaps some changes could be further improved but that's a nice base to start from

Update: Work in progress to implement #453

@coveralls
Copy link

coveralls commented Jul 4, 2019

Coverage Status

Coverage increased (+0.2%) to 36.851% when pulling 149d3b7 on check-evaluation into e3b8886 on master.

@lfoppiano lfoppiano changed the title Reworked the evaluation in a testable way [WIP] Improvement in evaluation framework Jul 4, 2019
@lfoppiano lfoppiano self-assigned this Jul 4, 2019
@lfoppiano lfoppiano added this to the 0.6.0 milestone Jul 4, 2019
@lfoppiano lfoppiano requested review from kermitt2 and removed request for kermitt2 July 11, 2019 02:01
@lfoppiano
Copy link
Collaborator Author

I've implemented something... however when I'm testing the date model, I get quite suspiciously high results:

org.grobid.trainer.TrainerRunner 3 date -gH /Users/lfoppiano/development/projects/grobid/grobid-home -n 10
path2GbdHome=/Users/lfoppiano/development/projects/grobid/grobid-home   path2GbdProperties=/Users/lfoppiano/development/projects/grobid/grobid-home/config/grobid.properties
Jul 11, 2019 3:56:34 PM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath

[...]

sourcePathLabel: /Users/lfoppiano/development/projects/grobid/grobid-home/../grobid-trainer/resources/dataset/date/corpus
outputPath for training data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2593286755485374659.train
355 tei files
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_0.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2049499740058725076.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6458
    nb features: 45248
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7712087194203659098.test
Jul 11, 2019 3:56:37 PM org.grobid.core.jni.WapitiModel init
INFO: Loading model: /Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti (size: 102435)
[Wapiti] Loading model: "/Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti"
Model path: /Users/lfoppiano/development/projects/grobid/grobid-home/models/date/model.wapiti
Labeling took: 11 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_1.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date10216013546878777186.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6587
    nb features: 46151
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7677806593219064519.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_2.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date11876661758354315573.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6376
    nb features: 44674
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date2377539879914962123.test
Labeling took: 8 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_3.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date751133838306332397.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6422
    nb features: 44996
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date5873862545493725401.test
Labeling took: 18 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_4.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date16816317073972557159.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6509
    nb features: 45605
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date17026021496668950245.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_5.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date9989361506377752060.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6492
    nb features: 45486
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date11548516505392637898.test
Labeling took: 6 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_6.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date801373626121976008.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6611
    nb features: 46319
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date55840070702985719.test
Labeling took: 17 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_7.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date10616795455243562047.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6507
    nb features: 45591
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date16544204553833352755.test
Labeling took: 7 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_8.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date4726942631385784016.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    573
    nb labels:   7
    nb blocks:   6560
    nb features: 45962
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date15314887437784567017.test
Labeling took: 10 ms
Saving model in /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date_nfold_9.wapiti
Training input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date6530542163135531667.train
	epsilon: 1.0E-5
	window: 20
	nb max iterations: 2000
	nb threads: 12
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    567
    nb labels:   7
    nb blocks:   6360
    nb features: 44562
* Train the model with l-bfgs
* Save the model
* Done
Evaluation input data: /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/date7810661868839241967.test
Labeling took: 11 ms
Results: 
Worst Model

===== Token-level results =====


label                accuracy     precision    recall       f1     

<day>                99.19        97.83        97.83        97.83  
<month>              99.19        98.25        98.25        98.25  
<year>               100          100          100          100    

all fields           99.46        98.8         98.8         98.8    (micro average)
                     99.46        98.69        98.69        98.69   (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<day>                99.14        97.83        97.83        97.83  
<month>              99.14        98.25        98.25        98.25  
<year>               100          100          100          100    

all fields           99.43        98.8         98.8         98.8    (micro average)
                     99.43        98.69        98.69        98.69   (macro average)

===== Instance-level results =====

Total expected instances:   63
Correct instances:          61
Instance-level recall:      96.83

===== Token-level results =====


label                accuracy     precision    recall       f1     

<day>                100          100          100          100    
<month>              100          100          100          100    
<year>               100          100          100          100    

all fields           100          100          100          100     (micro average)
                     100          100          100          100     (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<day>                100          100          100          100    
<month>              100          100          100          100    
<year>               100          100          100          100    

all fields           100          100          100          100     (micro average)
                     100          100          100          100     (macro average)

===== Instance-level results =====

Total expected instances:   63
Correct instances:          63
Instance-level recall:      100
Average precision: 99.75
Average recall: 99.81
Average F1: 99.78

Split, training and evaluation for date model is realized in 21324 ms

Process finished with exit code 0

@lfoppiano
Copy link
Collaborator Author

OK I've found some more bugs, the worst was that the evaluation of the n-fold was using the model in grobid-home/models returning wrong results and eventually failing when the model was not present

@lfoppiano
Copy link
Collaborator Author

I meant the conversion to XML. Although reading the docs again, it appears it should be possible with -Prun=0, just not as convenient or it's not so clear. e.g. I would generally keep the results from previous evaluations results. For my own evaluation, I originally used output suffixes, like .grobid-tei-0.5.3.xml but that became messy and I settled on separate output folders instead (e.g. grobid-tei-0.5.3). I also (optionally) gzip the XML files to save space. [...]

I think there are too many thing here, what about having this in a separate feature? I haven't really modified the end 2 end validation interface.

Another suggestion would be to have a CSV output.

as well it should be separated. Right now the evaluation is stored in class / set of classes, so we could add any output writer afterwards.

(cherry picked from commit 67243f4)
@lfoppiano
Copy link
Collaborator Author

One more detail, I've moved the method dispatchExample()in the AbstractTrainer. This method was duplicated in basically any Trainer classes. After the change many submodule's training needs to be updated having this method removed or made public from the AbstractTrainer.

@kermitt2
Copy link
Owner

kermitt2 commented Aug 27, 2019

Some remarks about the current version. I think this is almost good and great addition!

  • it should be possible to set numFolds to 1 for using just one fold (one shot eval), this is a normal intuitive usage
  • numFolds at 0 which is translated to 10 does not make a lot of sense imho, numFolds at 0 should simply not being possible because it corresponds to nothing concrete
  • for summary of results: I would suggest to replace worst model/best model, by worst fold/best fold because for CRF for instance we do not compare models but a training/eval setting (models are deterministics)
  • statistics over instances are not averaged (while they are over fields), for example:
====================== Fold 0 ====================== 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          295
Instance-level recall:      92.77



====================== Fold 1 ====================== 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          310
Instance-level recall:      97.48
Average over 2 folds: 
===== Instance-level results =====

Total expected instances:   636
Correct instances:          605
Instance-level recall:      95.13

It should be:

Average over 2 folds: 
===== Instance-level results =====

Total expected instances:   318
Correct instances:          302.5
Instance-level recall:      95.13

@kermitt2
Copy link
Owner

Gasp I tested an old version! I update my above remarks.

@lfoppiano
Copy link
Collaborator Author

Some remarks about the current version. I think this is almost good and great addition!
* it should be possible to set numFolds to 1 for using just one fold (one shot eval), this is a normal intuitive usage

Thanks for the review, I've implemented everything, but I have a question on fold = 1.
Do you mean just evaluation of the current model using all the provided data as one fold?

@kermitt2
Copy link
Owner

Thanks for the review, I've implemented everything, but I have a question on fold = 1.
Do you mean just evaluation of the current model using all the provided data as one fold?

Ah yes sorry, it's not confusing. In this case, no evaluation is possible, so just one training with the fold.

@lfoppiano
Copy link
Collaborator Author

Ah yes sorry, it's not confusing. In this case, no evaluation is possible, so just one training with the fold.

I'm still not understanding.
Fold=1 is computed to take the whole dataset. If you mention training, isn't it then just like running the training? Does it make sense to train a model without evaluation when the process is actually the evaluation?

To be sure I've checked other libraries, like scikitlearn and it looks like they allow only n >= 2: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

@kermitt2
Copy link
Owner

Yes you're right, it is redundant with the train only option so it's useless. So let's keep n >1!

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Aug 28, 2019

Thanks! I've pushed the change.

@kermitt2 kermitt2 merged commit 38489ed into master Sep 11, 2019
tantikristanti pushed a commit that referenced this pull request Nov 15, 2019
Improvement in evaluation framework

Former-commit-id: 38489ed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants