Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lists implementation for fulltext model #429

Open
wants to merge 1,558 commits into
base: master
Choose a base branch
from

Conversation

Vitaliy-1
Copy link
Contributor

Hi @kermitt2,

I've noticed that list items are excluded from being labeled by the fulltext model. Are you interested in their implementation? Maybe you are considering to put them into a separate model, like figures?

In this PR model.wapiti is trained with the default corpus.

lfoppiano and others added 30 commits January 22, 2018 11:43
…o add the rest of the war packaging components
… but pretty sure that they are proper figures)

- making sure that CrossrefClient does not prevent JVM from exiting
…ction

- extracting standalone figures (for which we didn't detect captions,…
update the links for INIST and TEI in the documentation
[wip] build docker from local source
Backslashes in URLs were being passed through verbatim into JSON for
the reference annotations, resulting in invalid JSON output. This was
because the JSON was being built via string concatenation, without any
escaping. This switches to using Jackson instead, to ensure the JSON
is valid and properly escaped.
Fix BibDataSetContextExtractor to quote replacement text
Fix JSON generation for reference annotations
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 36.519% when pulling 08a7784 on Vitaliy-1:core_fulltext_lists into 4e1bb24 on kermitt2:master.

@Vitaliy-1
Copy link
Contributor Author

Hi @kermitt2,

Can I do something more to help with lists implementation?

@kermitt2
Copy link
Owner

Hello @Vitaliy-1 ! I am sorry for the time I am taking to react to your PR. I think I excluded the lists at some point because it was not reliable and messing up the final TEI body serialization.

In between the serialization of the body has improved because the TaggingTokenClusteror is used so it should not be a problem anymore. Maybe it would be nice to review quickly the "list" annotations in the training data, to be sure it is always present.

Apart from that you also fixed the TEI serialization for list, so there would be nothing special to do beyond checking if the accuracy is acceptable.

@kermitt2
Copy link
Owner

kermitt2 commented Jun 9, 2019

I made some tests with train_eval and random split at 80% and item lists are still very inaccurate given the existing training data. Based on 3 training with random splits, being sure we have item lists in both partitions, we have this:

train 1
                     P            R            F-score
<figure>             99.62        70.73        82.73  
<figure_marker>      68.89        100          81.58  
<item>               56.38        12.18        20.04  
<paragraph>          88.36        94.65        91.4   
<section>            100          99.15        99.57  
<table>              71.9         87.89        79.09  
<table_marker>       90           75           81.82

train 2
                     P            R            F-score
<citation_marker>    90.18        64.79        75.41  
<equation>           74.81        92.33        82.65  
<equation_label>     100          56.25        72     
<figure>             26.76        26.37        26.56  
<figure_marker>      67.69        69.84        68.75  
<item>               1.13         0.98         1.05   
<paragraph>          87.62        87.45        87.54  
<section>            57.61        81.65        67.56  
<table>              40.03        44.28        42.05  
<table_marker>       85.71        57.14        68.57  

train 3
                     P            R            F-score
<citation_marker>    93.63        86.31        89.82  
<equation>           46.54        91.15        61.62  
<equation_label>     85.71        75           80     
<figure>             17.38        38.2         23.89  
<figure_marker>      88.29        58.13        70.1   
<item>               0            0            0      
<paragraph>          91.08        76.93        83.41  
<section>            86.35        43.88        58.19  
<table>              58.49        56.79        57.63  
<table_marker>       93.85        85.92        89.71

So there's still not enough training data for the moment for including list during the training in the master release. As you saw, it was already implemented a while ago but not included.

The bottleneck is adding more training data, item list is a very unbalanced label so it requires significantly more training data to be included.

@kermitt2
Copy link
Owner

I will do additional tests with some more non-public training data for the fulltext model, it will give an idea how much training data is necessary for an acceptable accuracy regarding the item list.

@Vitaliy-1
Copy link
Contributor Author

Thanks, @kermitt2!

Let me know about the acceptable amount of training data for lists.

@kermitt2
Copy link
Owner

Hi @Vitaliy-1 !

I had to relaunch two times my evaluation on the larger training set because of a unexpected reboot, sorry. Actually in the second training I obtained slightly better results than the previous ones. In my extended training set, I have an additional ~70 annotated document body, I made a split for train/eval at 80%:

===== Token-level results =====


label                accuracy     precision    recall       f1     

<citation_marker>    99.6         93.02        94.08        93.55  
<equation>           99.29        88.31        77.5         82.55  
<equation_label>     99.99        100          88.24        93.75  
<equation_marker>    99.99        33.33        50           40     
<figure>             95.36        73.03        45.18        55.83  
<figure_marker>      99.91        76.97        92.03        83.83  
<item>               98.02        79.91        34.86        48.54  
<paragraph>          93.6         93.11        98.05        95.51  
<section>            99.89        96.83        92.56        94.65  
<table>              97.93        81.62        87.36        84.39  
<table_marker>       99.95        73.2         91.06        81.16  

all fields           98.5         91.05        91           91.03   (micro average)
                     98.5         80.85        77.36        77.62   (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<citation_marker>    93.64        80.63        82.82        81.71  
<equation>           98.29        61.9         55.71        58.65  
<equation_label>     99.94        100          88.24        93.75  
<equation_marker>    99.84        33.33        25           28.57  
<figure>             97.21        31.34        32.31        31.82  
<figure_marker>      97.8         66.09        70.37        68.16  
<item>               98.17        84.31        45.74        59.31  
<paragraph>          78.63        75.04        75.41        75.22  
<section>            99.29        95.5         93.17        94.32  
<table>              98.42        43.48        44.44        43.96  
<table_marker>       99.41        78.57        86.27        82.24  

all fields           96.42        75.8         75.18        75.49   (micro average)
                     96.42        68.2         63.59        65.25   (macro average)

So list recognition is precise here, but recall is low, which is quite usual with lack of training data. To have a more accurate picture, I would need to do a 10-fold training and average the scores, but I don't have enough free CPU available right now to do it.

This shows that the addition of labels is technically OK but we would really need more public training data for practically using it.

@Vitaliy-1
Copy link
Contributor Author

Hi @kermitt2,

Thanks for checking how it looks like on a bigger training set. I'll look at how much additional training data we can provide. Do you already have a set with annotated lists?

If only lack of free CPUs is the issue, this is what I can ask to provide.

@kermitt2
Copy link
Owner

Do you already have a set with annotated lists?

Do you mean in the existing training data ? There are a few documents with lists. Or do you mean some XML full text free to reuse with lists?

The CPU would be just for getting more accurate evaluation, it's not an issue.
The issue is, as always, lack of training data :D It's very time-consuming to produce good quality training data for the full text model.

@Vitaliy-1
Copy link
Contributor Author

Yes, I mean training data. And yes, it's quite time-consuming to produce it :) I'll look at how much additional annotated data with lists we can provide.

@Vitaliy-1
Copy link
Contributor Author

Hi @kermitt2,

Can you explain the mechanism for measuring accuracy, precision, and recall for models?

@kermitt2
Copy link
Owner

Basically it uses the usual format of the sequence labeling:

token f0...fn label

for comparing expected labels with those produced by the model, the format becomes

token f0...fn expected_label predicted_label

this goes though the evaluateStandard() method in EvaluationUtilities.java and generate a report (from an object called Stats.java which contains all the statistics).
The tagger is a parameter so it applies to every models.

@de-code
Copy link
Collaborator

de-code commented Oct 9, 2020

Hi, just wondering what the plan is with this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet