Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the data sets #1

Open
liebharc opened this issue May 15, 2024 · 1 comment
Open

Check the data sets #1

liebharc opened this issue May 15, 2024 · 1 comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@liebharc
Copy link
Owner

liebharc commented May 15, 2024

Introduction

The efficacy of a transformer model is significantly influenced by the quality of its training data. However, the original training dataset utilized by https://github.com/NetEase/Polyphonic-TrOMR/tree/master remains unpublished. Consequently, this repository relies on https://github.com/liebharc/Polyphonic-TrOMR/tree/master, which trains the transformer on datasets sourced from https://grfia.dlsi.ua.es/primus/, https://sites.google.com/view/multiscore-project/datasets, and https://github.com/itec-hust/CPMS. Notably, for the grandstaff dataset, extensive preprocessing is essential, including the segmentation of the grandstaff into individual staves. In the past, significant improvements in performance have been achieved through rectifying errors in datasets, such as stave segmentation, accidental placement, or the conversion of humdrum files into the TrOMR semantic format.

The Task Itself

It would be helpful to have another set of eyes go through all the datasets, especially the grandstaff one. Just take a peek at some random staff images and their corresponding semantic representations. If you spot any issues, we should either tweak our preprocessing methods to fix them or just kick those problematic cases out of the datasets. That way, we won't confuse the transformer during training.

Update

The CPMS dataset has been removed for now. And the "Lieder" dataset has been added. The task itself remains important.

@liebharc liebharc added help wanted Extra attention is needed good first issue Good for newcomers labels May 15, 2024
@liebharc
Copy link
Owner Author

With the changes which lead to v0.2.0 the most severe issues in the data sets should be fixed. The fixes lead to a significant improvement in performance. I'll leave this issue open, as a 2nd pair of eyes would be really useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant