Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different size of CLS unsupervised data between .csv and original .xml files #63

Closed
blazejdolicki opened this issue Apr 7, 2020 · 1 comment

Comments

@blazejdolicki
Copy link

Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?

@eisenjulian
Copy link
Contributor

Hello Błażej, thanks for the interest in the project, and sorry for the delay, I recall the unsupervised portion was capped when running the original script. The full dataset, that has a size consistent with your calculation is at https://storage.googleapis.com/ulmfit/cls-full/de/books/unlabeled.csv

Let me know where you found the other URL and we can update it in the repo, or feel free to submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants