Different size of CLS unsupervised data between .csv and original .xml files #63

blazejdolicki · 2020-04-07T12:00:08Z

Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?

The text was updated successfully, but these errors were encountered:

eisenjulian · 2020-04-26T23:54:41Z

Hello Błażej, thanks for the interest in the project, and sorry for the delay, I recall the unsupervised portion was capped when running the original script. The full dataset, that has a size consistent with your calculation is at https://storage.googleapis.com/ulmfit/cls-full/de/books/unlabeled.csv

Let me know where you found the other URL and we can update it in the repo, or feel free to submit a PR.

blazejdolicki mentioned this issue Apr 16, 2020

Problems with reproducing zero-shot learning results #67

Open

eisenjulian closed this as completed Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different size of CLS unsupervised data between .csv and original .xml files #63

Different size of CLS unsupervised data between .csv and original .xml files #63

blazejdolicki commented Apr 7, 2020

eisenjulian commented Apr 26, 2020

Different size of CLS unsupervised data between .csv and original .xml files #63

Different size of CLS unsupervised data between .csv and original .xml files #63

Comments

blazejdolicki commented Apr 7, 2020

eisenjulian commented Apr 26, 2020