-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tfds with limited examples #50
Comments
Hi you may modify the pegasus/pegasus/data/datasets.py Line 186 in 1e029a8
|
Hey I have a doubt over here, I would suggest you to add this part in readme itself and the pattern to take partial dataset too (if it's different for different datasets then please mention it as it varies based on the dataset) |
Can I know specs of GPU/CPU or setup which you used for extraction of tf records for big patent dataset. I tried it out in AI-platform and I was able to create tf-records for sub categories of d,e,f and rest I couldn't I ran out of master memory in AI-platform (not RAM). |
Hi, I am sorry I am not sure what you are asking for. The (sentence) extraction only happens in the pre-training stage and there is no extraction in the fine-tuning stage. The input-target pairs are already provided in tfds (of each downstream dataset), which can be simply used for supervised learning to fine-tune. The big patent dataset is very large so please be patient when you download it from tfds for the first time. By default, we use big_patent(all). As far as I remember, we used < 32 CPUs to pre-fetch data in the fine-tuning stage. |
Hey I wasn't specific in the earlier comment. If I change data_dir in datasets.py file in data folder. I'm running these jobs in AI platform where I use direct runner pipeline to download dataset and later after execution tf records (input-target) pairs. That download part isn't happening for the first time itself due to low computation power. I'm asking for that dataset which is downloaded from tfds for the first. What I will do is I'll download the dataset (one which got downloaded for the first time while training) and place it in bucket and then let the model access dataset from bucket. Advantage in this case is I will skip the download of dataset for the first time while execution is happening and I'll change the directory path instead of looking for dataset in tfds folder. I'll customize it to search for dataset in gcs bucket (provide the path for input target pairs) and then model will execute itself skipping the download part even for the first time. All I'm asking is for the data which is downloaded from tfds for first time (The big patent dataset is very large so please be patient when you download it from tfds for the first time). I hope you have them stored on your local disk.I request you to push the dataset to bucket and then share the path with users, just like you did with checkpoints. |
Hi, I am afraid we're not able to provide alternatives other than the default download by tfds. Sorry about this. If you would like to download it manually and upload to your cloud, please refer to big_patent website https://evasharma.github.io/bigpatent/. |
It seems the feature ("inputs") is missing. |
Hey sorry for deleting I thought you didn't go through as I found the fix I just deleted it as you'll have other things to take care of, yeah it doesn't have input and targets instead it has abstract and descriptions. I thought the tf records which were created by default when we pass tfds:big_patent or any other tfds dataset have input and target pairs instead of abstract and description |
Hi, you may change |
@JingqingZ Great, thanks! Another question, in your paper you write
On my machine I cannot use a batch_size of 256, but I assume that any batch size should result in similar performance. The learning rate hyperparameter can be set in the params_override FLAGS if I am not mistaken. However I am not clear on the best checkpoint here as it clearly depends on how often you save a checkpoint. In the code in this folder the checkpoints are saved every 1000 steps. Is that also the case for these low-resource results? So do you choose the best checkpoint from point 0,1000 and 2000? |
The batch size may affect performance slightly even other settings are the same. Yes, the param_override flag is provided to override some hyperparameters like learning-rate in your command. The best checkpoint refers to the checkpoint that has the best validation loss. The checkpoints are actually saved with relatively small intervals (in the low-resource setting) so the best checkpoint can be 100, 200, 1000, 1500, 1600 or so. |
Could you double-check if all the |
Thanks, I think you were correct indeed. I added the following to
And trained the model with those parameters. I kept the other steps the same, and ran the whole process 3 times. This gave me average values of:
This is most certainly a lot closer to the values reported in the paper, although there is still some deviation. Do you think this could be explained by the smaller batch size? Or could it just be due to the randomness in selecting the 10 examples used to train the model? Or did you perhaps use a different max_output_len? |
It seems batch size is the major difference in your settings. We actually selected the first 10 examples instead of at random. |
Alright, thanks! So the |
I think |
I am pretty interested in Fine-tuning with limited supervised examples experiments. However, I am not familiar with TensorFlow datasets. For example, if I ran an experiment on the AESLC dataset.
Tensorflow dataset load data with:
This will load whole datasets for training. What is I just want to use only 50% of training data? What should I do?
The text was updated successfully, but these errors were encountered: