tfds with limited examples #50

PingYu-iris · 2020-07-07T06:11:56Z

I am pretty interested in Fine-tuning with limited supervised examples experiments. However, I am not familiar with TensorFlow datasets. For example, if I ran an experiment on the AESLC dataset.

Tensorflow dataset load data with:

dataset = all_datasets.get_dataset(input_pattern, training)
dataset = dataset.map(parser, num_parallel_calls=parallelism)
dataset = dataset.unbatch()
dataset = dataset.shuffle(10000)
dataset = dataset.repeat()
dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=drop_remainder)
dataset = dataset.prefetch(512)

This will load whole datasets for training. What is I just want to use only 50% of training data? What should I do?

The text was updated successfully, but these errors were encountered:

JingqingZ · 2020-07-07T10:03:54Z

Hi you may modify the input_pattern similar to

pegasus/pegasus/data/datasets.py

Line 186 in 1e029a8

big_patent/all-train-shard_100-take_200

in the corresponding dataset in https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

rohithsiddhartha · 2020-07-10T10:05:10Z

Hey I have a doubt over here,
I have following the code for text summarization for wikihow dataset you mentioned wikihow/all-train-take_1000 for big_patent/all-train-shard_100-take_200.

I would suggest you to add this part in readme itself and the pattern to take partial dataset too (if it's different for different datasets then please mention it as it varies based on the dataset)

rohithsiddhartha · 2020-07-11T08:31:01Z

Can I know specs of GPU/CPU or setup which you used for extraction of tf records for big patent dataset. I tried it out in AI-platform and I was able to create tf-records for sub categories of d,e,f and rest I couldn't I ran out of master memory in AI-platform (not RAM).
Tried the same in VM instance but it didn't process even after 1 hr. If you could mention with what configuration did you extract tf records for big patent dataset, it would be helpful for us. or Even a much better solution would be to provide google cloud storage link for tf records which were created during your testing phase, if you provide tf records itself then we would just run the model by providing path to tf records itself and it would save a lot of computation effort in creating tf-records

JingqingZ · 2020-07-11T13:16:31Z

Hi, I am sorry I am not sure what you are asking for. The (sentence) extraction only happens in the pre-training stage and there is no extraction in the fine-tuning stage. The input-target pairs are already provided in tfds (of each downstream dataset), which can be simply used for supervised learning to fine-tune. The big patent dataset is very large so please be patient when you download it from tfds for the first time. By default, we use big_patent(all). As far as I remember, we used < 32 CPUs to pre-fetch data in the fine-tuning stage.

rohithsiddhartha · 2020-07-11T16:31:19Z

Hey I wasn't specific in the earlier comment. If I change data_dir in datasets.py file in data folder. I'm running these jobs in AI platform where I use direct runner pipeline to download dataset and later after execution tf records (input-target) pairs. That download part isn't happening for the first time itself due to low computation power. I'm asking for that dataset which is downloaded from tfds for the first. What I will do is I'll download the dataset (one which got downloaded for the first time while training) and place it in bucket and then let the model access dataset from bucket.

Advantage in this case is I will skip the download of dataset for the first time while execution is happening and I'll change the directory path instead of looking for dataset in tfds folder. I'll customize it to search for dataset in gcs bucket (provide the path for input target pairs) and then model will execute itself skipping the download part even for the first time.

All I'm asking is for the data which is downloaded from tfds for first time (The big patent dataset is very large so please be patient when you download it from tfds for the first time). I hope you have them stored on your local disk.I request you to push the dataset to bucket and then share the path with users, just like you did with checkpoints.
Please revert back if you need further clarification.

JingqingZ · 2020-07-11T23:34:15Z

Hi, I am afraid we're not able to provide alternatives other than the default download by tfds. Sorry about this. If you would like to download it manually and upload to your cloud, please refer to big_patent website https://evasharma.github.io/bigpatent/.

JingqingZ · 2020-07-13T11:57:50Z

Feature: inputs (data type: string) is required but could not be found.

It seems the feature ("inputs") is missing.

rohithsiddhartha · 2020-07-13T12:17:11Z

Hey sorry for deleting I thought you didn't go through as I found the fix I just deleted it as you'll have other things to take care of, yeah it doesn't have input and targets instead it has abstract and descriptions. I thought the tf records which were created by default when we pass tfds:big_patent or any other tfds dataset have input and target pairs instead of abstract and description

CBvanYperen · 2020-07-15T11:50:59Z

Hi,

I am also interested in the low-resource results that you showed in your paper. Before going on to using my own dataset I'd like to replicate the results of the paper. For example, I'd like to replicate the values that you got for the CNN/DailyMail dataset with 10 examples. That is, the values that are highlighted in yellow in the screenshot below.

It would be great if you could show an example in the same way that you put in the README, thus like this:

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

and

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Your help would be much appreciated!

JingqingZ · 2020-07-15T21:59:46Z

Hi, you may change aeslc to cnn_dailymail in the command and update the train_pattern defined here to tfds:cnn_dailymail/plain_text-train-take_10

CBvanYperen · 2020-07-16T14:27:43Z

@JingqingZ Great, thanks! Another question, in your paper you write

We ﬁne-tuned the models up to 2000 steps with batch size 256, learning rate 0.0005, and picked the checkpoint with best validation performance.

On my machine I cannot use a batch_size of 256, but I assume that any batch size should result in similar performance. The learning rate hyperparameter can be set in the params_override FLAGS if I am not mistaken. However I am not clear on the best checkpoint here as it clearly depends on how often you save a checkpoint. In the code in this folder the checkpoints are saved every 1000 steps. Is that also the case for these low-resource results? So do you choose the best checkpoint from point 0,1000 and 2000?

JingqingZ · 2020-07-16T19:55:12Z

The batch size may affect performance slightly even other settings are the same. Yes, the param_override flag is provided to override some hyperparameters like learning-rate in your command.

The best checkpoint refers to the checkpoint that has the best validation loss. The checkpoints are actually saved with relatively small intervals (in the low-resource setting) so the best checkpoint can be 100, 200, 1000, 1500, 1600 or so.

CBvanYperen · 2020-08-25T14:01:42Z

Hi, I'm still not really able to replicate the low-resource results.
Since the CNN/DailyMail dataset kept giving me OOM erros (something I'll look into later) I wanted to replicate the AESLC Dataset results. Specifically, the one with 10 examples as highlighted below:

What I have done is the following:

!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

If I understood correcly, this will finetune Pegasus Large using 10 examples from the AESLC Dataset and perform 2000 steps where it saves the model every 100th step.

Then I ran the evaluation with the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \

In the checkpoint file the first line was; model_checkpoint_path: "model.ckpt-2000" so it evaluates the model trained for 2000 steps. Then since I wanted to have the model that gave the best performance I ran the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--evaluate_test=True \
--best=True \
--text_metrics_pattern=text_metrics-*-.dev.txt

This gave me a text_metrics-2000-.best.test.txt file which seems to be the one I need. The content of this file is as follows:

I highlighted the values that I expected would be the same as those in the paper. However, they clearly differ. So I must be making a mistake somewhere, I would very much appreciate your help in finding this mistake!

JingqingZ · 2020-08-26T14:09:14Z

!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

Could you double-check if all the param_overrides have been successfully passed to the params instance? It is not on my machine, which means you probably fine-tuned on the entire training set instead of only 10 examples.

CBvanYperen · 2020-08-26T20:48:00Z

Thanks, I think you were correct indeed. I added the following to public_params.py

@registry.register("aeslc_transformer_low")
def aeslc_transformer(param_overrides):
  return transformer_params(
      {
          "train_pattern": "tfds:aeslc-train-take_10",
          "dev_pattern": "tfds:aeslc-validation",
          "test_pattern": "tfds:aeslc-test",
          "max_input_len": 512,
          "max_output_len": 32,
          "train_steps": 2000,
          "learning_rate": 0.0005,
          "batch_size": 4,
      }, param_overrides)

And trained the model with those parameters. I kept the other steps the same, and ran the whole process 3 times. This gave me average values of:

rouge1-F: 0.0892
rouge2-F: 0.0406
rougeL-F: 0.0838

This is most certainly a lot closer to the values reported in the paper, although there is still some deviation. Do you think this could be explained by the smaller batch size? Or could it just be due to the randomness in selecting the 10 examples used to train the model? Or did you perhaps use a different max_output_len?

JingqingZ · 2020-08-26T21:48:53Z

It seems batch size is the major difference in your settings. We actually selected the first 10 examples instead of at random.

CBvanYperen · 2020-08-28T07:03:56Z

Alright, thanks! So the -take_10 automatically selects the first 10 or do I have to add some modifications to make it select the first 10 as well?

JingqingZ · 2020-08-28T09:29:07Z

I think take_10 will take the first 10.

JingqingZ mentioned this issue Jul 10, 2020

Update README.md #53

Merged

JingqingZ mentioned this issue Oct 27, 2020

How to fine-tune on limited number of samples? #119

Closed

psmukherjee009 mentioned this issue Mar 8, 2021

Fine tuning on existing Dataset fails after setup in README #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfds with limited examples #50

tfds with limited examples #50

PingYu-iris commented Jul 7, 2020 •

edited

Loading

JingqingZ commented Jul 7, 2020

rohithsiddhartha commented Jul 10, 2020

rohithsiddhartha commented Jul 11, 2020

JingqingZ commented Jul 11, 2020

rohithsiddhartha commented Jul 11, 2020

JingqingZ commented Jul 11, 2020

JingqingZ commented Jul 13, 2020

rohithsiddhartha commented Jul 13, 2020

CBvanYperen commented Jul 15, 2020

JingqingZ commented Jul 15, 2020

CBvanYperen commented Jul 16, 2020

JingqingZ commented Jul 16, 2020

CBvanYperen commented Aug 25, 2020

JingqingZ commented Aug 26, 2020

CBvanYperen commented Aug 26, 2020

JingqingZ commented Aug 26, 2020

CBvanYperen commented Aug 28, 2020

JingqingZ commented Aug 28, 2020

tfds with limited examples #50

tfds with limited examples #50

Comments

PingYu-iris commented Jul 7, 2020 • edited Loading

JingqingZ commented Jul 7, 2020

rohithsiddhartha commented Jul 10, 2020

rohithsiddhartha commented Jul 11, 2020

JingqingZ commented Jul 11, 2020

rohithsiddhartha commented Jul 11, 2020

JingqingZ commented Jul 11, 2020

JingqingZ commented Jul 13, 2020

rohithsiddhartha commented Jul 13, 2020

CBvanYperen commented Jul 15, 2020

JingqingZ commented Jul 15, 2020

CBvanYperen commented Jul 16, 2020

JingqingZ commented Jul 16, 2020

CBvanYperen commented Aug 25, 2020

JingqingZ commented Aug 26, 2020

CBvanYperen commented Aug 26, 2020

JingqingZ commented Aug 26, 2020

CBvanYperen commented Aug 28, 2020

JingqingZ commented Aug 28, 2020

PingYu-iris commented Jul 7, 2020 •

edited

Loading