Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfds with limited examples #50

Open
PingYu-iris opened this issue Jul 7, 2020 · 18 comments
Open

tfds with limited examples #50

PingYu-iris opened this issue Jul 7, 2020 · 18 comments

Comments

@PingYu-iris
Copy link

PingYu-iris commented Jul 7, 2020

I am pretty interested in Fine-tuning with limited supervised examples experiments. However, I am not familiar with TensorFlow datasets. For example, if I ran an experiment on the AESLC dataset.

Tensorflow dataset load data with:

dataset = all_datasets.get_dataset(input_pattern, training)
dataset = dataset.map(parser, num_parallel_calls=parallelism)
dataset = dataset.unbatch()
dataset = dataset.shuffle(10000)
dataset = dataset.repeat()
dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=drop_remainder)
dataset = dataset.prefetch(512)

This will load whole datasets for training. What is I just want to use only 50% of training data? What should I do?

@JingqingZ
Copy link
Collaborator

Hi you may modify the input_pattern similar to

big_patent/all-train-shard_100-take_200
in the corresponding dataset in https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

@rohithsiddhartha
Copy link

Hey I have a doubt over here,
I have following the code for text summarization for wikihow dataset you mentioned wikihow/all-train-take_1000 for big_patent/all-train-shard_100-take_200.

I would suggest you to add this part in readme itself and the pattern to take partial dataset too (if it's different for different datasets then please mention it as it varies based on the dataset)

@rohithsiddhartha
Copy link

Can I know specs of GPU/CPU or setup which you used for extraction of tf records for big patent dataset. I tried it out in AI-platform and I was able to create tf-records for sub categories of d,e,f and rest I couldn't I ran out of master memory in AI-platform (not RAM).
Tried the same in VM instance but it didn't process even after 1 hr. If you could mention with what configuration did you extract tf records for big patent dataset, it would be helpful for us. or Even a much better solution would be to provide google cloud storage link for tf records which were created during your testing phase, if you provide tf records itself then we would just run the model by providing path to tf records itself and it would save a lot of computation effort in creating tf-records

@JingqingZ
Copy link
Collaborator

Hi, I am sorry I am not sure what you are asking for. The (sentence) extraction only happens in the pre-training stage and there is no extraction in the fine-tuning stage. The input-target pairs are already provided in tfds (of each downstream dataset), which can be simply used for supervised learning to fine-tune. The big patent dataset is very large so please be patient when you download it from tfds for the first time. By default, we use big_patent(all). As far as I remember, we used < 32 CPUs to pre-fetch data in the fine-tuning stage.

@rohithsiddhartha
Copy link

Hey I wasn't specific in the earlier comment. If I change data_dir in datasets.py file in data folder. I'm running these jobs in AI platform where I use direct runner pipeline to download dataset and later after execution tf records (input-target) pairs. That download part isn't happening for the first time itself due to low computation power. I'm asking for that dataset which is downloaded from tfds for the first. What I will do is I'll download the dataset (one which got downloaded for the first time while training) and place it in bucket and then let the model access dataset from bucket.

Advantage in this case is I will skip the download of dataset for the first time while execution is happening and I'll change the directory path instead of looking for dataset in tfds folder. I'll customize it to search for dataset in gcs bucket (provide the path for input target pairs) and then model will execute itself skipping the download part even for the first time.

All I'm asking is for the data which is downloaded from tfds for first time (The big patent dataset is very large so please be patient when you download it from tfds for the first time). I hope you have them stored on your local disk.I request you to push the dataset to bucket and then share the path with users, just like you did with checkpoints.
Please revert back if you need further clarification.

@JingqingZ
Copy link
Collaborator

Hi, I am afraid we're not able to provide alternatives other than the default download by tfds. Sorry about this. If you would like to download it manually and upload to your cloud, please refer to big_patent website https://evasharma.github.io/bigpatent/.

@JingqingZ
Copy link
Collaborator

Feature: inputs (data type: string) is required but could not be found.

It seems the feature ("inputs") is missing.

@rohithsiddhartha
Copy link

Hey sorry for deleting I thought you didn't go through as I found the fix I just deleted it as you'll have other things to take care of, yeah it doesn't have input and targets instead it has abstract and descriptions. I thought the tf records which were created by default when we pass tfds:big_patent or any other tfds dataset have input and target pairs instead of abstract and description

@CBvanYperen
Copy link

Hi,

I am also interested in the low-resource results that you showed in your paper. Before going on to using my own dataset I'd like to replicate the results of the paper. For example, I'd like to replicate the values that you got for the CNN/DailyMail dataset with 10 examples. That is, the values that are highlighted in yellow in the screenshot below.
image

It would be great if you could show an example in the same way that you put in the README, thus like this:

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

and

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Your help would be much appreciated!

@JingqingZ
Copy link
Collaborator

Hi, you may change aeslc to cnn_dailymail in the command and update the train_pattern defined here to tfds:cnn_dailymail/plain_text-train-take_10

@CBvanYperen
Copy link

@JingqingZ Great, thanks! Another question, in your paper you write

We fine-tuned the models up to 2000 steps with batch size 256, learning rate 0.0005, and picked the checkpoint with best validation performance.

On my machine I cannot use a batch_size of 256, but I assume that any batch size should result in similar performance. The learning rate hyperparameter can be set in the params_override FLAGS if I am not mistaken. However I am not clear on the best checkpoint here as it clearly depends on how often you save a checkpoint. In the code in this folder the checkpoints are saved every 1000 steps. Is that also the case for these low-resource results? So do you choose the best checkpoint from point 0,1000 and 2000?

@JingqingZ
Copy link
Collaborator

The batch size may affect performance slightly even other settings are the same. Yes, the param_override flag is provided to override some hyperparameters like learning-rate in your command.

The best checkpoint refers to the checkpoint that has the best validation loss. The checkpoints are actually saved with relatively small intervals (in the low-resource setting) so the best checkpoint can be 100, 200, 1000, 1500, 1600 or so.

@CBvanYperen
Copy link

Hi, I'm still not really able to replicate the low-resource results.
Since the CNN/DailyMail dataset kept giving me OOM erros (something I'll look into later) I wanted to replicate the AESLC Dataset results. Specifically, the one with 10 examples as highlighted below:
image

What I have done is the following:

!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

If I understood correcly, this will finetune Pegasus Large using 10 examples from the AESLC Dataset and perform 2000 steps where it saves the model every 100th step.

Then I ran the evaluation with the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \

In the checkpoint file the first line was; model_checkpoint_path: "model.ckpt-2000" so it evaluates the model trained for 2000 steps. Then since I wanted to have the model that gave the best performance I ran the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--evaluate_test=True \
--best=True \
--text_metrics_pattern=text_metrics-*-.dev.txt

This gave me a text_metrics-2000-.best.test.txt file which seems to be the one I need. The content of this file is as follows:
image
I highlighted the values that I expected would be the same as those in the paper. However, they clearly differ. So I must be making a mistake somewhere, I would very much appreciate your help in finding this mistake!

@JingqingZ
Copy link
Collaborator

!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

Could you double-check if all the param_overrides have been successfully passed to the params instance? It is not on my machine, which means you probably fine-tuned on the entire training set instead of only 10 examples.

@CBvanYperen
Copy link

Thanks, I think you were correct indeed. I added the following to public_params.py

@registry.register("aeslc_transformer_low")
def aeslc_transformer(param_overrides):
  return transformer_params(
      {
          "train_pattern": "tfds:aeslc-train-take_10",
          "dev_pattern": "tfds:aeslc-validation",
          "test_pattern": "tfds:aeslc-test",
          "max_input_len": 512,
          "max_output_len": 32,
          "train_steps": 2000,
          "learning_rate": 0.0005,
          "batch_size": 4,
      }, param_overrides)

And trained the model with those parameters. I kept the other steps the same, and ran the whole process 3 times. This gave me average values of:

rouge1-F: 0.0892
rouge2-F: 0.0406
rougeL-F: 0.0838

This is most certainly a lot closer to the values reported in the paper, although there is still some deviation. Do you think this could be explained by the smaller batch size? Or could it just be due to the randomness in selecting the 10 examples used to train the model? Or did you perhaps use a different max_output_len?

@JingqingZ
Copy link
Collaborator

It seems batch size is the major difference in your settings. We actually selected the first 10 examples instead of at random.

@CBvanYperen
Copy link

Alright, thanks! So the -take_10 automatically selects the first 10 or do I have to add some modifications to make it select the first 10 as well?

@JingqingZ
Copy link
Collaborator

I think take_10 will take the first 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants