New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractive Prediction Instead of Abstractive Prediction #21
Comments
@kandarpkakkad would you be able to share how you added your own dataset? Ive tried to do the same (following the official tfds tutorial) but wasnt able to succesfully do so. |
Make a python file and keep it in pegasus folder and run it so that it is stored in testdata folder. import pandas as pd import tensorflow as tf name_dict = dict( inputs=[ # Your Inputs ], targets=[ # Your Targets For Inputs Respectively. ] ) df = pd.DataFrame(name_dict) print(df) header = ["inputs", "targets"] df.to_csv('output.csv', columns=header, index=False) csv = pd.read_csv("output.csv").values with tf.io.TFRecordWriter("pegasus/data/testdata/test_pattern.tfrecords") as writer: for row in csv: inputs, targets = row[:-1], row[-1] example = tf.train.Example( features=tf.train.Features( feature={ "inputs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[inputs[0].encode('utf-8')])), "targets": tf.train.Feature(bytes_list=tf.train.BytesList(value=[targets.encode('utf-8')])), } ) ) writer.write(example.SerializeToString()) I had 1 input 1 target pair. |
Thanks for sharing! |
This makes sense to some extent because you're using the pre-trained model without any fine-tuning (i.e. zero-shot summarization). I also experienced similar extractive behaviour on some datasets like AESLC and XSum. During the pre-training, PEGASUS encourages some extractive behaviour to fit the more extractive downstream tasks better (this is also described in our paper, section 6.2). PEGASUS has good performance with little supervision (100-1000 examples) on some datasets but the performance with zero-shot is still limited, especially on abstractive datasets. If your dataset is very abstractive, some fine-tuning should be helpful. |
Ok! Thank you very much. I got my answer and so I am closing this issue. |
@ kandarpkakkad , did you implement this on google collab ? I am facing disk space shortage issue, |
would love to hear from you as well @JingqingZ |
No, I have not Implemented in Google Colab. I ran this on a local computer. But there was a message that it is using 10% extra RAM. So I guess, it is heavy. I am sorry I don't know how to optimise it. |
--I tried running it on local computer too, it is taking ages to download the pretrained model |
Sorry, can't help. |
No, problem. Thanks |
The vocab and all model checkpoints (pre-trained + all finetuned) take ~29G spaces so please make sure if Google colab has sufficient space (i.e. Google drive has sufficient space available if Google drive is mounted). |
okay.Thanks, I haven't mounted google , i am using colab disk space alone |
'model.ckpt-1500000.data-00000-of-00001' , is this the hybrid model which is trained on 'c4+HugeNews' datasets? can anyone explain this please? |
Also,I'm getting this warning while fune tuning with the dataset aeslc, what does it mean? '2020-06-21 12:19:04.185622: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 393637888 exceeds 10% of system memory.' |
Yes
No, this is vocab model for sentencepiece.
This means your CPU is out of memory but it should be fine as the model can still run as normal. You may try to free up some CPU space. |
-- I am implementing this on Colab and it has enough space in the RAM as well as disk. |
@kandarpkakkad , i tried creating a test set with 1 input and 1 target and placed in the testdata folder, @registry.register("new_params") and finally tried testing the model, but the model isn't predicting results for the test set that i had given, it is giving prediction for different example I0622 22:42:07.407472 139655975044992 text_eval.py:126] INPUTS: [0]: I0622 22:42:07.407813 139655975044992 text_eval.py:126] TARGETS: Treasury Bank Resolutions I0622 22:42:07.407969 139655975044992 text_eval.py:126] PREDICTIONS: Thank you for your attention to this matter. @JingqingZ can somebody please explain what could be the issue? |
Take the advantage of the flag
Reference in the code to load datasets pegasus/pegasus/data/datasets.py Line 177 in addaf5a
Hope this may help https://www.tensorflow.org/datasets/catalog/wikihow |
It seems you're using the pre-trained model checkpoint instead of fine-tuned model checkpoints. Using pre-trained model checkpoint directly as zero-shot summarization may have such extractive output (i.e the prediction is extracted from the input). An explanation is provided above in our previous communication in this issue. |
Thank you, it was very helpful. Also, can we create a new dataset to fine tune the pretrained model in the TFrecords format ?or should it be of TFDS format only? |
TFrecords should be workable |
Thank you JingqingZ |
@JingqingZ , I have a csv file {input,target} pair, the inputs and targets are of variable length, the code shared by @kandarpkakkad is not working in this case. could you please suggest a method to convert this into TFrecord |
This sounds a little "abstractive". I think the idea of the code shared is right but if you have pairs of data in a different form, you definitely need to modify some code. |
@JingqingZ , Thank you. |
File "/content/Abstractive-Text-Summarization-Pegasus/datasets.py", line 30, in get_dataset """ def get_dataset(dataset_name): |
the file extension should be .tfrecord, not .tfrecords. |
Hi @ManuMahadevaswamy was there any resolution to this? I only want to generate a summary for my input. For the following dataset registration in the public_params.py it seems to iterate through all tests instead of the single record in test_pattern.tfrecord @registry.register("new_params") @kandarpkakkad what did your dataset registration in the public_params.py look like? @JingqingZ it appears that for test_pattern the test_pattern.tfrecord file would get picked up from the data/testdata folder. Can we supply the same format for train_pattern and dev_pattern i.e: "train_pattern": "tfrecord:test_pattern.tfrecord", |
@rjbanner @darienacosta |
It seems - just changing "test_pattern" is not enough to prevent endless iterations through all tests instead of the single record in the test_pattern.tfrecord. Finally, I ended up changing train_pattern and dev_pattern in addition to test_pattern to make it produce a single iteration as expected: @registry.register("new_params") |
I am trying to train the PEGASUS model with new fine tuning dataset which has just 600 records, with GPU, the execution is getting interrupted halfway through, I tried keeping the batch size low , by making batch_size=1 INFO:tensorflow:global_step/sec: 0.908995 with batch_size=8, i am getting below error 2020-07-08 18:59:17.896051: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at softmax_op_gpu.cu.cc:157 : Resource exhausted: OOM when allocating tensor with shape[8,16,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc @JingqingZ @rjbanner @bickrombishsass , is it because of the RAM size? I am using 12 GB RAM with GPU. |
It seems out-of-memory. 12GB should be enough to run the model for run batch_size=1 (or 4), but may not be enough for batch_size=8. You can also check the input sequence length and output sequence length. If the input/output length is too long, e.g. >= 1024, the self-attention module will take lots of memory. |
Yes, It could be because of that may be. I have inputs with length greater than 1024. |
Unfortunately, the model cannot handle inputs longer than the max length you set (like 512 or 1024). For example, in arxiv and pubmed dataset, PEGASUS only takes the first 1024 tokens into the encoder and discard other tokens. You can refer to some other papers which work on mutli-document summarization or summarization with very long inputs. |
Thank you @JingqingZ, I will check them. |
Hi Everyone, import pandas as pd save_path = "/content/pegasus/pegasus/data/testdata/test_pattern_1.tfrecord" input_dict = dict( data = pd.DataFrame(input_dict) with tf.io.TFRecordWriter(save_path) as writer: save_path = "/content/pegasus/pegasus/data/testdata/test_pattern_1.tfrecord" !python3 /content/pegasus/pegasus/bin/evaluate.py --params=test_transformer When I run this last couple of lines, this notice appears. Is this anything to do with the issue? Thx WARNING:tensorflow:From /content/pegasus/pegasus/bin/evaluate.py:85: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. |
It seems the folder of checkpoints hasn't been created. |
Hi!
I have tried to run the pre-trained model to test it on my dataset which consists of paragraphs as inputs and one line sentence as targets. The problem was when I saw the prediction it was extracted from the input instead of generating one as expected.
The new_params is the new .tfrecords dataset for testing.
In the output, I am getting the following output:
The prediction is the 2nd line of the Input.
Is there a mistake by me or is it the problem of the model?
The text was updated successfully, but these errors were encountered: