Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong paraphrase in the TF2/PyTorch README example. #1942

Closed
2 of 4 tasks
isaprykin opened this issue Nov 25, 2019 · 6 comments
Closed
2 of 4 tasks

Wrong paraphrase in the TF2/PyTorch README example. #1942

isaprykin opened this issue Nov 25, 2019 · 6 comments

Comments

@isaprykin
Copy link

馃悰 Bug

Model I am using (Bert, XLNet....): TFBertForSequenceClassification

Language I am using the model on (English, Chinese....): English

The problem arise when using:

The tasks I am working on is:

  • an official GLUE/SQUaD task: Sequence Classification
  • my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

  1. Run the attached script.
  2. Observe
$ /Users/igor/projects/ml-venv/bin/python /Users/igor/projects/transformers-experiments/paraphrasing_issue.py
2019-11-25 08:58:53.985213: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fed57a2be00 executing computations on platform Host. Devices:
2019-11-25 08:58:53.985243: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/Users/igor/tensorflow_datasets/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from /Users/igor/tensorflow_datasets/glue/mrpc/0.0.2
Train for 115 steps, validate for 7 steps
Epoch 1/2
  4/115 [>.............................] - ETA: 1:22:04 - loss: 0.6936  5/115 [>.............................] - ETA: 1:18:44 - loss: 0.6876  6/115 [>.............................] - ETA: 1:16:01 - loss: 0.6760115/115 [==============================] - 4587s 40s/step - loss: 0.5850 - accuracy: 0.7045 - val_loss: 0.4695 - val_accuracy: 0.8137
Epoch 2/2
115/115 [==============================] - 4927s 43s/step - loss: 0.3713 - accuracy: 0.8435 - val_loss: 0.3825 - val_accuracy: 0.8358
**sentence_1 is a paraphrase of sentence_0
sentence_2 is a paraphrase of sentence_0**
  1. Wonder why.
import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')

# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule 
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

# Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)

# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

Expected behavior

sentence_1 is a paraphrase of sentence_0
sentence_2 is not a paraphrase of sentence_0

Environment

  • OS: MacOS
  • Python version: 3.7.5
  • PyTorch version: 1.3.1
  • PyTorch Transformers version (or branch): last commit afaa335 as of Sat Nov 23 11:34:45 2019 -0500.
  • Using GPU ? nope
  • Distributed of parallel setup ? single machine
  • Any other relevant information: TF version is 2.0.0
@isaprykin isaprykin changed the title Wrong paraphrase in the TF2/PyTorch paraphrase example from README. Wrong paraphrase in the TF2/PyTorch README example. Nov 25, 2019
@mandubian
Copy link

Hi, I'm investigating. For now, I confirm the issue that you observe. I've tested on both CPU and GPU and it gives the same result. I've tested with Pytorch and TF models too, same result. Now, let's track the cause!

@mandubian
Copy link

mandubian commented Dec 3, 2019

Hi again,
Ok I've retrained a Pytorch model using run_glue.py on MRPC to check.
The final metrics are:

***** Eval results  *****
acc = 0.8382608695652174
acc_and_f1 = 0.8608840882272851
f1 = 0.8835073068893529

So it's not crazy high but not near random either.

Then I've retested:

Is "This research was consistent with his findings" same as:

"His findings were compatible with this research." ?
TRUE -> 馃槃

"His findings were not compatible with this research." ?
TRUE -> 馃槩

I've taken a more complex sentence from training set

Is 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.' same as:

"Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence." ?
TRUE -> 馃槃

"Referring to him as only "the witness", Amrozi accused his brother of not deliberately distorting his evidence." ?
TRUE -> 馃槩

"platypus to him as only "the platypus", platypus accused his platypus of deliberately platypus his evidence." ?
TRUE -> 馃槶 

"platypus to him as only "the platypus", platypus accused his platypus of deliberately platypus his platypus." ?
FALSE -> 馃対 

Here we see that it's not robust to not as in the primary case. Then it's also not robust to replacing any word with platypus until I replace 6 words (which is quite disappointing on the performance of the model, it's true).

I've taken sentences from test set:

Is "A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night." same as:

"A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night." ?
TRUE -> 馃槩
----------------------------------------------------------------------------------------
Is "The broader Standard & Poor's 500 Index <.SPX> was 0.46 points lower, or 0.05 percent, at 997.02." same as:

"The technology-laced Nasdaq Composite Index .IXIC was up 7.42 points, or 0.45 percent, at 1,653.44." ?
FALSE -> 馃槃
--------------------------------------------------------------------------------------------
Is "NASA plans to follow-up the rovers' missions with additional orbiters and landers before launching a long-awaited sample-return flight." same as:

"NASA plans to explore the Red Planet with ever more sophisticated robotic orbiters and landers."
FALSE -> 馃槃
----------------------------------------------------------------------------------------
Is "We are piloting it there to see whether we roll it out to other products." same as:

"Macromedia is piloting this product activation system in Contribute to test whether to roll it out to other products."
TRUE -> 馃槃

Here we see that sometimes it works, sometimes not. I might be wrong but I haven't seen anything in the code that could explain this issue (83% is the final accuracy on dev set... ok but it remains 1 error on 5 cases). A priori, I'd say that basic BERT trained like that on this tiny dataset is simply not that robust for that task in a generalized case and would need more data or at least more data augmentation.

Do you share my conclusion or see something different?

@isaprykin
Copy link
Author

Thanks for the investigation. Was the performance ever different at the time when that example was put into the README?

@mandubian
Copy link

TBH, personally I wasn't there, so I don't know...
If anyone at huggingface can answer this question?
I've been looking at MRPC leaderboard https://gluebenchmark.com/leaderboard/ and BERT is around my training above so it looks like a normal score.

@thomwolf
Copy link
Member

thomwolf commented Dec 5, 2019

MRPC is a very small dataset (the smallest among all GLUE benchmark and that's why we use it as an example). I should not be expected to generalize well and be usable in real-life settings.
The perfrormance you got @mandubian are a normal score indeed.

@isaprykin
Copy link
Author

Sounds like we don't think there's an actionable issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants