Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test data being used for validation data? #1753

Closed
Tracked by #6
KeironO opened this issue Feb 18, 2016 · 35 comments
Closed
Tracked by #6

Test data being used for validation data? #1753

KeironO opened this issue Feb 18, 2016 · 35 comments

Comments

@KeironO
Copy link

KeironO commented Feb 18, 2016

Hey there,

In many of your examples you seem to be using the test data as validation data, wouldn't this create an overfitted model that is unable to generalise?

Example: https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py#L80

@gw0
Copy link
Contributor

gw0 commented Feb 18, 2016

Validation data is not used for training (or development of the model). Its purpose is to track progress through validation loss and accuracy.

@tboquet
Copy link
Contributor

tboquet commented Feb 18, 2016

This is just a terminology used, it could also be called X_val and y_val. You would have to use another set of data (often called the test set) if you want to do some hyperparameters optimization.

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

No, no, no, no. This is incorrect.

The validation data needs to be separate to both the training and test data.

The results that this model gives are false as they are built to over-fit on the data, something that we shouldn't be suggesting is right.

Instead we should opt to use validation_split method in model.fit, before we confuse people further.

@tboquet
Copy link
Contributor

tboquet commented Feb 18, 2016

I think you are confused about the naming of the datasets. The results of each example are correct and the models are not overfitted since the validation data is not used in the fitting process as pointed out by @gw0. You just have less control using validation_split.

Take a look at those lines:
data fed to the _fit method using validation_data
data fed to the _fit method using validation_split
evaluation of the validation data

You could see that the same treatment is applied to the data you pass in the _fit method if you choose validation_split or validation_data. Then you also see that this data is not passed to the _train method so the optimizer doesn't update the parameters with respect to this data.

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

Not quite what I meant, take a look here - this is directly taken from cifar10_cnn.py in the examples folder.

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(X_train, Y_train, batch_size=batch_size,
              nb_epoch=nb_epoch, show_accuracy=True,
              validation_data=(X_test, Y_test), shuffle=True)
else:
    print('Using real-time data augmentation.')

Surely this is an incorrect method of training NN models, as the entire point is to produce models that are able to generalise in classification tasks - and by validating on the test dataset then the network will have already seen it, making the evaluation of the models incorrect.

@tboquet
Copy link
Contributor

tboquet commented Feb 18, 2016

Ok I don't think I understand your point, so let's summarize:

Are you saying that the the validation data should be fed to the fit method like this?

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(X_train, Y_train, batch_size=batch_size,
              nb_epoch=nb_epoch, show_accuracy=True,
              validation_data=(X_val, Y_val), shuffle=True)
else:
    print('Using real-time data augmentation.')

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

Yes.

The entire point of cross validation is that the network trains on training data, validates the training using a separate validation test set and then evaluates the performance of the model on another separate test set.

These networks are unable to effectively generalise, as they have already seen the test data.

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

Another comment was made about this here #916 where Francios stated it was just an example of

real-time monitoring

What on earth does that even mean?

@neggert
Copy link
Contributor

neggert commented Feb 18, 2016

I think this is just a nomenclature issue. The correct way to train a net is to have 3 data sets: train, validation, and test. The training set is obvious. The validation set is checked during training to monitor progress, and possibly for early stopping, but is never used for gradient descent. The test dataset is the best measure of the network accuracy, and should only be used once, once all training is finished.

So I think the issue is that, strictly speaking, the dataset in that cifar example is a validation dataset, not a test dataset.

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

My point is that it's extremely misleading. Take mnist_cnn.py for example.

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          show_accuracy=True, verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)

The header of the python file proudly proclaims that the model is able to achieve:

99.25% test accuracy after 12 epochs

Well of course its able to, it's badly over-fitted!

@fchollet
Copy link
Member

fchollet commented Feb 18, 2016

There are apparently many more people who are confused about the purpose and usage of validation data and test data in machine learning. If you need a textbook to learn about these concepts, please see my book Deep Learning with Python. Chapter 4 covers the evaluation of machine learning models, in particular the different processes you can use for evaluation, such as single validation split, k-fold validation, and iterated k-fold with shuffling (which is what I've used in the past to win Kaggle competitions), as well as how to avoid overfitting to your validation process, etc. You will also see these best practices in action throughout the book.

In the example that is being discussed in this thread, there is no validation phase, because there is no model development phase. The model presented is already the final product, and is meant to be trained on the entirety of the available training data, which is what you do to produce your final model. This was actually explained in the code example. I didn't think anyone could get confused by something so simple.

@KeironO
Copy link
Author

KeironO commented Feb 18, 2016

I think you're missing the point entirely. I couldn't care less to what you decide to call your variables - call them kitty, dog and spider all I care.

The problem is that I believe that a few of the examples are misleading. Look at mnist_cnn.py for example, you perform validation on the test set and then evaluate its performance after the run on the same data.

This model has been trained and tested incorrectly, and for people attempting to get to grips with things this is a little bit confusing.

@jgc128
Copy link

jgc128 commented Feb 18, 2016

I agree with KeironO. The point is that in these examples test data is used as validation data and this is wrong.

For example:

99.25% test accuracy after 12 epochs

How did the number 12 appear? After 12 epochs the model showed a worse performance on the validation data? In this case it literally means "train the model until the performance on test data starts to go down".

@gw0
Copy link
Contributor

gw0 commented Feb 19, 2016

The entire point of cross validation ...

@KeironO Those examples do not use cross validation, just a classic train/test data split. Consequently no actual overfitting happened, in the worst case it was used in the sense @neggert or @jgc128 described -- to monitor progress to set the nb_epochs = 12. But maybe this did not happen, the author of the example did not monitor progress to manually tweak nb_epochs and just left it at value 12 from the beginning. Strictly speaking in the first case it is a validation dataset used for manual tweaking and in the second it is a test dataset.

@bayesrule
Copy link

@KeironO take a close look at what @neggert said : "The validation set is checked during training to monitor progress, and possibly for early stopping, but is never used for gradient descent."

Indeed, the data set fed to the argument "validation_data" of model.fit() in Keras is never used for training. Using test data for this validation is SAFE.

@bbartoldson
Copy link

@fchollet, I was wondering why there was no third dataset too. I expected training, validation, and test datasets. As it's currently written, mnist_cnn.py puts the test data into the validation data slot:

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          verbose=1, validation_data=(X_test, Y_test))

which will confuse some people like @KeironO and risk leading others astray. For example, someone could make a change to mnist_cnn.py, run the program, and decide whether or not to keep their model changes based on the change in the test dataset accuracy. A novice may even be nudged toward that pitfall since the tutorial program explicitly states that "there is still a lot of margin for parameter tuning". As the TensorFlow MNIST tutorial and Yann LeCun suggest (see below), a separate validation dataset should be used for tuning purposes, not the test data.

Tensorflow's MNIST tutorial uses 5,000/60,000 training images as a validation set:

The downloaded data is split into three parts, 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation). This split is very important: it's essential in machine learning that we have separate data which we don't learn from so that we can make sure that what we've learned actually generalizes!

Yann LeCun, who helped to develop the MNSIT dataset, and others came up with a state-of-the-art result on MNIST in the paper What is the Best Multi-Stage Architecture for Object Recognition?, in which they discuss their use of a validation set:

experiments were run on the MNIST dataset, which contains 60,000 gray-scale 28x28 pixel digit images for training and 10,000 images for testing... A validation set of size 10,000 was set apart from the training set to tune the only hyper-parameter: the sparsity constant λ... The system was trained with a form of stochastic gradient descent on the 50,000 non-validation training samples until the best error rate on the validation set was reached (this took 30 epochs). It was then tuned for another 3 epochs on the whole training set.

Granted, mnist_cnn.py does not violate any machine learning principles I'm aware of because it's not tuning a parameter or the model by checking the test data. Nevertheless, the tutorial is currently structured so that someone might adopt that incorrect practice, and the tutorial has clearly confused some people already.

All of this said, I propose modifying the tutorial program to use validation_split, which is described in the Keras documentation as:

validation_split: float (0. < x < 1). Fraction of the data to use as held-out validation data.

I suggest using the same amount of validation data as the TensorFlow tutorial, which is 5,000/60,000 images, or 1/12. I tested the program using this modification and obtained 99.00% test data accuracy. I then ran 3 more epochs using all 60,000 training images, which is the process that LeCun used and described above, and reached 99.09% accuracy. This can be seen in the picture below.

mnist_cnn

A pull request to incorporate this change is #2899. Please let me know if you have any questions, or if you would like me to make a modification.

@talpay
Copy link
Contributor

talpay commented Jun 6, 2016

The example is technically correct but only until someone starts to optimize hyperparameters manually without reimplementing the fit call to use validation_split. In that case people will overfit to the test data. So why not directly implement the split (like @bbartoldson suggests) as this is guaranteed to confuse less people and reduce the risk of students/novices adapting bad practices?

@ygorcanalli
Copy link

I agree with @talpay, i'ts important to make the example didactic for novices, who probably will want to optimize hyperparameters (like the code comments suggests). Just adding the parameter validation_split on fit method will solve the problem, like @fchollet shows.

@HariKrishna-Vydana
Copy link

@fchollet why there is a test folder in keras code..what is the use

@gw0
Copy link
Contributor

gw0 commented Feb 26, 2017

@harikrishnavydana You mean unit tests for Keras?

Otherwise, stop posting non-relevant comments to closed issues!!!

@Euphemiasama
Copy link

Euphemiasama commented May 26, 2017

@gw0 @bayesrule Is monitoring the validation loss to reduce the learning rate when it reaches a plateau not considered tweaking the parameters ? So in that case aren't we peeking into the test data (validation data in the case of keras) which would eventually lead to overfitting?

@pGit1
Copy link

pGit1 commented Jul 28, 2017

@bbartoldson and @KeironO are 100% correct, period. However these small discrepancies should be ignored in the examples as they are just examples to demonstrate how fitting with a validation set works. In real life having access to a test set and using it to inform model architectures and other hyper parameters decisions leads to overfitting guaranteed. This is common ML knowledge and detracting opinions are simply/utterly wrong...the evidence at this point in ML is daunting.

That said I dont think the focus of the examples is to show the pitfalls of overfitting due to improper validations schemes. That should be well understood prior to building any models, nevermind powerful function approximators that can memorize noise in data.

@mohapatras
Copy link

The reason @fchollet used test data in validation is correct in a way. If we look into the basics of machine learning "Validation is a process of tracking record of loss and accuracy" which will help us tuning and choosing better hyper parameters. So in a way in the mnist_cnn.py example for validation calculation test set passed and used.

The only confusion that occurs is to the beginners. As it has been taught and is usually correct that test data should not be used in case of measuring anything as the real problem arises when you adjust hyperparameters such as learning rate based on the validation scores. By doing this you have incorporated validation data into your model and the scores are no longer independent.
Therefore, If you want an unbiased estimate of the final score you would need a separate test set.
Also @bbartoldson @gw0 's opinion about the subject matter is correct.

@grafael
Copy link

grafael commented Aug 15, 2017

So, after this long discussion... If I have 3 datasets: (train, valid and test) how can I explicitly tell to Keras to use those? Is it right the below snippet? I've been using this kind of approach for several months. @fchollet @mohapatras

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_valid, y_valid))

score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)

@pGit1
Copy link

pGit1 commented Aug 18, 2017 via email

@TohnoYukine
Copy link

Improper use of test set is improper. I think using that validation_split parameter is also easier for beginners to learn than using test data set as validation set. Beginners will really be confused at that misuse of test set as validation set, especially when they dive into hyper parameter. Please update the code before it misleads more people.

@ygorcanalli
Copy link

My vote is with @TohnoYukine, would be a best approach for beginners to use validation_split.

@joeantol
Copy link

Perhaps the Keras documentation will provide some clarity:

validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

validation_data: tuple (x_val, y_val) or tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. This will override validation_split.

(emphasis mine)

@aiexplorations
Copy link

Instructive discussion as I am resolving the same kind of issue with a project I'm on. Using [testX, testY] in the validation set does not affect the gradient descent (it is only done on the training set). And I get the same results for gradient descent if I don't provide a validation_data argument, assuming I choose to run the model past the test sets separately.

@haramoz
Copy link

haramoz commented Sep 29, 2018

Hello, I have same question here, I have two data sets, one for test and train. I am using test data as the validation data for the keras model.fit function like this.
model.fit(X_train,y_train, batch_size=64, epochs=epochs, callbacks=[es,reduceLROnPlateau,modelCheckpoint], validation_data=(X_test,y_test), verbose=2)
I am using Early stop, ReduceLROnPlateau and ModelCheckpoint. all on the val_loss. My objective is to get the training to stop where the test accuracy or auc score is highest. Since the validation data is not contributing to the training, is this alright to do it this way? Futher details, my test_data contains completely different/unseen objects than train data, so even when I make validation data from splitting the train data it is not a great indicator for the generalization performance of my network on test data. So the results I am getting are they scientifically fair or its cheating? What's your point of view on this?

`Train on 39840 samples, validate on 7440 samples
Epoch 1/30

  • 257s - loss: 0.6046 - acc: 0.7782 - val_loss: 0.6797 - val_acc: 0.8055

Epoch 00001: val_loss improved from inf to 0.67972, saving model to keras_densenet_simple_wt_28Sept.h5
Epoch 2/30

  • 253s - loss: 0.5055 - acc: 0.8236 - val_loss: 0.5370 - val_acc: 0.8085

Epoch 00002: val_loss improved from 0.67972 to 0.53701, saving model to keras_densenet_simple_wt_28Sept.h5
Epoch 3/30

  • 253s - loss: 0.4506 - acc: 0.8518 - val_loss: 0.6623 - val_acc: 0.7937

Epoch 00003: val_loss did not improve from 0.53701
Epoch 4/30

  • 253s - loss: 0.4110 - acc: 0.8711 - val_loss: 0.6583 - val_acc: 0.7995

Epoch 00004: val_loss did not improve from 0.53701
Epoch 5/30

  • 253s - loss: 0.3726 - acc: 0.8908 - val_loss: 0.5282 - val_acc: 0.8237

Epoch 00005: val_loss improved from 0.53701 to 0.52820, saving model to keras_densenet_simple_wt_28Sept.h5
Epoch 6/30

  • 253s - loss: 0.3427 - acc: 0.9035 - val_loss: 0.4683 - val_acc: 0.8422

Epoch 00006: val_loss improved from 0.52820 to 0.46830, saving model to keras_densenet_simple_wt_28Sept.h5
Epoch 7/30

  • 253s - loss: 0.3148 - acc: 0.9152 - val_loss: 0.5580 - val_acc: 0.8109

Epoch 00007: val_loss did not improve from 0.46830
Epoch 8/30

  • 253s - loss: 0.2928 - acc: 0.9241 - val_loss: 0.5347 - val_acc: 0.8290

Epoch 00008: val_loss did not improve from 0.46830
Epoch 9/30

  • 252s - loss: 0.2719 - acc: 0.9334 - val_loss: 0.7429 - val_acc: 0.7800

Epoch 00009: val_loss did not improve from 0.46830
Epoch 10/30

  • 252s - loss: 0.2563 - acc: 0.9389 - val_loss: 0.8768 - val_acc: 0.7226

Epoch 00010: val_loss did not improve from 0.46830
Epoch 11/30

  • 252s - loss: 0.2413 - acc: 0.9449 - val_loss: 0.9852 - val_acc: 0.7405

Epoch 00011: ReduceLROnPlateau reducing learning rate to 9.486833431105525e-05.

Epoch 00011: val_loss did not improve from 0.46830
Epoch 00011: early stopping
7440/7440 [==============================] - 15s 2ms/step
Present Test accuracy: 0.740
Present auc_score ------------------> 0.896
7440/7440 [==============================] - 17s 2ms/step
Loading Best saved model by model checkpoint
Best saved model Test accuracy: 0.842
best saved model Auc_score ------------------> 0.922
`

@AnanyaKumar
Copy link

AnanyaKumar commented Jan 18, 2019

@KeironO makes a very valid point, @fchollet could you please re-open the issue (or give a more detailed reply for the design choice) instead of asking him to "go back to the basics of machine learning"?

It would be nice to have 3 sets, train set, validation set, test set. In a typical paper, people should optimize on the train set, tune hyper-parameters on the validation set, and reserve the test set for reporting final accuracies. This is the norm in reputable papers. The current setup encourages people to hyper-parameter tune and do early stopping on the test set. Yes, some people do that nowadays, but this should not be encouraged/the only option.

Ideally, we want dataset splits that are used at different time scales for reliable research. CIFAR-10 has all 3 sets, so it would be nice (and very easy) for Keras to provide all 3. One could argue that people could just split the train set for hyper-parameter optimization. But first, the current setup encourages bad practice. Second, that's not what most research papers do, we should be consistent with the typical setup for comparisons (e.g. this makes loading pre-trained weights much easier).

Additionally, there are contexts like model probability calibration where we need 3 sets. And again, to be consistent with published work it's helpful to report results with a consistent split.

@rfernand2
Copy link

rfernand2 commented Feb 10, 2019

Good discussion on a often misunderstood topic. These example programs like MNIST are important because many developers new to ML use them as the basis of their actual applications, so we should strive to reflect best practices as space/time permits.

@AnanyaKumar makes a good point about the splits being used at different time scales. The best practice here is to use your train and validation (aka dev) sets for running all of your experiments, trying different hyperparameters, architectures, etc. Then, when you are done experimenting and have chosen a model and all its hyperparmeters, and you are ready to "spend" your holdout test set on a single, one-time evaluation, you combine your train and validation set into a single training set and do a final training and then, you do an evaluation of your test set and report it's accuracy as such.

To support the above, I propose that the majority of Keras examples like MNIST be focused on running training experiments with train and validation data splits, reporting their final results as "validation accuracy" (or whatever metric is being reported). In addition, there should be 1 or more examples that show how to do the final training and test evaluation, clearly labeled as such.

@baldesco
Copy link

The best practice here is to use your train and validation (aka dev) sets for running all of your experiments, trying different hyperparameters, architectures, etc. Then, when you are done experimenting and have chosen a model and all its hyperparmeters, and you are ready to "spend" your holdout test set on a single, one-time evaluation, you combine your train and validation set into a single training set and do a final training and then, you do an evaluation of your test set and report it's accuracy as such.

I think this sums up pretty well the confusion generated by the keras examples. For model development and hyperparameter tuning, train on train set and validate on validation set. For the final model, train on train+validation set and validate on test set.

@StoneCypher
Copy link

Any training based on a validation set is deeply corrupt and fails both rudimentary and practical 101 principles.

The number of people calling this a nomenclature issue is bewildering. This is basic practice. Under no circumstance is including a validation set a "best practice."

That defies the core concept of a validation set. As soon as it has been included in the training in any way, it is no longer a validation set, and you have no correct mechanism for validation.

This is easily tested. Just do the same job twice, by making two validation sets. One is your standard validation set. The other is a set for validating your development approach.

Now, set the second validation set aside. Do the same training in two batches of ten, once with the first validation set included, the second time without.

Let those run with whatever automatic hyperparameter tuning you want, as long as it's the same kind for both batches. No manual hyperparameter tuning, because then you might accidentally be doing a better job on one than the other, without intent.

Next, when both batches have given the same amount of work to every item inside, call a stop.

Now, use the second validation set to see which items in the batches did a better job: the ones with the first validation set included, or the ones without?

You're going to see dramatically, embarrassingly more overfitting in the batch including validation than the batch not.

I'm really pretty horrified that, years later, this is still closed incorrectly. Someone should have stepped in by now. Students have been making decisions on the basis of the names in this thread for years, and if you go to any college professor worth their salt and ask this question carefully to not imply either answer, 95% of them are going to disagree with the way this thread plays out.

It is not the case that this thread has it right.

@talpay
Copy link
Contributor

talpay commented May 23, 2020

Edit: there are apparently many more people who are confused about the purpose and usage of validation data and test data in machine learning. If you need a textbook to learn about these concepts, please see my book Deep Learning with Python.

Every single person in this thread understands how the split process works. Around half of the people in this thread understand the intention behind the code and are saying that it is misleading and can lead to incorrect usage very easily (evidenced by the other half being confused about the code). Even after 4 years, you still seem to be reading this as people saying the code is technically incorrect and people therefore not understanding the basics. With many posts going so much into depth as to explain their position and you still misconstruing their positions and giving condescending arguments like "git gud, read my book", it really feels like you're just acting in bad faith and are more concerned with "being right" than maximizing the amount of people that benefit from the code.

tallamjr added a commit to tallamjr/astronet that referenced this issue Dec 11, 2020
This removes references to 'validation' sets, which are now just
training and test sets, where the training set is split further during
hyperparameter optimisation

Allows for training to consider a 'full' training set.

This is also now matches the methodology used by Catarina et al in
preparation for comparison with the benchmark results

See discussion in the issues linked below for the reasoning behind this.

Making this change also makes allows for cross-validation during
hyperparameter optimisation easier

Refs:
    - keras-team/keras#1753

	modified:   astronet/t2/preprocess.py
	modified:   astronet/t2/tests/int/test_train.py
	modified:   astronet/t2/tests/unit/test_import.py
	modified:   astronet/t2/tests/unit/test_preprocess.py
	modified:   astronet/t2/tests/unit/test_utils.py
	modified:   astronet/t2/train.py
	modified:   astronet/t2/utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests