deepmutation

original idea: https://arxiv.org/pdf/1805.05206.pdf

This is a project I selected as my course project for Analysis of Software Artifacts in Fall 2019 at the University of Virginia. As explained in the above paper, the goal is to insert mutants into the model and the data in order to evaluate the quality of the testing data in deep neural networks. Below is my proposal for the project, which largely overlaps with the original paper, DeepMutation: Mutation Testing of Deep Learning Systems.

The workflow of deep learning software development roughly respects the following procedure: obtain a suitable dataset, preprocess, split it into training and testing data sets, develop a model, train it on the training data, evaluate the performance of the model on the test data before deployment. Since the test data is the main source to evaluate the model, its quality is of critical importance. Therefore, in this project, I ask and will try to answer the following question: “How can we evaluate the quality of the test dataset in deep learning”?

To answer this question, I use mutation testing and generally follow the procedure of Ma et al:

1. Train a DNN model on the original training data. Specifically, I will use a convolutional neural network as my model Moriginal and the MNIST data set.
2. Test the neural network on the test data. Mark the test data points that are correctly classified by the model. Call it <a href="https://www.codecogs.com/eqnedit.php?latex=D_{correct}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?D_{correct}" title="D_{correct}" /></a>.
3. Insert a mutant. Example of mutants:
          1. removing, repeating, mislabeling the training data
          2. removing, adding hidden layers in DNN
      3. initializing the neuron weights with pre-trained weights
      4. negating and removing the activation function
4. For each mutant

, I will re-evaluate the -mutated DNN on $D_{correct}$ . If any of $D_{correct}$ is misclassified, survives and is evaluated. Otherwise, it is killed (no effect). The evaluation method of is provided below.

After testing out the results on the MNIST data set, I will try mutation testing on more complex data sets, such as CIFAR10 or MS-COCO. In addition, I will discover new, effective mutants in addition to the ones mentioned in step 3. For example, I believe that swapping the order of hidden layers will be a very useful mutant. For example, if the original two hidden layers contained x and y neurons respectively, I would change the order to y and x. Often, DL engineers design networks based on previous experiences on similar data sets or empirically on how different layer ordering performs on training/validation data sets. Being able to draw a conclusion on layer ordering based on the results from mutation testing presents another way for DL engineers to design more effective DNNs.

It is safe to believe that the test data points that get misclassified after mutants are inserted are of lower quality than those unaffected by mutation. The term quality in the context of classification refers to how far a data point lies from the model’s decision boundary. The farther it is, the more certain the model is about to which class it belongs to—therefore, the higher its quality.

Suppose that the number of different classes is c; each data point belongs to one of {class1, class2, … classc}. I evaluate the mutation score for mutant as follows:

$MutationScore(D_{correct},m^k) = \sum_{i=1}^{c} \frac{count_misclassify(D_{correct},m^k,class_i)}{|D_{correct} \in class_i |}$

where count_misclassify( $D_{correct}$ , mk, classi) is the number of test data points in $D_{correct}$ that belong to that causes the model to misclassify. I divide this value by the number of test data points in $D_{correct}$ that belong to ( $|D_{correct} \in class_i |$ ). This way of calculating the mutation score handles the case where two data sets whose sizes are largely different and also differentiate the mutant that causes only a few of $D_{correct} \in class_i$ to be misclassified from the one that causes the majority of them to be.

It is highly likely, that most, if not every, mutant will result in different mutation scores. One way to interpret this is how detrimentally a mutant affects the model’s performance: the higher the score, the more detrimental its effect. This project will matter to DL developers who build DNNs to decide on the quality of the test data, provide them a way to evaluate data preprocessing methods, gain confidence on their model’s performance on the test data, as well as those who gather the data itself.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LeakyReLU.ipynb		LeakyReLU.ipynb
README.md		README.md
add_layer.ipynb		add_layer.ipynb
c10_original.ipynb		c10_original.ipynb
flip_layer_order.ipynb		flip_layer_order.ipynb
neuron_weight_initialization.ipynb		neuron_weight_initialization.ipynb
original.ipynb		original.ipynb
remove_layer.ipynb		remove_layer.ipynb
removing_activation.ipynb		removing_activation.ipynb
sigmoid.ipynb		sigmoid.ipynb
tanh.ipynb		tanh.ipynb
train data.ipynb		train data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeakyReLU.ipynb

LeakyReLU.ipynb

README.md

README.md

add_layer.ipynb

add_layer.ipynb

c10_original.ipynb

c10_original.ipynb

flip_layer_order.ipynb

flip_layer_order.ipynb

neuron_weight_initialization.ipynb

neuron_weight_initialization.ipynb

original.ipynb

original.ipynb

remove_layer.ipynb

remove_layer.ipynb

removing_activation.ipynb

removing_activation.ipynb

sigmoid.ipynb

sigmoid.ipynb

tanh.ipynb

tanh.ipynb

train data.ipynb

train data.ipynb

Repository files navigation

deepmutation

About

Releases

Packages

Languages

berkuva/mutation-testing-for-DNNs

Folders and files

Latest commit

History

Repository files navigation

deepmutation

About

Resources

Stars

Watchers

Forks

Languages