Skip to content

imics-lab/identifying_label_noise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identifying Mislabeled Instances in Time Series Data

Author: Gentry Atkinson Organization: Texas State University

This work has been adapted from the original work by extending Labelfix to function on time series data.

Changes

A 1d convolutional NN has been added to the original models set which are applied during label noise identification. Several utility files have been added to make working with time series data more reasonable. One of these files will generate synthetic datasets of arbitrary size. Another provides feature extraction on time series datasets using the TSFresh library. Two files have been added to re-write the UniMiB and Sussex-Huawei HAR datasets into a more usable format for the performance of some experiments to show the efficacy of the new model. Finally a bash script has been included to automatically run 3 experiments on 6 time series datasets.

Repeating Time Series Experiments

The UniMiB and Sussex-Huawei datasets will have to be downloaded by the user and pasted into the datasets directory. After having done this invoking the run_all_experiments script will run the 3 experiment on 4 synthetic datasets, UniMiB fall, UniMiB all activity, Sussex-Huawei hand, and Sussex Huawei torso datasets.

LabelFix - Identifying Mislabeled Instances in Classification Data Sets

Welcome to the LabelFix project. The goal of this tool is to find mislabeled data instances in image, text and numerical data in order to speed up the manual reviewing process by domain experts.

This work has been published at IJCNN 2019. The article can be found at https://ieeexplore.ieee.org/document/8851920 and at ArXiv.org: https://arxiv.org/pdf/1912.05283.pdf

Installation

Note: It is advisable to use a virtual environment for installing and running the code.

This tool has been developed using Python 3. Run pip3 install -r requirements.txt to install the required Python3 dependencies. This will install the dependencies for CPU. If you want to run the code on a machine with GPU support you have to follow https://www.tensorflow.org/install/gpu in order to install the GPu-compatible version of Tensorflow.

Usage

Execution of 'quantitative evaluation'

To run the quantitative evaluation of the system run src/quantitative_evaluation.py from the main directory labelfix. You may need to set your environment variables:

PYTHONPATH=.:$PYTHONPATH python3 src/quantitative_evaluation.py

Working examples

To find mislabeled instances, e.g., in the fashion MNIST data set, run the following.

PYTHONPATH=.:$PYTHONPATH python3 src/examples/find_in_fashion_mnist.py

Since the code includes random numbers, different runs will lead to a possibly different outcome.

To find mislabeled instances in your own data, follow the examples in src/examples.

Paper data sets

Most data sets used in the paper will be automatically downloaded and stored in the res/ directory. The exception is the SDSS data set (https://www.kaggle.com/lucidlenn/sloan-digital-sky-survey) and the Twitter Airline data set (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Both have to be downloaded from Kaggle manually and stored as res/kaggle/Skyserver_SQL2_27_2018 6_51_39 PM.csv and res/kaggle/Tweets.csv respectively. For the download you are required to have an account at Kaggle.

Custom data

To load a custom data set, you need to pre-process it and supply it as a dictionary {"data": X, "target": y}. The following code gives an example:

# First, construct required dictionary using the fashion mnist training data
(x_train, y_train), (_, _) = tf.keras.datasets.fashion_mnist.load_data()
dataset = {'data': x_train, 'target': y_train}

# check the data set
res = check_dataset(dataset)

# plot four sets of images with the most likely mislabeled pairs (x, y) and save to disk
for i in range(4):
    visualize_image(image_data=dataset["data"],
                    image_labels=dataset["target"],
                    label_names=["top/shirt", "trousers", "pullover", "dress", "coat", "sandal", "shirt", "sneaker", "bag", "ankle boot"],
                    indices=res["indices"],
                    batch_to_plot=i,
                    save_to_path="/tmp/found_fashion_mnist/")

The algorithm's output is shown in the following. It displays the top 9 images the algorithm return, e.g., the nine images out of the MNIST training data set that are the most likely to be mislabeled. Upon close inspection, it is clear that they are indeed all labeled with questionable labels.

Remember that the code includes some randomization, so different runs do not necessarily lead to the same image.

Results

Below are some results of our system on the publically available and famous data sets Fashion MNIST and Cifar-100.

Mislabeled instances from the fashion MNIST training set

Mislabeled instances from the Cifar-100 training set

20 Newsgroup

Below two examples from the twenty newsgroup where the content does not match the label.

Quantitative Evaluation Results

Table 3 from the Paper:

Copyright

GPLv2, see the files LICENSE and GPLv2.

Bugs and Contribution

Please let us know if you find bugs and/or would like to contribute.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •