microsoft · harsha-simhadri · Jul 29, 2020 · Jul 20, 2020 · Jul 20, 2020 · Jul 20, 2020
diff --git a/README.md b/README.md
@@ -17,6 +17,7 @@ Algorithms that shine in this setting in terms of both model size and compute, n
  - **EMI-RNN**: Training routine to recover the critical signature from time series data for faster and accurate RNN predictions.
  - **Shallow RNN**: A meta-architecture for training RNNs that can be applied to streaming data.
  - **FastRNN & FastGRNN - FastCells**: **F**ast, **A**ccurate, **S**table and **T**iny (**G**ated) RNN cells.
+ - **DROCC**: **D**eep **R**obust **O**ne-**C**lass **C**lassfiication for training robust anomaly detectors.
 
 These algorithms can train models for classical supervised learning problems
 with memory requirements that are orders of magnitude lower than other modern

diff --git a/examples/pytorch/DROCC/README.md b/examples/pytorch/DROCC/README.md
@@ -0,0 +1,86 @@
+# Deep Robust One-Class Classification 
+In this directory we present examples of how to use the `DROCCTrainer` to replicate results in [paper](https://proceedings.icml.cc/book/4293.pdf).
+
+`DROCCTrainer` is part of the `edgeml_pytorch` package. Please install the `edgeml_pytorch` package as follows:
+```
+git clone https://github.com/microsoft/EdgeML
+cd EdgeML/pytorch
+pip install -r requirements-gpu.txt
+pip install -e .
+``` 
+
+## Tabular Experiments
+Data is expected in the following format:
+```
+train_data.npy: features of train data
+test_data.npy: features of test data
+train_labels.npy: labels for train data (Normal Class Labelled as 1)
+test_labels.npy: labels for test data
+```
+
+### Arrhythmia and Thyroid
+* Download the datasets from the ODDS Repository, [Arrhythmia](http://odds.cs.stonybrook.edu/arrhythmia-dataset/) and [Thyroid](http://odds.cs.stonybrook.edu/annthyroid-dataset/). This will consist of `arrhythmia.mat` or `annthyroid.mat`.
+* The data is divided for training as presented in previous works: [DAGMM](https://openreview.net/forum?id=BJJLHbb0-) and [GOAD](https://openreview.net/forum?id=H1lK_lBtvS).
+* To generate the training and test data, use the `data_process_scripts/process_odds.py` script as follows 
+```
+python data_process_scripts/process_odds.py -d <path/to/downloaded_data/file_name.mat> -o <output path>
+```
+The output path is referred to as "root_data" in the following section.
+
+### Abalone
+* Download the `abalone.data` file from the UCI Repository [here](http://archive.ics.uci.edu/ml/datasets/Abalone).
+* To generate the training and test data, use the `data_process_scripts/process_abalone.py` script as follows 
+```
+python data_process_scripts/process_abalone.py -d <path/to/data/abalone.data> -o <output path>
+```
+The output path is referred to as "root_data" in the following section.
+
+### Command to run experiments to reproduce results
+#### Arrhythmia
+```
+python3 main_tabular.py --hd 128 --lr 0.0001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 16 --batch_size 256 --epochs 200 --optim 0 --restore 0 --metric F1 -d "root_data"
+```
+
+#### Thyroid
+```
+python3 main_tabular.py --hd 128 --lr 0.001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 2.5 --batch_size 256 --epochs 100 --optim 0 --restore 0 --metric F1 -d "root_data"
+```
+
+#### Abalone 
+```
+python3 main_tabular.py --hd 128 --lr 0.001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 3 --batch_size 256 --epochs 200 --optim 0 --restore 0 --metric F1 -d "root_data"
+```
+
+
+## Time-Series Experiments
+
+### Data Processing
+### Epilepsy
+* Download the dataset from the UCI Repository [here](https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition). This will consists of a `data.csv` file. 
+* To generate the training and test data, use the `data_process_scripts/process_epilepsy.py` script as follows
+
+```
+python data_process_scripts/process_epilepsy.py -d <path/to/data/data.csv> -o <output path>
+```
+The output path is referred to as "root_data" in the following section.
+
+
+### Example Usage for Epilepsy Dataset
+```
+python3  main_timeseries.py --hd 128 --lr 0.00001 --lamda 0.5 --gamma 2 --ascent_step_size 0.1 --radius 10 --batch_size 256 --epochs 200  --optim 0 --restore 0 --metric AUC -d "root_data"
+```
+
+## CIFAR Experiments
+```
+python3  main_cifar.py  --lamda 1  --radius 8 --lr 0.001 --gamma 1 --ascent_step_size 0.001 --batch_size 256 --epochs 40 --optim 0 --normal_class 0
+```
+
+
+### Arguments Detail
+normal_class => CIFAR10 class to be considered as normal  
+lamda => Weightage to the loss from adversarially sampled negative points (\mu in the paper)  
+radius => radius corresponding to the definition of set N_i(r)  
+hd => LSTM Hidden Dimension  
+optim => 0: Adam   1: SGD(M)  
+ascent_step_size => step size for gradient ascent to generate adversarial anomalies
+
diff --git a/examples/pytorch/DROCC/data_process_scripts/process_abalone.py b/examples/pytorch/DROCC/data_process_scripts/process_abalone.py
@@ -0,0 +1,35 @@
+import os
+import pandas as pd
+import numpy as np
+import argparse
+
+parser = argparse.ArgumentParser(description='Preprocess Abalone Data')
+parser.add_argument('-d', '--data_path', type=str, default='./abalone.data')
+parser.add_argument('-o', '--output_path', type=str, default='.')
+args = parser.parse_args()
+
+data = pd.read_csv(args.data_path, header=None, sep=',')
+
+data = data.rename(columns={8: 'y'})
+
+data['y'].replace([8, 9, 10], -1, inplace=True)
+data['y'].replace([3, 21], 0, inplace=True)
+data.iloc[:, 0].replace('M', 0, inplace=True)
+data.iloc[:, 0].replace('F', 1, inplace=True)
+data.iloc[:, 0].replace('I', 2, inplace=True)
+
+test = data[data['y'] == 0]
+num_normal_samples_test = test.shape[0]
+
+normal = data[data['y'] == -1].sample(frac=1)
+
+test_data = np.concatenate((test.drop('y', axis=1), normal[:num_normal_samples_test].drop('y', axis=1)), axis=0)
+train = normal[num_normal_samples_test:]
+train_data = train.drop('y', axis=1).values
+train_labels = train['y'].replace(-1, 1)
+test_labels = np.concatenate((test['y'], normal[:num_normal_samples_test]['y'].replace(-1, 1)), axis=0)
+
+np.save(os.path.join(args.output_path,'train_data.npy'), train_data)
+np.save(os.path.join(args.output_path,'train_labels.npy'), train_labels)
+np.save(os.path.join(args.output_path,'test_data.npy'), test_data)
+np.save(os.path.join(args.output_path,'test_labels.npy'), test_labels)
diff --git a/examples/pytorch/DROCC/data_process_scripts/process_cifar.py b/examples/pytorch/DROCC/data_process_scripts/process_cifar.py
@@ -0,0 +1,155 @@
+'''
+Code borrowed from https://github.com/lukasruff/Deep-SVDD-PyTorch 
+'''
+from PIL import Image
+import numpy as np
+from random import sample 
+from abc import ABC, abstractmethod
+import torch
+from torch.utils.data import Subset
+from torchvision.datasets import CIFAR10
+import torchvision.transforms as transforms
+from torch.utils.data import DataLoader
+
+class BaseADDataset(ABC):
+    """Anomaly detection dataset base class."""
+
+    def __init__(self, root: str):
+        super().__init__()
+        self.root = root  # root path to data
+
+        self.n_classes = 2  # 0: normal, 1: outlier
+        self.normal_classes = None  # tuple with original class labels that define the normal class
+        self.outlier_classes = None  # tuple with original class labels that define the outlier class
+
+        self.train_set = None  # must be of type torch.utils.data.Dataset
+        self.test_set = None  # must be of type torch.utils.data.Dataset
+
+    @abstractmethod
+    def loaders(self, batch_size: int, shuffle_train=True, shuffle_test=False, num_workers: int = 0) -> (
+            DataLoader, DataLoader):
+        """Implement data loaders of type torch.utils.data.DataLoader for train_set and test_set."""
+        pass
+
+    def __repr__(self):
+        return self.__class__.__name__
+
+class TorchvisionDataset(BaseADDataset):
+    """TorchvisionDataset class for datasets already implemented in torchvision.datasets."""
+
+    def __init__(self, root: str):
+        super().__init__(root)
+
+    def loaders(self, batch_size: int, shuffle_train=True, shuffle_test=False, num_workers: int = 0) -> (
+            DataLoader, DataLoader):
+        train_loader = DataLoader(dataset=self.train_set, batch_size=batch_size, shuffle=shuffle_train,
+                                  num_workers=num_workers)
+        test_loader = DataLoader(dataset=self.test_set, batch_size=batch_size, shuffle=shuffle_test,
+                                 num_workers=num_workers)
+        return train_loader, test_loader
+
+class CIFAR10_Dataset(TorchvisionDataset):
+
+    def __init__(self, root: str, normal_class=5):
+        super().__init__(root)
+
+        self.n_classes = 2  # 0: normal, 1: outlier
+        self.normal_classes = tuple([normal_class])
+        self.outlier_classes = list(range(0, 10))
+        self.outlier_classes.remove(normal_class)
+
+        # Pre-computed min and max values (after applying GCN) from train data per class
+        # min_max = [(-28.94083453598571, 13.802961825439636),
+        #            (-6.681770233365245, 9.158067708230273),
+        #            (-34.924463588638204, 14.419298165027628),
+        #            (-10.599172931391799, 11.093187820377565),
+        #            (-11.945022995801637, 10.628045447867583),
+        #            (-9.691969487694928, 8.948326776180823),
+        #            (-9.174940012342555, 13.847014686472365),
+        #            (-6.876682005899029, 12.282371383343161),
+        #            (-15.603507135507172, 15.2464923804279),
+        #            (-6.132882973622672, 8.046098172351265)]
+        # CIFAR-10 preprocessing: GCN (with L1 norm) and min-max feature scaling to [0,1]
+        transform = transforms.Compose([transforms.ToTensor(),
+                                        transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
+                                        std=[0.247, 0.243, 0.261])])
+
+        target_transform = transforms.Lambda(lambda x: int(x not in self.outlier_classes))
+
+        train_set = MyCIFAR10(root=self.root, train=True, download=True,
+                              transform=transform, target_transform=target_transform)
+
+        # Subset train set to normal class
+        train_idx_normal = get_target_label_idx(train_set.targets, self.normal_classes)
+        # train_idx_normal_train = sample(train_idx_normal, 4000)
+        # val_idx_normal = [x for x in train_idx_normal if x not in train_idx_normal_train]
+
+        # rest_train_classes = get_target_label_idx(train_set.train_labels, self.outlier_classes)
+        # rest_train_classes_subset = sample(rest_train_classes, 9000)
+        # val_idx = val_idx_normal + rest_train_classes_subset
+        self.train_set = Subset(train_set, train_idx_normal)
+        # self.test_set = Subset(train_set, val_idx)
+        self.test_set = MyCIFAR10(root=self.root, train=False, download=True,
+                                  transform=transform, target_transform=target_transform)
+
+
+class MyCIFAR10(CIFAR10):
+    """Torchvision CIFAR10 class with patch of __getitem__ method to also return the index of a data sample."""
+
+    def __init__(self, *args, **kwargs):
+        super(MyCIFAR10, self).__init__(*args, **kwargs)
+
+    def __getitem__(self, index):
+        """Override the original method of the CIFAR10 class.
+        Args:
+            index (int): Index
+        Returns:
+            triple: (image, target, index) where target is index of the target class.
+        """
+        img, target = self.data[index], self.targets[index]
+
+        # doing this so that it is consistent with all other datasets
+        # to return a PIL Image
+        img = Image.fromarray(img)
+
+        if self.transform is not None:
+            img = self.transform(img)
+
+        if self.target_transform is not None:
+            target = self.target_transform(target)
+
+        return img, target, index  # only line changed
+
+def get_target_label_idx(labels, targets):
+    """
+    Get the indices of labels that are included in targets.
+    :param labels: array of labels
+    :param targets: list/tuple of target labels
+    :return: list with indices of target labels
+    """
+    return np.argwhere(np.isin(labels, targets)).flatten().tolist()
+
+
+def global_contrast_normalization(x: torch.tensor, scale='l2'):
+    """
+    Apply global contrast normalization to tensor, i.e. subtract mean across features (pixels) and normalize by scale,
+    which is either the standard deviation, L1- or L2-norm across features (pixels).
+    Note this is a *per sample* normalization globally across features (and not across the dataset).
+    """
+
+    assert scale in ('l1', 'l2')
+
+    n_features = int(np.prod(x.shape))
+
+    mean = torch.mean(x)  # mean over all features (pixels) per sample
+    x -= mean
+
+    if scale == 'l1':
+        x_scale = torch.mean(torch.abs(x))
+
+    if scale == 'l2':
+        x_scale = torch.sqrt(torch.sum(x ** 2)) / n_features
+
+    x /= x_scale
+
+    return x
diff --git a/examples/pytorch/DROCC/data_process_scripts/process_epilepsy.py b/examples/pytorch/DROCC/data_process_scripts/process_epilepsy.py
@@ -0,0 +1,36 @@
+import os
+import argparse
+import pandas as pd
+import numpy as np
+
+parser = argparse.ArgumentParser()
+parser.add_argument('-d', '--data_path', type=str, default='./data.csv')
+parser.add_argument('-o', '--output_path', type=str, default='.')
+args = parser.parse_args()
+
+data = pd.read_csv(args.data_path)
+
+data['y'] = data['y'].replace(1, 0)
+
+data['y'] = data['y'].replace([2, 3, 4, 5], 1)
+
+
+test = data[data['y'] == 0]
+normal = data[data['y'] == 1].sample(frac=1).reset_index(drop=True)
+
+test = pd.concat([test, normal.iloc[:2300]])
+
+normal = normal.iloc[2300:]
+
+normal = normal.drop(['y', 'Unnamed: 0'], axis=1)
+np.save(os.path.join(args.output_path, 'train.npy'), normal.values)
+
+test = test.drop('Unnamed: 0', axis=1)
+test = test.sample(frac=1).reset_index(drop=True)
+
+labels = test['y'].values
+
+test = test.drop('y', axis=1).values
+np.save(os.path.join(args.output_path, 'test_data.npy'), test)
+np.save(os.path.join(args.output_path, 'test_labels.npy'), labels)
+
diff --git a/examples/pytorch/DROCC/data_process_scripts/process_kws.py b/examples/pytorch/DROCC/data_process_scripts/process_kws.py
@@ -0,0 +1,18 @@
+import numpy as np
+
+train_f = np.load('train_seven.npz')['features'] # containing only the class marvin
+others_f = np.load('other_seven.npz')['features'] # containing classes other than marvin
+
+np.random.shuffle(train_f)
+np.random.shuffle(others_f)
+
+len_train = 0.8 * len(train_f)
+len_test = len(train_f) - len_train
+
+data = train_f[:len_train]
+np.save('train.npy', data)
+
+test_data = np.concatenate((train_f[len_train:], others_f[len_t:len_train+len_test]), axis=0)
+labels = [1] * len_test + [0] * len_test
+np.save('test_data.npy', test_data)
+np.save('test_labels.npy', labels)
diff --git a/examples/pytorch/DROCC/data_process_scripts/process_odds.py b/examples/pytorch/DROCC/data_process_scripts/process_odds.py
@@ -0,0 +1,37 @@
+import os
+import numpy as np
+from scipy.io import loadmat
+import argparse
+
+parser = argparse.ArgumentParser(description='Preprocess Dataset from ODDS Repository')
+parser.add_argument('-d', '--data_path', type=str, default='./arrhythmia.mat')
+parser.add_argument('-o', '--output_path', type=str, default='.')
+args = parser.parse_args()
+
+dataset = loadmat(args.data_path)
+
+data = np.concatenate((dataset['X'], dataset['y']), axis=1)
+
+test = data[data[:,-1] == 1]
+num_normal_samples_test = test.shape[0]
+
+normal = data[data[:,-1] == 0]
+np.random.shuffle(normal)
+
+test = np.concatenate((test, normal[:num_normal_samples_test]), axis=0)
+
+train = normal[num_normal_samples_test:]
+train_data = train[:,:-1]
+# DROCC requires normal data to be labelled 1
+train_labels = np.ones(train_data.shape[0])
+
+test_data = test[:,:-1]
+# DROCC requires normal data to be labelled 1 and anomalies 0
+test_labels = np.concatenate((
+        np.zeros(num_normal_samples_test), np.ones(num_normal_samples_test)),
+        axis=0)
+
+np.save(os.path.join(args.output_path,'train_data.npy'), train_data)
+np.save(os.path.join(args.output_path,'train_labels.npy'), train_labels)
+np.save(os.path.join(args.output_path,'test_data.npy'), test_data)
+np.save(os.path.join(args.output_path,'test_labels.npy'), test_labels)