Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DROCC Code #196

Merged
merged 19 commits into from Jul 29, 2020
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -17,6 +17,7 @@ Algorithms that shine in this setting in terms of both model size and compute, n
- **EMI-RNN**: Training routine to recover the critical signature from time series data for faster and accurate RNN predictions.
- **Shallow RNN**: A meta-architecture for training RNNs that can be applied to streaming data.
- **FastRNN & FastGRNN - FastCells**: **F**ast, **A**ccurate, **S**table and **T**iny (**G**ated) RNN cells.
- **DROCC**: **D**eep **R**obust **O**ne-**C**lass **C**lassfiication for training robust anomaly detectors.

These algorithms can train models for classical supervised learning problems
with memory requirements that are orders of magnitude lower than other modern
Expand Down
86 changes: 86 additions & 0 deletions examples/pytorch/DROCC/README.md
@@ -0,0 +1,86 @@
# Deep Robust One-Class Classification
In this directory we present examples of how to use the `DROCCTrainer` to replicate results in [paper](https://proceedings.icml.cc/book/4293.pdf).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add instructions to install drocc trainer from ROOT/pytorch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

`DROCCTrainer` is part of the `edgeml_pytorch` package. Please install the `edgeml_pytorch` package as follows:
```
git clone https://github.com/microsoft/EdgeML
cd EdgeML/pytorch
pip install -r requirements-gpu.txt
pip install -e .
```

## Tabular Experiments
Data is expected in the following format:
```
train_data.npy: features of train data
test_data.npy: features of test data
train_labels.npy: labels for train data (Normal Class Labelled as 1)
test_labels.npy: labels for test data
```

### Arrhythmia and Thyroid
* Download the datasets from the ODDS Repository, [Arrhythmia](http://odds.cs.stonybrook.edu/arrhythmia-dataset/) and [Thyroid](http://odds.cs.stonybrook.edu/annthyroid-dataset/). This will consist of `arrhythmia.mat` or `annthyroid.mat`.
* The data is divided for training as presented in previous works: [DAGMM](https://openreview.net/forum?id=BJJLHbb0-) and [GOAD](https://openreview.net/forum?id=H1lK_lBtvS).
* To generate the training and test data, use the `data_process_scripts/process_odds.py` script as follows
```
python data_process_scripts/process_odds.py -d <path/to/downloaded_data/file_name.mat> -o <output path>
```
The output path is referred to as "root_data" in the following section.

### Abalone
* Download the `abalone.data` file from the UCI Repository [here](http://archive.ics.uci.edu/ml/datasets/Abalone).
* To generate the training and test data, use the `data_process_scripts/process_abalone.py` script as follows
```
python data_process_scripts/process_abalone.py -d <path/to/data/abalone.data> -o <output path>
```
The output path is referred to as "root_data" in the following section.

### Command to run experiments to reproduce results
#### Arrhythmia
```
python3 main_tabular.py --hd 128 --lr 0.0001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 16 --batch_size 256 --epochs 200 --optim 0 --restore 0 --metric F1 -d "root_data"
```

#### Thyroid
```
python3 main_tabular.py --hd 128 --lr 0.001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 2.5 --batch_size 256 --epochs 100 --optim 0 --restore 0 --metric F1 -d "root_data"
```

#### Abalone
```
python3 main_tabular.py --hd 128 --lr 0.001 --lamda 1 --gamma 2 --ascent_step_size 0.001 --radius 3 --batch_size 256 --epochs 200 --optim 0 --restore 0 --metric F1 -d "root_data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was able to follow instructions and run.
At the start of the log, I see:
(1862, 8) (1862,) (58, 8) (1891,)
Can you please add annotation for what these are, or just remove them from the print out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added annotation on number of train samples and test samples

```


## Time-Series Experiments

### Data Processing
### Epilepsy
* Download the dataset from the UCI Repository [here](https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition). This will consists of a `data.csv` file.
* To generate the training and test data, use the `data_process_scripts/process_epilepsy.py` script as follows

```
python data_process_scripts/process_epilepsy.py -d <path/to/data/data.csv> -o <output path>
```
The output path is referred to as "root_data" in the following section.


### Example Usage for Epilepsy Dataset
```
python3 main_timeseries.py --hd 128 --lr 0.00001 --lamda 0.5 --gamma 2 --ascent_step_size 0.1 --radius 10 --batch_size 256 --epochs 200 --optim 0 --restore 0 --metric AUC -d "root_data"
```
SachinG007 marked this conversation as resolved.
Show resolved Hide resolved

## CIFAR Experiments
```
python3 main_cifar.py --lamda 1 --radius 8 --lr 0.001 --gamma 1 --ascent_step_size 0.001 --batch_size 256 --epochs 40 --optim 0 --normal_class 0
SachinG007 marked this conversation as resolved.
Show resolved Hide resolved
```


### Arguments Detail
normal_class => CIFAR10 class to be considered as normal
lamda => Weightage to the loss from adversarially sampled negative points (\mu in the paper)
radius => radius corresponding to the definition of set N_i(r)
hd => LSTM Hidden Dimension
optim => 0: Adam 1: SGD(M)
ascent_step_size => step size for gradient ascent to generate adversarial anomalies

35 changes: 35 additions & 0 deletions examples/pytorch/DROCC/data_process_scripts/process_abalone.py
@@ -0,0 +1,35 @@
import os
import pandas as pd
import numpy as np
import argparse

parser = argparse.ArgumentParser(description='Preprocess Abalone Data')
parser.add_argument('-d', '--data_path', type=str, default='./abalone.data')
parser.add_argument('-o', '--output_path', type=str, default='.')
args = parser.parse_args()

data = pd.read_csv(args.data_path, header=None, sep=',')

data = data.rename(columns={8: 'y'})

data['y'].replace([8, 9, 10], -1, inplace=True)
data['y'].replace([3, 21], 0, inplace=True)
data.iloc[:, 0].replace('M', 0, inplace=True)
data.iloc[:, 0].replace('F', 1, inplace=True)
data.iloc[:, 0].replace('I', 2, inplace=True)

test = data[data['y'] == 0]
num_normal_samples_test = test.shape[0]

normal = data[data['y'] == -1].sample(frac=1)

test_data = np.concatenate((test.drop('y', axis=1), normal[:num_normal_samples_test].drop('y', axis=1)), axis=0)
train = normal[num_normal_samples_test:]
train_data = train.drop('y', axis=1).values
train_labels = train['y'].replace(-1, 1)
test_labels = np.concatenate((test['y'], normal[:num_normal_samples_test]['y'].replace(-1, 1)), axis=0)

np.save(os.path.join(args.output_path,'train_data.npy'), train_data)
np.save(os.path.join(args.output_path,'train_labels.npy'), train_labels)
np.save(os.path.join(args.output_path,'test_data.npy'), test_data)
np.save(os.path.join(args.output_path,'test_labels.npy'), test_labels)
155 changes: 155 additions & 0 deletions examples/pytorch/DROCC/data_process_scripts/process_cifar.py
@@ -0,0 +1,155 @@
'''
Code borrowed from https://github.com/lukasruff/Deep-SVDD-PyTorch
'''
from PIL import Image
import numpy as np
from random import sample
from abc import ABC, abstractmethod
import torch
from torch.utils.data import Subset
from torchvision.datasets import CIFAR10
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

class BaseADDataset(ABC):
"""Anomaly detection dataset base class."""

def __init__(self, root: str):
super().__init__()
self.root = root # root path to data

self.n_classes = 2 # 0: normal, 1: outlier
self.normal_classes = None # tuple with original class labels that define the normal class
self.outlier_classes = None # tuple with original class labels that define the outlier class

self.train_set = None # must be of type torch.utils.data.Dataset
self.test_set = None # must be of type torch.utils.data.Dataset

@abstractmethod
def loaders(self, batch_size: int, shuffle_train=True, shuffle_test=False, num_workers: int = 0) -> (
DataLoader, DataLoader):
"""Implement data loaders of type torch.utils.data.DataLoader for train_set and test_set."""
pass

def __repr__(self):
return self.__class__.__name__

class TorchvisionDataset(BaseADDataset):
"""TorchvisionDataset class for datasets already implemented in torchvision.datasets."""

def __init__(self, root: str):
super().__init__(root)

def loaders(self, batch_size: int, shuffle_train=True, shuffle_test=False, num_workers: int = 0) -> (
DataLoader, DataLoader):
train_loader = DataLoader(dataset=self.train_set, batch_size=batch_size, shuffle=shuffle_train,
num_workers=num_workers)
test_loader = DataLoader(dataset=self.test_set, batch_size=batch_size, shuffle=shuffle_test,
num_workers=num_workers)
return train_loader, test_loader

class CIFAR10_Dataset(TorchvisionDataset):

def __init__(self, root: str, normal_class=5):
super().__init__(root)

self.n_classes = 2 # 0: normal, 1: outlier
self.normal_classes = tuple([normal_class])
self.outlier_classes = list(range(0, 10))
self.outlier_classes.remove(normal_class)

# Pre-computed min and max values (after applying GCN) from train data per class
# min_max = [(-28.94083453598571, 13.802961825439636),
# (-6.681770233365245, 9.158067708230273),
# (-34.924463588638204, 14.419298165027628),
# (-10.599172931391799, 11.093187820377565),
# (-11.945022995801637, 10.628045447867583),
# (-9.691969487694928, 8.948326776180823),
# (-9.174940012342555, 13.847014686472365),
# (-6.876682005899029, 12.282371383343161),
# (-15.603507135507172, 15.2464923804279),
# (-6.132882973622672, 8.046098172351265)]
# CIFAR-10 preprocessing: GCN (with L1 norm) and min-max feature scaling to [0,1]
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
std=[0.247, 0.243, 0.261])])

target_transform = transforms.Lambda(lambda x: int(x not in self.outlier_classes))

train_set = MyCIFAR10(root=self.root, train=True, download=True,
transform=transform, target_transform=target_transform)

# Subset train set to normal class
train_idx_normal = get_target_label_idx(train_set.targets, self.normal_classes)
# train_idx_normal_train = sample(train_idx_normal, 4000)
# val_idx_normal = [x for x in train_idx_normal if x not in train_idx_normal_train]

# rest_train_classes = get_target_label_idx(train_set.train_labels, self.outlier_classes)
# rest_train_classes_subset = sample(rest_train_classes, 9000)
# val_idx = val_idx_normal + rest_train_classes_subset
self.train_set = Subset(train_set, train_idx_normal)
# self.test_set = Subset(train_set, val_idx)
self.test_set = MyCIFAR10(root=self.root, train=False, download=True,
transform=transform, target_transform=target_transform)


class MyCIFAR10(CIFAR10):
"""Torchvision CIFAR10 class with patch of __getitem__ method to also return the index of a data sample."""

def __init__(self, *args, **kwargs):
super(MyCIFAR10, self).__init__(*args, **kwargs)

def __getitem__(self, index):
"""Override the original method of the CIFAR10 class.
Args:
index (int): Index
Returns:
triple: (image, target, index) where target is index of the target class.
"""
img, target = self.data[index], self.targets[index]

# doing this so that it is consistent with all other datasets
# to return a PIL Image
img = Image.fromarray(img)

if self.transform is not None:
img = self.transform(img)

if self.target_transform is not None:
target = self.target_transform(target)

return img, target, index # only line changed

def get_target_label_idx(labels, targets):
"""
Get the indices of labels that are included in targets.
:param labels: array of labels
:param targets: list/tuple of target labels
:return: list with indices of target labels
"""
return np.argwhere(np.isin(labels, targets)).flatten().tolist()


def global_contrast_normalization(x: torch.tensor, scale='l2'):
"""
Apply global contrast normalization to tensor, i.e. subtract mean across features (pixels) and normalize by scale,
which is either the standard deviation, L1- or L2-norm across features (pixels).
Note this is a *per sample* normalization globally across features (and not across the dataset).
"""

assert scale in ('l1', 'l2')

n_features = int(np.prod(x.shape))

mean = torch.mean(x) # mean over all features (pixels) per sample
x -= mean

if scale == 'l1':
x_scale = torch.mean(torch.abs(x))

if scale == 'l2':
x_scale = torch.sqrt(torch.sum(x ** 2)) / n_features

x /= x_scale

return x
36 changes: 36 additions & 0 deletions examples/pytorch/DROCC/data_process_scripts/process_epilepsy.py
@@ -0,0 +1,36 @@
import os
import argparse
import pandas as pd
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('-d', '--data_path', type=str, default='./data.csv')
parser.add_argument('-o', '--output_path', type=str, default='.')
args = parser.parse_args()

data = pd.read_csv(args.data_path)

data['y'] = data['y'].replace(1, 0)

data['y'] = data['y'].replace([2, 3, 4, 5], 1)


test = data[data['y'] == 0]
normal = data[data['y'] == 1].sample(frac=1).reset_index(drop=True)

test = pd.concat([test, normal.iloc[:2300]])

normal = normal.iloc[2300:]

normal = normal.drop(['y', 'Unnamed: 0'], axis=1)
np.save(os.path.join(args.output_path, 'train.npy'), normal.values)

test = test.drop('Unnamed: 0', axis=1)
test = test.sample(frac=1).reset_index(drop=True)

labels = test['y'].values

test = test.drop('y', axis=1).values
np.save(os.path.join(args.output_path, 'test_data.npy'), test)
np.save(os.path.join(args.output_path, 'test_labels.npy'), labels)

18 changes: 18 additions & 0 deletions examples/pytorch/DROCC/data_process_scripts/process_kws.py
@@ -0,0 +1,18 @@
import numpy as np

train_f = np.load('train_seven.npz')['features'] # containing only the class marvin
others_f = np.load('other_seven.npz')['features'] # containing classes other than marvin

np.random.shuffle(train_f)
np.random.shuffle(others_f)

len_train = 0.8 * len(train_f)
len_test = len(train_f) - len_train

data = train_f[:len_train]
np.save('train.npy', data)

test_data = np.concatenate((train_f[len_train:], others_f[len_t:len_train+len_test]), axis=0)
labels = [1] * len_test + [0] * len_test
np.save('test_data.npy', test_data)
np.save('test_labels.npy', labels)
37 changes: 37 additions & 0 deletions examples/pytorch/DROCC/data_process_scripts/process_odds.py
@@ -0,0 +1,37 @@
import os
import numpy as np
from scipy.io import loadmat
import argparse

parser = argparse.ArgumentParser(description='Preprocess Dataset from ODDS Repository')
parser.add_argument('-d', '--data_path', type=str, default='./arrhythmia.mat')
parser.add_argument('-o', '--output_path', type=str, default='.')
args = parser.parse_args()

dataset = loadmat(args.data_path)

data = np.concatenate((dataset['X'], dataset['y']), axis=1)

test = data[data[:,-1] == 1]
num_normal_samples_test = test.shape[0]

normal = data[data[:,-1] == 0]
np.random.shuffle(normal)

test = np.concatenate((test, normal[:num_normal_samples_test]), axis=0)

train = normal[num_normal_samples_test:]
train_data = train[:,:-1]
# DROCC requires normal data to be labelled 1
train_labels = np.ones(train_data.shape[0])

test_data = test[:,:-1]
# DROCC requires normal data to be labelled 1 and anomalies 0
test_labels = np.concatenate((
np.zeros(num_normal_samples_test), np.ones(num_normal_samples_test)),
axis=0)

np.save(os.path.join(args.output_path,'train_data.npy'), train_data)
np.save(os.path.join(args.output_path,'train_labels.npy'), train_labels)
np.save(os.path.join(args.output_path,'test_data.npy'), test_data)
np.save(os.path.join(args.output_path,'test_labels.npy'), test_labels)