## Process original datasets to generate static train/test files
The original datasets used are obtained from [here](https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/) and undergo a preprocessing phase to generate static train/test files that are stored on disk to be used later in the experiments.

### Datasets
Download original versions of each dataset (unnormalized, without duplicates):

| Name | Description | Total Instances | Outliers | Attributes |
|:--- |:--- | --- | --- | --- |
| ALOI | This dataset is a collection of images used for outlier detection in different representations. | 50000 | 1508 | 27 |
| Annthyroid | This data set contains medical data on hypothyroidism. | 7200 | 534 | 21 |
| Arrhythmia | Patient records classified as normal or as exhibiting some type of cardiac arrhythmia. | 450 | 206 | 259 |
| Cardiotococraphy | Data set related to heart diseases. | 2126 | 471 | 21 |
| KDDCup99 | This dataset captures different types of network intrusions or attacks. | 60632 | 246 | 38+3 |
| SpamBase | A data set representing emails classified as spam (outliers) or nonspam. | 4601 | 1813 | 57 |
| Waveform | This dataset represents 3 classes of waves.  | 3443 | 100 | 21 |

### Steps
For each dataset:
  - Load dataset and subsample according to `max_samples`
  - Split to train/test sets iteratively according to `num_iters` using `shuffling`
  - Standardize the training set and apply to test set
  - Save train/test sets to disk as csv


In [None]:
# Imports
import os
import sys
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from notebook_utils import import_dataset

In [None]:
# Functions that prints normal/outlier numbers given a list of 0/1 labels
def print_dist(y):
    n_outliers = np.sum(y) # assuming 0/1 labels
    n_normal = len(y) - n_outliers
    assert(n_outliers + n_normal == len(y))
    print('\tTotal:', len(y))
    print('\tNormal:', n_normal)
    print('\tOutliers:', n_outliers)

In [None]:
# Processing parameters
data_dir = '../data' # root directory of the data
max_samples = 5000 # max number of points per dataset
num_iters = 10 # number of train/test sets to create
# Filenames of the original datasets
dataset_list = [
    'ALOI_withoutdupl.arff',
    'Annthyroid_withoutdupl_07.arff',
    'Arrhythmia_withoutdupl_46.arff',
    'Cardiotocography_withoutdupl_22.arff',
    'KDDCup99_withoutdupl_catremoved.arff',
    'SpamBase_withoutdupl_40.arff',
    'Waveform_withoutdupl_v10.arff'
]

In [None]:
# Loop over dataset list
for dataset in dataset_list:
    d_name = dataset.split('_')[0]
    print('Processing', d_name)
    d_dir = data_dir + '/original/' + dataset
    df = import_dataset(d_dir)
    # Subsample if too large
    if d_name == 'KDDCup99':
        df_1 = df[df['outlier']==1] # outliers
        df_0 = df[df['outlier']==0] # normals
        n_samples = max_samples - df_1.shape[0]
        df_0_sample = df_0.sample(n=n_samples)
        df = pd.concat([df_0_sample, df_1])
    elif(df.shape[0] > max_samples):
        df = df.sample(n=max_samples)
    # Extract X, y
    X = df.iloc[:, :-1]
    y = df['outlier']
    # Loop over iters
    for i in range(1, num_iters+1):
        # Split to train/test
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.25, stratify=y, shuffle=True, random_state=i)
        # Standardize
        scaler = StandardScaler()
        X_train_norm = scaler.fit_transform(X_train)
        X_test_norm = scaler.transform(X_test)
        # Convert to DataFrames
        X_train_norm_df = pd.DataFrame(X_train_norm, columns=X_train.columns)
        X_test_norm_df = pd.DataFrame(X_test_norm, columns=X_test.columns)
        # Save to disk
        parent_dir = '{}/processed/{}/iter{}'.format(data_dir, d_name, str(i))
        # Create parent dir if not exists
        if not os.path.exists(parent_dir):
            os.makedirs(parent_dir)
        X_train_norm_df.to_csv(parent_dir + '/X_train.csv', index = False)
        y_train.to_csv(parent_dir + '/y_train.csv', index = False)
        X_test_norm_df.to_csv(parent_dir + '/X_test.csv', index = False)
        y_test.to_csv(parent_dir + '/y_test.csv', index = False)
    # print distributions of last iter
    print('Training set:')
    print_dist(y_train)
    print('Files saved to disk\n')