## Dataset
Dataset is taken from here: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

Task is to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. 

### EDA Conclusion:
1. We found one column that has no impact on class label.
2. We found few correlated columns which can help in reducing feature set.
3. We found that it might be more suitable to first detect normal vs anomalous (1/0) and then predict the type of anomaly. 
4. We found that there exist no single or subset of features which alone can do the prediction. However, many features provide some level of distinction between class labels. Therefore by using all the available features together we can make a good prediction.
5. If needed, we can make new features by combining existing features. Or we can give neural networks a try which can automatically do this for us. 

### Next Steps:
To find out how accurate our data analysis has been we can do following
1. Convert categorical features into continous or binary (1/0) features
2. Rescale all the features between 0 and 1
3. Using stratified samping create three sets (1) test (2) train (3) cross validation
4. Implement a dummpy classifier and note results //http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
4. Select a machine learning algorithm to apply on dataset
5. Apply selected machine learning algorithm using all features and get the baseline accuracy of algoithm. Note the train and cross validation errors. 
6. Apply the selected machine learning algorithm using subset of features identified in EDA. Note the train and cross validation errors. 
7. Use the two step approach i.e. normal vs anomalous and then predict type of anomaly. Use all features present in feature set. Note the train and cross validation errors.
9. Repeat previous step using the subset of features identified in EDA. Note the train and cross validation errors. 
10. Repeat steps 5-9 using a Neural network
11. Apply the best method on test set to get final accuracy

[High Bias vs High Variance] can we use train vs cross validation results to measure whether we have a high bias scenario or high variance scenario
[At every step it is important to note two things (1) training error (2) cross validation error. It will help us in determining whether it is a high bias or high variance situation]
Good read: https://github.com/ianozsvald/data_science_delivered/blob/master/README.md

# Step 1: Prepare Datasets

In [72]:
#disable auto save, this sometimes hangs the browser
%autosave 0
import pandas as pd
import time
from pandas.tools.plotting import scatter_matrix
import numpy
import sklearn
from sklearn import preprocessing
from sklearn_pandas import DataFrameMapper
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
from sklearn import preprocessing
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# to supress printing of exponential notation in pandas
pd.options.display.float_format = '{:20,.2f}'.format

# avoid data truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Autosave disabled


## Helper Functions

### Function to return true if array contains binary (zero and one) values only

In [73]:
def is_only_zero_and_one(array):
    return len(array) == 2 and ((array[0] == 0 and array[1] == 1) or ((array[0] == 1 and array[1] == 0)))

### Function to convert categorical features into binary
#### use this in future instead: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
#### need to do this because: https://stackoverflow.com/questions/24715230/can-sklearn-random-forest-directly-handle-categorical-features

In [74]:
# does not modify the original source
def convert_categorical_to_binary(data, categorical_columns):
    
    temp_data = data.copy()

    label_binarizer = []
    for col in categorical_columns:
        label_binarizer.append((col, sklearn.preprocessing.LabelBinarizer()))
        
    # df_out=True: output a data frame
    mapper_df = DataFrameMapper(label_binarizer, df_out=True)    
    # temp contains the new columns
    temp = mapper_df.fit_transform(temp_data)
    
    # print temp[temp.isnull().any(axis=1)]
    
    for col in temp.columns:
        temp_data[col] = numpy.array(temp[col])
    
    total_column_count = len(data.columns)
    for col in categorical_columns:
        total_column_count += len(data[col].unique())
        
    print 'new column count should be ' + str(len(temp_data.columns)) + ' and is ' + str(total_column_count)
    return temp_data

### Function to rescale all non-binary features between 0 and 1

In [75]:
# does not modify the original source
# categorical_columns are skipped
# if a column only has binary (0/1) values, it is skipped too
def rescale_non_binary_columns(data, categorical_columns):
    
    temp_data = data.copy()
    scaler = preprocessing.MinMaxScaler()
    for col in data.columns:
        if col not in categorical_columns and not is_only_zero_and_one(data[col].unique()):
            # print 'scaling ' + col
            temp_data[col] = scaler.fit_transform(temp_data[[col]])
            
    return temp_data

### Function to print confusion matrix

In [76]:
def print_confusion_matrix(confusion_matrix, labels):
    records = len(labels)
    for row in range(records):
        print "-------------" + labels[row] + "-------------"
        total = 0
        for column in range(records):
            total += confusion_matrix[row][column]
        print 'total: ' + str(total)
        print 'correct: ' + str(confusion_matrix[row][row])
        for column in range(records):
            if confusion_matrix[row][column] != 0 and row != column:
                print labels[column] + ': ' + str(confusion_matrix[row][column])

### Function to print summary statistics

In [77]:
def print_summary_statistics(confusion_matrix, normal_class_index):
    class_label_count = len(confusion_matrix)
    total_records = 0
    total_normal = 0
    total_anomalous = 0
    total_normal_correctly_identified = 0
    total_anomalous_correctly_identified = 0
    
    for row in range(class_label_count):
        for col in range(class_label_count):            
            total_records += confusion_matrix[row][col]            
            if row == normal_class_index:
                total_normal += confusion_matrix[row][col]
                if col == normal_class_index:
                    total_normal_correctly_identified = confusion_matrix[row][col]
            else:
                total_anomalous += confusion_matrix[row][col]
                if row == col:
                    total_anomalous_correctly_identified += confusion_matrix[row][col]
     
    # * by 1.0 to make denominator float
    #  If the numerator or denominator is a float, then the result will be also.
    total_correctly_identified = total_normal_correctly_identified + total_anomalous_correctly_identified
    correct_normal_percentage = total_normal_correctly_identified * 100/(1.0 * total_normal)
    correct_anomalous_percentage = total_anomalous_correctly_identified * 100/(1.0 * total_anomalous)
    correct_total_percentage = total_correctly_identified * 100/(1.0 * total_records)
    print 'total: ' + str(total_records)
    print 'normal: ' + str(total_normal)
    print 'anomalous: ' + str(total_anomalous)
    
    print 'total correctly identified: ' + str(total_correctly_identified) + '(' + str(correct_total_percentage) + '%)'
    print 'normal correctly identified: ' + str(total_normal_correctly_identified) + '(' + str(correct_normal_percentage) + '%)'
    print 'anomalous correctly identified: ' + str(total_anomalous_correctly_identified) + '(' + str(correct_anomalous_percentage) + '%)'
            

### Function to print F scores

In [78]:
def print_f_scores(actual_labels, predictions, unique_labels):
    #http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
    # Calculate metrics globally by counting the total true positives, false negatives and false positives.
    print 'micro: ' + str(metrics.f1_score(actual_labels, predictions, 
                                           labels=unique_labels, average='micro'))
    # Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    print 'macro: '+ str(metrics.f1_score(actual_labels, predictions, 
                                          labels=unique_labels, average='macro'))
    # Calculate metrics for each label, and find their average, weighted by support (the number of true instances 
    # for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that
    # is not between precision and recall.
    print 'weighted: ' + str(metrics.f1_score(actual_labels, predictions, 
                                              labels=unique_labels, average='weighted'))

## loading the dataset

In [79]:
data = pd.read_csv("/Users/haris/Desktop/kdd/kdd_full.csv")
print "csv loaded"

csv loaded


## analyzing metadata

### Remove duplicates

In [None]:
print 'rows and columns: ' + str(data.shape)
# remove duplicate rows
data = data.drop_duplicates()
print 'rows and columns after removing duplicates:' + str(data.shape)
print 'printing rows with null values'
print len(data[data.isnull().any(axis=1)])

In [None]:
categorical_columns_without_label = ['service', 'flag', 'protocol_type']
categorical_columns_with_label = ['service', 'flag', 'protocol_type', 'label']

### Convert categorical features into binary

In [None]:
data = convert_categorical_to_binary(data, categorical_columns_without_label)

### Remove categorical columns

In [None]:
# 1 for column, 0 for row
for col in categorical_columns_without_label:
    data = data.drop(col, 1)

### Rescale numeric features between 0 and 1 

In [None]:
data = rescale_non_binary_columns(data, categorical_columns_with_label)

### Separate features and class label

In [None]:
y = data['label']
X = data.drop('label', 1)

### Create Train, Cross Validation and Test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify = y)
X_train, X_cross_validation, y_train, y_cross_validation = train_test_split(
    X_train, y_train, test_size=0.25, random_state=31, stratify = y_train)

In [None]:
print 'Train size: ' + str(len(y_train))
print 'Test size: ' + str(len(y_test))
print 'Cross validation size: ' + str(len(y_cross_validation))

### Merge labels and save sets on disk

In [None]:
X_train['label'] = y_train
X_cross_validation['label'] = y_cross_validation
X_test['label'] = y_test

In [None]:
X_train.to_csv('/Users/haris/Desktop/train.csv', sep=',', index=False)
X_cross_validation.to_csv('/Users/haris/Desktop/cross_validation.csv', sep=',', index=False)
X_test.to_csv('/Users/haris/Desktop/test.csv', sep=',', index=False)

# Step 2: Load train and cross validation set

In [80]:
train = pd.read_csv("/Users/haris/Desktop/kdd_datasets/train.csv")
cross_validation = pd.read_csv("/Users/haris/Desktop/kdd_datasets/cross_validation.csv")
print "csvs loaded"
print str(len(train)) + ' train rows'
print str(len(cross_validation)) + ' cross validation rows'

y_train = train['label']
X_train = train.drop('label', 1) 
y_cross_validation = cross_validation['label']
X_cross_validation = cross_validation.drop('label', 1) 
labels = y_train.unique()

csvs loaded
644994 train rows
214999 cross validation rows


# Step 3: Dummy Classifier

In [None]:
dummy_classfier = DummyClassifier(strategy='most_frequent', random_state=0)
# train the classifiers
dummy_classfier = dummy_classfier.fit(X_train, y_train)
# test the classifiers
dummy_classfier_predictions = dummy_classfier.predict(X_cross_validation)

### Create confusion matrix from classification results

In [None]:
confusion_matrix = metrics.confusion_matrix(y_cross_validation, dummy_classfier_predictions, labels=labels)
print confusion_matrix

### Print summary of classification results

In [None]:
print_summary_statistics(confusion_matrix, 0)

### Confusion matrix of dummy classifier

In [None]:
print_confusion_matrix(confusion_matrix, labels)

### Accuracy of dummy classifier

In [None]:
print_f_scores(y_cross_validation, dummy_classfier_predictions, labels)

# Step 4: Random Forest

In [81]:
random_state = 37

In [82]:
random_forest_classifier = RandomForestClassifier(n_estimators=10, random_state=random_state)
random_forest_classifier = random_forest_classifier.fit(X_train, y_train)
random_forest_predictions = random_forest_classifier.predict(X_cross_validation)

### Create confusion matrix from classification results

In [83]:
confusion_matrix = metrics.confusion_matrix(y_cross_validation, random_forest_predictions, labels=labels)

### Print result summary

In [84]:
print_summary_statistics(confusion_matrix, 0)

total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214904(99.955813748%)
normal correctly identified: 162548(99.9907728081%)
anomalous correctly identified: 52356(99.8474330613%)


### Print random forest confusion matrix

In [85]:
print_confusion_matrix(confusion_matrix, labels)

-------------normal.-------------
total: 162563
correct: 162548
ipsweep.: 6
satan.: 3
nmap.: 1
warezclient.: 2
land.: 3
-------------ipsweep.-------------
total: 745
correct: 731
normal.: 7
nmap.: 7
-------------neptune.-------------
total: 48430
correct: 48430
-------------satan.-------------
total: 1004
correct: 992
normal.: 11
portsweep.: 1
-------------smurf.-------------
total: 602
correct: 602
-------------portsweep.-------------
total: 713
correct: 708
normal.: 4
ipsweep.: 1
-------------nmap.-------------
total: 311
correct: 296
normal.: 8
ipsweep.: 7
-------------teardrop.-------------
total: 183
correct: 182
satan.: 1
-------------back.-------------
total: 194
correct: 194
-------------warezclient.-------------
total: 178
correct: 162
normal.: 16
-------------guess_passwd.-------------
total: 10
correct: 9
land.: 1
-------------imap.-------------
total: 2
correct: 1
normal.: 1
-------------pod.-------------
total: 41
correct: 40
normal.: 1
-------------buffer_overflow.-------

### Accuracy of random forest

In [86]:
print_f_scores(y_cross_validation, random_forest_predictions, labels)

micro: 0.99955813748
macro: 0.726997715881
weighted: 0.999541505154


### Remove the columns identified redundant/useless in EDA

In [87]:
columns_to_remove = ['num_outbound_cmds', 'num_root']
X_train_copy = X_train.copy()
X_cross_validation_copy = X_cross_validation.copy()

print X_train_copy.shape
print X_cross_validation_copy.shape

for col in columns_to_remove:
    X_train_copy.drop(col, axis=1, inplace=True)
    X_cross_validation_copy.drop(col, axis=1, inplace=True)

print X_train_copy.shape
print X_cross_validation_copy.shape

(644994, 122)
(214999, 122)
(644994, 120)
(214999, 120)


In [88]:
random_forest_classifier = RandomForestClassifier(n_estimators=10, random_state=random_state)
random_forest_classifier = random_forest_classifier.fit(X_train_copy, y_train)
random_forest_predictions = random_forest_classifier.predict(X_cross_validation_copy)

In [89]:
confusion_matrix = metrics.confusion_matrix(y_cross_validation, random_forest_predictions, labels=labels)

In [90]:
print_summary_statistics(confusion_matrix, 0)

total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214885(99.9469764976%)
normal correctly identified: 162541(99.9864667852%)
anomalous correctly identified: 52344(99.8245480204%)


#### Results were same when we removed the feature with 0 variance [num_outbound_cmds] probably because random forest itself would have discarded this feature internally
#### Results almost remained same (15 less correct classifications) when we removed one highly correlated featuere

### Trying different number of trees (10-90) on random forest

In [91]:
random_state = 37
print 'random state: ' + str(random_state)

for num in range(1, 10):
    print '----------------------------------------'
    start_time = time.time()
    n_estimators = num * 10
    print 'n_estimators: ' + str(n_estimators)
    random_forest_classifier = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
    random_forest_classifier = random_forest_classifier.fit(X_train, y_train)
    random_forest_predictions = random_forest_classifier.predict(X_cross_validation)
    confusion_matrix = metrics.confusion_matrix(y_cross_validation, random_forest_predictions, labels=labels)
    print_summary_statistics(confusion_matrix, 0)
    total_time = time.time() - start_time
    print 'total time: ' + str(total_time) + ' seconds'

random state: 37
----------------------------------------
n_estimators: 10
total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214904(99.955813748%)
normal correctly identified: 162548(99.9907728081%)
anomalous correctly identified: 52356(99.8474330613%)
total time: 19.2350928783 seconds
----------------------------------------
n_estimators: 20
total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214894(99.9511625635%)
normal correctly identified: 162547(99.990157662%)
anomalous correctly identified: 52347(99.8302692806%)
total time: 37.6239840984 seconds
----------------------------------------
n_estimators: 30
total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214897(99.9525579189%)
normal correctly identified: 162545(99.9889273697%)
anomalous correctly identified: 52352(99.8398047143%)
total time: 54.3900029659 seconds
----------------------------------------
n_estimators: 40
total: 214999
normal: 162563
anomalous: 52

KeyboardInterrupt: 

### Conclusion: no impact on number of correct outcomes of number of trees in this case

### Separating anomalous and normal data

In [92]:
train_copy = train.copy()
cross_validation_copy = cross_validation.copy()

train_copy.loc[train_copy['label'] != 'normal.', 'label'] = 'anomalous'
cross_validation_copy.loc[cross_validation_copy['label'] != 'normal.', 'label'] = 'anomalous'

y_train_copy = train_copy['label']
X_train_copy = train_copy.drop('label', 1) 
y_cross_validation_copy = cross_validation_copy['label']
X_cross_validation_copy = cross_validation_copy.drop('label', 1) 

labels = y_train_copy.unique()
print 'labels: ' + labels

['labels: normal.' 'labels: anomalous']


In [93]:
random_forest_classifier = RandomForestClassifier(n_estimators=10, random_state=random_state)
random_forest_classifier = random_forest_classifier.fit(X_train_copy, y_train_copy)
random_forest_predictions = random_forest_classifier.predict(X_cross_validation_copy)

confusion_matrix = metrics.confusion_matrix(y_cross_validation_copy, random_forest_predictions, labels=labels)
print_summary_statistics(confusion_matrix, 0)

total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214925(99.9655812353%)
normal correctly identified: 162539(99.9852364929%)
anomalous correctly identified: 52386(99.9046456633%)


### Conclusion: by first doing anomalous vs normal classification our correct predictions overall increased by 26 and our correct prediction of anomalous increased by 36.

#### also tried this approach by removing columns identified useless/redundant in EDA but no benefit found. 

In [94]:
print 'random state: ' + str(random_state)

for num in range(1, 10):
    print '----------------------------------------'
    start_time = time.time()
    n_estimators = num * 10
    print 'n_estimators: ' + str(n_estimators)
    random_forest_classifier = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
    random_forest_classifier = random_forest_classifier.fit(X_train_copy, y_train_copy)
    random_forest_predictions = random_forest_classifier.predict(X_cross_validation_copy)
    confusion_matrix = metrics.confusion_matrix(y_cross_validation_copy, random_forest_predictions, labels=labels)
    print_summary_statistics(confusion_matrix, 0)
    total_time = time.time() - start_time
    print 'total time: ' + str(total_time) + ' seconds'

SyntaxError: invalid syntax (<ipython-input-94-12e885848aae>, line 13)

### Conclusion: even in case of binary classification (normal vs anomalous) increasing the number of trees didn't help

### Building complete binary classifier using Random Forest

#### Make two sets (1) normal vs abnormal (2) all anomalous
#### First detect normal vs anomalous (1/0), if anomalous, use the second classifier to predict type of anomaly

In [95]:
# get copy of train and cross validation data
train_data = train.copy()
cross_validation_data = cross_validation.copy()

# create a new feature binary_label and set it to anomalous/binary
train_data['binary_label'] = 'normal.'
cross_validation_data['binary_label'] = 'normal.'
train_data.loc[train_data['label'] != 'normal.', 'binary_label'] = 'anomalous'
cross_validation_data.loc[cross_validation_data['label'] != 'normal.', 'binary_label'] = 'anomalous'

# separate, X, y_label and y_binary_label for train data
y_train_label = train_data['label']
y_train_binary_label = train_data['binary_label']
X_train_data = train_data.drop(['label', 'binary_label'], 1) 

# separate anomalous data and labels for train
X_train_anomalous = X_train_data.iloc[y_train_label[y_train_label != 'normal.'].index.tolist()]
y_train_anomalous = y_train_label.iloc[y_train_label[y_train_label != 'normal.'].index.tolist()]

### Train two classifiers (1) binary (2) anomalous and make predictions

In [96]:
# train binary classifier
random_forest_binary_classifier = RandomForestClassifier(n_estimators=10, random_state=37)
random_forest_binary_classifier = random_forest_binary_classifier.fit(X_train_data, y_train_binary_label)
# train anomalous classifier
random_forest_anomalous_classifier = RandomForestClassifier(n_estimators=10, random_state=37)
random_forest_anomalous_classifier = random_forest_anomalous_classifier.fit(X_train_anomalous, y_train_anomalous)

### Prepare cross validation data

In [97]:
# separate, X, y_label and y_binary_label for cross validation data
y_cross_validation_label = cross_validation_data['label']
y_cross_validation_binary_label = cross_validation_data['binary_label']
X_cross_validation_data = cross_validation_data.drop(['label', 'binary_label'], 1) 

# separate anomalous data and labels for cross_validation
# this will help us in determining the accuracy of anomalous only classifier
X_cross_validation_anomalous = X_cross_validation_data.iloc[y_cross_validation_label[y_cross_validation_label 
                                                                                     != 'normal.'].index.tolist()]
y_cross_vadliation_anomalous = y_cross_validation_label.iloc[y_cross_validation_label[y_cross_validation_label 
                                                                                      != 'normal.'].index.tolist()]

#### Binary Classifier Accuracy

In [98]:
binary_unique_labels = y_train_binary_label.unique()

random_forest_binary_predictions = random_forest_binary_classifier.predict(X_cross_validation_data)
confusion_matrix_binary_classifier = metrics.confusion_matrix(y_cross_validation_binary_label, 
                                                              random_forest_binary_predictions,
                                                              labels=binary_unique_labels)

print_f_scores(y_cross_validation_binary_label, random_forest_binary_predictions, binary_unique_labels)

micro: 0.999655812353
macro: 0.999533308529
weighted: 0.999655783437


#### Anomalous Classifier Accuracy

In [99]:
anomalous_labels_unique = y_train_anomalous.unique()

random_forest_anomalous_predictions = random_forest_anomalous_classifier.predict(X_cross_validation_anomalous)
confusion_matrix_anomalous_classifier = metrics.confusion_matrix(y_cross_vadliation_anomalous, 
                                                              random_forest_anomalous_predictions,
                                                              labels=anomalous_labels_unique)

print_f_scores(y_cross_vadliation_anomalous, random_forest_anomalous_predictions, anomalous_labels_unique)

micro: 0.999389732245
macro: 0.750320468047
weighted: 0.999352475034


#### Combined accuracy

In [100]:
# we have all the binary predictions in the 'random_forest_binary_predictions'
# find index of all anomalous predictions within 'random_forest_binary_predictions'
# because we will use anomalous classifier to predict exact type of anomaly
X_cross_validation_of_anomalous_predictions = X_cross_validation.iloc[numpy.where(random_forest_binary_predictions
                                                                               != 'normal.')]
#y_cross_validation_of_anomalous_predictions = y_cross_validation_label.iloc[numpy.where(random_forest_binary_predictions
#                                                                               != 'normal.')]

# make the anomalous predictions
cross_validation_anomalous_predictions = random_forest_anomalous_classifier.predict(
    X_cross_validation_of_anomalous_predictions)

# merge the anomalous predictions back in 'random_forest_binary_predictions' to 
# get final accuracy
anomalous_predictions_index = 0
for index in range(0, len(random_forest_binary_predictions)):
    if random_forest_binary_predictions[index] != 'normal.':
        random_forest_binary_predictions[index] = cross_validation_anomalous_predictions[anomalous_predictions_index]
        anomalous_predictions_index += 1

all_unique_labels = pd.Series.unique(y_train_label)
# random_forest_binary_predictions now contains all predictions
confusion_matrix_complete = metrics.confusion_matrix(y_cross_validation_label, random_forest_binary_predictions, 
                                            labels=all_unique_labels)

# print the final score of cross validation data
print_f_scores(y_cross_validation_label, random_forest_binary_predictions, all_unique_labels)
print_summary_statistics(confusion_matrix_complete, 0)

micro: 0.999548835111
macro: 0.662630371032
weighted: 0.99953377716
total: 214999
normal: 162563
anomalous: 52436
total correctly identified: 214902(99.9548835111%)
normal correctly identified: 162539(99.9852364929%)
anomalous correctly identified: 52363(99.8607826684%)


# Conclusion on random forest: Binary method is slightly better in predicting anomalous records. It predicted (7) less false positives than normal approach. 

## Accuracy of final random forest on test set. 

In [101]:
test = pd.read_csv("/Users/haris/Desktop/kdd_datasets/test.csv")
print "csv loaded"
print str(len(test)) + ' test rows'

test['binary_label'] = 'normal.'
test.loc[test['label'] != 'normal.', 'binary_label'] = 'anomalous'

# separate, X, y_label and y_binary_label for cross validation data
y_test_label = test['label']
y_test_binary_label = test['binary_label']
X_test = test.drop(['label', 'binary_label'], 1) 
random_forest_test_binary_predictions = random_forest_binary_classifier.predict(X_test)

# find index of all anomalous predictions
X_test_of_anomalous_predictions = X_test.iloc[numpy.where(random_forest_test_binary_predictions
                                                                               != 'normal.')]
y_test_of_anomalous_predictions = y_test_label.iloc[numpy.where(random_forest_test_binary_predictions
                                                                               != 'normal.')]

test_anomalous_predictions = random_forest_anomalous_classifier.predict(
    X_test_of_anomalous_predictions)

test_anomalous_predictions_index = 0
for index in range(0, len(random_forest_test_binary_predictions)):
    if random_forest_test_binary_predictions[index] != 'normal.':
        random_forest_test_binary_predictions[index] = test_anomalous_predictions[test_anomalous_predictions_index]
        test_anomalous_predictions_index += 1

print_f_scores(y_test_label, random_forest_test_binary_predictions, all_unique_labels)

csv loaded
644994 test rows
micro: 0.999581393402
macro: 0.625597103083
weighted: 0.999560347679
