## Dataset
Dataset is taken from here: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

Task is to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. 

## Step 1: setting it up

In [None]:
#disable auto save, this sometimes hangs the browser
%autosave 0
# need this to properly plot graphs using matplotlib
%matplotlib inline

import pandas as pd
from pandas.tools.plotting import scatter_matrix
# to supress printing of exponential notation in pandas
pd.options.display.float_format = '{:20,.2f}'.format

# avoid data truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# plotly settings
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
print __version__ # requires version >= 1.9.0
init_notebook_mode(connected=True)

import numpy
from sklearn import preprocessing

### Helper functions

#### Print box plot grouped by  unique label

In [None]:
def print_box_plot_grouped_by_label(data, feature_column, label_column):
    x = data[label_column].unique()

    traces = []
    for label in x:
        trace = go.Box(y = numpy.array(data.loc[data[label_column] == label][feature_column]), name = label)
        traces.append(trace)

    iplot(traces)

#### Print histograms of correlated columns

In [None]:
def print_correlated_column_histogram(data, column_names):
    sorted_data = {}
    for column in column_names:
        sorted_data[column] = pd.Series.sort_values(data[column])

    traces = []
    for column in column_names:
        traces.append(go.Histogram(x = sorted_data[column], name = column, opacity=0.5))

    histogramData = traces
    layout = go.Layout(barmode='overlay')
    fig = go.Figure(data = histogramData, layout=layout)
    iplot(fig, filename='overlaid histogram')

#### Print normal vs abnormal box plots

In [None]:
def print_normal_abnormal_box_plot(data, column):
    normal = data.loc[data.label == 'normal.']
    abnormal = data.loc[data.label != 'normal.']    
    trace0 = go.Box(y= normal[column], name = 'Normal')
    trace1 = go.Box(y= abnormal[column], name = 'Abnormal')
    boxData = [trace0, trace1]
    iplot(boxData)

#### Print normal vs abnormal scatter plots

In [None]:
def print_normal_vs_abnormal_scatter_plot(data_source, col1, col2): 

    normal = data_source.loc[data_source.label == 'normal.']
    abnormal = data_source.loc[data_source.label != 'normal.']

    # keep only the required columns and remove duplicate points
    # otherwise graphs become too heavy and chrome gets stuck
    normal = normal.loc[:,[col1, col2,]]
    normal = normal.drop_duplicates()
    abnormal = abnormal.loc[:,[col1, col2,]]
    abnormal = abnormal.drop_duplicates()

    # Create a trace
    trace0 = go.Scatter(
        x = normal[col1],
        y = normal[col2],
        mode = 'markers',
        name = 'normal',
        marker = dict(opacity= 0.8)
    )

    # name the x and y axis
    trace1 = go.Scatter(
        x = abnormal[col1],
        y = abnormal[col2],
        mode = 'markers',
        name = 'abnormal',
        marker = dict(opacity= 0.8)
    )

    tempData = [trace0, trace1]

    layout = go.Layout(
        xaxis=dict(title=col1),
        yaxis=dict(title=col2)
    )

    fig = go.Figure(data = tempData, layout=layout)

    # Plot and embed in ipython notebook!
    iplot(fig, filename='basic-scatter')

#### Print correlation scatter plots

In [None]:
def print_correlation_scatter_plots(data, columns):
    max_value = -1
    for col in columns:
        temp = numpy.max(data[col].unique())
        if temp > max_value:
            max_value = temp

    row_count = len(data)
    x_axis = numpy.linspace(0, max_value, row_count)

    for col in columns:
        trace = go.Scatter(
            x = x_axis,
            y = data[col],
            mode = 'line',
            name = col)

        print iplot([trace], filename='line-mode')



#### Returns true if array contains 1 zero and 1 one

In [None]:
def is_only_zero_and_one(array):
    return len(array) == 2 and ((array[0] == 0 and array[1] == 1) or ((array[0] == 1 and array[1] == 0)))

## Step 2: loading the dataset

In [None]:
data = pd.read_csv("/Users/haris/Desktop/kdd/kdd_full.csv")
print "csv loaded"

## Step 3: analyzing metadata

### Remove duplicates

In [None]:
print 'rows and columns: ' + str(data.shape)
# remove duplicate rows
data = data.drop_duplicates()
print 'rows and columns after removing duplicates:' + str(data.shape)

### Unique class label count in dataset

In [None]:
# print number of unique class labels
len(data['label'].unique())

### Features and data types

In [None]:
data.dtypes

### Percentage-wise distribution of class labels in dataset

In [None]:
rows_count = data.shape[0]
data.groupby('label').size() * 100/rows_count

#### Generally in anomaly detection problem,  and particularly in this dataset, most of the data is normal (75%) and small subset is anomalous. And for this dataset within anomalous we have 22 different class labels. Therefore it makes more sense to first train a model to identify an input row as normal/anomalous (1/0). Then, if it is anomalous, use a different model to identify type of anomaly. This approach has an added advantage of being more effective against unseen anomalies. 
#### If needed, we can take this approach even further and within anomalous first do neptune(22%) detection and then rest accordingly.  

### Do we need to fill missing values? Count rows with null values in it for any column

In [None]:
len(data[data.isnull().any(axis=1)])

## Step 4: Individual feature analysis
### Questions we are trying to answer: 
1. Is there any feature that can be removed because it has no impact on class label? 
2. Is there any feature that clearly differentiate between different class labels?

### Features that can be removed because they have no impact on class label

In [None]:
### Features that can be removed because they have a single value throughout the column
for col in numpy.array(data.columns):
    if len(data[col].unique()) == 1:
        print col

### Print features with low std dev, excluding categorical and binary value (1/0) features

In [None]:
categorical = ['protocol_type', 'service', 'flag', 'label']
for col in numpy.array(data.columns):
    if col not in categorical:
        unique_values = numpy.array(data[col].unique())
        if not is_only_zero_and_one(unique_values):
            std_dev = numpy.std(data[col])
            if std_dev < 0.1:                          
                print col + ': ' + str(std_dev)

### Analyzing columns with low std dev (< 0.1)
#### Goal is to see if we can remove some low std dev features if they are not helping in determing any class label

In [None]:
print_box_plot_grouped_by_label(data, 'urgent', 'label')

In [None]:
print_box_plot_grouped_by_label(data, 'num_failed_logins', 'label')

In [None]:
print_box_plot_grouped_by_label(data, 'wrong_fragment', 'label')

In [None]:
print_box_plot_grouped_by_label(data, 'su_attempted', 'label')

In [None]:
print_box_plot_grouped_by_label(data, 'dst_host_srv_diff_host_rate', 'label')

In [None]:
print_box_plot_grouped_by_label(data, 'num_shells', 'label')

### Conclusion: Features with low std dev be very effective in classifying the type of anomaly

### Categorical features: we listed distribution of categorical features against class labels to see if we can find any pattern, but we didn't. Categorical features are uniformly distributed across various class labels.  

In [None]:
# see how categorical attribute protocol_type is contributing to various labels
data.groupby(['protocol_type', 'label']).count()

In [None]:
# see how categorical attribute service is contributing to various labels
data.groupby(['service', 'label']).count()

In [None]:
# see how categorical attribute flag is contributing to various labels
data.groupby(['flag', 'label']).count()

### Minor findings on categorical feature
1. wherever service value is "http_2784" or "harvest" class label is "satan"
2. whenever the flag is "RSTOS0", class label is always "portsweep"

### Conclusion: so far we have seen that there is one feature that has no impact on class label and can be removed. Furthermore, there exist no limited subset of features that can be used to predict the label alone, we will have to train the model using all features. 

## Step 5: finding correlated features
#### Question we are trying to answer: are there any highly correlated features in the dataset? If yes, we can reduce the feature set by removing redundant columns. 

In [None]:
pairsSet = set()
# skipping some columns
# first three are categorical
columns_to_skip = ['flag', 'protocol_type', 'service', 'num_outbound_cmds', 'label']

for column in data.columns:
    # print column
    for inner_column in data.columns:
        if column not in columns_to_skip and inner_column not in columns_to_skip:
            
            key1 = column + '-' + inner_column
            key2 = inner_column + '-' + column
            
            if column != inner_column and key1 not in pairsSet and key2 not in pairsSet:
                # print column + " -- " + inner_column                
                pairsSet.add(key1)
                pairsSet.add(key2)
                correlation = numpy.corrcoef(data[column], data[inner_column])[0, 1]
                # list all column pairs where correlation is >= 0.75
                if correlation >= 0.75:
                    print column + " -- " + inner_column
                    print correlation
                
print 'DONE'

### Plot correlated columms
### Correlation : serror_rate | dst_host_srv_serror_rate | srv_serror_rate | dst_host_serror_rate => 0.99 correlation


#### These four features have exactly same mean, std dev, interquartile ranger but are they really same and highly correlated?

In [None]:
print data['serror_rate'].describe()
print data['dst_host_srv_serror_rate'].describe()
print data['srv_serror_rate'].describe()
print data['dst_host_serror_rate'].describe()

#### Printing line plot to visualize the correlation

In [None]:
print_correlation_scatter_plots(data, ['dst_host_serror_rate', 
                                       'dst_host_srv_serror_rate',
                                       'serror_rate', 
                                       'srv_serror_rate'
                                       ])

#### Print scatter matrix to visualize the correlation

In [None]:
df = data[['dst_host_serror_rate', 'dst_host_srv_serror_rate', 'serror_rate', 'srv_serror_rate']]
print scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

#### By looking at various plots above we can conclude that although these features are at some places correlated but overall correlation is not so high that we can drop the feature. 

### Correlation : num_compromised | num_root => 0.99 correlation

In [None]:
print data['num_root'].describe()
print data['num_compromised'].describe()

In [None]:
# columns have slightly different max value so applying scaling
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
num_compromised_scaled = min_max_scaler.fit_transform(data['num_compromised'])
num_root_scaled = min_max_scaler.fit_transform(data['num_root'])

row_count = len(data)
x_axis = numpy.linspace(0, 1, row_count)

trace0 = go.Scatter(x = x_axis, y = num_compromised_scaled, mode = 'line', name = 'num_compromised')
trace1 = go.Scatter(x = x_axis, y = num_root_scaled, mode = 'line', name = 'num_root')

df = data[['num_root', 'num_compromised']]
print scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

print iplot([trace0], filename='line-mode')
print iplot([trace1], filename='line-mode')

In [None]:
print print_box_plot_grouped_by_label(data, 'num_root', 'label')
print print_box_plot_grouped_by_label(data, 'num_compromised', 'label')

#### Here correlation is very high and looks like we can drop one of these features from feature set

### Correlation: rerror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | srv_rerror_rate
1. rerror_rate : srv_rerror_rate => 0.98959088658
2. rerror_rate : dst_host_rerror_rate => 0.965840027856
3. dst_host_rerror_rate : dst_host_srv_rerror_rate => 0.961246461143
4. srv_rerror_rate : dst_host_srv_rerror_rate => 0.959543136155
5. rerror_rate : dst_host_srv_rerror_rate => 0.958523973064
6. rv_rerror_rate : dst_host_rerror_rate => 0.956401661999

In [None]:
print data['rerror_rate'].describe()
print data['dst_host_rerror_rate'].describe()
print data['dst_host_srv_rerror_rate'].describe()
print data['srv_rerror_rate'].describe()

In [None]:
print_correlation_scatter_plots(data,
    ['rerror_rate',
    'srv_rerror_rate',
    'dst_host_srv_rerror_rate',
    'dst_host_rerror_rate'])

In [None]:
df = data[['rerror_rate', 'srv_rerror_rate', 'dst_host_srv_rerror_rate', 'dst_host_rerror_rate']]
print scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

#### Again, there is some correlation but not high enough to drop the features

### Conclusion: we have found one highly correlated features in the dataset. Based on this we can reduce one feature from our feature-set.

## Step 5: Compare anomalous and normal data: Single Feature
#### Questions we are trying to answer: does anomalous and normal data has some kind of distiction that can help us in prediction?

In [None]:
print_normal_abnormal_box_plot(data, 'srv_serror_rate')

In [None]:
print_normal_abnormal_box_plot(data, 'dst_host_srv_count')

#### Conclusion: there exist features that have different pattern for normal and anomalous data

## Step 6: Compare anomalous and normal data: Multiple features
#### Questions we are trying to answer: can creation of new features (if needed) help us in prediction?

In [None]:
print_normal_vs_abnormal_scatter_plot(data, 'count', 'same_srv_rate')

In [None]:
print_normal_vs_abnormal_scatter_plot(data, 'srv_count', 'count')

In [None]:
print_normal_vs_abnormal_scatter_plot(data, 'num_root', 'wrong_fragment')

In [None]:
tempData = data.loc[data['dst_bytes'] < 1000000]
print_normal_vs_abnormal_scatter_plot(tempData, 'dst_bytes', 'num_file_creations')

In [None]:
print_normal_vs_abnormal_scatter_plot(data, 'duration', 'dst_host_same_src_port_rate')

### Final Conclusion:
1. We found one column that has no impact on class label.
2. We found few correlated columns which can help in reducing feature set.
3. We found that it might be more suitable to first detect normal vs anomalous (1/0) and then predict the type of anomaly. 
4. We found that there exist no single or subset of features which alone can do the prediction. However, many features provide some level of distinction between class labels. Therefore by using all the available features together we can make a good prediction.
5. If needed, we can make new features by combining existing features. Or we can give neural networks a try which can automatically do this for us. 

### Next Steps:
To find out how accurate our data analysis has been we can do following
To find out how accurate our data analysis has been we can do following
1. Convert categorical features into continous or binary (1/0) features
2. Rescale all the features between 0 and 1
3. Using stratified samping create three sets (1) test (2) train (3) cross validation
4. Implement a dummpy classifier and note results //http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
5. Select a machine learning algorithm to apply on dataset
6. Apply selected machine learning algorithm using all features and get the baseline accuracy of algoithm. Note the train and cross validation errors. 
7. Apply the selected machine learning algorithm using subset of features identified in EDA. Note the train and cross validation errors. 
8. Use the two step approach i.e. normal vs anomalous and then predict type of anomaly. Use all features present in feature set. Note the train and cross validation errors.
9. Repeat previous step using the subset of features identified in EDA. Note the train and cross validation errors. 
10. Repeat steps 6-9 using a Neural network
11. Apply the best method on test set to get final accuracy