#### This notebook is used for making predictions on the snort logs

In [1]:
#Importing the required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
pd.options.mode.chained_assignment = None  

##### The function below preprocesses the training dataset (NSL-KDD) and test set (snort logs) :
- Class labelling on the training set
- Handling the categorical features using Label encoder
- Processing the snort logs


In [2]:
def Preprocessing(dataset,test) :
    test.rename(columns={'protocol': 'protocol_type'}, inplace=True)
    test=test[['duration', 'protocol_type', 'src_bytes', 'dst_bytes', 'count','srv_count']]
    #Processing the snort logs: Time column is converted to seconds
    test["protocol_type"] = test["protocol_type"].str.lower()
    test["protocol_type"] = test["protocol_type"].str.strip()
    test["duration"] = pd.to_timedelta(test["duration"])
    test['duration']=test['duration'].dt.total_seconds()
    
    X_test = test.values

    #Replacing the labels as normal and attack
    dataset['class']= dataset['class'].replace(['back', 'buffer_overflow', 'ftp_write', 'guess_passwd', 'imap', 'ipsweep', 'land', 'loadmodule', 'multihop', 'neptune', 'nmap', 'perl', 'phf', 'pod', 'portsweep', 'rootkit', 'satan', 'smurf', 'spy', 'teardrop', 'warezclient', 'warezmaster'], 'attack')
    
    x = dataset.iloc[:, :-1].values
    y = dataset['class'].values

    #Handling the categorical Labels
    labelencoder_x_1 = LabelEncoder()
    x[:, 1] = labelencoder_x_1.fit_transform(x[:, 1])
    X_test[:, 1] = labelencoder_x_1.transform(X_test[:, 1])

    return x,y,X_test

In [3]:
def training(x,y) :
    model=RandomForestClassifier()
    #Model Training
    model.fit(x, y)
    #Prediction on test set(Snort Logs)
    #y_pred= model.predict(X_test)

    return model

In [4]:
def predictions(model,X_test) :
    y_pred= model.predict(X_test)
    return y_pred

**Reading the training dataset:**

In [5]:
column_names = ['duration','protocol_type','service','flag','src_bytes','dst_bytes','land',
'wrong_fragment','urgent','hot','num_failed_logins','logged_in','num_compromised','root_shell',
'su_attempted','num_root','num_file_creations','num_shells','num_access_files','num_outbound_cmds',
'is_host_login','is_guest_login','count','srv_count','serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate',
'same_srv_rate','diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count',
'dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate','dst_host_srv_diff_host_rate',
'dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate', 'class','misc']
dataset = pd.read_csv('NSL-KDD/KDDTrain+.txt', header = None,names=column_names,index_col=False)
column_names = ["duration", "protocol_type", "src_bytes", "dst_bytes", "count", "srv_count", "class"]
dataset=dataset[column_names]

**Importing the snort Logs** :
1. On non-attack logs
2. Non-attack logs

In [6]:
test1=pd.read_csv('tcplogWithCount_non_attack.csv')
test2=pd.read_csv('tcplogWithCount_attack.csv')

**Performing preprocessing on the training dataset and the snort logs:**

In [7]:
x,y,X_test1=Preprocessing(dataset,test1)
x,y,X_test2=Preprocessing(dataset,test2)

**Model Training:**

In [8]:
model=training(x,y)

**Predictions:**

In [9]:
y_pred1=predictions(model,X_test1)
y_pred2=predictions(model,X_test2)

**Results on non-attack logs:**

In [11]:
df=pd.DataFrame({'Results':y_pred1})
df.value_counts()

Results
normal     20
dtype: int64

**Result on Attack logs:**

In [12]:
df=pd.DataFrame({'Results':y_pred2})
df.value_counts()

Results
attack     5719
normal       27
dtype: int64