<a href="https://colab.research.google.com/github/kpatonsmith/3ND/blob/main/3ND.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3ND: Neural Network Ntrusion Detection
The internet can be a beautiful thing, but the highway that connects us to our cat memes and YouTube videos can also be the door for malicious users to gain access to our networks and resources. Because of this, network security is paramount; however, detecting these intrusions can be less than obvious to the naked eye. We suggest a machine learning approach that takes in network traffic data and labels it malicious or benign. We will be making use of the [IDS 2018 Intrusion CSVs](https://www.kaggle.com/solarmainframe/ids-intrusion-csv) and the [LUFlow Network Intrusion Detection](https://www.kaggle.com/mryanm/luflow-network-intrusion-detection-data-set) data sets.

## Loading and Preparing the Data
The IDS and LUFlow datasets only contain 9 matching columns, both of different names. We have to load in and organize them properly. For reference:
* bytes_in -> TotLen Bwd Pkts 
* bytes_out -> TotLen Fwd Pkts 
* dest_port -> Dst Port 
* num_pkts_out -> Tot Fwd Pkts
* num_pkts_in -> TotBwd Pkts
* proto -> Protocol
* time_start -> Timestamp
* duration -> Flow Duration
* label -> Label



In [None]:
import pandas as pd

# read our csv files
ids_df = pd.read_csv("data/02-14-2018.csv")
luflow_df = pd.read_csv("data/2020.06.19.csv")

In [None]:
# filter only useful rows
ids_df = ids_df.filter(['TotLen Bwd Pkts', 'TotLen Fwd Pkts', 'Dst Port', 'Tot Fwd Pkts', 'Tot Bwd Pkts', 'Protocol', 'Timestamp', 'Flow Duration', 'Label'])
luflow_df = luflow_df.filter(['bytes_in', 'bytes_out', 'dest_port', 'num_pkts_out', 'num_pkts_in', 'proto', 'time_start', 'time_end', 'label'])

# drop rows that are missing info
ids_df = ids_df.dropna()
luflow_df = luflow_df.dropna()

In [None]:
luflow_df.head(3)

Unnamed: 0,bytes_in,bytes_out,dest_port,num_pkts_out,num_pkts_in,proto,time_start,time_end,label
0,342,3679,9200.0,2,2,6,1592533725632946,1592533725648144,benign
1,0,0,55972.0,1,1,6,1592533744644904,1592533744644904,outlier
2,15440,942,9300.0,3,3,6,1592533770933553,1592533770936279,benign


In [None]:
ids_df.head(3)

Unnamed: 0,TotLen Bwd Pkts,TotLen Fwd Pkts,Dst Port,Tot Fwd Pkts,Tot Bwd Pkts,Protocol,Timestamp,Flow Duration,Label
0,0,0,0,3,0,0,14/02/2018 08:31:01,112641719,Benign
1,0,0,0,3,0,0,14/02/2018 08:33:50,112641466,Benign
2,0,0,0,3,0,0,14/02/2018 08:36:39,112638623,Benign


In [None]:
import time
import datetime

# convert date string to timestamp for the ids_df
def convert_to_timestamp(x):
  return int(time.mktime(datetime.datetime.strptime(x[:19],"%d/%m/%Y %H:%M:%S").timetuple()))

ids_df['Timestamp']=ids_df['Timestamp'].apply(convert_to_timestamp)
ids_df.head(3)

Unnamed: 0,TotLen Bwd Pkts,TotLen Fwd Pkts,Dst Port,Tot Fwd Pkts,Tot Bwd Pkts,Protocol,Timestamp,Flow Duration,Label
0,0,0,0,3,0,0,1518597061,112641719,Benign
1,0,0,0,3,0,0,1518597230,112641466,Benign
2,0,0,0,3,0,0,1518597399,112638623,Benign


In [None]:
# make duration in luflow into timestamp info
# takes in start and end
def make_timestamp(start, end):
  return int(end-start)

luflow_df['duration']=luflow_df.apply(lambda x: make_timestamp(start = x['time_start'], end = x['time_end']), axis=1)

# drop the time_end column
luflow_df=luflow_df.drop(['time_end'], axis=1)

In [None]:
# make labels the same across both
# LUFLow takes benign, outlier, or malicious; make outlier benign
def convert_outlier_to_benign(x):
  if x == "outlier":
    return "benign"
  return x

luflow_df['label']=luflow_df['label'].apply(convert_outlier_to_benign)

# convert dest_port to int
def convert_dest_port(x):
  return int(x)

luflow_df['dest_port']=luflow_df['dest_port'].apply(convert_dest_port)

luflow_df.head(3)

Unnamed: 0,bytes_in,bytes_out,dest_port,num_pkts_out,num_pkts_in,proto,time_start,label,duration
0,342,3679,9200,2,2,6,1592533725632946,benign,15198
1,0,0,55972,1,1,6,1592533744644904,benign,0
2,15440,942,9300,3,3,6,1592533770933553,benign,2726


In [None]:
# IDS takes Benign and lists the specific attack; make Benign benign and anything else malicious
def convert_ids_label(x):
  if x == "Benign":
    return "benign"
  return "malicious"

ids_df['Label']=ids_df['Label'].apply(convert_ids_label)
ids_df.head(3)

Unnamed: 0,TotLen Bwd Pkts,TotLen Fwd Pkts,Dst Port,Tot Fwd Pkts,Tot Bwd Pkts,Protocol,Timestamp,Flow Duration,Label
0,0,0,0,3,0,0,1518597061,112641719,benign
1,0,0,0,3,0,0,1518597230,112641466,benign
2,0,0,0,3,0,0,1518597399,112638623,benign


In [None]:

# rename ids_df columns to match LUFlow's
ids_df=ids_df.rename(columns={'TotLen Bwd Pkts': 'bytes_in', 'TotLen Fwd Pkts': 'bytes_out', 'Dst Port':'dest_port', 'Tot Fwd Pkts':'num_pkts_out', 'Tot Bwd Pkts':'num_pkts_in', 'Protocol':'proto', 'Timestamp':'time_start', 'Flow Duration':'duration', 'Label':'label'})

# combine the two into one datafield
df = pd.concat([ids_df, luflow_df])

df.head(3)

Unnamed: 0,bytes_in,bytes_out,dest_port,num_pkts_out,num_pkts_in,proto,time_start,duration,label
0,0,0,0,3,0,0,1518597061,112641719,benign
1,0,0,0,3,0,0,1518597230,112641466,benign
2,0,0,0,3,0,0,1518597399,112638623,benign


In [None]:
import tensorflow


# make dest_port an int and category (keras.utils.to_categorical(y_train, num_classes))
def convert_dest_port(x):
  return int(x)

df['dest_port']=df['dest_port'].apply(convert_dest_port)

# df['dest_port']=tensorflow.keras.utils.to_categorical(df['dest_port'], 300) #each port should be a category
df.head(3)


Unnamed: 0,bytes_in,bytes_out,dest_port,num_pkts_out,num_pkts_in,proto,time_start,duration,label
0,0,0,0,3,0,0,1518597061,112641719,benign
1,0,0,0,3,0,0,1518597230,112641466,benign
2,0,0,0,3,0,0,1518597399,112638623,benign


In [None]:
# make labels a category (keras.utils.to_categorical(y_train, num_classes))
def label_to_num(x):
  if(x=="benign"):
    return 0
  return 1
df['label_num']=df['label'].apply(label_to_num)
df['label_num']=tensorflow.keras.utils.to_categorical(df['label_num'], 2)

In [None]:
# make input and output
X = df.drop(['label_num', 'label'], axis=1)
Y = df['label_num']

Unnamed: 0,bytes_in,bytes_out,dest_port,num_pkts_out,num_pkts_in,proto,time_start,duration
0,0,0,0,3,0,0,1518597061,112641719
1,0,0,0,3,0,0,1518597230,112641466
2,0,0,0,3,0,0,1518597399,112638623
