# **Detecting Network Anomalies using Machine Learning**
## Authors: Matthew Grubelic, Chris Saliby, Anthony Saldana, Luke Turbert, Andrew Rittenhouse

## Advisor: Liberty Page, Vahid Behzadan, Ph.D.

### *Sponsor: Secure and Assured Intelligent Learning (SAIL) Lab*

  The risk of crippling cyber attacks on computer systems is increasing rapidly each day. Current software and techniques used to defend against these malicious attacks are showing their limitations and are being completely overwhelmed in some cases. As a result, a more modern and forwardthinking solution is becoming increasingly necessary. The team is implementing one such software solution using a sequence to sequence neural network in Python3 with the Pytorch library to observe malicious events and predict when the next attack might happen. In this project, a deep learning sequence to sequence model, modeled after behavior of predictive typing technologies, was implemented on the CICIDS 2017 dataset to detect malicious DNS traffic and determine the probability of another attack in the future. The model functions by observing sequences of network packets, using them to predict upcoming sequences of packets, and comparing the actual observed data to the prediction. Since the amount of notable research and experimentation done in this area so far is lacking, the results yielded by this network traffic anomaly detection approach further demonstrate that the use of a sequence to sequence model is a viable, though still emerging, solution for modern intrusion detection systems.

  Goal: To leverage machine learning techniques to automatically detect anomalous traffic using the DNS protocol. The project seeks to eliminate the need for manual examination of suspicious log files by allowing a machine learning algorithm to make decisions. The machine learning algorithm will use an unsurpervised learning model that is trained off of known data and then compares this data to unknown, live DNS data.

  Google has its own API library that allows us to search our local machines for the file that we would like to use and upload for this program. Unfortunately, google only allows us to have the file uploaded for 24 hours until the system is flushed and reset. The following code allows us to search for the specific file that was referenced later on in the program and upload it for use. The for loop in this section will display the name and the bytes of the file so that the user of the program knows that the file has uploaded successfully. Depending on the file, this does take 5-10 minutes to upload which is shown by a percentage to inform us on the progress.

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(name = fn, length = len(uploaded[fn])))

  This one liner is not needed for proper exectution of the program, although, it allows the user to see if the file that was uploaded in the above code snippet was uploaded properly. If the file was uploaded properly, then running this code will allow you to see the file name displayed in the current diretory. The file name shown will match the file name that was shown above.

In [0]:
!ls

DARPA1999_week1mon  sample_data


  Continuing on to the first major piece of the program code, the following snippet mainly processes the data and shows defines a function that will be used later in main(). The beginning of this snippet reads in the data from the previously uploaded file and reads in the file row by row. This allows us to extract the most important bits of information that is held inside the CSV file. After the information is completely read in, the column that displays the protocol type is then filtered to only produce the 'TCP' protocols and insert them into their own array. This will later be a filter that is included so the user can look for any protocol that is present in the CSV file.

Moving on to the newly defined function, this function takes in two values. The first is the position of where we are going to start looking in the previously defined protocol array. The second is the DataFrame that we want to look at. The reason this is a DataFrame is because, DataFrames are the most useful and take up the smallest amount of memory. Using regular arrays led to Random Access Memory (RAM) overloads, Google runtime resets, and on a worst case scenario, computer resets. The DataFrames are also useful because they allow you to label the rows and columns of the information that is being processed. Continuing through the function, there are a set of rules that have been added to keep the function running properly and to make sure that we obtain the correct information. The rules are as follows: 

1. If the Source IP does not match the Source IP of the first packet in the sequence, the function will break
2. If the Destination IP does not match the Destination IP of the first packet in the sequence, the function will break
3. If the length of the sequences are less than 3 packets total, they will not be included into the final DataFrame. This is because the small number of packets will not be sufficent enough to produce an accurate prediction of the sequences to follow.
4. If the current packet did not take place within 1 second of the first packet in the sequence, the function will break.

These four rules are the most important for the successful completion of the program. Once these requirements are met, then the function will start generating the sequences that are present in the data set and put them into a  labeled DataFrame. The counter on the last line will then keep track of the number of packets that are present in each sequence that it encounters. This will be helpful later in predicting future sequences. The more packets that are present in a sequence, the more accurate the prediction will be. Finally, this function will return a value - the number of packets present in the sequence and a DataFrame - the labeled sequence.

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import time
import os 
import psutil

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', None)

# DATA PREPROCESSING------------------------------------------------------------
data = pd.read_csv("DARPA1999_week1mon", delimiter=',', header=0, quotechar='"')
tcp_data = data.loc[data['Protocol']=='TCP']

# Gets the next input sequence of packets
def get_sequence(pos, input_data):
    t1 = input_data[pos][1]  # Timestamp of the first packet in the sequence
    src1 = input_data[pos][2]  # Source IP of the first packet in the sequence
    num_packets = 0  # Store the number of packets in the sequence
    sequence = []

    for x in range(pos, pos + 320):
        # if packet does not have the same source or destination IP as the source of the first one in the sequence
        if (input_data[x][2] != src1 and input_data[x][3] != src1 and len(sequence)>3):
            break
        # if the current one did not happen within 1 second of the first one in the sequence
        elif ((input_data[x][1]-t1)>1 and len(sequence)>3):
            break
        else:
            row = input_data[x]
            sequence.append(row)  # Append the sequence with the current packet
            num_packets = num_packets + 1  # Update the number of packets

    return num_packets, sequence

The following snippet of code is the main part of the program. This is where the function described above will be used and where the filtered dataset that was previously read in will be split between train and test data. The train data is the data that we will use to model the program with. Then the test data will be the part of the data that we will use to test the prediction functions. Furthermore, after the split between the train and test data, the train data is than made into an np.array - an array that contains any object - which we are able to use in the DataFrame. Our DataFrame as of right now is comprised of 8 total columns:
1. Packet Number (No.)
2. Time
3. Source IP
4. Destination IP
5. Protocol
6. Length
7. Source Port
8. Destination Port

These 8 columns are currently being used as the labels in the DataFrame which we will either add or take away from as we feel necessary. Continuing into the while loop, this is where the program starts working. The rules for this loop are as follows:
1. If the counting variable exceeds the length of the filtered train data array, the loop will be broken.
2. If the length of the filtered train data array subtracted by the count is greater than 3, the loop will break. 

Rule 2 happens due to the fact that we took away any sequences that are smaller than three packets. This means it is possible that the count will reach a number smaller than the length of the filtered train data array but have no more data to run through. This rule avoids the possibility of the loop going out of bounds. Furthermore, after each sequence is found, the loop will inform the user how many packets are inside the sequence, it will print the sequence to the screen for the user to see, and will also inform the user on the number of sequences it has processed. After all sequences have been accounted for and the loop breaks, the program will then print the time of execution in seconds and end.

In [0]:
# Split into train, test data
train, test = train_test_split(tcp_data, test_size=0.33, random_state=0, shuffle=False)
print("Train: \n", train)
np_data = np.array(train)

# Get all of the sequences
count = 1  # Store the total number of processed packets

num_sequences = 0  # Store the total number of sequences

all_sequences_df = pd.DataFrame(
    columns=["No.", "Time", "Source IP", "Destination IP", "Protocol", "Length", "Source Port", "Destination Port"])

start_time = time.time()

while (count < len(np_data) and len(np_data)-count>3):
    num_packets, sequence = get_sequence(count, np_data)  # Get the next sequence
    count = count + num_packets  # Update the number of processed packets

    sequence_df = pd.DataFrame(data=sequence, columns = ["No.", "Time", "Source IP", "Destination IP", "Protocol", "Length", "Source Port", "Destination Port"])
    #print('Number of packets in sequence: ', num_packets)
    #print("Sequence: \n", sequence_df)
    all_sequences_df = all_sequences_df.append(sequence_df)

    num_sequences = num_sequences + 1  # Update the number of sequences
    print("Sequence Number: ", num_sequences)
    print("=========================================================\n")

print("\n\n\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")
print("-----> FINISHED READING SEQUENCES <-----")
print("Time of Execution: %s seconds" % (time.time() - start_time))

After all of the above steps have taken place, the following code snippet saves the DataFrame which will then be saved to the system for the user to have an accessible copy.

In [3]:
print(all_sequences_df)

# convert dataframe to csv file 
all_sequences_df.to_csv("/content/SequenceCSV.csv", header = True, index = True)



       No.          Time       Source IP  Destination IP Protocol Length  Source Port  Destination Port
0       49     37.294817  172.16.113.105   196.37.75.158      TCP     60         79.0            1024.0
1       50     37.295017   196.37.75.158  172.16.113.105      TCP     60       1024.0              79.0
2       51     37.295563   196.37.75.158  172.16.113.105      TCP     60       1024.0              79.0
3       52     37.307251  172.16.113.105   196.37.75.158      TCP     60         79.0            1024.0
4       54     37.327150  172.16.113.105   196.37.75.158      TCP     60         79.0            1024.0
..     ...           ...             ...             ...      ...    ...          ...               ...
3   849259  25253.689354  172.16.112.207   194.27.251.21      TCP     60      15901.0              23.0
0   849262  25253.710977  197.182.91.233   172.16.114.50      TCP     60       8803.0              23.0
1   849265  25253.779422  172.16.112.207   194.27.251.21      TC