In [2]:
import pandas as pd
import numpy as np 
from sklearn import preprocessing 
from sklearn.preprocessing import StandardScaler

This code is submitted as part of project 2 for the subject COMP90037 (Security Analytics) at the University of  Melbourne .
     
    -------------------------------------------
    COMP90037 Security Analytics - Project 2 
    Machine learning based Threat detection

    Author : Mohammed Ahsan Kollathodi 
    Student id: 1048942.
    

### The principal aim of this code is to perform pre-processing to the given dataset to clean dataset or remove noise.

### Train data.

The dataset provided contain the NetFlow data for a network under cyberattacks. Each line of the dataset includes the following 15 fields: (1) stream ID, (2) timestamp, (3) duration, (4) protocol, (5) source IP address, (6) source port, (7) direction, (8) destination IP address, (9) destination port, (10) state, (11) source type of service, (12) destination type of service, (13) the number of total packets, (14) the number of bytes transferred in both directions, (15) the number of bytes transferred from the source to the destination.

I have not labelled stream ID as it's not very relevant with respect to the project. 

In [3]:
# Train dataframe.
train_df = pd.read_csv('trainingdata_cleaned.csv', sep=',')

In [4]:
# Test dataframe. 
test_df = pd.read_csv('testdata_cleaned.csv', sep=',')

In [5]:
# The features are treated as an unique dataframe with different keys. 
selected_features = pd.concat([train_df,test_df], keys=['train','test'])


In [6]:
# We need to scale the numerical features. 
numerical_features = selected_features[['num_total_packets','total_bytes','src_bytes','packets_in_Sec','bytes_total_in_Sec','Source_Bytes_Sec']]


In [7]:
numerical_features.replace([np.inf,np.nan], -1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [8]:
numerical_features.isna().any()

num_total_packets     False
total_bytes           False
src_bytes             False
packets_in_Sec        False
bytes_total_in_Sec    False
Source_Bytes_Sec      False
dtype: bool

In [9]:
scaler = StandardScaler() 
scaler.fit(numerical_features.loc['train'])

StandardScaler()

In [13]:
train_numerical_df = pd.DataFrame(scaler.transform(numerical_features.loc['train']), columns=numerical_features.columns,
                                   index=numerical_features.loc['train'].index)

In [15]:
test_numerical_df = pd.DataFrame(scaler.transform(numerical_features.loc['test']), columns=numerical_features.columns,
                                   index=numerical_features.loc['test'].index)

In [18]:
# To get the categorical features we employ the label encoder.
categorical_features = selected_features[['protocol', 'src_ip', 'src_port','direction',
                         'dst_ip', 'dst_port', 'state', 'srctype_service','dsttype_service']]

In [19]:
encoderset = {}

for column in categorical_features :
    print(column)
    encoderset[column] = preprocessing.LabelEncoder()
    encoderset[column].fit(categorical_features[column].astype(str))

protocol
src_ip
src_port
direction
dst_ip
dst_port
state
srctype_service
dsttype_service


In [20]:
test_categorical_dataframe = categorical_features.loc['test'].apply(lambda x: encoderset[x.name].transform(categorical_features.loc['test'][x.name].astype(str)))

In [21]:
train_categorical_dataframe = categorical_features.loc['train'].apply(lambda x: encoderset[x.name].transform(categorical_features.loc['train'][x.name].astype(str)))

In [22]:
# We then join together the numerical and the categorical data. 
data_train = pd.concat([train_numerical_df, train_categorical_dataframe], axis=1)

In [25]:
# Convert the output to CSV. 
data_train.to_csv('train_data_of_A1.csv', sep=',', index=False)

In [23]:
# The test data would comprise of the concatenated numerical and categorical data. 
data_test = pd.concat([test_numerical_df, test_categorical_dataframe], axis=1)


In [24]:
# Convert the test data into output CSV.
data_test.to_csv('test_data_of_A1.csv', sep=',', index=False)