### Guodong SUN & Liang WANG
<h1><center>Malicious Data detection based on Neural Network</center></h1>
<h3><center>SR2I Projet du filière, supervised by: Jean-Philippe MONTEUUIS</center></h3>
<center>17 Juin 2019</center>

# Content

### 1. [Analyzing the dataset](#paragraph1)
### 2. [Extracting the feature vectors](#paragraphe2)
### 3. [Training and testing the Neural Network](#paragraphe3)
### 4. [Comparison with other models](#paragraphe4)
### 5. [Testing the model](#paragraphe5)
<br/><br/>
Note that all the functions used in this scripts are stored in the 'facilities file', check the file if needed. 

## 1. Analyzing the dataset <a name = "paragraph1"></a>

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np

from keras.models import Sequential
from keras.layers import Dense,Activation
from sklearn.model_selection import train_test_split

from ipynb.fs.defs.facilities import *

Using TensorFlow backend.


In [122]:
data_file = ['attack1withlabels', 'attack2withlabels', 'attack4withlabels',
             'attack8withlabels','attack16withlabels']
data_atk1 = read_raw_data(data_file[0])
data_atk2 = read_raw_data(data_file[1])
data_atk4 = read_raw_data(data_file[2])
data_atk8 = read_raw_data(data_file[3])
data_atk16 = read_raw_data(data_file[4])

data_atk2['Label'] = data_atk2['Label']/2
data_atk4['Label'] = data_atk4['Label']/4
data_atk8['Label'] = data_atk8['Label']/8
data_atk16['Label'] = data_atk16['Label']/16

raw_data = [data_atk1, data_atk2, data_atk4, data_atk8, data_atk16]
atk_type = [1,2,4,8,16]

### 2.1 All the BSM message in which the distance is bigger than range of line-of-sight (800m for this project) is malicious 

In [18]:
for i in range(5):
    data = raw_data[i]
    # We first get the index of the BSM in which the distance is above 800.
    Plus800_index = check_range(data, 800)
    # We get the all the labels of all the BSM message.
    data_Plus800 = data.iloc[Plus800_index]
    label_this = np.unique(np.array(data_Plus800.iloc[:,11]))
    print('For all the data for attack type', atk_type[i], ', The labels are ', label_this)

For all the data for attack type 1 , The labels are  [1]
For all the data for attack type 2 , The labels are  [1]
For all the data for attack type 3 , The labels are  [1]
For all the data for attack type 4 , The labels are  [1]
For all the data for attack type 5 , The labels are  [1]


### 2.2 For a communication session between a sender and a receiver, all the message are either malicious or normal

In [51]:
for i in range(5):
    data = raw_data[i]
    print('************For the attack type',atk_type[i], '************' )
    # we show the statistics of the dataset. 
    statistics = check_session(data)

************For the attack type 1 ************
There are  387516 rows in the dataset
267305 rows are normal, i.e.,  0.689790873151044 percent of rows in the dataset
267305 rows are malicious, i.e.,  0.689790873151044 percent of rows in the dataset
There are  30588 sessions in the dataset
20973 session are normal, i.e.,  0.6856610435464888 percent of sessions in the dataset 
9615 session are malicious, i.e.,  0.3143389564535112 percent of sessions in the dataset 
For all the session, there are only 1 kind of label, in other word, a session is whether attack or normal
************For the attack type 2 ************
There are  387516 rows in the dataset
267305 rows are normal, i.e.,  0.689790873151044 percent of rows in the dataset
387516 rows are malicious, i.e.,  1.0 percent of rows in the dataset
There are  30588 sessions in the dataset
20973 session are normal, i.e.,  0.6856610435464888 percent of sessions in the dataset 
9615 session are malicious, i.e.,  0.3143389564535112 percent of

<br/><br/>
## 2. Extracting the feature vectors <a name = "paragraphe2" ></a>

We extract 6 features from the paper, 

In [None]:
for i in range(5):
    data = raw_data[i]
    data_with_features = add_feature_vectors(data)
    put_csv(data_with_features, atk_type[i])

We save the vectors in csv file for further processing.

<br/><br/>
## 3. Training and testing the Neural Network <a name = "paragraphe3" ></a>

In [57]:
vector_file = ['attack1with7FeatureVector', 'attack2with7FeatureVector', 'attack4with7FeatureVector',
             'attack8with7FeatureVector','attack16with7FeatureVector']

data_vector1 = read_vector_data(vector_file[0])
data_vector2 = read_vector_data(vector_file[1])
data_vector4 = read_vector_data(vector_file[2])
data_vector8 = read_vector_data(vector_file[3])
data_vector16 = read_vector_data(vector_file[4])

data_vector2['Label'] = data_vector2['Label']/2
data_vector4['Label'] = data_vector4['Label']/4
data_vector8['Label'] = data_vector8['Label']/8
data_vector16['Label'] = data_vector16['Label']/16

### 3.1 Check 

In [126]:
Hyper_parameter = ['sigmoid', 'mean_absolute_error', 'RMSprop', 'binary_accuracy']
NN_structure = [64, 16, 1]

In [127]:
model1 = NN_model(data_vector1, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
For this model, the CCR is 0.987741091860085 , the Precision is 1.0 and the Recall is 0.9612803304078472
There are  30588 session in total
The training dataset has, 24470 sessions, there are  [0.31377197] malicious data, and [0.68622803] normal data
The testing dataset has 6118 sessions, and there are  [0.31660673] malicious data, and [0.68339327] normal data
The prediction includes  [0.3043478] malicious data, and [0.6956522] normal data


In [128]:
model2 = NN_model(data_vector2, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
All the prediction is normal
For this model, the CCR is 0.6835567178816607 , the Precision is 0 and the Recall is 0.0
There are  30588 session in total
The training dataset has, 24470 sessions, there are  [0.31381283] malicious data, and [0.68618717] normal data
The testing dataset has 6118 sessions, and there are  [0.31644328] malicious data, and [0.68355672] normal data
The prediction includes  [0.] malicious data, and [1.] normal data


In [129]:
model4 = NN_model(data_vector4, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
For this model, the CCR is 0.9619840104421602 , the Precision is 0.9984334203655353 and the Recall is 0.892623716153128
There are  30642 session in total
The training dataset has, 24513 sessions, there are  [0.34369518] malicious data, and [0.65630482] normal data
The testing dataset has 6129 sessions, and there are  [0.34948605] malicious data, and [0.65051395] normal data
The prediction includes  [0.312449] malicious data, and [0.687551] normal data


In [130]:
model8 = NN_model(data_vector8, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
For this model, the CCR is 0.9741830065359477 , the Precision is 0.9992716678805535 and the Recall is 0.8973185088293002
There are  30596 session in total
The training dataset has, 24476 sessions, there are  [0.25641445] malicious data, and [0.74358555] normal data
The testing dataset has 6120 sessions, and there are  [0.2498366] malicious data, and [0.7501634] normal data
The prediction includes  [0.2243464] malicious data, and [0.7756536] normal data


In [131]:
model16 = NN_model(data_vector16, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
For this model, the CCR is 0.9705930403528835 , the Precision is 0.9878493317132442 and the Recall is 0.910414333706607
There are  30601 session in total
The training dataset has, 24480 sessions, there are  [0.29754902] malicious data, and [0.70245098] normal data
The testing dataset has 6121 sessions, and there are  [0.29178239] malicious data, and [0.70821761] normal data
The prediction includes  [0.26891032] malicious data, and [0.7310897] normal data


In [132]:
data_overall = data_vector1.append([data_vector2, data_vector4, data_vector8, data_vector16])
model_all = NN_model(data_overall, Hyper_parameter, NN_structure)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***********************************************
For this model, the CCR is 0.9265104728294612 , the Precision is 0.9963462619449128 and the Recall is 0.7613014066358853
There are  153015 session in total
The training dataset has, 122412 sessions, there are  [0.30519067] malicious data, and [0.69480933] normal data
The testing dataset has 30603 sessions, and there are  [0.30431657] malicious data, and [0.69568343] normal data
The prediction includes  [0.23252623] malicious data, and [0.76747376] normal data


In [133]:
model_trained = [model1, model2, model4, model8, model16]

### Classfication
The first step is identify whether the session is normal or malicious. Now we train a second model to classify the five types of attack.

In [72]:
malicicous_1 = data_vector1.loc[(data_vector1['Label']==1)]
malicicous_2 = data_vector2.loc[(data_vector2['Label']==1)]
malicicous_4 = data_vector4.loc[(data_vector4['Label']==1)]
malicicous_8 = data_vector8.loc[(data_vector8['Label']==1)]
malicicous_16 = data_vector16.loc[(data_vector16['Label']==1)]

malicious_data_overall = malicicous_1.append([malicicous_2, malicicous_4, malicicous_8, malicicous_16])
labels_classfication = to_labels(malicicous_1, malicicous_2, malicicous_4, malicicous_8, malicicous_16)

In [97]:
Hyper_parameter = ['sigmoid', 'categorical_crossentropy', 'RMSprop', 'categorical_accuracy']
NN_structure = [128, 32, 5]

model_classification = Classification_model(malicious_data_overall, labels_classfication, Hyper_parameter, NN_structure)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[[7.61016071e-01 4.14722654e-03 8.13893209e-02 2.59201659e-02
  1.27527216e-01]
 [0.00000000e+00 8.82702703e-01 1.18918919e-02 9.13513514e-02
  1.40540541e-02]
 [2.28102190e-03 0.00000000e+00 9.95437956e-01 1.82481752e-03
  4.56204380e-04]
 [0.00000000e+00 1.25944584e-03 6.29722922e-03 9.81108312e-01
  1.13350126e-02]
 [2.45495495e-01 2.81531532e-02 3.60360360e-02 8.89639640e-02
  6.01351351e-01]]


<br/><br/>
## 4. Comparison with other models <a name = "paragraphe4"></a>

This part is done by Liang and show in separate file.

<br/><br/>
## 5. Testing the model <a name="paragraphe5"></a>

We randomly choose a session for validating the models.

In [150]:
input_type = 4 # Which type of attack our malicious is chosen from, choose 0~4 please1
mode = 1 # Whether the data that we choose is attack or normal, choose 0 or 1 please!
validation(input_type, mode)

The session we randomly choose from the dataset 16  is shown as follows


Unnamed: 0,re_time,re_ID,re_x,re_y,tr_time,tr_ID,tr_x,tr_y,tr_vx,tr_vy,RSSI,Label
353306,21824,2446,3800.3,5216.8,21824,2455,3634.4,5252.7,-4.4023,28.221,9.9173e-09,1.0
353372,21827,2446,3810.3,5255.5,21827,2455,3630.0,5280.9,-3.2759,28.385,1.9426e-08,1.0
353389,21828,2446,3799.5,5258.9,21828,2455,3630.0,5280.9,-3.2729,28.359,1.7566e-08,1.0
353412,21829,2446,3788.4,5258.3,21829,2455,3630.0,5280.9,-3.2716,28.348,6.0523e-09,1.0
353454,21831,2446,3767.3,5249.9,21831,2455,3630.0,5280.9,-3.2827,28.444,7.2461e-09,1.0
353479,21832,2446,3756.9,5245.4,21832,2455,3630.0,5280.9,-3.2819,28.437,5.5237e-09,1.0
353501,21833,2446,3746.5,5241.0,21833,2455,3630.0,5280.9,-2.3333,28.484,4.0297e-09,1.0
353568,21836,2446,3714.9,5228.3,21836,2455,3630.0,5280.9,-2.3312,28.458,3.2064e-09,1.0
353614,21838,2446,3693.9,5219.9,21838,2455,3630.0,5280.9,-1.1923,28.557,2.6447e-09,1.0
353639,21839,2446,3683.1,5216.3,21839,2455,3630.0,5280.9,-1.1894,28.488,2.8727e-09,1.0


*************We firstly check whether it is malicious by its corresponding model*************
The detection system said: The BSM is malicious!
The detection is correct!
*************We then check whether it is malicious by its general model*************
The detection system said: The BSM is malicious!
The detection is correct!
*************If it is malicious, we classify it*************
The classification is correct!
