# Detection of Malicicous Data

## Part I: Preprocessing 

### Read the data.file
The data file includes dataset structured like Basic Safety Messages (BSM) defined in the SEA J2735 standard, including vehicle's location, speed, acceleration, heading, and brake status. We utilize the data to check the correcteness of the data and better classify the five types of position spoofing attack. 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

### colunm data
1. Type (3=BSM)
2. Time BSM was received by the receiver
3. Receiver ID
4. Receiver X position
5. Receiver Y position
6. Receiver Z position
7. Time BSM was transmitted
8. Transmitter ID
9. BSM ID
10. Transmitter X position
11. Transmitter Y position
12. Transmitter Z position
13. Transmitter X velocity
14. Transmitter Y velocity
15. Transmitter Z velocity
16. RSSI (Received Signal Strength Indicator)
17. Label (0=Normal Behavior)

In [20]:
cols = [2,3,4,9,10,15,16] #id, xr ,yr, xt, yt, RSSI, label.
data = pd.read_csv("dataset/attack1withlabels.csv",usecols=cols)
data = data.dropna(axis=0, how="any")#remove invalid data
data_id = data.iloc[:,0]
# attack 1: distance is too far recevier and transmitter
pos_xr = data.iloc[:,1]
pos_yr = data.iloc[:,2]
pos_xt = data.iloc[:,3]
pos_yt = data.iloc[:,4]
rssi   = data.iloc[:,5]
label = data.iloc[:,6]


In [23]:
# We read the data from the dataset.
data = pd.read_csv("dataset/attack1withlabels.csv", usecols=cols, header=None)
X = data.iloc[:,:6]
y = data.iloc[:,6]

Two observations for attack 1
* The fake distance exceed the upper bound of communication range which is 800m
* The distance does not change but the RSSI changes.
The first criteria is trivial, thus we apply a filter. 

In [26]:
#add a distance feature
def distance(X):
    distance = np.zeros(len(X))
    for i in range(len(X)):
        distance[i] = np.linalg.norm([X.iloc[i][3]-X.iloc[i][9], X.iloc[i][4]-X.iloc[i][10]])
    distance_series = pd.Series(distance)
    return distance_series

distance_series = distance(X)
X['distance'] = distance_series

In [25]:
# We filter the BSM whose distance is above the thredhold = 800
def filtering_dis(X,y,thredhold):
    drop_index = np.zeros(len(y))
    for i in range(len(y)):
        if X['distance'][i] > thredhold:
            drop_index[i]=1
    return drop_index
drop_index = filtering_dis(X,y,800)
drop_index_ = np.where(drop_index>0)
np.asarray(drop_index_)[0]
X_filter = X.drop(np.asarray(drop_index_)[0])
y_filter = y.drop(np.asarray(drop_index_)[0])

Unnamed: 0,2,3,4,9,10,15
0,562,5607.3,5965.1,5367.5,5930.1,1.420000e-08
1,562,5594.2,5973.5,5373.5,5934.6,3.120000e-09
2,562,5583.8,5980.2,5368.1,5924.2,2.260000e-09
3,3166,3642.0,5183.0,5560.0,5820.0,3.790000e-09
4,3166,3642.0,5183.0,3781.6,5256.0,3.150000e-09
5,3166,3642.0,5183.0,5560.0,5820.0,5.580000e-09
6,3166,3642.0,5183.0,4168.2,5286.1,2.940000e-09
7,3166,3642.0,5183.0,3609.1,5420.4,2.070000e-08
8,3166,3641.4,5186.5,3485.7,5192.2,3.580000e-09
9,3166,3640.9,5190.1,5560.0,5820.0,1.580000e-09


In [46]:
#drop receiver colums 
X_feed = X_filter.drop([3, 4], axis = 1)
X_feed

Unnamed: 0,2,9,10,15,distance
0,562,5367.5,5930.1,1.420000e-08,242.340752
1,562,5373.5,5934.6,3.120000e-09,224.101986
2,562,5368.1,5924.2,2.260000e-09,222.850825
4,3166,3781.6,5256.0,3.150000e-09,157.534631
6,3166,4168.2,5286.1,2.940000e-09,536.205231
7,3166,3609.1,5420.4,2.070000e-08,239.668876
8,3166,3485.7,5192.2,3.580000e-09,155.804300
10,3166,3596.1,5514.8,1.240000e-08,327.776036
11,3166,3479.0,5190.9,4.180000e-09,161.901977
12,3166,3600.3,5503.8,2.350000e-09,316.316376


In [72]:
# split training data and validate data
data_count = y_filter.count()
data_rate = 0.8 * data_count
data_rate = int(data_rate)
X_train = X_feed.iloc[:data_rate,:]
y_train = y_filter.iloc[:data_rate]
X_test = X_feed.iloc[data_rate:,:]
y_test = y_filter.iloc[data_rate:]

X_train.shape

(266402, 5)

After pre-processing, we get the following relevant information:
1. The ID of the transmitter BSM (Coloum 2)
2. Transmitter positions (Coloum 9,10)
3. The distance between transmitter and receiver (Coloum 'distance')
4. The RSSI (Coloum 16)
We feed the following information into the training model.

## impletation keras

In [73]:
from tensorflow.python.keras import backend as k
from keras.models import Sequential
from keras.layers import Dense,Activation
from keras.layers.recurrent import LSTM
import tensorflow
from sklearn import model_selection


In [74]:
# model implementation
model = Sequential()
model.add(LSTM(32, input_shape=(None,5),return_sequences=False))
model.add(Dense(8, input_dim=5))
model.add(Dense(1, activation="relu"))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [79]:
print("The radio of 0/all = ",1-sum(y_filter)/y_filter.shape[0])
X_train.shape

The radio of 0/all =  0.8671063023456245


(266402, 1, 5)

In our dataset, attack take the part of 13.3% and normal take a part of 86.7% 

In [None]:
#to satisfait LSTM
X_train = np.reshape(X_train.values, (X_train.shape[0], 1,X_train.shape[1]))
X_test =np.reshape(X_test.values, (X_test.shape[0], 1, X_test.shape[1]))

In [None]:
#fit data
model.fit(X_train,y_train,epochs=10,batch_size=100)

In [125]:
cost = model.evaluate(X_test.values,y_test.values,batch_size=100)
print('loss \n',cost[0])
print('accurency \n',cost[1])

loss 
 1.8329824162138717
accurency 
 0.8862779837549901
