# 1. Introdution :

A computer worm is a self-replicating malware that duplicates itself to spread to uninfected computers. Worms often use parts of an operating system that are automatic and invisible to the user[1]. Today, there is a proliferation of internet worms. In order to defend against this threat, many companies use  network intrusion detection systems(NIDS). NIDS typically detect worms by scanning packets to see whether specific byte sequences, known as signatures, match the signature of known attacks[2]. This approach means that a worm can not be detected until a signature is created. Therefore, it is difficult to detect new attacks. In this report, we propose a machine learning approach to detect worms in real-time.

# 2. Related Work :

In this section, we present a table listing the different research publications that attempt to solve the same problem. (see the accompanying file : MalwareDetection_LiteratureReview.xlsx).


# 3. Description of the Dataset used :

We have collected the CTU-13 dataset from [3], which was collected from the CTU University(Czech Republic, 2011) network. This dataset consists in thirteen captures (called scenarios) of different botnet samples. On each scenario, a specific malware was executed and the following features were extracted : Start Time, Duration, Protocole, Source IP address, Source Port, Direction,Destination IP address, Destination Port, State, SToS, DTos,Total Packets, Total Bytes, Source Bytes. We used 70% of the CTU-Malware-Capture-Botnet-50 (Scenario 9) dataset for training and 30% for testing.   




In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

dataFrame = pd.read_csv('CTU-Malware-Capture-Botnet-50_Scenario_9/capture20110817.csv')

y = dataFrame['Label']

listOfFeaturesToDrop = [

		'StartTime',
		'Dur',
		'Proto',
		'SrcAddr',
		'Sport',
		'DstAddr',
		'Dport',
		'State',
		'Label'

	]

	
X = dataFrame.drop(listOfFeaturesToDrop, axis=1)  #create copy of dataframe

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)

# 4. Libraries used:

## 4.1 Pandas (0.23.4) : 

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.[4]

## 4.2 Numpy (1.15.4) :

NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object sophisticated (broadcasting) functions tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities. [5]

## 4.3 Matplotlib (2.1.1) :

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. [6]

## 4.4 Scikit-learn (0.20.0) :

Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. [7]

## 4.5 Scipy (1.1.0) : 

SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages. [8]

## 4.6 Imbalanced-learn (0.4.3) :

Imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. [9]
    
# 5. Model training : 

## 5.1 Data cleansing : 

In this part of the projet, we have first removed columns 'sTos', 'dTos' and 'Dir' from the dataset used (CTU-Malware-Capture-Botnet-50, Scenario 9 [3]). Then, we have removed any row which contained one or more NaN values. Next, we have removed any row which contained an incorrect source or destination port. Finally, we have converted the remaining columns to the appropriate types.

In [None]:
def preprocessData(dataFrame):

	'''
	This function is used to perform
	the necessary operations to 
	convert the raw data into a
	clean data set.

	'''


	dp.deleteColumn(dataFrame,'sTos')
	dp.deleteColumn(dataFrame,'dTos')
	dp.deleteColumn(dataFrame,'Dir')
	

	dp.deleteNullRow(dataFrame,'Sport')
	dp.deleteNullRow(dataFrame,'Dport')


	dp.deleteRowWhere(dataFrame,'Sport','x')
	dp.deleteRowWhere(dataFrame,'Dport','x')


	dp.convertColumnToInt32(dataFrame,'Sport')
	dp.convertColumnToInt32(dataFrame,'Dport')


	dp.convertColumnToTimeStamp(dataFrame,'StartTime')

	dp.convertColumnToFloat16(dataFrame,'Dur')
	dp.convertColumnToInt16(dataFrame,'TotPkts')

	
	dp.replaceColumn(dataFrame,'Label')

	
	return dataFrame

## 5.2 Data discretization : 

In this part, we have added to the dataset columns 'Sport_Dis', 'Dport_Dis', 'Proto_Dis' and'State_Dis'. The first two columns are used to partition columns 'Sport' and 'Dport' values to nominal intervals (0-1023(WELLKNOWN_PORTNUMBER),1024-49151(REGISTERED_PORTNUMBER),49152-65535(DYNAMIC_PORTNUMBER)) and the two other ones are used to convert columns 'Proto' and 'State' values to nominal values.

In [None]:
def discretizeData(dataFrame):

	'''
		This function is used to perform
		the necessary operations to 
		reduce the number of values in 
		the data set.

	'''

	dd.bucketingPortNumber(dataFrame,'Sport_Dis','Sport')
	dd.bucketingPortNumber(dataFrame,'Dport_Dis','Dport')

	dd.labelEncoder8(dataFrame,'Proto_Dis','Proto')
	dd.labelEncoder16(dataFrame,'State_Dis','State')

	return dataFrame

## 5.3 Feature Generation : 

Based on the requirements document, we have generated the following features :

## 5.3.1 Connection-based features : 

(Using a rolling window for the previous n netflows when a given source or destination address appears in the traffic)

- For any of the flow records that SRCADDRESS has appeared within the last n flow records, average the bytes (A_TotBytes_S)
- For any of the flow records that SRCADDRESS has appeared within the last n flow records, average the packets (A_TotPkts_S)
- Number of apperance of SRCADDRESS in the last n/10 netflows (Nbr_App_S)
- For any of the flow records that SRCADDRESS has appeared within the last n flow records, count the distinct source ports (Dct_Sport_S)
- For any of the flow records that SRCADDRESS has appeared within the last n flow records, count the distinct destination ports (Dct_Dport_S)
- For any of the flow records that SRCADDRESS has appeared within the last n flow records, count the distinct source ips (Dct_SrcAddr_S)



- For any of the flow records that DSTADDRESS has appeared within the last n flow records, average the bytes (A_TotBytes_D)
- For any of the flow records that DSTADDRESS has appeared within the last n flow records, average the packets (A_TotPkts_D)
- Number of apperance of DSTADDRESS in the last n/10 netflows (Nbr_App_D)
- For any of the flow records that DSTADDRESS has appeared within the last n flow records, count the distinct source ports (Dct_Sport_D)
- For any of the flow records that DSTADDRESS has appeared within the last n flow records, count the distinct destination ports (Dct_Dport_D)
- For any of the flow records that DSTADDRESS has appeared within the last n flow records, count the distinct destination ips (Dct_DstAddr_D)


## 5.3.2 Time-based features : 

(Using a rolling window for the n previous minutes when a given source or destination address appears in the traffic). 


- For any of the flow records that SRCADDRESS has appeared within the last n minutes, average the bytes (A_TotBytes_S)
- For any of the flow records that SRCADDRESS has appeared within the last n minutes, average the packets (A_TotPkts_S)
- For any of the flow records that SRCADDRESS has appeared within the last n minutes, count the distinct source ports (Dct_Sport_S)
- For any of the flow records that SRCADDRESS has appeared within the last n minutes, count the distinct source ips (Dct_SrcAddr_S)
- Number of apperance of SRCADDRESS  within the last n/10 minutes (Nbr_App_S)



- For any of the flow records that DSTADDRESS has appeared within the last n minutes, average the bytes (A_TotBytes_D)
- For any of the flow records that DSTADDRESS has appeared within the last n minutes, average the packets (A_TotPkts_D)
- For any of the flow records that DSTADDRESS has appeared within the last n minutes, count the distinct source ports (Dct_Sport_D)
- For any of the flow records that DSTADDRESS has appeared within the last n minutes, count the distinct destination ips (Dct_DstAddr_D)
- Number of apperance of DSTADDRESS  within the last n/10 minutes (Nbr_App_D)





In [None]:
def generateSrcAddrFeaturesConnectionBased(dataFrame, srcAddr, windowSize):

    '''

        this function is used to generate connection-based features using
        the given source ip address and window size


    '''

    srcAddr_dis = dd.labelEncoder32(dataFrame,srcAddr,'SrcAddr_Dis','SrcAddr')

    #print("DIS (SrcAddr) : ", srcAddr_dis)

    dataFrame['A_TotBytes_S'] = dataFrame['TotBytes'].rolling(windowSize).mean()     #Average TotBytes
    dataFrame['A_SrcBytes_S'] = dataFrame['SrcBytes'].rolling(windowSize).mean()     #Average SrcBytes
    dataFrame['A_TotPkts_S'] = dataFrame['TotPkts'].rolling(windowSize).mean()       #Average TotPkts


    dataFrame['Dct_Sport_S'] = dataFrame['Sport'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False) #Disctinct Source ports
    dataFrame['Dct_Dport_S'] = dataFrame['Dport'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False) #Disctinct Destination ports

    dataFrame['Dct_SrcAddr_S'] = dataFrame['SrcAddr_Dis'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False) #Disctinct SrcAddr


    dataFrame['Nbr_App_S'] = dataFrame['SrcAddr_Dis'].rolling((windowSize//10)).apply(lambda x: np.count_nonzero(np.where(x == srcAddr_dis)), raw = False) #number of apperance of SrcAddr in (windowSize/10) netflows


    dp.deleteNullRow(dataFrame,'A_TotBytes_S')


    #print(dataFrame.shape[0])

    return dataFrame


def generateDstAddrFeaturesConnectionBased(dataFrame,dstAddr, windowSize):

    '''

        this function is used to generate connection-based features using
        the given destination ip address and window size


    '''



    dstAddr_dis = dd.labelEncoder32(dataFrame,dstAddr,'DstAddr_Dis','DstAddr')

    #print("DIS (DstAddr) : ", dstAddr_dis)

    dataFrame['A_TotBytes_D'] = dataFrame['TotBytes'].rolling(windowSize).mean()
    dataFrame['A_SrcBytes_D'] = dataFrame['SrcBytes'].rolling(windowSize).mean()
    dataFrame['A_TotPkts_D'] =  dataFrame['TotPkts'].rolling(windowSize).mean()


    dataFrame['Dct_Sport_D'] = dataFrame['Sport'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False)
    dataFrame['Dct_Dport_D'] = dataFrame['Dport'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False)

    dataFrame['Dct_DstAddr_D'] = dataFrame['DstAddr_Dis'].rolling(windowSize).apply(lambda x: len(np.unique(x)), raw = False)

    dataFrame['Nbr_App_D'] = dataFrame['DstAddr_Dis'].rolling((windowSize//10)).apply(lambda x: np.count_nonzero(np.where(x == dstAddr_dis)), raw = False)


    dp.deleteNullRow(dataFrame,'A_TotBytes_D')


   # print(dataFrame.shape[0])

    return dataFrame



def generateSrcAddrFeaturesTimeBased(dataFrame, srcAddr, time):


    '''

        this function is used to generate time-based features using
        the given source ip address and time


    '''



    time = time * 60  #convert to minutes
    time = str(time) + 's'

    dataFrame['timeStampIndex'] = pd.to_datetime(dataFrame['StartTime'])   #used to create the rolling window based on minutes

    dataFrame.set_index('timeStampIndex', inplace=True)

    srcAddr_dis = dd.labelEncoder32(dataFrame,srcAddr,'SrcAddr_Dis','SrcAddr')

    #print("DIS (SrcAddr) : ", srcAddr_dis)


    dataFrame['A_TotBytes_S'] = dataFrame['TotBytes'].rolling(time).mean()
    dataFrame['A_SrcBytes_S'] = dataFrame['SrcBytes'].rolling(time).mean()
    dataFrame['A_TotPkts_S'] =  dataFrame['TotPkts'].rolling(time).mean()


    dataFrame['Dct_Sport_S'] = dataFrame['Sport'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False)
    #dataFrame['Distinct_Dport (DstAddr)'] = dataFrame['Dport'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False) #Not listed in the specification document

    dataFrame['Dct_SrcAddr_S'] = dataFrame['SrcAddr_Dis'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False)


    dataFrame['Nbr_App_S'] = dataFrame['SrcAddr_Dis'].rolling(time).apply(lambda x: np.count_nonzero(np.where(x == srcAddr_dis)), raw = False)


    dP.deleteNullRow(dataFrame,'A_TotBytes_S')


    #print(dataFrame.shape[0])

    return dataFrame



def generateDstAddrFeaturesTimeBased(dataFrame, srcAddr, time):


    '''

        this function is used to generate time-based features using
        the given destination ip address and time


    '''



    time = time * 60  #convert to minutes
    time = str(time) + 's'

    dataFrame['timeStampIndex'] = pd.to_datetime(dataFrame['StartTime'])   #used to create the rolling window based on minutes

    dataFrame.set_index('timeStampIndex', inplace=True)

    dstAddr_dis = dd.labelEncoder32(dataFrame,dstAddr,'DstAddr_Dis','DstAddr')


    dataFrame['A_TotBytes_D'] = dataFrame['TotBytes'].rolling(time).mean()
    dataFrame['A_SrcBytes_D'] = dataFrame['SrcBytes'].rolling(time).mean()
    dataFrame['A_TotPkts_D'] =  dataFrame['TotPkts'].rolling(time).mean()


    dataFrame['Dct_Sport_D'] = dataFrame['Sport'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False)
    #dataFrame['Distinct_Dport (DstAddr)'] = dataFrame['Dport'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False) #Not listed in the specification document

    dataFrame['Dct_DstAddr_D'] = dataFrame['DstAddr_Dis'].rolling(time).apply(lambda x: len(np.unique(x)), raw = False)


    dataFrame['Nbr_App_D'] = dataFrame['DstAddr_Dis'].rolling(time).apply(lambda x: np.count_nonzero(np.where(x == srcAddr_dis)), raw = False)


    dP.deleteNullRow(dataFrame,'A_TotBytes_D')


    #print(dataFrame.shape[0])

    return dataFrame



## 5.4 Model execution and results : 

#### Gradient boosting :
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. [10]


For our expirement, we used the gradient boosting classifier to create the model and the results are shown in table 1 and 2.

- table 1 : Accuracy rate, Detection rate and True Positive (Without Oversampling) 

Accuracy of the GBM on test set (Without Oversampling): 0.985
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     71041
           1       0.83      0.37      0.51      1524

   micro avg       0.99      0.99      0.99     72565
   macro avg       0.91      0.68      0.75     72565
weighted avg       0.98      0.99      0.98     72565



- table 2 : Accuracy rate, Detection rate and True Positive (Oversampling)

Accuracy of the GBM on test set (Over Sampling): 0.924
              precision    recall  f1-score   support

           0       1.00      0.92      0.96     71094
           1       0.20      0.94      0.33      1471

   micro avg       0.92      0.92      0.92     72565
   macro avg       0.60      0.93      0.65     72565
weighted avg       0.98      0.92      0.95     72565





In [None]:
def createModel(dataFrame):
	
	'''
		This function is used to create 
		a machine learning model using 
		the gradient boost classifier

	'''
	
	y = dataFrame['Label']

	listOfFeaturesToDrop = [

			'StartTime',
			'Dur',
			'Proto',
			'SrcAddr',
			'SrcAddr_Dis',
			'Sport',
			'DstAddr',
			'DstAddr_Dis',
			'Dport',
			'State',
			'Label'

		]

	
	X = dataFrame.drop(listOfFeaturesToDrop, axis=1)  #create copy of dataframe

	X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)

	sm = SMOTE(random_state=12, ratio = 1.0)

	X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

	model = GradientBoostingClassifier()

	model.fit(X_train_res, y_train_res)


	saveModel(model, 'trainedModel.sav')  


	predictors=list(X_train)


	print('Accuracy of the GBM on test set (Over Sampling): {:.3f}'.format(model.score(X_test, y_test)))
	pred=model.predict(X_test)
	print(classification_report(y_test, pred))

## 6. Conclusion and Future Work :

Machine learning techniques show promising results in the worm detection problem. In this experiment, our model consisted of preprocessing, feature engineering and classification techniques. We have investigated the optimal configuration for the gradient boosting technique and demonstrated that tuning certain parameters greatly affects the classifier performance. Our next experiment will consist of using a classifier other than the gradient boosting and comparing the results with the ones we have now.  

## 7. References :

[1] What is computer worm? - Definition from WhatIs.com. (n.d.). Retrieved from https://searchsecurity.techtarget.com/definition/worm
[2] O. Sharma, M. Girolami, and J. Sventek, “Detecting worm variants using machine
learning,” in Proc. ACM CoNEXT Conf., New York, NY, USA: ACM, Dec. 2007, pp.
1–12.
[3] García, S.; Grill, M.; Stiborek, J.; Zunino, A. An Empirical Comparison of Botnet Detection
Methods. Comput. Secur. 2014, 45, 100–123.
[4] Python Data Analysis Library. (n.d.). Retrieved from https://pandas.pydata.org/
[5] NumPy. (n.d.). Retrieved from http://www.numpy.org/
[6] Matplotlib. (n.d.). Retrieved from https://matplotlib.org/
[7] Scikit-learn. (n.d.). Retrieved from https://scikit-learn.org/stable/
[8] SciPy.org. (n.d.). Retrieved from https://www.scipy.org/
[9] Imbalanced-learn. (n.d.). Retrieved from https://pypi.org/project/imbalanced-learn/
[10] Grover, P. (2017, December 09). Gradient Boosting from scratch – ML Review – Medium. Retrieved from https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
[11] Siddiqui, M., Wang, M.C. and Lee, J. (2009) Detecting Internet Worms Using Data Mining Techniques. Journal of
Systemics, Cybernetics and Informatics, 6, 48-53.
[12] S. Yang, J. P. Song, H. Rajamani, T. W. Cho, Y. Zhang, and R. Mooney, “Fast and
effective worm fingerprinting via machine learning,” in Proc. of the 3rd IEEE
International Conference on Autonomic Computing (ICAC), Dublin, Ireland, Jun. 2006. 






