last updated: 23rd July, 2018
Project Members:
- Kunal Ranjan
- Pranjal Pratik Lal
The aim of this project was to analyse and predict the behaviour of tools like KLEE and AFL (which is used in finding bugs) using Machine Learning. This project was started in summer'18 (1st June) and is still going on under the supervision of Prof. Subhajit Roy and guidance of Awanish Pandey.
A set of programs (written in C Programming Language) with bugs injected by generating various versions for each code and real time run output generated by KLEE or AFL to detect the presence of bug.
-
We aimed to predict whether KLEE or AFL software detect the bug or not on new sets of programs.
-
Furthermore given that KLEE failed to detect the bug then we analyze the various parameters(through which bug is injected) and try to predict relationship between them.
Libraries used
- Numpy
- Pickle
- Pandas
- Scikit- Learn
- Keras
- XGBoost
- MLXtend
NOTE : All the preprocessing are done in python
We are given folder containing various C programs and each program has 4 versions which are actually different bugs injected in 4 ways. Each file has path for bug through line numbers and parameters for each line.
Definition of various line parameters of the programs
- trace_length : how many for loops or if else come before the given line
- global_nesting_depth : real depth of if-else / for / while when used inside for / if-else / while
- call_depth : recursive depth of function at this line
- no_of_local_variable : total numbers of variables
- taint_count :- no of times the given condition inside if-else or any is updated if many updation is present then taint count is max of them
So we first extract the above 5 bug parameters from dataset.
NOTE : All the preprocessing are done in python
Step 1 Output generated by KLEE and AFL (See File AFLSingle.csv and KLEESingle.csv)
aflsingle = pd.read_csv('AFLSingle.csv', delimiter=' ', header=None)
kleesingle = pd.read_csv('KLEEsingle.csv', delimiter=' ', header=None)
y_afl = aflsingle.iloc[:, 2].values
y_afl = (y_afl >= 1)
y_klee = kleesingle.iloc[:, 2].values
y_klee = (y_klee >= 1)
Step 2
Imorting dataset(bug files) and their important parameters extraction
class tosent:
def __init__ (self,command,version):
self.command = command
self.version = version
def get_data(self):
dataset = pd.read_csv('tosent/' + self.command + '/' + self.version + '/AvailableVariablesFiltered.txt', header=None)
dataset= dataset.iloc[:, 3:-1]
dataset.columns = ["trace_length", "global_nesting_depth", "call_depth", "no_of_local_variable", "taint_count"]
return dataset
def get_train_data(self):
dataset_final= self.get_data().iloc[:, :].values
return dataset_final
commands = ['CAT', 'CHMOD', 'CUT', 'DF', 'DU', 'ECHO', 'EXPAND', 'EXPR', 'FMT', 'HEAD', 'ID',
'KILL', 'LN', 'LS', 'NL', 'PATHCHK', 'PINKY', 'PR', 'PRINTF', 'PTX', 'SEQ', 'STAT',
'SUM', 'TAIL', 'TEST', 'TOUCH', 'UNEXPAND', 'UNIQ', 'WC', 'WHO']
data_dict= []
for name in commands:
for i in range(1, 5):
scan_data= tosent(name, str(i))
temp=scan_data.get_train_data()
temp_mean=temp.mean(0)
data_dict.append(temp_mean)
Note: taint_count is ignored in above code due to lack of some more information
Step3 splitting datasets into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_dict, y_afl, test_size = 0.2 )
Various Machine Learning Models used :
- Neural Networks
- Decision Trees
- Gradient Boosted Trees
- Association Rules Mining algorithms
Note: Since we are given for each path (i.e. version) output generated by KLEE and AFL (1 if bug detected otherwise 0) which lead to features all having multivalue so we have to take aggregrate of them by trying different mathematical tools such mean,median,max or min to know which can define the model most appropriate.
In neural network adjusting nodes-to-features ratios of connected layers helps to reduce the dimensionality and thus we can then know which feature is more important.
classifier = Sequential()
classifier.add(Dense(output_dim = 1000, init = 'uniform', activation = 'relu', input_dim = 5))
classifier.add(Dense(output_dim = 1000, init = 'uniform', activation = 'relu'))
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)
We are not able to reduce feature because if we try to reduce ratio then number of features get reduced but our model effieciency on test set reduces and if we increase to as in above as 1000 output efficiency on test cases is about 75 to 85 percent which most probably the case of overfitting as we have not enough data.
Note: we consider mean values of all the features over all lines as input for this model.
We used XGBoost library to train a Boosted Tree model and used Bayesian Optimization using the Scikit-Optimize library to optimize tree parameters to produce best results on our data.
bayes_cv_tuner = BayesSearchCV(
estimator = xgb.XGBClassifier(
n_jobs = 1,
objective = 'binary:logistic',
eval_metric = 'map',
silent=1,
tree_method='exact'
),
search_spaces = {
'learning_rate': (0.01, 1.0, 'log-uniform'),
'min_child_weight': (0, 10),
'max_depth': (0, 50),
'max_delta_step': (0, 20),
'subsample': (0.01, 1.0, 'uniform'),
'colsample_bytree': (0.01, 1.0, 'uniform'),
'colsample_bylevel': (0.01, 1.0, 'uniform'),
'reg_lambda': (1e-9, 1000, 'log-uniform'),
'reg_alpha': (1e-9, 1.0, 'log-uniform'),
'gamma': (1e-9, 0.5, 'log-uniform'),
'min_child_weight': (0, 5),
'n_estimators': (50, 100),
'scale_pos_weight': (1e-6, 500, 'log-uniform')
},
scoring = 'accuracy',
cv = StratifiedKFold(
n_splits=3,
shuffle=True,
random_state=42
),
n_jobs = 1,
n_iter = 10,
verbose = 0,
refit = True,
random_state = 42
)
The optimized models gave accuracy score(fraction of correctly classified samples) of 0.73 for KLEE and 0.80 for AFL upon 3-fold cross validation. Feature Importance Plots also showed Trace length to be the most important feature out of the five. AFL OUTPUT
KLEE OUTPUT Since Gradient Boosted Trees are fairly complicated, the models we made had 70 trees. We were later told that we needed to know exactly how our classifier was working i.e. the exact function on our features, which was very difficult to do with Gradient Boosted Trees. Hence, we started looking for simpler models.
Note: Before implementing Simpler Model we have tried to know feature importance on XGBoost by expanding to all parameters i.e. mean, median, maximum, minimum of KLEE .
Note: we consider mean values of all the features over all lines as input for this model.
We started working on Decision Trees because they are a lot simpler than Boosted Trees and can easily be viewed as an image using the library we were using. We used the Scikit-Learn library to implement Decision Trees and Scikit-Optimize to optimize tree parameters to give best results on our data.
bayes_cv_tuner = BayesSearchCV(
estimator = DecisionTreeClassifier(
criterion='gini',
splitter='best',
presort=True
),
search_spaces = {
'max_features': (1,4),
'max_depth':(1,15),
'min_samples_split': (2,10),
'min_samples_leaf': (1,10),
'min_weight_fraction_leaf': (1e-9,0.5,'log-uniform'),
'min_impurity_decrease' : (1e-9,0.5,'log-uniform'),
'max_leaf_nodes':(2,50)
},
scoring = 'accuracy',
cv = StratifiedKFold(
n_splits=3,
shuffle=True,
random_state=42
),
n_jobs = 1,
n_iter = 10,
verbose = 0,
refit = True,
random_state = 42
)
The trees gave an accuracy score of 0.78 for KLEE and 0.71 for AFL upon 3-fold cross validation.
We weren’t able to get any good result because the result we got from the tree for KLEE did not match the known algorithm on which KLEE works. During this time we also came to know that the Taint Count values on individual lines of the data were Taint Counts of different variables at different lines, hence taking their mean was not a good way to summarise it over the entire file. So we stopped using Taint Count in our models from here onwards. We also used maximum values of features in a file as inputs for models instead of mean and got similar accuracy scores ~0.7 for both KLEE and AFL but the trees for KLEE didn’t match with the algorithm on which KLEE works(according to our guide Awanish Pandey).
Note: We also tried using the mean, maximum and median data in the same model for KLEE but the Decision Tree became very complicated to comprehend. These tree also used the same feature multiple times as mean, median and max are correlated. Note: Around this time we also came to know that taint_count was a parameter for individual variables in the program hence taking its mean or median over the entire program didn't make sense so we didn't use taint_count in furthur models.
Output table for feature importance for KLEE
Parameter | Weightage |
---|---|
max_trace_length | 0.238 |
min_global_nesting_depth | 0.147 |
median_no_of_local_variable | 0.145 |
mean_no_of_local_variable | 0.125 |
mean_trace_length | 0.088 |
max_global_nesting_depth | 0.082 |
median_trace_length | 0.079 |
min_trace_length | 0.061 |
max_no_of_local_variable | 0.031 |
Association Rule Mining algorithm is used to find relation among variables in large datasets. It works for binary data i.e. data in form of 1 an 0 (present or not present), hence we needed to convert our data, which was numeric into 0s and 1s. We used thresholding to do this i.e. if the value at a point was greater than a certain value it was assigned the value 1, otherwise 0. We tried thresholding at mean and median of the features on the whole dataset.
MLXtend API(frequent_patterns) MLXtend API(association_rules)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(X, min_support=0.05, use_colnames=True)
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules = rules[(rules['consequents'] == frozenset({'AFL_data'}))]
Even after trying several combination of thresholds and introducing new variables (where we assigned value 1 if value at a point was smaller than a certain value and 0 otherwise), we didn’t get any strong rule from our data. All the rules generated had a low value for either support or confidence or lift. One of the reasons might be that Association Rules Mining is used for finding patterns in large datasets and our dataset only had a 120 points.
Note: For any rule A->B
Support = P(A) Confidence = P(B|A) Lift = P(AB)/P(A)P(B)
-
After trying several algorithms and methods we weren’t able to devise any sort of rule or formula on which KLEE and AFL work.
-
While looking at the data files manually we found a few abnormalities in the data given to us.
-
In one of the folders FMT, three files had the exact same values of all the features but KLEE had found bug in two of them but not in the third.
-
Few files had negative values of Global Nesting Depth.
-
Another reason why our models didn’t do very well was that there are many instances where two files are very similar, hence having similar values of all the features but bug was found in one and not in other. Such conflicting data might have resulted in our models making wrong classification on such similar files.
Note: We are working further on SVM to see whether it can solve our problem or not.