## Proof of concept of the AST metrics approach

This notebook shows a proof of concept of the solution based on the application of regressiong techniques on a dataset
that stores the counts of each different type of node of the Abstract Syntax Tree (AST) of the sources to analyze.

### Feature extraction
In order to extract numbers out of the code, the basic idea would be to simply count the number of occurrences of each
token in the sources, building a bag-of-words.

This leads to a very large number of different tokens to build a
regression model on, probably raising overfitting problems. It's also worth to notice that in this approach many of the
tokens lose their meaning: variable identifiers do not carry information about their type, for example.

Building the AST on the sources allows treating the code as what it is and not as plain text, linking each identifier to
its type and enabling the distinction between declarations and invocations, casts and argument listings and so on.

The process of building the AST and extracting many of what we think are relevant metrics can be achieved with a Java
parser such as [ANTLR](https://www.antlr.org/) or [javalang](https://github.com/c2nes/javalang) (we choose the latter
because of its speed): once the AST is built, we can simply visit it and increase the counters associated to the type of
node encountered.

A demonstration of how this approach works is given below.

In [1]:
from src.processing import analyzer

source = "HelloClass.java"

print("// Source code to be analyzed:")
with open(f"./{source}") as file:
    print(file.read())
    print()

print("// Analyzer output:")
print(analyzer.analyze(f"./{source}"))

// Source code to be analyzed:
package greetings;

import java.util.*;

/*
 * A very simple "Hello World" example in Java
 */
public class HelloClass<T extends List> extends Class1.SubType<Type, Int3> implements Interface1, Interface2 {

	public static void main(String[] args) {
		// Print to standard output
		int j = 1;
		j++;

		ArrayList<String> names = new ArrayList<String>();

		System.out.println("Hello World!");
	}

}


// Analyzer output:
Counter({'INTERFACES_IMPLEMENTED': 2, 'VARIABLE_DECLARATIONS': 2, 'LITERALS': 2, 'STATEMENTS': 2, 'EXPRESSION_STATEMENTS': 2, 'PACKAGE_DECLARATIONS': 1, 'IMPORT_STATEMENTS': 1, 'CLASS_DECLARATIONS': 1, 'PUBLIC_TYPE_DECLARATIONS': 1, 'PARAMETRIZED_TYPE_DECLARATIONS': 1, 'SUBTYPE_DECLARATIONS': 1, 'METHOD_DECLARATIONS': 1, 'PUBLIC_METHOD_DECLARATIONS': 1, 'STATIC_METHOD_DECLARATIONS': 1, 'BASIC_INT_VARIABLES': 1, 'TYPED_REFERENCES': 1, 'ARRAYLIST_VARIABLES': 1, 'METHOD_INVOCATIONS': 1})


### Model building

The task is about exploring the application of Logistic Regression over the data in order to identify vulnerable files.
Building a simple model over the data gathered without any kind of fine tuning yields results to be considered as
a benchmark for the models to be built later on.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from src.processing import metrics_extractor

features = metrics_extractor.load_features()

labels = features["IS_WEAK"]
data = features.drop("IS_WEAK", axis=1)

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.2)

classifier = LogisticRegression()
classifier.fit(train_data, train_labels)

predictions = classifier.predict(test_data)
report = classification_report(test_labels, predictions, target_names=["Safe files", "Weak files"])

print(report)



              precision    recall  f1-score   support

  Safe files       0.92      1.00      0.95      7343
  Weak files       0.35      0.03      0.05       686

    accuracy                           0.91      8029
   macro avg       0.63      0.51      0.50      8029
weighted avg       0.87      0.91      0.88      8029



This simple classifier doesn't perform well and is unable to generalize due to the imbalance of samples for the two
classes, meaning that different weights should be given to the two classes or that the dataset should be artificially
cut in order to restore balance, performing undersampling. 

Following the second approach, the dataset can then be built considering every negative sample and the same amount of
positive ones picked randomly. The resulting dataset is then shuffled and split using 80% of samples for training and
the remaining 20% for testing.

In [14]:
from src.processing.dataset_splitter import __train_data as train_data
from src.processing.dataset_splitter import __test_data as test_data
from src.processing.dataset_splitter import __train_labels as train_labels
from src.processing.dataset_splitter import __test_labels as test_labels

classifier = LogisticRegression()
classifier.fit(train_data, train_labels)

predictions = classifier.predict(test_data)
report = classification_report(test_labels, predictions, target_names=["Safe files", "Weak files"])

print(report)



              precision    recall  f1-score   support

  Safe files       0.68      0.85      0.76       689
  Weak files       0.80      0.60      0.69       689

    accuracy                           0.73      1378
   macro avg       0.74      0.73      0.72      1378
weighted avg       0.74      0.73      0.72      1378



The results are encouraging as the simple model is capable of generalizing although the recall over weak files should be
improved in order to avoid leaving out too many vulnerable files.
