# Avast-CTU Public CAPE Dataset (Model Example)

The following document demonstrates usage of the dataset. We first show statistics of the dataset, following by creating an HMIL model build and trained based no the reduced reports.

The example is using Julia 1.6, Mill.jl 2.7 and JsonGrinder 2.2.3. For details about julia or the main packages, please see the official documentations:
* https://julialang.org
* https://ctuavastlab.github.io/JsonGrinder.jl/stable/

Activation of the environment, addiing the packages.

In [1]:
using Pkg
Pkg.activate("./")

using Flux, MLDataPattern, Mill, JsonGrinder, JSON, Statistics, IterTools, StatsBase, ThreadTools
using JsonGrinder: suggestextractor, ExtractDict
using Mill: reflectinmodel
using CSV, DataFrames
using Random
using Dates
using Plots
using Printf


[32m[1m  Activating[22m[39m project at `~/Diplomka/ExplainMill.jl/myscripts`


Some parts of the code is parallelized to speed up loading / working with data. The following variable determines the number of threads.

In [2]:
THREADS = Threads.nthreads() 

1

Setting the path variables to directories containing metadata (labels) and reports: 

In [3]:
PATH_TO_REPORTS = "PATH/TO/REPORTS/"
PATH_TO_REDUCED_REPORTS = PATH_TO_REPORTS * "public_small_reports/"
PATH_TO_FULL_REPORTS = PATH_TO_REPORTS * "public_full_reports/"
PATH_TO_LABELS = "./" ;

Loading the labels:

In [4]:
df_labels = CSV.read(PATH_TO_LABELS * "public_labels.csv", DataFrame) ;

ArgumentError: ArgumentError: "./public_labels.csv" is not a valid file or doesn't exist

## Statistics for malware families in the dataset:

We first provide some basic statistics. First, we show the list of malware families in the dataset and how many samples belong to each family. 

In [5]:
all_samples_count = size(df_labels, 1)
println("All samples: $(all_samples_count)")
println("Malware families: ")
[println(k => v) for (k,v) in countmap(df_labels.classification_family)] ;

UndefVarError: UndefVarError: `df_labels` not defined

We now show how the sampels are distributed over time:

In [6]:
df_labels[!,:month] = map(i -> string(year(i), "-", month(i) < 10 ? "0$(month(i))" : month(i)), df_labels.date) ;
month_counts = sort(countmap(df_labels.month) |> collect, by = x -> x[1])
index2017 = findfirst(j -> j[1] == "2017-01", month_counts)
previous_months = sum(map(j -> j[2], month_counts[1:index2017-1]))
month_counts[index2017] = Pair("≤"*month_counts[index2017][1], month_counts[index2017][2]+previous_months)
deleteat!(month_counts, 1:64)
bar(getindex.(month_counts,2), xticks=(1:length(month_counts), getindex.(month_counts,1)), xtickfontsize=5, ytickfontsize=5, xrotation=45, yguidefontsize=8, xguidefontsize=8, legend=false,
    xlabel="Month and year of the first evidence of a sample", ylabel="Number of samples for each month",size=(900,400),
    left_margin = 5Plots.mm, bottom_margin = 10Plots.mm)

UndefVarError: UndefVarError: `df_labels` not defined

## Building a model

We first need to split data into the training and testing part. We do this according the time to reflect the way in which the models should be used (i.e., to detect new, unseen malware). We are using date **2019-08-01** as an example splitting date, however, other dates can be used in a more detailed study of the drift and changes in the data distributions. 

In [7]:
timesplit = Date(2019,8,1)
train_indexes = findall(i -> df_labels.date[i] < timesplit, 1:all_samples_count)
test_indexes = [setdiff(Set(1:all_samples_count), Set(train_indexes))...] ;

train_size = length(train_indexes)
test_size = length(test_indexes)

println("Train size: $(train_size)")
println("Test size: $(test_size)")

UndefVarError: UndefVarError: `all_samples_count` not defined

Now we need to load all JSON files. For the example model, we are using reduced reports:

In [8]:
jsons = tmap(df_labels.sha256) do s
    try 
        open(JSON.parse, "$(PATH_TO_REDUCED_REPORTS)$(s).json")
    catch e
        @error "Error when processing sha $s: $e"
    end
end ;
@assert size(jsons, 1) == all_samples_count # verifying that all samples loaded correctly

UndefVarError: UndefVarError: `df_labels` not defined

Next, we are going to build schema out of the JSONs and model corresponding to these JSONs. Note that we are using only training data to build the schema and the model.

In [9]:
chunks = Iterators.partition(train_indexes, div(train_size, THREADS))
sch_parts = tmap(chunks) do ch
    JsonGrinder.schema(jsons[ch])
end
time_split_complete_schema = merge(sch_parts...)
printtree(time_split_complete_schema)

UndefVarError: UndefVarError: `train_indexes` not defined

Now we prepare the JSONs based on the scheme so that we can build and train the model. This can take a few minutes.

In [10]:
extractor = suggestextractor(time_split_complete_schema)
data = tmap(extractor, jsons) ;

UndefVarError: UndefVarError: `time_split_complete_schema` not defined

Now we are ready for creating the model, prepare minibatches and callback functions for the training.

In [11]:
labelnames = sort(unique(df_labels.classification_family))
neurons = 32
model = reflectinmodel(time_split_complete_schema, extractor,
	k -> Dense(k, neurons, relu),
	d -> SegmentedMeanMax(d),
	fsm = Dict("" => k -> Dense(k, length(labelnames))),
)

minibatchsize = 500
function minibatch()
	idx = sample(train_indexes, minibatchsize, replace = false)
	reduce(catobs, data[idx]), Flux.onehotbatch(df_labels.classification_family[idx], labelnames)
end

iterations = 200

function accuracy(x,y) 
    vals = tmap(x) do s
        Flux.onecold(softmax(model(s)), labelnames)[1]
    end
    mean(vals .== y)
end     
    

eval_trainset = shuffle(train_indexes)[1:1000]
eval_testset = shuffle(test_indexes)[1:1000]

cb = () -> begin
	train_acc = accuracy(data[eval_trainset], df_labels.classification_family[eval_trainset])
	test_acc = accuracy(data[eval_testset], df_labels.classification_family[eval_testset])
	println("accuracy: train = $train_acc, test = $test_acc")
end
ps = Flux.params(model)
loss = (x,y) -> Flux.logitcrossentropy(model(x), y)
opt = ADAM()

UndefVarError: UndefVarError: `df_labels` not defined

We can now train the model (this can take some time). Note that actual performance may slightly vary from the numbers presented in accompanying paper.

In [12]:
Flux.Optimise.train!(loss, ps, repeatedly(minibatch, iterations), opt, cb = Flux.throttle(cb, 2))

UndefVarError: UndefVarError: `minibatch` not defined

The model is trained, we can evaluate the performance on the complete test set:

In [13]:
full_test_accuracy = accuracy(data[test_indexes], df_labels.classification_family[test_indexes])
println("Final evaluation:")
println("Accuratcy on test data: $(full_test_accuracy)")

UndefVarError: UndefVarError: `data` not defined

We can look at the confusion matrix table of the testing data for different malware families. The true lables are in the row, the predictions are in the column.

In [14]:
test_predictions = Dict()
for true_label in labelnames
    current_predictions = Dict()
    [current_predictions[pl]=0.0 for pl in labelnames]
    family_indexes = filter(i -> df_labels.classification_family[i] == true_label, test_indexes)
    predictions = tmap(data[family_indexes]) do s
        Flux.onecold(softmax(model(s)), labelnames)[1]
    end
    [current_predictions[pl] += 1.0 for pl in predictions]
    [current_predictions[pl] = current_predictions[pl] ./ length(predictions) for pl in labelnames]
    test_predictions[true_label] = current_predictions
end

@printf "%8s\t" "TL\\PL"
[@printf " %8s" s for s in labelnames]
print("\n")
for tl in labelnames
    @printf "%8s\t" tl 
    for pl in labelnames
        @printf "%9s" @sprintf "%.2f" test_predictions[tl][pl]*100
    end
    print("\n")
end

UndefVarError: UndefVarError: `labelnames` not defined

If you want to test the static version of the reduced model, the schema can be altered to remove the behavioral part. Re-extracting the data with the new extractor and re-training the model would work in the same way as above.

In [15]:
time_split_static_schema = deepcopy(time_split_complete_schema)
delete!(time_split_static_schema.childs,:behavior)

UndefVarError: UndefVarError: `time_split_complete_schema` not defined