# Ransomware detection model- Windows OS

## Table of Contents
* Introduction
* Dataset & Volatility Plugins
* Data Preprocessing
* Model Training
* Model Evaluation
* Conclusions
* References

## Introduction

Ransomware attacks are growing in volume and sophistication. Many attackers evade detection from traditional file scanning techniques. Here we use data sampled from volatile memory (RAM) to detect the presence of ransomware in Windows OS. We engineer several features from a dataset containing artifacts from running both benign and ransomware processses and train a random forest classifier. We can create multiple models based on the number of samples or snapshots available.

## Dataset

We ran hundreds of ransomwares in our lab environment and have recorded the generated process features using the [Volatility framework](https://github.com/volatilityfoundation/volatility3) to create a labeled dataset.

The csv file contains 530 columns- a combination of features from 5 different Volatility Plugins. This data collection is part of [DOCA AppShield](https://developer.nvidia.com/networking/doca)

## Volatility Plugin Features

#### Envars Plugin
Displays a process's environment variables. Typically this will show the number of CPUs installed and the hardware architecture, the process's current directory, temporary directory, session name, computer name, user name, and various other interesting artifacts.

#### Threadlist Plugin
Displays the threads that are used by a process.

#### VadInfo Plugin
Displays extended information about a process's VAD nodes.
- The address of the MMVAD structure in kernel memory
- The starting and ending virtual addresses in process memory that the MMVAD structure pertains to
- The VAD Tag
- The VAD flags, control flags, etc
- The name of the memory mapped file (if one exists)
- The memory protection constant (permissions)

#### Handles Plugin
Displays the open handles in a process, use the handles command. This applies to files, registry keys, mutexes, named pipes, events, window stations, desktops, threads, and all other types of securable executive objects

#### LdrModules Plugin
Displays a process's loaded DLLs. LdrModules detects a dll-hiding or injection kind of activities in a process memory.

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import random

from sklearn.ensemble import RandomForestClassifier

## Data Preprocessing

In [2]:
TRAINING_DATA_PATH = "../../datasets/training-data/ransomware-training-data.csv"

# Read CSV of Data from Plugins with Ransomware labels
ransom_df = pd.read_csv(TRAINING_DATA_PATH)

In [3]:
# Sorting the dataframe by ransomware_name, PID_Process and snapshot to make the dataframe for time series
sortby_cols = ['ransomware_name', 'PID_Process', 'snapshot']
ransom_df = ransom_df.sort_values(by=sortby_cols)
ransom_df = ransom_df.reset_index(drop=True)

In [4]:
# Using only 99 of the 530 features based on prior experimentation 
# In the prior experiment that we did, we select the best 99 features of Random-Forest model based on single snapshot

# Information and defenitions about the important features:
# Defenitions:
    # Commit charged - the total amount of virtual memory of all processes that must be backed by either physical memory or the page file
    # vad - virtual address descriptor
    # vads - virtual address descriptor short
    # private memory - this field refers to committed regions that cannot be shared with other processes.
# In the features engineering stage we are using several memory plugins as raw data:
# Environment variables plugin feature engineering:
    # Check if the extended PATHEXT environment variable is exists
    # Claculating the amount of environment variables for each process
# Threadlist plugin feature engineering:
    # Count amount of unique states and wait reasons and thread with staten - '2'-'Running' and wait reason - '9'-'WrPageIn', '13'-'WrUserRequest', '31'-'WrDispatchInt'
    # Calculate the amount of unique states
    # Calculate the amount of unique wait reasons
# Vadinfo plugin feature engineering:
    # Calculate the amount of vad, vads and private memory
    # Calculate the ratio of vad, vads and private memory in vadinfo df
    # Calculate the mean, max, sum and len of commit charged
    # Calculate the mean, max, sum of vad commit charged
    # Calculate the min of vads commit charged
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_EXECUTE_WRITECOPY ','PAGE_NOACCESS ' min commit charged
    # Calculate min commit charged for vad with 'PAGE_NOACCESS' protection
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ','PAGE_READONLY ' mean commit charged
    # Calculate mean commit charged for vad with 'PAGE_NOACCESS' protection
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ' max commit charged
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ','PAGE_EXECUTE_WRITECOPY ' sum commit charged
    # Calculate the std of commit charge with 'PAGE_EXECUTE_READWRITE' protection
    # Calculate the amount of entire memory commit charged of vads
    # Count the amount and ratio of each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_READWRITE ','PAGE_EXECUTE_WRITECOPY '
    # Count the amount and ratio of vads with each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_READWRITE ','PAGE_READWRITE ','PAGE_EXECUTE_READWRITE '
    # Count the amount and ratio of vad with each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_WRITECOPY ','PAGE_READWRITE '
    # Count vadinfo unique paths
    # Calculate the ratio between vads amount and amount pages with PAGE_EXECUTE_WRITECOPY access + 1
    # Count amount of unique extensions
# Handles plugin feature engineering:
    # Count double extensions file handles
    # Count amount of files with common file extension
    # Count amount of directories with personal user directory
    # Count amount of directories with windows directory
    # Count amount of unique directories
    # Count unique file extension
    # Count amount of handles
    # Count amount and ration of unique handles names
    # Count amount and ration of unique handles type
    # Count amount and ratio of handles type
# LdrModules plugin feature engineering:
    # Extract process size and path
REQ_FEATURES = ['envirs_pathext', 'count_double_extension_count_handles', 'page_readonly_vads_count', 'double_extension_len_handles', 'get_commit_charge_max_vad', 'count_entire_commit_charge_vads', 'get_commit_charge_min_vad_page_noaccess', 'check_doc_file_handle_count', 'envars_df_count', 'page_noaccess_vad_count', 'get_commit_charge_min_vads', 'get_commit_charge_mean_vad_page_noaccess', 'page_noaccess_vad_ratio', 'handles_df_directory_count', 'threadlist_df_wait_reason_9', 'page_noaccess_count', 'get_commit_charge_mean_page_noaccess', 'ldrmodules_df_size_int', 'get_commit_charge_max_page_execute_readwrite', 'ratio_private_memory', 'get_commit_charge_max_page_noaccess', 'page_readwrite_ratio', 'get_commit_charge_mean_page_execute_readwrite', 'handles_df_section_ratio', 'vad_ratio', 'page_noaccess_ratio', 'page_execute_writecopy_vad_ratio', 'handles_df_section_count', 'handles_df_tpworkerfactory_count', 'page_readonly_count', 'handles_df_waitcompletionpacket_count', 'get_commit_charge_mean_page_readonly', 'page_readonly_vad_ratio', 'handles_df_event_ratio', 'handles_df_semaphore_ratio', 'get_commit_charge_sum_page_execute_readwrite', 'threadlist_df_state_2', 'handles_df_iocompletionreserve_count', 'handles_df_directory_ratio', 'handles_df_iocompletionreserve_ratio', 'get_commit_charge_mean_vad', 'get_commit_charge_sum_page_execute_writecopy', 'page_execute_readwrite_ratio', 'get_commit_charge_min_page_execute_readwrite', 'threadlist_df_wait_reason_31', 'get_commit_charge_sum_page_noaccess', 'page_readwrite_vads_ratio', 'handles_df_mutant_ratio', 'get_commit_charge_sum_vad', 'get_commit_charge_max', 'handles_df_type_unique', 'handles_df_iocompletion_ratio', 'handles_df_waitcompletionpacket_ratio', 'handles_df_tpworkerfactory_ratio', 'vadinfo_df_path_unique', 'vad_count', 'page_readonly_ratio', 'count_private_memory', 'page_execute_readwrite_vads_ratio', 'vads_page_execute_writecopy_ratio', 'handles_df_file_ratio', 'handles_df_etwregistration_ratio', 'handles_df_key_ratio', 'get_commit_charge_min_page_noaccess', 'page_readonly_vads_ratio', 'handles_df_thread_ratio', 'handles_df_file_count', 'handles_df_thread_count', 'threadlist_df_count', 'get_commit_charge_len', 'get_commit_charge_min_page_execute_writecopy', 'handles_df_alpc port_ratio', 'file_users_exists', 'file_windows_count', 'handles_df_key_count', 'threadlist_df_wait_reason_13', 'threadlist_df_wait_reason_unique', 'handles_df_semaphore_count', 'handles_df_name_unique_ratio', 'threadlist_df_state_unique', 'get_count_unique_extensions', 'handles_df_name_unique', 'page_noaccess_vads_ratio', 'handles_df_event_count', 'page_readwrite_vad_ratio', 'handles_df_alpc port_count', 'get_commit_charge_std_page_execute_readwrite', 'count_directories_handles_uniques', 'count_extension_handles_uniques', 'page_readwrite_vad_count', 'get_commit_charge_sum', 'get_commit_charge_mean', 'handles_df_desktop_ratio', 'handles_df_count', 'handles_df_mutant_count', 'handles_df_windowstation_ratio', 'page_execute_readwrite_vads_count', 'handles_df_type_unique_ratio', 'page_execute_readwrite_count']

### Split Dataset Into Training and Validation

In [5]:
# Spliting into training and validation sets by ransomware name
files = ransom_df.ransomware_name.unique()
files_count = len(files)

# We randomize the files to remove biases related to recording process
random.shuffle(files)

train_files = files[:int(files_count*0.8)]
test_files = files[int(files_count*0.8):]

In [6]:
train_df = ransom_df[ransom_df.ransomware_name.isin(train_files)]
val_df = ransom_df[ransom_df.ransomware_name.isin(test_files)]

In [7]:
class FeaturesData():
    
    def __init__(self, df, labels, rw_names, pid_processes, snapshot_ids):
        self._df = df
        self._labels = labels
        self._rw_names = rw_names
        self._pid_processes = pid_processes
        self._snapshot_ids = snapshot_ids
    
    @property
    def df(self):
        return self._df
    
    @property
    def labels(self):
        return self._labels
    
    @property
    def rw_names(self):
        return self._rw_names
    
    @property
    def pid_processes(self):
        return self._pid_processes
    
    @property
    def snapshot_ids(self):
        return self._snapshot_ids

In [8]:
def sort_entries(df, columns):
    df = df.sort_values(by=columns).reset_index(drop=True)
    return df


def sliding_window_offsets(ids, window):
        """
        Create snapshot_id's sliding sequence for a given window
        """
        ids_len = len(ids)

        sliding_window_offsets = []

        for start in range(ids_len - (window - 1)):
            stop = start + window
            sequence = ids[start:stop]
            consecutive = sorted(sequence) == list(range(min(sequence), max(sequence) + 1))
            if consecutive:
                sliding_window_offsets.append((start, stop))

        return sliding_window_offsets


def generate_sequences(df, window=3):
    """
    Generate time series sequences.
    """
    features_data = []
    labels = []
    snapshots = []
    rw_names = []
    pid_processes = []
    
    pid_processes_unique = list(df.PID_Process.unique())

    for pid_process in pid_processes_unique:

        pid_process_df = df[df.PID_Process==pid_process]
        pid_process_df.index = pid_process_df.snapshot
        pid_process_df = pid_process_df[~pid_process_df.index.duplicated(keep='last')]
        pid_process_labels = pid_process_df.label.values
        pid_process_rwname = pid_process_df.ransomware_name.values
        pid_process_df = pid_process_df[REQ_FEATURES]
        
        if len(pid_process_df) >= window:
            snapshot_ids = pid_process_df.index.values
            offsets = sliding_window_offsets(snapshot_ids, window)
            for start, stop in offsets:
                features_data.append(list(pid_process_df[start:stop].values.ravel()))
                labels.append(pid_process_labels[start])
                snapshots.append(snapshot_ids[start])
                rw_names.append(pid_process_rwname[start])
                pid_processes.append(pid_process)

    features_df = pd.DataFrame(np.array(features_data))
    
    sd = FeaturesData(features_df, labels, rw_names, pid_processes, snapshots)
    
    return sd
    

In [9]:
columns = ["PID_Process", "snapshot", "ransomware_name"]
# sort the entries by ["PID_Process", "snapshot", "ransomware_name"] to create time series data.
train_df = sort_entries(train_df, columns)
val_df = sort_entries(val_df, columns)

In [10]:
train_data = generate_sequences(train_df)
val_data = generate_sequences(val_df)

## Model Training

In [11]:
X_df_train = train_data.df
Y_train = train_data.labels

In [12]:
# RandomForest model parameters
MAX_DEPTH=10
MIN_SAMPLES_SPLIT=10
N_ESTIMATORS=250

# For our model we select RandomForest to avoid overfitting
model = RandomForestClassifier(max_depth=MAX_DEPTH, 
                               min_samples_split=MIN_SAMPLES_SPLIT, 
                               n_estimators=N_ESTIMATORS)
model.fit(X_df_train, Y_train)

RandomForestClassifier(max_depth=10, min_samples_split=10, n_estimators=250)

In [13]:
# Save model
def save_model(model, output_file='ransomware_model_new.sav'):
    pickle.dump(model, open(output_file, 'wb'))

## Model Evaluation

In [14]:
# Evaluate model
def model_eval(model, val_data):
    df_val = val_data.df
    Y_pred = model.predict_proba(df_val)
    df_val['pred'] = Y_pred[:, 1]

    Pre = []
    Rec = []
    
    df_val['label'] = val_data.labels
    df_val['PID_Process'] = val_data.pid_processes
    df_val['ransomware_name'] = val_data.rw_names
    
    x = df_val[df_val.label == 0]
    y = df_val[df_val.label == 1]
    
    tp_fn_len = len(x)
    
    # Changing the threshold to calculate precision and recall and plot them
    for thr in np.arange(0, 1, 0.01):
        print(f"thr: {thr}")

        fp_df = x[x.pred > thr]
        tp_df = y[y.pred > thr]
        fn_df = y[y.pred < thr]

        tp_len = len(tp_df)
        fp_len = len(fp_df)

        # Calculating the Recall = TP/(TP+FN)
        recall_val = tp_len / tp_fn_len
        # Calculating the precision = TP/(TP+FP)
        precision_val =  tp_len / (tp_len + fp_len)

        Rec.append(recall_val)
        Pre.append(precision_val)

        print(f"Recall val: \n{recall_val}")
        print(f"Precision val: \n{precision_val}")

        include_cols = ['ransomware_name', 'PID_Process']

        # Printing the TPs, FNs and FPs by name recording to see which ransomware we detected or missed and which
        # legitimate software we detected as ransomware (FP)
        if precision_val > 0.85:
            print(f"TPs: \n{tp_df[include_cols].value_counts()}")
            print(f"FNs: \n{fn_df[include_cols].value_counts()}")
            print(f"FPs: \n{fp_df[include_cols].value_counts()}")


In [15]:
model_eval(model, val_data)

thr: 0.0
Recall val: 
0.03128321943811693
Precision val: 
0.030334265940214992
thr: 0.01
Recall val: 
0.03128321943811693
Precision val: 
0.2881118881118881
thr: 0.02
Recall val: 
0.03128321943811693
Precision val: 
0.3249211356466877
thr: 0.03
Recall val: 
0.03128321943811693
Precision val: 
0.34563758389261745
thr: 0.04
Recall val: 
0.03128321943811693
Precision val: 
0.36589698046181174
thr: 0.05
Recall val: 
0.03128321943811693
Precision val: 
0.37522768670309653
thr: 0.06
Recall val: 
0.03128321943811693
Precision val: 
0.3850467289719626
thr: 0.07
Recall val: 
0.03128321943811693
Precision val: 
0.39768339768339767
thr: 0.08
Recall val: 
0.03128321943811693
Precision val: 
0.40234375
thr: 0.09
Recall val: 
0.03128321943811693
Precision val: 
0.41282565130260523
thr: 0.1
Recall val: 
0.03128321943811693
Precision val: 
0.4256198347107438
thr: 0.11
Recall val: 
0.03128321943811693
Precision val: 
0.4345991561181435
thr: 0.12
Recall val: 
0.03128321943811693
Precision val: 
0.435517

## Save Model

In [16]:
save_model(model)

## Conclusions
Here we show an example of how to train a single-model for a time window of 3 snapshots. If we extened this training to create three cascading models with time windows of 3, 5, and 10 snapshots our precision is 90% and our recall is also 90%. Our model based on AppShield - BlueField which is an agentless system. By using AppShield we succeeded to detect ransomware without the ransomware is knowing that we are monitoring. 

## References
##### https://github.com/volatilityfoundation/volatility3
##### https://developer.nvidia.com/networking/doca