# Ransomware detection model- Windows OS

## Table of Contents
* Introduction
* Dataset & Volatility Plugins
* Data Preprocessing
* Model Training
* Model Evaluation
* Conclusions
* References

## Introduction

Ransomware attacks are growing in volume and sophistication. Many attackers evade detection from traditional file scanning techniques. Here we use data sampled from volatile memory (RAM) to detect the presence of ransomware in Windows OS. We engineer several features from a dataset containing artifacts from running both benign and ransomware processses and train a random forest classifier. We can create multiple models based on the number of samples or snapshots available.

## Dataset

We ran hundreds of ransomwares in our lab environment and have recorded the generated process features using the [Volatility framework](https://github.com/volatilityfoundation/volatility3) to create a labeled dataset.

The csv file contains 530 columns- a combination of features from 5 different Volatility Plugins. This data collection is part of [DOCA AppShield](https://developer.nvidia.com/networking/doca)

## Volatility Plugin Features

#### Envars Plugin
Displays a process's environment variables. Typically this will show the number of CPUs installed and the hardware architecture, the process's current directory, temporary directory, session name, computer name, user name, and various other interesting artifacts.

#### Threadlist Plugin
Displays the threads that are used by a process.

#### VadInfo Plugin
Displays extended information about a process's VAD nodes.
- The address of the MMVAD structure in kernel memory
- The starting and ending virtual addresses in process memory that the MMVAD structure pertains to
- The VAD Tag
- The VAD flags, control flags, etc
- The name of the memory mapped file (if one exists)
- The memory protection constant (permissions)

#### Handles Plugin
Displays the open handles in a process, use the handles command. This applies to files, registry keys, mutexes, named pipes, events, window stations, desktops, threads, and all other types of securable executive objects

#### LdrModules Plugin
Displays a process's loaded DLLs. LdrModules detects a dll-hiding or injection kind of activities in a process memory.

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import random

## Data Preprocessing

In [2]:
TRAINING_DATA_PATH = "../../datasets/training-data/ransomware-training-data.csv"

# Read CSV of Data from Plugins with Ransomware labels
ransom_df = pd.read_csv(TRAINING_DATA_PATH)

In [3]:
# Sorting the dataframe by ransomware_name, PID_Process and snapshot to make the dataframe for time series
ransom_df = ransom_df.sort_values(by=['ransomware_name', 'PID_Process', 'snapshot'])
ransom_df = ransom_df.reset_index(drop=True)

In [4]:
# Using only 99 of the 530 features based on prior experimentation 
# In the prior experiment that we did, we select the best 99 features of Random-Forest model based on single snapshot

# Information and defenitions about the important features:
# Defenitions:
    # Commit charged - the total amount of virtual memory of all processes that must be backed by either physical memory or the page file
    # vad - virtual address descriptor
    # vads - virtual address descriptor short
    # private memory - this field refers to committed regions that cannot be shared with other processes.
# In the features engineering stage we are using several memory plugins as raw data:
# Environment variables plugin feature engineering:
    # Check if the extended PATHEXT environment variable is exists
    # Claculating the amount of environment variables for each process
# Threadlist plugin feature engineering:
    # Count amount of unique states and wait reasons and thread with staten - '2'-'Running' and wait reason - '9'-'WrPageIn', '13'-'WrUserRequest', '31'-'WrDispatchInt'
    # Calculate the amount of unique states
    # Calculate the amount of unique wait reasons
# Vadinfo plugin feature engineering:
    # Calculate the amount of vad, vads and private memory
    # Calculate the ratio of vad, vads and private memory in vadinfo df
    # Calculate the mean, max, sum and len of commit charged
    # Calculate the mean, max, sum of vad commit charged
    # Calculate the min of vads commit charged
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_EXECUTE_WRITECOPY ','PAGE_NOACCESS ' min commit charged
    # Calculate min commit charged for vad with 'PAGE_NOACCESS' protection
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ','PAGE_READONLY ' mean commit charged
    # Calculate mean commit charged for vad with 'PAGE_NOACCESS' protection
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ' max commit charged
    # Calculate for each page protection: 'PAGE_EXECUTE_READWRITE ','PAGE_NOACCESS ','PAGE_EXECUTE_WRITECOPY ' sum commit charged
    # Calculate the std of commit charge with 'PAGE_EXECUTE_READWRITE' protection
    # Calculate the amount of entire memory commit charged of vads
    # Count the amount and ratio of each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_READWRITE ','PAGE_EXECUTE_WRITECOPY '
    # Count the amount and ratio of vads with each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_READWRITE ','PAGE_READWRITE ','PAGE_EXECUTE_READWRITE '
    # Count the amount and ratio of vad with each page protection: 'PAGE_READONLY ','PAGE_NOACCESS ','PAGE_EXECUTE_WRITECOPY ','PAGE_READWRITE '
    # Count vadinfo unique paths
    # Calculate the ratio between vads amount and amount pages with PAGE_EXECUTE_WRITECOPY access + 1
    # Count amount of unique extensions
# Handles plugin feature engineering:
    # Count double extensions file handles
    # Count amount of files with common file extension
    # Count amount of directories with personal user directory
    # Count amount of directories with windows directory
    # Count amount of unique directories
    # Count unique file extension
    # Count amount of handles
    # Count amount and ration of unique handles names
    # Count amount and ration of unique handles type
    # Count amount and ratio of handles type
# LdrModules plugin feature engineering:
    # Extract process size and path
IMPORTANT_FEATURES = ['envirs_pathext', 'count_double_extension_count_handles', 'page_readonly_vads_count', 'double_extension_len_handles', 'get_commit_charge_max_vad', 'count_entire_commit_charge_vads', 'get_commit_charge_min_vad_page_noaccess', 'check_doc_file_handle_count', 'envars_df_count', 'page_noaccess_vad_count', 'get_commit_charge_min_vads', 'get_commit_charge_mean_vad_page_noaccess', 'page_noaccess_vad_ratio', 'handles_df_directory_count', 'threadlist_df_wait_reason_9', 'page_noaccess_count', 'get_commit_charge_mean_page_noaccess', 'ldrmodules_df_size_int', 'get_commit_charge_max_page_execute_readwrite', 'ratio_private_memory', 'get_commit_charge_max_page_noaccess', 'page_readwrite_ratio', 'get_commit_charge_mean_page_execute_readwrite', 'handles_df_section_ratio', 'vad_ratio', 'page_noaccess_ratio', 'page_execute_writecopy_vad_ratio', 'handles_df_section_count', 'handles_df_tpworkerfactory_count', 'page_readonly_count', 'handles_df_waitcompletionpacket_count', 'get_commit_charge_mean_page_readonly', 'page_readonly_vad_ratio', 'handles_df_event_ratio', 'handles_df_semaphore_ratio', 'get_commit_charge_sum_page_execute_readwrite', 'threadlist_df_state_2', 'handles_df_iocompletionreserve_count', 'handles_df_directory_ratio', 'handles_df_iocompletionreserve_ratio', 'get_commit_charge_mean_vad', 'get_commit_charge_sum_page_execute_writecopy', 'page_execute_readwrite_ratio', 'get_commit_charge_min_page_execute_readwrite', 'threadlist_df_wait_reason_31', 'get_commit_charge_sum_page_noaccess', 'page_readwrite_vads_ratio', 'handles_df_mutant_ratio', 'get_commit_charge_sum_vad', 'get_commit_charge_max', 'handles_df_type_unique', 'handles_df_iocompletion_ratio', 'handles_df_waitcompletionpacket_ratio', 'handles_df_tpworkerfactory_ratio', 'vadinfo_df_path_unique', 'vad_count', 'page_readonly_ratio', 'count_private_memory', 'page_execute_readwrite_vads_ratio', 'vads_page_execute_writecopy_ratio', 'handles_df_file_ratio', 'handles_df_etwregistration_ratio', 'handles_df_key_ratio', 'get_commit_charge_min_page_noaccess', 'page_readonly_vads_ratio', 'handles_df_thread_ratio', 'handles_df_file_count', 'handles_df_thread_count', 'threadlist_df_count', 'get_commit_charge_len', 'get_commit_charge_min_page_execute_writecopy', 'handles_df_alpc port_ratio', 'file_users_exists', 'file_windows_count', 'handles_df_key_count', 'threadlist_df_wait_reason_13', 'threadlist_df_wait_reason_unique', 'handles_df_semaphore_count', 'handles_df_name_unique_ratio', 'threadlist_df_state_unique', 'get_count_unique_extensions', 'handles_df_name_unique', 'page_noaccess_vads_ratio', 'handles_df_event_count', 'page_readwrite_vad_ratio', 'handles_df_alpc port_count', 'get_commit_charge_std_page_execute_readwrite', 'count_directories_handles_uniques', 'count_extension_handles_uniques', 'page_readwrite_vad_count', 'get_commit_charge_sum', 'get_commit_charge_mean', 'handles_df_desktop_ratio', 'handles_df_count', 'handles_df_mutant_count', 'handles_df_windowstation_ratio', 'page_execute_readwrite_vads_count', 'handles_df_type_unique_ratio', 'page_execute_readwrite_count']

In [5]:
# Create time-series data- rolling window size given by snapshot length
def create_timeseries_data(full_df, snapshot_len):
    list_data = []
    list_data_n_snapshots = []
    list_labels = []
    list_snapshots = []
    ransomware_name_list = []
    list_pid_process = []
    # Creating batches of time-series data - with addition information: ransomware_name, PID_Process, label, snapshot
    for i in range(len(full_df)):
        current_ransomname = full_df.iloc[i]['ransomware_name']
        current_pid_process = full_df.iloc[i]['PID_Process']
        current_label = full_df.iloc[i]['label']
        current_snapshot = full_df.iloc[i]['snapshot']
        list_data_n_snapshots.extend(list(full_df.iloc[i][IMPORTANT_FEATURES]))
        for j in range(1, snapshot_len):
            if i + j < len(full_df) and current_ransomname == full_df.iloc[i + j][
                'ransomware_name'] and current_pid_process == full_df.iloc[i + j]['PID_Process']:
                list_data_n_snapshots.extend(list(full_df.iloc[i + j][IMPORTANT_FEATURES]))
        # Condition for store a batch
        if len(list_data_n_snapshots) == snapshot_len * len(IMPORTANT_FEATURES):
            list_data_n_snapshots = np.array(list_data_n_snapshots).reshape(snapshot_len, len(IMPORTANT_FEATURES))
            list_labels.append(current_label)
            list_data.append(list_data_n_snapshots)
            ransomware_name_list.append(current_ransomname)
            list_pid_process.append(current_pid_process)
            list_snapshots.append(current_snapshot)
        list_data_n_snapshots = []
    return np.array(list_data), list_labels, ransomware_name_list, list_pid_process, list_snapshots

# Convert array to dataframe
def convert_array_dataframe(ransom_df, snaphot_len):
    # Create time series data - because the ransomware attack is time series problem
    list_data_series, list_labels_series, list_ransomware_series, list_pid_process, list_snapshots = create_timeseries_data(
        ransom_df, snaphot_len)
    list_data_series_flat = list_data_series.reshape(len(list_data_series), snaphot_len * len(IMPORTANT_FEATURES))
    time_series_features = []
    # Tagging the features according to the snapshot
    for i in range(snaphot_len):
        time_series_features.extend([s + '_' + str(i) for s in IMPORTANT_FEATURES])
    # Creating dataframe input for the model
    ransom_df_series = pd.DataFrame(list_data_series_flat, columns=time_series_features)
    ransom_df_series['label'], ransom_df_series['ransomware_name'], ransom_df_series['PID_Process'], ransom_df_series[
        'snapshot'] = list_labels_series, list_ransomware_series, list_pid_process, list_snapshots
    list_labels_series = list(ransom_df_series['label'])
    return ransom_df_series, list_labels_series

# Save model
def save_model(model, output_file='ransomware_model_new.sav'):
    pickle.dump(model, open(output_file, 'wb'))

### Split Dataset Into Training and Validation

In [6]:
# Spliting into training and validation sets by ransomware name
files = ransom_df['ransomware_name'].unique()
# We randomize the files to remove biases related to recording process
random.shuffle(files)
train_files = files[:int(len(files)*0.8)]
test_files = files[int(len(files)*0.8):]

In [7]:
full_df_train = ransom_df[ransom_df['ransomware_name'].isin(train_files)]
full_df_val = ransom_df[ransom_df['ransomware_name'].isin(test_files)]

In [8]:
# Creating training data for the short model - 3 snapshots
ransom_df_train_series_short, list_labels_train_series_short = convert_array_dataframe(full_df_train, 3)

In [9]:
# Creating validation data for the short model - 3 snapshots
ransom_df_val_series_short, list_labels_val_series_short = convert_array_dataframe(full_df_val, 3)

## Model Training

In [10]:
# remove label and PIDs from features
training_features = [item for item in list(ransom_df_train_series_short.columns) if
                               item not in ['label', 'ransomware_name', 'PID_Process', 'snapshot']]
X_df_train = ransom_df_train_series_short[training_features]
Y_train = list_labels_train_series_short

In [11]:
# RandomForest model parameters
MAX_DEPTH=10
MIN_SAMPLES_SPLIT=10
N_ESTIMATORS=250

# For our model we select RandomForest to avoid overfitting
model = RandomForestClassifier(max_depth=MAX_DEPTH, min_samples_split=MIN_SAMPLES_SPLIT, n_estimators=N_ESTIMATORS)
model.fit(X_df_train, Y_train)

RandomForestClassifier(max_depth=10, min_samples_split=10, n_estimators=250)

## Model Evaluation

In [12]:
# Evaluate model
def model_eval(model, full_df_val):
    important_features_ones = [item for item in list(full_df_val.columns) if
                               item not in ['label', 'ransomware_name', 'PID_Process', 'snapshot']]
    X_df_val = full_df_val[important_features_ones]
    Y_val = full_df_val['label']
    Y_pred = model.predict_proba(X_df_val)
    X_df_val['pred'] = Y_pred[:, 1]
    full_df_val['pred'] = Y_pred[:, 1]
    X_df_val['label'] = Y_val
    Pre = []
    Rec = []
    # Changing the threshold to calculate precision and recall and plot them
    for thr in np.arange(0, 1, 0.01):
        try:
            print("thr: " + str(thr))
            # Calculating the Recall = TP/(TP+FN)
            print("Recall val: ")
            print(len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) / len(
                full_df_val[(full_df_val['label'] == 1)]))
            Rec.append(len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) / len(
                full_df_val[(full_df_val['label'] == 1)]))
            # Calculating the precision = TP/(TP+FP)
            print("Precision val: ")
            print(len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) / (
                        len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) + len(
                    full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 0)])))
            Pre.append(len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) / (
                        len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) + len(
                    full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 0)])))
            # Printing the TPs, FNs and FPs by name recording to see which ransomware we detected or missed and which
            # legitimate software we detected as ransomware (FP)
            if len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) / (
                    len(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)]) + len(
                    full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 0)])) > 0.85:
                print("TPs:")
                print(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 1)][
                          ['ransomware_name', 'PID_Process']].value_counts())
                print("FNs:")
                print(full_df_val[(full_df_val['pred'] < thr) & (full_df_val['label'] == 1)][
                          ['ransomware_name', 'PID_Process']].value_counts())
                print("FPs:")
                print(full_df_val[(full_df_val['pred'] > thr) & (full_df_val['label'] == 0)][
                          ['ransomware_name', 'PID_Process']].value_counts())
        except:
            print('error: ' + str(thr))

In [13]:
model_eval(model, ransom_df_val_series_short)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df_val['pred'] = Y_pred[:, 1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df_val['label'] = Y_val


thr: 0.0
Recall val: 
1.0
Precision val: 
0.016313116223483717
thr: 0.01
Recall val: 
1.0
Precision val: 
0.25045871559633026
thr: 0.02
Recall val: 
1.0
Precision val: 
0.2676470588235294
thr: 0.03
Recall val: 
1.0
Precision val: 
0.28556485355648537
thr: 0.04
Recall val: 
1.0
Precision val: 
0.2970620239390642
thr: 0.05
Recall val: 
1.0
Precision val: 
0.3127147766323024
thr: 0.06
Recall val: 
1.0
Precision val: 
0.3482142857142857
thr: 0.07
Recall val: 
1.0
Precision val: 
0.36941813261163736
thr: 0.08
Recall val: 
1.0
Precision val: 
0.4050445103857567
thr: 0.09
Recall val: 
1.0
Precision val: 
0.4396135265700483
thr: 0.1
Recall val: 
1.0
Precision val: 
0.5
thr: 0.11
Recall val: 
1.0
Precision val: 
0.5055555555555555
thr: 0.12
Recall val: 
1.0
Precision val: 
0.5219885277246654
thr: 0.13
Recall val: 
1.0
Precision val: 
0.5571428571428572
thr: 0.14
Recall val: 
1.0
Precision val: 
0.5594262295081968
thr: 0.15
Recall val: 
1.0
Precision val: 
0.5973741794310722
thr: 0.16
Recall val

## Save Model

In [14]:
save_model(model)

## Conclusions
Here we show an example of how to train a single-model for a time window of 3 snapshots. If we extened this training to create three cascading models with time windows of 3, 5, and 10 snapshots our precision is 90% and our recall is also 90%. Our model based on AppShield - BlueField which is an agentless system. By using AppShield we succeeded to detect ransomware without the ransomware is knowing that we are monitoring. 

## References
##### https://github.com/volatilityfoundation/volatility3
##### https://developer.nvidia.com/networking/doca