# Malicious False Labels

Describes/summarizes the or set of clusters. Useful for when you aim to enlist those samples (and its API calls) that were falsely labelled as malcious as per the third-party verification tool, VirusTotal.

Note that this notebook only makes use of the verified xxxx_SampleHash_Common.csv file which represents a significant majority of the entire Oliveira dataset.  

## Import Libraries/Datasets

In [1]:
import pandas as pd

malicious_df = pd.read_csv('./Clustering/(EDITED)KMeans_SampleHash_Common.csv', low_memory=False) #This should point to a verified <DataClustering>_SampleHash_Common.csv file
benign_df = pd.read_csv('./Clustering/Benign/API_Patterns.csv', low_memory=False) #This should point to the API_Patterns.csv file

#Load list of API calls
API_LIST = "api_calls.txt"
DELIMITER = "NaN"
API_FILE = open(API_LIST,"r")
APIS = API_FILE.readline().split(',')
APIS.append(DELIMITER) #serves as a label for NaN values for Instance-based datasets
API_FILE.close()

def get_unique_clusters(df:pd.DataFrame):
    return list(df['cluster'].unique())

## DataFrame Preview

In [2]:
#Replace '-' empty malware type delimiter with '_' for consistency
malicious_df.replace(to_replace='-',value='_', inplace=True)
malicious_df

Unnamed: 0,cluster,hash,Type 1,Type 2,Type 3,pattern
0,0,490d584c7d303ed35c673460b63f3ca8,trojan,dropper,pua,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
1,0,9ab8ea1d2d68a0d4110df413e677976c,trojan,hacktool,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
2,0,adbc74815ef2bd1ea4967abad812233d,trojan,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
3,0,f6eb4841bba3a4cee747700dc0ee1609,_,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
4,0,f5a0ad49337ebc87897698e70d03364e,trojan,dropper,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
...,...,...,...,...,...,...
1756,198,d24b78bd73f17379ed62e4c776b4f66e,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1757,198,f666dd4b3a53b7fe71f8976fa09bfdfb,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1758,199,b6d6520b608875282d831b1e983cd5e5,_,_,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1759,199,18bce1a594550daf8b3f318de48c1674,trojan,dropper,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."


## How many are falsely labelled samples from the verified samples?

In [3]:
false_labelled = malicious_df[(malicious_df['Type 1']=='_')&(malicious_df['Type 2']=='_')&(malicious_df['Type 3']=='_')].copy(deep=True)
false_labelled.drop(columns=['Type 1', 'Type 2', 'Type 3'], inplace=True)

print(f"No. of falsely labelled samples from verified samples: {false_labelled.shape[0]}")
print("")

print("Counts of Falsely Labelled Samples in each Cluster")
print(false_labelled['cluster'].value_counts())
print("")

display(false_labelled)

No. of falsely labelled samples from verified samples: 97

Counts of Falsely Labelled Samples in each Cluster
cluster
41     10
194     8
63      7
163     6
162     6
92      5
126     5
34      5
141     5
69      5
143     4
114     4
156     4
147     3
158     3
67      2
165     1
177     1
0       1
96      1
137     1
130     1
3       1
80      1
57      1
44      1
37      1
30      1
29      1
16      1
199     1
Name: count, dtype: int64



Unnamed: 0,cluster,hash,pattern
3,0,f6eb4841bba3a4cee747700dc0ee1609,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
29,3,891ad40bae7bcb15ad7d6c3a512ca31b,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
147,16,149a13fecb33c3f8f6e4705a7184b334,"GetSystemTimeAsFileTime,NtAllocateVirtualMemor..."
269,29,0462d98ae8bdc75c707d42a90d01a0e9,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
277,30,f524bc7e9bcef51a5fd5b976b047643f,"SetErrorMode,LdrGetDllHandle,LdrGetProcedureAd..."
...,...,...,...
1713,194,20282634cb838987bb48d969655cc6a9,"NtAllocateVirtualMemory,LoadStringA,NtClose,Lo..."
1714,194,d8fcabb2b4a0ce949949423026b3bca6,"NtAllocateVirtualMemory,LoadStringA,NtClose,Lo..."
1715,194,addcc965f09c6fab10000c7c18960455,"NtAllocateVirtualMemory,LoadStringA,NtClose,Lo..."
1717,194,9132e700162931646171e3ff5c3c6ea7,"NtAllocateVirtualMemory,LoadStringA,NtClose,Lo..."


## Does the presented API Call Patterns match those from the API Call Patterns of those Benign samples?

**Note:** The samples labelled as benign in Oliveira came from Win7 executables which means that it is guaranteed that those are truly benign which makes it safe for use in comparison.

In [4]:
unique_false_patterns = list(false_labelled['pattern'])

ctr = 1
same = []
print("Falsely Labelled Malicious Samples that Match API Call Patterns of Benign Samples","\n")
for f in unique_false_patterns:
    if benign_df[benign_df['pattern']==f].shape[0]>0:
        print("PATTERN: ", ctr)
        print("API Call Pattern:")
        print("\t" + f)
        # print("API Calls: ")
        # print("\t", pd.Series(list(f.split(","))).unique())
        print("Clusters & Hashes of Matching Falsely Labelled Samples:")
        for p in range(false_labelled[false_labelled['pattern']==f].shape[0]):
            print(f"\t{false_labelled['cluster'].iloc[p]} - {false_labelled['hash'].iloc[p]}")
        print("Hashes of Benign Samples with Matching API Call Patterns:")
        for p in range(benign_df[benign_df['pattern']==f].shape[0]):
            print(f"\t{benign_df['hash'].iloc[p]}")
        same.append(f)
        print("\n")
        ctr+=1
print("")
print(f"No. of API Call Patterns of Falsely-Labelled Malicious Samples that match the API Call Patterns of Benign Samples: {len(same)} ({len(same)/benign_df.shape[0]*100:.4f}%)")

Falsely Labelled Malicious Samples that Match API Call Patterns of Benign Samples 

PATTERN:  1
API Call Pattern:
	GetSystemTimeAsFileTime,NtCreateMutant,GetSystemTimeAsFileTime,NtOpenKeyEx,NtQueryKey,NtOpenKeyEx,LdrLoadDll,LdrGetProcedureAddress,RegOpenKeyExW,LdrGetProcedureAddress,RegQueryInfoKeyW,LdrGetProcedureAddress,RegEnumKeyExW,RegOpenKeyExW,RegQueryInfoKeyW,LdrGetProcedureAddress,RegEnumValueW,LdrGetProcedureAddress,RegCloseKey,GetFileAttributesW,RegOpenKeyExW,LdrGetProcedureAddress,RegQueryValueExW,RegCloseKey,NtOpenFile,NtQueryDirectoryFile,NtClose,RegOpenKeyExW,RegQueryInfoKeyW,RegCloseKey,RegOpenKeyExW,RegQueryInfoKeyW,RegEnumValueW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,NtOpenFile,RegOpenKeyExW,RegQueryInfoKeyW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,NtOpenFile,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,GetSystemTimeAsFileTime,NtQuerySystemInformation,NtPro

In [5]:
print("In terms of unique API Calls:")
for i, s in enumerate(same):
    print(f"PATTERN: {i}", "\n", list(pd.Series(s.split(',')).unique()))
    print("")

In terms of unique API Calls:
PATTERN: 0 
 ['GetSystemTimeAsFileTime', 'NtCreateMutant', 'NtOpenKeyEx', 'NtQueryKey', 'LdrLoadDll', 'LdrGetProcedureAddress', 'RegOpenKeyExW', 'RegQueryInfoKeyW', 'RegEnumKeyExW', 'RegEnumValueW', 'RegCloseKey', 'GetFileAttributesW', 'RegQueryValueExW', 'NtOpenFile', 'NtQueryDirectoryFile', 'NtClose', 'NtQuerySystemInformation', 'NtProtectVirtualMemory', 'GetSystemDirectoryW', 'LdrGetDllHandle', 'NtOpenKey', 'NtQueryValueKey', 'NtCreateFile', 'GetFileSize', 'NtCreateSection', 'NtMapViewOfSection', 'GetSystemInfo', 'NtUnmapViewOfSection']

PATTERN: 1 
 ['GetSystemTimeAsFileTime', 'NtOpenKey', 'NtQueryValueKey', 'NtClose', 'NtCreateMutant', 'NtOpenKeyEx', 'NtQueryKey', 'LdrLoadDll', 'LdrGetProcedureAddress', 'RegOpenKeyExW', 'RegQueryInfoKeyW', 'RegEnumKeyExW', 'RegEnumValueW', 'RegCloseKey', 'GetFileAttributesW', 'RegQueryValueExW', 'NtOpenFile', 'NtQueryDirectoryFile', 'NtQuerySystemInformation', 'NtProtectVirtualMemory', 'GetSystemDirectoryW', 'LdrGetDl