# Malicious ClusterScan

Describes/summarizes the or set of clusters. Useful for when you aim to enlist the summarized malware types found in each cluster in order by Type # as presented in VirusTotal such that the first ones on the list per cluster are Type 1 (most popular), followed by Type 2 then Type 3 (least popular).

Note that this notebook only makes use of the verified xxxx_SampleHash_Common.csv file which represents a significant majority of the entire Oliveira dataset.  

## 1. Import Libraries/Datasets

In [31]:
import pandas as pd

malicious_df = pd.read_csv('./Clustering/(EDITED)KMeans_SampleHash_Common.csv', low_memory=False, index_col=False) #This should point to a VirusTotal verified <Data Clustering>_SampleHash_Common.csv file

#Load list of API calls
API_LIST = "api_calls.txt"
DELIMITER = "NaN"
API_FILE = open(API_LIST,"r")
APIS = API_FILE.readline().split(',')
APIS.append(DELIMITER) #serves as a label for NaN values for Instance-based datasets
API_FILE.close()

def get_unique_clusters(df:pd.DataFrame):
    return list(df['cluster'].unique())

## 2. DataFrame Preview

In [32]:
#Replace '-' empty malware type delimiter with '_' for consistency
malicious_df.replace(to_replace='-',value='_', inplace=True)
malicious_df

Unnamed: 0,cluster,hash,Type 1,Type 2,Type 3,pattern
0,0,490d584c7d303ed35c673460b63f3ca8,trojan,dropper,pua,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
1,0,9ab8ea1d2d68a0d4110df413e677976c,trojan,hacktool,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
2,0,adbc74815ef2bd1ea4967abad812233d,trojan,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
3,0,f6eb4841bba3a4cee747700dc0ee1609,_,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
4,0,f5a0ad49337ebc87897698e70d03364e,trojan,dropper,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
...,...,...,...,...,...,...
1756,198,d24b78bd73f17379ed62e4c776b4f66e,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1757,198,f666dd4b3a53b7fe71f8976fa09bfdfb,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1758,199,b6d6520b608875282d831b1e983cd5e5,_,_,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1759,199,18bce1a594550daf8b3f318de48c1674,trojan,dropper,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."


In [33]:
#Drop row that is falsely labelled. (i.e. '_' on all popularity levels of VirusTotal)
malicious_df.drop(malicious_df[(malicious_df['Type 1']=='_')&(malicious_df['Type 2']=='_')&(malicious_df['Type 3']=='_')].index, inplace=True)
malicious_df

Unnamed: 0,cluster,hash,Type 1,Type 2,Type 3,pattern
0,0,490d584c7d303ed35c673460b63f3ca8,trojan,dropper,pua,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
1,0,9ab8ea1d2d68a0d4110df413e677976c,trojan,hacktool,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
2,0,adbc74815ef2bd1ea4967abad812233d,trojan,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
4,0,f5a0ad49337ebc87897698e70d03364e,trojan,dropper,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
5,0,4c972b447659f1e86769eb43593fd2a5,trojan,downloader,dropper,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
...,...,...,...,...,...,...
1755,198,0226e311ed2648ff399c7902fc113421,adware,trojan,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1756,198,d24b78bd73f17379ed62e4c776b4f66e,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1757,198,f666dd4b3a53b7fe71f8976fa09bfdfb,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
1759,199,18bce1a594550daf8b3f318de48c1674,trojan,dropper,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."


In [34]:
malicious_df['Type 3'].unique()

array(['pua', '_', 'dropper', 'adware', 'downloader', 'virus', 'trojan',
       'hacktool', 'ransomware', 'spyware', 'banker', 'worm'],
      dtype=object)

## 3. Identify "What Malware Types are there in each cluster?"

This indicates what are the individual malware types mostly associated to cluster in terms of quantity regardless of order by popularity by VT.

Note that order matters as the first ones are the most popular labels as indicated by VT.

In [35]:
#Identify the overall list of types each cluster is as designated by VirusTotal.
unique_clusters = get_unique_clusters(malicious_df)

summary = []
for u in unique_clusters:
    df_copy = malicious_df[malicious_df['cluster'] == u].copy(deep=True)
    types = list(df_copy['Type 1']) + list(df_copy['Type 2']) + list(df_copy['Type 3'])
    for i,t in enumerate(types):
        types[i]=t.strip()
    types.sort(key=lambda x:types.copy().count(x), reverse=True) #Sort the list by quantity (i.e., type with most quantity comes first)
    types = pd.Series(types).unique() #Combine the 3 levels of classifications of VirusTotal
    types = list(types)
    while '_' in types:
        types.remove('_')
    types = ' '.join(types)
    summary.append([int(u), types])
summary = pd.DataFrame(summary, columns=['cluster', 'types'])
for s in range(summary.shape[0]):
    print(f"{summary['cluster'].iloc[s]:2d} | {summary['types'].iloc[s]}")
# Note that the order of malware types starts from Type 1, Type 2, and Type 3.

 0 | trojan dropper hacktool downloader pua
 1 | trojan pua adware
 2 | trojan adware downloader pua dropper
 3 | trojan adware downloader virus
 4 | trojan downloader dropper worm virus
 5 | trojan adware
 6 | downloader adware trojan
 7 | trojan adware
 8 | trojan adware pua downloader
 9 | trojan adware
10 | trojan adware
11 | trojan adware pua
12 | trojan adware
13 | trojan adware virus
14 | adware trojan virus
15 | trojan adware dropper
16 | trojan miner downloader adware pua
17 | trojan adware dropper pua
18 | adware trojan downloader virus
19 | trojan adware virus
20 | adware trojan virus downloader pua
21 | trojan adware
22 | trojan adware
23 | trojan dropper
24 | trojan adware downloader
25 | trojan adware
26 | trojan spyware
27 | trojan adware dropper pua
28 | trojan adware
29 | adware trojan pua
30 | trojan pua dropper downloader miner virus adware
31 | trojan pua
32 | trojan adware virus
33 | downloader adware trojan
34 | virus trojan adware pua
35 | trojan adware virus
36 

In [42]:
# Summarize Clusters that have the same malware types as per VirusTotal.
unique_type_summary = list(summary['types'].unique())
print("# of Unique Type Summaries:", len(unique_type_summary),"\n")

count_summary = []
for i, u in enumerate(list(summary['types'].unique())):
    count_summary.append([u, len(list(summary[summary['types'] == u]['cluster']))])
    print("Malware Type Summary:", u)
    print("Matching Clusters:", list(summary[summary['types'] == u]['cluster']))
    print("")
count_summary.sort(key=lambda x: x[1])
    
count_summary = pd.DataFrame(count_summary, columns=['malware_type_summary', 'matching_cluster_count'])
count_summary.sort_values(by='matching_cluster_count',ascending=False, inplace=True)
display("Top 20 Most Common Malicious Types")
display(count_summary.iloc[0:20])

display("Top 20 Least Common Malicious Types")
display(count_summary.iloc[len(count_summary)-20-1:len(count_summary)-1])

# of Unique Type Summaries: 89 

Malware Type Summary: trojan dropper hacktool downloader pua
Matching Clusters: [0]

Malware Type Summary: trojan pua adware
Matching Clusters: [1]

Malware Type Summary: trojan adware downloader pua dropper
Matching Clusters: [2, 179]

Malware Type Summary: trojan adware downloader virus
Matching Clusters: [3, 38, 86]

Malware Type Summary: trojan downloader dropper worm virus
Matching Clusters: [4]

Malware Type Summary: trojan adware
Matching Clusters: [5, 7, 9, 10, 12, 21, 22, 25, 28, 42, 47, 52, 58, 62, 65, 67, 71, 72, 83, 84, 88, 91, 99, 110, 116, 123, 141, 157, 161, 172, 174, 183, 193, 195]

Malware Type Summary: downloader adware trojan
Matching Clusters: [6, 33, 76]

Malware Type Summary: trojan adware pua downloader
Matching Clusters: [8, 36, 64, 89, 173, 184, 189]

Malware Type Summary: trojan adware pua
Matching Clusters: [11, 45, 56, 79, 93, 106, 145, 164, 167, 182]

Malware Type Summary: trojan adware virus
Matching Clusters: [13, 19, 32, 

'Top 20 Most Common Malicious Types'

Unnamed: 0,malware_type_summary,matching_cluster_count
88,trojan adware,34
87,trojan adware virus,17
86,trojan adware pua,10
85,trojan,7
84,trojan adware pua downloader,7
83,trojan downloader adware,5
82,trojan dropper,5
81,adware trojan downloader,4
80,trojan adware dropper pua,4
75,trojan spyware,3


'Top 20 Least Common Malicious Types'

Unnamed: 0,malware_type_summary,matching_cluster_count
23,adware trojan pua ransomware,1
34,trojan downloader,1
42,trojan worm dropper downloader,1
41,trojan dropper downloader worm,1
40,trojan miner adware virus downloader,1
39,miner trojan adware pua,1
38,adware trojan dropper,1
37,trojan adware virus dropper,1
36,trojan techjoydown adware downloader,1
35,trojan pua syncopate,1
