# Analysis of Similarities in Threat Actors

## Description 

This notebook takes the analysis from the other notebooks (e.g. [aggregation-notebook](./aggregation-notebook.ipynb), [attack-pattern-similarities](./attack-pattern-similarities.ipynb), [malware-similarities](./malware-similarities.ipynb), etc) and combines these into a structured approach to analyzing the similarities among threat groups. See the [README](./README.md) for the methodology behind this analysis. 

**NOTE:** this notebook assumes that the other notebooks are completed and the files for each similarity matrix exist. 

# Similarity Analysis

In [1]:
import numpy as np
import pandas as pd 
import json 
from functions.functions import generate_similarity_table

## Setup 

**Loading the necessary data**

In [2]:
data_dir:str = 'data/analysis-outcomes/'
actor_names:list[str] = [ a['name'] for a in json.load(open('data/jsons/intrusion-sets.json', 'r')) ]

# Actor TTP similarities
actor_ttps_similarity_matrix:np.matrix = pd.read_csv(data_dir + 'actor-attack-pattern-similarities/actor-ttp-similarity-matrix.csv', header=None).values 
with open(data_dir + 'actor-attack-pattern-similarities/matrix-labels.json', 'r') as file:
    actor_ttps_matrix_labels:dict[str, list[str]] =  json.load(file)
    
# Actor malware similarities
actor_malware_similarity_matrix:np.matrix = pd.read_csv(data_dir + 'actor-malware-similarities/actor-malware-similarity-matrix.csv', header=None).values 
with open(data_dir + 'actor-malware-similarities/matrix-labels.json', 'r') as file: 
    actor_malware_matrix_labels:list[str] = json.load(file)
    
# Actor tools similarities
actor_tools_similarities_matrix:np.matrix = pd.read_csv(data_dir + 'actor-tool-similarities/actor-tool-similarity-matrix.csv', header=None).values 
with open(data_dir + 'actor-tool-similarities/matrix-labels.json', 'r') as file: 
    actor_tools_matrix_labels:list[str] = json.load(file)
    

**Defining hyperparameters**

In [3]:
# NOTE: the following can be defined as fractions or probabilities but MUST add to 1 (100%)

# ttp_similarity_weight := weight for the actor_ttps_similarity_matrix 
ttp_similarity_weight:float = 1/6

# malware_similarity_weight := weight for the actor_malware_similarity_matrix
malware_similarity_weight:float = 1/2

# tool_similarity_weight := weight for the actor_tools_similarity_matrix
tool_similarity_weight:float = 1/3

**Checking that hyperparams are valid**

In [4]:
if ttp_similarity_weight + malware_similarity_weight + tool_similarity_weight != 1: 
    raise ValueError('Hyperparameters do not add up to 1.')

## Creating the model

**Weighting the matrices**

In [5]:
# TTPs similarity matrix
actor_ttps_similarity_matrix = ttp_similarity_weight * actor_ttps_similarity_matrix

# Malware similarity matrix
actor_malware_similarity_matrix = malware_similarity_weight * actor_malware_similarity_matrix

# Tools similarity matrix 
actor_tools_similarities_matrix = tool_similarity_weight * actor_tools_similarities_matrix

**Checking that the matrices have the same shape**

In [6]:
if not (actor_ttps_similarity_matrix.shape == actor_malware_similarity_matrix.shape == actor_tools_similarities_matrix.shape): 
    raise ValueError('Similarity matrices do not have the same shape. Check your matrices.')

**Summing the matrices together**

In [7]:
summed_similarities_matrix:np.matrix = actor_ttps_similarity_matrix + actor_malware_similarity_matrix + actor_tools_similarities_matrix
print(summed_similarities_matrix)

# Save the summed similarities matrix 
np.savetxt('data/analysis-outcomes/similarity-matrix.csv', summed_similarities_matrix.astype(float), delimiter=',', fmt='%f')

[[1.         0.15074867 0.         ... 0.0474835  0.         0.        ]
 [0.15074867 1.         0.         ... 0.05498567 0.         0.        ]
 [0.         0.         0.5        ... 0.         0.         0.        ]
 ...
 [0.0474835  0.05498567 0.         ... 0.66666667 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.5       ]]


In [9]:
sim_table, pretty_table = generate_similarity_table(summed_similarities_matrix, actor_names, check_if_aliases=True)


# Convert the table to a dataframe and save it
sim_table_df:pd.DataFrame = pd.DataFrame(sim_table, columns=['Actor 1', 'Actor 2', 'Similarity Rating', 'Is Alias'])
sim_table_df.to_csv('data/analysis-outcomes/similarities-table.csv', index=False)

                    Label 1               Label 2  Similarity Rating
------------------  ------------------  ---------  -------------------
APT-C-23            None                     0     False
APT-C-36            TA2541                   0.3   False
APT1                admin@338                0.29  False
APT12               APT30                    0.11  False
APT16               Indrik Spider            0.04  False
APT17               Leviathan                0.16  False
APT18               Higaisa                  0.22  False
APT19               CopyKittens              0.67  False
APT28               APT38                    0.21  False
APT29               UNC2452                  1     True
APT3                Inception                0.28  False
APT30               TA459                    0.11  False
APT32               Chimera                  0.36  False
APT33               MuddyWater               0.23  False
APT34               OilRig                   1     True
APT37  