# Unsupervised learning for detecting obfuscated bash commands

In this Jupyter notebook, we will study a dataset comprised of bash commands and obfuscated versions of them, using unsupervised learning approaches such as clustering and anomaly detection to try to detect these obfuscated (and therefore likely malicious) commands. The purpose of this exercise is to demonstrate some unsupervised learning pipelines, consisting of feature extraction, learning and qualitatively exploring the results of an unsupervised model, and quantitatively evaluating it in the rare instance that we have labels.

In order to circumvent simple, signature-based detectors for suspicious commands executed in the shell, attackers may choose to obfuscate their commands using automated methods. Such tools take a normal command as input and generate a version that is functionally the same but looks significantly different, and would therefore escape detection by a signature-based detector. The obfuscation tool we will be using to generate our example dataset is Bashfuscator [1], which works with bash commands.

To generate our example dataset, we downloaded a set of normal (i.e., unobfuscated) bash commands from a project called NL2Bash [2]. This data consists of a set of ~12K bash one-liners [3] collected from websites such as StackOverflow. Using Bashfuscator and a subset of these commands, we generated ~1200 obfuscated bash commands.

### References
[1] https://github.com/Bashfuscator/Bashfuscator

[2] https://github.com/TellinaTool/nl2bash

[3] https://github.com/TellinaTool/nl2bash/blob/master/data/bash/all.cm



## Load data
For this exercise, the data collection and pre-processing steps have already been completed as described earlier, and that we only need to load this dataset in order to start using it.

Note that for an unsupervised learning pipeline, we typically don't have labels for the data--if we did, we would have used a supervised approach! However, for the purpose of quantatively evaluating performance of our unsupervised methods later on, when we load the dataset we will retain labels (+1 for obfuscated commands and -1 for unobfuscated commands) so that we can compare the performance of our unsupervised methods to earlier supervised approaches. **We do not require or use these labels when training our unsupervised methods.**

In [None]:
import os
import sys
import random

# Load raw text
nor_text = list()
mal_text = list()

# Load unobfuscated commands
sample_proportion_nor = 1  # lower to decrease class imbalance
my_file = 'data/bash_commands'
with open(my_file) as f:
    for i,line in enumerate(f):
        if random.random() < sample_proportion_nor:
            cmd = line.rstrip()
            nor_text.append(cmd)

# Load obfuscated commands
sample_proportion_mal = 0.1  # lower to increase class imbalance
my_file = 'data/obs_bash_commands'
with open(my_file) as f:
    for i,line in enumerate(f):
        if random.random() < sample_proportion_mal:
            cmd = line.rstrip()
            mal_text.append(cmd)

# Count number of normal and malicious commands
num_nor = len(nor_text)
num_mal = len(mal_text)
print('\nLoaded %s normal commands and %s obfuscated commmands.' % (num_nor, num_mal))

### Examples from dataset
Let's first look at some examples of normal and obfuscated bash commands to better understand the task at hand. This step of manually inspecting the data and trying to understand its features and peculiarities is an important and necessary one, as it informs us on the type of model we may want to use to best solve our problem.

In [None]:
# Examples of normal bash commands
print('Normal bash examples:')
for elem in nor_text[0:10]: print('(len: ' + str(len(elem)) + ') ' + elem)

In [None]:
# Examples of obfuscated bash commands
print('Obfuscated bash examples:')
for elem in mal_text[0:10]: print('(len: ' + str(len(elem)) + ') ' + elem)

### Length of commands
We inspect the distribution of lengths of normal and obfuscated commands to determine whether there are any patterns that can be used to discriminate between the two classes.

*Bonus: Can you prescribe a hand-coded set of rules (based on command length) that does well at discriminating between normal and obfuscated commands?*

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute log (base 10) lengths of normal and obfuscated commands
len_nor = [len(elem) for elem in nor_text]
loglen_nor = np.log10(len_nor)

len_mal = [len(elem) for elem in mal_text]
loglen_mal = np.log10(len_mal)

# Evalute max of log lengths
print('Max. log length for normal commands: ' + str(np.max(loglen_nor)))
print('Max. log length for obfuscated commands: ' + str(np.max(loglen_mal)))

# Plot density
bins = np.linspace(0, 8, 160)
plt.hist(loglen_nor, bins, density=True, alpha=0.5, label='normal')
plt.hist(loglen_mal, bins, density=True, alpha=0.5, label='obfuscated')
plt.legend(loc='upper right')
plt.title('Density plot of log (base 10) command lengths')

from matplotlib import rcParams
rcParams['figure.figsize'] = [8,6]
plt.show()

In [None]:
# Truncate obfuscated commands
truncate = True
hard_truncate = False
max_length = 100

if truncate:
    if hard_truncate:
        # Truncate all obfuscated commands at specified max length
        mal_text = [elem[:max_length] for elem in mal_text]
    else:
        # Truncate obfuscated commands according to normal command length distribution
        mal_text = [elem[:random.choice(len_nor)] for elem in mal_text]

## Feature extraction
We now perform feature extraction on a labeled dataset comprised of both normal and obfuscated commands. To do this, we will use the `CountVectorizer` function in scikit-learn to generate a set of features based on $n$-grams of words or characters (or some other user-specified criterion), then compute the number of occurrences of every feature within each command. Each such count vector can be thought of a vector representation (e.g., an embedding) of the command from which it was generated.

In [None]:
# Merge normal/obfuscated commands into one labeled dataset
raw_text = nor_text + mal_text
raw_labels = [-1]*num_nor + [1]*num_mal

In [None]:
# Build feature extractor
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
#count_vect = CountVectorizer()
count_vect = CountVectorizer(analyzer='char', ngram_range=(1,1))  # character n-gram feature extraction

# Extract feature counts
raw_counts = count_vect.fit_transform(raw_text)

# Display features
features = count_vect.get_feature_names()
print('Feature set: ' + str(features))
print('Number of features: ' + str(len(features)))

In [None]:
# Normalize counts
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False)
all_data = tf_transformer.fit_transform(raw_counts)

# Convert labels to numpy array
import numpy as np
all_labels = np.asarray(raw_labels)

# Create set of indices
indices = np.arange(len(all_labels))

## Unsupervised learning
Next, we can learn an unsupervised model that tries to distinguish parts of the feature space that correspond to each of the two classes: normal vs. obfuscated (anomalous). We will focus on k-means clustering (a clustering method) and isolation forests (a tree-based approach for anomaly detection) in order to illustrate how these models can be utilized, and will later show how their results can be analyzed and interpreted.

### K-means clustering

In [None]:
# Cluster test data using k-means
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2, random_state=0).fit(all_data)
model_type = 'clustering'

# Print resulting cluster labels
print('Cluster labels: ' + str(model.labels_))
#print('Cluster centers: ' + str(model.cluster_centers_))

In [None]:
# Compute silhouette score (measure of cluster consistency, ranging from -1 (poor consistency) to +1 (good consistency))
from sklearn.metrics import silhouette_score
print('Silhouette score: ' + str(silhouette_score(all_data, model.labels_)))

### Isolation forest (anomaly detection)

In [None]:
from sklearn.ensemble import IsolationForest
model = IsolationForest(max_samples=100, random_state=0, behaviour='new', contamination='auto').fit(all_data)
model_type = 'anomaly detection'

# Compute anomaly scores and labels (label 1 if anomaly score is negative)
anomaly_scores = model.decision_function(all_data)
model.labels_ = (anomaly_scores < 0).astype(int)
print('Anomaly scores: ' + str(anomaly_scores))
print('Anomaly labels: ' + str(model.labels_))

In [None]:
# Evalute max/min of anomaly scores
print('Max. anomaly score: ' + str(np.max(anomaly_scores)))
print('Min. anomaly score: ' + str(np.min(anomaly_scores)))

# Plot density
bins = np.linspace(np.min(anomaly_scores), np.max(anomaly_scores), 160)
plt.hist(anomaly_scores, bins, density=True, alpha=0.5, label='anomaly scores')
plt.legend(loc='upper right')
plt.title('Density plot of anomaly scores from isolation forest')

from matplotlib import rcParams
rcParams['figure.figsize'] = [8,6]
plt.show()

## Qualitative results (i.e., intrinsic evaluation)

Since we do not have labels when performing unsupervised learning, we can only perform an evaluation of the model results using intrinsic measures of “goodness” (e.g., consistency of clusters for k-means or distribution of anomaly scores from isolation forest).

We can also try to visualize the results of these methods, which requires us to project our high-dimensional feature vectors to two- or three-dimensional plots. There are many methods for doing so, and their performance largely depends on the characteristics of the dataset being visualized. Besides linear projection methods like principal component analysis (PCA), one can also use nonlinear methods like manifold learning to try to discern lower-dimensional structure in the dataset under consideration.

One such nonlinear method is t-distributed stochastic neighbor embedding (t-SNE), which tries to capture local structure in the data manifold. Note that t-SNE (like all manifold learning methods) can produce visualizations that are quite deceptive, implying structures and clusters that may not actually exist in the dataset. For further discussion and an interactive ilustration of the capabilities and potential pitfalls of t-SNE, we refer to [4] and [5].

### References

[4] https://mlexplained.com/2018/09/14/paper-dissected-visualizing-data-using-t-sne-explained/

[5] https://distill.pub/2016/misread-tsne/

#### t-SNE plot of feature vectors

In [None]:
# Compute 2-d TSNE projection
from sklearn.manifold import TSNE
num_points = 1000
sample_indices = random.sample(list(indices), num_points)
tsne = TSNE(n_components=2, init='random', random_state=0)
tsne_proj = tsne.fit_transform(all_data[sample_indices].todense())

In [None]:
# Create TSNE plot (colored by cluster/anomaly labels)
import matplotlib.pyplot as plt 
colors = 'c', 'g'
for i, c in zip([0,1], colors):
    plt.scatter(tsne_proj[model.labels_[sample_indices] == i, 0], tsne_proj[model.labels_[sample_indices] == i, 1], c=c)
plt.legend()
plt.show()

What if we had ground truth lables?

In [None]:
# Create TSNE plot (colored by ground truth labels)
import matplotlib.pyplot as plt
colors = 'b', 'r'
for i, c in zip([-1,1], colors):
    plt.scatter(tsne_proj[all_labels[sample_indices] == i, 0], tsne_proj[all_labels[sample_indices] == i, 1], c=c)
plt.legend()
plt.show()

## Quantitative evaluation given labels (extrinsic evaluation)
While we typically do not have a labelled dataset when using an unsupervised ML pipeline, if we do have access to the labels we can perform a more comprehensive evaluation of the method being used. In this section, we build the unsupervised model on a training set (without labels) and fit it to a test set to study its performance as a classifier by using ground truth labels and the same metrics introduced in the supervised ML pipeline. We can also take a closer look at those examples that were misclassified and try to discern the reasons why prediction may have failed.

### Train/test split
In order to not overfit the model and miscalculate its true performance, we must first split the dataset into a training set and a test set. In the following sections, we will train a model using feature vectors and corresponding labels from the training set (see **Model training**), and evaluate its performance by predicting labels using only feature vectors from the test set (see **Inference**).

In [None]:
# Create train/test split
from sklearn.model_selection import train_test_split

# Include indices for tracking of individual data points after splitting
train_data, test_data, train_labels, test_labels, train_indices, test_indices = train_test_split(all_data, all_labels, indices, test_size=0.5, random_state=0)

### Model training and prediction
Next, we can train a classifier that tries to learn which parts of the feature space correspond to each of the two classes: normal vs. obfuscated.

#### K-means clustering

In [None]:
# Cluster test data using k-means
from sklearn.cluster import KMeans

# Train model assuming only two clusters in data
classifier = KMeans(n_clusters=2, random_state=2).fit(train_data)
classifier_type = 'clustering'

# Predict labels for test data, with clusters 0 and 1 mapped to negative and positive class, respectively (arbitary choice)
predicted_labels = classifier.predict(test_data)
predicted_labels = np.where(predicted_labels==0, -1, predicted_labels)

# Assign classifier classes
classifier.classes_ = [-1,1]

#### Isolation forest

In [None]:
# Perform anomaly detection on test data
from sklearn.ensemble import IsolationForest
classifier = IsolationForest(max_samples=100, random_state=0, behaviour='new', contamination='auto').fit(train_data)
classifier_type = 'anomaly detection'

# Predict labels for test data, with negative anomaly scores (most anomalous) labeled as obfuscated
anomaly_scores = classifier.decision_function(test_data)
predicted_labels = (anomaly_scores < 0).astype(int)
predicted_labels = np.where(predicted_labels==0, -1, predicted_labels)

# Assign classifier classes
classifier.classes_ = [-1,1]

### Analyze performance
We consider the performance of our classifier by looking at metrics such as precision, recall, F1-measure, and accuracy. These can also be built using the confusion matrix, which is essentially a histogram of the predicted labels and true lables of examples in the test set.

In [None]:
# Analyze performance
from sklearn import metrics

if classifier_type == 'clustering':
    nmi_score = sklearn.metrics.normalized_mutual_info_score(test_labels, predicted_labels, average_method='arithmetic')
    print('NMI between cluster labels and class labels: \n' + str(nmi_score))
    print('\n')

# Classification report
#print(metrics.classification_report(test_labels, predicted_labels))

# Standard metrics
precision = metrics.precision_score(test_labels, predicted_labels)
recall = metrics.recall_score(test_labels, predicted_labels)
f1measure = metrics.f1_score(test_labels, predicted_labels)
accuracy = metrics.accuracy_score(test_labels, predicted_labels)

print(' precision = ' + str(precision))
print('    recall = ' + str(recall))
print('F1-measure = ' + str(f1measure))
print('  accuracy = ' + str(accuracy))
print('\n')

# Confusion matrix
print('Confusion matrix (text-only):')
cm = metrics.confusion_matrix(test_labels, predicted_labels, classifier.classes_)
print(classifier.classes_)
print(cm)

In [None]:
# Plot fancy confusion matrix
# Reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import matplotlib.pyplot as plt
import numpy as np

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = metrics.confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    #    print("Normalized confusion matrix")
    #else:
    #    print('Confusion matrix, without normalization')
        
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.4f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

np.set_printoptions(precision=4)


# Pretty plot non-normalized confusion matrix
plot_confusion_matrix(test_labels, predicted_labels, classes=['normal','obfuscated'],
                      title='Confusion matrix, without normalization')

# Pretty plot normalized confusion matrix
plot_confusion_matrix(test_labels, predicted_labels, classes=['normal','obfuscated'], normalize=True,
                      title='Normalized confusion matrix, by true label')

from matplotlib import rcParams
rcParams['figure.figsize'] = [6,6]
plt.show()

### Inspect misclassified examples
By looking at those examples that were misclassified, we can better understand what the classifier has learned and how it could potentially be improved. In our case, each misclassified example is either (i) a command marked as obfuscated although it was actually normal (i.e., a false positive) or (ii) a command marked as normal although it was actually obfuscated (i.e., a false negative). It is largely domain-dependent to determine how to balance the cost of a false negative versus a false positive, and which we'd prefer our model to try to avoid.

In [None]:
# Show misclassified examples
misclassified_fp = np.where((test_labels != predicted_labels) & (predicted_labels == np.ones(len(predicted_labels))))
misclassified_fn = np.where((test_labels != predicted_labels) & (predicted_labels != np.ones(len(predicted_labels))))

false_positives = [raw_text[index] for index in test_indices[misclassified_fp]]
print('False positives (marked as obfuscated, but actually normal):')
for elem in false_positives:
    print(elem)
print('\n')

false_negatives = [raw_text[index] for index in test_indices[misclassified_fn]]
print('False negatives (marked as normal, but actually obfuscated):')
for elem in false_negatives:
    print(elem)