# Webshell Prediction/Detection

In this Notebook, we will train a classifier to identify malicious PHP files.

This is based on the discoveries in our other Notebooks that file entropy is strongly correlated with malicious obfuscated web shells, at entropy values unique from normal files in web roots across 3 common content management systems (Wordpress, Joomla, Drupal).

With this information, we will attempt predict malicious PHP based on entropy and a small selection of other features.

In [1]:
# Import a bunch of stuff
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
import plotly.express as px
import numpy as np
import os

## Feature Creation

Since we're talking about files, there are no inherent features in our "dataset" as yet. We'll have to derive them one way or another. 

Other than entropy, some simple features we'll extract from the files include:

* Average line length
* Max line length
* Total line count

### PHP Only?

We could focus our dataset on just PHP files, but first I'd like to try to train against _all_ files that you might find in a CMS webroot. If that creates too many false positives, we'll limit to PHP.

## Data Import

We'll start by importing our data and configuring headers. We also have a helper function to extract file extensions for when we _do_ want to filter.

Our sample set is made of 3 entire CMS webroots along with 100+ samples of obfuscated PHP webshells.

In [3]:
header_names = [
    "filename",
    "path",
    "entropy",
    "is_elf",
    "md5",
    "sha1",
    "sha256",
    "sha512"
]

In [4]:
# Import individual CMS files and create the CMS category
drupal_df = pd.read_csv("drupal.csv", names=header_names)
drupal_df["cms"] = "drupal"
joomla_df = pd.read_csv("joomla.csv", names=header_names)
joomla_df["cms"] = "joomla"
wordpress_df = pd.read_csv("wordpress.csv", names=header_names)
wordpress_df["cms"] = "wordpress"
# Create main DF
df = pd.concat([drupal_df, joomla_df, wordpress_df]).reset_index()
df["disposition"] = "benign"

# Get webshells
webshell = pd.read_csv("webshell.csv", names=header_names)
webshell["disposition"] = "malicious"

# Add webshell to df
df = pd.concat([df, webshell]).reset_index()
df["extension"] = df.filename.apply(lambda f: os.path.splitext(f)[-1])

# Remove unnecessary cols
df.drop(["level_0","index","md5","sha1","sha256","sha512","is_elf","cms"], axis=1, inplace=True)

In [5]:
df

Unnamed: 0,filename,path,entropy,disposition,extension
0,.csslintrc,cms/drupal/.csslintrc,3.89,benign,
1,.editorconfig,cms/drupal/.editorconfig,4.68,benign,
2,.eslintignore,cms/drupal/.eslintignore,3.70,benign,
3,.eslintrc.json,cms/drupal/.eslintrc.json,4.14,benign,.json
4,.gitattributes,cms/drupal/.gitattributes,4.62,benign,
...,...,...,...,...,...
48455,r00tshell_6dc8b59781183d4061990b8b0fdb617063b8...,Webshell-samples/samples/webshell/PHP-backdoor...,5.71,malicious,.php
48456,r57Shell_0963884cbc71293d7e290ad3ecf06a81355b5...,Webshell-samples/samples/webshell/PHP-backdoor...,5.67,malicious,.php
48457,r57_89fa86a8748b5dfbabf610e47cf447675a817182.php,Webshell-samples/samples/webshell/PHP-backdoor...,6.02,malicious,.php
48458,webshell_e8eaf8da94012e866e51547cd63bb99637969...,Webshell-samples/samples/webshell/PHP-backdoor...,5.94,malicious,.php


In [7]:
# Define feature functions
def avg_line_length(filepath: str) -> float:
    """
    Returns average line length for a file
    """
    with open(filepath) as f:
        try:
            return np.mean([len(l) for l in f.readlines()])
        except ValueError:
            return 1

def max_line_length(filepath: str) -> int:
    """
    Returns max line length for a file
    """
    with open(filepath) as f:
        try:
            return np.max([len(l) for l in f.readlines()])
        except:
            return None

    
def line_count(filepath: str) -> int:
    """
    Returns number of lines in a file
    """
    with open(filepath) as f:
        try:
            return len(f.readlines())
        except:
            return None

### Feature Generation

Now that we have the basics imported, it's time to generate our features by applying the functions defned above.

In [8]:
# Get features
# ============
# Get filesize
import os
df["size"] = df.path.apply(lambda f: os.path.getsize(f))
df["avg_line_length"] = df.path.apply(avg_line_length)
df["max_line_length"] = df.path.apply(max_line_length)
df["line_count"] = df.path.apply(line_count)
df["is_php"] = df.filename.str.endswith(".php")

  return _methods._mean(a, axis=axis, dtype=dtype,


Now we clean the data by removing any features with `NaN` as a value.

In [9]:
clean = df.dropna().reset_index()

## Model Fit

With our source data ready, it's time to build and fit the model.

We begin by splitting the data into features (`X`) and the target `y`.

In [10]:
# Split the data
X = clean[["entropy","size","avg_line_length","max_line_length","line_count", "is_php"]]
y = clean.disposition

### Classification

Our initial model is very simple: a `KNeighborsClassifier` with a `StandardScaler` to normalize the features before fitting.

We'll further split the data betwen `train` and `test` sets.

In [12]:
# Instantiate our classified and scaler
knn = KNeighborsClassifier()
scaler = StandardScaler()

# Kfold
kf = KFold(n_splits=6, shuffle=True, random_state=42)

# Split the data between train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Make pipeline
pipeline = make_pipeline(scaler, knn)

Now we'll fit the model.

In [13]:
pipeline.fit(X_train, y_train)

With the model fit, now we make our predictions.

In [14]:
preds = pipeline.predict(X_test)

## Results Review

To fully review the results, we'll reattach the dropped columns we discarded for the model.

In [24]:
pred_df = X_test
pred_df["disposition"] = preds
pred_df["filename"] = [clean.loc[i, "filename"] for i in pred_df.index]
pred_df["path"] = [clean.loc[i, "path"] for i in pred_df.index]

Okay, what got flagged as malicious?

In [25]:
pd.set_option("display.max_colwidth", 300)
pred_df[pred_df.disposition == "malicious"]

Unnamed: 0,entropy,size,avg_line_length,max_line_length,line_count,is_php,disposition,filename,path
46622,6.11,17844,85.37799,766.0,209.0,True,malicious,WebShell_0ba8e8b6c1334b8335a9a9374bfb1109c0371478.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/WebShell_0ba8e8b6c1334b8335a9a9374bfb1109c0371478.php
46598,5.96,1330,83.125,90.0,16.0,True,malicious,Unknown_afdb2cf061897a383718ef8c59e9be10fc76d1c5.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Unknown_afdb2cf061897a383718ef8c59e9be10fc76d1c5.php
46560,5.71,38636,80.157676,92.0,482.0,True,malicious,Sosyete_6ba719a5dcbbe675542d9001e5bf7987c979910f.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Sosyete_6ba719a5dcbbe675542d9001e5bf7987c979910f.php
46617,6.07,23982,959.28,23038.0,25.0,True,malicious,WSOShell_1d2746a23a5201da7a0e89ff52adc3e8304b98a2.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/WSOShell_1d2746a23a5201da7a0e89ff52adc3e8304b98a2.php
46607,6.1,1389,81.705882,89.0,17.0,True,malicious,Unknown_d88733ed68ddb4a6f715d6a2c889d979c3ffef63.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Unknown_d88733ed68ddb4a6f715d6a2c889d979c3ffef63.php
46603,6.12,1281,80.0625,86.0,16.0,True,malicious,Unknown_c17433899e87dc065f3e6eadd26d92e82a62db08.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Unknown_c17433899e87dc065f3e6eadd26d92e82a62db08.php
46581,6.17,2006,80.24,85.0,25.0,True,malicious,Unknown_63e22f2dad6485eff1b516666443220c31a83173.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Unknown_63e22f2dad6485eff1b516666443220c31a83173.php
46584,5.91,1718,85.9,92.0,20.0,True,malicious,Unknown_851cc138b197afae9e1f24d7ac8e7f7e89bff85d.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/Unknown_851cc138b197afae9e1f24d7ac8e7f7e89bff85d.php
46618,5.68,264001,81.886166,94.0,3224.0,True,malicious,WSOShell_606ece05d586d7b76817fbe10634871aa286222d.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/WSOShell_606ece05d586d7b76817fbe10634871aa286222d.php
46639,5.75,20044,78.603922,92.0,255.0,True,malicious,fabf134fce36292cf2bd03c5f2c9d3195f102bb7.php,Webshell-samples/samples/webshell/PHP-backdoors-master/Obfuscated/fabf134fce36292cf2bd03c5f2c9d3195f102bb7.php


Just eyeballing it, all but 2 of our predictions are correct. That is pretty dang impressive!

And what was the score against our test set?

In [18]:
score = round(pipeline.score(X_test[["entropy","size","avg_line_length","max_line_length","line_count", "is_php"]], y_test) * 100, 2)
print(f"Score: {score}")
print(f"Cross val score:")
print(cross_val_score(pipeline, X, y, cv=kf))

Score: 99.97
Cross val score:
[0.99961415 0.99909956 0.99909956 0.99948546 0.99948546 0.9996141 ]


### Further detections

Now let's try with a completely new sample. This is a real webshell I've just encountered in the wild. It's been saved here as `webshell.php` Let's process its data into a DataFrame.

In [37]:
# Collect entropy!
entropy_result = ! sandfly-entropyscan/sandfly-entropyscan -file webshell.php 

In [39]:
# Build DataFrame
webshell_X = pd.DataFrame([{
    "entropy": float(entropy_result[2].split(": ")[-1]),
    "size": os.path.getsize("webshell.php"),
    "avg_line_length": avg_line_length("webshell.php"),
    "max_line_length": max_line_length("webshell.php"),
    "line_count": line_count("webshell.php"),
    "is_php": True
}])

And now, to predict.

In [41]:
# And our prediction...
pipeline.predict(webshell_X)

array(['malicious'], dtype=object)

## Conclusions

This is an extremely rudimentary model, but even with this approach, we can see that by adding some additional features easily derived from the files themselves, our detection rate goes up from 94.13% to 99.97%

Obfuscated PHP webshells are a fairly specific detection use case. However, the efficacy of this method speaks to the potential for well-applied machine learning in defense techniques.