<a href="https://www.kaggle.com/code/keilamoral/avast-ctu-cape?scriptVersionId=222501570" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Avast-CTU Public CAPEv2 Dataset

There is a limited amount of publicly available data to support research in malware analysis technology. Particularly, there are virtually no publicly available datasets generated from rich sandboxes such as Cuckoo/CAPE. 

The benefit of using dynamic sandboxes is the realistic simulation of file execution in the target machine and obtaining a log of such execution. The machine can be infected by malware hence there is a good chance of capturing the malicious behavior in the execution logs,  thus allowing researchers to study such behavior in detail. Although the subsequent analysis of log information is extensively covered in industrial cybersecurity backends, to our knowledge there has been only limited effort invested in academia to advance such log analysis capabilities using cutting edge techniques. 

We make this sample dataset available to support designing new machine learning methods for malware detection, especially for automatic detection of generic malicious behavior. The dataset has been collected in cooperation between Avast Software and Czech Technical University - AI Center (AIC).
​
The archive contains CAPEv2 reports of 48,976 malicious files. For each file, we provide the following metadata:
* sha256,
* classification to malware family,
* type of the malware,
* date of detection of the file.
  
There are 6 types of malware ( "banker", "trojan", "pws", "coinminer", "rat", "keylogger") and 10 malware families in the dataset:
​
| Malware Family Name | Number of Samples |
| ------------------- | ----------------- |
| Adload              | 704               |
| Emotet              | 14,429            |
| HarHar              | 655               |
| Lokibot             | 4,191             |
| njRAT               | 3,372             |
| Qakbot              | 4,895             |
| Swisyn              | 12,591            |
| Trickbot            | 4,202             |
| Ursnif              | 1,343             |
| Zeus                | 2,594             |


In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.animation as animation
import os

In [None]:
# Loading dataset
avast_ctu_dataset = pd.read_csv('/kaggle/input/avast-ctu-dataset/public_labels.csv')

# Modifying dataset
family_type_mapping = {
    'Adload': 0, 'Emotet': 1, 'HarHar': 2, 'Lokibot': 3, 'njRAT': 4,
    'Qakbot': 5, 'Swisyn': 6, 'Trickbot': 7, 'Ursnif': 8, 'Zeus': 9
}
avast_ctu_dataset["family_type"] = avast_ctu_dataset['classification_family'].map(family_type_mapping).fillna(10).astype(int)

malware_type_mapping = {
    'banker': 0, 'trojan': 1, 'pws': 2, 'coinminer': 3, 'rat': 4,
    'keylogger': 5
}
avast_ctu_dataset["malware_type"] = avast_ctu_dataset['classification_type'].map(malware_type_mapping).fillna(6).astype(int)
avast_ctu_dataset.head()
avast_ctu_dataset['date'] = pd.to_datetime(avast_ctu_dataset['date'])
avast_ctu_dataset = avast_ctu_dataset.sort_values(by='date')
avast_ctu_dataset.head()


In [None]:
# Plot the malware types over time
def plot_malware_frequency(df):

    # Aggregate malware counts by date and type
    malware_counts = df.groupby(['date', 'classification_type']).size().reset_index(name='count')

    # Create the main plot
    f, ax = plt.subplots(figsize=(12, 8))
    
    # Create additional axes for the distributions
    ax_histx = plt.axes([0.1, 0.85, 0.8, 0.15])
    ax_histy = plt.axes([0.85, 0.1, 0.15, 0.75])

    # Remove unnecessary spines
    sns.despine(f)
    sns.despine(ax=ax_histx, left=True, bottom=True)
    sns.despine(ax=ax_histy, left=True, bottom=True)

    pink_palette = sns.color_palette("RdPu", n_colors=len(df['classification_type'].unique()))

    # Main scatter plot
    sns.scatterplot(x="date", y="count",
                    hue="classification_type",
                    size="count",
                    sizes=(40, 400),
                    alpha=0.6,
                    data=malware_counts,
                    palette=pink_palette,
                    ax=ax)

    # KDE plot for x-axis (dates)
    sns.kdeplot(data=malware_counts, x="date", ax=ax_histx, color='thistle', fill=True)
    ax_histx.set_xlim(ax.get_xlim())
    ax_histx.set_yticks([])

    # KDE plot for y-axis (counts)
    sns.kdeplot(data=malware_counts, y="count", ax=ax_histy, color='mediumorchid', fill=True)
    ax_histy.set_ylim(ax.get_ylim())
    ax_histy.set_xticks([])

    ax.set_xlabel("Date", fontsize=12)
    ax.set_ylabel("Malware Count", fontsize=12)
    ax.set_title("Frequency of Malware Types Over Time", fontsize=14)

    plt.xticks(rotation=45, ha="right")

    # Sub
    leg = ax.get_legend()
    leg.set_title("Malware Category")
    for t in leg.texts:
        if t.get_text() == 'classification_type':
            t.set_text('')
    
    #plt.tight_layout()
    plt.show()

# Assuming avast_ctu_dataset is your DataFrame
plot_malware_frequency(avast_ctu_dataset)


In [None]:
# Plot the malware types over time
def plot_malware_frequency(df):

    # Aggregate malware counts by date and type
    malware_counts = df.groupby(['date', 'classification_family']).size().reset_index(name='count')

    # Create the main plot
    f, ax = plt.subplots(figsize=(12, 8))
    
    # Create additional axes for the distributions
    ax_histx = plt.axes([0.1, 0.85, 0.8, 0.15])
    ax_histy = plt.axes([0.85, 0.1, 0.15, 0.75])

    # Remove unnecessary spines
    sns.despine(f)
    sns.despine(ax=ax_histx, left=True, bottom=True)
    sns.despine(ax=ax_histy, left=True, bottom=True)

    pink_palette = sns.color_palette("BuPu", n_colors=len(df['classification_family'].unique()))

    # Main scatter plot
    sns.scatterplot(x="date", y="count",
                    hue="classification_family",
                    size="count",
                    sizes=(40, 400),
                    alpha=0.6,
                    data=malware_counts,
                    palette=pink_palette,
                    ax=ax)

    # KDE plot for x-axis (dates)
    sns.kdeplot(data=malware_counts, x="date", ax=ax_histx, color='thistle', fill=True)
    ax_histx.set_xlim(ax.get_xlim())
    ax_histx.set_yticks([])

    # KDE plot for y-axis (counts)
    sns.kdeplot(data=malware_counts, y="count", ax=ax_histy, color='mediumorchid', fill=True)
    ax_histy.set_ylim(ax.get_ylim())
    ax_histy.set_xticks([])

    ax.set_xlabel("Date", fontsize=12)
    ax.set_ylabel("Malware Count", fontsize=12)
    ax.set_title("Frequency of Malware Families Over Time", fontsize=14)

    plt.xticks(rotation=45, ha="right")

    # Sub
    leg = ax.get_legend()
    leg.set_title("Malware family")
    for t in leg.texts:
        if t.get_text() == 'classification_family':
            t.set_text('')
    
    #plt.tight_layout()
    plt.show()

# Assuming avast_ctu_dataset is your DataFrame
plot_malware_frequency(avast_ctu_dataset)
