<h1> DYNAMIC MALWARE CLASSIFICATION</h1>

Supervised, classification, multi-class

# 0. Basic Set-Up and General info

Note: there are a lot of non-used imports (to be removed at the end)

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns
from scipy import stats

import copy

In [56]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


In [57]:
from imblearn.over_sampling import SMOTENC
from boruta import BorutaPy


In [58]:
import importlib

# Import your modules initially
import library.pipeline

# Reload the modules in case they were updated
importlib.reload(library.pipeline)

# Now you can import specific classes/functions
from library.pipeline.pipeline import Pipeline
from library.pipeline.pipeline_manager import PipelineManager

## Documentation Standarization
Follow the following guidelines to make sure our document is consistent and easier to undestand
- All constant should be writte in full capitall letters (e.g: MY_CONSTANT). Notebook-level constants should be written in the cell assigned to it (see below)
- Every new major section (EDA, Feature engineering) should be written with '# My title'#
- Subsequent sections should use '##', '###' or '####' hierarchacically
- At the beginning of each major section write a content index with the content included in that section (see 'feature engineering' section as an example)
- Please write paragraphs before and after each cell explain what u are about to do and the conclusions, correspondingly. Do not assume they are too obvious.
- Avoid excessive ChatGPT-originated comments
- Avoid writing more than 20 lines per code cell (exceptions for subroutines, which should be written in utilitied_functions.py)
- Ideally, add a "Questions" and "Things to be done" section in each major sections where u write about futher iterations u want to do (while sharing in with the rest)


## Set-up

In [59]:
dataset_path ="./dataset/dynamic_dataset.csv"
results_path = "results/results.csv"

In [60]:
metrics_to_evaluate = ["accuracy", "precision", "recall", "f1-score"]

We start off with a pipeline for all the models. Then at some moment, whenever we think the pipelines will diverge, we remove the by-object-refrence (that we will create in the next cell), and create a new copy with our modifications ready.

In [61]:
default_pipeline = ensembled_pipeline = tree_pipeline = supportVectorsMachine_pipeline = baseline_pipeline = naiveBayes_pipeline = stacking_pipeline = feedForwardNN_pipeline = example = Pipeline(
                        dataset_path=dataset_path,
                        model_results_path=results_path,
                        model_task="classification")

In [62]:
pipelines = {
            "not-baseline": {
                  "ensembled": ensembled_pipeline,
                  "tree-based": tree_pipeline,
                  "linear": supportVectorsMachine_pipeline,
                  "naive-bayes": naiveBayes_pipeline,
                  "feedForwardNN": feedForwardNN_pipeline,
                  "stacking": stacking_pipeline,
                  }, 
            "baseline": {
                  "baselines": baseline_pipeline, 
                  "example":  example               
            }
}
pipeline_manager = PipelineManager(pipelines)

Here I show an example of how we will be working with by-object-reference.

In [63]:
baseline_pipeline.dataset.example_attribute = "1"

In [64]:
default_pipeline.dataset.example_attribute

'1'

However, if we do a (deep) copy:
Note that the difference between a deep and a shallow copy is that the former copies all the objects that are inside the class. We need that.
We will overwrite the prior attribute and show is this is not propagated to the other results anymore

In [65]:
new_pipeline = copy.deepcopy(default_pipeline)
new_pipeline.dataset.example_attribute = "2"

In [66]:
default_pipeline.dataset.example_attribute

'1'

It may also be of interest for you to propagate the same method's invokation to all the pipelines within the pipeline manager. For instance, after a divergence in, lets say, feature scaling, you may need to fit all the models in the corresponding pipelines. You would want to avoid reduntandly calling the same method with the same parameters for all the Pipeline class instances. In order to do so, we present an example that avoids such annoyance

In [67]:
pipeline_manager.create_pipeline_divergence(category="baseline", pipelineName="example", print_results=True)

Pipeline example in category baseline has diverged
 Pipeline schema is now: {'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'tree-based': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'linear': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'naive-bayes': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'feedForwardNN': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'stacking': <library.pipeline.pipeline.Pipeline object at 0x317256d50>}, 'baseline': {'baselines': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'example': <library.pipeline.pipeline.Pipeline object at 0x317253990>}}


<library.pipeline.pipeline.Pipeline at 0x317253990>

In [68]:
pipeline_manager.pipelines["baseline"]["example"].feature_analysis.feature_selection.automatic_feature_selection.speak("Whats goooddd")

Whats goooddd from 13086427216. You are at automatic feature selection!


In [69]:
pipeline_manager.all_pipelines_execute(methodName="speak", message="Hello, world!")

Hello, world! from 13273230672
Hello, world! from 13273217424


{'not-baseline': {'ensembled': None}, 'baseline': {'example': None}}

A method present deeper down the class:

Instead of doing:
'pipeline_manager.pipelines["baseline"]["logistic"].feature_analysis.feature_selection.automatic_feature_selection.speak("Whats goooddd") for all objects'

we can do:

In [70]:
pipeline_manager.all_pipelines_execute(methodName="feature_analysis.feature_selection.automatic_feature_selection.speak", message="Whats good")

Whats good from 13275691472. You are at automatic feature selection!
Whats good from 13086427216. You are at automatic feature selection!


{'not-baseline': {'ensembled': None}, 'baseline': {'example': None}}

In [71]:
# lets delete the example obj
del new_pipeline 
del pipeline_manager.pipelines["baseline"]["example"]

In [72]:
pipeline_manager.pipelines

{'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'tree-based': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'linear': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'naive-bayes': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'feedForwardNN': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'stacking': <library.pipeline.pipeline.Pipeline at 0x317256d50>},
 'baseline': {'baselines': <library.pipeline.pipeline.Pipeline at 0x317256d50>}}

<hr>

# Start Of The Pipeline

In [73]:
default_pipeline.dataset.df.head()

Unnamed: 0,Memory_PssTotal,Memory_PssClean,Memory_SharedDirty,Memory_PrivateDirty,Memory_SharedClean,Memory_PrivateClean,Memory_SwapPssDirty,Memory_HeapSize,Memory_HeapAlloc,Memory_HeapFree,...,Logcat_error,Logcat_warning,Logcat_debug,Logcat_verbose,Logcat_total,Process_total,Hash,Category,Family,Reboot
0,31053,2448,14044,23472,74824,2452,0,8919,4786,4132,...,1635,2351,3285,1551,11221,193,f460abb8f2e4e3fb689966ddaea6d6babbd1738bb691c7...,Trojan_SMS,opfake,before
1,107787,21976,11852,74548,69052,23152,0,25341,20965,4375,...,1816,826,1544,2045,8457,189,556c238536d837007e647543eaf3ea95ae9aaf1c1a52d0...,Trojan_SMS,opfake,before
2,86584,18460,12284,59992,91548,19376,0,24500,21378,3121,...,2244,3406,1565,2819,10780,195,398322f94b5bfa2a9e7b3756a4cf409764595003280c48...,Trojan_SMS,fakeinst,before
3,41248,924,10328,36280,55768,928,0,10082,7281,2800,...,974,4134,3138,1556,11739,191,4a9c14872b2c66165599a969a1a8654bb6887d7a18ab6d...,Trojan_SMS,fakeinst,before
4,38621,5080,12392,27388,71048,5088,0,9077,5750,3326,...,936,2298,3752,1992,10488,188,6b37b9b9c170727f706b69731e64da4bbca2638b4237a7...,Trojan_SMS,fakeinst,before


In [74]:
default_pipeline.dataset.df["API_DeviceData_android.location.Location_getLatitude"]

0        0
1        0
2        0
3        0
4        0
        ..
53434    0
53435    0
53436    0
53437    0
53438    0
Name: API_DeviceData_android.location.Location_getLatitude, Length: 53439, dtype: int64

In [75]:
default_pipeline.dataset.df.rename(columns={"reboot": "Reboot"}, inplace=True) # consistency with other features' names
default_pipeline.dataset.df.drop(columns=["Family", "Hash"], inplace=True) # We have decided to use only category as target variable; Hash is temporary while im debugging (it will be deleted in EDA)

### Description of our features

For each of our entries in the dataset we have a feature describing the state of the (operating) system under which the malware was running. Here is an intial description for each of them.
Please access this document for a complete problem context description: https://docs.google.com/document/d/1yH9gvnJVSH9GLv9ATQ5JQWA2z8Jy4umxxRfMF-y2fiU/edit?usp=sharing

# 1. EDA

This section below shall be deleted

In this section, we conduct an Exploratory Data Analysis (EDA) on the dynamic malware dataset to gain initial insights into the structure, distribution, and quality of the data prior to modeling. The dataset includes behavioral features extracted from Android applications, along with labels indicating their respective malware categories. Understanding the composition of the dataset, such as class imbalance, feature correlations, and the presence of outliers, is crucial to ensure robust preprocessing, informed feature engineering, and  the success of machine learning classifiers.


### SUGGESTIONS FOR IMPROVEMENTS
- Cluster analysis

## Data Type Distribution

In [76]:
# # Get the data types of each column
# data_types = dataset.df.dtypes

# # Count the frequency of each data type
# type_counts = data_types.value_counts()

# # Plotting the frequency of data types
# type_counts.plot(kind='bar', color='skyblue')

# # Add title and labels
# plt.title('Frequency of Data Types in DataFrame')
# plt.xlabel('Data Type')
# plt.ylabel('Frequency')

We can see that most columns are numerical. Lets gets to see which are the variables that are of type object.

In [77]:
# df_onlyCols = dataset.df.select_dtypes(include=["object"]).columns
# df_onlyCols

## Summary Statistics Overview

## Histograms


In [78]:
"""# Select only numerical features
numerical_cols = dataset.df.select_dtypes(include=[np.number]).columns

# Define number of rows and columns for subplots
num_features = len(numerical_cols)
cols = 4  # Number of columns per row
rows = math.ceil(num_features / cols)  # Calculate required rows

# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(16, rows * 4))
axes = axes.flatten()  # Flatten to easily iterate

# Plot histograms
for i, col in enumerate(numerical_cols):
    sns.histplot(df[col], bins=30, kde=True, ax=axes[i])  # kde=True for smooth curve
    axes[i].set_title(col)

# Remove empty subplots
for i in range(num_features, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()"""

'# Select only numerical features\nnumerical_cols = dataset.df.select_dtypes(include=[np.number]).columns\n\n# Define number of rows and columns for subplots\nnum_features = len(numerical_cols)\ncols = 4  # Number of columns per row\nrows = math.ceil(num_features / cols)  # Calculate required rows\n\n# Create subplots\nfig, axes = plt.subplots(rows, cols, figsize=(16, rows * 4))\naxes = axes.flatten()  # Flatten to easily iterate\n\n# Plot histograms\nfor i, col in enumerate(numerical_cols):\n    sns.histplot(df[col], bins=30, kde=True, ax=axes[i])  # kde=True for smooth curve\n    axes[i].set_title(col)\n\n# Remove empty subplots\nfor i in range(num_features, len(axes)):\n    fig.delaxes(axes[i])\n\nplt.tight_layout()\nplt.show()'

We can see most distributions tend to be right-skewed and only a small portion follows a normal distribution. This right-skewness will be dealt in feature-engineering.

In [79]:
"""# Select only numerical features
numerical_cols = dataset.df.select_dtypes(include=[np.number]).columns

# Define number of rows and columns for subplots
num_features = len(numerical_cols)
cols = 4  # Number of columns per row
rows = math.ceil(num_features / cols)  # Calculate required rows

# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(16, rows * 4))
axes = axes.flatten()  # Flatten to easily iterate

# Plot boxplots
for i, col in enumerate(numerical_cols):
    sns.boxplot(x=dataset.df[col], ax=axes[i])  # Boxplot for each feature
    axes[i].set_title(col)

# Remove empty subplots
for i in range(num_features, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()"""

'# Select only numerical features\nnumerical_cols = dataset.df.select_dtypes(include=[np.number]).columns\n\n# Define number of rows and columns for subplots\nnum_features = len(numerical_cols)\ncols = 4  # Number of columns per row\nrows = math.ceil(num_features / cols)  # Calculate required rows\n\n# Create subplots\nfig, axes = plt.subplots(rows, cols, figsize=(16, rows * 4))\naxes = axes.flatten()  # Flatten to easily iterate\n\n# Plot boxplots\nfor i, col in enumerate(numerical_cols):\n    sns.boxplot(x=dataset.df[col], ax=axes[i])  # Boxplot for each feature\n    axes[i].set_title(col)\n\n# Remove empty subplots\nfor i in range(num_features, len(axes)):\n    fig.delaxes(axes[i])\n\nplt.tight_layout()\nplt.show()'

## Numerical Features

Seemed to be grouped by prefixes: Memory, Network, Battery, Logcat, Process y API.

According to dataset authors to capture how various malware families and categories behave at runtime, the analysis relies on six distinct sets of features obtained after executing each sample within a controlled emulated environment. These feature groups offer a comprehensive view of the malware's dynamic activity.

This categories appear before the first _ in every feature label and are defined as:


"Memory: Memory features define activities performed by malware by utilizing memory.

API: Application Programming Interface (API) features delineate the communication between two applications.

Network: Network features describe the data transmitted and received between other devices in the network. It indicates foreground and background network usage.

Battery: Battery features describe the access to battery wakelock and services by malware.

Logcat: Logcat features write log messages corresponding to a function performed by malware.

Process: Process features count the interaction of malware with total number of processes."



In [80]:
"""numeric_cols = dataset.df.select_dtypes(include='number').columns

# Grouping based on the first prefix before "_"
prefix_groups = defaultdict(list)

for col in numeric_cols:
    prefix = col.split("_")[0]  # Get the first word before the underscore
    prefix_groups[prefix].append(col)

for prefix, columns in prefix_groups.items():
    print(f"\n {prefix} ({len(columns)} features):")
    for col in columns:
        print(f"  - {col}")"""

'numeric_cols = dataset.df.select_dtypes(include=\'number\').columns\n\n# Grouping based on the first prefix before "_"\nprefix_groups = defaultdict(list)\n\nfor col in numeric_cols:\n    prefix = col.split("_")[0]  # Get the first word before the underscore\n    prefix_groups[prefix].append(col)\n\nfor prefix, columns in prefix_groups.items():\n    print(f"\n {prefix} ({len(columns)} features):")\n    for col in columns:\n        print(f"  - {col}")'

## Categorical Features

In [81]:
"""#Statistical summary for categorical features
dataset.df.describe(include=["object", "category", "bool"])"""

'#Statistical summary for categorical features\ndataset.df.describe(include=["object", "category", "bool"])'

In [82]:
"""print(dataset.df[['Hash', 'Category', 'Family']].head())"""

"print(dataset.df[['Hash', 'Category', 'Family']].head())"

Hash: unique identifier that represents each malware sample. <<<>>>THIS IS PROBABLY WRONG<<<>>>

Category: general classification of the malware sample based on its behavior.

Family: more fine-grained grouping of malware based on its codebase or origin

For hash, it will first be checked if the same malware before and after reboot contains the same hash value.

In [83]:
"""# Count how many times each hash appears in 'before' and 'after'
hash_reboot_counts =dataset.df.groupby(['Hash', 'reboot']).size().unstack(fill_value=0)

# Hashes in both with exactly one in each
hashes_with_one_each = hash_reboot_counts[
    (hash_reboot_counts['before'] == 1) & (hash_reboot_counts['after'] == 1)
].index

# Hashes in both but with extra rows
hashes_in_both_but_not_clean = hash_reboot_counts[
    (hash_reboot_counts['before'] > 0) &
    (hash_reboot_counts['after'] > 0) &
    ~((hash_reboot_counts['before'] == 1) & (hash_reboot_counts['after'] == 1))
].index

# Total unique hashes
total_unique_hashes = dataset.df['Hash'].nunique()

# Hashes in only one reboot condition
hashes_in_one_condition = hash_reboot_counts[
    (hash_reboot_counts['before'] == 0) | (hash_reboot_counts['after'] == 0)
]

# Only once in one reboot condition
only_once_in_one = hashes_in_one_condition[
    (hashes_in_one_condition['before'] == 1) | (hashes_in_one_condition['after'] == 1)
]

# More than once in one reboot condition
more_than_once_in_one = hashes_in_one_condition[
    ((hashes_in_one_condition['before'] > 1) & (hashes_in_one_condition['after'] == 0)) |
    ((hashes_in_one_condition['after'] > 1) & (hashes_in_one_condition['before'] == 0))
]

# Split those into counts
more_than_once_in_before = more_than_once_in_one[more_than_once_in_one['before'] > 1]
more_than_once_in_after = more_than_once_in_one[more_than_once_in_one['after'] > 1]

# --- PRINT RESULTS ---
print(f"Hashes with EXACTLY one row in BOTH before and after: {len(hashes_with_one_each)}")
print(f"Hashes in BOTH, BUT with extra rows: {len(hashes_in_both_but_not_clean)}")

print(f"\nHashes in ONLY ONE reboot condition:")
print(f"• Appearing ONLY ONCE: {len(only_once_in_one)}")
print(f"• Appearing MORE THAN ONCE: {len(more_than_once_in_one)}")
print(f"   - More than once in BEFORE: {len(more_than_once_in_before)}")
print(f"   - More than once in AFTER: {len(more_than_once_in_after)}")

print(f"\nTotal breakdown:")
print(f"• In BOTH (any): {len(hashes_with_one_each) + len(hashes_in_both_but_not_clean)}")
print(f"• In ONLY ONE reboot: {len(hashes_in_one_condition)}")
print(f"• TOTAL unique hashes: {total_unique_hashes}")
"""

'# Count how many times each hash appears in \'before\' and \'after\'\nhash_reboot_counts =dataset.df.groupby([\'Hash\', \'reboot\']).size().unstack(fill_value=0)\n\n# Hashes in both with exactly one in each\nhashes_with_one_each = hash_reboot_counts[\n    (hash_reboot_counts[\'before\'] == 1) & (hash_reboot_counts[\'after\'] == 1)\n].index\n\n# Hashes in both but with extra rows\nhashes_in_both_but_not_clean = hash_reboot_counts[\n    (hash_reboot_counts[\'before\'] > 0) &\n    (hash_reboot_counts[\'after\'] > 0) &\n    ~((hash_reboot_counts[\'before\'] == 1) & (hash_reboot_counts[\'after\'] == 1))\n].index\n\n# Total unique hashes\ntotal_unique_hashes = dataset.df[\'Hash\'].nunique()\n\n# Hashes in only one reboot condition\nhashes_in_one_condition = hash_reboot_counts[\n    (hash_reboot_counts[\'before\'] == 0) | (hash_reboot_counts[\'after\'] == 0)\n]\n\n# Only once in one reboot condition\nonly_once_in_one = hashes_in_one_condition[\n    (hashes_in_one_condition[\'before\'] == 1) 

A total of 19,169 hashes appear exactly once in both before and after conditions. These are highly reliable for paired  comparisons, ideal for understanding how reboot affects malware behavior.


There are 158 hashes that appear in both reboot states but not exactly once in each. These extra instances may come from inconsistencies in data capture like multiple logs for the same sample and should be checked.

A significant portion of samples appear only in one reboot condition. This is consistent with limitations described in the original dataset paper, where some malware samples failed to execute after the reboot. However, what is curious is that some still have been logged more than once.


In [84]:
"""dataset.df.drop(columns=['Hash'], inplace=True)
'''
The Hash column is a high-cardinality feature, containing unique values for a high number of rows in the dataset.
It serves as an identifier for each malware sample. Including this column in modeling
would not only offer no predictive value but could also lead to overfitting or cause issues with algorithms that are
sensitive to high-cardinality categorical features.
 <<<>>> J.N: may be better to focus the argumentation on ID not being useful rather than high-cardinality per se. Also write the 
  argumentation in a text cell not in this type of comments. <<<>>>
'''"""

"dataset.df.drop(columns=['Hash'], inplace=True)\n'''\nThe Hash column is a high-cardinality feature, containing unique values for a high number of rows in the dataset.\nIt serves as an identifier for each malware sample. Including this column in modeling\nwould not only offer no predictive value but could also lead to overfitting or cause issues with algorithms that are\nsensitive to high-cardinality categorical features.\n <<<>>> J.N: may be better to focus the argumentation on ID not being useful rather than high-cardinality per se. Also write the \n  argumentation in a text cell not in this type of comments. <<<>>>\n'''"

This research will be using both Category and Family as the target variables for classification.

## Reboot Analysis

In [85]:
"""print(dataset.df["reboot"].value_counts())"""

'print(dataset.df["reboot"].value_counts())'

The imbalance observed in the dataset, with 28,380 samples collected before reboot and only 25,059 after reboot, is explained by limitations found during the dynamic analysis. The authors of the dataset note that "there was no entry point in some Android malware samples and some Android malware samples stopped abruptly." This means that certain malware applications either failed to launch or terminated unexpectedly during execution, preventing the collection of dynamic behavior data, particularly after the reboot phase.

Additionally, the study highlights another critical limitation: "the dynamic analysis is performed in an emulator. Some malware samples are able to detect the emulated environment and are not executed." This behavior reflects common anti-analysis techniques used by sophisticated malware, which can detect when they are running in a sandbox or emulator and intentionally suspend their malicious actions.




<<<>>>THIS ANALYSIS IS SUPER GOOD (you can delete this comment)<<<>>>

The displayed features are the top 10  most affected by reboot showing a clear reboot-sensitive behavior.

In [86]:
"""#Category distribution across reboot
plt.figure(figsize=(12, 6))
sns.countplot(data=dataset.df, x='Category', hue='reboot')
plt.title("Malware Categories by Reboot Condition")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()"""


'#Category distribution across reboot\nplt.figure(figsize=(12, 6))\nsns.countplot(data=dataset.df, x=\'Category\', hue=\'reboot\')\nplt.title("Malware Categories by Reboot Condition")\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()'

To identify which numeric features are most influenced by the reboot condition, the dataset will be grouped by the reboot variable, separating entries collected before and after the device reboot. Within each group, the mean of every numeric feature will be computed, allowing for the comparison of average behavior across both states.

A new column labeled 'diff' was then added, representing the difference between the mean values after and before the reboot for each feature. A positive value indicates that the feature increased after reboot, while a negative value shows it decreased.

In [87]:
"""reboot_means = dataset.df.groupby('reboot').mean(numeric_only=True).T
reboot_means['diff'] = reboot_means['after'] - reboot_means['before']
reboot_means_sorted = reboot_means.sort_values(by='diff', ascending=False)

reboot_means_sorted.head(10)"""

"reboot_means = dataset.df.groupby('reboot').mean(numeric_only=True).T\nreboot_means['diff'] = reboot_means['after'] - reboot_means['before']\nreboot_means_sorted = reboot_means.sort_values(by='diff', ascending=False)\n\nreboot_means_sorted.head(10)"

The results reveal that several features show clear shifts after reboot. Specially, network-related features such as Network_TotalReceivedBytes and Network_TotalTransmittedBytes demonstrate significant increases, suggesting that some malware types intensify data transmission once the device has rebooted. Memory features like Memory_SharedClean, Memory_HeapSize, and Memory_HeapAlloc also show increased values after reboot, indicating greater memory use or altered memory management after reboot.
This shows that the reboot condition plays an important role in runtime behavior and should be treated as an important factor in exploratory analysis and modeling.

## Family

In [88]:
"""#How many categories each family belongs to
dataset.df.groupby("Family")["Category"].nunique().sort_values(ascending=False)"""

'#How many categories each family belongs to\ndataset.df.groupby("Family")["Category"].nunique().sort_values(ascending=False)'

Almost every family is either unknown or unique


In [89]:
# <<<Error: NameError: name 'family_to_category' is not defined>>> (this Irina's code; copied from Argentinan guy's notebook)
# multi_cat_families = family_to_category[family_to_category > 1]
# print(f"Number of families mapping to multiple categories: {len(multi_cat_families)}")
# print(multi_cat_families)

There is only one Family that maps to multiple categories, and is the placeholder unknown.

The following code displays how many samples with unknown family labels belong to each malware category.

In [90]:
"""dataset.df[dataset.df["Family"] == "<unknown>"]["Category"].value_counts()"""

'dataset.df[dataset.df["Family"] == "<unknown>"]["Category"].value_counts()'

In [91]:
"""# Step 1: Count unique families per category
family_amount = dataset.df.groupby("Category")["Family"].nunique()

# Step 2: Total number of instances per category
total_per_category = dataset.df["Category"].value_counts()

# Step 3: Count how many of those are <unknown> per category
unknown_amount = dataset.df[dataset.df["Family"] == "<unknown>"]["Category"].value_counts()

# Step 4: Combine all stats into a summary table
summary_df = pd.DataFrame({
    "Family_amount": family_amount,
    "Total_category": total_per_category,
    "Unknown_amount": unknown_amount
}).fillna(0).astype({"Unknown_amount": int})

# Step 5: Calculate percentage of unknowns per category
summary_df["%_Unknown"] = (summary_df["Unknown_amount"] / summary_df["Total_category"] * 100).round(2)

# Reorder columns for readability
summary_df = summary_df[["Family_amount", "Total_category", "Unknown_amount", "%_Unknown"]]

# Display the summary
print(summary_df)"""

'# Step 1: Count unique families per category\nfamily_amount = dataset.df.groupby("Category")["Family"].nunique()\n\n# Step 2: Total number of instances per category\ntotal_per_category = dataset.df["Category"].value_counts()\n\n# Step 3: Count how many of those are <unknown> per category\nunknown_amount = dataset.df[dataset.df["Family"] == "<unknown>"]["Category"].value_counts()\n\n# Step 4: Combine all stats into a summary table\nsummary_df = pd.DataFrame({\n    "Family_amount": family_amount,\n    "Total_category": total_per_category,\n    "Unknown_amount": unknown_amount\n}).fillna(0).astype({"Unknown_amount": int})\n\n# Step 5: Calculate percentage of unknowns per category\nsummary_df["%_Unknown"] = (summary_df["Unknown_amount"] / summary_df["Total_category"] * 100).round(2)\n\n# Reorder columns for readability\nsummary_df = summary_df[["Family_amount", "Total_category", "Unknown_amount", "%_Unknown"]]\n\n# Display the summary\nprint(summary_df)'

In [92]:
"""unknown_count = (dataset.df["Family"] == "<unknown>").sum()
print(f"Number of rows with Family == '<unknown>': {unknown_count}")"""


'unknown_count = (dataset.df["Family"] == "<unknown>").sum()\nprint(f"Number of rows with Family == \'<unknown>\': {unknown_count}")'

Based on the analysis of family distribution across categories:

The Adware category stands out with zero instances labeled as <unknown> and a balanced distribution across 43 families. This makes it a strong candidate for modeling.

In contrast, Zero_Day and No_Category The categories Zero_Day and No_Category exhibit extremely high family dispersion, with 2576 and 335 unique families. These values are significantly higher than all other categories, which generally have fewer than 50 families each.


This suggests they function more as placeholder labels. In particular, Zero_Day likely serves as a catch-all label for unknown or uncategorized threats, making it ambiguous. In cybersecurity, this term is refered to a new unknown vulnerability, not yet classified in terms of malware behavior, this is why samples are varied. They do not seem to represent a consistent type. On the other hand, No_Category explicitly denotes a lack of category. So, including these instances would only bring noise to the training process, preventing the model from learning meaningful patterns.
Therefore, they are excluded from the final dataset to preserve the quality and consistency of the classification task.


Additionally, categories like FileInfector show a high percentage of <unknown> families (6.85%) despite having a small total count, raising concerns about label quality. Most other categories maintain a relatively stable level of unknowns (around 3–5%), indicating that the presence of <unknown> is manageable.

<hr>

# 2. DATA SPLITTING
### TO BE DONE
- Statistical analysis of this
- Make sure they follow the same distributions

### Data Splitting: Category as target variable
Originally, we will focus only on category

Lets first get the X and y extracted from our dataset

Also object!
Lets get back to the splitting!

Before, we split the dataset lets observe the SE of accuracy variation based on our choice of split. 
Brief explanation: we can model accuracy via a Binomial distribution. We know each event in a binomial distribution can be modelled through a bernoulli distribution, where the outcome represents the probability predicting the correct class or not. We make the assumption that each classification error is independent from each other. For:
$$
\text{Bin} \sim (n, p)
$$
 let us assume that the parameter of this distribution is p = .85 and n is given by the choice of sample split for the test set. The SE of the sample proportion can be modeled via: 
$$
\text{SE}_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}

$$

Before we continue to assess all possible choices of split based on a variant n, also note that that choice of evenly distributed split (e.g: 10% for each hold-out set) between hold-out sets is arbitrary. Proper choice is that which guarantees an equivalent distribution at each hold-out sets, which may not neccesarily be the equivalent split. This brings a new choice of tradeoff between certainity of prediction accuracy (higher test size, smaller validaiton set) but possibly less space for proper hyperparemter optimization or the inverse (higher validation size, smaller set set). As with other tradeoffs, the priority for a given option is rooted in the model's application (which may come derived from a client/employer). For our case, we dont favor either option of the tradeoof, thus we will keep the even hold-out distribution.

Finally, also note that that choice of evenly distributed split between hold-out sets is arbitrary. Proper choice is that which guarantees an equivalent distribution at each hold-out sets, which may not neccesarily be the equivalent split. This brings a new choice of tradeoff between certainity of prediction accuracy (higher test size, smaller validaiton set) but possibly less space for proper hyperparemter optimization or the inverse (higher validation size, smaller set set). As with other tradeoffs, the priority for a given option is rooted in the model's application (which may come derived from a client/employer). For our case, we dont favor either option of the tradeoof, thus we will keep the even hold-out distribution.


In [93]:
default_pipeline.dataset.split.asses_split_classifier(p=.85, step=.05, plot=True)

TypeError: NoTimeSeries.asses_split_classifier() got an unexpected keyword argument 'plot'

[We can see a diminishing-returns class graph](https://en.wikipedia.org/wiki/Knee_of_a_curve). The more we decline the training set percentage the slower and more steadier the current SE varies as well as the difference to prior SE, in percentage. We can see that the knee in the curve is between 80 and 90 training set percentage. This represents the area where when you start going below the lower bound, no **significant** improve appears. Given the fact that our criteria for choice of split percentage is to keep as much training possible while increasing hold-out sets size only if the decrease in SE is significant **we are going to select 80% for the training set**.

In [40]:
default_pipeline.dataset.split.split_data(y_column="Category",
                                   train_size=.8, 
                                   validation_size=.1,
                                   test_size=.1, 
                                   plot_distribution=False)

In [None]:
default_pipeline.dataset.X_train.shape, default_pipeline.dataset.X_val.shape, default_pipeline.dataset.X_test.shape, default_pipeline.dataset.y_train.shape, default_pipeline.dataset.y_val.shape, default_pipeline.dataset.y_test.shape

<hr>

# 3. DATA PREPROCESSING

In [97]:
pipeline_manager.pipelines

{'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline at 0x31b4d79d0>,
  'tree-based': <library.pipeline.pipeline.Pipeline at 0x31b317750>,
  'linear': <library.pipeline.pipeline.Pipeline at 0x31b4ca850>,
  'naive-bayes': <library.pipeline.pipeline.Pipeline at 0x31b4d4610>,
  'feedForwardNN': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'stacking': <library.pipeline.pipeline.Pipeline at 0x317256d50>},
 'baseline': {'baselines': <library.pipeline.pipeline.Pipeline at 0x17dba6d50>}}

Making pipeline divergences...

In [95]:
baseline_pipeline = pipeline_manager.create_pipeline_divergence(category="baseline", pipelineName="baselines", print_results=True)
tree_pipeline = pipeline_manager.create_pipeline_divergence(category="not-baseline", pipelineName="tree-based", print_results=True)
supportVectorsMachine_pipeline = pipeline_manager.create_pipeline_divergence(category="not-baseline", pipelineName="linear", print_results=True)
naiveBayes_pipeline = pipeline_manager.create_pipeline_divergence(category="not-baseline", pipelineName="naive-bayes", print_results=True)
ensembled_pipeline = pipeline_manager.create_pipeline_divergence(category="not-baseline", pipelineName="ensembled", print_results=True)
feedForwardNN_pipeline = pipeline_manager.create_pipeline_divergence(category="not-baseline", pipelineName="ensembled", print_results=True)

Pipeline baselines in category baseline has diverged
 Pipeline schema is now: {'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'tree-based': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'linear': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'naive-bayes': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'feedForwardNN': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'stacking': <library.pipeline.pipeline.Pipeline object at 0x317256d50>}, 'baseline': {'baselines': <library.pipeline.pipeline.Pipeline object at 0x17dba6d50>}}
Pipeline tree-based in category not-baseline has diverged
 Pipeline schema is now: {'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'tree-based': <library.pipeline.pipeline.Pipeline object at 0x31b317750>, 'linear': <library.pipeline.pipeline.Pipeline object at 0x317256d50>, 'naive-bayes': <library.pipeline.pipeline.Pipeline object

In [96]:
pipeline_manager.pipelines

{'not-baseline': {'ensembled': <library.pipeline.pipeline.Pipeline at 0x31b4d79d0>,
  'tree-based': <library.pipeline.pipeline.Pipeline at 0x31b317750>,
  'linear': <library.pipeline.pipeline.Pipeline at 0x31b4ca850>,
  'naive-bayes': <library.pipeline.pipeline.Pipeline at 0x31b4d4610>,
  'feedForwardNN': <library.pipeline.pipeline.Pipeline at 0x317256d50>,
  'stacking': <library.pipeline.pipeline.Pipeline at 0x317256d50>},
 'baseline': {'baselines': <library.pipeline.pipeline.Pipeline at 0x17dba6d50>}}

### Duplicate and Missing Value Analysis

**Quantifying Missingness**

Our goal is to understand the structure (i.e., the potential patterns) of how the data is missing in our training features because two features missing values may indicate bias or reduce model performance, and we may be able to assure ourselves, that it is not too critical based on the type of missingness or its distribution.


We will begin by calculating the percentage of missing values per column; this is done as: 
$$
\text{Missing\_Percentage}_j = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x_{ij} = \text{NaN}) \times 100
$$

where:
- $x_{ij}$ is the value of the row  $i$ and the column $j$
- $n$ is the total number of rows
- $\mathbb{1}$ is the indicator function that returns 1 is the value is NaN and 0 if it is not.



This provides a fair comparison of the features, and sets aside concerns with dataset sizes. Once we can sort these percentages we can focus our assistance by features with the most serious issues first.


To support this summary we will also produce a heatmap to present the distribution of the missing values across our rows and columns, which will allow us to see possibly interactions and/or patterns - for example, do certain samples (or groups of features) fail together, which you couldn't see in simple statistics; and we favor this type of representation vs. just using summary tables, as this will also expose some structural issues likely related to acquisition processes (or conditional logic) that may be visible in the dataset.

In [None]:
placeholders = [np.nan, -999, "N/A", "missing", "", -np.inf, "NA", "ND", "Null", "null", "NaN", "nan", "none", "None", "missing", -float('inf'), "-inf"]
baseline_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders, plot=True)

If we had missing values, a useful method to handle them would be **KNN (k-nearest neighbors)** imputation, which estimates values based on similar samples to preserve the original distribution. However, it's computationally expensive with $O(n^2)$ complexity. A faster alternative is **imputing with the mean or median** --$O(n)$--, though this reduces variance. Since we have no missing values, imputation isn't necessary.

We need to check missing values for all pipelines. However, we will get no missing nor duplicates in every pipeline given that the dataset is the same.

In the case that there were missing values, we would probably go with the following:
- Tree-based: Trees handle missing values reasonably well, but we could impute with the median to avoid introducing bias.

- Linear (SVM): Requires complete data, so KNN might be the best implementation here.

- Naive Bayes: Performs poorly with missing data; we'd use KNN as well to ensure no missing data, but also accuracy.

- Ensembled: These combine several models, so consistent imputation (likely median for numeric, mode for categorical) is important to avoid skewing individual learners. KNN could also work for this.

- Baseline: Simple imputation (mean or median) would suffice here, as performance is not the focus but rather establishing a reference.

- FNNs: Requires complete and numerically stable input, as missing values can break backpropagation and lead to NaNs in training. The most common approach is mean or median imputation for numerical features, and mode or one-hot encoding for categorical ones. 

In [None]:
tree_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders)
supportVectorsMachine_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders)
naiveBayes_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders)
ensembled_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders)
feedForwardNN_pipeline.preprocessing.uncomplete_data_obj.get_missing_values(placeholders=placeholders)

In [45]:
# pipeline_manager.all_pipelines_execute(
#     methodName="preprocessing..uncomplete_data_obj.get_missing_values",
#     verbose=True
    
# )

# pipeline_manager.all_pipelines_execute(methodName="feature_analysis.feature_transformation.get_categorical_features_encoded", verbose=True, features=featuresToEncode, encode_y=True)

We want to assess if there were any duplicates in the training data, especially in the cases where duplicate records would lead to a biasing of the model, before the modelling occurs. First, we'll look for duplicate rows in the feature matrix $X$ train, where the same unique input is being repeated multiple times, that could lead the model to learn the unique patterns stronger or potentially put more weight on those records than other unique inputs. 

Next, we'll check for duplicates in the joint matrix [ $X$ train ∣ $y$ train], with the target variable as well. This exercise allows us to explicitly identify duplicate observations, i.e., where the input and output are both repeated, and may potentially indicate some unintentional data leak and/or potentially some issues in generating the dataset. We intend not to keep deductions at this point and keep this analysis exploratory, as to leave the modeller with the option of how and when to edit it at all.

This method serves to provide more credibility than looking to the dataset as an entirety and helps to identify the unique duplicated records (which might not be problem) and then the unique duplicated observations (which might be a more serious matter).

In [None]:
baseline_pipeline.preprocessing.uncomplete_data_obj.analyze_duplicates(plot=True)

In [None]:
tree_pipeline.preprocessing.uncomplete_data_obj.analyze_duplicates(plot=True)
supportVectorsMachine_pipeline.preprocessing.uncomplete_data_obj.analyze_duplicates(plot=True)
naiveBayes_pipeline.preprocessing.uncomplete_data_obj.analyze_duplicates(plot=True)
stacking_pipeline.preprocessing.uncomplete_data_obj.analyze_duplicates(plot=True)

Nothing is plotted because no duplicates exist.

In [None]:
baseline_pipeline.preprocessing.uncomplete_data_obj.remove_duplicates()

In [None]:
tree_pipeline.preprocessing.uncomplete_data_obj.remove_duplicates()
supportVectorsMachine_pipeline.preprocessing.uncomplete_data_obj.remove_duplicates()
naiveBayes_pipeline.preprocessing.uncomplete_data_obj.remove_duplicates()
stacking_pipeline.preprocessing.uncomplete_data_obj.remove_duplicates()

This is mandatory for all pipelines to ensure there are no duplicates.

### Outlier Detection

To improve the quality of the data, we detect and remove extreme outliers from the training set.
Outliers can distort model training, especially for models sensitive to scale or distance.
We detect them using the Interquartile Range (IQR) method and remove the affected rows.
This step makes the dataset cleaner, helps models converge better, and improves generalization without being influenced by extreme noise.

However, we have to be very careful with values specially in the lower range because there are a lot of 0 values.

**Candidate Detectors**

| Category | Method | Formula / Principle | Pros | Cons |
|----------|--------|---------------------|------|------|
| **Univariate** | **IQR rule** | $x < Q_1 - 1.5\,\mathrm{IQR}$ or $x > Q_3 + 1.5\,\mathrm{IQR}$ | Robust to skew; O(n) | Ignores correlation; per-feature only |
| | **Percentile clipping** | $x > P_{99}$ → clip to $P_{99}$ | Fast; preserves distribution shape | Capping, not removing; univariate only |
| | **Winsorization** |  | Prevents leakage and preserves df size | Risk towards the lower range of the dataset and losing valuable information |
| **Multivariate** | **Mahalanobis** | $D^2=(\mathbf x-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf x-\boldsymbol\mu)$ | Captures covariance | Needs Σ⁻¹; sensitive to MNAR |
| | **Isolation Forest** | Random splits → path length | Non-parametric; high-dim | Hyper-params; O(n log n) |
| | Local Outlier Factor (LOF) | Density ratio vs k-NN | Detects local anomalies | High cost; choice of *k* |
| **Density-based** | DBSCAN reachability | ε-neighbourhood | Finds clusters + noise | ε, minPts tuning |
| **Ensemble-robust** | Robust Covariance (MinCovDet) | DetMCD; breakdown 0.5 | Good for elliptic | O(n²) |


We choose **99th percentile clipping** because it's fast and doesn't imply any risk of losing valuable data from the lower bound of values.

The function will do two different things depending on the pipeline:

If `pipeline == "iqr"`, remove rows.

If `pipeline == "percentile"`, cap values from the 99th percentile.

For the baseline model pipeline, we don't really need outlier detection since it's just a basic version of the pipeline. However, we'll implement it either way.

In [None]:
f"Baseline: {baseline_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}"
#baseline_pipeline.preprocessing.outliers_bounds_obj.smart_outlier_handler()

The `smart_outlier_handler` function dynamically assigns outlier handling strategies tailored to different feature families, offering flexibility and domain-aware preprocessing. However, the current implementation risks significant data loss by dropping rows with high outlier values, especially when applied across many features. 

While winsorization is a safer alternative that preserves rows by capping extreme values, standard symmetric capping can inadvertently distort important patterns—particularly at the lower bound, where zeros often carry meaningful information (e.g., absence of usage, activity, or allocation). 



For the lineal model pipeline (SVMs), we detect the outliers and either remove or cap extreme values before applying scaling. Linear models are sensitive to outliers because they rely on absolute magnitudes that's why for this pipeline it's essential to get it right but also not lose information in the lower bound of values..

For a tree-based pipeline we don't really need to handle outliers since trees are robust to outliers. However, we're gonna handle the 1st and 99th percentiles to avoid distortion

For the ensembled pipeline, we will take the same approach as tree-based: capping if extreme outliers severely skew feature distributions.

In [None]:
print(f"Tree: {tree_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}")
print(f"SVMs: {supportVectorsMachine_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}")
print(f"Naive Bayes: {naiveBayes_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}")
print(f"Ensembled: {ensembled_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}")
print(f"FNNs: {feedForwardNN_pipeline.preprocessing.outliers_bounds_obj.get_outliers_df(detection_type='percentile', plot=False)}")

### Bound Checking

To ensure data quality, we check that selected features stay within expected bounds. If the amount of values that are out of bounds is less than 0.5% of the data, then we will automathically delete those values inside of the `bound_checking()` function.

In addition, based on the nature of the dataset, we have created a dict with minimum values and an educated guess for maximum values of each individual feature or family of features.

In [None]:
baseline_pipeline.preprocessing.outliers_bounds_obj.bound_checking()

In [None]:
tree_pipeline.preprocessing.outliers_bounds_obj.bound_checking()
supportVectorsMachine_pipeline.preprocessing.outliers_bounds_obj.bound_checking()
naiveBayes_pipeline.preprocessing.outliers_bounds_obj.bound_checking()
ensembled_pipeline.preprocessing.outliers_bounds_obj.bound_checking()

**This step is mandatory for all pipelines.** If a latitude or longitude falls outside these bounds, it cannot physically exist.
This is not a modeling problem — it’s a data error.

### Feature Scaling

To address feature scaling:

- For the baseline pipeline, we do not apply any feature scaling; we train on the raw feature values to establish a control benchmark, allowing us to objectively measure the performance gains (or losses) afforded by each scaling strategy in subsequent experiments.

- Tree-based models do not require feature scaling. This is because decision trees split data based on thresholds (e.g., “is feature X > 4.5?”), which makes them invariant to the magnitude or distribution of feature values. Since ensemble methods like Random Forests and Gradient Boosting are composed of tree-based estimators, they also do not benefit from scaling.

- Gaussian Naive Bayes assumes normally distributed features but does not strictly require scaling. However, standardization can help improve performance by better aligning feature distributions with the model's assumptions. Therefore, we apply standardization in this case as a light enhancement.

- For Feedforward Neural Networks (FNNs), standardization is generally the preferred approach, as it centers features around zero and ensures unit variance, which accelerates convergence and improves stability. However, because FNNs are sensitive to outliers which can significantly distort gradients, RobustScaler is more appropriate when the data contains outliers, as it scales based on the median and interquartile range. **HOWEVER**... if this proves to not work, an alternative could be to go back to StandardScaler given that the NNs might need values in specific intervals that could be affected by the outlier handling from the RobustScaler. We will decide on a final method after testing.

For a similar reason to FNNs, we will also apply RobustScaler to SVMs.

With this in mind, this process will be mandatory for all pipelines except Tree-Based and Ensembled pipeline.

For scaling data, we will work only with the training set to avoid data leakage.

In [None]:
baseline_pipeline.dataset.X_train.head()

In the function created, if we decide to plot and check distributions before and after scaling, the function will select the first 10 features and plot them to avoid plotting all features.

In [None]:
baseline_pipeline.preprocessing.feature_scaling_obj.scale_features(scaler="robust", columnsToScale=baseline_pipeline.dataset.X_train.select_dtypes(include=["number"]).columns, plot=True)

The histograms show how `RobustScaler` compresses extreme values and brings the bulk of the data closer to a common scale. Despite the presence of outliers, the core distribution becomes more uniform and comparable across features — ideal for many ML models.

In [None]:
print(f"SVMs: {supportVectorsMachine_pipeline.preprocessing.feature_scaling_obj.scale_features(scaler='robust', columnsToScale=default_pipeline.dataset.X_train.select_dtypes(include=['number']).columns)}")
print(f"Naive Bayes: {naiveBayes_pipeline.preprocessing.feature_scaling_obj.scale_features(scaler='standard', columnsToScale=default_pipeline.dataset.X_train.select_dtypes(include=['number']).columns)}")
print(f"FNNs: {feedForwardNN_pipeline.preprocessing.feature_scaling_obj.scale_features(scaler='robust', columnsToScale=default_pipeline.dataset.X_train.select_dtypes(include=['number']).columns)}")

In [None]:
baseline_pipeline.dataset.X_train.head()

Since there is a very large number of fields, we need to avoid looking at the distribution and information about each feature. For this, we decided to create an automation that decides the appropriate scaling method -if necessary- for each feature.

This function `determine_scaling_method` takes into account outlier detection and skewness and based on that it decides to use a **robust scaler**, **normalize** or **none**.

**Note**: we use robust scaler rather than a standard scaler because robust scales with the median and IQR, which is less sensitive to outliers than standardizing is -computed with $\mu$ and $\sigma$. It will turn the median to 0 and turn the values in a field in values between -1 and 1 with some outliers.

Normalization Formula:
 
$$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$

Robust Scaler Formula:

$$x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}$$
 
Where:

$IQR(x)$ (Interquartile Range) $= Q3 - Q1$

### Class Imbalance

It's very important to check the distribution of the target variable before training any model. If there is a large class imbalance, the models could easily become biased towards the majority classes or get too little representation in the minority classes, particularly in precision and recall.

To illustrate this mathematically, we would calculate the relative frequency of each class: 

$$
P(y = c_i) = \frac{\text{Number of samples in class} c_i}{\text{Total number of samples}}
$$

The probability distribution tells us if the dataset appears to be biased towards one or more of the classes. We actually used raw counts, but we will use normalized frequencies, so the analysis can remain interpretable regardless of the size of the dataset.

To illustrate this, we have a bar chart that depicts this distribution. Bar charts provide an easier way to visualize distribution imbalance compared to tabular summaries or piecharts, especially if the discrepancies are smaller. At this stage of the analysis, we haven't undertaken any class balancing strategy; our goal is to diagnose whether there is any imbalance, and to give direction to the next step and consideration of resampling or adjusting class weights.

SMOTE is probably gonna be the best move in this case. We won't use methods like SMOTE NC since, as showed on EDA, we only have categorical variables and SMOTE NC is optimal for datasets with a wide mix of numerical and categorical variables. 

To address class imbalance:

- For the baseline pipeline, we do not apply any resampling or class‐weight adjustments; we train on the raw class distribution to establish a control benchmark, allowing us to objectively measure the uplift provided by each subsequent imbalance mitigation strategy.

- We apply SMOTE for models like SVMs, where generating synthetic examples helps improve decision boundaries.

- For tree-based models, which are naturally more robust to imbalance, we do not apply SMOTE. Instead, we handle imbalance directly in the model by setting `class_weight='balanced'` when creating the classifier.

- For Naive Bayes models, we avoid SMOTE and instead adjust class priors, as synthetic oversampling would violate their probabilistic assumptions.

- For ensemble and stacking methods, the approach depends on the base models: SMOTE is applied if the base learners are sensitive to class imbalance, and class weighting is used otherwise.
Thus, SMOTE is applied selectively through a function, and when not appropriate, imbalance is handled directly within the model definition.

- For Feedforward Neural Networks (FNNs), we **will not apply SMOTE**, as it would unnecessarily increase training time on a large dataset. Instead, we address class imbalance by using a **weighted loss function** later in the pipeline. This will allows the model to account for class distribution during training without generating synthetic samples. Neural networks are very good at adjusting to sample-level importance when guided by loss weights during backpropagation. Their training dynamically adapts to these weights.

In [None]:
supportVectorsMachine_pipeline.preprocessing.class_imbalance_obj.class_imbalance(plot=True)

SMOTE was only applied to SVMs.

## !> Pipeline example
Let us show an example of the pipeline divergence. We will create a divergence for the baseline pipeline. We will exempt it from being scaled. This specific example is not meant to be kept, but rather show the purpose of the corresponding functions.

In [50]:
baseline_pipeline = pipeline_manager.create_pipeline_divergence(category="baseline", pipelineName="logistic")

In [None]:
default_pipeline.dataset.X_train.describe()

In [52]:
default_pipeline.preprocessing.scale_features(scaler="robust",
                                      columnsToScale=default_pipeline.dataset.X_train.select_dtypes(include=["number"]).columns)

In [None]:
default_pipeline.dataset.X_train.describe()

In [None]:
default_pipeline.dataset.X_train.describe()

In [None]:
# This is not updated (as expected)
baseline_pipeline.dataset.X_train.describe()

<hr>

# 4. FEATURE ANALYSIS

Here we will take adjust the features that compose the learning inputs to our model. The correctness of this section is pivotal for proper learning by the model


## FEATURE ENGINEERING
- Domain-specific features
- Binning
- Interaction terms


In [57]:
featuresToEncode = ["Reboot"]

In [None]:
encoded_maps_perPipeline = pipeline_manager.all_pipelines_execute(methodName="feature_analysis.feature_transformation.get_categorical_features_encoded", verbose=True, features=featuresToEncode, encode_y=True)

Lets visualize the results of the encoding...

In [None]:
default_pipeline.dataset.X_train

In [None]:
default_pipeline.dataset.y_train

In [None]:
baseline_pipeline.dataset.X_train

In [None]:
baseline_pipeline.dataset.y_train

## FEATURE SELECTION
- Analyze correlation and low-variances


## Feature selection 
As explained before, we now proceed to carefully reduce the present high-dimensionality. High-dimensionality increases the chances of the model overfitting (capturing noise from irrelevant features; increasing variance and reducing bias), as well as introducing a significant computational overhead. We match this high-dimensionality with a highly filtering models_pipeline of feature selection.
The most extensive cut comes given at the first level, with the mutual information threshold-based cut. This metric captures the level of uncertainity between the feature and the target variable (cnt). In marked contrast with pearson coefficeint (correlation), it is able to model non-linear and linea relationships altoghther. 
This feature-selectin models_pipeline is compromised of five (3 as of the final models_pipeline) cuts:
 - mutual information
 - low variance
 - multicolinearity analysis
 - PCA
 - Boruta and/or Lasso

The different thresholds for each of this cuts have been altered over the different models_pipeline iterations. Specifically, in the bias-variance tradeoff (to be elaborated in further detail later), I increased all the thresholds in order to avoid por performance.

#### Feature to target variable mutual information

In [63]:
# models_pipeline.feature_analysis.feature_selection.manual_feature_selection.fit(type="MutualInformation", threshold=.2, delete_features=True, plot=True)

#### Eliminating low-variances features
Features with low variances provide little new information for the model to learn from, thus they could introduce statistical noise. Due to this reason, they should be elimanted from the dataset. The reason why we don't focus on high-variance is because this symbolizes outliers, which have been dealt with before in data preporcesing.
This function call eliminate low-variance (based on threshold) and all cosntant variables (regardless of threshold)

We will start off this analysis eliminating univariate features (i.e: the featuers with constant values)

In [64]:
# default_pipeline.feature_analysis.feature_selection.manual_feature_selection.fit(type="LowVariances", 
#                                                                          threshold=0.5, 
#                                                                          delete_features=True, 
#                                                                          plot=True)

As explained in scikit-learn's in (1.13.1 Feature selection)[https://scikit-learn.org/stable/modules/feature_selection.html], the variance threshold must be selected carefully. Too low may delete few variables, and too may be too restrictive, deleting more variables than it should.

#### Eliminating highly correlated feature 
Highly correlated variables (multicolinearity) are problem for models because they introduce a redundancy (features that contain significantly related similar information are not bringing much new insight into the model's input) to the model that can introduce significant variance. This is due to the fact that small changes in the data may make the coefficeints of the highly correlated variables **swing** more than it should

A note on the shape of this heatmap: due to the high amount of features, and the redundancy to measure the correlation between features (where corr(A,B) = corr(B,A)) we set 'np.triu(np.ones_like(corr, dtype=bool))' in the utilities functions in order to show only new non-redudant correlations between features, thus the right triangle shape.
Well, thats a lot to digest! We can see some solid red (high positive correlation) and medium-solid blue (some high negative correlation). 
Lets use a non-visual methodology to confirm our initial hypothesis. We will use variance inflaction factor (VIF) along with checking manually.
A brief explanation on VIF. 
Formula: 
$$
VIF_i = \frac{1}{1 - R_i^2}
$$
You regress (i.e: do linear regression) on the ith feature as target variable, and all other features as predictors. You then compute the coefficient of determination as a way to measure how well the predictors fit the target variable. VIF values ranges from [1, inf], where lower bound signifies little multicolinearity (R^2 = 0), and upper bound occurs when R^2=1 (perfect multicolineairty). 5 is considered a standard threshold for VIF as it symbolized an 80% R^2.
Once you ve obtained the results of VIF, you need to delete the variable with the highest VIF, and recompute VIF until there is no multicolinearity. For example, say there are three features that are 4 features, 3 of them being linear combinations of each other. You would delete the variables with VIF until there is no VIF (when only one of those linear combinations remains). You cant delete all n-1 variables where n are the amount of variables with exceeding (with respect to the chosen threshold), because you may be deleting one that has high VIF due to another feature.

A relevant note on why one-hot-encoding must be done dropping the first one:
If we were to not remove one of the labels of one hot encoding, you would be able to predict which level of the categorical variable based on all other ones (there is only degree of freedom for categories levels; they are a linear combination). You would essentially see inf VIF in that area and delete it in this section. 
The VIF has to be computed every time we delete a feature due to high multicolinearity. Lets do that

In [65]:
#models_pipeline.feature_analysis.feature_selection.manual_feature_selection.fit(type="VIF", threshold=10, delete_features=True, plot=False)

### PCA
PCA can still bring some more value to feature selection. It will take into account interaction effects by itself and find the principals that capture as much variance as we specify. Thus, its inclusion in the feature selection models_pipeline.
It has been excldued due to underperformance.

In [66]:
#models_pipeline.feature_analysis.feature_selection.manual_feature_selection.fit(type="PCA", threshold=.95, delete_features=False, plot=False)

### 1.3 Automatic Feature Selection
#### L1 regularization
Because of L1 regularization being able to set weights 0, we can briefly train our logistic regression model with such regulartization and see which features it uses. The reason why it sets 0 to insignificant feature is because the objective function is not only the MSE but added a component of the wieght magnitude (which is trying to minimize)

In [67]:
# excluded_features, predictive_features, coefficients = models_pipeline.feature_analysis.feature_selection.automatic_feature_selection.fit(type="L1",
#                                                                                                                                     max_iter=1000, 
#                                                                                                                                     print_results=True, delete_features=False)
# excluded_features, len(excluded_features)

#### BORUTA
Boruta is a more powerful feature selection method (thus we use it as a reference for variable deletion). It is more powerful that L1 becuase it compares the importance of features to shuffled versions, ensuring robust feature selection


In [68]:
# excluded_features, predictive_features = models_pipeline.feature_analysis.feature_selection.automatic_feature_selection.fit(type="Boruta", max_iter=10, print_results=True, delete_features=False)
# excluded_features, len(excluded_features)

Awesome, lets move onto the actual modelling part!

<hr>

# MODELLING

## Fitting the model

QUESTIONS FOR THIS SECTION
- Can the ROC curve be used for multiclass?
- Can a unsupervised learning algorithm (e.g: KNN) be used for this problem even tough its nature is to be supervised?
- Does val to train deltas are meaningful here?

TO BE DONE FOR THIS SECTION
- Multiple models as classifiers
- Learn more about each alogrithms' paremters
- Can KNN be used here?

## Random Forest & Decision Trees
Instead of the originally planned logistic regression, we will be using an ensembled model first: random forest (collection of week decision trees). This model is not only likely to outperform the original choice because its nature to handle multiclass better, but also does this several orders of magnitude faster. We will add its not ensembled version too, along with gradient boosted machine
Note we are not using not-by-default multiclass classifiers (e.g: logistic regressions, svms)

## Non-optimized fitting
We first fit all these models with the default paramterers. This is done to constrat more starkly the difference between pre and post tuning.

In [69]:
# Ensembled models
gradientBoostingModel = GradientBoostingClassifier()
randomForestModel = RandomForestClassifier()

# Tree-based models
decisionTreeModel = DecisionTreeClassifier()

# Linear models
supportVectorModel = SVC()

# Baseline
logisticRegressionModel = LogisticRegression()

#### MODEL PERFORMANCE, HYPOTHESIS
- Trees (e.g: random forest, decision trees):
  - Time to fit:
    - Ensemble models (random forest) take more time to train due to the fact that they are larger and heavier than their non-ensembled version. 
  - Correctness:
     - High
- Binary-classifiers by default models (e.g: SMV, logistic regression)
  - Time to fit:
    - compute C models for all C number of classes. Each trained to detect a single class, then when we make predictions, we select the ones that has the highest probability in its predictions ("the most confident in its prediction"). This strategy is called One-vs-Rest, note however, a single logistic regression may be used if we used the softmax objective function (instead of log-odds) (it still heavy computationally, tough). They will be sloder than ensemble models
  - Correctness:
     - Very low if the problem is non-linear which it is for the SVM and logistic regression

In [None]:
pipeline_manager.pipelines

Pipelines always need to diverge from training onwards. Otherwise they will have each other results (which does not follow the isolation pattern we have programmed this with)

In [71]:
ensembled_pipeline = pipeline_manager.create_pipeline_divergence(category="models", pipelineName="ensembled")
linear_pipeline = pipeline_manager.create_pipeline_divergence(category="models", pipelineName="linear")

In [None]:
pipeline_manager.pipelines

In [73]:
# Ensembled models
ensembled_pipeline.model_selection.add_model("Gradient Boosting", gradientBoostingModel)
ensembled_pipeline.model_selection.add_model("Random Forest", randomForestModel)

# Tree-based models
tree_pipeline.model_selection.add_model("Decision Tree", decisionTreeModel)

# Linear models
linear_pipeline.model_selection.add_model("SVM", supportVectorModel) 

# Baseline
baseline_pipeline.model_selection.add_model("Logistic Regression", logisticRegressionModel)


While we debug, lets exlclude some models we dont need for now (they are very slow to train)

In [74]:
# Ensembled models
ensembled_pipeline.model_selection.models_to_exclude = ["Gradient Boosting"]

# Tree-based models
tree_pipeline.model_selection.models_to_exclude = []

# Linear models
linear_pipeline.model_selection.models_to_exclude = ["SVM"]

# Baseline
baseline_pipeline.model_selection.models_to_exclude = ["Logistic Regression"]


In [None]:
pipeline_manager.all_pipelines_execute(methodName="model_selection.fit_models", current_phase="pre")

Let's make sure the predictions vary between holdout sets

<i> the aforeshown diagram was originally done with the sole intention to debug an error that made predictions be the same across sets. Insights may not be much meaningful after correctio, but it is worth keeping until the end of the notebook development </i>

### PREDICTIONS RESULTS 
Before we get into the actual results, lets elaborate briefly on all the metrics that we are using to asses our classifiers:

- Accuracy => total correctly predicted elemetnts (sigma over the moments we predicted x_i and it was actually x_i / number_of_samples)
$$
\text{Accuracy} = \frac{\sum_{i} \mathbf{1}(\hat{y}_i = y_i)}{N}
$$
- Precision => out of how many predicted for that class were actually from that class (predicted for class x when it was x/ predicted for class x when it was x + predicted for class x when it was NOT x)
$$
\text{Precision}_x = \frac{\text{TP}_x}{\text{TP}_x + \text{FP}_x}
$$
- Recall => out of all cases that were positive how many got predicted correctly?
$$
\text{Recall}_x = \frac{\text{TP}_x}{\text{TP}_x + \text{FN}_x}
$$
- F1-score => harmonic mean of precision and recall (balances both metrics, heavily penalize spreadness between ratios that are being averaged out)
$$
\text{F1}_x = 2 \times \frac{\text{Precision}_x \times \text{Recall}_x}{\text{Precision}_x + \text{Recall}_x}
$$
- Support => number of actual occurences of class in the dataset
- macro avg => averages given metric across all classes
$$
\text{Macro Avg} = \frac{1}{C} \sum_{i=1}^{C} M_i
$$
- weighted avg => averages with weights per class occurence (considers frequency of class in average computation)
$$
\text{Weighted Avg} = \sum_{i=1}^{C} \frac{\text{Support}_i}{\text{Total Instances}} \times M_i
$$

In [76]:
comments = "wiLL THIS work?"

In [None]:
model_results = pipeline_manager.all_pipelines_execute(methodName="model_selection.evaluate_models", comments=comments, current_phase="pre")

## Performance Evaluation (pre-tuning)
Below are shown all the metrics we can compare our plots to:

In [None]:
model_results["models"]["ensembled"].columns

In [None]:
pipeline_manager.pipelines_analysis.plot_results_metrics(
                                                         metrics=["accuracy_val", "precision_val", "recall_val", "f1-score_val", "timeToFit", "timeToPredict"], 
                                                         phase="pre")

## Feature importances (pre-tuning)

In [None]:
importances_dfs = pipeline_manager.pipelines_analysis.pot_feature_importance(phase="pre")
importances_dfs

## Residual analysis (pre-tuning)
Gotta add the mappers naming to the plot


This contains the confusion matrices (weighted and not weighted). 
It also returns the specific elements that were erroneously classified and does cluster analysis

### Confusion Matrix

In [None]:
pipeline_manager.pipelines_analysis.plot_confusion_matrix(phase="pre")

PLOTS:
- Early signs of overfitting
- Plot errors 

# Hyperparameter Optimization