<h1>Integration of MLOps with Signature-based and Image-based Malware Detection Systems</h1>

<h2>1. Problem Description
</h2>

Malware detection is a critical aspect of cybersecurity, with signature-based and
image-based approaches being prominent methods. Signature-based detection relies on predefined
patterns or signatures of known malware, while image-based detection analyzes visual
characteristics of malware samples. Leveraging machine learning algorithms, these approaches
offer automated solutions to identify and classify malicious software.

<h2>2.  Import relevant Libraries/packages</h2>

In [10]:
# Importing pandas for data manipulation and analysis
import pandas as pd

# Importing numpy for numerical operations and array handling
import numpy as np

# Importing CountVectorizer for converting text data into numerical features
from sklearn.feature_extraction.text import CountVectorizer

# Importing StandardScaler for feature scaling (standardization)
from sklearn.preprocessing import StandardScaler

# Importing hstack and coo_matrix for handling sparse matrices
from scipy.sparse import hstack, coo_matrix

# Importing resample for resampling datasets (e.g., for balancing classes)
from sklearn.utils import resample

# Importing PrettyTable for creating nicely formatted tables
from prettytable import PrettyTable

# Importing SMOTE for handling class imbalance through synthetic oversampling
from imblearn.over_sampling import SMOTE

# Importing train_test_split for splitting data into training and testing sets
from sklearn.model_selection import train_test_split

# Importing LogisticRegression for performing logistic regression
from sklearn.linear_model import LogisticRegression

# Importing various metrics for evaluating model performance
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score

# Importing GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Importing SVC for Support Vector Classification
from sklearn.svm import SVC

# Importing RandomForestClassifier for Random Forest Classification
from sklearn.ensemble import RandomForestClassifier

# Importing XGBClassifier for XGBoost Classification
from xgboost import XGBClassifier

# Importing matplotlib for plotting
import matplotlib.pyplot as plt

# Importing seaborn for statistical data visualization
import seaborn as sns

# Suppressing warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')


<h2>3. Exploratory Data Analysis</h2>

<h3>3.1 Some High Level Information</h3>

In [14]:
# Loading the Data
data = pd.read_csv("MalwareData.csv", sep="|")
data.head(5)

Unnamed: 0,Name,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
0,memtest.exe,631ea355665f28d4707448e442fbf5b8,332,224,258,9,0,361984,115712,0,...,4,3.262823,2.568844,3.537939,8797.0,216,18032,0,16,1
1,ose.exe,9d10f99a6712e28f8acd5641e3a7ea6b,332,224,3330,9,0,130560,19968,0,...,2,4.250461,3.420744,5.080177,837.0,518,1156,72,18,1
2,setup.exe,4d92f518527353c0db88a70fddcfd390,332,224,3330,9,0,517120,621568,0,...,11,4.426324,2.846449,5.271813,31102.272727,104,270376,72,18,1
3,DW20.EXE,a41e524f8d45f0074fd07805ff0c9b12,332,224,258,9,0,585728,369152,0,...,10,4.364291,2.669314,6.40072,1457.0,90,4264,72,18,1
4,dwtrig20.exe,c87e561258f2f8650cef999bf643a731,332,224,258,9,0,294912,247296,0,...,2,4.3061,3.421598,5.190603,1074.5,849,1300,72,18,1


In [18]:
#Last 5 row of the dataset
data.tail()     

Unnamed: 0,Name,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
138042,VirusShare_8e292b418568d6e7b87f2a32aee7074b,8e292b418568d6e7b87f2a32aee7074b,332,224,258,11,0,205824,223744,0,...,7,4.122736,1.37026,7.677091,14900.714286,16,81654,72,0,0
138043,VirusShare_260d9e2258aed4c8a3bbd703ec895822,260d9e2258aed4c8a3bbd703ec895822,332,224,33167,2,25,37888,185344,0,...,26,3.377663,2.031619,5.050074,6905.846154,44,67624,0,15,0
138044,VirusShare_8d088a51b7d225c9f5d11d239791ec3f,8d088a51b7d225c9f5d11d239791ec3f,332,224,258,10,0,118272,380416,0,...,22,6.825406,2.617026,7.990487,14981.909091,48,22648,72,14,0
138045,VirusShare_4286dccf67ca220fe67635388229a9f3,4286dccf67ca220fe67635388229a9f3,332,224,33166,2,25,49152,16896,0,...,10,3.421627,2.060964,4.739744,601.6,16,2216,0,0,0
138046,VirusShare_d7648eae45f09b3adb75127f43be6d11,d7648eae45f09b3adb75127f43be6d11,332,224,258,11,0,111616,468480,0,...,4,4.407252,1.980482,6.115374,96625.0,20,318464,72,0,0


In [15]:
# High level statistics of Numerical features
data.describe()

Unnamed: 0,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,AddressOfEntryPoint,BaseOfCode,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
count,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,...,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0
mean,4259.069274,225.845632,4444.145994,8.619774,3.819286,242595.6,450486.7,100952.5,171956.1,57798.45,...,22.0507,4.000127,2.434541,5.52161,55450.93,18180.82,246590.3,465675.0,12.363115,0.29934
std,10880.347245,5.121399,8186.782524,4.088757,11.862675,5754485.0,21015990.0,16352880.0,3430553.0,5527658.0,...,136.494244,1.112981,0.815577,1.597403,7799163.0,6502369.0,21248600.0,26089870.0,6.798878,0.457971
min,332.0,224.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,332.0,224.0,258.0,8.0,0.0,30208.0,24576.0,0.0,12721.0,4096.0,...,5.0,3.458505,2.178748,4.828706,956.0,48.0,2216.0,0.0,13.0,0.0
50%,332.0,224.0,258.0,9.0,0.0,113664.0,263168.0,0.0,52883.0,4096.0,...,6.0,3.729824,2.458492,5.317552,2708.154,48.0,9640.0,72.0,15.0,0.0
75%,332.0,224.0,8226.0,10.0,0.0,120320.0,385024.0,0.0,61578.0,4096.0,...,13.0,4.233051,2.696833,6.502239,6558.429,132.0,23780.0,72.0,16.0,1.0
max,34404.0,352.0,49551.0,255.0,255.0,1818587000.0,4294966000.0,4294941000.0,1074484000.0,2028711000.0,...,7694.0,7.999723,7.999723,8.0,2415919000.0,2415919000.0,4294903000.0,4294967000.0,26.0,1.0


In [16]:
# Take a look at dataframe shape
print("Shape of the dataframe: ",data.shape)

Shape of the dataframe:  (138047, 57)


In [17]:
# name of the columns
data.columns  

Index(['Name', 'md5', 'Machine', 'SizeOfOptionalHeader', 'Characteristics',
       'MajorLinkerVersion', 'MinorLinkerVersion', 'SizeOfCode',
       'SizeOfInitializedData', 'SizeOfUninitializedData',
       'AddressOfEntryPoint', 'BaseOfCode', 'BaseOfData', 'ImageBase',
       'SectionAlignment', 'FileAlignment', 'MajorOperatingSystemVersion',
       'MinorOperatingSystemVersion', 'MajorImageVersion', 'MinorImageVersion',
       'MajorSubsystemVersion', 'MinorSubsystemVersion', 'SizeOfImage',
       'SizeOfHeaders', 'CheckSum', 'Subsystem', 'DllCharacteristics',
       'SizeOfStackReserve', 'SizeOfStackCommit', 'SizeOfHeapReserve',
       'SizeOfHeapCommit', 'LoaderFlags', 'NumberOfRvaAndSizes', 'SectionsNb',
       'SectionsMeanEntropy', 'SectionsMinEntropy', 'SectionsMaxEntropy',
       'SectionsMeanRawsize', 'SectionsMinRawsize', 'SectionMaxRawsize',
       'SectionsMeanVirtualsize', 'SectionsMinVirtualsize',
       'SectionMaxVirtualsize', 'ImportsNbDLL', 'ImportsNb',
       'Impor

In [19]:
data.describe(include="all") 

Unnamed: 0,Name,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
count,138047,138047,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,...,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0,138047.0
unique,107488,138047,,,,,,,,,...,,,,,,,,,,
top,mshtml.dll,631ea355665f28d4707448e442fbf5b8,,,,,,,,,...,,,,,,,,,,
freq,187,1,,,,,,,,,...,,,,,,,,,,
mean,,,4259.069274,225.845632,4444.145994,8.619774,3.819286,242595.6,450486.7,100952.5,...,22.0507,4.000127,2.434541,5.52161,55450.93,18180.82,246590.3,465675.0,12.363115,0.29934
std,,,10880.347245,5.121399,8186.782524,4.088757,11.862675,5754485.0,21015990.0,16352880.0,...,136.494244,1.112981,0.815577,1.597403,7799163.0,6502369.0,21248600.0,26089870.0,6.798878,0.457971
min,,,332.0,224.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,332.0,224.0,258.0,8.0,0.0,30208.0,24576.0,0.0,...,5.0,3.458505,2.178748,4.828706,956.0,48.0,2216.0,0.0,13.0,0.0
50%,,,332.0,224.0,258.0,9.0,0.0,113664.0,263168.0,0.0,...,6.0,3.729824,2.458492,5.317552,2708.154,48.0,9640.0,72.0,15.0,0.0
75%,,,332.0,224.0,8226.0,10.0,0.0,120320.0,385024.0,0.0,...,13.0,4.233051,2.696833,6.502239,6558.429,132.0,23780.0,72.0,16.0,1.0


In [20]:
# info about the whole dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138047 entries, 0 to 138046
Data columns (total 57 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Name                         138047 non-null  object 
 1   md5                          138047 non-null  object 
 2   Machine                      138047 non-null  int64  
 3   SizeOfOptionalHeader         138047 non-null  int64  
 4   Characteristics              138047 non-null  int64  
 5   MajorLinkerVersion           138047 non-null  int64  
 6   MinorLinkerVersion           138047 non-null  int64  
 7   SizeOfCode                   138047 non-null  int64  
 8   SizeOfInitializedData        138047 non-null  int64  
 9   SizeOfUninitializedData      138047 non-null  int64  
 10  AddressOfEntryPoint          138047 non-null  int64  
 11  BaseOfCode                   138047 non-null  int64  
 12  BaseOfData                   138047 non-null  int64  
 13 

In [21]:
 # count of malware (0) and benign (1) files in dataset
data["legitimate"].value_counts()

legitimate
0    96724
1    41323
Name: count, dtype: int64

In [23]:
data["legitimate"].value_counts().plot(kind="line",autopct="%1.1f%%")
plt.show()