# Feature Selection w/ Android Malware

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
df_origin = pd.read_csv('Android_Malware.csv', low_memory=False)

In [3]:
df = df_origin.copy() # Just in case ;)

In [4]:
# Calculate the variance of each column
numdf = df.select_dtypes(include="number")
variance_per_column = numdf.var()

# Set a threshold for minimum variance
threshold = 0.01  # You can adjust this threshold based on your preference

# Identify columns with variance below the threshold
low_variance_columns = variance_per_column[variance_per_column < threshold].index.tolist()

print(low_variance_columns)
# Consider removing these columns as well
# [' ECE Flag Count', ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', 
#  ' Bwd Avg Bytes/Bulk', ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate']

[' ECE Flag Count', ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk', ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate']


  variance_per_column = df.var()


In [5]:
correlation_matrix = numdf.corr()

# Get the upper triangle of the correlation matrix (excluding the diagonal)
upper_triangle = correlation_matrix.where(
    np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)

# Find highly correlated columns
highly_correlated_columns = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]

# Print or further analyze the highly correlated columns
print("Highly correlated columns to consider removing:", highly_correlated_columns)

  correlation_matrix = df.corr()


Highly correlated columns to consider removing: [' Total Backward Packets', ' Fwd Packet Length Max', ' Flow IAT Max', ' Flow IAT Min', ' Fwd IAT Mean', ' Fwd IAT Max', ' Fwd IAT Min', ' Bwd IAT Max', 'Fwd PSH Flags', ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags', ' Bwd Header Length', 'Fwd Packets/s', ' Max Packet Length', ' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance', 'FIN Flag Count', ' SYN Flag Count', ' RST Flag Count', ' Average Packet Size', ' Avg Fwd Segment Size', ' Avg Bwd Segment Size', ' Fwd Header Length.1', 'Subflow Fwd Packets', ' Subflow Fwd Bytes', ' Subflow Bwd Packets', ' Subflow Bwd Bytes', 'Init_Win_bytes_forward', ' act_data_pkt_fwd', ' Active Max', ' Active Min', ' Idle Max', ' Idle Min']


'Total Backward Packets' and 'Total Fwd Packets': These two features are highly correlated, which is expected since they represent similar information. You may consider keeping only one of them.

'Fwd Packet Length Max' and 'Packet Length Mean': These features are highly correlated, suggesting redundancy. You might want to keep one that is more relevant to your analysis.

'Fwd IAT Mean', 'Fwd IAT Max', and 'Fwd IAT Min': These features related to inter-arrival times are highly correlated. You may choose to keep one representative feature.

'Bwd IAT Max' and 'Fwd IAT Max': These features are highly correlated, indicating redundancy. You might want to keep the one that aligns better with your analysis.

'Fwd PSH Flags' and 'Bwd PSH Flags': These features are highly correlated, suggesting similarity. Consider keeping one that is more relevant to your analysis.

'Fwd Packets/s' and 'Flow IAT Max': These features are highly correlated, and you may want to keep the one that aligns better with your analysis.

### From EDA here are some potential columns that we can drop

In [4]:
# Remove leading spaces from column names
df.columns = df.columns.str.strip()

In [5]:
# Define the pairs of columns for differences
packet_pairs = [('Total Fwd Packets', 'Total Backward Packets'),
                ('Total Length of Fwd Packets', 'Total Length of Bwd Packets'),
                ('Fwd IAT Total', 'Bwd IAT Total'),
                ('Fwd PSH Flags', 'Bwd PSH Flags'),
                ('Fwd URG Flags', 'Bwd URG Flags'),
                ('Fwd Header Length', 'Bwd Header Length'),
                ('Fwd Packets/s', 'Bwd Packets/s'),
                ('Avg Fwd Segment Size', 'Avg Bwd Segment Size'),
                ('Fwd Avg Bytes/Bulk', 'Bwd Avg Bytes/Bulk'),
                ('Fwd Avg Packets/Bulk', 'Bwd Avg Packets/Bulk'),
                ('Fwd Avg Bulk Rate', 'Bwd Avg Bulk Rate'),
                ('Subflow Fwd Packets', 'Subflow Bwd Packets'),
                ('Subflow Fwd Bytes', 'Subflow Bwd Bytes'),
                ('Init_Win_bytes_forward', 'Init_Win_bytes_backward')]

# Convert columns to numeric before calculating differences
df = df.apply(pd.to_numeric, errors='coerce')

# Calculate differences for each pair and create separate columns
for pair in packet_pairs:
    col_name_diff = f'{pair[0]} - {pair[1]}'
    df[col_name_diff] = df[pair[0]] - df[pair[1]]

# Display the resulting DataFrame
print(df)

        Unnamed: 0  Flow ID  Source IP  Source Port  Destination IP  \
0                0      NaN        NaN        50004             NaN   
1                1      NaN        NaN        35455             NaN   
2                2      NaN        NaN        51775             NaN   
3                3      NaN        NaN        51775             NaN   
4                4      NaN        NaN        51776             NaN   
...            ...      ...        ...          ...             ...   
355625         405      NaN        NaN           80             NaN   
355626         406      NaN        NaN         7632             NaN   
355627         407      NaN        NaN        45970             NaN   
355628         408      NaN        NaN        51982             NaN   
355629         409      NaN        NaN         9320             NaN   

        Destination Port  Protocol  Timestamp  Flow Duration  \
0                  443.0       6.0        NaN          37027   
1                  

In [6]:
# List of all columns to drop
columns_to_drop = [
    'Unnamed: 0', 'Flow ID', 'Source IP', 'Source Port', 'Destination IP', 'Destination Port',
    'Protocol', 'Timestamp', 'Total Fwd Packets', 'Total Backward Packets',
    'Total Length of Fwd Packets', 'Total Length of Bwd Packets',
    'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Std',
    'Bwd Packet Length Max', 'Bwd Packet Length Min', 'Bwd Packet Length Std',
    'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min',
    'Fwd IAT Total', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
    'Bwd IAT Total', 'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min',
    'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags',
    'Fwd Header Length', 'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
    'Min Packet Length', 'Max Packet Length', 'Packet Length Std', 'Packet Length Variance',
    'ECE Flag Count', 'Avg Fwd Segment Size', 'Avg Bwd Segment Size',
    'Fwd Header Length.1', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk', 'Fwd Avg Bulk Rate',
    'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate',
    'Subflow Fwd Packets', 'Subflow Fwd Bytes', 'Subflow Bwd Packets', 'Subflow Bwd Bytes',
    'Init_Win_bytes_forward', 'Init_Win_bytes_backward',
    'Active Std', 'Active Max', 'Active Min', 'Idle Std', 'Idle Max', 'Idle Min'
]

# Drop the specified columns
df = df.drop(columns=columns_to_drop, errors='ignore')

In [7]:
df.shape

(355630, 37)

In [9]:
df.columns

Index(['Flow Duration', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean',
       'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Fwd IAT Mean',
       'Bwd IAT Mean', 'Packet Length Mean', 'FIN Flag Count',
       'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count',
       'URG Flag Count', 'CWE Flag Count', 'Down/Up Ratio',
       'Average Packet Size', 'act_data_pkt_fwd', 'min_seg_size_forward',
       'Active Mean', 'Idle Mean', 'Label',
       'Total Fwd Packets - Total Backward Packets',
       'Total Length of Fwd Packets - Total Length of Bwd Packets',
       'Fwd IAT Total - Bwd IAT Total', 'Fwd PSH Flags - Bwd PSH Flags',
       'Fwd URG Flags - Bwd URG Flags',
       'Fwd Header Length - Bwd Header Length',
       'Fwd Packets/s - Bwd Packets/s',
       'Avg Fwd Segment Size - Avg Bwd Segment Size',
       'Fwd Avg Bytes/Bulk - Bwd Avg Bytes/Bulk',
       'Fwd Avg Packets/Bulk - Bwd Avg Packets/Bulk',
       'Fwd Avg Bulk Rate - Bwd Avg Bulk Rate',
  

In [8]:
df.describe()

Unnamed: 0,Flow Duration,Fwd Packet Length Mean,Bwd Packet Length Mean,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Fwd IAT Mean,Bwd IAT Mean,Packet Length Mean,FIN Flag Count,...,Fwd URG Flags - Bwd URG Flags,Fwd Header Length - Bwd Header Length,Fwd Packets/s - Bwd Packets/s,Avg Fwd Segment Size - Avg Bwd Segment Size,Fwd Avg Bytes/Bulk - Bwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk - Bwd Avg Packets/Bulk,Fwd Avg Bulk Rate - Bwd Avg Bulk Rate,Subflow Fwd Packets - Subflow Bwd Packets,Subflow Fwd Bytes - Subflow Bwd Bytes,Init_Win_bytes_forward - Init_Win_bytes_backward
count,355630.0,355630.0,355630.0,355630.0,355630.0,355630.0,355630.0,355630.0,355629.0,355629.0,...,355630.0,355630.0,355629.0,355627.0,355626.0,355626.0,355626.0,355626.0,355626.0,355626.0
mean,10929750.0,59.643539,168.537728,83989.08,5494.58,3175805.0,3206486.0,946156.1,115.158037,0.041301,...,0.002126,820731.6,3870.642,-108.892824,0.0,0.0,0.0,-3.137181,-10615.01,21354.297723
std,21808610.0,119.309,311.332303,911269.8,39135.03,8459869.0,8859162.0,4583215.0,193.966724,14.68998,...,1.281203,247157600.0,35451.43,312.014179,0.0,0.0,0.0,131.1452,276841.1,29856.085938
min,-1.0,0.0,0.0,0.0,-2000000.0,-1.0,0.0,0.0,-1.0,0.0,...,-8.0,-16881960000.0,-250000.0,-1460.0,0.0,0.0,0.0,-37775.0,-80509520.0,-65535.0
25%,48886.0,0.0,0.0,0.0,0.6318762,24574.25,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-83.0,0.0,0.0,0.0,0.0,-99.0,0.0
50%,560225.5,28.5,15.5,89.86458,7.758975,230593.5,45280.3,0.0,35.333333,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1387.0
75%,10769070.0,54.25,150.75,2373.444,63.27312,2127172.0,1842099.0,41869.54,124.666667,0.0,...,0.0,56.0,0.5398507,0.0,0.0,0.0,0.0,2.0,0.0,64815.0
max,119999900.0,1460.0,1460.0,170500000.0,2000000.0,119951400.0,119951400.0,119225800.0,1373.111111,8760.0,...,764.0,108814800000.0,2000000.0,1460.0,0.0,0.0,0.0,2996.0,9039466.0,65536.0


Unnamed: 0: This column seems to be an index or identifier and may not provide meaningful information for classification.

Flow ID: If this column is just a unique identifier for network flows, it might not contribute much to the classification task.

Source IP, Source Port, Destination IP, Destination Port, Protocol: Depending on your model's objective, these details might not be crucial for detecting Android malware.

Timestamp: If the temporal aspect is not a key factor in your classification, you may consider dropping this column or extracting relevant features.

Init_Win_bytes_forward, Init_Win_bytes_backward: If these are not critical features for your model, you might exclude them.

Fwd Header Length.1: If this is a duplicate entry, it can be dropped.

Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes: Depending on your feature selection strategy, these might not be necessary.

act_data_pkt_fwd, min_seg_size_forward: Assess their importance based on their correlation with the target variable.

Total Fwd Packets, Packet Length Mean, Fwd IAT Mean, Fwd IAT Max, Bwd PSH Flags, Flow IAT Max: These are from highly correlated pairs, and keeping only one from each pair helps reduce redundancy in the dataset.

ECE Flag Count, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk, Bwd Avg Bulk Rate: These are considered based on variance, and if they have low variance, they may not provide much information for the model.

### Possible consolidation of features

Packet Length Features:

'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean', 'Fwd Packet Length Std'
'Bwd Packet Length Max', 'Bwd Packet Length Min', 'Bwd Packet Length Mean', 'Bwd Packet Length Std'
You might consider consolidating these into a single metric, such as the range (Max - Min) or the standard deviation, depending on the relevance to your problem.

Flow IAT Features:

'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Min'
'Fwd IAT Total', 'Fwd IAT Std', 'Fwd IAT Min'
'Bwd IAT Total', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min'
Similar to the packet length features, you might consider consolidating these based on your problem requirements.

Flags:

'Fwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags'
'FIN Flag Count', 'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count', 'URG Flag Count', 'CWE Flag Count', 'ECE Flag Count'
Depending on the information these flags convey, you might choose to consolidate them or keep them separate.

Packet Rate Features:

'Fwd Packets/s', 'Bwd Packets/s'
If the rate of packets is essential, you might keep one feature representing the overall packet rate.

Packet Size Features:

'Min Packet Length', 'Max Packet Length', 'Packet Length Std', 'Packet Length Variance'
'Average Packet Size', 'Avg Fwd Segment Size', 'Avg Bwd Segment Size', 'Fwd Avg Bytes/Bulk'
These features might convey similar information, and you can consider consolidating them based on the specific metric's relevance.