# ML-Session-2 Solution: Data Preprocessing and Machine Learning

This notebook provides complete solutions for the ML-session-2 exercises.
It demonstrates the full machine learning workflow: data preprocessing, feature selection, model training, and evaluation.

## 1. Setup and Library Imports

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

## 2. Loading the Sherlock Dataset

In [None]:
# Load the dataset
df2 = pd.read_csv('sherlock/sherlock_mystery_2apps.csv')
print(f"Dataset shape: {df2.shape}")
print(f"\nFirst 10 rows:")
print(df2.head(10))

## 3. Data Preprocessing

### 3.1 Remove Irrelevant Features

In [None]:
# SOLUTION: Drop the 'Unnamed: 0' column
df2.drop('Unnamed: 0', axis=1, inplace=True)
print("Dropped 'Unnamed: 0' column")
print(f"New shape: {df2.shape}")

### 3.2 Handle Missing Data

In [None]:
# Check for missing values
print("Missing values per column:")
print(df2.isna().sum())
print()

# Calculate fraction of missing data in cminflt
missing_fraction = df2['cminflt'].isna().sum() / df2['cminflt'].size
print(f"Fraction of missing data in 'cminflt': {missing_fraction:.4f} ({missing_fraction*100:.2f}%)")

In [None]:
# SOLUTION: Drop rows with missing values
df2.dropna(inplace=True)
print(f"After dropping missing values, shape: {df2.shape}")
print(f"\nVerify no missing values remain:")
print(df2.isna().sum())

### 3.3 Remove Duplicate Features

In [None]:
# SOLUTION: Drop duplicate columns
df2.drop(['Mem', 'guest_time', 'queue'], axis=1, inplace=True)
print("Dropped duplicate columns: Mem, guest_time, queue")
print(f"\nRemaining columns:")
print(df2.columns.tolist())

### 3.4 Separate Labels from Features

In [None]:
# Separate labels and features
df2_labels = df2['ApplicationName']
df2_features = df2.drop('ApplicationName', axis=1)

print("Labels:")
print(df2_labels.head())
print(f"\nLabel value counts:")
print(df2_labels.value_counts())
print(f"\nFeatures shape: {df2_features.shape}")
print(f"\nFeature statistics:")
print(df2_features.describe())

### 3.5 Feature Scaling (Normalization)

In [None]:
# Apply StandardScaler to normalize features
scaler = preprocessing.StandardScaler()
scaler.fit(df2_features)
df2_features_n = pd.DataFrame(scaler.transform(df2_features),
                              columns=df2_features.columns,
                              index=df2_features.index)

print("Normalized features (first 10 rows):")
print(df2_features_n.head(10))
print(f"\nNormalized features statistics:")
print(df2_features_n.describe())

## 4. Machine Learning Experiments

### 4.1 Experiment 1: Features (CPU_USAGE, vsize)

In [None]:
# Select features for Experiment 1
features_exp1 = df2_features_n[['CPU_USAGE', 'vsize']]
labels = df2_labels.copy()

# Train-test split
train_F1, test_F1, train_L1, test_L1 = train_test_split(features_exp1, labels, test_size=0.2, random_state=42)

print(f"Experiment 1: Features (CPU_USAGE, vsize)")
print(f"Training set: {train_F1.shape}")
print(f"Test set: {test_F1.shape}")

In [None]:
# Train Logistic Regression model
model_lr1 = LogisticRegression(solver='lbfgs', max_iter=1000)
%time model_lr1.fit(train_F1, train_L1)

# Evaluate
test_pred_lr1 = model_lr1.predict(test_F1)
acc_lr1 = accuracy_score(test_L1, test_pred_lr1)
cm_lr1 = confusion_matrix(test_L1, test_pred_lr1)

print(f"\nLogistic Regression - Experiment 1")
print(f"Accuracy: {acc_lr1:.4f}")
print(f"Confusion Matrix:\n{cm_lr1}")

In [None]:
# Train Decision Tree model
model_dtc1 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8, random_state=42)
%time model_dtc1.fit(train_F1, train_L1)

# Evaluate
test_pred_dtc1 = model_dtc1.predict(test_F1)
acc_dtc1 = accuracy_score(test_L1, test_pred_dtc1)
cm_dtc1 = confusion_matrix(test_L1, test_pred_dtc1)

print(f"\nDecision Tree - Experiment 1")
print(f"Accuracy: {acc_dtc1:.4f}")
print(f"Confusion Matrix:\n{cm_dtc1}")

### 4.2 Experiment 2: Features (CPU_USAGE, cutime)

In [None]:
# SOLUTION: Select features for Experiment 2
features_exp2 = df2_features_n[['CPU_USAGE', 'cutime']]

# Train-test split
train_F2, test_F2, train_L2, test_L2 = train_test_split(features_exp2, labels, test_size=0.2, random_state=42)

print(f"Experiment 2: Features (CPU_USAGE, cutime)")
print(f"Training set: {train_F2.shape}")
print(f"Test set: {test_F2.shape}")

In [None]:
# SOLUTION: Train Logistic Regression model
model_lr2 = LogisticRegression(solver='lbfgs', max_iter=1000)
%time model_lr2.fit(train_F2, train_L2)

# Evaluate
test_pred_lr2 = model_lr2.predict(test_F2)
acc_lr2 = accuracy_score(test_L2, test_pred_lr2)
cm_lr2 = confusion_matrix(test_L2, test_pred_lr2)

print(f"\nLogistic Regression - Experiment 2")
print(f"Accuracy: {acc_lr2:.4f}")
print(f"Confusion Matrix:\n{cm_lr2}")

In [None]:
# SOLUTION: Train Decision Tree model
model_dtc2 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8, random_state=42)
%time model_dtc2.fit(train_F2, train_L2)

# Evaluate
test_pred_dtc2 = model_dtc2.predict(test_F2)
acc_dtc2 = accuracy_score(test_L2, test_pred_dtc2)
cm_dtc2 = confusion_matrix(test_L2, test_pred_dtc2)

print(f"\nDecision Tree - Experiment 2")
print(f"Accuracy: {acc_dtc2:.4f}")
print(f"Confusion Matrix:\n{cm_dtc2}")

### 4.3 Experiment 3: Features (CPU_USAGE, priority)

In [None]:
# SOLUTION: Select features for Experiment 3
features_exp3 = df2_features_n[['CPU_USAGE', 'priority']]

# Train-test split
train_F3, test_F3, train_L3, test_L3 = train_test_split(features_exp3, labels, test_size=0.2, random_state=42)

print(f"Experiment 3: Features (CPU_USAGE, priority)")
print(f"Training set: {train_F3.shape}")
print(f"Test set: {test_F3.shape}")

In [None]:
# SOLUTION: Train Logistic Regression model
model_lr3 = LogisticRegression(solver='lbfgs', max_iter=1000)
%time model_lr3.fit(train_F3, train_L3)

# Evaluate
test_pred_lr3 = model_lr3.predict(test_F3)
acc_lr3 = accuracy_score(test_L3, test_pred_lr3)
cm_lr3 = confusion_matrix(test_L3, test_pred_lr3)

print(f"\nLogistic Regression - Experiment 3")
print(f"Accuracy: {acc_lr3:.4f}")
print(f"Confusion Matrix:\n{cm_lr3}")

In [None]:
# SOLUTION: Train Decision Tree model
model_dtc3 = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8, random_state=42)
%time model_dtc3.fit(train_F3, train_L3)

# Evaluate
test_pred_dtc3 = model_dtc3.predict(test_F3)
acc_dtc3 = accuracy_score(test_L3, test_pred_dtc3)
cm_dtc3 = confusion_matrix(test_L3, test_pred_dtc3)

print(f"\nDecision Tree - Experiment 3")
print(f"Accuracy: {acc_dtc3:.4f}")
print(f"Confusion Matrix:\n{cm_dtc3}")

### 4.4 Challenge: Using All Features

In [None]:
# SOLUTION: Use all features
features_all = df2_features_n.copy()

# Train-test split
train_F_all, test_F_all, train_L_all, test_L_all = train_test_split(features_all, labels, test_size=0.2, random_state=42)

print(f"Challenge: Using ALL features")
print(f"Number of features: {features_all.shape[1]}")
print(f"Features: {features_all.columns.tolist()}")
print(f"Training set: {train_F_all.shape}")
print(f"Test set: {test_F_all.shape}")

In [None]:
# Train Logistic Regression with all features
model_lr_all = LogisticRegression(solver='lbfgs', max_iter=1000)
%time model_lr_all.fit(train_F_all, train_L_all)

# Evaluate
test_pred_lr_all = model_lr_all.predict(test_F_all)
acc_lr_all = accuracy_score(test_L_all, test_pred_lr_all)
cm_lr_all = confusion_matrix(test_L_all, test_pred_lr_all)

print(f"\nLogistic Regression - All Features")
print(f"Accuracy: {acc_lr_all:.4f}")
print(f"Confusion Matrix:\n{cm_lr_all}")

In [None]:
# Train Decision Tree with all features
model_dtc_all = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_split=8, random_state=42)
%time model_dtc_all.fit(train_F_all, train_L_all)

# Evaluate
test_pred_dtc_all = model_dtc_all.predict(test_F_all)
acc_dtc_all = accuracy_score(test_L_all, test_pred_dtc_all)
cm_dtc_all = confusion_matrix(test_L_all, test_pred_dtc_all)

print(f"\nDecision Tree - All Features")
print(f"Accuracy: {acc_dtc_all:.4f}")
print(f"Confusion Matrix:\n{cm_dtc_all}")

## 5. Results Summary and Comparison

In [None]:
# Create a summary table of all results
results = {
    'Experiment': [
        'Exp 1: (CPU_USAGE, vsize)',
        'Exp 2: (CPU_USAGE, cutime)',
        'Exp 3: (CPU_USAGE, priority)',
        'Challenge: All Features'
    ],
    'LR Accuracy': [acc_lr1, acc_lr2, acc_lr3, acc_lr_all],
    'DTC Accuracy': [acc_dtc1, acc_dtc2, acc_dtc3, acc_dtc_all]
}

results_df = pd.DataFrame(results)
print("\n" + "="*70)
print("SUMMARY OF ALL EXPERIMENTS")
print("="*70)
print(results_df.to_string(index=False))
print("="*70)

In [None]:
# Visualize accuracy comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(results_df))
width = 0.35

bars1 = ax.bar(x - width/2, results_df['LR Accuracy'], width, label='Logistic Regression', alpha=0.8)
bars2 = ax.bar(x + width/2, results_df['DTC Accuracy'], width, label='Decision Tree', alpha=0.8)

ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Model Accuracy Comparison Across Different Feature Sets', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(results_df['Experiment'], rotation=15, ha='right')
ax.legend()
ax.set_ylim([0, 1])
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

## 6. Key Findings and Discussion

### Observations:

1. **Best Feature Set**: The features (CPU_USAGE, vsize) provide the best accuracy for both models, achieving ~70% accuracy with Logistic Regression and ~72% with Decision Tree.

2. **Feature Importance**: Not all features are equally important. The choice of features significantly impacts model performance.

3. **All Features Performance**: Using all features does NOT necessarily improve accuracy. This demonstrates the importance of feature selection.

4. **Model Comparison**: Decision Tree generally performs slightly better than Logistic Regression on this dataset.

### Why Not Use All Features?

- **Curse of Dimensionality**: More features can lead to overfitting
- **Computational Cost**: More features = longer training time
- **Interpretability**: Fewer features are easier to understand and explain
- **Noise**: Irrelevant features can introduce noise and reduce model performance
- **Data Requirements**: More features require more training data

### Cybersecurity Application:

This machine learning approach can be used for:
- **Malware Detection**: Identify malicious apps based on resource usage patterns
- **Anomaly Detection**: Detect unusual application behavior
- **Real-time Monitoring**: Classify running applications on mobile devices
- **Threat Intelligence**: Build profiles of known malicious applications