# ML Technique - SVM

Sources: https://github.com/cmadusankahw/email-spam-detection-with-SVM

https://www.kaggle.com/code/elnahas/phishing-email-detection-using-svm-rfc

In [7]:
!pip install scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.13.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.6 kB ? eta -:--:--
     ---------------------------------------- 60.6/60.6 kB 3.4 MB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.0-cp311-cp311-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   - -------------------------------------- 0.3/11.0 MB 22.5 MB/s eta 0:00:01
   ---- ----------------------------------- 1.3/11.0 MB 16.8 MB/s eta 0:00:01
   ------- -------------------------------- 2.2/11.0 MB 17.4 MB/s eta 0:00:01
   ------------ --------------------

In [3]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, balanced_accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Load the Dataset

In [2]:
import pandas as pd

# Load the dataset
data_path = '../masterData.csv'
master_data = pd.read_csv(data_path)

# Display the first few rows of the dataset
master_data.head()

Unnamed: 0,Subject,Body,label,Body_Length
0,review your shipment details shipment notific...,notice this message was sent from outside the ...,1,890.0
1,υоur ассоunt іѕ оn hоld,votre réponse a bien été prise en compte\r\nht...,1,1235.0
2,completed invoice kz89tys2564 frombestbuycom ...,notice this message was sent from outside the ...,1,3024.0
3,uvic important notice,your uvic account has been filed under the lis...,1,528.0
4,you have 6 suspended incoming messages,message generated from uvicca source\r\n\r\n\...,1,1234.0


In [3]:
# Check for missing values
missing_values = master_data.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 Subject        19365
Body              68
label              0
Body_Length        5
dtype: int64


In [4]:
# Handle missing values
# Remove rows with empty 'Body'
master_data = master_data.dropna(subset=['Body'])

# Replace missing values in 'Subject' with a space
master_data['Subject'].fillna(' ', inplace=True)

# Fill missing values in 'Body_Length' with the mean
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
master_data['Body_Length'] = imputer.fit_transform(master_data[['Body_Length']])

In [5]:
# Encode the 'Body' column as it is a categorical variable
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

le_body = LabelEncoder()
le_subject = LabelEncoder()


In [6]:
master_data['Body'] = le_body.fit_transform(master_data['Body'].astype(str))
master_data['Subject'] = le_subject.fit_transform(master_data['Subject'].astype(str))

In [7]:
# Separate features (X) and target variable (y)
X = master_data.drop('label', axis=1)
y = master_data['label']

In [8]:
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [10]:
# Check the shapes of the resulting datasets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)

Training set shape: (415927, 3)
Testing set shape: (103982, 3)
Training target shape: (415927,)
Testing target shape: (103982,)


## Training the SVM Model

Training with svm_classifier = SVC()

In [11]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Initialize the SVM classifier
svm_classifier = SVC()

# Train the SVM classifier on the training data
svm_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 0.9947587082379643
Confusion Matrix:
 [[103437      0]
 [   545      0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00    103437
           1       0.00      0.00      0.00       545

    accuracy                           0.99    103982
   macro avg       0.50      0.50      0.50    103982
weighted avg       0.99      0.99      0.99    103982



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The classification report indicates that your SVM model is performing very poorly on the minority class (label 1). The model has high accuracy, but this is misleading because it is almost entirely predicting the majority class (label 0). This is a common issue when dealing with imbalanced datasets.

In [12]:
class_distribution = master_data['label'].value_counts()
print("Class Distribution:\n", class_distribution)

Class Distribution:
 label
0    517355
1      2554
Name: count, dtype: int64


Trying with LinearSVC

In [12]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Initialize the Linear SVM classifier with class weights
svm_classifier = LinearSVC(class_weight='balanced')

# Train the Linear SVM classifier on the training data
svm_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)


Accuracy: 0.5171087303571772
Confusion Matrix:
 [[53402 50035]
 [  177   368]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.52      0.68    103437
           1       0.01      0.68      0.01       545

    accuracy                           0.52    103982
   macro avg       0.50      0.60      0.35    103982
weighted avg       0.99      0.52      0.68    103982



Class Distribution:
 label
0    517355
1      2554
Name: count, dtype: int64


In [14]:
pip install imbalanced-learn


Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.3-py3-none-any.whl.metadata (8.3 kB)
Downloading imbalanced_learn-0.12.3-py3-none-any.whl (258 kB)
   ---------------------------------------- 0.0/258.3 kB ? eta -:--:--
   ----------- ---------------------------- 71.7/258.3 kB 3.8 MB/s eta 0:00:01
   ------------------------------------ --- 235.5/258.3 kB 3.6 MB/s eta 0:00:01
   ---------------------------------------- 258.3/258.3 kB 3.2 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.3
Note: you may need to restart the kernel to use updated packages.


In [15]:
from imblearn.over_sampling import SMOTE
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Initialize the Linear SVM classifier with class weights
svm_classifier = LinearSVC(class_weight='balanced')

# Train the Linear SVM classifier on the resampled training data
svm_classifier.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)


Accuracy: 0.5134638687465138
Confusion Matrix:
 [[53008 50429]
 [  162   383]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.51      0.68    103437
           1       0.01      0.70      0.01       545

    accuracy                           0.51    103982
   macro avg       0.50      0.61      0.35    103982
weighted avg       0.99      0.51      0.67    103982




The results show that while using SMOTE with Linear SVM slightly improved the recall for the minority class, the overall accuracy is still low, and the precision for the minority class remains very poor. This suggests that the model is still struggling with the class imbalance despite the oversampling.

Next steps: Try Naive Bayes.
Naive Bayes: A good fit for text data and may handle imbalance better.