# Phishing Email Detection.

A comparison on Random Forest vs Supor Vector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics

## Ingest data set.
> Download from https://www.kaggle.com/datasets/subhajournal/phishingemails/download?datasetVersionNumber=1

**SEE REAMDE.md** on instructions.

In [2]:
# Ingest
## Make pandas read the data.
pd.options.mode.copy_on_write = True
data = pd.read_csv("data/Phishing_Email.csv")
data.shape, data.columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18650 entries, 0 to 18649
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  18650 non-null  int64 
 1   Email Text  18634 non-null  object
 2   Email Type  18650 non-null  object
dtypes: int64(1), object(2)
memory usage: 437.2+ KB


## Wrangle

### Remove NA's
Remove partially filled rows.

In [3]:
# Wrangle
data.isna().sum()  # List empty variables
df = data.dropna() 

### Check classes balance
Check for observation balance in the sample. If the sample is unbalanced the model might learn to predict the observations with bigger pools better thant the smaller ones and bin the predictions towards these.

In [4]:
# Check sample balance
## DONT FORGET TO ADD .to_frame LEST you know how to operate pandas.Series
base_lines = df["Email Type"].value_counts().to_frame()  # Sample is IMBALANCED

For simplicity, the samples with the biggest pools were undersampled. 

In [5]:
## Adding Weights & undersampling
base_lines["weights"] = base_lines["count"] / base_lines["count"].max()
undersample_obs = base_lines["count"].min()
oversampled = df[df["Email Type"] == "Safe Email"].sample(undersample_obs)
dfx = pd.concat(
    [oversampled, df[df["Email Type"] == "Phishing Email"]], ignore_index=True
)

### Split in train, test sets.

In [6]:
# Split train test
X = dfx["Email Text"]  # Features
y = dfx["Email Type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Model

Instead of creating a function to convert each text observation into a vector of words manually and then iterate over each one of the observations. Scikit learn provides a method for doing this and chaining it with the ML algorithm as a pipe line.

### Suport Vector

In [7]:
# SVM
from sklearn.svm import SVC

SVM = Pipeline([("vectorizer", TfidfVectorizer()), ("SVM", SVC(C=100, gamma="auto"))])
SVM.fit(X_train, y_train)

**If you would like to see what vectorizer is doing. You can create the following block** 
```python
vectorizer = TfidfVectorizer( encoding='utf-8')  
x = vectorizer.fit_transform(X_train)  # This pases the whole array for demonstration.
vectorizer.get_feature_names_out()
```

Do note that vectorizer is not tokenizing the text information.

### Evaluate SVM

In [8]:
# Evaluate svm
svm_y_hat = SVM.predict(X_test)
metrics.confusion_matrix(y_test, svm_y_hat)
print(metrics.classification_report(y_test, svm_y_hat))
print("\nAccuracy:", metrics.accuracy_score(y_test, svm_y_hat))

                precision    recall  f1-score   support

Phishing Email       0.00      0.00      0.00      1468
    Safe Email       0.50      1.00      0.66      1457

      accuracy                           0.50      2925
     macro avg       0.25      0.50      0.33      2925
  weighted avg       0.25      0.50      0.33      2925


Accuracy: 0.49811965811965814


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


What the warning indicates is that there is onle one feature listed in one of the validation sets. If we visualize we can see that the model predicted everyting as Safe, this the Phishing class doesn't appear in the svm_y_hat array.

## Save Model

In [None]:
from joblib import dump
dump(SVM, 'svm-classifier.joblib')