# Phishing Email Detection – Training and Testing Guide
This guide explains how to train and test a simple but powerful Phishing Email Detection Model using Kaggle datasets and Google Colab.
It walks through each step — from dataset download to real-time testing on your own email text.

## Step 1: Setup and Get Data
We start by:
1. Installing required Python packages.
2. Uploading the kaggle.json API token.
3. Downloading a phishing email dataset from Kaggle.
4. Loading and preparing the data.

### Install Required Packages

In [None]:
!pip -q install kaggle scikit-learn pandas joblib

#### Explanation
- `kaggle`: Access Kaggle datasets programmatically.
- `scikit-learn`: Machine learning toolkit for training the model.
- `pandas`: Data handling and cleaning.
- `joblib`: For model saving/loading (optional later).

--------
### Upload your kaggle.json API Token
This authenticates your access to Kaggle datasets.

In [None]:
from google.colab import files, os
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)
uploaded = files.upload()

with open(os.path.expanduser("~/.kaggle/kaggle.json"), "wb") as f:
    f.write(uploaded['kaggle.json'])
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

#### Explanation

- You’ll be prompted to upload your Kaggle API key file (kaggle.json).
- This file is downloaded from your Kaggle account under:

    **Profile → Account → Create New API Token.**

- The code creates the ~/.kaggle directory and sets correct file permissions for secure access.

## Download the Dataset

In [None]:
!kaggle datasets download -d naserabdullahalam/phishing-email-dataset -p /content -q
!unzip -qq /content/*.zip -d /content/data

#### Explanation
- Downloads a public Phishing Email dataset from Kaggle.
- Extracts it into /content/data for use in Colab.

## Load and Inspect the Dataset

In [None]:
import pandas as pd, glob

# Find first CSV file and load it
csv = glob.glob("/content/data/**/*.csv", recursive=True)[0]
df = pd.read_csv(csv, encoding='latin-1').dropna()
df.columns = [c.lower() for c in df.columns]

# Identify likely text and label columns
text_col = [c for c in df.columns if 'text' in c or 'message' in c or 'body' in c or 'email' in c][0]
label_col = [c for c in df.columns if 'label' in c or 'class' in c or 'spam' in c or 'phish' in c][0]

df = df[[text_col, label_col]].rename(columns={text_col:'text', label_col:'label'})
print("✅ Loaded", len(df), "emails")
df.head(3)

#### Explanation:
- Automatically finds and reads the first .csv file inside the dataset folder.
- Cleans missing values (dropna()).
- Detects which columns contain email text and labels based on column names.
- Renames them to standardized columns:

    `text` → email content

    `label` → phishing/spam indicator

- Displays the first few rows for quick inspection.
---
## Step 2: Train Model

Now we train a lightweight, explainable model using text-based features.
---
Clean & Normalize Labels

In [None]:
# --- Robust label mapping ---
lab = df['label'].astype(str).str.lower().str.strip()

# Common tokens for spam and ham classes
SPAM_TOKENS = {'spam','phish','phishing','malicious','malware','attack','fraud','scam','bad','abusive','1','true','yes'}
HAM_TOKENS  = {'ham','legit','legitimate','benign','normal','not_spam','0','false','no','safe'}

def map_label(s):
    # exact match first
    if s in SPAM_TOKENS: return 1
    if s in HAM_TOKENS:  return 0
    # substring fallback
    if any(t in s for t in ['spam','phish','malic','attack','fraud','scam']): return 1
    if any(t in s for t in ['ham','legit','normal','safe']): return 0
    # numeric fallback
    if s.isdigit(): return int(s) if s in {'0','1'} else 0
    return 0  # default to ham if unknown

y = lab.apply(map_label).astype(int)
X = df['text'].astype(str)

#### Explanation

- Converts all labels to lowercase text.
- Maps common variants of phishing/spam labels (`spam`, `phish`, `fraud`, etc.) to 1.
- Maps safe/ham/legit labels to 0.
- The `map_label` function ensures the dataset works even if label names differ across sources.

### Clean Email Text

In [None]:
import re

def clean_email(t):
    t = t.replace('\r','\n')
    cleaned_lines = []
    for ln in t.splitlines():
        if not re.match(r'^(from:|to:|subject:|cc:|bcc:|date:|reply-to:)', ln.strip().lower()):
            cleaned_lines.append(ln)
    t = '\n'.join(cleaned_lines)
    t = re.sub(r'\s+', ' ', t).strip().lower()
    return t

X_clean = X.apply(clean_email)

#### Explanation
- Removes common email header lines (e.g., From, Subject, To).
- Normalizes whitespace and converts text to lowercase.
- Keeps URLs and numbers, which are often critical indicators of phishing.

---

## Feature Extraction (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

tfidf_word = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, max_features=60000)
tfidf_char = TfidfVectorizer(analyzer='char', ngram_range=(3,5), min_df=2, max_features=60000)

Xw = tfidf_word.fit_transform(X_clean)
Xc = tfidf_char.fit_transform(X_clean)
Xv = hstack([Xw, Xc])

In [None]:
# === تقسيم البيانات إلى تدريب/اختبار (Train/Test Split) ===
# نستخدم 20% للاختبار حتى نقيس الأداء على بيانات ما شافها النموذج
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    Xv, y,
    test_size=0.20,        # نسبة الاختبار 20%
    random_state=42,       # للتكرارية
    stratify=y             # يحافظ على نسبة spam/ham في المجموعتين
)

print("Train:", X_train.shape, " | Test:", X_test.shape)


#### Explanation
- Uses TF-IDF (Term Frequency–Inverse Document Frequency) to turn text into numerical vectors.

- Two types of features are combined:

- Word-level features: Capture words and short phrases (n-grams).

- Character-level features: Capture URL and obfuscated word patterns often used in phishing.

- The two matrices are combined using `hstack()` for a rich representation.

----

## Train the Model

In [None]:
# === تدريب النموذج على مجموعة التدريب فقط ===
# مهم: ندرب على X_train/y_train وليس كل البيانات لتجنب overfitting
from sklearn.svm import LinearSVC

model = LinearSVC(class_weight='balanced', max_iter=3000)
model.fit(X_train, y_train)

print("✅ Trained on", X_train.shape[0], "emails with", X_train.shape[1], "features")


In [None]:
# === تقييم سريع: الدقة + Precision/Recall/F1 على مجموعة الاختبار ===
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

y_pred_train = model.predict(X_train)
y_pred_test  = model.predict(X_test)

print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy    :", accuracy_score(y_test,  y_pred_test))

# نفترض أن القيمة الإيجابية هي 1 (phish/spam) لأن y أرقام 0/1 بعد المعالجة
prec, rec, f1, _ = precision_recall_fscore_support(
    y_test, y_pred_test, pos_label=1, average="binary", zero_division=0
)
print(f"Precision: {prec:.4f} | Recall: {rec:.4f} | F1: {f1:.4f}\n")

print("Classification report (test):\n",
      classification_report(y_test, y_pred_test, zero_division=0))


### Step: Training Accuracy & Performance Graph
This step computes the **training accuracy** of the fitted model and draws a graph.


In [None]:
# === Simple Accuracy Graph ===
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

# احسب الدقة على التدريب والاختبار
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc  = accuracy_score(y_test, model.predict(X_test))

# ارسم رسم بسيط يوضح الفرق بين تدريب واختبار
labels = ["Train", "Test"]
values = [train_acc, test_acc]

plt.bar(labels, values, color=["skyblue", "orange"])
plt.ylim(0, 1)
plt.ylabel("Accuracy")
plt.title("Model Accuracy (Train vs Test)")
for i, v in enumerate(values):
    plt.text(i, v + 0.02, f"{v:.3f}", ha="center", fontsize=10)
plt.show()




#### Explanation

- LinearSVC (Support Vector Machine) is efficient and robust for text classification.

- The `class_weight='balanced'` handles imbalance between spam and ham samples.

- Trains on the TF-IDF features to learn distinguishing patterns.

---
## Step 3: Test the Model
You can now input any text or email content to test the model’s prediction.
---

### Define the Test Function

In [None]:
from scipy.sparse import hstack

def check_email(email_text: str):
    v_w = tfidf_word.transform([clean_email(email_text)])
    v_c = tfidf_char.transform([clean_email(email_text)])
    v   = hstack([v_w, v_c])
    pred = model.predict(v)[0]
    return "PHISHING/SPAM" if pred == 1 else "Safe email"

#### Explanation

- Cleans your input email text.

- Converts it into both word and character TF-IDF feature vectors.

- Feeds it to the trained SVM model to predict phishing (1) or safe (0).

### Try Example Emails

In [None]:
print(check_email("Your account has been locked. Verify now at http://fakebank.com"))
print(check_email("Hi John, attached are the meeting minutes for today."))
print(check_email("Hey did you check this picture before, its great landscape http://someoddpic.sm"))

#### Expected Output

```
PHISHING/SPAM
Safe email
PHISHING/SPAM
```

## Summary
| Step                      | Description                                |
| ------------------------- | ------------------------------------------ |
| **1. Setup**              | Install dependencies & authenticate Kaggle |
| **2. Download Data**      | Get a real phishing dataset                |
| **3. Preprocess**         | Clean text and normalize labels            |
| **4. Feature Extraction** | TF-IDF on words + characters               |
| **5. Train Model**        | Linear SVM classifier                      |
| **6. Test Model**         | Predict phishing vs safe emails            |
