### Gaussian Naive Bayes: Assumptions

Gaussian Naive Bayes is based on **Bayes' Theorem** and assumes:

1. **Feature Independence**:
   - Features are conditionally independent given the class label.
   - This means the presence or value of one feature does not influence another.
   - While this is a simplifying assumption, Naive Bayes often performs well even when independence is violated.

2. **Gaussian (Normal) Distribution**:
   - Continuous features are assumed to follow a Gaussian (normal) distribution.
   - The probability density function (PDF) for a feature \(x\) is:
     $$
     P(x|C) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
     $$
     Where:
     - $ \mu $ : Mean of the feature for the given class.
     - $ \sigma^2 $ : Variance of the feature for the given class.

3. **Class Prior Probabilities**:
   - The model assumes the prior probability of each class is proportional to its frequency in the training data unless specified otherwise.

4. **Feature Contributions**:
   - Each feature contributes equally to the outcome based on its likelihood under the Gaussian assumption.

---

### Practical Considerations
- **Violation of Independence**:
  - In real-world datasets, features are often correlated, violating the independence assumption. Despite this, Gaussian Naive Bayes is robust and can still perform well.
  
- **Non-Gaussian Features**:
  - Features that do not follow a Gaussian distribution may impact performance. In such cases, feature transformations or alternative Naive Bayes variants (e.g., Multinomial or Bernoulli) should be considered.

In [1]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
import numpy as np
from tqdm import tqdm

# Load preprocessed data
X_train = pd.read_csv("../../data/preprocessed_phishing/naive_bayes/nb_X_train.csv")
X_test = pd.read_csv("../../data/preprocessed_phishing/naive_bayes/nb_X_test.csv")
y_train = pd.read_csv("../../data/preprocessed_phishing/naive_bayes/nb_y_train.csv")
y_test = pd.read_csv("../../data/preprocessed_phishing/naive_bayes/nb_y_test.csv")

In [4]:
# Step 1: Initialize Gaussian Naive Bayes
gnb = GaussianNB()

# Step 2: Train the model with a progress bar
print("Training Gaussian Naive Bayes...")
for _ in tqdm(range(1), desc="Training Progress"):
    gnb.fit(X_train, y_train.values.ravel())  # Flatten target if needed

# Step 3: Evaluate the model
print("\nEvaluating Gaussian Naive Bayes...")
y_pred = gnb.predict(X_test)

# Step 4: Generate classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)

Training Gaussian Naive Bayes...


Training Progress: 100%|██████████| 1/1 [00:00<00:00, 41.70it/s]


Evaluating Gaussian Naive Bayes...

Accuracy: 0.6761

Classification Report:
              precision    recall  f1-score   support

  Legitimate       0.61      0.99      0.75     10000
    Phishing       0.98      0.36      0.53     10000

    accuracy                           0.68     20000
   macro avg       0.80      0.68      0.64     20000
weighted avg       0.80      0.68      0.64     20000






### Gaussian Naive Bayes: Results Analysis

**Accuracy**:
- The overall accuracy of **82.26%** indicates reasonable performance in classifying the data.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 0.75
     - Of all samples predicted as legitimate, 75% were correctly classified.
   - **Recall**: 0.97
     - The model correctly identified 97% of all legitimate samples.
   - **F1-Score**: 0.84
     - A good balance between precision and recall, showing strong performance for legitimate samples.

2. **Phishing (Class 1)**:
   - **Precision**: 0.96
     - Of all samples predicted as phishing, 96% were correctly classified.
   - **Recall**: 0.68
     - The model identified 68% of all phishing samples, indicating some difficulty in capturing all phishing cases.
   - **F1-Score**: 0.79
     - Reflects a performance trade-off, with higher precision but lower recall.

3. **Weighted Averages**:
   - Reflect the dataset's class distribution and overall model performance, showing an acceptable but improvable balance between precision and recall.

---

### Insights
1. **Imbalance in Recall**:
   - The model performs better in identifying legitimate samples (high recall) but struggles to capture all phishing cases (low recall for phishing).

2. **Feature-Model Alignment**:
   - Gaussian Naive Bayes assumes feature independence and Gaussian distributions, which might not hold perfectly for this dataset, impacting performance.


In [3]:
# Initialize Gaussian Naive Bayes with refined parameters
gnb = GaussianNB(var_smoothing=1e-9)  # Adjust for numerical stability

# Adjust sample weights
sample_weights = np.where(y_train.values.ravel() == 1, 1.5, 1.0)  # Phishing class gets higher weight

print("Training Gaussian Naive Bayes with refined parameters...")
for _ in tqdm(range(1), desc="Training Progress"):
    gnb.fit(X_train, y_train.values.ravel(), sample_weight=sample_weights)

# Evaluate the model on the test set
print("\nEvaluating Gaussian Naive Bayes...")
y_pred = gnb.predict(X_test)

# Calculate accuracy and classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)

# Perform k-fold cross-validation to detect overfitting
print("\nPerforming k-Fold Cross-Validation...")
cv_scores = cross_val_score(gnb, X_train, y_train.values.ravel(), cv=10, scoring='accuracy')

print("\nCross-Validation Accuracy Scores:", cv_scores)
print("Mean Accuracy:", np.mean(cv_scores))
print("Standard Deviation:", np.std(cv_scores))

Training Gaussian Naive Bayes with refined parameters...


Training Progress: 100%|██████████| 1/1 [00:00<00:00, 27.25it/s]


Evaluating Gaussian Naive Bayes...






Accuracy: 0.8403

Classification Report:
              precision    recall  f1-score   support

  Legitimate       0.77      0.97      0.86      9922
    Phishing       0.96      0.71      0.82     10078

    accuracy                           0.84     20000
   macro avg       0.86      0.84      0.84     20000
weighted avg       0.87      0.84      0.84     20000


Performing k-Fold Cross-Validation...

Cross-Validation Accuracy Scores: [0.825125 0.810375 0.821375 0.83175  0.819625 0.819    0.821375 0.828375
 0.835    0.819625]
Mean Accuracy: 0.8231624999999999
Standard Deviation: 0.006761344633281163


### Refined Gaussian Naive Bayes: Results Analysis

**Accuracy**:
- The overall accuracy improved to **84.03%**, indicating better performance after incorporating class weights for the phishing class.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 0.77  
     Of all samples predicted as legitimate, 77% were correct.
   - **Recall**: 0.97  
     The model identified 97% of legitimate samples correctly.
   - **F1-Score**: 0.86  
     Reflects strong performance in classifying legitimate cases.

2. **Phishing (Class 1)**:
   - **Precision**: 0.96  
     Of all samples predicted as phishing, 96% were correct.
   - **Recall**: 0.71  
     An improvement over the previous model, where phishing recall was 68%.
   - **F1-Score**: 0.82  
     Shows better handling of phishing cases with a balance between precision and recall.

3. **Macro Average**:
   - Precision, recall, and F1-scores average around **84%**, showing an overall improvement.

---

### Cross-Validation Results

1. **Mean Accuracy**:
   - Cross-validation mean accuracy: **82.32%**
   - This confirms consistent generalization across different data splits.

2. **Standard Deviation**:
   - Standard deviation: **0.0068**
   - Indicates low variability in accuracy, showing stability and minimal overfitting.

---

### Insights

1. **Improved Recall for Phishing**:
   - Weighting the phishing class successfully improved recall, addressing the previous imbalance.

2. **Generalization**:
   - Cross-validation results confirm the model generalizes well across different data splits.

3. **Further Refinement Opportunities**:
   - Recall for phishing, while improved, can be further enhanced through feature transformations or advanced engineering techniques.

---


### Next Steps for Refining Gaussian Naive Bayes

1. **Testing Further**:
   - Testing the robustness of the model is crucial to ensure its reliability in real-world scenarios. 
   - By introducing noise into the dataset, we can simulate imperfect or noisy environments and evaluate how well the model maintains its classification performance. 
   - This step helps identify potential weaknesses in the model and ensures it can handle real-world data variability.

2. **Feature Transformation**:
   - To better align features with Gaussian Naive Bayes assumptions, we will experiment with feature transformations:
     - **Logarithmic Transformation**: Reduces skewness in highly skewed features, making their distribution more Gaussian-like.
     - **Polynomial Features**: Captures interactions between features and introduces non-linear components that the model may otherwise miss.
     - **Feature Interaction Terms**: Helps uncover relationships between features that may contribute to classification performance.
   - These transformations aim to enhance the discriminative power of features while maintaining compatibility with Gaussian Naive Bayes.

3. **Final Refinement**:
   - After applying feature transformations, the refined dataset will be combined with the adjusted class weighting from previous steps.
   - The final model will be evaluated on its ability to balance precision, recall, and overall accuracy across both classes.
   - Cross-validation will be repeated to ensure the model generalizes well and is not overfitting due to the transformations or weighting adjustments.

In [4]:
# Create a noisy version of the training and testing datasets
X_train_noisy = X_train.copy()
X_test_noisy = X_test.copy()

# Add random Gaussian noise to the features
noise_train = np.random.normal(0, 0.05, X_train_noisy.shape)  # Mean=0, Std=0.05
noise_test = np.random.normal(0, 0.05, X_test_noisy.shape)

X_train_noisy += noise_train
X_test_noisy += noise_test

# Train Gaussian Naive Bayes on the noisy data
print("\nTraining Gaussian Naive Bayes on noisy data...")
gnb_noisy = GaussianNB(var_smoothing=1e-9)  # Use the same refined parameters as before
gnb_noisy.fit(X_train_noisy, y_train.values.ravel())

# Evaluate the model on noisy test data
print("\nEvaluating Gaussian Naive Bayes on noisy data...")
y_pred_noisy = gnb_noisy.predict(X_test_noisy)

# Calculate accuracy and classification report
accuracy_noisy = accuracy_score(y_test, y_pred_noisy)
report_noisy = classification_report(y_test, y_pred_noisy, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Noisy Data: {accuracy_noisy:.4f}")
print("\nClassification Report on Noisy Data:")
print(report_noisy)


Training Gaussian Naive Bayes on noisy data...

Evaluating Gaussian Naive Bayes on noisy data...

Accuracy on Noisy Data: 0.8258

Classification Report on Noisy Data:
              precision    recall  f1-score   support

  Legitimate       0.75      0.97      0.85      9922
    Phishing       0.96      0.69      0.80     10078

    accuracy                           0.83     20000
   macro avg       0.85      0.83      0.82     20000
weighted avg       0.85      0.83      0.82     20000



### Gaussian Naive Bayes: Results on Noisy Data

**Accuracy**:
- The model achieved an accuracy of **82.62%** on the noisy dataset, a slight drop compared to its performance on clean data, indicating moderate robustness to noise.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 0.75  
     Of all samples predicted as legitimate, 75% were correctly classified.
   - **Recall**: 0.97  
     The model identified 97% of legitimate samples correctly.
   - **F1-Score**: 0.85  
     Demonstrates strong performance in identifying legitimate cases, even with noise.

2. **Phishing (Class 1)**:
   - **Precision**: 0.96  
     Of all samples predicted as phishing, 96% were correctly classified.
   - **Recall**: 0.69  
     The model captured 69% of phishing cases, showing a slight decline due to the noise.
   - **F1-Score**: 0.80  
     Indicates that phishing classification is moderately affected by noise.

3. **Macro Average**:
   - Precision, recall, and F1-scores averaged around **82%-85%**, showing a balanced but slightly degraded performance compared to clean data.

---

### Insights
1. **Robustness to Noise**:
   - The model's performance declined slightly with the addition of noise, particularly in the recall for phishing cases.
   - This suggests that the features are moderately sensitive to noise, but the model maintains reasonable stability.

2. **Next Focus**:
   - Enhancing feature distributions through transformations may improve robustness and performance, especially for phishing cases.

---

### Next Steps
Proceed with **feature transformations** to better align features with Gaussian assumptions and potentially improve performance.

### Identifying Skewness in Feature Distributions

Before refining the dataset for Gaussian Naive Bayes, it is essential to understand the statistical properties of the features. In this code block, we will:

1. **Calculate Feature Statistics**:
   - Compute the mean, standard deviation, skewness, minimum, and maximum for each feature in the training dataset.
   - These metrics provide a comprehensive overview of the feature distributions.

2. **Identify Skewed Features**:
   - Features with high skewness (absolute skewness > 1) deviate significantly from a normal distribution.
   - Such features may require transformations (e.g., logarithmic scaling) to align better with Gaussian assumptions.

3. **Focus on Refinement**:
   - This analysis will guide the selective refinement process, ensuring that only highly skewed features are transformed while preserving features that already align well with Gaussian assumptions.

By running this code block, you will gain insights into the distributions of all features and identify candidates for refinement.

In [5]:
# Calculate skewness and other statistics
skewness = X_train.skew()
std_dev = X_train.std()
mean = X_train.mean()
min_val = X_train.min()
max_val = X_train.max()

# Combine results into a summary DataFrame
summary_stats = pd.DataFrame({
    "Mean": mean,
    "Standard Deviation": std_dev,
    "Skewness": skewness,
    "Minimum": min_val,
    "Maximum": max_val
})

# Display the results
print("Feature Summary Statistics:")
print(summary_stats)

# Identify features with high skewness
highly_skewed = summary_stats[summary_stats["Skewness"].abs() > 1]
print("\nFeatures with High Skewness (|Skewness| > 1):")
print(highly_skewed)

Feature Summary Statistics:
                           Mean  Standard Deviation   Skewness   Minimum  \
url_length            49.456725           84.201335  17.224523  5.000000   
starts_with_ip         0.006800            0.082182  12.002961  0.000000   
url_entropy            3.941826            0.644900   0.216875  1.007621   
has_punycode           0.000487            0.022074  45.258804  0.000000   
digit_letter_ratio     0.113286            0.221263   4.402489  0.000000   
dot_count              2.248162            1.716658   8.039215  1.000000   
at_count               0.011362            0.127215  23.684375  0.000000   
dash_count             0.778250            1.590694   7.999229  0.000000   
tld_count              0.034688            0.391486  52.986467  0.000000   
domain_has_digits      0.093063            0.290522   2.801495  0.000000   
subdomain_count        0.883288            1.156269   4.966115  0.000000   
nan_char_entropy       0.458588            0.183485   0.4938

### Analysis of Feature Distributions for Refinement

#### Observations on Feature Distributions

1. **Highly Skewed Features**:
   - Several features exhibit high skewness (absolute skewness > 1), indicating significant deviation from a normal distribution:
     - `url_length` (Skewness: 17.22)
     - `starts_with_ip` (Skewness: 12.00)
     - `has_punycode` (Skewness: 45.25)
     - `digit_letter_ratio` (Skewness: 4.40)
     - `dot_count` (Skewness: 8.04)
     - `at_count` (Skewness: 23.68)
     - `dash_count` (Skewness: 8.00)
     - `tld_count` (Skewness: 52.98)
     - `domain_has_digits` (Skewness: 2.80)
     - `subdomain_count` (Skewness: 4.97)
     - `has_internal_links` (Skewness: 6.35)

2. **Moderate and Low Skewness**:
   - Features like `url_entropy` (Skewness: 0.21) and `domain_age_days` (Skewness: 0.41) are closer to a normal distribution and may not need transformations.

3. **Binary Features**:
   - Features like `starts_with_ip` and `has_punycode` are binary (`0` or `1`) and do not require transformations, as their distribution is inherently discrete.

---

#### Why Refinement is Necessary

1. **Gaussian Assumptions**:
   - Gaussian Naive Bayes assumes that each feature follows a **normal distribution**.
   - Features with high skewness deviate significantly from this assumption, which may lead to inaccurate probability estimations.

2. **Improving Robustness**:
   - Transforming skewed features (e.g., via logarithmic scaling) can make their distribution more Gaussian-like, improving the model's overall performance and robustness.

3. **Focus on Critical Features**:
   - Refining only highly skewed features helps maintain the integrity of well-distributed features, preventing unnecessary transformations that could degrade performance.

---

#### Recommendations for Refinement

1. **Logarithmic Transformations**:
   - Apply to features with extreme skewness and large ranges, such as `url_length`, `dot_count`, and `dash_count`.

2. **Retain Binary Features**:
   - Leave features like `starts_with_ip` and `has_punycode` unchanged, as they are already binary and not continuous.

3. **Selective Application**:
   - Focus transformations on features with absolute skewness > 2, ensuring that the refinement aligns with Gaussian assumptions without overcomplicating the feature space.

By addressing these points, we aim to align the feature distributions better with Gaussian Naive Bayes' assumptions, improving the model's ability to classify phishing and legitimate cases accurately.

In [6]:
# Create a copy of the training and testing datasets for transformation
X_train_transformed = X_train.copy()
X_test_transformed = X_test.copy()

# Features with high skewness identified earlier
skewed_features = [
    "url_length", "digit_letter_ratio", "dot_count", 
    "dash_count", "subdomain_count", "domain_has_digits"
]

# Apply log transformation to reduce skewness
for feature in skewed_features:
    X_train_transformed[feature] = np.log1p(X_train_transformed[feature])  # log1p avoids log(0)
    X_test_transformed[feature] = np.log1p(X_test_transformed[feature])

# Check the transformed data
print("Transformations applied to highly skewed features:")
print(skewed_features)

Transformations applied to highly skewed features:
['url_length', 'digit_letter_ratio', 'dot_count', 'dash_count', 'subdomain_count', 'domain_has_digits']


In [7]:
# Train Gaussian Naive Bayes on transformed data
print("\nTraining Gaussian Naive Bayes on transformed data...")
gnb_transformed = GaussianNB(var_smoothing=1e-9)  # Retain refined hyperparameter
gnb_transformed.fit(X_train_transformed, y_train.values.ravel())

# Evaluate the model on the transformed test data
print("\nEvaluating Gaussian Naive Bayes on transformed data...")
y_pred_transformed = gnb_transformed.predict(X_test_transformed)

# Calculate accuracy and classification report
accuracy_transformed = accuracy_score(y_test, y_pred_transformed)
report_transformed = classification_report(y_test, y_pred_transformed, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Transformed Data: {accuracy_transformed:.4f}")
print("\nClassification Report on Transformed Data:")
print(report_transformed)


Training Gaussian Naive Bayes on transformed data...

Evaluating Gaussian Naive Bayes on transformed data...

Accuracy on Transformed Data: 0.9235

Classification Report on Transformed Data:
              precision    recall  f1-score   support

  Legitimate       0.89      0.96      0.93      9922
    Phishing       0.96      0.88      0.92     10078

    accuracy                           0.92     20000
   macro avg       0.93      0.92      0.92     20000
weighted avg       0.93      0.92      0.92     20000



### Gaussian Naive Bayes: Results on Transformed Data

**Accuracy**:
- The model achieved an impressive accuracy of **92.35%** on the transformed dataset, marking a significant improvement over previous evaluations.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 0.89  
     Of all samples predicted as legitimate, 89% were correct.
   - **Recall**: 0.96  
     The model identified 96% of legitimate samples correctly.
   - **F1-Score**: 0.93  
     Demonstrates strong performance in classifying legitimate cases.

2. **Phishing (Class 1)**:
   - **Precision**: 0.96  
     Of all samples predicted as phishing, 96% were correct.
   - **Recall**: 0.88  
     The model captured 88% of phishing cases, a noticeable improvement.
   - **F1-Score**: 0.92  
     Reflects a well-balanced performance for phishing classification.

3. **Macro and Weighted Averages**:
   - Both averages are approximately **92%**, showcasing a consistent and balanced model across classes.

---

### Insights

1. **Impact of Transformations**:
   - The log transformations applied to highly skewed features significantly improved the model's performance.
   - By better aligning feature distributions with Gaussian assumptions, the model achieved higher recall for phishing cases without compromising legitimate case detection.

2. **Balanced Performance**:
   - The precision-recall balance indicates that the model is effective in identifying both legitimate and phishing cases.

3. **Comparison**:
   - The transformed dataset outperformed both the original and noisy datasets, suggesting that feature refinement was critical for enhancing the model.

---

### Next Steps
1. **Final Validation**:
   - Perform cross-validation on the transformed dataset to ensure generalization.

2. **Stress Testing with Noise**:
   - Evaluate the model’s performance by introducing noise into the transformed dataset to test its robustness further.


In [8]:
# Perform cross-validation on the transformed training data
print("\nPerforming 10-Fold Cross-Validation on Transformed Data...")
cv_scores_transformed = cross_val_score(
    gnb_transformed, 
    X_train_transformed, 
    y_train.values.ravel(), 
    cv=10, 
    scoring='accuracy'
)

# Display cross-validation results
print("\nCross-Validation Accuracy Scores:", cv_scores_transformed)
print("Mean Accuracy:", np.mean(cv_scores_transformed))
print("Standard Deviation:", np.std(cv_scores_transformed))


Performing 10-Fold Cross-Validation on Transformed Data...

Cross-Validation Accuracy Scores: [0.92725  0.919375 0.91825  0.927875 0.921875 0.92125  0.924875 0.926125
 0.93075  0.922625]
Mean Accuracy: 0.924025
Standard Deviation: 0.003805752225250599


### Gaussian Naive Bayes: Cross-Validation Results on Transformed Data

**Cross-Validation Accuracy Scores**:
- Individual Fold Scores:  
  `[0.92725, 0.919375, 0.91825, 0.927875, 0.921875, 0.92125, 0.924875, 0.926125, 0.93075, 0.922625]`

**Mean Accuracy**:
- The model achieved a mean accuracy of **92.40%** across 10 folds, indicating consistent performance.

**Standard Deviation**:
- The standard deviation of **0.0038** reflects minimal variance between folds, showcasing excellent generalization and stability.

---

### Insights

1. **Consistency Across Folds**:
   - The small standard deviation highlights that the model performs reliably on different splits of the dataset.

2. **Generalization Ability**:
   - The high mean accuracy confirms the model’s strong generalization to unseen data, a critical indicator of its robustness.

---


In [9]:
# Add Gaussian noise to the transformed features
X_train_transformed_noisy = X_train_transformed + np.random.normal(0, 0.05, X_train_transformed.shape)
X_test_transformed_noisy = X_test_transformed + np.random.normal(0, 0.05, X_test_transformed.shape)

# Train Gaussian Naive Bayes on noisy transformed data
print("\nTraining Gaussian Naive Bayes on noisy transformed data...")
gnb_transformed_noisy = GaussianNB(var_smoothing=1e-9)
gnb_transformed_noisy.fit(X_train_transformed_noisy, y_train.values.ravel())

# Evaluate the model on noisy transformed test data
print("\nEvaluating Gaussian Naive Bayes on noisy transformed data...")
y_pred_transformed_noisy = gnb_transformed_noisy.predict(X_test_transformed_noisy)

# Calculate accuracy and classification report
accuracy_transformed_noisy = accuracy_score(y_test, y_pred_transformed_noisy)
report_transformed_noisy = classification_report(y_test, y_pred_transformed_noisy, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Noisy Transformed Data: {accuracy_transformed_noisy:.4f}")
print("\nClassification Report on Noisy Transformed Data:")
print(report_transformed_noisy)


Training Gaussian Naive Bayes on noisy transformed data...

Evaluating Gaussian Naive Bayes on noisy transformed data...

Accuracy on Noisy Transformed Data: 0.9192

Classification Report on Noisy Transformed Data:
              precision    recall  f1-score   support

  Legitimate       0.89      0.96      0.92      9922
    Phishing       0.95      0.88      0.92     10078

    accuracy                           0.92     20000
   macro avg       0.92      0.92      0.92     20000
weighted avg       0.92      0.92      0.92     20000



### Gaussian Naive Bayes: Stress Test Results on Noisy Transformed Data

**Accuracy**:
- The model achieved an accuracy of **91.76%** on the noisy transformed dataset, maintaining high performance even under stress.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 0.89  
     Of all samples predicted as legitimate, 89% were correct.
   - **Recall**: 0.96  
     The model identified 96% of legitimate cases correctly.
   - **F1-Score**: 0.92  
     Reflects consistent performance despite added noise.

2. **Phishing (Class 1)**:
   - **Precision**: 0.95  
     Of all samples predicted as phishing, 95% were correct.
   - **Recall**: 0.88  
     The model captured 88% of phishing cases, showcasing robustness.
   - **F1-Score**: 0.91  
     Highlights balanced performance for phishing classification.

3. **Macro and Weighted Averages**:
   - Both averages remain steady at **92%**, demonstrating the model’s resilience to noisy data.

---

### Final Insights

1. **Robustness to Noise**:
   - The model handled noisy conditions effectively, with minimal performance degradation.

2. **Generalization**:
   - Cross-validation and stress testing confirm the model generalizes well to unseen and imperfect data.

3. **Feature Refinement Success**:
   - The transformations applied to highly skewed features improved the alignment with Gaussian assumptions, significantly enhancing performance.

---