In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score
from tqdm import tqdm
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.utils import resample
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Step 1: Load the preprocessed data
X_train_lr = pd.read_csv("../../data/preprocessed_phishing/l_regression/lr_X_train.csv")
X_test_lr = pd.read_csv("../../data/preprocessed_phishing/l_regression/lr_X_test.csv")
y_train_lr = pd.read_csv("../../data/preprocessed_phishing/l_regression/lr_y_train.csv").values.ravel()  # Ensure target is 1D
y_test_lr = pd.read_csv("../../data/preprocessed_phishing/l_regression/lr_y_test.csv").values.ravel()    # Ensure target is 1D

### Linear Regression Implementation


**Model Initialization:**

A linear regression model is initialized using the default configuration to model the relationship between the features and the target variable.
Training the Model:

The model is trained on the training dataset (lr_X_train, lr_y_train), estimating the coefficients for each feature.
Model Evaluation:

Predictions are made on both training and testing datasets.
Evaluation metrics include:
Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values, computed for both training and testing sets.
If PCA is applied, the regression results are visualized along the first principal component.

In [16]:
# Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(X_train_lr, y_train_lr)

# Predictions and MSE for training and test sets
y_train_pred_linear = lin_reg.predict(X_train_lr)
y_test_pred_linear = lin_reg.predict(X_test_lr)
mse_train_linear = mean_squared_error(y_train_lr, y_train_pred_linear)
mse_test_linear = mean_squared_error(y_test_lr, y_test_pred_linear)


# Results Summary
linear_results = {
    "Train MSE": mse_train_linear,
    "Test MSE": mse_test_linear,
}

linear_results

{'Train MSE': 0.03058393252753603, 'Test MSE': 0.030813566363399224}

The dataset exhibits a linear relationship with moderate noise, suggesting it is well-suited for linear regression, generalizing well with unseen data

### Logistic Regression Implementation

In this block, we implement a **Logistic Regression** model using the preprocessed dataset. The steps include:

1. **Loading Preprocessed Data**:
   - The data is split into training (`lr_X_train`, `lr_y_train`) and testing (`lr_X_test`, `lr_y_test`) subsets.
   - Logistic regression requires all features to be numerical and scaled, which has been ensured during preprocessing.

2. **Model Initialization**:
   - A logistic regression model is initialized with the following parameters:
     - `max_iter=1000`: Allows the model sufficient iterations to converge for larger datasets.
     - `random_state=42`: Ensures reproducibility of results.

3. **Training the Model**:
   - The model is trained on the training dataset, learning the relationship between features and the target variable.

4. **Model Evaluation**:
   - Predictions are made on the test set.
   - Accuracy and a detailed classification report (precision, recall, F1-score) are generated to evaluate performance.

5. **Cross-Validation**:
   - A 10-fold cross-validation is performed to assess the model's generalization ability across different splits of the training dataset.
   - Metrics include mean accuracy and standard deviation, indicating consistency across folds.

In [None]:
# Logistic Regression

# Step 2: Initialize Logistic Regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Step 3: Train the model with a progress bar
print("Training Logistic Regression...")
for _ in tqdm(range(1), desc="Training Progress"):
    log_reg.fit(X_train_lr, y_train_lr)

# Step 4: Evaluate the model on the test set
print("\nEvaluating Logistic Regression...")
y_pred_lr = log_reg.predict(X_test_lr)

# Step 5: Generate classification report
accuracy_lr = accuracy_score(y_test_lr, y_pred_lr)
report_lr = classification_report(y_test_lr, y_pred_lr, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy: {accuracy_lr:.4f}")
print("\nClassification Report:")
print(report_lr)

# Step 6: Perform cross-validation
print("\nPerforming 10-Fold Cross-Validation...")
cv_scores_lr = cross_val_score(log_reg, X_train_lr, y_train_lr, cv=10, scoring='accuracy')

# Display cross-validation results
print("\nCross-Validation Accuracy Scores:", cv_scores_lr)
print("Mean Accuracy:", cv_scores_lr.mean())
print("Standard Deviation:", cv_scores_lr.std())

Training Logistic Regression...


Training Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Training Progress: 100%|██████████| 1/1 [00:01<00:00,  1.08s/it]



Evaluating Logistic Regression...

Accuracy: 1.0000

Classification Report:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00     10078

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000


Performing 10-Fold Cross-Validation...

Cross-Validation Accuracy Scores: [1.       1.       1.       1.       1.       1.       0.999875 1.
 1.       1.      ]
Mean Accuracy: 0.9999874999999999
Standard Deviation: 3.750000000001252e-05


### Logistic Regression: Results and Analysis

**Accuracy**:
- The model achieved a perfect accuracy of **100%** on the test set, demonstrating excellent performance in classifying both legitimate and phishing cases.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 1.00  
     All predictions for legitimate cases were correct.
   - **Recall**: 1.00  
     The model identified all legitimate cases without error.
   - **F1-Score**: 1.00  
     Reflects a perfect balance between precision and recall.

2. **Phishing (Class 1)**:
   - **Precision**: 1.00  
     All phishing predictions were correct.
   - **Recall**: 1.00  
     The model captured all phishing cases perfectly.
   - **F1-Score**: 1.00  
     Indicates flawless performance for phishing classification.

3. **Macro and Weighted Averages**:
   - Both averages are **1.00**, confirming balanced and exceptional classification across both classes.

---

**Cross-Validation Results**:
1. **Accuracy Scores**:
   - Individual fold scores are consistently **1.00**, with a single fold scoring **0.999875**.
2. **Mean Accuracy**:
   - The mean cross-validation accuracy is **99.9987%**, demonstrating remarkable generalization ability.
3. **Standard Deviation**:
   - The standard deviation of **0.0000375** reflects extremely low variance, ensuring consistent performance across all folds.

---

### Insights

1. **Exceptional Performance**:
   - The logistic regression model demonstrates perfect classification on both the test set and cross-validation, with negligible variance across folds.

2. **Dataset Characteristics**:
   - The high performance suggests that the features provide sufficient separation between classes, enabling the model to classify accurately.

3. **Generalization**:
   - The consistency in cross-validation scores highlights the model's ability to generalize well to unseen data.

---
### Next Steps: Ensuring Model Robustness

To ensure the logistic regression model is not overfitting, we will:

1. **Test the Model on Noisy Data**:
   - Introduce Gaussian noise to the features in both training and testing datasets and evaluate the model's performance.
   - This simulates real-world scenarios where data might contain inconsistencies or imperfections.
  
2.  **Adversarial Testing**:
    - Introduce targeted perturbations to test the model's robustness.

In [2]:
# Create noisy versions of the transformed datasets
X_train_noisy_lr = X_train_lr + np.random.normal(0, 0.05, X_train_lr.shape)  # Add Gaussian noise to training data
X_test_noisy_lr = X_test_lr + np.random.normal(0, 0.05, X_test_lr.shape)    # Add Gaussian noise to test data

# Train the model on noisy training data
print("\nTraining Logistic Regression on noisy data...")
log_reg_noisy = LogisticRegression(max_iter=1000, random_state=42)
log_reg_noisy.fit(X_train_noisy_lr, y_train_lr)

# Evaluate the model on noisy test data
print("\nEvaluating Logistic Regression on noisy data...")
y_pred_noisy_lr = log_reg_noisy.predict(X_test_noisy_lr)

# Calculate accuracy and classification report
accuracy_noisy_lr = accuracy_score(y_test_lr, y_pred_noisy_lr)
report_noisy_lr = classification_report(y_test_lr, y_pred_noisy_lr, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Noisy Data: {accuracy_noisy_lr:.4f}")
print("\nClassification Report on Noisy Data:")
print(report_noisy_lr)


Training Logistic Regression on noisy data...

Evaluating Logistic Regression on noisy data...

Accuracy on Noisy Data: 1.0000

Classification Report on Noisy Data:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00     10078

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000



### Logistic Regression: Results on Noisy Data

**Accuracy**:
- The model maintained perfect accuracy of **100%** on the noisy dataset, demonstrating exceptional robustness.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 1.00  
     All legitimate predictions were correct.
   - **Recall**: 1.00  
     The model identified all legitimate cases perfectly.
   - **F1-Score**: 1.00  
     Reflects flawless performance under noisy conditions.

2. **Phishing (Class 1)**:
   - **Precision**: 1.00  
     All phishing predictions were correct.
   - **Recall**: 1.00  
     The model identified all phishing cases without error.
   - **F1-Score**: 1.00  
     Indicates balanced and perfect classification for phishing cases.

3. **Macro and Weighted Averages**:
   - Both averages remain at **1.00**, confirming robustness against noise.

---

### Next Steps: Adversarial Testing

Adversarial testing is a method to evaluate a model's robustness by introducing **targeted perturbations** to the input data. These perturbations are designed to simulate situations where an adversary might intentionally manipulate data to deceive the model.

In this test, we will:
1. **Generate Adversarial Noise**:
   - Small, random Gaussian noise will be added to the test dataset.
   - The noise is scaled proportionally to the standard deviation of each feature to ensure meaningful perturbations.

2. **Create an Adversarial Dataset**:
   - The original test dataset will be combined with the generated noise to produce an adversarial version.

3. **Evaluate the Model**:
   - The trained logistic regression model will be evaluated on the adversarial dataset.
   - Key metrics such as accuracy, precision, recall, and F1-score will be analyzed to assess the model's robustness against adversarial manipulation.

This step will help determine if the model can maintain its performance even when faced with challenging, noisy inputs.

In [3]:
# Ensure the data is in NumPy array format for compatibility
X_test_lr_np = X_test_lr.to_numpy()

# Generate adversarial noise
adversarial_noise = np.random.normal(0, 0.1, X_test_lr_np.shape) * X_test_lr_np.std(axis=0, keepdims=True) * 0.5

# Create adversarial test dataset
X_test_adversarial_lr = X_test_lr_np + adversarial_noise

# Evaluate the model on adversarial test data
print("\nEvaluating Logistic Regression on adversarial data...")
y_pred_adversarial_lr = log_reg_noisy.predict(X_test_adversarial_lr)

# Calculate accuracy and classification report
accuracy_adversarial_lr = accuracy_score(y_test_lr, y_pred_adversarial_lr)
report_adversarial_lr = classification_report(y_test_lr, y_pred_adversarial_lr, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Adversarial Data: {accuracy_adversarial_lr:.4f}")
print("\nClassification Report on Adversarial Data:")
print(report_adversarial_lr)


Evaluating Logistic Regression on adversarial data...

Accuracy on Adversarial Data: 1.0000

Classification Report on Adversarial Data:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00     10078

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000





### Logistic Regression: Results on Adversarial Data

**Accuracy**:
- The model maintained perfect accuracy of **100%** on the adversarial dataset, demonstrating exceptional robustness against targeted perturbations.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 1.00  
     All legitimate predictions were correct.
   - **Recall**: 1.00  
     The model identified all legitimate cases perfectly.
   - **F1-Score**: 1.00  
     Flawless classification for legitimate cases.

2. **Phishing (Class 1)**:
   - **Precision**: 1.00  
     All phishing predictions were correct.
   - **Recall**: 1.00  
     The model captured all phishing cases without error.
   - **F1-Score**: 1.00  
     Perfect classification for phishing cases.

3. **Macro and Weighted Averages**:
   - Both averages remain at **1.00**, confirming balanced performance across both classes, even with adversarial inputs.

---

### Insights

1. **Robustness to Adversarial Inputs**:
   - The model successfully handled adversarial perturbations without any degradation in performance, highlighting its resilience.

2. **Balanced Classification**:
   - The model retained perfect precision, recall, and F1-scores across both classes, even under challenging conditions.

3. **Final Validation**:
   - These results confirm that the logistic regression model is robust and well-calibrated

### Additional Tests for Logistic Regression

To further validate and refine the logistic regression model, we will perform the following tests:

1. **Feature Importance and Analysis**:
   - **Purpose**: Identify which features contribute the most to the model's predictions.
   - **Approach**: Extract model coefficients and rank features by their impact.
   - **Benefit**: Helps ensure the model is leveraging meaningful features and not overfitting to irrelevant data.

2. **Correlated Noise Testing**:
   - **Purpose**: Evaluate the model's robustness to structured, correlated noise.
   - **Approach**: Add systematic perturbations to specific columns (e.g., `url_length` or `dot_count`) and test the model's performance.
   - **Benefit**: Confirms resilience to systematic changes or adversarial attempts based on correlated feature manipulation.

3. **Class Imbalance Testing**:
   - **Purpose**: Simulate imbalanced scenarios to test the model's ability to handle minority classes effectively.
   - **Approach**: Modify the test set to have an extreme imbalance (e.g., 90% legitimate, 10% phishing) and evaluate the model.
   - **Benefit**: Ensures the model can classify minority classes without bias.

4. **Feature Removal Testing**:
   - **Purpose**: Evaluate how the model performs if specific features are removed or corrupted.
   - **Approach**: Sequentially drop individual features (or groups of features) and measure the impact on accuracy.
   - **Benefit**: Ensures that the model does not overly rely on specific features and can adapt if some data becomes unavailable.


---

We will begin with **Feature Importance and Analysis** and proceed sequentially.

### Feature Importance Analysis

Feature importance analysis helps identify which features contribute the most to the logistic regression model's predictions. Understanding these contributions allows for better interpretability and potential refinement of the model.

---

**What We Are Doing**:
1. **Extract Model Coefficients**:
   - The logistic regression model assigns a coefficient to each feature, indicating its contribution to the prediction.

2. **Rank Features by Importance**:
   - Features are ranked based on the absolute value of their coefficients. Larger coefficients (positive or negative) indicate a greater impact on the model's decisions.

3. **Display Top Features**:
   - The top 10 features are displayed to highlight the most influential attributes driving the model's performance.

---

### Code for Feature Importance and Analysis

In [4]:
# Assuming the logistic regression model (log_reg) is already trained
# Extract feature names from the training dataset
feature_names = X_train_lr.columns

# Extract model coefficients
coefficients = log_reg.coef_[0]  # Coefficients for the logistic regression model

# Combine feature names and coefficients into a DataFrame
feature_importance = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": coefficients
})

# Add an absolute coefficient column to sort by magnitude
feature_importance["Abs_Coefficient"] = np.abs(feature_importance["Coefficient"])

# Sort features by absolute coefficient
feature_importance = feature_importance.sort_values(by="Abs_Coefficient", ascending=False)

# Display top features
print("Feature Importance (Top 10 Features):")
print(feature_importance.head(10))

Feature Importance (Top 10 Features):
   Feature  Coefficient  Abs_Coefficient
11      11     8.943489         8.943489
0        0     8.188846         8.188846
2        2     3.614554         3.614554
10      10    -0.951891         0.951891
5        5    -0.940916         0.940916
13      13    -0.646416         0.646416
7        7     0.221642         0.221642
12      12     0.081941         0.081941
9        9    -0.035299         0.035299
8        8     0.031990         0.031990


### Analysis of Feature Importance

**RN I CANNOT GET THE ACTUAL FEATURE NAME BECAUSE FOR THAT WE NEED THE SUBSET.CSV WHICH SUCKS SINCE WE DO NOT DO PREPROCESSING IN THE SAME FILE SO GETTING INFO LIKE THIS IS HARD SINCE WE NEED THE ORIGINAL DATASET. NP IGNORE AND WE CAN MOVE ON**

**Top Features**:
The top 10 features ranked by their absolute coefficients provide insight into the most influential attributes in the logistic regression model:
- **Feature 11**: Largest positive impact with a coefficient of **8.94**.
- **Feature 0**: Second most important, also with a strong positive effect (**8.19**).
- **Feature 2**: Significant positive contribution (**3.61**).
- **Feature 10** and **Feature 5**: Most influential negative contributors, with coefficients of **-0.95** and **-0.94**, respectively.

**Insights**:

1. **Dominance of Features**:
   - Features **11**, **0**, and **2** have a significantly higher impact compared to others.
   - These features are likely driving the model's high performance.

2. **Lesser Impact Features**:
   - Features like **8** and **9** have negligible coefficients, indicating minimal contribution.
---

Let's move on with **Correlated Noise Testing**.

### Correlated Noise Testing

Correlated noise testing evaluates the model's robustness by introducing structured noise into specific, high-impact features. This test simulates real-world scenarios where certain features might be systematically altered or perturbed.

---

**What We Are Doing**:
1. **Select High-Impact Features**:
   - Based on feature importance analysis, the top three influential features (`Feature 11`, `Feature 0`, and `Feature 2`) are selected for testing.

2. **Generate Correlated Noise**:
   - Gaussian noise is added to the selected features, scaled based on each feature's standard deviation to ensure realistic perturbations.

3. **Evaluate Model Performance**:
   - The trained logistic regression model is tested on the noisy dataset.
   - Accuracy and classification metrics (precision, recall, F1-score) are analyzed to assess the model's robustness against structured noise.

---
###  Code for Correlated Noise Testing

In [5]:
# Select high-impact features from the feature importance ranking
high_impact_features = [11, 0, 2]  # Replace with actual feature names if needed

# Add correlated noise to the high-impact features in the test dataset
X_test_correlated_lr = X_test_lr.copy()  # Copy to avoid modifying the original test set

# Generate noise correlated with the original features
for feature in high_impact_features:
    noise = np.random.normal(0, 0.05, X_test_correlated_lr.shape[0]) * X_test_lr.iloc[:, feature].std() * 0.5
    X_test_correlated_lr.iloc[:, feature] += noise

# Evaluate the model on the noisy test data
print("\nEvaluating Logistic Regression on test data with correlated noise...")
y_pred_correlated_lr = log_reg.predict(X_test_correlated_lr)

# Calculate accuracy and classification report
accuracy_correlated_lr = accuracy_score(y_test_lr, y_pred_correlated_lr)
report_correlated_lr = classification_report(y_test_lr, y_pred_correlated_lr, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nAccuracy on Correlated Noisy Data: {accuracy_correlated_lr:.4f}")
print("\nClassification Report on Correlated Noisy Data:")
print(report_correlated_lr)


Evaluating Logistic Regression on test data with correlated noise...

Accuracy on Correlated Noisy Data: 1.0000

Classification Report on Correlated Noisy Data:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00     10078

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000



### Logistic Regression: Results on Correlated Noisy Data

**Accuracy**:
- The model maintained perfect accuracy of **100%** on the test set with correlated noise, demonstrating exceptional robustness.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 1.00  
     All legitimate predictions were correct.
   - **Recall**: 1.00  
     The model identified all legitimate cases perfectly.
   - **F1-Score**: 1.00  
     Flawless classification for legitimate cases.

2. **Phishing (Class 1)**:
   - **Precision**: 1.00  
     All phishing predictions were correct.
   - **Recall**: 1.00  
     The model identified all phishing cases without error.
   - **F1-Score**: 1.00  
     Perfect classification for phishing cases.

3. **Macro and Weighted Averages**:
   - Both averages remain at **1.00**, confirming the model's resilience to structured, correlated noise.

---

**Insights**:
- The model's performance was unaffected by the introduction of structured noise to high-impact features.
- This result highlights the robustness of the logistic regression model under challenging conditions.

---

Let’s move on with **Class Imbalance Testing**.

### Class Imbalance Testing

Class imbalance is a common challenge in real-world datasets where one class significantly outweighs the other. This test simulates an imbalanced scenario in the test data to evaluate the model's ability to classify minority classes effectively.

---

**What We Are Doing**:
1. **Create an Imbalanced Test Set**:
   - Modify the test dataset to have an imbalance, with 80% of samples belonging to the `Legitimate` class and 20% to the `Phishing` class.

2. **Evaluate Model Performance**:
   - Test the trained logistic regression model on this imbalanced dataset.
   - Analyze accuracy and detailed classification metrics (precision, recall, F1-score) to assess if the model can still classify minority samples accurately.

---

### Code for Class Imbalance Testing

In [6]:
# Convert y_test_lr to a Pandas Series for indexing
y_test_lr_series = pd.Series(y_test_lr)

# Create a highly imbalanced test dataset (e.g., 80% Legitimate, 20% Phishing)
legitimate_indices = y_test_lr_series[y_test_lr_series == 0].index
phishing_indices = y_test_lr_series[y_test_lr_series == 1].index

# Calculate the maximum number of samples that can be used for imbalance
legitimate_sample_size = int(0.8 * len(y_test_lr_series))
phishing_sample_size = len(y_test_lr_series) - legitimate_sample_size

# Downsample Legitimate or Phishing cases to achieve imbalance
imbalanced_legitimate = resample(
    legitimate_indices, replace=False, n_samples=min(legitimate_sample_size, len(legitimate_indices)), random_state=42
)
imbalanced_phishing = resample(
    phishing_indices, replace=False, n_samples=min(phishing_sample_size, len(phishing_indices)), random_state=42
)

# Combine indices and create the imbalanced test set
imbalanced_indices = np.hstack((imbalanced_legitimate, imbalanced_phishing))
X_test_imbalanced_lr = X_test_lr.iloc[imbalanced_indices]
y_test_imbalanced_lr = y_test_lr_series.iloc[imbalanced_indices]

# Evaluate the model on the imbalanced test data
print("\nEvaluating Logistic Regression on imbalanced test data...")
y_pred_imbalanced_lr = log_reg.predict(X_test_imbalanced_lr)

# Calculate accuracy and classification report
accuracy_imbalanced_lr = accuracy_score(y_test_imbalanced_lr, y_pred_imbalanced_lr)
report_imbalanced_lr = classification_report(
    y_test_imbalanced_lr, y_pred_imbalanced_lr, target_names=["Legitimate", "Phishing"]
)

# Display results
print(f"\nAccuracy on Imbalanced Data: {accuracy_imbalanced_lr:.4f}")
print("\nClassification Report on Imbalanced Data:")
print(report_imbalanced_lr)


Evaluating Logistic Regression on imbalanced test data...

Accuracy on Imbalanced Data: 1.0000

Classification Report on Imbalanced Data:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00      4000

    accuracy                           1.00     13922
   macro avg       1.00      1.00      1.00     13922
weighted avg       1.00      1.00      1.00     13922



### Logistic Regression: Results on Imbalanced Test Data

**Accuracy**:
- The model maintained perfect accuracy of **100%** on the imbalanced test dataset, showcasing its robustness.

**Classification Report**:
1. **Legitimate (Class 0)**:
   - **Precision**: 1.00  
     All predictions for legitimate cases were correct.
   - **Recall**: 1.00  
     The model successfully identified all legitimate cases in the imbalanced dataset.
   - **F1-Score**: 1.00  
     Flawless classification for the majority class.

2. **Phishing (Class 1)**:
   - **Precision**: 1.00  
     All phishing predictions were correct.
   - **Recall**: 1.00  
     The model accurately captured all phishing cases despite their minority representation.
   - **F1-Score**: 1.00  
     Perfect classification for the minority class.

3. **Macro and Weighted Averages**:
   - Both averages remain at **1.00**, confirming that the model performs equally well across both majority and minority classes.

---

### Insights

1. **Robustness to Class Imbalance**:
   - The model exhibited no performance degradation even under heavily imbalanced conditions, highlighting its ability to classify minority classes effectively.

2. **Balanced Predictions**:
   - The equal precision, recall, and F1-scores for both classes indicate that the model is not biased toward the majority class.

---

Let’s move on with **Feature Removal Testing** to explore potential performance optimizations.

### Feature Removal Testing

This test evaluates how the logistic regression model performs when specific features are removed. It helps identify:
- Features that the model heavily relies on.
- Redundant features that have minimal impact on accuracy.

---

**What We Are Doing**:
1. **Sequential Feature Removal**:
   - Drop one feature at a time from the training and test datasets.
   - Retrain the logistic regression model without the dropped feature.

2. **Evaluate Model Performance**:
   - Measure the accuracy of the reduced model on the test dataset.
   - Compare the reduced accuracy with the original model’s accuracy to determine the impact of each feature.

3. **Analyze Impact**:
   - Features that cause significant accuracy drops are critical for the model’s performance.
   - Features with minimal impact may be candidates for removal to simplify the model.

---
### Code for Robustness to Feature Removal

In [7]:
# Store the original model's accuracy for comparison
original_accuracy = log_reg.score(X_test_lr, y_test_lr)

# Evaluate the model by sequentially dropping each feature
print("\nEvaluating Robustness to Feature Removal...")
feature_impact = []

for feature in X_train_lr.columns:
    # Drop the current feature
    X_train_dropped = X_train_lr.drop(columns=[feature])
    X_test_dropped = X_test_lr.drop(columns=[feature])
    
    # Retrain the model on the reduced dataset
    log_reg_dropped = LogisticRegression(max_iter=1000, random_state=42)
    log_reg_dropped.fit(X_train_dropped, y_train_lr)
    
    # Evaluate the model on the reduced test set
    dropped_accuracy = log_reg_dropped.score(X_test_dropped, y_test_lr)
    
    # Store the results
    feature_impact.append((feature, dropped_accuracy))
    print(f"Feature '{feature}' removed. Accuracy: {dropped_accuracy:.4f}")

# Convert results to a DataFrame for analysis
feature_impact_df = pd.DataFrame(feature_impact, columns=["Feature", "Accuracy"])
feature_impact_df["Accuracy Drop"] = original_accuracy - feature_impact_df["Accuracy"]

# Display features with the highest impact
print("\nFeatures with the Highest Impact on Accuracy:")
print(feature_impact_df.sort_values(by="Accuracy Drop", ascending=False).head(10))


Evaluating Robustness to Feature Removal...
Feature '0' removed. Accuracy: 0.9988
Feature '1' removed. Accuracy: 1.0000
Feature '2' removed. Accuracy: 1.0000
Feature '3' removed. Accuracy: 1.0000
Feature '4' removed. Accuracy: 1.0000
Feature '5' removed. Accuracy: 1.0000
Feature '6' removed. Accuracy: 1.0000
Feature '7' removed. Accuracy: 1.0000
Feature '8' removed. Accuracy: 1.0000
Feature '9' removed. Accuracy: 1.0000
Feature '10' removed. Accuracy: 1.0000
Feature '11' removed. Accuracy: 0.9392
Feature '12' removed. Accuracy: 1.0000
Feature '13' removed. Accuracy: 1.0000

Features with the Highest Impact on Accuracy:
   Feature  Accuracy  Accuracy Drop
11      11   0.93920        0.06080
0        0   0.99875        0.00125
13      13   0.99995        0.00005
1        1   1.00000        0.00000
2        2   1.00000        0.00000
3        3   1.00000        0.00000
4        4   1.00000        0.00000
5        5   1.00000        0.00000
6        6   1.00000        0.00000
7        7  

### Logistic Regression: Results of Feature Removal Test

**Evaluation Results**:
- The model was evaluated by sequentially removing each feature and measuring the impact on accuracy.
- Key findings from the test are:

1. **Features with the Highest Impact**:
   - **Feature '11'**: Caused the most significant accuracy drop (from **1.0000** to **0.9392**) when removed.
   - **Feature '0'**: Resulted in a minor accuracy drop (from **1.0000** to **0.9988**).

2. **Redundant Features**:
   - Several features (e.g., **1**, **2**, **3**, etc.) caused no measurable drop in accuracy when removed, indicating their minimal impact on the model’s performance.

---

**Insights**:
1. **Critical Features**:
   - **Feature '11'** is crucial for the model's performance and should be retained in all scenarios.
   - **Feature '0'** is also somewhat impactful and contributes to the model's overall robustness.

2. **Potential Redundancy**:
   - Features that caused no accuracy drop (e.g., **'1'**, **'2'**, etc.) might be redundant and could potentially be removed to simplify the model without sacrificing performance.

---

The results demonstrate that the model is robust and does not overly depend on most features.

### Hyperparameter Tuning for Logistic Regression

Hyperparameter tuning aims to optimize the logistic regression model by systematically searching for the best combination of parameters. This process ensures that the model achieves peak performance on the given dataset.

---

**What We Are Doing**:
1. **Define Hyperparameter Grid**:
   - Tune key parameters:
     - **`C`**: Controls the strength of regularization.
     - **`solver`**: Specifies the optimization algorithm.
     - **`penalty`**: Determines the type of regularization (`l1` or `l2`).

2. **Grid Search with Cross-Validation**:
   - Use 10-fold cross-validation to evaluate multiple combinations of parameters.
   - Identify the best configuration based on accuracy.

3. **Evaluate Best Model**:
   - Test the best hyperparameter configuration on the test dataset.
   - Generate accuracy and a detailed classification report.

---
### Code for Hyperparameter Tuning

In [8]:
# Define the hyperparameter grid
param_grid = {
    "C": [0.01, 0.1, 1, 10, 100],  # Regularization strength
    "solver": ["liblinear", "lbfgs", "saga"],  # Optimization algorithms
    "penalty": ["l1", "l2"]  # Regularization techniques
}

# Initialize the logistic regression model
log_reg_tune = LogisticRegression(max_iter=1000, random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg_tune,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="accuracy",
    verbose=1,  # Display progress
    n_jobs=-1  # Use all available CPU cores
)

# Perform hyperparameter tuning
print("\nPerforming Grid Search for Hyperparameter Tuning...")
grid_search.fit(X_train_lr, y_train_lr)

# Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
print("\nEvaluating the Best Logistic Regression Model...")
y_pred_best_lr = best_model.predict(X_test_lr)

# Calculate accuracy and classification report
accuracy_best_lr = accuracy_score(y_test_lr, y_pred_best_lr)
report_best_lr = classification_report(y_test_lr, y_pred_best_lr, target_names=["Legitimate", "Phishing"])

# Display results
print(f"\nBest Hyperparameters: {best_params}")
print(f"\nAccuracy of Best Model: {accuracy_best_lr:.4f}")
print("\nClassification Report of Best Model:")
print(report_best_lr)


Performing Grid Search for Hyperparameter Tuning...
Fitting 10 folds for each of 30 candidates, totalling 300 fits


50 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

       nan 0.9998875 0.9998125 0.9998


Evaluating the Best Logistic Regression Model...

Best Hyperparameters: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

Accuracy of Best Model: 1.0000

Classification Report of Best Model:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00      9922
    Phishing       1.00      1.00      1.00     10078

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000





### Logistic Regression: Hyperparameter Tuning Results

The hyperparameter tuning process identified the best configuration for the logistic regression model, further optimizing its performance.

---

**Best Parameters**:
- **C**: `1`  
   Controls the strength of regularization, with `1` indicating moderate regularization.
- **Penalty**: `l1`  
   Applies lasso regularization, which can help with feature selection by shrinking less important coefficients to zero.
- **Solver**: `liblinear`  
   A solver optimized for small datasets and supports `l1` regularization.

---

**Model Performance**:
1. **Accuracy**:
   - Achieved perfect accuracy of **1.0000** on the test dataset.

2. **Classification Report**:
   - **Legitimate (Class 0)**:
     - **Precision**: 1.00  
     - **Recall**: 1.00  
     - **F1-Score**: 1.00  
   - **Phishing (Class 1)**:
     - **Precision**: 1.00  
     - **Recall**: 1.00  
     - **F1-Score**: 1.00  
   - Macro and Weighted Averages: All metrics remain at **1.00**, confirming the model's reliability.

---

**Observations**:
1. **Robustness**:
   - The model continues to demonstrate exceptional performance even after hyperparameter tuning.

2. **Feature Selection**:
   - The use of `l1` regularization likely reduced reliance on less important features, enhancing efficiency.

---

**Warnings During Tuning**:
1. **Convergence Issues**:
   - Some solvers reached the maximum number of iterations without converging.
   - This did not affect the final results, as `liblinear` produced the optimal configuration.

2. **Unsupported Parameter Combinations**:
   - The solver `lbfgs` does not support `l1` penalty, resulting in failed fits for certain combinations.
   - These fits were skipped, and valid combinations were evaluated.

---

This tuned model highlights the efficiency of logistic regression when regularization and optimization are carefully selected.