## **Feature Importance in Logistic Regression**
   - Train a Logistic Regression model using the entire dataset.
   - Identify the most important feature based on the highest absolute coefficient value.

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv('spambase.csv')

# Separate features and target
X = df.drop('class', axis=1)
y = df['class']

# Split dataset into training and testing sets

In [7]:
# For this exercise, we'll use the entire dataset as requested
# But let's also create a proper train/test split for evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# For coefficient analysis, we'll use the entire dataset
X_full = X
y_full = y

## Logistic Regression

In [8]:
from sklearn.linear_model import LogisticRegression

# Train Logistic Regression on the entire dataset
cls = LogisticRegression(random_state=21, max_iter=3000)
cls.fit(X_full, y_full)

# Get feature names
feature_names = X.columns.tolist()

print("Logistic Regression Coefficients:")
print("="*50)

# Create a DataFrame for better visualization
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': cls.coef_[0],
    'Abs_Coefficient': np.abs(cls.coef_[0]),
    'Odds_Ratio': np.exp(cls.coef_[0])
})

# Sort by absolute coefficient value (importance)
coef_df_sorted = coef_df.sort_values('Abs_Coefficient', ascending=False)

print("Top 10 Most Important Features:")
print(coef_df_sorted.head(10))

Logistic Regression Coefficients:
Top 10 Most Important Features:
                 Feature  Coefficient  Abs_Coefficient  Odds_Ratio
26      word_freq_george    -3.978600         3.978600    0.018712
52         char_freq_%24     3.923058         3.923058   50.554800
6       word_freq_remove     2.198654         2.198654    9.012878
22         word_freq_000     2.079641         2.079641    8.001598
41     word_freq_meeting    -1.818344         1.818344    0.162294
24          word_freq_hp    -1.794308         1.794308    0.166242
47  word_freq_conference    -1.741294         1.741294    0.175293
40          word_freq_cs    -1.719267         1.719267    0.179198
45         word_freq_edu    -1.410830         1.410830    0.243941
28         word_freq_lab    -1.354008         1.354008    0.258203


In [9]:
# Detailed interpretation of coefficients
print("\n" + "="*60)
print("COEFFICIENT INTERPRETATION GUIDE:")
print("="*60)

print("\n1. MOST IMPORTANT FEATURE:")
most_important = coef_df_sorted.iloc[0]
print(f"   Feature: {most_important['Feature']}")
print(f"   Coefficient: {most_important['Coefficient']:.4f}")
print(f"   Odds Ratio: {most_important['Odds_Ratio']:.4f}")

if most_important['Coefficient'] > 0:
    print(f"   → This feature INCREASES spam probability")
    print(f"   → Each unit increase multiplies odds of spam by {most_important['Odds_Ratio']:.4f}")
else:
    print(f"   → This feature DECREASES spam probability")
    print(f"   → Each unit increase multiplies odds of spam by {most_important['Odds_Ratio']:.4f}")

print("\n2. POSITIVE COEFFICIENTS (Increase spam probability):")
positive_coefs = coef_df_sorted[coef_df_sorted['Coefficient'] > 0].head(5)
for idx, row in positive_coefs.iterrows():
    print(f"   • {row['Feature']}: {row['Coefficient']:.4f} (odds ratio: {row['Odds_Ratio']:.4f})")

print("\n3. NEGATIVE COEFFICIENTS (Decrease spam probability):")
negative_coefs = coef_df_sorted[coef_df_sorted['Coefficient'] < 0].head(5)
for idx, row in negative_coefs.iterrows():
    print(f"   • {row['Feature']}: {row['Coefficient']:.4f} (odds ratio: {row['Odds_Ratio']:.4f})")

print("\n4. INTERPRETATION NOTES:")
print("   • Coefficient = change in log-odds per unit change in feature")
print("   • Odds Ratio = exp(coefficient) = multiplicative effect on odds")
print("   • Odds Ratio > 1: feature increases spam odds")
print("   • Odds Ratio < 1: feature decreases spam odds")
print("   • Larger |coefficient| = stronger influence on prediction")



COEFFICIENT INTERPRETATION GUIDE:

1. MOST IMPORTANT FEATURE:
   Feature: word_freq_george
   Coefficient: -3.9786
   Odds Ratio: 0.0187
   → This feature DECREASES spam probability
   → Each unit increase multiplies odds of spam by 0.0187

2. POSITIVE COEFFICIENTS (Increase spam probability):
   • char_freq_%24: 3.9231 (odds ratio: 50.5548)
   • word_freq_remove: 2.1987 (odds ratio: 9.0129)
   • word_freq_000: 2.0796 (odds ratio: 8.0016)
   • char_freq_%23: 1.3187 (odds ratio: 3.7384)
   • word_freq_free: 1.0165 (odds ratio: 2.7634)

3. NEGATIVE COEFFICIENTS (Decrease spam probability):
   • word_freq_george: -3.9786 (odds ratio: 0.0187)
   • word_freq_meeting: -1.8183 (odds ratio: 0.1623)
   • word_freq_hp: -1.7943 (odds ratio: 0.1662)
   • word_freq_conference: -1.7413 (odds ratio: 0.1753)
   • word_freq_cs: -1.7193 (odds ratio: 0.1792)

4. INTERPRETATION NOTES:
   • Coefficient = change in log-odds per unit change in feature
   • Odds Ratio = exp(coefficient) = multiplicative effe

## Results

## Summary of Logistic Regression Coefficient Analysis

**Key Insights:**

1. **Most Important Feature**: The feature with the highest absolute coefficient value represents the strongest predictor of spam vs. non-spam emails.

2. **Coefficient Interpretation**:
   - **Positive coefficients**: Features that increase the probability of an email being spam
   - **Negative coefficients**: Features that decrease the probability of an email being spam
   - **Magnitude**: Larger absolute values indicate stronger influence

3. **Odds Ratios**: 
   - Values > 1 indicate the feature increases spam odds
   - Values < 1 indicate the feature decreases spam odds
   - The further from 1, the stronger the effect

4. **Practical Application**: 
   - Spam filters can focus on the top features identified by the model
   - Understanding which words/characters increase spam probability helps in email filtering
   - Features with large negative coefficients can be used to identify legitimate emails