# Building Logistic Regression Models

Logistic regression is a foundational classification model.

In this notebook you will:
- Fit a logistic regression model with R-style formulas
- Interpret coefficients and predicted probabilities
- Evaluate performance with accuracy and a confusion matrix
- Visualize decision boundaries and sigmoid curves


## Setup: Load the Breast Cancer Dataset

In [None]:
import cuanalytics


In [None]:
# Load data and create train/test split
df = cuanalytics.load_breast_cancer_data()
train_df, test_df = cuanalytics.split_data(df, test_size=0.2)
train_df.head()


## Step 1: Build Your First Logistic Regression Model

Logistic regression predicts **probabilities** for each class. We turn those
probabilities into class labels using a threshold (usually 0.5).

In [None]:
logit = cuanalytics.fit_logit(
    train_df,
    formula='diagnosis ~ .',
    C=1.0,
    max_iter=5000
)


Note: If you see a convergence warning, try increasing `max_iter` or scaling the features.

## Step 2: Interpret the Coefficients

Each coefficient shifts the **log-odds** of the positive class.
Positive values increase the probability; negative values decrease it.

In [None]:
logit.summary()


## Step 3: Evaluate Performance

We will report accuracy and a confusion matrix on both train and test sets.

In [None]:
train_report = logit.score(train_df)
test_report = logit.score(test_df)

print(f"Train accuracy: {train_report['accuracy']:.2%}")
print(f"Test accuracy: {test_report['accuracy']:.2%}")


## Step 4: Predict Probabilities

Predicted probabilities are useful for ranking, risk scoring, and custom thresholds.

In [None]:
probs = logit.predict_proba(test_df)
probs[:5]


## Step 5: Visualize Coefficients

A bar chart makes it easy to compare which features push predictions up or down.

In [None]:
logit.visualize()


## Step 6: Visualize Decision Boundary (2 Features)

This shows how two features alone separate the classes.
All other features are ignored in this 2D view.

In [None]:
logit.visualize_features('radius_mean', 'texture_mean')


## Step 7: Visualize Sigmoid Curves by Feature

Each panel shows the sigmoid curve for one feature with all other
features held at their mean values.

In [None]:
logit.visualize_all_features()


## Step 8: Inspect Metrics and Coefficients

The metrics dictionary is useful if you want to programmatically record results.

In [None]:
metrics = logit.get_metrics()
metrics['coefficients'].head()


## 🎓 Final Challenge

- Try changing the probability threshold from 0.5 to 0.3 or 0.7 and see how the confusion matrix changes.
- Compare logistic regression to LDA or SVM on the same dataset.
- Use `C=0.1`, `C=1`, and `C=10` to see how regularization affects coefficients.

## Key Takeaways

- Logistic regression models class **probabilities**, not just class labels.
- Coefficients control the **log-odds** of the positive class.
- Visualization helps you see which features push probability up or down.

## Real-World Considerations

- Feature scaling often improves convergence and interpretability.
- The choice of threshold matters when classes are imbalanced.
- Use probabilities to rank risk, not just make hard decisions.