#  Regression Models
    Predicting Continuous Outcomes with Supervised Learning
    
## Objective

This notebook provides a structured overview of **core regression models**, covering:

- Linear and regularized regression
- Tree-based regression models
- Ensemble methods
- Model assumptions and trade-offs
- Regression modeling inside pipelines

It answers:

    How do we choose, train, and compare regression models in a principled and leakage-safe way?

## Why Regression Models Matter

Regression problems arise when predicting:
- Revenue
- Demand
- Lifetime value
- Risk scores
- Continuous KPIs

Different regression models encode **different assumptions** about relationships.

# Imports and Dataset


In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("D:/GitHub/Data-Science-Techniques/datasets/Superviased-regression/synthetic_customer_ltv_regression_complete.csv")
df.head()

Unnamed: 0,customer_id,signup_year,signup_month,days_since_signup,tenure_months,avg_monthly_spend,purchase_frequency,discount_sensitivity,returns_rate,email_open_rate,ad_click_rate,loyalty_score,support_tickets,churn_risk_score,credit_score_proxy,customer_lifetime_value
0,1,2022,8,899.094991,29,123.916907,3,0.401322,0.043396,0.042156,0.023647,0.123574,1,0.959716,671.029435,2691.193107
1,2,2019,9,2017.615223,66,204.814055,5,0.26684,0.338968,0.540674,0.180153,0.323954,1,0.78927,746.074773,11690.801889
2,3,2020,3,1720.937794,57,218.905816,3,0.028719,0.041845,0.517227,0.173583,0.26843,2,0.53341,601.164043,13094.093874
3,4,2022,3,1001.962036,33,188.02806,4,0.421602,0.140611,0.512366,0.277571,0.498941,3,0.699054,722.688139,6251.644013
4,5,2018,4,2522.620983,84,142.413565,6,0.192419,0.051116,0.462827,0.123844,0.500634,2,0.439348,659.860235,16474.610236


## Step 1 – Define Target and Features


In [3]:
target = "customer_lifetime_value"

X = df.drop(columns=[target, "customer_id"])
y = df[target]


## Step 2 – Train/Test Split


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# Step 3 – Baseline Model

Always compare against a naive baseline.


In [5]:
baseline_pred = np.full_like(y_test, y_train.mean())

baseline_rmse = np.sqrt(np.mean((y_test - baseline_pred) ** 2))
baseline_rmse


np.float64(6397.159934038442)

## Step 4 – Linear Regression & Evaluation

Assumes linear relationships and independent features.


In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

lr_pred = lr.predict(X_test)

In [8]:
from sklearn.metrics import mean_squared_error, r2_score

rmse_lr = mean_squared_error(y_test, lr_pred, squared=False)
r2_lr = r2_score(y_test, lr_pred)

rmse_lr, r2_lr




(np.float64(2580.1580448117697), 0.8372608981288544)

## Step 5 – Regularized Regression

Regularization reduces overfitting and multicollinearity.


### Ridge Regression


In [9]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

ridge_pred = ridge.predict(X_test)


### Lasso Regression

In [10]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)

lasso_pred = lasso.predict(X_test)


  model = cd_fast.enet_coordinate_descent(


### Regularization Interpretation

- Ridge shrinks coefficients
- Lasso performs feature selection
- ElasticNet balances both


# Step 6 – Tree-Based Regression

Trees capture nonlinearities and interactions.


## Decision Tree

In [11]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=5, random_state=42)
dt.fit(X_train, y_train)

dt_pred = dt.predict(X_test)


## Random Forest Regressor

In [12]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)

rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)


## Step 7 – Gradient Boosting Regression

Boosting reduces bias iteratively.


In [13]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)

gbr_pred = gbr.predict(X_test)


## Step 8 – Model Comparison


In [14]:
results = pd.DataFrame({
    "Model": [
        "Baseline",
        "Linear",
        "Ridge",
        "Lasso",
        "Decision Tree",
        "Random Forest",
        "Gradient Boosting"
    ],
    "RMSE": [
        baseline_rmse,
        rmse_lr,
        mean_squared_error(y_test, ridge_pred, squared=False),
        mean_squared_error(y_test, lasso_pred, squared=False),
        mean_squared_error(y_test, dt_pred, squared=False),
        mean_squared_error(y_test, rf_pred, squared=False),
        mean_squared_error(y_test, gbr_pred, squared=False)
    ]
})

results.sort_values("RMSE")




Unnamed: 0,Model,RMSE
6,Gradient Boosting,1218.679247
5,Random Forest,1382.669603
1,Linear,2580.158045
3,Lasso,2580.237428
2,Ridge,2580.61749
4,Decision Tree,2871.522702
0,Baseline,6397.159934


## Regression Models by Assumption

| Model | Handles Nonlinearity | Scaling Needed |
|-----|---------------------|---------------|
| Linear | ❌ | ✔ |
| Ridge / Lasso | ❌ | ✔ |
| Tree | ✔ | ❌ |
| Random Forest | ✔ | ❌ |
| Boosting | ✔ | ❌ |


## Step 10 – Pipelines (Preview)

All regression models should be embedded in pipelines.


## Common Mistakes (Avoided)

- `[neg] -` No baseline
- `[neg] -` Comparing untuned models unfairly
- `[neg] -` Ignoring residual analysis
- `[neg] -` Blindly trusting R²


## Summary Table

| Model Type | Strength |
|----------|---------|
| Linear | Interpretability |
| Regularized | Stability |
| Tree | Nonlinearity |
| Ensemble | Performance |


## Key Takeaways

- Start simple
- Compare against baseline
- Match model to data assumptions
- Ensembles dominate performance
- Pipelines ensure correctness


## Next Notebook

04_Supervised_Learning/

└── [01_linear_and_regularized_models.ipynb](01_linear_and_regularized_models.ipynb)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)




<br><br><br><br><br>

[Panfugo Dados](www.pantufodados.com)


[Pantufo Dados - YouTube Channel](https://www.youtube.com/@pantufodados)