# **Goal: Recommendation System / Next Best Action**

## **Context**
We have a **random extraction** of a (real world) dataset containing **customers of a large wealth management company**.  

The data is **anonymous, mostly clean, and NOT always normalized/scaled**.  

Our objective is to **estimate investment needs** for these customers using **Data Science techniques**.

### **Why Estimate Investment Needs?**
Identifying customer needs is useful for several reasons, including:

* **Recommender Systems / Next Best Action:**  
  * Needs can serve as **key inputs** for **content-based** or **knowledge-based filtering algorithms**, that allows for personalized services.  
  * This is our **primary focus** in this notebook, i.e., "Know Your Client (KYC)".  

* **Product Targeting & Governance (Regulatory Compliance - MIFID/IDD in EU):**  
  * Regulatory standards require that **customer needs match the investment products offered**. So financial institutions must estimate customer needs.
  * This is essentially an **"institutional view"** of a recommendation system...

<br>

---

## **Dataset Overview**
The dataset, named **"Needs"**, is stored in an **Excel file called Dataset2_Needs.xls**.  
It contains several **potentially relevant features** along with two **target variables**, i.e:

* **AccumulationInvestment**  
  * Indicates a customer preference for **accumulation investing**, typically through **dollar-cost averaging** (i.e., investing small amounts at regular intervals over time, say on a monthly basis).  
  * **Binary (Boolean) response:**  
    * `1 = High propensity`.  
    * `0 = Low propensity`.

- **IncomeInvestment**  
  - Indicates a customer preference for **income investing**, typically through **lump-sum investing** (i.e., one-shot investments).  This is because anyone who aspires to obtain income from coupons and dividends must necessarily already have accumulated capital - a typical need of people who are older than their previous need.
  - **Binary (Boolean) response:**  
    - `1 = High propensity`.  
    - `0 = Low propensity`.  

    **Where do these two response variables come from?** From a **revealed preference scheme**: if the client has an advisor who is considered professionally reliable (this eliminates the possibility of conflict of interest) and has purchased a product that satisfies that need, and the client has also purchased it, we can say with good probability that the advisor has identified the need correctly and the client has that need. In other respects, the machine learning model we are building is a clone of the financial advisor.

<br>

Additionally, we have a **second dataset**, **"Products"**, containing investment products (funds, segregated accounts, unit-linked policies), along with:

* **Product Type:**  
  * `1 = Accumulation` (that is, a product that is good for those who have a high need for accumulation investments)
  * `0 = Income`  (that is, a product that is good for those who have a high need for income investments)

* **Risk Level:**  
  * A **normalized risk score** in the range **$[0,1]$**.  
  * This usually represents the normalized value in $[0, 1]$ of the **[Synthetic Risk and Reward Indicator (SRRI)](https://www.esma.europa.eu/sites/default/files/library/2015/11/10_673.pdf)** of the product, an ordinal variable defined in the range ${1, 7}$ starting from continuous data.

<br>

---

## **Recommendation System Approach**
The recommendation system consists of **two key steps**:

1. **Identifying customers with high investment propensity:**  
   - Using **machine learning models**, we aim to classify customers based on **AccumulationInvestment** (`1 = High propensity`) and/or **IncomeInvestment** (`1 = High propensity`).  

2. **Recommending the most suitable product for each customer:**  
   - For each customer, we match the **most appropriate product** based on:  
     - **Investment need** (Accumulation or Income).  
     - **Risk compatibility** (matching product risk level with the customer profile).  
   - This **personalized recommendation** represents the **Next Best Action** for each client.

<br>


<br>

Let's start with data ingestion.

<br>

In [1]:
import pandas as pd

In [2]:
import os

folder = r'C:\Users\leo_h\Documents\project\fintech_final_project'
print(os.listdir(folder))

['.git', '.vscode', 'BaseNotebook.ipynb', 'dataset', 'Dataset2_Needs.xlsx', 'FeaEngwithDecisionTree.ipynb', 'Projectwork_Zenti.pdf', 'RIJ.ipynb', 'SL.ipynb', 'Zenti_Business_Case_2.pdf']


In [3]:
import os

folder = r'C:\Users\leo_h\Documents\project\fintech_final_project'
file_path = "Dataset2_Needs.xlsx"

os.chdir(folder)
os.getcwd()
# Print current working directory for debugging
print("Current working directory:", os.getcwd())
print("Looking for file at:", file_path)

# 'folder' and 'file_path' are already defined in the notebook, so we use them directly



needs_df = pd.read_excel(file_path, sheet_name='Needs')
products_df = pd.read_excel(file_path, sheet_name='Products')
metadata_df = pd.read_excel(file_path, sheet_name='Metadata')

Current working directory: C:\Users\leo_h\Documents\project\fintech_final_project
Looking for file at: Dataset2_Needs.xlsx


# **Data Exploration**

As for the last business case: I keep it minimalist, for the benefit of brevity, to be able to get to the heart of the problem. But you could/can spend tons of time here in order to **understand the problem and the dataset**.

Let's display our variables to better understand the data structure and characteristics of the dataset.

<br>

In [4]:
# Let's see the actual variables names in metadata_df
print("Metadata DataFrame columns:")
print(metadata_df.columns.tolist())

# Let's peek at the first few rows
print("\nFirst few rows of metadata:")
print(metadata_df.head())


Metadata DataFrame columns:
['Metadata', 'Unnamed: 1']

First few rows of metadata:
        Metadata                     Unnamed: 1
0        Clients                            NaN
1             ID                   Numerical ID
2            Age                  Age, in years
3         Gender  Gender (Female = 1, Male = 0)
4  FamilyMembers           Number of components


<br>

We drop ID column as it's not needed for analysis.

<br>

In [5]:
# Drop ID column as it's not needed for analysis
needs_df = needs_df.drop('ID', axis=1)

<br>

Create a formatted table to summarize the dataset (you can expand the number of statistics you look at).


<br>

In [6]:
def create_variable_summary(df, metadata_df):
    # Create empty lists to store the chosen statistics
    stats_dict = {
        'Variable': [],
        'Description': [],
        'Mean': [],
        'Std': [],
        'Missing': [],
        'Min': [],
        'Max': []
    }

    # Create a metadata dictionary for easy lookup
    meta_dict = dict(zip(metadata_df['Metadata'], metadata_df['Unnamed: 1']))

    for col in df.columns:
        stats_dict['Variable'].append(col)
        stats_dict['Description'].append(meta_dict.get(col, 'N/A'))

        # Calculate some statistics for each column
        if pd.api.types.is_numeric_dtype(df[col]):
            stats_dict['Mean'].append(f"{df[col].mean():.2f}")
            stats_dict['Std'].append(f"{df[col].std():.2f}")
            stats_dict['Min'].append(f"{df[col].min():.2f}")
            stats_dict['Max'].append(f"{df[col].max():.2f}")
        else:
            stats_dict['Mean'].append('N/A')
            stats_dict['Std'].append('N/A')
            stats_dict['Min'].append('N/A')
            stats_dict['Max'].append('N/A')

        stats_dict['Missing'].append(df[col].isna().sum())

    return pd.DataFrame(stats_dict)


# Create summary tables
print("NEEDS VARIABLES SUMMARY:")
needs_summary = create_variable_summary(needs_df, metadata_df)
display(needs_summary.style
        .set_properties(**{'text-align': 'left'})
        .hide(axis='index'))

print("\nPRODUCTS VARIABLES SUMMARY:")
products_summary = create_variable_summary(products_df, metadata_df)
display(products_summary.style
        .set_properties(**{'text-align': 'left'})
        .hide(axis='index'))


NEEDS VARIABLES SUMMARY:


Variable,Description,Mean,Std,Missing,Min,Max
Age,"Age, in years",55.25,11.97,0,18.0,97.0
Gender,"Gender (Female = 1, Male = 0)",0.49,0.5,0,0.0,1.0
FamilyMembers,Number of components,2.51,0.76,0,1.0,5.0
FinancialEducation,Normalized level of Financial Education (estimate),0.42,0.15,0,0.04,0.9
RiskPropensity,Normalized Risk propensity from MIFID profile,0.36,0.15,0,0.02,0.88
Income,Income (thousands of euros); estimate,62.99,44.36,0,1.54,365.32
Wealth,Wealth (thousands of euros); sum of investments and cash accounts,93.81,105.47,0,1.06,2233.23
IncomeInvestment,Boolean variable for Income investment; 1 = High propensity,0.38,0.49,0,0.0,1.0
AccumulationInvestment,Boolean variable for Accumulation/growth investment; 1 = High propensity,0.51,0.5,0,0.0,1.0



PRODUCTS VARIABLES SUMMARY:


Variable,Description,Mean,Std,Missing,Min,Max
IDProduct,Product description,6.0,3.32,0,1.0,11.0
Type,"1 = Accumulation product, 0 = Income product",0.64,0.5,0,0.0,1.0
Risk,Normalized Synthetic Risk Indicator,0.43,0.24,0,0.12,0.88


<h1> Feature Engineering

<body>
<h2>Feature Idea 8: Gender × Age Interaction</h2>
Rationale: Investigate whether life-cycle financial needs differ by gender within the dataset. Additionally, refine the Income/Wealth Ratio by applying a logarithmic transformation for better scaling.
Combining the Gender × Age interaction feature with the log-transformed Income/Wealth Ratio and the base feature set yielded the strongest performance among all feature engineering strategies tested.
</body>

In [7]:
import pandas as pd
import numpy as np

# --- Sklearn Imports ---
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import MinMaxScaler
# --- Add imports for new models ---

from sklearn.tree import DecisionTreeClassifier

# ---------------------------------
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tabulate import tabulate

# --- Global Settings ---
RANDOM_STATE = 42

# --- ASSUME 'needs_df' is loaded and preprocessed correctly before this point ---
# Example:
# needs_df = pd.read_csv('your_data.csv')
# # Ensure Gender, Age, FamilyMembers are numeric
# needs_df['Gender'] = needs_df['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
# needs_df['FamilyMembers'] = pd.to_numeric(needs_df['FamilyMembers'], errors='coerce').fillna(0)
# needs_df['Age'] = pd.to_numeric(needs_df['Age'], errors='coerce').fillna(needs_df['Age'].median())


# Step 1: Feature engineering function (Implementing Both New Features)
def prepare_features(df):
    """Prepares base and engineered (Gender*Age + I/W Ratio) feature sets."""
    X = df.copy()
    income_col = 'Income ' if 'Income ' in X.columns else 'Income'
    wealth_col = 'Wealth'
    gender_col = 'Gender'
    age_col = 'Age'

    # Log transformation for Wealth and Income
    if wealth_col in X.columns: X['Wealth_log'] = np.log1p(X[wealth_col])
    if income_col in X.columns: X['Income_log'] = np.log1p(X[income_col])

    # --- Engineered Feature 1: Gender * Age Interaction ---
    interaction_gender_age = 'Gender_x_Age'
    if age_col in X.columns and gender_col in X.columns and \
       pd.api.types.is_numeric_dtype(X[age_col]) and \
       pd.api.types.is_numeric_dtype(X[gender_col]):
        X[interaction_gender_age] = X[age_col] * X[gender_col]
    else:
        X[interaction_gender_age] = 0
        print(f"Warning: Could not calculate {interaction_gender_age}.")

    # --- Engineered Feature 2: Refined Income/Wealth Ratio (log) ---
    feature_iw_ratio_log = 'Income_Wealth_Ratio_log'
    if income_col in X.columns and wealth_col in X.columns and \
       pd.api.types.is_numeric_dtype(X[income_col]) and \
       pd.api.types.is_numeric_dtype(X[wealth_col]):
        # Calculate ratio using original values, handle division by zero
        ratio = X[income_col].div(X[wealth_col].replace(0, np.nan))
        # Fill NaN ratios with 0 before log1p
        X[feature_iw_ratio_log] = np.log1p(ratio.fillna(0))
    else:
        X[feature_iw_ratio_log] = 0
        print(f"Warning: Could not calculate {feature_iw_ratio_log}.")


    # --- Define Feature Lists ---
    # Base features: Standard set
    features_base_expected = ['Age', 'Gender', 'FamilyMembers', 'FinancialEducation',
                              'RiskPropensity', 'Wealth_log', 'Income_log']

    # Engineered features: Base + BOTH New Features
    features_engineered_expected = features_base_expected.copy()
    if interaction_gender_age in X.columns:
         features_engineered_expected.append(interaction_gender_age)
    if feature_iw_ratio_log in X.columns:
         features_engineered_expected.append(feature_iw_ratio_log)

    # Select only available columns
    features_base = [f for f in features_base_expected if f in X.columns]
    features_engineered = [f for f in features_engineered_expected if f in X.columns]


    # Normalize all features
    scaler_base = MinMaxScaler()
    scaler_eng = MinMaxScaler()

    # Minimal error checking assumed data is clean/numeric
    X_base = pd.DataFrame(scaler_base.fit_transform(X[features_base]), columns=features_base, index=X.index)
    X_engineered = pd.DataFrame(scaler_eng.fit_transform(X[features_engineered]), columns=features_engineered, index=X.index)

    return X_base, X_engineered


In [8]:
X_base, X_engineered = prepare_features(needs_df)
y_income = needs_df['IncomeInvestment']
y_accum = needs_df['AccumulationInvestment']

In [9]:
# Base Features
print("Base Features:")
X_base.head(5)

Base Features:


Unnamed: 0,Age,Gender,FamilyMembers,FinancialEducation,RiskPropensity,Wealth_log,Income_log
0,0.531646,0.0,0.25,0.222172,0.243105,0.468132,0.664782
1,0.759494,0.0,0.25,0.37241,0.170321,0.60016,0.441614
2,0.189873,1.0,0.25,0.324649,0.262161,0.498951,0.45397
3,0.64557,1.0,0.75,0.843975,0.73411,0.756044,0.842246
4,0.506329,0.0,0.5,0.45409,0.377948,0.482307,0.436064


In [10]:

# Engineered Features
print("\nEngineered Features:")
X_engineered.head(5)


Engineered Features:


Unnamed: 0,Age,Gender,FamilyMembers,FinancialEducation,RiskPropensity,Wealth_log,Income_log,Gender_x_Age,Income_Wealth_Ratio_log
0,0.531646,0.0,0.25,0.222172,0.243105,0.468132,0.664782,0.0,0.155428
1,0.759494,0.0,0.25,0.37241,0.170321,0.60016,0.441614,0.0,0.025499
2,0.189873,1.0,0.25,0.324649,0.262161,0.498951,0.45397,0.347368,0.054647
3,0.64557,1.0,0.75,0.843975,0.73411,0.756044,0.842246,0.726316,0.062964
4,0.506329,0.0,0.5,0.45409,0.377948,0.482307,0.436064,0.0,0.055916


In [11]:
# Income Investment Model
y_income


0       0
1       1
2       0
3       1
4       0
       ..
4995    0
4996    1
4997    0
4998    0
4999    1
Name: IncomeInvestment, Length: 5000, dtype: int64

In [12]:
# Accumulation Investment Model
y_accum

0       1
1       0
2       1
3       1
4       0
       ..
4995    0
4996    1
4997    1
4998    0
4999    1
Name: AccumulationInvestment, Length: 5000, dtype: int64

## Assumptions for Streaming Model

- **Independence:** Each customer record is treated as an independent observation. There are no time-dependent or sequential relationships between records.
- **Feature Stability:** The relationships between features (e.g., Gender × Age interaction, log(Income/Wealth)) and target variables are assumed to remain relevant over time, though gradual changes (concept drift) are expected and handled by the streaming model.
- **Concept Drift:** The streaming approach assumes that if the distribution of data changes, it does so gradually enough for the model and drift detector (ADWIN) to adapt.
- **Representative Sample:** The dataset of 5,000 records is assumed to be representative of the broader client population we wish to model.
- **No Explicit Temporal Information:** As the dataset is a cross-sectional snapshot (not longitudinal), we do not model time series effects or repeated measurements per client.

## Sample Size Justification

- **Sufficiency:** With 5,000 records and a moderate number of features, the data is sufficient for both batch and streaming learning for binary classification tasks.
- **Class Balance:** Outcomes are checked for class imbalance; if severe imbalance was present, additional measures (e.g., resampling) could be considered.
- **Streaming Model Suitability:** Streaming models like Hoeffding Tree and ADWIN can learn effectively from thousands of samples, especially when feature engineering captures key relationships.

## Time Series Discussion

- **No Temporal Sequence:** The current dataset does not contain explicit timestamps or sequential data per client; each record is a point-in-time snapshot.
- **Streaming Simulation:** The 'streaming' process is simulated by iterating over records as if they arrive in real-time. This is appropriate for adaptive learning, but not a replacement for true time series methods.
- **When Time Series Matters:** If the goal were to model changes in an individual client’s behavior over time or to forecast future investment needs, time series or panel data modeling would be required.
- **Justification:** For this business case—recommending next best actions based on current client profiles—cross-sectional modeling is appropriate and time series modeling is not strictly necessary.


In [13]:
#%pip install river

import river
print(river.__version__)

0.22.0


In [14]:
from river import stream

#### Discliamer 

Streaming methods like Hoeffding Trees and their ensembles learn data one sample at a time and do not get to see the whole dataset at once. This is much harder than batch learning.
Typical streaming benchmarks (especially for classification) often show that accuracies in the 65–75% range are quite solid, especially on non-trivial, real-world data.
Batch models (like full RandomForest or XGBoost) often do better, but they have the advantage of multiple passes and more complex optimization.
Practical context: If your classes are balanced and your baseline (e.g., always guessing the majority class) is much lower (say, 50–60%), then 75% is strong. If the problem is very easy (e.g., baseline is 70%), then it's less impressive.

#### Models

Models Used:
The code evaluates four streaming models from the River library:
<ol>
<li>HoeffdingTreeClassifier (single decision tree for streaming)</li>
<li>LogisticRegression (online linear model)</li>
<li>NaiveBayes (Gaussian Naive Bayes for streaming)</li>
<li>Bagging (HoeffdingTree) (ensemble of Hoeffding Trees for improved stability and accuracy)</li>
</ol>

In [15]:
from river import tree, linear_model, naive_bayes, ensemble, metrics, stream

models = {
    "HoeffdingTree": tree.HoeffdingTreeClassifier(),
    "LogisticRegression": linear_model.LogisticRegression(),
    "NaiveBayes": naive_bayes.GaussianNB(),
    "Bagging (HoeffdingTree)": ensemble.BaggingClassifier(
        model=tree.HoeffdingTreeClassifier(),
        n_models=10,
        seed=42
    )
}

targets = {
    "income": y_income,
    "accumulation": y_accum
}

for model_name, model in models.items():
    for target_name, y in targets.items():
        print(f"\nExperiment: Target = {target_name}, Features = engineered, Model = {model_name}")
        data_stream = stream.iter_pandas(X_engineered, y)
        model_instance = model.clone()
        # Initialize metrics
        accuracy = metrics.Accuracy()
        precision = metrics.Precision()
        recall = metrics.Recall()
        f1 = metrics.F1()
        rocauc = metrics.ROCAUC()
        for x_row, y_row in data_stream:
            y_pred = model_instance.predict_one(x_row)
            y_pred_proba = model_instance.predict_proba_one(x_row)
            accuracy.update(y_row, y_pred)
            precision.update(y_row, y_pred)
            recall.update(y_row, y_pred)
            f1.update(y_row, y_pred)
            # Only update ROC AUC if probabilities are available (i.e., model supports it)
            rocauc.update(y_row, y_pred_proba)
            model_instance.learn_one(x_row, y_row)
        print(f"Accuracy:  {accuracy.get():.4f}")
        print(f"Precision: {precision.get():.4f}")
        print(f"Recall:    {recall.get():.4f}")
        print(f"F1:        {f1.get():.4f}")
        print(f"ROC AUC:   {rocauc.get():.4f}\n")


Experiment: Target = income, Features = engineered, Model = HoeffdingTree
Accuracy:  0.7420
Precision: 0.7062
Recall:    0.5615
F1:        0.6256
ROC AUC:   0.7370


Experiment: Target = accumulation, Features = engineered, Model = HoeffdingTree
Accuracy:  0.6520
Precision: 0.6452
Recall:    0.7151
F1:        0.6784
ROC AUC:   0.6981


Experiment: Target = income, Features = engineered, Model = LogisticRegression
Accuracy:  0.6248
Precision: 0.8500
Recall:    0.0266
F1:        0.0516
ROC AUC:   0.5801


Experiment: Target = accumulation, Features = engineered, Model = LogisticRegression
Accuracy:  0.5628
Precision: 0.5542
Recall:    0.7576
F1:        0.6401
ROC AUC:   0.5538


Experiment: Target = income, Features = engineered, Model = NaiveBayes
Accuracy:  0.7494
Precision: 0.6869
Recall:    0.6382
F1:        0.6616
ROC AUC:   0.7542


Experiment: Target = accumulation, Features = engineered, Model = NaiveBayes
Accuracy:  0.6374
Precision: 0.6191
Recall:    0.7627
F1:        0.6834
R

In [16]:
from sklearn.tree import DecisionTreeClassifier
from river import tree, naive_bayes, metrics, stream

# Try to import advanced ensembles and specialized trees if available
extra_models = {}

try:
    from river.ensemble import AdaptiveRandomForestClassifier
    extra_models["AdaptiveRandomForest"] = AdaptiveRandomForestClassifier(seed=42, n_models=10)
except ImportError:
    pass

try:
    from river.ensemble import ARFClassifier
    extra_models["ARFClassifier"] = ARFClassifier(seed=42, n_models=10)
except ImportError:
    pass

try:
    from river.tree import ExtremelyFastDecisionTreeClassifier
    extra_models["ExtremelyFastDecisionTree"] = ExtremelyFastDecisionTreeClassifier()
except ImportError:
    pass

try:
    from river.tree import MondrianTreeClassifier
    extra_models["MondrianTree"] = MondrianTreeClassifier()
except ImportError:
    pass

# Feature selection using sklearn
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_engineered, y_income)
importances = dict(zip(X_engineered.columns, dtree.feature_importances_))
threshold = 0.01
selected_features = [feat for feat, imp in importances.items() if imp > threshold]
print("Selected features:", selected_features)

X_pruned = X_engineered[selected_features]

# Main set of models
models = {
    "HoeffdingTree": tree.HoeffdingTreeClassifier(),
}

# Add Bagging if available
try:
    from river import ensemble
    models["Bagging (HoeffdingTree)"] = ensemble.BaggingClassifier(
        model=tree.HoeffdingTreeClassifier(),
        n_models=10,
        seed=42
    )
except ImportError:
    pass

# Add the extra models found above
models.update(extra_models)

targets = {
    "income": y_income,
    "accumulation": y_accum
}

for model_name, model in models.items():
    for target_name, y in targets.items():
        print(f"\nExperiment: Target = {target_name}, Features = pruned, Model = {model_name}")
        data_stream = stream.iter_pandas(X_pruned, y)
        model_instance = model.clone()
        # Initialize metrics
        accuracy = metrics.Accuracy()
        precision = metrics.Precision()
        recall = metrics.Recall()
        f1 = metrics.F1()
        rocauc = metrics.ROCAUC()
        for x_row, y_row in data_stream:
            y_pred = model_instance.predict_one(x_row)
            y_pred_proba = model_instance.predict_proba_one(x_row)
            accuracy.update(y_row, y_pred)
            precision.update(y_row, y_pred)
            recall.update(y_row, y_pred)
            f1.update(y_row, y_pred)
            rocauc.update(y_row, y_pred_proba)
            model_instance.learn_one(x_row, y_row)
        print(f"Accuracy:  {accuracy.get():.4f}")
        print(f"Precision: {precision.get():.4f}")
        print(f"Recall:    {recall.get():.4f}")
        print(f"F1:        {f1.get():.4f}")
        print(f"ROC AUC:   {rocauc.get():.4f}\n")

Selected features: ['Age', 'FamilyMembers', 'FinancialEducation', 'RiskPropensity', 'Wealth_log', 'Income_log', 'Gender_x_Age', 'Income_Wealth_Ratio_log']

Experiment: Target = income, Features = pruned, Model = HoeffdingTree
Accuracy:  0.7424
Precision: 0.7072
Recall:    0.5615
F1:        0.6260
ROC AUC:   0.7397


Experiment: Target = accumulation, Features = pruned, Model = HoeffdingTree
Accuracy:  0.6520
Precision: 0.6450
Recall:    0.7159
F1:        0.6786
ROC AUC:   0.6995


Experiment: Target = income, Features = pruned, Model = Bagging (HoeffdingTree)
Accuracy:  0.7532
Precision: 0.7237
Recall:    0.5777
F1:        0.6425
ROC AUC:   0.7491


Experiment: Target = accumulation, Features = pruned, Model = Bagging (HoeffdingTree)
Accuracy:  0.6672
Precision: 0.6579
Recall:    0.7323
F1:        0.6931
ROC AUC:   0.7215


Experiment: Target = income, Features = pruned, Model = ExtremelyFastDecisionTree
Accuracy:  0.7602
Precision: 0.7605
Recall:    0.5480
F1:        0.6370
ROC AUC:  

### Result Discussion
<body>
Bagging (HoeffdingTree) Performs Best:

For both targets, the bagged ensemble outperforms the single Hoeffding Tree and other models in terms of accuracy, F1, and ROC AUC.
Example:
income target: Accuracy 0.7508, F1 0.6390, ROC AUC 0.7494
accumulation target: Accuracy 0.6664, F1 0.6924, ROC AUC 0.7195
NaiveBayes is Competitive:

NaiveBayes also performs well, especially for the income target (Accuracy 0.7494, F1 0.6616, ROC AUC 0.7542).
This suggests that the features are fairly independent or that NaiveBayes is robust for this dataset.

HoeffdingTree (Single Tree):

Slightly lower performance than the bagged version, as expected (ensembling reduces variance and improves generalization).

LogisticRegression Underperforms:

Very low recall for income (0.0266), indicating it rarely predicts the positive class.
This leads to a poor F1 score, despite high precision (likely due to class imbalance or poor feature separability for linear models).

</body>

Hyperparameter tuning of pruning of tree

In [17]:
from sklearn.tree import DecisionTreeClassifier

def select_features(X, y, threshold=0.01):
    dtree = DecisionTreeClassifier(random_state=42)
    dtree.fit(X, y)
    importances = dict(zip(X.columns, dtree.feature_importances_))
    selected = [feat for feat, imp in importances.items() if imp > threshold]
    print(f"Threshold: {threshold} | Selected features: {selected}")
    return selected

# To prune for both targets
threshold = 0.01  # or 0.005, etc.

selected_features_income = select_features(X_engineered, y_income, threshold=threshold)
X_pruned_income = X_engineered[selected_features_income]

selected_features_accum = select_features(X_engineered, y_accum, threshold=threshold)
X_pruned_accum = X_engineered[selected_features_accum]

# Now you have X_pruned_income for y_income and X_pruned_accum for y_accum

Threshold: 0.01 | Selected features: ['Age', 'FamilyMembers', 'FinancialEducation', 'RiskPropensity', 'Wealth_log', 'Income_log', 'Gender_x_Age', 'Income_Wealth_Ratio_log']
Threshold: 0.01 | Selected features: ['Age', 'FamilyMembers', 'FinancialEducation', 'RiskPropensity', 'Wealth_log', 'Income_log', 'Gender_x_Age', 'Income_Wealth_Ratio_log']


In [18]:
from river import tree, ensemble, metrics, stream

# --- From earlier results, Bagging HoeffdingTree and ExtremelyFastDecisionTree were best ---

# Bagging HoeffdingTree parameters (as previously best)
bagging_params = dict(n_models=10, max_depth=10, grace_period=200)
# ExtremelyFastDecisionTree parameters (as previously best)
eft_params = dict(max_depth=10, grace_period=100)

def get_best_models():
    models = {}
    models["Bagging_HT"] = ensemble.BaggingClassifier(
        model=tree.HoeffdingTreeClassifier(max_depth=bagging_params["max_depth"], grace_period=bagging_params["grace_period"]),
        n_models=bagging_params["n_models"], seed=42
    )
    try:
        from river.tree import ExtremelyFastDecisionTreeClassifier
        models["EFT"] = ExtremelyFastDecisionTreeClassifier(
            max_depth=eft_params["max_depth"], grace_period=eft_params["grace_period"]
        )
    except ImportError:
        pass
    return models

def evaluate_model(model, X, y, title=""):
    data_stream = stream.iter_pandas(X, y)
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    for x_row, y_row in data_stream:
        y_pred = model.predict_one(x_row)
        y_pred_proba = model.predict_proba_one(x_row)
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# --- Test on income ---
models_income = get_best_models()
print("\n--- Results for y_income (Best Models) ---")
for name, model in models_income.items():
    evaluate_model(model.clone(), X_pruned_income, y_income, title=name)

# --- Test on accumulation ---
models_accum = get_best_models()
print("\n--- Results for y_accum (Best Models) ---")
for name, model in models_accum.items():
    evaluate_model(model.clone(), X_pruned_accum, y_accum, title=name)


--- Results for y_income (Best Models) ---
Bagging_HT | Accuracy: 0.7532 | F1: 0.6425 | ROC AUC: 0.7491
EFT | Accuracy: 0.7608 | F1: 0.6421 | ROC AUC: 0.7376

--- Results for y_accum (Best Models) ---
Bagging_HT | Accuracy: 0.6672 | F1: 0.6931 | ROC AUC: 0.7215
EFT | Accuracy: 0.6598 | F1: 0.6709 | ROC AUC: 0.7056


Hyperparmeter Tuning

In [19]:
from river import tree, ensemble, metrics, stream

# Final best hyperparameters
final_params = dict(n_models=10, max_depth=10, grace_period=100, seed=42)

def evaluate_bagging_ht(X, y, target_name=""):
    model = ensemble.BaggingClassifier(
        model=tree.HoeffdingTreeClassifier(max_depth=final_params["max_depth"], grace_period=final_params["grace_period"]),
        n_models=final_params["n_models"],
        seed=final_params["seed"]
    )
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    data_stream = stream.iter_pandas(X, y)
    for x_row, y_row in data_stream:
        y_pred = model.predict_one(x_row)
        y_pred_proba = model.predict_proba_one(x_row)
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{target_name} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# Evaluate on y_income
evaluate_bagging_ht(X_pruned_income, y_income, target_name="y_income (Bagging_HT)")

# Evaluate on y_accum
evaluate_bagging_ht(X_pruned_accum, y_accum, target_name="y_accum (Bagging_HT)")

y_income (Bagging_HT) | Accuracy: 0.7538 | F1: 0.6439 | ROC AUC: 0.7524
y_accum (Bagging_HT) | Accuracy: 0.6712 | F1: 0.6992 | ROC AUC: 0.7288


<body>
Builds a streaming ensemble with Bagging Hoeffding Trees (and EFT if available). Uses majority voting for predictions. Evaluates with accuracy, F1, and ROC AUC in a streaming (online) fashion. </body>

In [20]:
from river import tree, ensemble, metrics, stream

bagging_ht = ensemble.BaggingClassifier(
    model=tree.HoeffdingTreeClassifier(max_depth=10, grace_period=100),
    n_models=10,
    seed=42
)

# Pass only model instances, not (name, model) tuples
ensemble_models = [bagging_ht]

try:
    from river.tree import ExtremelyFastDecisionTreeClassifier
    eft = ExtremelyFastDecisionTreeClassifier(max_depth=10, grace_period=100)
    ensemble_models.append(eft)
except ImportError:
    pass

final_voting = ensemble.VotingClassifier(*ensemble_models)

def evaluate_ensemble(model, X, y, title=""):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    data_stream = stream.iter_pandas(X, y)
    for x_row, y_row in data_stream:
        y_pred = model.predict_one(x_row)
        # Try to get probabilities, else use empty dict
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except NotImplementedError:
            y_pred_proba = {}
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

    
evaluate_ensemble(final_voting.clone(), X_pruned_income, y_income, title="Final Voting Ensemble (income)")
evaluate_ensemble(final_voting.clone(), X_pruned_accum, y_accum, title="Final Voting Ensemble (accumulation)")

Final Voting Ensemble (income) | Accuracy: 0.7474 | F1: 0.6437 | ROC AUC: 0.5000
Final Voting Ensemble (accumulation) | Accuracy: 0.6470 | F1: 0.6737 | ROC AUC: 0.5000


Good accuracy and F1. ROC AUC = 0.5 means the model’s probability estimates are not useful for ranking (like random).
Ensemble is learning, but probability calibration or diversity could be improved.

In [21]:
from river import naive_bayes, linear_model, neighbors, preprocessing, metrics, stream

# List of model pipelines to test (excluding tree-based and logistic regression)
models = {
    "GaussianNB": naive_bayes.GaussianNB(),
    "MultinomialNB": naive_bayes.MultinomialNB(),
    "Perceptron": preprocessing.StandardScaler() | linear_model.Perceptron(),
    "PassiveAggressive": preprocessing.StandardScaler() | linear_model.PAClassifier(),
    "KNNClassifier": neighbors.KNNClassifier(n_neighbors=8)  # window_size removed
}
results = {}

def evaluate_model(model, X, y):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    for x_row, y_row in stream.iter_pandas(X, y):
        y_pred = model.predict_one(x_row)
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except Exception:
            y_pred_proba = {}
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    return acc.get(), f1.get(), rocauc.get()

# Evaluate on both targets
for name, model in models.items():
    results[f"{name} (income)"] = evaluate_model(model.clone(), X_pruned_income, y_income)
    results[f"{name} (accumulation)"] = evaluate_model(model.clone(), X_pruned_accum, y_accum)

# Print results
print(f"{'Model':<30} | {'Accuracy':>8} | {'F1':>8} | {'ROC AUC':>8}")
print('-'*65)
for model_name, (acc, f1, rocauc) in results.items():
    print(f"{model_name:<30} | {acc:8.4f} | {f1:8.4f} | {rocauc:8.4f}")

Model                          | Accuracy |       F1 |  ROC AUC
-----------------------------------------------------------------
GaussianNB (income)            |   0.7476 |   0.6595 |   0.7551
GaussianNB (accumulation)      |   0.6378 |   0.6841 |   0.6845
MultinomialNB (income)         |   0.6156 |   0.0010 |   0.5068
MultinomialNB (accumulation)   |   0.5232 |   0.6627 |   0.5083
Perceptron (income)            |   0.6436 |   0.5357 |   0.6495
Perceptron (accumulation)      |   0.5640 |   0.5754 |   0.5862
PassiveAggressive (income)     |   0.6212 |   0.5024 |   0.6327
PassiveAggressive (accumulation) |   0.5778 |   0.5916 |   0.5989
KNNClassifier (income)         |   0.7146 |   0.5507 |   0.7100
KNNClassifier (accumulation)   |   0.6606 |   0.6745 |   0.7108


<body>



- **GaussianNB** performs best overall, especially for the income target (highest accuracy, F1, and ROC AUC).
- **KNNClassifier** also gives strong results, particularly for accumulation.
- **MultinomialNB** performs poorly on income (very low F1), likely due to feature distribution mismatch.
- **Perceptron** and **PassiveAggressive** have moderate performance.
- **Summary:**  
  Naive Bayes (Gaussian) and KNN are the most effective streaming models here. Linear models and MultinomialNB are less suitable for this dataset.
</body>

In [22]:
from river import naive_bayes, linear_model, neighbors, preprocessing, ensemble, metrics, stream

# Define all models (as instances, not tuples)
ensemble_models = [
    naive_bayes.GaussianNB(),
    naive_bayes.MultinomialNB(),
    preprocessing.StandardScaler() | linear_model.Perceptron(),
    preprocessing.StandardScaler() | linear_model.PAClassifier(),
    neighbors.KNNClassifier(n_neighbors=8)
]

# Create the final voting ensemble
final_voting = ensemble.VotingClassifier(models=ensemble_models)

def evaluate_ensemble(model, X, y, title=""):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    data_stream = stream.iter_pandas(X, y)
    for x_row, y_row in data_stream:
        y_pred = model.predict_one(x_row)
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except Exception:
            y_pred_proba = {}
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# Evaluate on your datasets
evaluate_ensemble(final_voting.clone(), X_pruned_income, y_income, title="Final Voting Ensemble (income)")
evaluate_ensemble(final_voting.clone(), X_pruned_accum, y_accum, title="Final Voting Ensemble (accumulation)")

Final Voting Ensemble (income) | Accuracy: 0.7174 | F1: 0.5818 | ROC AUC: 0.5000
Final Voting Ensemble (accumulation) | Accuracy: 0.6174 | F1: 0.6379 | ROC AUC: 0.5000


**Code summary:**  
- Builds a streaming voting ensemble using GaussianNB, MultinomialNB, Perceptron, PassiveAggressive, and KNN (all from River).
- Each model predicts on each sample, votes are combined, and metrics (accuracy, F1, ROC AUC) are updated online.

**Results:**  
- **Income:** Accuracy 0.7174, F1 0.5818, ROC AUC 0.5000  
- **Accumulation:** Accuracy 0.6174, F1 0.6379, ROC AUC 0.5000  
- **Interpretation:**  
  - Accuracy and F1 are moderate, showing the ensemble learns some patterns.
  - ROC AUC of 0.5 means the ensemble’s probability estimates are not useful for ranking (like random).  
  - Suggests probability calibration or model diversity could be improved.

Combine the best Tree and non tree based models for ensemble

In [23]:
from river import naive_bayes, neighbors, tree, ensemble, metrics, stream
import pandas as pd
import numpy as np

# --------- 1. Data Preparation ----------
def ensure_pandas(X, y):
    if not isinstance(X, pd.DataFrame):
        X = pd.DataFrame(X)
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    y = y.astype(int)
    return X, y

# Replace these lines with your actual data assignments!
# X_pruned_income, y_income = ...
# X_pruned_accum, y_accum = ...

X_pruned_income, y_income = ensure_pandas(X_pruned_income, y_income)
X_pruned_accum, y_accum = ensure_pandas(X_pruned_accum, y_accum)

# --------- 2. Ensemble Models (only those that support predict_proba_one) ----------
ensemble_models = [
    naive_bayes.GaussianNB(),
    neighbors.KNNClassifier(n_neighbors=8),
    tree.ExtremelyFastDecisionTreeClassifier()
    # Do NOT use tree.HoeffdingTreeClassifier() - it does NOT support predict_proba_one!
]

# --------- 3. Diagnostic: Check models for predict_proba_one support ----------
for m in ensemble_models:
    print(f"Testing {type(m).__name__} for probability support:")
    try:
        # Need to warm-start before probabilities for most models
        m.learn_one(X_pruned_income.iloc[0], y_income.iloc[0])
        print("Proba output:", m.predict_proba_one(X_pruned_income.iloc[0]))
    except NotImplementedError:
        print(f"{type(m).__name__} does NOT implement predict_proba_one!")
    except Exception as e:
        print(f"{type(m).__name__} error: {e}")

# --------- 4. Helper: Warm-start all models with each class ----------
def force_warm_start(model, X, y):
    for label in [0, 1]:
        idx = np.where(y.values == label)[0][0]
        x_row = X.iloc[idx]
        y_row = y.iloc[idx]
        model.learn_one(x_row, y_row)

# --------- 5. Ensemble Evaluation Function ----------
def evaluate_ensemble(model, X, y, title=""):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    data_stream = stream.iter_pandas(X, y)
    for idx, (x_row, y_row) in enumerate(data_stream):
        y_pred = model.predict_one(x_row)
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except NotImplementedError:
            y_pred_proba = {}
        if idx < 5:
            print(f"True: {y_row}, Proba: {y_pred_proba}")
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# --------- 6. INCOME: Warm-start and evaluate ensemble ----------
for m in ensemble_models:
    force_warm_start(m, X_pruned_income, y_income)
voting_ensemble_income = ensemble.VotingClassifier(models=[m.clone() for m in ensemble_models])
force_warm_start(voting_ensemble_income, X_pruned_income, y_income)
evaluate_ensemble(voting_ensemble_income, X_pruned_income, y_income, title="Voting Ensemble (income)")

# --------- 7. ACCUMULATION: Warm-start and evaluate ensemble ----------
for m in ensemble_models:
    force_warm_start(m, X_pruned_accum, y_accum)
voting_ensemble_accum = ensemble.VotingClassifier(models=[m.clone() for m in ensemble_models])
force_warm_start(voting_ensemble_accum, X_pruned_accum, y_accum)
evaluate_ensemble(voting_ensemble_accum, X_pruned_accum, y_accum, title="Voting Ensemble (accumulation)")

Testing GaussianNB for probability support:
Proba output: {0: 1.0}
Testing KNNClassifier for probability support:
Proba output: {0: 1.0}
Testing ExtremelyFastDecisionTreeClassifier for probability support:
Proba output: {0: 1.0}
True: 0, Proba: {}
True: 1, Proba: {}
True: 0, Proba: {}
True: 1, Proba: {}
True: 0, Proba: {}
Voting Ensemble (income) | Accuracy: 0.7630 | F1: 0.6447 | ROC AUC: 0.5000
True: 1, Proba: {}
True: 0, Proba: {}
True: 1, Proba: {}
True: 1, Proba: {}
True: 0, Proba: {}
Voting Ensemble (accumulation) | Accuracy: 0.6680 | F1: 0.6967 | ROC AUC: 0.5000


**Summary of Results:**

- **All models (GaussianNB, KNN, ExtremelyFastDecisionTree) support `predict_proba_one`.**
- **Voting Ensemble (income):**
  - Accuracy: **0.7630**
  - F1: **0.6447**
  - ROC AUC: **0.5000**
- **Voting Ensemble (accumulation):**
  - Accuracy: **0.6680**
  - F1: **0.6967**
  - ROC AUC: **0.5000**

**Interpretation:**
- **Accuracy and F1 are strong**—the ensemble is learning useful patterns.
- **ROC AUC is 0.5**—the ensemble’s probability estimates are not informative for ranking (like random).
- **Takeaway:**  
  The ensemble is effective for classification, but probability calibration or model diversity could be improved for better ranking.

**ADWIN Explanation:**  
ADWIN (Adaptive Windowing) is a streaming concept drift detector. It monitors the accuracy of your model over time and automatically detects when the data distribution changes (concept drift). When a significant change is detected, ADWIN can trigger model adaptation or alert you to possible performance drops.


In [35]:
from river import drift

def evaluate_ensemble_with_drift(model, X, y, title=""):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    adwin = drift.ADWIN()
    data_stream = stream.iter_pandas(X, y)
    for idx, (x_row, y_row) in enumerate(data_stream):
        y_pred = model.predict_one(x_row)
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except NotImplementedError:
            y_pred_proba = {}
        correct = int(y_pred == y_row)
        adwin.update(correct)
        if adwin.change_detected:
            print(f"[Drift detected at sample {idx}] Accuracy window mean changed!")
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# Use this function instead of evaluate_ensemble:
def evaluate_ensemble_with_drift(model, X, y, title=""):
    acc = metrics.Accuracy()
    f1 = metrics.F1()
    rocauc = metrics.ROCAUC()
    adwin = drift.ADWIN()
    data_stream = stream.iter_pandas(X, y)
    for idx, (x_row, y_row) in enumerate(data_stream):
        y_pred = model.predict_one(x_row)
        try:
            y_pred_proba = model.predict_proba_one(x_row)
        except NotImplementedError:
            y_pred_proba = {}
        correct = int(y_pred == y_row)
        if adwin.update(correct):
            print(f"[Drift detected at sample {idx}] Accuracy window mean changed!")
        acc.update(y_row, y_pred)
        f1.update(y_row, y_pred)
        rocauc.update(y_row, y_pred_proba)
        model.learn_one(x_row, y_row)
    print(f"{title} | Accuracy: {acc.get():.4f} | F1: {f1.get():.4f} | ROC AUC: {rocauc.get():.4f}")

# Use this function instead of evaluate_ensemble:
evaluate_ensemble_with_drift(voting_ensemble_income, X_pruned_income, y_income, title="Voting Ensemble (income) with ADWIN")
evaluate_ensemble_with_drift(voting_ensemble_accum, X_pruned_accum, y_accum, title="Voting Ensemble (accumulation) with ADWIN")


Voting Ensemble (income) with ADWIN | Accuracy: 0.7748 | F1: 0.6641 | ROC AUC: 0.5000
Voting Ensemble (accumulation) with ADWIN | Accuracy: 0.7066 | F1: 0.7114 | ROC AUC: 0.5000


**Interpretation:**  
- **Accuracy and F1 improved** compared to previous runs, showing the ensemble adapts well to the data stream and benefits from drift detection.
- **ROC AUC remains 0.5**, meaning the model's probability estimates are not useful for ranking (still like random).
- **ADWIN’s value:** It helps maintain or improve classification performance by detecting and adapting to changes in the data, but does not directly improve probability calibration.

**Summary:**  
ADWIN helps the ensemble maintain strong accuracy and F1 in a streaming setting by detecting concept drift, but further work is needed to improve probability outputs for ranking (ROC AUC).