# Project 2

The goal of this project is to develop a parsimonious predictive model that accurately identifies customers most likely to respond to a bank's marketing offer. The model should efficiently utilize a limited number of predictive variables to maximize accuracy and minimize cost, optimizing the balance between predictive performance and economic expenditure.

The model will be trained on a dataset comprising 5000 historical instances, each described by 500 anonymized variables. It will be evaluated based on its ability to select 1000 customers from a test set, predicted to accept the marketing offer, while minimizing the use of costly variables. The effectiveness of the model will be measured by the net profit achieved, considering both the rewards for correct predictions and penalties for each variable used.

It might be beneficial to consider this as the regression problem, where the target variable is the probability of a customer accepting the offer. This way, you can use regression techniques to predict the probability and then apply a threshold to classify customers as responders or non-responders.

In [120]:
import numpy as np
import pandas as pd
import xgboost as xgb
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Load data

In [121]:
X_input = np.loadtxt('data/x_train.txt')
y_input = np.loadtxt('data/y_train.txt')

print(f'Data: {X_input.shape[0]} samples, {X_input.shape[1]} variables')
print(f'Target: {y_input.mean():.2%} positive')

Data: 5000 samples, 500 variables
Target: 49.92% positive


## Split data
Make assumptions about % positive in the test set. It is important so that the model can be evaluated based on the net profit.
percentage_positive should range from 20% to 50% (both inclusive).
The training set will have 80% of all the data and will have fixed 50% positive instances.

In [122]:
def get_data(percentage_positive: float):
    assert 0.2 <= percentage_positive <= 0.5
    
    X_pos = X_input[y_input == 1]
    X_neg = X_input[y_input == 0]
    y_pos = y_input[y_input == 1]
    y_neg = y_input[y_input == 0]
    
    num_test_samples = int(0.2 * X_input.shape[0])
    num_pos_test_samples = int(percentage_positive * num_test_samples)
    num_neg_test_samples = num_test_samples - num_pos_test_samples
    
    X_train_pos, X_test_pos, y_train_pos, y_test_pos = train_test_split(X_pos, y_pos, test_size=num_pos_test_samples)
    X_train_neg, X_test_neg, y_train_neg, y_test_neg = train_test_split(X_neg, y_neg, test_size=num_neg_test_samples)
    
    # drop positives in train set until num_pos_train = num_neg_train
    num_pos_train = X_train_neg.shape[0]
    X_train_pos = X_train_pos[:num_pos_train]
    y_train_pos = y_train_pos[:num_pos_train]
    
    X_train = np.vstack([X_train_pos, X_train_neg])
    X_test = np.vstack([X_test_pos, X_test_neg])
    y_train = np.hstack([y_train_pos, y_train_neg])
    y_test = np.hstack([y_test_pos, y_test_neg])
    
    return X_train, X_test, y_train, y_test

In [123]:
for percentage_positive in [0.2, 0.3, 0.4, 0.5]:
    X_train, X_test, y_train, y_test = get_data(percentage_positive)
    print(f'\nExpected % positive in test set: {percentage_positive:.2%}')
    print(f'Train data: {X_train.shape[0]} samples, {y_train.mean():.2%} positive')
    print(f'Test data: {X_test.shape[0]} samples, {y_test.mean():.2%} positive')
    


Expected % positive in test set: 20.00%
Train data: 3408 samples, 50.00% positive
Test data: 1000 samples, 20.00% positive

Expected % positive in test set: 30.00%
Train data: 3608 samples, 50.00% positive
Test data: 1000 samples, 30.00% positive

Expected % positive in test set: 40.00%
Train data: 3808 samples, 50.00% positive
Test data: 1000 samples, 40.00% positive

Expected % positive in test set: 50.00%
Train data: 4000 samples, 49.90% positive
Test data: 1000 samples, 50.00% positive


## Evaluation function
Choose 200 customers from the test set with the highest predicted probability of accepting the offer. 
Calculate the net profit based on the actual responses and the cost of using the variables.
The net profit is calculated as follows:
$$
\text{Net profit} = 50 \times \text{TP} - 200 \times \text{number of variables}
$$
where:|
- TP is the number of true positives (correctly predicted responders),
- The cost of using a variable is 200.
- The reward for correctly predicting a responder is 50 (scaled times 5, as we only predict 200 customers).


In [124]:
def evaluate_selection(model, X_train_selected, X_val_selected, y_train, y_val, verbose=False) -> int:
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_val_selected)
    
    indices = np.argsort(y_pred)[::-1]
    selected_indices = indices[:200]
    
    TP = np.sum(y_val[selected_indices])
    number_of_features = X_val_selected.shape[1]
    
    score = int(50 * TP - 200 * number_of_features)
    
    if verbose:
        print(f'Score: {score}, TP: {TP:.0f}/200, #features: {number_of_features}')
    
    return score


def full_evaluation(model, selected_features, cv=5):
    results = {}
    for percentage_positive in [0.2, 0.3, 0.4, 0.5]:
        scores = []
        for _ in range(cv):
            X_train, X_val, y_train, y_val = get_data(percentage_positive)
            X_train_selected = X_train[:, selected_features]
            X_val_selected = X_val[:, selected_features]
            score = evaluate_selection(model, X_train_selected, X_val_selected, y_train, y_val)
            scores.append(score)
        mean_score = np.mean(scores)
        results[percentage_positive] = mean_score
        print(f'Expected % positive in test set: {percentage_positive:.2%}, mean score: {mean_score}')
    return results

evaluation_models = {
    'XGBoost': xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1,
                          max_depth=5, alpha=10, n_estimators=100),
    'RandomForest': RandomForestRegressor(random_state=0, n_estimators=100)
}

## Feature selection
Base class FeatureSelector with the following methods:
- fit - train the selector
- features - return the indices of the selected features (indexed from 0)

In [125]:
from abc import ABC, abstractmethod


class IFeatureSelector(ABC):
    @abstractmethod
    def fit(self, X_train, y_train):
        pass
    
    @property
    @abstractmethod
    def selected_features(self) -> np.ndarray:
        pass

### Mean decrease impurity

In [126]:
class MeanDecreaseImpuritySelector(IFeatureSelector):
    def __init__(self, n_features: int, n_estimators: int = 100):
        self.n_features = n_features
        self.rf = RandomForestRegressor(n_estimators=n_estimators)
        self._selected_features = None
    
    def fit(self, X_train, y_train):
        self.rf.fit(X_train, y_train)
        importances = self.rf.feature_importances_
        indices = np.argsort(importances)[::-1]
        self._selected_features = indices[:self.n_features]
    
    @property
    def selected_features(self) -> np.ndarray:
        return self._selected_features

In [127]:
selector = MeanDecreaseImpuritySelector(n_features=6, n_estimators=5)
selector.fit(X_input, y_input)

selected_features = selector.selected_features
print(f'Selected features: {selected_features}')

Selected features: [100 102 105 104 103 101]


In [128]:
for model_name, model in evaluation_models.items():
    print(f'\nModel: {model_name}')
    full_evaluation(model, selected_features)


Model: XGBoost
Expected % positive in test set: 20.00%, mean score: 3150.0
Expected % positive in test set: 30.00%, mean score: 4630.0
Expected % positive in test set: 40.00%, mean score: 5680.0
Expected % positive in test set: 50.00%, mean score: 6480.0

Model: RandomForest
Expected % positive in test set: 20.00%, mean score: 2900.0
Expected % positive in test set: 30.00%, mean score: 4110.0
Expected % positive in test set: 40.00%, mean score: 5080.0
Expected % positive in test set: 50.00%, mean score: 6020.0


### Drop highly correlated features

In [129]:
def remove_correlated_greedy(X, threshold=0.5):
    df = pd.DataFrame(X)
    corr_matrix = df.corr().abs()
    
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    reduced_df = df.drop(columns=to_drop)
    
    return reduced_df

In [130]:
def remove_correlated_hierarchy(X, threshold=0.5):
    df = pd.DataFrame(X)
    corr_matrix = df.corr().abs()

    distance_matrix = 1 - corr_matrix
    condensed_distance_matrix = squareform(distance_matrix, checks=False)

    Z = linkage(condensed_distance_matrix, method='average')

    clusters = fcluster(Z, threshold, criterion='distance')

    def get_representative_feature(cluster_features, corr_matrix):
        if len(cluster_features) == 1:
            return cluster_features[0]
        else:
            avg_corr = corr_matrix.loc[cluster_features, cluster_features].mean()
            return avg_corr.idxmax()

    cluster_dict = {i: df.columns[np.where(clusters == i)[0]] for i in np.unique(clusters)}

    representative_features = [get_representative_feature(cluster_dict[i], corr_matrix) for i in cluster_dict]

    reduced_df = df[representative_features]

    return reduced_df

In [135]:
X_input_reduced_greedy = remove_correlated_greedy(X_input)
selector = MeanDecreaseImpuritySelector(n_features=6, n_estimators=1)
selector.fit(X_input_reduced_greedy, y_input)
selected_features = selector.selected_features
print(f'Selected features: {selected_features}')
for model_name, model in evaluation_models.items():
    print(f'\nModel: {model_name}')
    full_evaluation(model, selected_features)

Selected features: [ 92 341 235 171 197 274]

Model: XGBoost
Expected % positive in test set: 20.00%, mean score: 750.0
Expected % positive in test set: 30.00%, mean score: 1610.0
Expected % positive in test set: 40.00%, mean score: 2750.0
Expected % positive in test set: 50.00%, mean score: 3770.0

Model: RandomForest
Expected % positive in test set: 20.00%, mean score: 690.0
Expected % positive in test set: 30.00%, mean score: 1660.0
Expected % positive in test set: 40.00%, mean score: 2910.0
Expected % positive in test set: 50.00%, mean score: 3860.0


In [137]:
X_input_reduced_hierarchy = remove_correlated_hierarchy(X_input)
selector = MeanDecreaseImpuritySelector(n_features=6, n_estimators=1)
selector.fit(X_input_reduced_hierarchy, y_input)
selected_features = selector.selected_features
print(f'Selected features: {selected_features}')
for model_name, model in evaluation_models.items():
    print(f'\nModel: {model_name}')
    full_evaluation(model, selected_features)

Selected features: [82 80 83 77 86 81]

Model: XGBoost
Expected % positive in test set: 20.00%, mean score: 980.0
Expected % positive in test set: 30.00%, mean score: 1670.0
Expected % positive in test set: 40.00%, mean score: 3150.0
Expected % positive in test set: 50.00%, mean score: 4020.0

Model: RandomForest
Expected % positive in test set: 20.00%, mean score: 800.0
Expected % positive in test set: 30.00%, mean score: 1860.0
Expected % positive in test set: 40.00%, mean score: 2640.0
Expected % positive in test set: 50.00%, mean score: 3930.0
