In [59]:
## Importing Libraries
import pandas as pd

## XGBoost model
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

## SKLearn libraries
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.utils import resample
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import OneHotEncoder, StandardScaler


## Ideation and Approach: 

1. By now I know that dataset is highly imbalanced. In order to work with this, I will use XGBoost along with weight distribution, giving more weight to imbalance class. <br>
2. Through EDA, I know certain features have correlation with donation but many dont and no clear feature selection could be made so inorder to handle this, I will build the modle using RFE (Recursive Feature Eliminator)
3.  I will combine all donations into one feature called donated. I will predict the likelihood of people being able to donate and those are our targeted donors for mailing campaign. 
4. Once we determine Donors, we can estimate the donation for next 6, 12, 18 and 24 months. 


In [60]:
donor_df = pd.read_csv('../Data/cleaned_data/data_for_modeling.csv')

In [61]:
donor_df.shape

(19895, 30)

In [62]:
# donor_df.dtypes
donor_df.columns

Index(['TARGET_B', 'TARGET_D6', 'TARGET_D12', 'TARGET_D18', 'TARGET_D24',
       'CONTROL_NUMBER', 'MONTHS_SINCE_ORIGIN', 'NUMBER_OF_RESP', 'DONOR_AGE',
       'IN_HOUSE', 'URBANICITY', 'CLUSTER_CODE', 'HOME_OWNER', 'DONOR_GENDER',
       'INCOME_GROUP', 'PUBLISHED_PHONE', 'WEALTH_RATING', 'MEDIAN_HOME_VALUE',
       'MEDIAN_HOUSEHOLD_INCOME', 'PCT_OWNER_OCCUPIED', 'PEP_STAR',
       'RECENT_STAR_STATUS', 'RECENCY_FREQ_STATUS',
       'RECENT_CARD_RESPONSE_PROP', 'MONTHS_SINCE_LAST_PROM_RESP',
       'LAST_GIFT_AMT', 'NUMBER_PROM_12', 'MONTHS_SINCE_LAST_GIFT',
       'MONTHS_SINCE_FIRST_GIFT', 'donated'],
      dtype='object')

In [63]:
## Creating features and target variable
X = donor_df.drop(columns=['TARGET_B', 'TARGET_D6', 'TARGET_D12', 'TARGET_D18', 'TARGET_D24', 'CONTROL_NUMBER', 'donated'], axis = 1)
y = donor_df['donated']

### 1. Data Prep for building a predective model

I will be splitting the data into train and test at 80:20 ratio. I will further split training data into 85:15 for training and validation set. 

I will be performing column transformation after split to ensure no data leaquage and represent the test data as it appears in real world scenario.  

In [64]:
X.head()

Unnamed: 0,MONTHS_SINCE_ORIGIN,NUMBER_OF_RESP,DONOR_AGE,IN_HOUSE,URBANICITY,CLUSTER_CODE,HOME_OWNER,DONOR_GENDER,INCOME_GROUP,PUBLISHED_PHONE,...,PCT_OWNER_OCCUPIED,PEP_STAR,RECENT_STAR_STATUS,RECENCY_FREQ_STATUS,RECENT_CARD_RESPONSE_PROP,MONTHS_SINCE_LAST_PROM_RESP,LAST_GIFT_AMT,NUMBER_PROM_12,MONTHS_SINCE_LAST_GIFT,MONTHS_SINCE_FIRST_GIFT
0,137,15,42.0,1,S,11.0,H,F,6.0,1,...,45,1,0,A4,0.4,17.0,189.0,15,17,128
1,65,33,42.0,1,R,53.0,U,M,5.0,0,...,79,1,0,A1,0.0,18.0,0.0,33,7,57
2,53,16,42.0,1,U,7.0,H,M,5.0,0,...,55,0,0,A1,0.167,20.0,0.0,16,20,57
3,53,33,42.0,0,C,22.0,U,M,5.0,0,...,18,1,1,S3,0.125,18.0,0.0,33,18,52
4,17,28,42.0,1,C,24.0,H,M,6.0,0,...,85,0,0,F1,0.0,17.0,50.0,28,20,20


In [65]:
# y

In [66]:
## Splitting the data into training+validation and testing set and ensuring the data distribution remains the same
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, stratify= y, random_state=7)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.15, stratify= y_train_val, random_state=7)

In [67]:
print("X-train-val: ", X_train_val.shape)
print("X-train: ", X_train.shape)
print("X-val: ", X_val.shape)
print("X-test: ", X_test.shape)

X-train-val:  (15916, 23)
X-train:  (13528, 23)
X-val:  (2388, 23)
X-test:  (3979, 23)


In [68]:
print(X_train_val.dtypes.nunique())
print(set(X_train_val.dtypes.to_list()))

3
{dtype('O'), dtype('float64'), dtype('int64')}


In [69]:
## Selecting categorical and numerical features to use in the pipeline
cat_to_transform = X_train.select_dtypes(include=['object']).columns
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns 

In [70]:
## Defining the pipeline
# Scaling numerical features and encoding categorical features

preprocessor = ColumnTransformer(
    transformers = [
        ('numerical', StandardScaler(), numerical_features),
        ('categorical', OneHotEncoder(drop='first'), cat_to_transform)
    ]
)

In [71]:
## Fitting the preprocessor on the training data
X_train_t = preprocessor.fit_transform(X_train)

## Transforming the validation and test data
X_val_t = preprocessor.transform(X_val)
X_test_t = preprocessor.transform(X_test)

In [72]:
# ## Trying SMOTE to handle class imbalance
# smote = SMOTE(random_state=42)
# X_train_t, y_train = smote.fit_resample(X_train_t, y_train)

Tried Smot, precision goes up by 0.01 but recall drops by 0.4. 

Next: Downsample the Majority class and evaluate the performance

In [73]:
## Trying Downsampling to handle class imbalance
# Combine the features and target into a single DataFrame for downsampling
train_data = pd.DataFrame(X_train_t, columns=preprocessor.get_feature_names_out())
train_data['DONATED'] = y_train.reset_index(drop=True)

# Separate the majority and minority classes
majority_class = train_data[train_data.DONATED == 0]
minority_class = train_data[train_data.DONATED == 1]

# Downsample the majority class
majority_downsampled = resample(majority_class, 
                                replace=False,    # sample without replacement
                                n_samples=len(minority_class),     # to match minority class
                                random_state=7)  # reproducible results

# Combine minority class with downsampled majority class
downsampled_data = pd.concat([majority_downsampled, minority_class])

# Separate features and target
X_train_t = downsampled_data.drop('DONATED', axis=1)
y_train = downsampled_data['DONATED']

## Converting the data to numpy arrays to keep the format consistent as required by XGBoost
X_train_t = X_train_t.values  
# y_train = y_train.values


### 2. Feature Selection

I will be using RFECV(Recursive Feature Elimination with Cross-Validation) to select the best features for the model. This will help in reducing the number of features and also help in improving the model performance.

In [74]:
y_train.value_counts()

0    671
1    671
Name: DONATED, dtype: int64

In [75]:
## Defining the class weight adjustment for the XGBClassifier to handle the class imbalance
scale_pos_weight = y_train.value_counts()[0]/y_train.value_counts()[1]
xgb_classifier = XGBClassifier(scale_pos_weight=scale_pos_weight, 
                               random_state=7,
                                n_estimators=100, 
                                max_depth=10,
                                learning_rate=0.01,
                                #  early_stopping_rounds=5,
                                 n_jobs=-1)

In [76]:
## Performing Recursive Feature Elimination with Cross Validation
metric_tracker = make_scorer(f1_score)
rfecv = RFECV(estimator=xgb_classifier, step=200, cv=10, scoring=metric_tracker, n_jobs=-1, min_features_to_select=1)
rfecv.fit(X_train_t, y_train)


In [77]:
# feature_names = X_train_t.columns
# selected_features_boolean = rfecv.support_
# selected_features = feature_names[selected_features_boolean]
# ranking_features = feature_names[rfecv.ranking_]

print("Optimal number of features : %d" % rfecv.n_features_)
print("Ranking of features: ", rfecv.support_)
print("Ranking of each Features: ", rfecv.ranking_)

Optimal number of features : 50
Ranking of features:  [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
Ranking of each Features:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [78]:
# len(selected_features_boolean)

#### 3. Creating Dataset and training the model

In [81]:
# Create new datasets with selected features
X_train_sel = X_train_t[:, rfecv.support_]
X_val_sel = X_val_t[:, rfecv.support_]
X_test_sel = X_test_t[:, rfecv.support_]

In [106]:
# Train the XGBoost model with selected features
xgb_model_sel = XGBClassifier(scale_pos_weight=scale_pos_weight, 
                               random_state=7,
                                n_estimators=500, 
                                max_depth=10,
                                learning_rate=0.05,
                                 early_stopping_rounds=15,
                                 n_jobs=-1)
xgb_model_sel.fit(X_train_sel, y_train, eval_set=[(X_val_sel, y_val)])

[0]	validation_0-logloss:0.67752
[1]	validation_0-logloss:0.66351
[2]	validation_0-logloss:0.65106
[3]	validation_0-logloss:0.63978
[4]	validation_0-logloss:0.63004
[5]	validation_0-logloss:0.62105


[6]	validation_0-logloss:0.61275
[7]	validation_0-logloss:0.60609
[8]	validation_0-logloss:0.59987
[9]	validation_0-logloss:0.59399
[10]	validation_0-logloss:0.58891
[11]	validation_0-logloss:0.58362
[12]	validation_0-logloss:0.57958
[13]	validation_0-logloss:0.57683
[14]	validation_0-logloss:0.57467
[15]	validation_0-logloss:0.57145
[16]	validation_0-logloss:0.56802
[17]	validation_0-logloss:0.56660
[18]	validation_0-logloss:0.56393
[19]	validation_0-logloss:0.56249
[20]	validation_0-logloss:0.56041
[21]	validation_0-logloss:0.55882
[22]	validation_0-logloss:0.55734
[23]	validation_0-logloss:0.55683
[24]	validation_0-logloss:0.55554
[25]	validation_0-logloss:0.55474
[26]	validation_0-logloss:0.55451
[27]	validation_0-logloss:0.55328
[28]	validation_0-logloss:0.55279
[29]	validation_0-logloss:0.55161
[30]	validation_0-logloss:0.55134
[31]	validation_0-logloss:0.55147
[32]	validation_0-logloss:0.55168
[33]	validation_0-logloss:0.55103
[34]	validation_0-logloss:0.55071
[35]	validation_0-

In [107]:
# Validate the model
y_val_pred = xgb_model_sel.predict(X_val_sel)

##### Model Performance: Classification with Downsampling and n_estimators = 500 and early stopping rounds = 15

In [108]:
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.98      0.68      0.80      2269
           1       0.11      0.75      0.19       119

    accuracy                           0.68      2388
   macro avg       0.54      0.71      0.50      2388
weighted avg       0.94      0.68      0.77      2388



#### Saving the model

In [110]:
import joblib
import os

## Save the trained model to a file
model_filename = "../models/xgboost_donor_model.joblib"
joblib.dump(xgb_model_sel, model_filename)

['../models/xgboost_donor_model.joblib']

# Performance of other models that were tried

##### Classification with Downsampling and n_estimators = 500 and early stopping rounds = 15

In [89]:
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.98      0.68      0.80      2269
           1       0.11      0.74      0.19       119

    accuracy                           0.68      2388
   macro avg       0.54      0.71      0.49      2388
weighted avg       0.94      0.68      0.77      2388



##### Classification with Downsampling and n_estimators = 100

In [84]:
### Classification with Downsampling
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.98      0.67      0.79      2269
           1       0.10      0.74      0.18       119

    accuracy                           0.67      2388
   macro avg       0.54      0.70      0.49      2388
weighted avg       0.94      0.67      0.76      2388



In [85]:
confusion_matrix(y_val, y_val_pred)

array([[1510,  759],
       [  31,   88]], dtype=int64)

##### Classification with Smot

In [23]:
### Classification with Smot
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.96      0.91      0.94      2269
           1       0.13      0.24      0.17       119

    accuracy                           0.88      2388
   macro avg       0.54      0.58      0.55      2388
weighted avg       0.92      0.88      0.90      2388



In [24]:
confusion_matrix(y_val, y_val_pred)

array([[2073,  196],
       [  90,   29]], dtype=int64)

# Evaluating the model on unseen data

In [109]:
# Evaluate the model on the unseen test set
y_test_pred = xgb_model_sel.predict(X_test_sel)
test_report = classification_report(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)

print("Test Classification Report:\n", test_report)
print("Test Confusion Matrix:\n", test_confusion_matrix)

Test Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.69      0.81      3782
           1       0.12      0.79      0.20       197

    accuracy                           0.69      3979
   macro avg       0.55      0.74      0.51      3979
weighted avg       0.94      0.69      0.78      3979

Test Confusion Matrix:
 [[2600 1182]
 [  42  155]]


# Conclusion:

1. Metric used:

    From the dataset, we observed that only about 5% of the individuals are donors. To ensure that our mailing campaign effectively reaches these potential donors, I prioritized recall as our primary evaluation metric.


2. Model Performance:

    The model was trained with a focus on increasing recall. On an unseen dataset, the model achieved a recall of 79% for the donor class. This means that the model successfully identified 79% of the actual donors.
    The model also demonstrated a high precision of 98% for the non-donor class, effectively rejecting non-donors while capturing the majority of donors.

3. Trade-off Between Precision and Recall:

    While there is a trade-off with precision for the donor class (12%), the model's ability to capture the majority of actual donors (recall of 79%) makes it a valuable tool for enhancing our donor targeting strategy.
    This trade-off can be acceptable given the critical need to maximize donor engagement and increase donations. By capturing a higher percentage of potential donors, we can significantly enhance the effectiveness of our mailing campaigns.