# Data Science for Non-Profit Fundraising

Data science can be leveraged for non-profit fundraising by using models to predict donors for the current year. Analytics could also help identify cohorts and characteristics of potential donors. The aim of this project is to use an anonymized dataset to predict which constituent will donate in the current fiscal year.

## This report will be laid out in several steps:

1. Data Preparation: review of the data source and prepping the data for analysis.
2. Data Analysis: Using visualization, a preliminary analysis of the constituent base of the dataset.
3. Prediction Models: Using 3 different models to attempt to predict who will be a current year donor.
4. Summary: Findings from analysis and performance of the models.

## 1. Data Preparation

In [1]:
#import libraries
import pandas as pd
import numpy as np
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer 
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import plot_confusion_matrix 
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.gridspec as grd
import scipy.stats
import re
from os import path
from PIL import Image

# for regressions with statsmodels:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from statsmodels.stats.outliers_influence import OLSInfluence
from statsmodels.graphics.regressionplots import plot_leverage_resid2

# for regressions with scikit-learn:
import sklearn.linear_model as sklm
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report, precision_score, \
                            accuracy_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score

#for plotting confusion matrix:
import scikitplot as skplt

#for ordinal logistic regression
from mord import LogisticIT

#These are utility tools of the DMBA book. 
from dmba import regressionSummary, exhaustive_search
from dmba import backward_elimination, forward_selection, stepwise_selection
from dmba import adjusted_r2_score, AIC_score, BIC_score
from dmba import classificationSummary, gainsChart, liftChart


# for KNN:
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

#hyperparameter optimization usng RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import xgboost



ImportError: cannot import name 'plot_confusion_matrix' from 'sklearn.metrics' (C:\Users\dsemi\anaconda3\lib\site-packages\sklearn\metrics\__init__.py)

### Data Source

The data was sourced from Kaggle.

A 34,000 row sample constituent data set from the book: Data Science for Fundraising.

Pawlus, M. Fundraising Data, Version 1. Retrieved December 18, 2022 from https://www.kaggle.com/datasets/michaelpawlus/fundraising-data?select=data_science_for_fundraising_donor_data.csv.

### Importing data

In [None]:
#loading the datasets
df = pd.read_csv("data_science_for_fundraising_donor_data.csv")

In [None]:
df.info()

### Understanding the variables and dataset

In [None]:
df.describe(include='all') 

### Data Preprocessing

Summary:
1. Check and remove duplicates
2. Removing irrelevant or duplicate date points
3. Change categorical variables to numerical
4. Check missing data

#### 1) Check for duplicates

In [None]:
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))  # counts the number of True's   

#### 2) Removing irrelevant or duplicate date points

In [None]:
df['MEMBERSHIP_IND'].value_counts()

In [None]:
len(df['BIRTH_DATE'])

In [None]:
len(df['AGE'])

In [None]:
df.drop(['ID','BIRTH_DATE', 'MEMBERSHIP_IND'],
        axis=1, inplace=True) ## set axis=0 to remove rows, axis=1 to remove columns
df.head()

Removed the following features:

    ID - Not meaningful
    MEMBERSHIP_IND - Not meaningful, only has negative value or null
    BIRTH_DATE - redundant with AGE feature


#### 3) Change categorical variables to correct numerical format

In [None]:
df['ALUMNUS_IND'].value_counts()

In [None]:
df['PARENT_IND'].value_counts()

In [None]:
df['HAS_INVOLVEMENT_IND'].value_counts()

In [None]:
df['EMAIL_PRESENT_IND'].value_counts()

In [None]:
df['DONOR_IND'].value_counts()

In [None]:
#change to numerical
df['ALUMNUS_IND'] = np.where(df['ALUMNUS_IND'] == "Y", 1, 0)
df['PARENT_IND'] = np.where(df['PARENT_IND'] == "Y", 1, 0)
df['HAS_INVOLVEMENT_IND'] = np.where(df['HAS_INVOLVEMENT_IND'] == "Y", 1, 0)
df['EMAIL_PRESENT_IND'] = np.where(df['EMAIL_PRESENT_IND'] == "Y", 1, 0)
df['DONOR_IND'] = np.where(df['DONOR_IND'] == "Y", 1, 0)

In [None]:
#check
df['ALUMNUS_IND'].value_counts()

In [None]:
Giving= df[['PrevFYGiving', 'PrevFY1Giving', 'PrevFY2Giving', 'PrevFY3Giving', 'PrevFY4Giving', 'CurrFYGiving']].copy()
Giving.max()

In [None]:
#remove commas
df = df.replace(',','', regex=True)

In [None]:
#remove $
df['PrevFYGiving'] = df['PrevFYGiving'].str[1:]
df['PrevFY1Giving'] = df['PrevFY1Giving'].str[1:]
df['PrevFY2Giving'] = df['PrevFY2Giving'].str[1:]
df['PrevFY3Giving'] = df['PrevFY3Giving'].str[1:]
df['PrevFY4Giving'] = df['PrevFY4Giving'].str[1:]
df['CurrFYGiving'] = df['CurrFYGiving'].str[1:]

In [None]:
#change to numemrical
df['PrevFYGiving'] = df['PrevFYGiving'].apply(pd.to_numeric,errors='coerce')
df['PrevFY1Giving'] = df['PrevFY1Giving'].apply(pd.to_numeric,errors='coerce')
df['PrevFY2Giving'] = df['PrevFY2Giving'].apply(pd.to_numeric,errors='coerce')
df['PrevFY3Giving'] = df['PrevFY3Giving'].apply(pd.to_numeric,errors='coerce')
df['PrevFY4Giving'] = df['PrevFY4Giving'].apply(pd.to_numeric,errors='coerce')
df['CurrFYGiving'] = df['CurrFYGiving'].apply(pd.to_numeric,errors='coerce')

In [None]:
df['ZIPCODE'].value_counts()

90265 is the zipcode for Malibu, CA. As it is the most significant, I will update this attribute to binary fields for in Malibu or not. I chose this split rather than converting each zipcode to states or have each be a separate feature to avoid too many dimensions for later computation.

In [None]:
df['MalibuCA'] = np.where(df['ZIPCODE'] == 90265.0, 1, 0)

In [None]:
#check
df.loc[df['ZIPCODE'] == 90265]

In [None]:
df.drop(['ZIPCODE'],
        axis=1, inplace=True) ## set axis=0 to remove rows, axis=1 to remove columns
df.head(7)

In [None]:
df['WEALTH_RATING'].value_counts()

In [None]:
#dictionary for integer encoding for ordinal categorical value
ordinal_cols = {
        '$1-$24999':8, 
        '$25000-$49999':7,
        '$50000-$99999':6, 
        '$100000-$249999':5,
        '$250000-$499999':4, 
        '$500000-$999999':3, 
        '$1000000-$2499999':2,
        '$2500000-$4999999':1
}

In [None]:
#map new values to attribute
df['WEALTH_RATING'] = df['WEALTH_RATING'].map(ordinal_cols).fillna(df['WEALTH_RATING'])

In [None]:
df['WEALTH_RATING'].value_counts()

In [None]:
df['MARITAL_STATUS'].value_counts()

In [None]:
df['GENDER'].value_counts()

In [None]:
#removing misclassed fields and combining those without a rating
df= df.replace('Uknown', "0")
df= df.replace('Unknown', "0")
df= df.replace('U', "0")

In [None]:
df = df.loc[(df['GENDER'] != "0")]

In [None]:
df['GENDER'].value_counts()

In [None]:
df['AGE'].value_counts()

####  4) Missing Data

In [None]:
#find columns that have missing values
nan_cols = df.loc[:,df.isna().any(axis=0)]
nan_cols

In [None]:
df['WEALTH_RATING'] = df['WEALTH_RATING'].fillna(0)

In [None]:
df.fillna({'AGE' : df['AGE'].median()}, inplace=True)

In [None]:
#change categorical variables to new attributes - will also remove null variables
df2 = pd.get_dummies(df, columns=['PREF_ADDRESS_TYPE', 'MARITAL_STATUS', 'GENDER', 'DEGREE_LEVEL', ])

In [None]:
nan_cols = df2.loc[:,df2.isna().any(axis=0)]
nan_cols

In [None]:
df2.info()

#### Create new target variable for CurrYrGiving

In [None]:
df2['CurrYrDonor'] = np.where(df2['CurrFYGiving'] > 0, 1, 0)

In [None]:
df2['CurrYrDonor'].value_counts()

## 2. Data Analysis

### What are the age ranges of the constituency base?

In [None]:
df['AGE'].plot.hist(bins=10)

Most of the constituency base is between the ages of 35-45. This is typically the early to mid-career point for many people so they are in their core earning years. This constituency base could be ready to have a qualification visit from a fundraiser to explore thier potential philanthropic interests as their wealth continues to increase.

### How many of the constituents have wealth ratings?

In [None]:
sns.countplot(y="WEALTH_RATING", data=df2)

As expected, most of the constiuents do not have wealth ratings. Of those that do have a rating, they are in the lower range of wealth at 6, which is a score that makes them capable of a gift of fifty to ninety-nine thousand dollars.

### How many constituents have ever made a gift?

In [None]:
sns.countplot(y="DONOR_IND", data=df2)

The majority of the constituent base have made a gift before to the institution. As this is anonymized data, the reason is unclear. Potentially, most constituents were collected through a donation of some kind. Overall, the inclination to give again is high based on past giving.

### Exploring total giving

### What is the correlation of total giving and age?

In [None]:
plt.scatter(x='AGE', y='TotalGiving', data=df2, marker='x')

Two of the donors with the highest amount of total giving are avobe 80 years of age. However the 3rd highest total giving to the institution is in the 40s. As most people do not give, the correlation is hard to see in this graph.


### Let's explore this further by incorporating gender into this graph.

In [None]:
sns.relplot(x='AGE', y="TotalGiving", hue="GENDER", data=df)

The top three highest donors are female to the institution.

### Now let's take a look at total giving by age.

In [None]:
plt.scatter(x='AGE', y='WEALTH_RATING', data=df2, marker='x')

The most wealthy constituents with a wealth rating of 1 are in the ages of 30-45.

### Let's take a look at their gender.

In [None]:
sns.relplot(x='AGE', y='WEALTH_RATING', hue="GENDER", data=df)

Two of the three known most wealthy constiuents are male. Let's take a look if they have ever donated to the institution.

In [None]:
sns.relplot(x='AGE', y='WEALTH_RATING', hue="GENDER", col="DONOR_IND", data=df)

A graph shows that the highest total giving are from female donors. This graph shows that the wealthiest female constituents has not donated.

### How many constituents donated last year?

In [None]:
LastYrDonor = df2.loc[df2['PrevFYGiving'] != 0]

In [None]:
LastYrDonor

There were a total of 2,289 constituents who made a gift last year.

### How many have donated so far?

In [None]:
CurrYrDonor = df2.loc[df2['CurrFYGiving'] > 0]

In [None]:
CurrYrDonor

1,845 donors have so far donated this year.

### Are there any new donors to the institution? Are current year donors the same as last year?

In [None]:
New = df2.loc[(df2['PrevFYGiving'] == 0) & (df2['CurrFYGiving'] != 0)]

In [None]:
New

There are 1,689 people that have donated this year but not last year.

### Correlations between features

In [None]:
# correlation heatmap 
corrmat = df2.corr()
sns.heatmap(corrmat, square = True, cmap="Blues")

The highest correlated features are between the giving features.

Reviewing how current fiscal year giving correlates to other giving features:

In [None]:
sns.pairplot(data=df2, y_vars=['CurrFYGiving'], x_vars=['TotalGiving', 'PrevFYGiving', 'CON_YEARS'])

### Reviewing how wealth rating correlates to giving features:

In [None]:
sns.pairplot(data=df2, y_vars=['WEALTH_RATING'], x_vars=['TotalGiving', 'PrevFYGiving','CON_YEARS'])

From the box plot above, there are some donors who have made sizeable gifts that have no wealth rating. This could be a discovery list to confirm their capacity to donate. 

### Let's identify a discovery pool with anyone that is unrated and has total giving of over or equal to one hundred thousand. 

In [None]:
discovery = df2.loc[(df2['TotalGiving'] >= 100000) & (df2['WEALTH_RATING'] == 0)]

In [None]:
discovery

There are 46 total that have total giving of $100,000 or more that have no wealth rating. This should be reviewed internally. The table above shows many that are in the age range where future gifts are possible. A wealth rating could help inform the strategic plan to cultivate them.

## 3. Prediction Models

#### Prepping predictor and target variables

Reviewing the features in the final dataframe to be potential predictor variables. I am removing features that contain giving information that could leak information on what our model is trying to predict such as total giving, consecutive years of giving, donor indicator, etc.

In [None]:
#Removing attributes that have giving information to avoid leakage to the data.
X = df2.drop(columns=['CurrFYGiving', 'TotalGiving', 'CON_YEARS', 'DONOR_IND', 'CurrYrDonor'])


The final list for prediction:

In [None]:
list(X.columns)

The target variable will be a binary feature that was created. It is defined as 1 for made a git this year and 0 as has not made a gift this year.

In [None]:
y = df2['CurrYrDonor']

In [None]:
# checking sizes of variables are the same
X.shape

In [None]:
y.shape

In [None]:
y.value_counts()

#### Check balance of classes

In [None]:
sum(y)/len(y)

Class is imbalanced, so stratify the data when splitting for train and test

#### Split the data for performance evaluation with 25% test size and stratified for imbalance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [None]:
sum(y_train)/len(y_train)

In [None]:
sum(y_test)/len(y_test)

### Model #1: Logistic Regression

In [None]:
#inital model
lreg=sklm.LogisticRegression(solver='liblinear')
lreg.fit(X_train, y_train)

In [None]:
lreg_predictions_tr =lreg.predict(X_train)

In [None]:
#get the predicted probabilities in training:
lreg_predict_prob_tr=lreg.predict_proba(X_train) # predictions for training set as probability values
lreg_predict_prob_tr

In [None]:
logit_result_tr = pd.DataFrame({'actual': y_train, 
                             'p(0)': [p[0] for p in lreg_predict_prob_tr],
                             'p(1)': [p[1] for p in lreg_predict_prob_tr],
                             'predicted': lreg_predictions_tr })
print("Predicted probabilities of training data")
logit_result_tr

In [None]:
print("Highest probability of being a donor from training data")

logit_result_tr.sort_values(by='p(1)', ascending=False).head()

In [None]:
print("Lowest probability of being a donor from training data")
logit_result_tr.sort_values(by='p(1)').head()

In [None]:
skplt.metrics.plot_confusion_matrix(y_train, lreg_predictions_tr, figsize=(4,4), cmap="Blues")
plt.title('Confusion matrix of lreg on train')

In [None]:
#classification report for training
print("Classification Report for lreg train:\n",classification_report(y_train, lreg_predictions_tr))

In [None]:
#Performance on test data
lreg_predictions_tt=lreg.predict(X_test)

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, lreg_predictions_tt, figsize=(4,4), cmap="Greens")
plt.title('Confusion matrix of lreg on test')

In [None]:
print("Classification Report for lreg test:\n",classification_report(y_test, lreg_predictions_tt))


Performance of model on test model is much lower than on train data.

#### Lowering the probability for model from default of 50% to 5% to see if the model is able to capture more donors.

In [None]:
probabilities = lreg.predict_proba(X_test)[:, 1]

In [None]:
prediction = probabilities > 0.05

skplt.metrics.plot_confusion_matrix(y_test, prediction, figsize=(4,4), cmap="Greens")
plt.title('Confusion matrix of 5% lreg on test')

In [None]:
print("Classification Report of 5% lreg on test:\n",classification_report(y_test, prediction))

At 5% probability, the model makes many more mistakes on who is a donor, but it also correctly identifies 367 donors. This could be preferable to capture potential donors. There should be a review on the negative effects of soliciting a donor that is unlikely to donate.

#### Adding a penalty and cross-validation to see if model can be improved.

L1

In [None]:
# regularization
logitcv = sklm.LogisticRegressionCV(penalty="l1", solver='liblinear', cv=5)
logitcv.fit(X_train, y_train)

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, logitcv.predict(X_test), figsize=(4,4), cmap="Greens")
plt.title('Confusion matrix of l1 lreg')

There is no improvement from the original model.

#### Nominal Logit

In [None]:
nlogit = sklm.LogisticRegression(penalty="l2", solver='lbfgs', C=1e24, multi_class='multinomial')
nlogit.fit(X, y)

In [None]:
probs = nlogit.predict_proba(X)

In [None]:
skplt.metrics.plot_confusion_matrix(y, nlogit.predict(X), figsize=(4,4), cmap="Blues")
plt.title('Confusion matrix of l2 lreg')

In [None]:
print("Classification Report:\n",classification_report(y, nlogit.predict(X), zero_division=1))

### Model #2: KNN

In [None]:
# we scale the Xvar before using KNN:
X=preprocessing.scale(X)

In [None]:
# Set aside 25% of data for out-of-training-sample test:
X3pc_train, X3pc_test, Y3pc_train, Y3pc_test = train_test_split(X, y, \
                                                           test_size=0.25, random_state=7)
print(X3pc_train.shape, Y3pc_train.shape)
print(X3pc_test.shape, Y3pc_test.shape)

In [None]:
# Specify the parameters of the KNN classifer:
knn_ccv = KNeighborsClassifier(n_neighbors=4, weights='uniform') # considering 4 nearest neighbors weighted equally

In [None]:
# Fit the KNN model with training data:
knn_ccv.fit(X3pc_train, Y3pc_train)

In [None]:
# Get the prediction of the KNN training
knn_prediction_ctr=knn_ccv.predict(X3pc_train)

In [None]:
skplt.metrics.plot_confusion_matrix(Y3pc_train, knn_prediction_ctr, figsize=(4,4), cmap="Blues")
plt.title('Confusion matrix of knn train')

In [None]:
# Evaluate how good the knn classification of training data is:
cm_knn = confusion_matrix(Y3pc_train, knn_prediction_ctr)

print("Classification Report of knn train:\n",classification_report(Y3pc_train, knn_prediction_ctr))

In [None]:
# Evaluate how good the knn classification, after cross-validation:
cvparam = KFold(3, random_state=13, shuffle=True)
scores_accuracy_knn =  cross_val_score(knn_ccv, X3pc_train, Y3pc_train, cv=cvparam, scoring='accuracy')

In [None]:
scores_accuracy_knn.mean() #average training accuracy

In [None]:
# How good is the trained model for predicting the test data?
knn_prediction_ctr_tt=knn_ccv.predict(X3pc_test) #use test data

In [None]:
skplt.metrics.plot_confusion_matrix(Y3pc_test, knn_prediction_ctr_tt, figsize=(4,4), cmap="Greens")
plt.title('Confusion matrix of knn test')

In [None]:
# Evaluate how good the knn classification of training data is:
cm_knn_tt = confusion_matrix(Y3pc_test, knn_prediction_ctr_tt)
print("Classification Report knn on test:\n",classification_report(Y3pc_test, knn_prediction_ctr_tt))

### Model #3: XGBoost

In [None]:
#hyper parameter optimization

params = {
     'max_depth': [1, 2, 3, 4, 5],
     'learning_rate': [0.1, 0.01, 0.05],
     'min_child_weight': [1, 3, 5, 7],
     'colsample_bytree': [ 0.3, 0.4, 0.5, 0.7],
     'gamma': [0, 0.25, 1.0, 1.5],
     'reg_lambda': [0, 1.0, 10.0],
     'scale_pos_weight': [1, 3, 5, 7] # NOTE: XGBoost recommends sum(negative instances) / sum(positive instances)
 }

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time =datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
classifier=xgboost.XGBClassifier(objective='binary:logistic')

In [None]:
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3, random_state=42)

In [None]:
from datetime import datetime

start_time=timer(None)
random_search.fit(X,y)
timer(start_time)


In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
clf_xgb=xgb.XGBClassifier(seed=42,
                          objective='binary:logistic',
                          gamma=1.5,
                          learning_rate=0.1,
                          max_depth=4,
                          reg_lambda=1.0,
                          scale_pos_weight=7,
                          subsample=0.9,
                          colsample_bytree=0.4,
                          use_label_encoder=False)

clf_xgb.fit(X_train, 
            y_train, 
            verbose=True, 
            early_stopping_rounds=10,
            eval_metric='aucpr',
            eval_set=[(X_test, y_test)])

In [None]:
plot_confusion_matrix(clf_xgb, 
                      X_train, 
                      y_train,
                      values_format='d',
                      cmap="Blues",
                      display_labels=["Did not donate", "Donated"])
plt.title('Confusion matrix of XGB train')

In [None]:
plot_confusion_matrix(clf_xgb, 
                      X_test, 
                      y_test,
                      values_format='d',
                      cmap="Greens",
                      display_labels=["Did not donate", "Donated"])
plt.title('Confusion matrix of XGB test')

In [None]:
import shap
explainer = shap.Explainer(clf_xgb)
shap_values = explainer(X_test)

In [None]:
shap.plots.beeswarm(shap_values)

#### Evaluating all the models

In [None]:
def build_cumulative_curve(model, scale=100):
    # Fit model
    model.fit(X_train, y_train)

    # Get the probability of Y_test records being = 1
    Y_test_probability_1 = model.predict_proba(X_test)[:, 1]

    # Sort theseprobabilities and the true value in descending order of probability
    order = np.argsort(Y_test_probability_1)[::-1]
    Y_test_probability_1_sorted = Y_test_probability_1[order]
    Y_test_sorted = np.array(y_test)[order]

    # Build the cumulative response curve
    x_cumulative = np.arange(len(Y_test_probability_1_sorted)) + 1
    y_cumulative = np.cumsum(Y_test_sorted)

    # Rescale
    x_cumulative = np.array(x_cumulative)/float(x_cumulative.max()) * scale
    y_cumulative = np.array(y_cumulative)/float(y_cumulative.max()) * scale
    
    return x_cumulative, y_cumulative

def plot_cumulative_curve(models):
    # Plot curve for each model
    for key in models:
        x_cumulative, y_cumulative = build_cumulative_curve(models[key])
        plt.plot(x_cumulative, y_cumulative, label=key)
    # Plot other details
    plt.plot([0,100], [0,100], 'k--', label="Random")
    plt.xlabel("Percentage of test instances targeted (decreasing score)")
    plt.ylabel("Percentage of positives targeted")
    plt.title("Cumulative response curve")
    plt.legend()

models = {"Logistic Regression": lreg,
          "KNN": knn_ccv,
          "XGBoost": clf_xgb}
plot_cumulative_curve(models)

In [None]:
def plot_lift_curve(models):
    # Plot curve for each model
    for key in models:
        x_cumulative, y_cumulative = build_cumulative_curve(models[key])
        plt.plot(x_cumulative, y_cumulative/x_cumulative, label=key)
    # Plot other details
    plt.plot([0,100], [1,1], 'k--', label="Random")
    plt.xlabel("Percentage of test instances (decreasing score)")
    plt.ylabel("Lift (times)")
    plt.title("Lift curve")
    plt.legend()

plot_lift_curve(models)

## 4. Summary

### Some key takeaways

From the data preparation stage, this anonymized dataset is from an institution where a majority of the constituents were densely located in Malibu, CA. The dataset tracks membership, alumnus, parent, and involvement indicators. The majority of the 34,000 rows of data are not members, alumni, parent, or involved. There's an even split between male and female constituents, and the majority are married.

Analyzing the data further, most of the constituents are between the ages of 35-45. Nearly 2/3 of the constituents have made a gift in the past and are identified as a donor. Some of the highest total giving at the institution are from those in the ages of 35-45. The highest donors are female, with total giving over $1M. Male constituents with a wealth rating of 1 have previously made a gift to the institution, while the wealthiest female constituent has not made a donation.

At the time of this data collection, the institution has eighty percent of the donors that it does from last year, with 1,689 being new donors from last year. This churn is high and implicates that predicting returning donors could be difficult.

There is an opportunity to rate donors who have already given to the institution by identifying those with high giving and no wealth rating. This will help inform the strategic plan to cultivate these donors further.

### Predicting donors 

Predicting current year donors with the available features was a difficult task, given the imbalance of donors to non-donors.

While most models were overall accurate at 94%+ on average, they all failed to appropriately identified donors by mostly classifying everyone as a non-donor. This proved even more difficult when testing generalization on the test dataset.

Sacrificing overall accuracy, a logistic regression with a lower probability could identify more true donors but misclassifying non-donors as donors. An overall cutoff and cost benefit analysis could help determine if this model would be helpful in outreach such as mailings, event invitations, calls, or visits from fundraisers.

### Limitations

Given the poor performance of the models to predict donors, more data and information, especially around past giving could be helpful as the giving features were the highest correlated variables. The challenge with this classification task will continue to be the imbalance of the dataset. With a larger dataset, the model could also have a better opportunity to learn  patterns that increase performance to correctly classify donors.