## 1. Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix, classification_report
from test_results import test_results

% matplotlib inline

UsageError: Line magic function `%` not found.


## 2. Load data

In [2]:
train_df = pd.read_csv('data/training.csv')
print('Loaded training data: size {}'.format(train_df.shape))
test_df = pd.read_csv('data/test.csv')
print('Loaded testing data: size {}'.format(test_df.shape))
print('----------------')
print('train_df:')
train_df.head()

Loaded training data: size (84534, 10)
Loaded testing data: size (41650, 10)
----------------
train_df:


Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044331,-0.385883,1,1,2,2


## 3. Modeling

Build a model to select the best customers to target that maximizes the IRR and NIR

### 3.1. Get training data

Only consider data from treatment group for model training.

In [3]:
treatment_train_df = train_df[train_df['Promotion']=='Yes']
# treatment_test_df = test_df[test_df['Promotion']=='Yes']

# treatment_train_df = pd.concat([treatment_train_df, treatment_test_df])

treatment_train_df.drop(['ID','Promotion'], axis=1, inplace=True)

print('treatment_train_df: size {}\n'.format(treatment_train_df.shape))
treatment_train_df.head()

treatment_train_df: size (42364, 8)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  treatment_train_df.drop(['ID','Promotion'], axis=1, inplace=True)


Unnamed: 0,purchase,V1,V2,V3,V4,V5,V6,V7
4,0,3,28.044331,-0.385883,1,1,2,2
8,0,2,31.930423,0.393317,2,3,1,2
10,0,1,32.770916,-1.511395,2,1,4,1
12,0,1,36.957009,0.133583,2,3,1,1
14,0,3,36.911714,-0.90535,2,2,4,1


### 3.2. Class distribution

In [4]:
print('class disbution: \n{}'.format(treatment_train_df['purchase'].value_counts()))

class disbution: 
purchase
0    41643
1      721
Name: count, dtype: int64


* We can find that the class distribution is extremely imbalanced. We will leave this class distribution as it is for model training, let's see how it performs.

* Then try another strategy that deal with class imbalance, and let's see if the result is improved.

### 3.3. Data splitting

In [5]:
# Split training data into predictors and response
X = treatment_train_df.drop(['purchase'], axis=1)
y = treatment_train_df['purchase']

print('X: {}'.format(X.shape))
print('y: {}'.format(y.shape))

X: (42364, 7)
y: (42364,)


### 3.4. Feature scaling

In [6]:
min_max_scaler = preprocessing.MinMaxScaler()
X = min_max_scaler.fit_transform(X)

### 3.5. Train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X_train: {}'.format(X_train.shape))
print('X_test: {}'.format(X_test.shape))
print('y_train: {}'.format(y_train.shape))
print('y_test: {}'.format(y_test.shape))

X_train: (33891, 7)
X_test: (8473, 7)
y_train: (33891,)
y_test: (8473,)


### 3.6. Fitting data

In [8]:
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X_train, y_train)

## 4. Prediction

In [9]:
y_pred = classifier.predict(X_test)

## 5. Evaluation

### 5.1. Accuracy

In [10]:
accuracy = (y_pred == y_test).mean()
print('Accuracy: {0:.3f}'.format(accuracy))

Accuracy: 0.980


### 5.2. Confusion matrix

In [11]:
print('Confusion matrix: \n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print('Classification report: \n')
print(classification_report(y_test, y_pred))

Confusion matrix: 

[[8306    7]
 [ 160    0]]


Classification report: 

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8313
           1       0.00      0.00      0.00       160

    accuracy                           0.98      8473
   macro avg       0.49      0.50      0.50      8473
weighted avg       0.96      0.98      0.97      8473



## 6. Test promotion strategy

In [12]:
def promotion_strategy(df):
    '''
    INPUT 
    df - a dataframe with *only* the columns V1 - V7 (same as train_data)

    OUTPUT
    promotion_df - np.array with the values
                   'Yes' or 'No' related to whether or not an 
                   individual should recieve a promotion 
                   should be the length of df.shape[0]
                
    Ex:
    INPUT: df
    
    V1	V2	  V3	V4	V5	V6	V7
    2	30	-1.1	1	1	3	2
    3	32	-0.6	2	3	2	2
    2	30	0.13	1	1	4	2
    
    OUTPUT: promotion
    
    array(['Yes', 'Yes', 'No'])
    indicating the first two users would recieve the promotion and 
    the last should not.
    '''
    
    df = min_max_scaler.fit_transform(df)

    y_pred = classifier.predict(df)

    # Convert 1/0 value into yes/no for outcome
    promotion_yes_no = []
    for value in y_pred:
        if value == 0:
            promotion_yes_no.append("No")
        if value == 1:
            promotion_yes_no.append("Yes")
            
    promotion = np.asarray(promotion_yes_no)
    
    
    return promotion

In [13]:
test_results(promotion_strategy)

Nice job!  See how well your strategy worked on our test data below!

Your irr with this strategy is 0.0000.

Your nir with this strategy is -2.25.
We came up with a model with an irr of 0.0000 and an nir of -2.25 on the test set.


This promotion strategy doesn't seem good because there's no effect in IRR and NIR is still negative.

## 7. Deal with imbalanced class and test promotion strategy again

In [14]:
# Use SMOTE to oversample the minority class
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X, y)
over_X_train, over_X_test, over_y_train, over_y_test = train_test_split(over_X, over_y, test_size=0.2, stratify=over_y)

print('over_X_train: {}'.format(X_train.shape))
print('over_X_test: {}'.format(X_test.shape))
print('over_y_train: {}'.format(y_train.shape))
print('over_y_test: {}'.format(y_test.shape))

over_X_train: (33891, 7)
over_X_test: (8473, 7)
over_y_train: (33891,)
over_y_test: (8473,)


In [15]:
# Build SMOTE SRF model
SMOTE_classifier = RandomForestClassifier(n_estimators=150)
SMOTE_classifier.fit(over_X_train, over_y_train)

In [16]:
# Prediction
y_pred_SMOTE = SMOTE_classifier.predict(over_X_test)

In [17]:
# Evaluation

##Accuracy
accuracy_SMOTE = (y_pred_SMOTE == over_y_test).mean()
print('Accuracy_SMOTE: {0:.3f}'.format(accuracy_SMOTE))

print('\n')

## Confusion matrix
print('Confusion matrix: \n')
print(confusion_matrix(over_y_test, y_pred_SMOTE))
print('\n')
print('Classification report: \n')
print(classification_report(over_y_test, y_pred_SMOTE))

Accuracy_SMOTE: 0.948


Confusion matrix: 

[[7839  490]
 [ 380 7949]]


Classification report: 

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      8329
           1       0.94      0.95      0.95      8329

    accuracy                           0.95     16658
   macro avg       0.95      0.95      0.95     16658
weighted avg       0.95      0.95      0.95     16658



In [18]:
def promotion_strategy(df):
    '''
    INPUT 
    df - a dataframe with *only* the columns V1 - V7 (same as train_data)

    OUTPUT
    promotion_df - np.array with the values
                   'Yes' or 'No' related to whether or not an 
                   individual should recieve a promotion 
                   should be the length of df.shape[0]
                
    Ex:
    INPUT: df
    
    V1	V2	  V3	V4	V5	V6	V7
    2	30	-1.1	1	1	3	2
    3	32	-0.6	2	3	2	2
    2	30	0.13	1	1	4	2
    
    OUTPUT: promotion
    
    array(['Yes', 'Yes', 'No'])
    indicating the first two users would recieve the promotion and 
    the last should not.
    '''
    
    df = min_max_scaler.fit_transform(df)

    y_pred = SMOTE_classifier.predict(df)

    # Convert 1/0 value into yes/no for outcome
    promotion_yes_no = []
    for value in y_pred:
        if value == 0:
            promotion_yes_no.append("No")
        if value == 1:
            promotion_yes_no.append("Yes")
            
    promotion = np.asarray(promotion_yes_no)
    
    
    return promotion

In [19]:
# Test promotion strategy
test_results(promotion_strategy)

Nice job!  See how well your strategy worked on our test data below!

Your irr with this strategy is 0.0152.

Your nir with this strategy is 5.55.
We came up with a model with an irr of 0.0152 and an nir of 5.55 on the test set.


After dealing with imbalanced class for model training, we can see a significant increase in NIR and slightly increase in IRR metric. Therefore, we will apply the classifier SMOTE_classifier for targeting customer.