# CRISP-DM Analysis for Business Problem: Innactivity prediction with transactional data

This notebook is a companion to the Medium article (link bellow) the underlies the application with CRISP-DM methodology to understand, analyze and communicate a business problem through a proven and tested Data Science methodology.

CRISP-DM comprises of 6 steps:

Section 1: Business Understanding

Section 2: Data Understanding

Section 3: Data Preparation

Section 4: Data Modeling

Section 5: Evaluate the Results

Section 6: Deployment

Medium Article:
https://medium.com/@fernandocarliniguimaraes/innactivity-prediction-using-machine-learning-on-transacional-data-642ef7c84674

# Section 1: Business Understanding

The broader business contextualization is laid in the companion Medium Article.
A brief summary of the business undersating is laid out bellow:
A Brazilian Credit Union wishes to preempively predict Mobile phone app innactivity in a six month window. 

The business value of such endeavor lies on: 
- (1) expanding use cases of a dataset (data enrichment may lead to revenue growth); 
- (2) deterring potential customer churn (avoid revenue lost);
- (3) early detection of customer friction (garantee user satisfaction).

The business questions that arise pertaing such objective are:

### Question 1: Can transactional data alone safely predict channel innactivity in a six month window?

### Question 2: What are the main features of a transactional dataset that can be used for understanding channel innactivity in a six month window?

### Question 3: Are mono-product-family users more likely to have channel innactivity in a six month window?


    


# Section 2: Data Understanding

### Credit Union's Transaction Dataset overview

Since this project deals with classified company owned information, we won't be able to show a complete exploratory analysis.

So here is a small low level description:

The Credit Union has several client channels. For this project we are looking at only one of them: the mobile phone app. It has roughlly 4 million users, with an average of 40–45 Million transactions per month, about 40% of these are financial transacions (like paying a bill) and 60% non-financial (like looking up a bill receipt). Our main goal for the project is avoidind financial transaction innactivity, so we focused onlty on those.

All of these transactions are stored in a main database that is daily ingested in AWS Data Lake. That was the interface I used to query the data and extract it for the project.

The transacional database holds A LOT of information. But, for this project the most vital informations used were:

- Time and date the transacion happend;
- The transaction code;
- The product family the transaction is part of (example: investment application and investment cashout are two different transactions of the same product family).¹
- The Credit Union Member who solicited the transaction;
- The Credit Union the Member is linked to;
- The status of the transacion. Did it complete? Or was it canceled?
- The channel through which the transaction was solicited;

¹ There are 8 main product familys: Channels (managing your self service channel), Checking account (wire transfers), Payments (Government Tribute or company Slips), Bills (Water, Phone, etc), Credit (Loans), Cards (Credit and debit) and PIX(Brazil’s own instant payment financial product), Investments (Long Term Deposits, Market Shares);

For this project I filtered the channel to be only the Mobile App. I also chose 5 medium sized Credit Unions from our system (we have over 140) so as to have a good amount of data, but not too much as to make the processing time too long. And also fixed a six month period to analyze data.

### IMPORTANT OBSERVATION: 
This dataset is quite clean because it’s a high management information system. When we use the filters described above, like the channel filter and completed status filter, we flush out basically anything that could get in our way. The heavier data wrangling necessary is grouping the transaction codes into product families and that is still quite easy to accomplish.

### Exploratory Analysis of the Transactional Database

I have written a second article piece that show cases the method I used for both the exploratory analysis and also the model selection and development. Please check it out the article, specially the <b>Data understanding — What data do we have / need? Is it clean?</b> section for further insight.

Link:
https://medium.com/@fernandocarliniguimaraes/innactivity-prediction-using-machine-learning-on-transacional-data-642ef7c84674

### Disclaimer about Compliance and Confidentiality

Due to company compliance I had to do all of the data wrangling and manipulation on our AWS Data Lake server using Redash running a AWS Athena and AthenaSQL engine. Data was only available for extraction after anonymization. I've included in the repository a SQL file with a pseudo algorhitm that masks the sensible information (like dataset names and columns) and shows how data manipulation was done.

GitHub Repo for this project: https://github.com/nandodsg/Innactivity-Prediction-with-Transactional-Data

##### Queries using during exploratory analysis:
1. pseudo query - exploratory analysis dataset (anonymous).sql :
https://github.com/nandodsg/Innactivity-Prediction-with-Transactional-Data/blob/main/pseudo%20query%20-%20exploratory%20analysis%20dataset%20(anonymous).sql
2. pseudo query - exporatory analysis - churn flags (anonymous).sql : 
https://github.com/nandodsg/Innactivity-Prediction-with-Transactional-Data/blob/main/pseudo%20query%20-%20exporatory%20analysis%20-%20churn%20flags%20(anonymous).sql

# Section 3: Data Preparation

### Classifier Models Data Set preparation

There can be many approches when it comes to modelling this specific business problem. One way to look at is to think of this sixth month innactivity as a kind of “churn” that we would want to predict based on a series of features (predictors). On this approch we could elect a Classifier Model for the problem.

On this solution framing we have to consider our dataset modeling base on individual and not on transactions (the would work for the Time Series model though).

We need one individual per row, with all the features laid out on separate columns. Based on the exploratory analysis I want to construct my dataframe with the following blocks:

- Account Number ID
- Credit Union Number ID
- Sixth Month Innactivity Flag (our future dependent variable)
- A depth counter (number of transacionts) by month and by product family
- An amplitude counter (number of diferente families used) by month
- Total depth counter by month
- Standard Deviation (by product family)
- 3 month window Moving Average (by product family)
- Absolute Moving Average variation (by product family)

The first 6 columns were prepare via SQL query. The last three items were calculated in a second notebook also found in this Repository called 'Data Wrangling for Innactivity Prediction'.

I extracted the data from the other 4 credit unions I had previously selected. This time bringing in every member who attendend one simple rule: they had to be active on the first by months of 2022. This extraction gave me a 91.848 long dataset, each row representing an unique individual.

### Disclaimer about Compliance and Confidentiality

Due to company compliance I had to do all of the data wrangling and manipulation on our AWS Data Lake server using Redash running a AWS Athena and AthenaSQL engine. Data was only available for extraction after anonymization. I've included in the repository a SQL file with a pseudo algorhitm that masks the sensible information (like dataset names and columns) and shows how data manipulation was done.

GitHub Repo for this project: https://github.com/nandodsg/Innactivity-Prediction-with-Transactional-Data

##### Query used to generate model dataset
1. pseudo query - model dataset.SQL : 
https://github.com/nandodsg/Innactivity-Prediction-with-Transactional-Data/blob/main/pseudo%20query%20-%20model%20dataset.SQL


# Section 4: Data Modeling

The following section details the development of 10 different Classifiers Models aimed at supporting the analyses of the three business questions.

I have written a second article piece that show cases the method I used for both the exploratory analysis and also the model selection and development. Please check it out the article, specially the <b>Modeling — What modeling techniques should we apply?</b> and the <b>Evaluation — Which model best meets the business objectives?</b> sections for further insight.

Link: https://medium.com/@fernandocarliniguimaraes/innactivity-prediction-using-machine-learning-on-transacional-data-642ef7c84674

In [1]:
# Utils class with functions for model development, testing and evaluation
import utils as u
%matplotlib inline

In [2]:
df_first_run = u.pd.read_csv('./Model Data Set STDEV MA (pseudo).csv',sep=',')
df_first_run.head()

Unnamed: 0.1,Unnamed: 0,CREDIT_UNION_ID,ACCOUNT_NUM,FLG_202201,FLG_202202,FLG_202203,FLG_202204,FLG_202205,FLG_202206,DEEP_CHANNELS_202201,...,PAYMENTS_MA_3,PAYMENTS_MA_VAR,AMP_MA_1,AMP_MA_2,AMP_MA_3,AMP_MA_VAR,NUM_TRANSACTIONS_MA_1,NUM_TRANSACTIONS_MA_2,NUM_TRANSACTIONS_MA_3,NUM_TRANSACTIONS_MA_VAR
0,0,A,ZWZZ!W,1,1,1,1,1,1,0,...,8.666667,2.666667,3.666667,3.333333,2.666667,-1.0,94.666667,105.333333,99.333333,4.666667
1,1,A,&WXYY&,1,1,1,1,1,1,0,...,0.0,-2.0,2.0,1.666667,1.666667,-0.333333,8.666667,12.0,13.333333,4.666667
2,2,A,Y%@YZ&,1,1,1,1,1,1,0,...,0.0,0.0,2.0,2.0,1.666667,-0.333333,9.333333,9.333333,11.333333,2.0
3,3,A,!W%&#!,1,1,1,1,1,1,0,...,1.333333,0.0,2.666667,3.0,2.666667,0.0,32.666667,38.0,54.0,21.333333
4,4,A,%##AXY,1,1,1,1,1,1,0,...,2.666667,-0.666667,3.0,2.666667,2.666667,-0.333333,16.0,10.666667,8.666667,-7.333333


In [3]:
df_big_blind_test = u.pd.read_csv('./Big Blind Predict Test.csv',sep=',')
df_big_blind_test.head()

Unnamed: 0.1,Unnamed: 0,FLG_202201,FLG_202202,FLG_202203,FLG_202204,FLG_202205,FLG_202206,DEEP_CHANNELS_202201,DEEP_CHANNELS_202202,DEEP_CHANNELS_202203,...,PAYMENTS_MA_3,PAYMENTS_MA_VAR,AMP_MA_1,AMP_MA_2,AMP_MA_3,AMP_MA_VAR,NUM_TRANSACTIONS_MA_1,NUM_TRANSACTIONS_MA_2,NUM_TRANSACTIONS_MA_3,NUM_TRANSACTIONS_MA_VAR
0,0,1,1,1,1,1,1,0,0,0,...,10.666667,2.666667,2.333333,2.666667,3.0,0.666667,26.666667,29.333333,32.0,5.333333
1,1,1,1,1,1,1,1,0,0,0,...,3.333333,2.0,2.0,2.0,2.0,0.0,18.0,18.666667,19.333333,1.333333
2,2,1,1,1,1,1,1,0,0,0,...,15.333333,5.333333,2.333333,2.666667,2.666667,0.333333,53.333333,72.0,79.333333,26.0
3,3,1,1,1,1,1,1,0,0,0,...,1.333333,1.333333,2.0,2.0,1.666667,-0.333333,34.0,20.666667,19.333333,-14.666667
4,4,1,1,1,1,1,1,0,0,0,...,4.0,1.333333,3.0,3.0,3.333333,0.333333,25.333333,29.333333,38.666667,13.333333


In [4]:
# We make sure to create a copy of the data before we start altering it. Note that we don't change the original data we loaded.
data_first_run = df_first_run.copy(deep=False)
data_big_blind_test = df_big_blind_test.copy(deep=False)

In [6]:
data_first_run = u.pd.DataFrame()
data_first_run['FLG_INNACTIVITY'] = df_first_run['FLG_202206']
data_first_run['NUM_TRANSACTIONS_MA_1'] = df_first_run['NUM_TRANSACTIONS_MA_1']
data_first_run['NUM_TRANSACTIONS_MA_2'] = df_first_run['NUM_TRANSACTIONS_MA_2']
data_first_run['NUM_TRANSACTIONS_MA_3'] = df_first_run['NUM_TRANSACTIONS_MA_3']
data_first_run['NUM_TRANSACTIONS_MA_VAR'] = df_first_run['NUM_TRANSACTIONS_MA_VAR']
data_first_run

Unnamed: 0,FLG_INNACTIVITY,NUM_TRANSACTIONS_MA_1,NUM_TRANSACTIONS_MA_2,NUM_TRANSACTIONS_MA_3,NUM_TRANSACTIONS_MA_VAR
0,1,94.666667,105.333333,99.333333,4.666667
1,1,8.666667,12.000000,13.333333,4.666667
2,1,9.333333,9.333333,11.333333,2.000000
3,1,32.666667,38.000000,54.000000,21.333333
4,1,16.000000,10.666667,8.666667,-7.333333
...,...,...,...,...,...
91843,1,76.666667,79.333333,102.666667,26.000000
91844,1,16.000000,18.666667,18.000000,2.000000
91845,1,3.000000,2.666667,2.666667,-0.333333
91846,1,15.333333,8.000000,8.666667,-6.666667


In [None]:
data_first_run = u.pd.DataFrame()
data_first_run['FLG_INNACTIVITY'] = df_first_run['FLG_202206']
data_first_run['NUM_TRANSACTIONS_MA_1'] = df_first_run['NUM_TRANSACTIONS_MA_1']
data_first_run['NUM_TRANSACTIONS_MA_2'] = df_first_run['NUM_TRANSACTIONS_MA_2']
data_first_run['NUM_TRANSACTIONS_MA_3'] = df_first_run['NUM_TRANSACTIONS_MA_3']
data_first_run['NUM_TRANSACTIONS_MA_VAR'] = df_first_run['NUM_TRANSACTIONS_MA_VAR']
data_first_run

# Handling class imbalance

We know from our exploratory analysis that this dataset will be havily imbalanced with churn on 6th month as the minority class (represented as inactivity on that month or FLG_202206 = 0). Check the histogram bellow for visual reference.

In [None]:
fig, axis = u.plt.subplots()
axis.hist(data_first_run['FLG_INNACTIVITY'])
u.plt.ylabel('Accounts')
u.plt.xlabel('Innactivity Flag')
u.plt.show()

The problem with classifiers and class imbalance is that the classifier will more easily classify the majority class, simply because most cases are of that class. For that reason model performance metrics have to be carefully selected. Precision, recall and F1 will be used as the main metrics for evaluating performance. In our specfic case we our most interested in those metrics regarding the prediction of the minority class (0 in our case).

So in this study we will contrast the use of two wildly used classification models: Logistic Regression and RandomTreeClassifier, both with SciKit Learn implementations. Tree Ensembles our suposabily better at handling inbalance. And a common technique for getting better results is using resampling techniques. For that we will contrast model metrics on baseline models with resampled models (RandomOverSampling, SMOTE and NearMisses)


Reference:

https://medium.com/grabngoinfo/four-oversampling-and-under-sampling-methods-for-imbalanced-classification-using-python-7304aedf9037

https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

# Preparing Global Variables

In [None]:
#Declare independent variables (X) and dependent variable (y) for the Model Training and Setup dataset
X = data_first_run.loc[:,'DEEP_CHANNELS_202201':'NUM_TRANSACTIONS_MA_VAR'] # Drop Id columns
X = X.drop(columns=['FLG_INNACTIVITY',
                    'DEEP_CHANNELS_202206',
                    'DEEP_CARDS_202206',
                    'DEEP_CHECKING_202206',
                    'DEEP_BILLS_202206',
                    'DEEP_CREDIT_202206',
                    'DEEP_INVESTMENTS_202206',
                    'DEEP_PAYMENTS_202206',
                    'DEEP_PIX_202206',
                    'AMP_202206',
                    'NUM_TRANSACTIONS_202206'
                   ]) # Drop prediction column
y = data_first_run['FLG_INNACTIVITY']

#Declare independent variables (X) and dependent variable (y) for the Big Blind Predict Test dataset
X_bbt = data_big_blind_test.loc[:,'DEEP_CHANNELS_202201':'NUM_TRANSACTIONS_MA_VAR'] # Drop Id columns
X_bbt = X_bbt.drop(columns=['FLG_INNACTIVITY',
                    'DEEP_CHANNELS_202206',
                    'DEEP_CARDS_202206',
                    'DEEP_CHECKING_202206',
                    'DEEP_BILLS_202206',
                    'DEEP_CREDIT_202206',
                    'DEEP_INVESTMENTS_202206',
                    'DEEP_PAYMENTS_202206',
                    'DEEP_PIX_202206',
                    'AMP_202206',
                    'NUM_TRANSACTIONS_202206'
                   ]) # Drop prediction column
y_bbt = data_big_blind_test['FLG_INNACTIVITY']

#set shared model, scaler and splitter variables
random_state = 42
test_size = 0.30
verbose = 'off'
print_report = 'off'

#set model names
models = [
          'Random Forest',
#           'Logistic Regression',
         ]

#set resampling method names
resamplers = [
              'Baseline',
              'Random Over Sampling',
              'SMOTE',
              'Near Miss KNN',
              'Random Under Sampling',
             ]

## Processing and evaluating models

In [None]:
u = u.reload(u)
model_prediction, model_scores_table, model_coef_table, BBT_model_scores_table = u.model_run(models,
                                                                                           resamplers,
                                                                                           X,
                                                                                           y,
                                                                                           X_bbt,
                                                                                           y_bbt,
                                                                                           random_state,
                                                                                           test_size,
                                                                                           verbose,
                                                                                           print_report)

In [None]:
#Model Training Score Table
m = model_scores_table.loc[model_scores_table['Precision 1'] >= .08].sort_values(by='Recall 1')
#m = m.loc[m['Recall 1'] >= .85]
m

In [None]:
# Model training coeficient and feature importance table
model_coef_table

In [None]:
#Blind Test Score Table
BBT_model_scores_table

# Section 5: Evaluate the Results

This section will split up into separate analyses for each business question.
Each section will be comprised of a brief analysis, and evaluation and conclusion.

# Question 1: Can transactional data alone safely predict channel innactivity in a six month window?

In this section we will analyze the Model Scores table for the Big Blind Test routine. The idea is to simulate a real world application where we will use a data sample to train a model to predict in a bigger group.

Let's first analyze the scores to identify our best models.

Our Models can be considerd having medium to high success if they:
1) Correctly predict over 80% of the Tre Positive Cases
We don't want a model who can get most our cases right from the start. In this case, our main interest is the minority case (1).

2) Correctly predict over 95% of the Tre Negative Cases
The should also be able to get most of the True Negatives cases (minority, 0) since it will have abundant data for the job.

2) Have a Precision higher than 75% for the minority class
The rate in which the model classifies True Positives is higher than that of False Positives.

3) Have a Recall higher than 80% 
The rate in which the model correctly clasifies all possible class cases, in this case the minority class.


##### Quick Primer com classifier evaluation scores

Quick Primer on reading the Confusion Matrix and Classification report measures
How to read the quadrants of the matrix:

True Negative | False Positive

False Negative | True Positive

Precision
Measure of how many of the positive predictions made are correct (true positives).
Formula: TP/(TP+FP)

Recall
Measure of how many of the positive cases the classifier correctly predicted considering the over all positive cases in the data.
It is sometimes also referred to as Sensitivity
Formula: TP/(TP+FN)

f1-Score
Harmonic mean of precision and recall

Accuracy
Measure of the number of correct predictions over all predictions
Formula: (TP+TN)/(TP+TN+FP+FN)

In [None]:
BBT_model_scores_table

In [None]:
model_cutoff = BBT_model_scores_table
i=0.1
for count in range(10):
    model_cutoff = BBT_model_scores_table.loc[BBT_model_scores_table['TP'] >= (Support_1*i)]
    print('Cutoff:',i,'Number of models:',model_cutoff.shape[0])
    i+=0.1

In [None]:
model_cutoff

In [None]:
model_cutoff = BBT_model_scores_table
i=0.05
for count in range(20):
    model_cutoff = BBT_model_scores_table.loc[BBT_model_scores_table['TP'] >= (Support_1*.7)]
    model_cutoff = model_cutoff.loc[BBT_model_scores_table['TN'] >= (Support_0*.7)]
    model_cutoff = model_cutoff.loc[BBT_model_scores_table['Precision 1'] >= i]
    print('Cutoff:',i,'Number of models:',model_cutoff.shape[0])
    i+=0.1

In [None]:
Support_1 = BBT_model_scores_table.loc[1,'Support 1']
Support_0 = BBT_model_scores_table.loc[1,'Support 0']

# 1) Correctly predict over 80% of the Tre Positive Cases 
model_cutoff = BBT_model_scores_table.loc[BBT_model_scores_table['TP'] >= (Support_1*.7)]
print('Total number of models after cutoff round:',model_cutoff.shape[0])

# 2) Correctly predict over 95% of the Tre Negative Cases
model_cutoff = model_cutoff.loc[BBT_model_scores_table['TN'] >= (Support_0*.7)]
print('Total number of models after cutoff round:',model_cutoff.shape[0])

# # 3) Have a Precision higher than 75% for the minority class
# model_cutoff = model_cutoff.loc[BBT_model_scores_table['Precision 1'] >= 0.75]
# print('Total number of models after cutoff round:',model_cutoff.shape[0])

# # 4) Have a Recall higher than 80%
# model_cutoff = model_cutoff.loc[BBT_model_scores_table['Recall 1'] >= 0.85]
# print('Total number of models after cutoff round:',model_cutoff.shape[0])



In [None]:
# Let's look at our Top 3 Models, sorted out by Precision 1 value
model_cutoff.sort_values(by='Precision 1',ascending=False)

### Classfication Score analysis

Our Top 3 Models are all Random Forest, but with Resampling variations: Random Over Sampling, Baseline (no resampling) and Near Miss KNN.

All three had very similiar performances. All three had 100% Recall, which means they correctly predicted all cases of True Positives available in the data. All three also had 100% Precision on the major class (0), which means they got all True Negatives right and had zero False Negatives to account for.

The main point which differentiates the three models are the False Negative scores and, consequently, the Precision 1 scores. Random Forest with Random Over Sampling outperfomed both the Baseline and the Neas Miss KNN variates by predicting less False Positives, which in turn led to a Precision 1 score slightly above the other two models.

## Evaluation for Question 1:

We have three models who achieve (and in fact surpass) our success pre-requisites. Thus, we can than affirm that the Classifier Models will safely predict innactivity on month six.

# Question 2: What are the main features of a transactional dataset that can be used for understanding channel innactivity in a six month window?

The objetctive behind this question is to understand what predictors from our transactional database have the higest impact on model performance.

We will:

(1) Use the model coeficient and features importance to select the Top 10 features

(3) Reavaluate our models using only the best predictor to check if performance boosts up.

In [None]:
#Let's create a Table of only the Top 3 Model's Feature Importance Scores
Top3_FI = model_coef_table.loc[:,['Features','Random Forest Baseline','Random Forest Random Over Sampling','Random Forest Near Miss KNN']]
Top3_FI

In [None]:
# Let's check how many features were fed to the model
print('Total number of features fed to the model = ',Top3_FI.shape[0])

# Lets Sum and Average the Importance Feature Scores
Top3_FI['Features Importance Average'] = Top3_FI.drop(columns=['Features']).mean(axis = 1)

#Let's Check how many Features show an average importance of 0.
print('Average Feature Importance of zero =',Top3_FI.loc[Top3_FI['Features Importance Average'] == 0].shape[0])

In [None]:
#Let's analyze a histogram of Coeficient and Feature Importance scores.
u.pyplot.hist(Top3_FI['Features Importance Average'])

In [None]:
# Let's clean up the coeficient score
print(Top3_FI['Features Importance Average'].describe(include='all'))
print('Average Feature Importance >= 0.01 =',Top3_FI.loc[Top3_FI['Features Importance Average'] >= 0.01].shape[0])
print('Average Feature Importance >= 0.02 =',Top3_FI.loc[Top3_FI['Features Importance Average'] >= 0.02].shape[0])
print('Average Feature Importance >= 0.03 =',Top3_FI.loc[Top3_FI['Features Importance Average'] >= 0.03].shape[0])
print('Average Feature Importance >= 0.04 =',Top3_FI.loc[Top3_FI['Features Importance Average'] >= 0.04].shape[0])

In [None]:
#Let's drop the Features of average importance < 0.01
Top3_FI_Clean = Top3_FI.loc[Top3_FI['Features Importance Average'] >= 0.01]
print('Total number of features with imporantce > 0 =',Top3_FI_Clean.shape[0])

In [None]:
#Now that we've cleand things up, let's analyze again a histogram of Coeficient and Feature Importance scores.
u.pyplot.hist(Top3_FI_Clean['Features Importance Average'])

In [None]:
# What are the Top 10 coeficient for the Random ForestRandom Under Sampling?
Top3_FI_Clean.sort_values(by=['Features Importance Average'],ascending=True)


### Coeficient and Features Importance Analysis

In this section we used Coeficiente and Feature Importance Analysis to identify the most important features on our Top 3 models.

On the first few rounds of investigation we noticed a huge amount of features with coeficients scoring less that 0.01 importance. After cleaning them up we found 16 features with the highest importance. We will now test these on a new training and evaltuation routine to see if our model performance boosts up.


## Model tweaking based on feature importance

In [None]:
#Declare independent variables (X) and dependent variable (y)

# Based on our prior analysis, I've decided to test the perfomance of the models with less predictors. 
# So will drop most of them and leave only the top 10.

# To avoid writing them out every time, we save the names of the estimators of our model in a list. 
#Declare independent variables (X) and dependent variable (y) for the Model Training and Setup dataset
X1 = data_first_run.loc[:,Top3_FI_Clean['Features']] # Drop Id columns
y1 = data_first_run['FLG_INNACTIVITY']

#Declare independent variables (X) and dependent variable (y) for the Big Blind Predict Test dataset
X1_bbt = data_big_blind_test.loc[:,Top3_FI_Clean['Features']] # Drop Id columns
y1_bbt = data_big_blind_test['FLG_INNACTIVITY']

models1 = ['Random Forest']
resamplers1 = ['Random Over Sampling',
               'Baseline',
               'Near Miss KNN'
              ]

In [None]:
u = u.reload(u)
model_prediction1, model_scores_table1, model_coef_table1, BBT_model_scores_table1 = u.model_run(models1,
                                                                                           resamplers1,
                                                                                           X1,
                                                                                           y1,
                                                                                           X1_bbt,
                                                                                           y1_bbt,
                                                                                           random_state,
                                                                                           test_size,
                                                                                           verbose,
                                                                                           print_report)

In [None]:
BBT_model_scores_table1

# Question 3: Are mono-product-family users more likely to have channel innactivity in a six month window?

# Evaluation

Though the exploratory analysis indicated the possibily of finding correlation between transaction patterns and innactivity, the two classifiers and 4 resampling techniques used did not present good performance on this highly imbalanced dataset. 

The models just didn't perform well! Unfortunatelly. But hey, this is a scientific approach, know that something doesn't work is also a valid result, it just brushes off the false positives from your line of sight.

All models had precision scores ranging from 0.08 to 0.09, recall from 0.83 to 0.85 and f1-score at exactlly 0.15. The main difference seeable at the confusion matrix, with slight differences on the true/false positive/negative predictions. The RandomForest with Random Under Sampling had similiar measures: precision at 0.09, recall at 0.77 and f1-score at 0.16.
Exemple of Classification Report and Confusion Matrix for the Logistic Regression with Baseline model.

The models actually did an interesting job of predicting 325+ cases of the 423 innactivity targets in the test set (you can see that looking at the confusion matrix's top left quadrant, 358 in the example above). That is why the Recall (or sensitivity) is high. 

This means the model is more confident at trying to predict the minority cases (the Random Forest Baseline practically didn't even try to predict the minority cases, in the report in only classified 15 as negatives, and 14 of them were flase - check the print screen bellow).

# Conclusion

Unfortunately this project doesn't seem to provide strong evidence towards answering either positively or negatively the business question provided.

Our exploratory analysis show their is a potential correlation to be explored between innactivaty, depth (specially PIX) and amplitude. But, the use of classifier models, at least with the present configuration, haven't presented promising results.

### Recommendations on future studies

1. Study the use of time series prediction techniques as a subsititue for Classifiers
2. Use the accumlative transactional variation on 5 months prior to the 6th month innactivity prediction may wielf better results than using the absolute number of transations per month as features.