# Objective : Decide if we should switch to the new recommendation engine. 
## Strategy:
We can consider the recommendation systems as a customer-detector (a binary classifier).

In practice, a potential customer has to have two essential characteristics to transform to a future customer.
First, she has to be interested in the product and secondly she has to be eligible for the product. 

It is not clear if the recommendation engine aims to detect both the aspects. For the purpose of this exercise, 
I will proceed to evaluate the recommendation engine on just the first of these two related concepts:<br>
1) Interest-detector: Ability to detect an interested customer. <br>
2) Eligibility-detector: Ability to detect an interested and eligible customer.

For now, let me focus only on Interest-detector.
First, lets characterize the workflows in the system.

In [20]:
import pandas as pd
print("Examining the data:")
rec_one= pd.read_csv("out_recommender01.csv")
rec_two= pd.read_csv("out_recommender02.csv")
all_data= pd.concat( [rec_one, rec_two], ignore_index=True)
#I have left out preliminary-analysis of data structure 
#print (set( rec_one.action.tolist()))
#print (set( rec_two.action.tolist()))
#print (rec_one.columns)
#print (rec_two.columns)


#Lets see the possible workflows that any visitor to the site goes through:
# workflow is the sequence of actions in a journey 
workflow = [ ",".join(group.action.tolist()) for (key,group) in all_data.groupby("journey_id")]
print("Workflows are : \n",  "\n".join(set(workflow) ))

Examining the data:
Workflows are : 
 Recommender-1.0,Recommender-1.0,Recommender-1.0,Customer Opted-Out
Recommender-1.0,Recommender-1.0,Recommender-1.0,Recommender-1.0,Customer Applied,Application Declined
Recommender-1.0,Customer Applied,Application Approved,Customer Signed-Up
Recommender-1.0,Customer Applied,Application Declined
Recommender-1.0,Recommender-1.0,Recommender-1.0,Recommender-1.0,Customer Applied,Application Approved,Customer Cancelled Application
Recommender-1.0,Customer Applied,Application Approved,Customer Cancelled Application
Recommender-2.0,Customer Applied,Application Declined
Recommender-1.0,Recommender-1.0,Customer Applied,Application Approved,Customer Cancelled Application
Recommender-1.0,Recommender-1.0,Recommender-1.0,Customer Applied,Application Approved,Customer Cancelled Application
Recommender-1.0,Customer Opted-Out
Recommender-2.0,Customer Applied,Application Approved,Customer Cancelled Application
Recommender-2.0,Recommender-2.0,Recommender-2.0,Customer

## Workflows, Success and Failure Criterion
From the above workflow-analysis, I understand that a customer may or may not 
be chosen by the recommender. From the customers' perspective, she has the following possible workflows:
*) 'Customer Opted-Out' <br>
*) 'Customer Applied,Application Declined' <br>
*) 'Customer Applied,Application Approved,Customer Cancelled Application' <br>
*) 'Customer Applied,Application Approved,Customer Signed-Up'  <br>

For my initial analysis, where I focus solely on interest-detection capability of the recommender,
I consider : <br>
1) True Positive  : If a customer was recommended and she applied.<br>
2) False Positive : If a customer was recommended and she did not apply. <br> 
3) False Negative : A customer who was not recommended the product but she still applied. <br>
       <b>They should be attributed to recommender-01 if she applied before  01 July  2017. Else, it should be attributed to both engines in a 80:20 ratio. </b> <br>
4) True Negative  : N/A  or Undetectable -- from the present logs, it appears that the data is missing about
    the customers who were neither recommended nor applied for the product.

### Characterizing Recommender-Performance

Based on the above data, we can measure the following metrics to characterize the recommendation engines:<br>
1) Precision (P) = TP / (TP + FP) <br>
2) Recall or Sensitivity (R) = TP /(TP + FN) <br>
3) Balanced F1 score  = 2*P*R / (P+R) <br>

#### Simplifying Assumptions
These metrics are <b>journey-specific</b> not actor-specific.
In other words, I am ignoring the condition where a same customer goes through several recommendation-journeys.
It is very much possible that repeated-recommendations have a higher conversion rate.
For example, a recommendation might be more effective when seen for the 3rd time. 
For this first cut analysis, I am ignoring such aspects and making some simplistic assumptions 
that journey are i.i.d .  I have also not done a <b> rigorous error-analysis </b>. For example,
is it possible that a customer applies, opts-out and repeats this process many times.
How should such a case be handled ? These are some aspects requiring more analysis.

In [25]:
tp1, tp2, fp1, fp2, fn1, fn2 = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
###
for journey_id, group in all_data.groupby("journey_id") :
    action_lst = group.action.tolist()
    if 'Recommender-1.0' in  action_lst:
        if 'Customer Applied' in action_lst:
            tp1 += 1
        else:
            assert('Customer Opted-Out' in action_lst )
            fp1 += 1
    elif 'Recommender-2.0' in group.action.tolist() :
        if 'Customer Applied' in action_lst:
            tp2 += 1
        else:
            assert('Customer Opted-Out' in action_lst )
            fp2 += 1
    else: # customer came in without being recommended to 
        assert('Customer Applied' in action_lst) #has to be FN, there is no TN-detection capability
        if group._time.min() <  '2017-07-01' : # only recommender-01 was in play
            fn1 += 1.0
        else: # it could be from either of the recommenders with a .8:.2 probability 
            fn1 += 0.8
            fn2 += 0.2
########
#print( "Journey level metrics :", tp1, tp2, fp1, fp2, fn1, fn2)
p1 = tp1 / (tp1+fp1)
p2 = tp2 / (tp2+fp2)
r1 = tp1 / (tp1+fn1)
r2 = tp2 / (tp2+fn2)
f1 = 2*p1*r1/(p1+r1)
f2 = 2*p2*r2/(p2+r2)
print("Journey level metrics of Precision, Recall and F1 for ")
print("\t\tRecommender-01 : ", "{0:.2f}".format(p1), "{0:.2f}".format(r1), "{0:.2f}".format(f1) )
print("\t\tRecommender-02 : ", "{0:.2f}".format(p2), "{0:.2f}".format(r2), "{0:.2f}".format(f2) )
trials1 = tp1 + fp1 
trials2 = tp2 + fp2 
print("Number of journeys or trials for the two recommenders :" , trials1, trials2)

Journey level metrics of Precision, Recall and F1 for 
		Recommender-01 :  0.37 0.35 0.36
		Recommender-02 :  0.39 0.36 0.37
Number of journeys or trials for the two recommenders : 30239.0 2196.0


# Comparing the two Recommenders.
A final question is to compare which of the two detectors is better ? 
The newer version certainly seems better on Precision, Recall and F-score.
The sample size too is substantial (~2196 journeys with Recommender-02).

Still it is instructive to check for statistical significance, before 
rejecting the null hypothesis H0: The two engines are equally effective. 
Further, I will choose Precision as the measure of effectiveness instead of Recall or F1,
as it seems appropriate for this task setting and is also deterministic, unlike FN that needs estimation.

We cannot use standard tests such as McNemar's test as the trials are done on different customers. 
For independent pairs (i.e different customers corresponding to different recommenders), I choose Unpaired-t tests.

Precision is analogous to population mean. I.e it is the proportion of truly interested customers, 
    amongst those for whom we recommended the product. 
    
I.e p == tp/(tp+fp) can be seen as the probability/proportion of success of (tp+fp) bernoulli-trials. 
Therefore, we can compare the precisions, by comparing the population-proportions for the two engines.

The null hypothesis: H0: Both are equally precise. i.e (P1-P2)==0  <br> 
The alternate hypothesis H1 : Their precisions are different. ie P1 != P2 <br>
The two binomial distributions are  given by n1, p1 and n2, p2: <br>



In [28]:
n1 = 30239.0 
p1 = 0.37 
n2 = 2196.0  
p2 = 0.39
import numpy as np
p= (n1*p1 + n2*p2) / (n1+n2) #avg prob of success
z =  (p1-p2)/ np.sqrt( p*(1-p)*(1/n1 + 1/n2) ) #test-statistic  
print (z)

-1.872948221730775


The above statistic happens to be > Z (alpha/2) == Z (0.025) == -1.96 for 95 % confidence. 
Which means it falls in the 95 % confidence interval and hence we cannot reject the null hypothesis.
In other words, there is less statistical significance or support to claim that the newer version is more precise. <br>

<b> A switch to newer version cannot be recommended. </b> 