# Statistical Learning - Project 2

Please find below the Project for Statistical Learning course. This is an individual assignment. Kindly submit it before it's deadline.

The Titan Insurance Company has just installed a new incentive payment scheme for its lift policy sales force. It wants to have an early view of the success or failure of the new scheme. Indications are that the sales force is selling more policies, but sales always vary in an unpredictable pattern from month to month and it is not clear that the scheme has made a significant difference.

Life Insurance companies typically measure the monthly output of a salesperson as the total sum assured for the policies sold by that person during the month. For example, suppose salesperson X has, in the month, sold seven policies for which the sums assured are £1000, £2500, £3000, £5000, £10000, £35000. X's output for the month is the total of these sums assured, £61,500. Titan's new scheme is that the sales force receives low regular salaries but are paid large bonuses related to their output (i.e. to the total sum assured of policies sold by them). The scheme is expensive for the company, but they are looking for sales increases which more than compensate. The agreement with the sales force is that if the scheme does not at least break even for the company, it will be abandoned after six months.

The scheme has now been in operation for four months. It has settled down after fluctuations in the first two months due to the changeover.

To test the effectiveness of the scheme, Titan have taken a random sample of 30 salespeople measured their output in the penultimate month prior to changeover and then measured it in the fourth month after the changeover (they have deliberately chosen months not too close to the changeover). The outputs of the salespeople in the data.csv file.



## Questions

**1. Find the mean of old scheme and new scheme column. (5 points)**

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import t,ttest_1samp,ttest_rel, ttest_ind, mannwhitneyu, levene, shapiro
from statsmodels.stats.power import ttest_power

In [2]:
insuranceSalesData = pd.read_csv('data.csv')
insuranceSalesData.head()

Unnamed: 0,SALESPERSON,Old Scheme (in thousands),New Scheme (in thousands)
0,1,57,62
1,2,103,122
2,3,59,54
3,4,75,82
4,5,84,84


In [3]:
insuranceSalesData.shape

(30, 3)

In [4]:
#Since given data for Old Scheme & New Scheme in thousands its better to convert that with multiple by 1000
insuranceSalesData["OldScheme"]=insuranceSalesData["Old Scheme (in thousands)"]*1000
insuranceSalesData["NewScheme"]=insuranceSalesData["New Scheme (in thousands)"]*1000
#NOTE : mean value calculated from the newly converted field
print( "Old Scheme Mean Value : {0}".format(insuranceSalesData["OldScheme"].mean()))
print( "New Scheme Mean Value : {0}".format(insuranceSalesData["NewScheme"].mean()))

Old Scheme Mean Value : 68033.33333333333
New Scheme Mean Value : 72033.33333333333


**2. Use the five percent significance test over the data to determine the p value to check new scheme has significantly raised outputs? (10 points)**

In [5]:
# Let  μ2 = Mean value of New Scheme.
# μ1 = Mean value of Old Scheme.

#H0: μ1 = μ2  ; μ2 – μ1  = 0

#HA: μ1 < μ2   ; μ2 – μ1  > 0 ; true difference of means is greater than zero.

#Since population standard deviation is unknown, paired sample t-test will be used.
#It is asked that whether the new scheme has significantly raised the output, it is an example of the one-tailed t-test.
# here, ttest_rel is two-tailed t- test,so in the final p-value we should divide by 2 to get one-tailed p-value

t_statistic, p_value = ttest_rel(insuranceSalesData["NewScheme"], insuranceSalesData["OldScheme"])
print ("p-Value = {0}".format( p_value/2))


p-Value = 0.06528776980668831


**3. What conclusion does the test (p-value) lead to? (2.5 points)**

In [6]:
#Null Hypothesis : New scheme has NOT significantly raised outputs

#Alternate Hypothesis : New scheme has significantly raised outputs

# p_value < 0.05 => alternative hypothesis:
# they don't have the same mean at the 5% significance level

print ("Since p-value = {0} is higher than 0.05, we accept (fail to reject) NULL hypothesis.".format( p_value/2))
print ("The New scheme has NOT significantly raised outputs.")

Since p-value = 0.06528776980668831 is higher than 0.05, we accept (fail to reject) NULL hypothesis.
The New scheme has NOT significantly raised outputs.


**4. Suppose it has been calculated that in order for Titan to break even, the average output must increase by £5000 in the scheme compared to the old scheme. If this figure is alternative hypothesis, what is:**

 **a) The probability of a type 1 error? (2.5 points)**

<b>Ans :</b><br>
*When the null hypothesis is true and you reject it.*<br>
Albha = props(Type I Error)=significant level = 0.05 or 5%</br>

**b) What is the p- value of the hypothesis test if we test for a difference of $5000? (5 points)**

In [7]:
# Let  μ2 = Mean value of New Scheme.
# μ1 = Mean value of Old Scheme.
# μd = μ2 – μ1   

# H0: μd ≤ 5000  
# HA: μd > 5000

#This is a one (right) tail test 

# here, ttest_1samp is two-tailed t- test,so in the final p-value we should divide by 2 to get one-tailed p-value

t_statistic, p_value = ttest_1samp(insuranceSalesData["NewScheme"]-insuranceSalesData["OldScheme"], 5000)

print("one tail p-Value  = {0}".format(p_value/2))

one tail p-Value  = 0.3500667456306643


**c) Power of the test (5 points)**

In [8]:
# Calculating Power of Test

newScheme=insuranceSalesData["NewScheme"]
oldScheme=insuranceSalesData["OldScheme"]
insuranceSalesData["Salesdiff"]=newScheme-oldScheme
n=insuranceSalesData.shape[0]
alpha=0.05
#Standard Error
se=((np.std(insuranceSalesData["Salesdiff"],ddof=1))/np.sqrt(n))
print("Standard Error :{0} ".format(se))
#Standard Error
df=n-1
print("Degree of Fredom :{0} ".format(df))
#Critical Value
cv = t.ppf(1.0 - alpha, df)
print("Critical Value :{0} ".format(cv))
mu0=0
#X_bar
x_bar=mu0+cv*se
print("x_bar : {0}".format(x_bar))

#Probability (type II error) is P(Do not reject H0 | H0 is false)
#Our NULL hypothesis is TRUE at μd = 0 so that  H0: μd = 0 ; HA: μd > 0
#Probability of type II error at μd = 5000
    
t_stat=(x_bar-5000)/se
print("t_Stat Value : {0}".format(t_stat))
#Calculate Beta pvalue
beta_pvalue = t.sf(t_stat, n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
print("beta Value : {0}".format(beta_pvalue/2))
powerOfTest=1-(beta_pvalue/2) 

print("Power of Test Value: {0}".format(powerOfTest))



Standard Error :2570.8355455569017 
Degree of Fredom :29 
Critical Value :1.6991270265334972 
x_bar : 4368.176156228719
t_Stat Value : -0.2457659514095497
beta Value : 0.5962027475205683
Power of Test Value: 0.40379725247943166
