In [1]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  

%matplotlib inline

import scipy.stats as stats 
import random

# 1. **One Proportion Z-test**

**Business Problem 1**

*A Fox Sports channel reporter claims that the next champion of UEFA Champions League will be Real Madrid, known that currently there are 76 competing clubs*

*We randomly collected data of the UEFA Champions League winners, Out of 50 seasons , Real Madrid won 12 times.*

*Is there enough evidence at 𝛂 = 0.05 to support her claim?*


Let $p$ be the proportion of (success) the number of times the users said they like the album.

we will test the null hypothesis

>$H_0:p \leq 0.5$

against the alternate hypothesis

>$H_a:p > 0.5$

In [2]:
uefa_samp = pd.read_csv('UEFA_Champions_League_Sample.csv')
uefa_samp.drop(columns = ['Unnamed: 0'], inplace=True)
uefa_samp.head(2)

Unnamed: 0,Season,UCL champion
0,1983-84,Liverpool (England)
1,1982-83,Hamburger SV (Germany)


In [3]:
uefa_samp['UCL champion'].value_counts().head(1)

Real Madrid (Spain)    12
Name: UCL champion, dtype: int64

In [4]:
def prop_binom_func(n, success):
    ''' compute the probability of the two possible outcomes(Bernoulli)'''
    prob_success = success/n
    prob_fail = 1 - prob_success
    result = f'The probability of success : {round(prob_success, 2)*100}% \nThe probability of failure : {round(prob_fail, 2)*100} %'
    return print(result)

In [5]:
def avg_binom_func(n, success):
    '''compute the average number of successes and the average number of failures in n trials,
    the results must be greater than 10, so that the normal approximation is applied only to binomial distributions '''
    avg_success = n*success/n
    avg_fail = n*((n-success)/n)
    result = f' np = {int(avg_success)} \n n(1-p) = {int(avg_fail)}'
    return print(result)

In [6]:
prob =  prop_binom_func(50, 12)
prob

The probability of success : 24.0% 
The probability of failure : 76.0 %


In [7]:
binom_ = avg_binom_func(50, 12)
binom_

 np = 12 
 n(1-p) = 38


### Are the Z-test assumptions are satisfied or not?

- Binomially distributed population - Yes, the team either win or loss
- Random sampling from the population - Yes, we randomly selected the data.
- Can the binomial distribution be approximated to a normal distribution - Yes, For binary data, CLT works slower than usual. The standard thing is to check whether np and n(1-p) are greater than or equal to 10. 
>$np = 12 \geq 10\\
n(1-p) = 38 \geq 10$

### Find the z-test statistic and p-value

In [8]:
from statsmodels.stats.proportion import proportions_ztest

test_stat, p_value = proportions_ztest(12, 50, value = 0.5, alternative = 'larger')
print('The p-value is : {}, the t-test statistic is : {}'.format(str(p_value),str(test_stat)))

The p-value is : 0.9999916405302924, the t-test statistic is : -4.304730160461392


As the p-value is much greater than the significance level of 0.05, we can not reject the null hypothesis. Thus, the Fox reporter does not have enough statistical evidence to support her claim that the next UCL champion will be Real Madrid at a 5% significance level.

### **Conclusion**

- Based on our test results, we don't have enough statistical evidence to say that Real Madrid will be the next UCL champion.
- As Real Madrid is the most successful club it's reasonable to assume that the reporter is right, but it isn't and not with 95% confidence.
- Still there's a possibility of 24% that Real Madrid will win the next season, however it's not certain or guaranteed.

# 2.**Two Proportion Z-test**

**Business Problem 2**

*A bank aims to improve its protection and security process by reducing the number of any security breaches.Thus, the bank randomly checks the efficiency of two transactional systems. In the first system, there are 52 breaches out of 300 attempts to cyber attacks on the system 1, and in the second system, there are 20 breaches out of 400 attempts to cyber attacks on the system 2.*

*At a 5% level of significance, do we have enough statistical evidence to conclude that the protection procedures followed in the two systems are different?*

Let $p_1$ and $p_2$ be the proportions of the number of breaches in system 1 and  system 2 respectively.

We will test the null hypothesis

>$H_0:p_1 =p_2$

against the alternate hypothesis

>$H_a:p_1 \neq p_2$

In [9]:
sys1_prob = prop_binom_func(300, 52)
sys1_prob

The probability of success : 17.0% 
The probability of failure : 83.0 %


In [10]:
sys2_prob = prop_binom_func(400, 20)
sys2_prob

The probability of success : 5.0% 
The probability of failure : 95.0 %


In [11]:
system_1 = avg_binom_func(300, 52)
system_1

 np = 52 
 n(1-p) = 248


In [12]:
system_2 = avg_binom_func(400, 20)
system_2

 np = 20 
 n(1-p) = 380


### Are the Z-test assumptions are satisfied or not?

- Binomially distributed population - Yes, A cyber attack is either a success or a failure.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.  
- Can the binomial distribution be approximated to a normal distribution - Yes, For binary data, CLT works slower than usual. The standard thing is to check whether np and n(1-p) are greater than or equal to 10. 
>$np_1 = 52 \geq 10\\
n(1-p_1) = 248 \geq 10 \\
np_2 = 20 \geq 10\\
n(1-p_2) = 380 \geq 10 $


### Find the z-test statistic and p-value

In [13]:
from statsmodels.stats.proportion import proportions_ztest

breaches_count = np.array([52, 20])

nobs = np.array([300, 400])

# find the p-value
test_stat, p_value = proportions_ztest(breaches_count, nobs)
print('The p-value is : {}, the t-test statistic is : {}'.format(str(p_value),str(test_stat)))

The p-value is : 1.0615112471575655e-07, the t-test statistic is : 5.3158662128413745


As p-value is much smaller than the level of significance 0.05, we reject the null in favor of the alternative hypothesis. We have enough significant statistical evidence to support our alternative.

### **Conclusion**

- With 95% confidence we can say that we have enough statistical evidence to say that the protection procedures followed by the two systems are different.
- As the chance of security breach happening to system 1 is 17% , and the chance of security breach happening to system 2 is only 5%, system 1 is being more vulnerable to cyber attacks than system 2 since it has a higher chance to be breached.
- In practical perspective, the protection procedures of system 2 is better than system 1 by 12% observed difference.