# Business Analytics & Machine Learning

## Tutorial sheet 1: Statistics


In [3]:
import numpy as np
from scipy import stats

### Exercise 1 Effect of tax on consumption

The following table contains data of 10 individuals’ consumption levels before and after a tax increase,
measured by an index value. High index values correspond to high consumption levels. The rows represent
individuals’ identifiers i, their index values prior to the tax increase ai, and after the tax increase bi.


When dealing with the sample data, degrees of freedom is n-1.  
Furthermore, with sample data, it is hard to prove that something holds for whole population. We can, instead, quite confidently prove that an alternative hypothesis does not hold and hence we start with that.

![ex1.jpeg](ex1.jpeg)


In [2]:
a = [27, 31, 23, 35, 26, 27, 26, 18, 22, 21]
b = [40, 36, 43, 34, 25, 41, 32, 29, 21, 36]

# TODO

In [10]:
aa = np.array(a)
bb = np.array(b)

diff = np.subtract(aa, bb)
diff_list = diff.tolist()
diff_list

[-13, -5, -20, 1, 1, -14, -6, -11, 1, -15]

In [60]:
d = diff.mean()
n = len(diff)
s = diff.std(ddof=1)

t = d / (s/np.sqrt(n))
t

-3.373373572263509

In [42]:
ss = np.random.standard_t(len(diff), size=100000)
p = np.sum(ss<t) / float(len(ss))
p

0.0037

In [5]:
stats.ttest_rel(a, b)

Ttest_relResult(statistic=-3.3733735722635085, pvalue=0.008212883453442434)

### Exercise 2: Masks during Covid19

In the context of the COVID-19 pandemic, 8 men and 10 women were asked how many hours per day they wear a mask. The following table shows their answers. The hypothesis is "On average, women wear their mask longer per day than men". It can be assumed, that the average time people wear their mask is normally distributed.

![ex2.jpeg](ex2.jpeg)


In [32]:
# individuals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# hours = [4, 2, 3, 5, 7, 2, 7, 3, 5, 2, 2, 1, 5, 3, 1, 3, 2, 3]
# gender = ["f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "m", "m", "m", "m", "m", "m", "m", "m"]

female = np.array([4, 2, 3, 5, 7, 2, 7, 3, 5, 2])
male = np.array([2, 1, 5, 3, 1, 3, 2, 3])

f_mean = female.mean()
m_mean = male.mean()
print(f"female mean: {f_mean}, male mean: {m_mean}")

f_std = female.std(ddof=1)
m_std = male.std(ddof=1)
print(f"female std: {f_std:0.3f}, male std: {m_std:0.3f}")

s_diff = np.sqrt(((f_std**2)/len(female)) + ((m_std**2)/len(male)))
print(f"s_diff: {s_diff:0.3f}")

t = (f_mean - m_mean) / s_diff
print(f"t: {t:0.3f}")

female mean: 4.0, male mean: 2.5
female std: 1.944, male std: 1.309
s_diff: 0.769
t: 1.949


In [65]:
test = stats.ttest_ind(female, male, alternative="greater", equal_var=False)
test
# pvalue < alpha: we reject the H0

Ttest_indResult(statistic=1.9494276330540574, pvalue=0.03470640093813483)

In [49]:
stats.t.ppf(0.95, df=16)

1.7458836762762397

### Exercise T1.3 Research Methods

You are a researcher investigating whether a new study technique improves the average test scores of
students. You have collected data on the test scores of 15 students who used the new technique (group
N T ) and 15 students who did not (group OT ). The following table contains the test scores, where i is the
index of a student in a specific group. We can assume that the differences are normally distributed.


a) State the null hypothesis (H0) and the alternative hypothesis (H1) for this scenario.


H_0: OT <= NT  
H_1: NT > OT

(H1 - research question, H0 - plug in the opposite)


b) Explain whether this is a one-sided or two-sided test and justify your choice.


One sided, we want to see outliers "on the right". 2-sample test (we have 2 independent samples)


c) Conduct the t-test in Python using the SciPy library to compare the means of the two groups using a significance level of α = 0.05. You can leverage the provided notebook.


In [19]:
scores_nt = np.array([85, 89, 92, 88, 91, 90, 87, 93, 86, 91, 84, 88, 89, 90, 92])
scores_ot = np.array([79, 81, 75, 82, 77, 80, 78, 84, 76, 80, 78, 83, 82, 79, 85])

test = stats.ttest_ind(scores_nt, scores_ot, alternative="greater") 

test

Ttest_indResult(statistic=8.87976253646568, pvalue=6.191452789567058e-10)

stats.t.ppf(0.95, df=)


d) Given the test result, would you reject the null hypothesis H0? Explain the result in the context of the research question.


Reject, because pvalue < alpha


e) Determine and interpret the corresponding 95% confidence interval in Python.


Answer interpretation:  
We are 95% confident that the population lies between values 6.88 and 11.26  
We know that the new technique improves the results of the students because the difference is higher (?).  
![ex3.jpeg](ex3.jpeg)
