# Controlled experiments

In [None]:
require(readxl)
require(dplyr)
require(pwr)
require(effsize)

## Task 1

We are going to replicate analysis for two hypothesis formulated in the paper:

```
@article{salman2019controlled,
  title={A controlled experiment on time pressure and confirmation bias in functional software testing},
  author={Salman, Iflaah and Turhan, Burak and Vegas, Sira},
  journal={Empirical Software Engineering},
  volume={24},
  number={4},
  pages={1727--1761},
  year={2019},
  publisher={Springer}
}
```

The aim of this research was as follows: Analyse the ***functional test cases*** For the purpose of ***examining the effects of time pressure With respect to confirmation bias*** From the point of view of researchers In the context of an experiment run with graduate students (as proxies for novice professionals) in an academic setting.

Before we continue, let's introduce some terms:

***Consistent test case (c):*** A consistent test case tests strictly according to what has been specified in the requirements, i.e. consistency with the specified behaviour. In the context of testing this refers to: 1) the defined program behaviour on a certain input; and 2) the defined behaviour for a specified invalid input. Example: If the specifications state, “… the phone number field does not accept alphabetic characters...”, the test case designed to validate that phone number field does not accept alphabetic characters is considered a consistent test case.

***Inconsistent test case (ic):*** An inconsistent test case tests the scenario or the data input that is not explicitly specified in the requirements. We also consider such test cases that present outside-of-the-box thinking at the tester’s end inconsistent. Example: If the specifications only state, “… the phone number field accepts digits...”, and the application’s behaviour for the other types of input for that field is not specified, then the following test case is considered inconsistent: the phone number field accepts only the + sign from the set of special characters (e.g. to set an international call prefix).

***Proxy Measure of Confirmation Bias***

We mark the functional test cases designed by the participants as either consistent (c) or inconsistent (ic). To detect the bias of participants through a proxy measure, we derive a scalar parameter based on (c) and (ic) test cases designed by the participants and the total count of (all possible) consistent (C) and inconsistent (IC) test cases for the given specification:

𝑧=𝑐/𝐶−𝑖𝑐/𝐼𝐶

* if z >0 ; participant has designed relatively more consistent test cases

* if z <0 ; participant has designed relatively more inconsistent test cases

* if z = 0 ; participant has designed a relatively equal number of consistent and inconsistent test cases.

***Hypothesis Formulation***

*H1 states that: Testers design more consistent test cases than inconsistent test cases.*

𝐻1𝐴:𝜇(𝑐)>𝜇(𝑖𝑐)

𝐻10:𝜇(𝑐)≤𝜇(𝑖𝑐)

*H3: Testers under time pressure manifest relatively more confirmation bias than testers under no time pressure.*

𝐻3𝐴:𝜇(𝑧𝑇𝑃)>𝜇(𝑧𝑁𝑇𝑃)

𝐻30:𝜇(𝑧𝑇𝑃)≤𝜇(𝑧𝑁𝑇𝑃)


### Load the dataset

In [None]:
exp_data <- read_excel("experiment1.xlsx", sheet=1)

In [None]:
head(exp_data)

TP = Time Presure
NTP = No Time Pressure

### Descriptive statistics

In [None]:
summary(exp_data %>% select(C_TP, IC_TP, Z_TP, C_NTP, IC_NTP, Z_NTP) )

### H1 states that: Testers design more consistent test cases than inconsistent test cases.

𝐻1𝐴:𝜇(𝑐)>𝜇(𝑖𝑐)

𝐻10:𝜇(𝑐)≤𝜇(𝑖𝑐)

We have to combine two datasets because the question is in general.

In [None]:
pooled_data <- data.frame(rbind(as.matrix(exp_data %>% select(C_TP, IC_TP)), 
                     as.matrix(exp_data %>% select(C_NTP, IC_NTP))))
colnames(pooled_data) <- c("C", "IC")

In [None]:
dim(pooled_data)
head(pooled_data)

Let's calculate descriptive statistics and visualize the data

In [None]:
summary(pooled_data)

In [None]:
options(repr.plot.width=5, repr.plot.height=3)
old_par <- par(mfrow=c(1,1), mar=c(3, 10, 2, 0), cex=0.7, omi=c(0,0,0,0), mgp=c(2, 1, 0))
boxplot(pooled_data, 
     main="C vs IC", 
     xlab="Test cases", horizontal=TRUE)
par(old_par)

We would have to decide what to do with the potential oultiers. The authors decided to no remove them.

Let's check normality assumption for both variables.

In [None]:
options(repr.plot.width=5.5, repr.plot.height=3)
old_par <- par(mfrow=c(1,2), mar=c(3, 3, 2, 0), cex=0.7, omi=c(0,0,0,0), mgp=c(2, 1, 0))

qqnorm(pooled_data$C, main="Normal Q-Q for C")
qqline(pooled_data$C)

qqnorm(pooled_data$IC, main="Normal Q-Q for IC")
qqline(pooled_data$IC)

par(old_par)


In [None]:
shapiro.test(pooled_data$C)

In [None]:
shapiro.test(pooled_data$IC)

Conclusion: there is strong evidence against normality

It's better to use a non-parametric test. Going back to H1 our null hypothesis would be that there are not differences in median number of test cases for C and CI, while an alternative hypothesis would state that the median number of test cases is greater for C.

In [None]:
wilcox.test(pooled_data$C, pooled_data$IC, alternative="greater", paired = TRUE )

To calculate the size of sample you can use procedure for t test and divide the value by A.R.E which is in this case equal 0.955

In [None]:
cd <- as.numeric(cohen.d(pooled_data$C, 
                         pooled_data$IC, na.rm=T)$estimate)
cd

In [None]:
pwr.t.test( n=dim(pooled_data)[1] , sig.level=0.05 , 
           power=NULL , d=cd , alternative='greater' ,
           type='paired')

Conclusion: **Tester are more likely to produce consistent tests**

### H3: Testers under time pressure manifest relatively more confirmation bias than testers under no time pressure.

𝐻3𝐴:𝜇(𝑧𝑇𝑃)>𝜇(𝑧𝑁𝑇𝑃)

𝐻30:𝜇(𝑧𝑇𝑃)≤𝜇(𝑧𝑁𝑇𝑃)

In [None]:
z_data <- exp_data %>% select(Z_TP, Z_NTP)

Let's calculate some descriptive statistics and visualize the data.

In [None]:
summary(z_data)

In [None]:
options(repr.plot.width=5, repr.plot.height=3)
old_par <- par(mfrow=c(1,1), mar=c(3, 10, 2, 0), cex=0.7, omi=c(0,0,0,0), mgp=c(2, 1, 0))
boxplot(z_data, 
     main="TP vs NTP", 
     xlab="Z", horizontal=TRUE,
     names=c("TP", "NTP"))
par(old_par)

We would have to decide what to do with the potential oultiers. The authors decided to no remove them.

Let's check normality assumption for both variables.

In [None]:
options(repr.plot.width=5.5, repr.plot.height=3)
old_par <- par(mfrow=c(1,2), mar=c(3, 3, 2, 0), cex=0.7, omi=c(0,0,0,0), mgp=c(2, 1, 0))

qqnorm(z_data$Z_TP, main="Normal Q-Q for TP")
qqline(z_data$Z_TP)

qqnorm(z_data$Z_NTP, main="Normal Q-Q for NTP")
qqline(z_data$Z_NTP)

par(old_par)


In [None]:
shapiro.test(z_data$Z_TP)

In [None]:
shapiro.test(z_data$Z_NTP)

Conclusion: there are no strong evidences against normality

Since the normality assumption seem valid, we can perform a paramteric test. By looking at H3 we could define two hypotheses: the null hypotheses will state that the mean z is the same for the TP and NTP groups while the alternative hypothesis will state that mean z is greater for the TP group (time pressure increases the Confirmation Bias)

In [None]:
t.test(z_data$Z_TP, z_data$Z_NTP, alternative="greater", paired = FALSE )

To calculate the size of sample you can use procedure for t test and divide the value by A.R.E which is in this case equal 0.955

In [None]:
cd <- as.numeric(cohen.d(z_data$Z_TP, 
                         z_data$Z_NTP, na.rm=T)$estimate)
cd

In [None]:
pwr.t.test( n=dim(z_data)[1] , sig.level=0.05 , 
           power=NULL , d=cd , alternative='greater' ,
           type='two.sample')

Conclusion: ***It is unlikely that time pressure affects confirmation bias***

## Task 2

Your taks will be to replicate one of the analyses described in the paper:

```
@article{mackowiak2018some,
  title={On some end-user programming constructs and their understandability},
  author={Ma{\'c}kowiak, Micha{\l} and Nawrocki, Jerzy and Ochodek, Miros{\l}aw},
  journal={Journal of Systems and Software},
  volume={142},
  pages={206--222},
  year={2018},
  publisher={Elsevier}
}
```

The experiments in the paper evaluate some constructs that could be useful from the perspective of end-user programmers (end-user programming as a research area tackles with the problem of how to help domain experts that are not professional programmers write code to support their work).

This particular experiment we will focus on investigated whether programs written using the single assignment paradigm can help in understanding programs. 

The understandability of a set of programming constructs, when evaluated by a group of people, can be described with the following performance indicators (we call them the FACT indicators of understandability):
* F — first attempt failure ratio: the percentage of people (partic- ipants) who fail at the first attempt to predict the results of a computation defined by means of a given set of programming constructs;
* A — attempt number: the average number of attempts needed to predict the correct results of computations defined by code containing the analysed programming constructs (if A = 1 then F = 0—all participants from the group predict the correct re- sults at the first attempt);
* C — cancellation ratio: the percentage of participants who are unable to complete a given assignment (i.e. correctly predict the result, regardless of the number of attempts) in a given amount of time;
* T — prediction time: the average time used by participants to predict the results of a computation defined by code composed of the analysed programming constructs.

**The task of the participants was to predict the output generated by a given function written in two different ways: control group = standard way, experimental group = using a single assignment paradigm.**

**We will limit our analysis here to the prediction time - T.**


### Load data

Load data from the experiment2.xls file into and display the dataset.

### Separate data frames for each group

As you can see, the data frame contains data for both controll and experimental groups.

Create two separate data frames by filtering the main one: 
* experiment - should contain observations with the Treatment "Star"
* control - should contain observations with the Treatment "Standard"


### How many participants were in both groups?

Find out how many participants were in each of the groups.

### T - Prediction Time

We will replicate the analysis for the prediction time. The null hypothesis is that there is no difference in prediction time (T) between the groups. The alternative hypothesis will state that "average" time for the experimental group is lower than for the control group.

Since not all the participants completed the task, we have to filter the samples:
* **experiment** - reject all non-completed tasks for the experimental group (you should preserve only the rows with IsDone=="True" and IsCorrect=="True")
* **control** - since the alternative hypothesis indicates the programs using the single assignment paradigm as superior, it is acceptable to favor the control group. Therefore, it was decided to accept all incompleted solutions with the assumption that the participants would finish in time T+1. Create a new column **TimeCorrected** in the control group data frame that will equal to: ```Time if IsDone=="True" and IsCorrect=="True" otherwise Time+1```. **Hint:** *if you want to use the mutate function from the dplyr package, you can use the function ifelse(condition, exp1, exp2)*. 

Calculate descriptive statistics for the Prediction Time - the columns Time and TimeCorrected in the data frames.

Plot a boxplot comparing the distributions of T in both groups

Let's check normality assumption for both variables - check the QQ plots and conduct Shapiro-Wilk tests.

**Conclusion:** What is your conclusion about the normality assumption? What kind of test will you use to investigate the hypotheses?

Perform the statistical test of you choice to investigate the hypotheses.

**Conclusion:** What is you final conclusion?