# One between-subject factor



In [None]:
import pandas as pd
import pingouin as pg
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_columns', 20)
df = pg.read_dataset('mixed_anova.csv')
pg.pairwise_ttests(dv='Scores', between='Group', data=df).round(3)

# One within-subject factor



In [None]:
post_hocs = pg.pairwise_ttests(dv='Scores', within='Time', subject='Subject', data=df)
post_hocs.round(3)

# Non-parametric pairwise paired test (wilcoxon)



In [None]:
pg.pairwise_ttests(dv='Scores', within='Time', subject='Subject',
                   data=df, parametric=False).round(3)

# Mixed design (within and between) with bonferroni-corrected p-values



In [None]:
posthocs = pg.pairwise_ttests(dv='Scores', within='Time', subject='Subject',
                              between='Group', padjust='bonf', data=df)
posthocs.round(3)

In [None]:
import pandas as pd
import numpy as np
import pingouin as pg
import seaborn as sns

# Let's assume that we have a balanced design with 30 students in each group
n = 30
months = ['August', 'January', 'June']

# Generate random data
np.random.seed(1234)
control = np.random.normal(5.5, size=len(months) * n)
meditation = np.r_[np.random.normal(5.4, size=n),
                   np.random.normal(5.8, size=n),
                   np.random.normal(6.4, size=n)]

# Create a dataframe
df = pd.DataFrame({'Scores': np.r_[control, meditation],
                   'Time': np.r_[np.repeat(months, n), np.repeat(months, n)],
                   'Group': np.repeat(['Control', 'Meditation'], len(months) * n),
                   'Subject': np.r_[np.tile(np.arange(n), 3),
                                    np.tile(np.arange(n, n + n), 3)]})
# DESCRIPTIVE STATS
pg.print_table(df.head())

# import seaborn as sns
sns.set()
sns.pointplot(data=df, x='Time', y='Scores', hue='Group', dodge=True,
              markers=['o', 's'], capsize=.1, errwidth=1, palette='colorblind')

print(df.groupby(['Time', 'Group']).agg(['mean', 'std']))

# ANOVA
aov = pg.mixed_anova(dv='Scores', within='Time', between='Group',
                     subject='Subject', data=df)
pg.print_table(aov)

# POST-HOC TESTS
posthocs = pg.pairwise_ttests(dv='Scores', within='Time', between='Group',
                              subject='Subject', data=df)
pg.print_table(posthocs)


In [3]:
import numpy as np
import pandas as pd
import pingouin as pg
from statsmodels.stats.weightstats import ttost_paired
from scipy import stats

In [None]:
data = pd.read_csv('../input/ctr-a-b/ctr_a_b.csv')
data

The above table shows the A/B test results, consist of 4 columns—"userid" that differentiates one user from another, "dt" that indicates the date of experiment occured, "groupid" consists of 0 for Design A and 1 for Design B, and "ctr" that indicates the metrics result—and 447,602 rows.

We can see from the above dataset info, there's one null value in ctr column. We can just drop the null column since it's only one null value. We can also change the groupid column name into design and change the value to a and b, instead of 0 and 1, to make it easier for us to understand the data.



In [None]:
data['ctr'] = data['ctr'].dropna()
data['dt'] = pd.to_datetime(data['dt'])
data['groupid'] = data['groupid'].replace([0, 1], ['a', 'b'])
data = data.rename(columns = {'groupid':'design'})

data.info()

In [None]:
data['dt'].unique()

In [None]:
data['design'].unique()

In [None]:
len(data['userid'].unique())


In "dt" column there are only 10 unique values—indicate every dates of experiment, while there are only 2 unique values in "groupid" colum. We can also see that the total user that get to experience the test are 59,984 users. To run the analysis, let's group the dataset by the date and the group id with the average ctr as the values.



In [None]:
group = data.groupby(['dt','design']).mean('ctr')
group

## T-Test
A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

The t-test is a parametric test of difference, meaning that it makes the same assumptions about your data as other parametric tests. The t-test assumes your data:

are independent
are (approximately) normally distributed.
have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)
When choosing a t-test, you will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction.

## One-sample, two-sample, or paired t-test?

If the groups come from a single population (e.g. measuring before and after an experimental treatment), perform a paired t-test.

If the groups come from two different populations (e.g. two different species, or people from two separate cities), perform a two-sample t-test (a.k.a. independent t-test).

If there is one group being compared against a standard value (e.g. comparing the acidity of a liquid to a neutral pH of 7), perform a one-sample t-test.

One-tailed or two-tailed t-test?

If you only care whether the two populations are different from one another, perform a two-tailed t-test.

If you want to know whether one population mean is greater than or less than the other, perform a one-tailed t-test.

Since we are interested in the difference between two variables (Design A and Design B) for the same subject (CTR), so we'll perform paired t-test. There are 2 different methods to run paired t-test in Python. We can use SciPy and Pingouin. But first, we have to subset the data into two groups by the design in order to run the t-test.

In [None]:
sci = stats.ttest_rel(design_a, design_b)
sci

## Pingouin T-Test
Before running the Pingouin t-test, we have run Levene's statistic test to determine the variances homogeneity of the subsets to determine the right correction to choose.

In [None]:
stats.levene(design_a, design_b)


The above results show the statistic value and the p-value of Levene's test. The statistic test is 2.46 while the p-value is 0.13. Since the p-value is higher than the threshold (0.05), we can safely say that the variance in ctr between the two designs is not significantly different.



In [None]:
ping = pg.ttest(design_a, design_b, paired = True, correction = False)
ping

## Analysis of T-Test for A/B Testing
We have generated the t-test results from two different t-test methods and both of them giving the same results in t-score, which is -895.20, and also the same results in p-value, which is 1.38 x 10^-23. The p-value is way less than the threshold, this means that we can reject the null hypothesis.

Remember our hypothesis:

H0 : The average CTR of Design B is less than or equal to the average CTR of Design A, μB ≤ μA

Ha : The average CTR of Design B is more than the average CTR of Design A, μB > μA

Since we can reject the null hypothesis, means the alternative hypothesis is true. So, our marketing manager is right, design B can increase the CTR.

To support the p-value hypothesis, we can see the Cohen's d variable from Pingouin table. Cohen's d is one of the measurement to calculate the effect size. Effect size tells you how meaningful the relationship between variables or the difference between groups is. It indicates the practical significance of a research outcome.

Cohen’s d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means.

Effect sizes can be categorized into small, medium, or large according to Cohen’s criteria. Cohen’s criteria for small, medium, and large effects differ based on the effect size measurement used. Cohen’s d can take on any number between 0 and infinity. In general, the greater the Cohen’s d, the larger the effect size.

We have 434.07 for the value of Cohen's d criteria. This means that there is a practical significance successful in changing the design from Design A to Design B.

## Conclusion
We have run an A/B test to see the average/mean of the CTR for two different designs. From the t-test, we get p-value lower than the threshold, large Cohen's d** criteria, and high Bayes Factor, so the team can change the ad pop-up layout design from design A to design B to engage more people in clicking the ad.