# IN269 Kecerdasan Bisnis
## Pertemuan 06: A/B Testing 

- Di pertemuan sebelumnya, we discussed the scientific practice of observing two groups and making quantitative judgments about how they relate to each other. 
- But scientists (including data scientists) do more than just observe preexisting differences. 
- A huge part of science consists of creating differences experimentally and then drawing conclusions. 

## Table of Content
1. Discuss the need for experimentation and our motivations for testing. 
2. Cover how to properly set up experiments, including the need for randomization. 
3. Detail the steps of A/B testing and the champion/challenger framework. 
4. Describe nuances like the exploration/exploitation trade-off, as well as ethical concerns.

## The Need for Experimentation
- Imagine that you’re running a computer company and maintain email marketing lists that your customers can choose to subscribe to. 
- One email list is designed for customers who are interested in your desktop computers, and the other email list is for customers interested in your laptops. 

In [1]:
import pandas as pd
desktop=pd.read_csv('desktop.csv')
laptop=pd.read_csv('laptop.csv')

You can run `print(desktop.head())` and `print(laptop.head())` to see the first five rows of each dataset.

In [2]:
print(desktop.shape)
print(desktop.head(), "\n")
print(laptop.shape)
print(laptop.head())

(30, 4)
   userid  spending  age  visits
0       1      1250   31     126
1       2       900   27       5
2       3         0   30     459
3       4      2890   22      18
4       5      1460   38      20 

(30, 4)
   userid  spending  age  visits
0      31      1499   32      12
1      32       799   23      40
2      33      1200   45      22
3      34         0   59     126
4      35      1350   17      85


Di pertemuan sebelumnya you learned how to use simple t-tests to detect differences between our datasets, as follows:

In [3]:
import scipy.stats
print(scipy.stats.ttest_ind(desktop['spending'],laptop['spending']))
print(scipy.stats.ttest_ind(desktop['age'],laptop['age']))
print(scipy.stats.ttest_ind(desktop['visits'],laptop['visits']))

TtestResult(statistic=np.float64(-2.109853741030508), pvalue=np.float64(0.03919630411621095), df=np.float64(58.0))
TtestResult(statistic=np.float64(-0.7101437106800108), pvalue=np.float64(0.4804606394128761), df=np.float64(58.0))
TtestResult(statistic=np.float64(0.20626752311535543), pvalue=np.float64(0.8373043059847984), df=np.float64(58.0))


- After determining that desktop subscribers are different from laptop subscribers, we can conclude that <u>we should send them different marketing emails</u>. 
- However, this fact alone is not enough to completely guide our marketing strategy. 
- Just knowing that our desktop subscriber group spends a little less than the laptop subscriber group doesn’t tell us whether crafting long messages or short ones would lead to better sales, or whether using red text or blue text would get us more clicks, or whether informal or formal language would improve customer loyalty most. 

- In some cases, past research published in academic marketing journals can give us hints about what will work best.
- But even when relevant research exists, every company has its own unique set of customers that may not respond to marketing in exactly the same way that past research indicates.

- We <u>need a way to generate new data that’s never been collected or published before</u>, so <u>we can use that data to answer new questions about the new situations that we regularly face</u>. 
- Only if we can generate this kind of new data can we reliably learn about what will work best in our efforts to grow our business with our particular set of unique customers. 

> **A/B testing** uses experiments to help businesses determine which practices will give them the greatest chances of success. 
- It consists of a few steps: 
    - experimental design, 
    - random assignment into treatment and control groups, 
    - careful measurement of outcomes, and finally, 
    - statistical comparison of outcomes between groups.



- The way we’ll do statistical comparisons will be familiar: 
    - we’ll use the t-tests introduced di pertemuan sebelumnya. 
    - While t-tests are a part of the A/B testing process, they are not the only part. 
    - A/B testing is a process for collecting new data, which can then be analyzed using tests like the t-test. 

## Running Experiments to Test New Hypotheses
- Let’s consider just one hypothesis about our customers that might interest us. 
- Suppose we’re interested in studying whether changing the color of text in our marketing emails from black to blue will increase the revenue we earn as a result of the emails. 

Let’s express two hypotheses related to this:           
   
**Hypothesis 0** : Changing the color of text in our emails from black to blue will have no effect on revenues.   
**Hypothesis 1** : Changing the color of text in our emails from black to blue will lead to a change in revenues (either an increase or a decrease).

- Here, our datasets do not include information about blue-text and black-text emails. 
- So, extra steps are required before we perform hypothesis testing: 
    - designing an experiment, 
    - running an experiment, and 
    - collecting data related to the experiment’s results.

- To do the hypothesis test we just outlined, we’ll need data from two groups: **a group that has received a blue-text email** and **a group that has received a black-text email**.     
- We’ll need to know <u>how much revenue we received</u> from each member of the group that received the blue-text email and how much revenue we received from each member of the group that received the black-text email.   
- After we have that, <u>we can do a simple $t$-test to determine whether the revenue collected from the blue-text group differed significantly from the revenue collected from the black-text group</u>. 

- We need to split our population of interest into <u>two subgroups</u> and **send a blue-text email to one subgroup and a black-text email to our other subgroup so we can compare revenues from each group**. 
- For now, let’s focus on desktop subscribers only and split our desktop dataframe into two subgroups.



- We can split a group into two subgroups in many ways. 
- One possible choice is to split our dataset into **a group of younger people and a group of older people**. 
- We might split our data this way because we believe that younger people and older people might be interested in different products, or we might do it this way just because age is one of the few variables that appears in our data. 
- Later, we’ll see that this way of splitting our group into subgroups will lead to problems in our analysis, and we’ll discuss better ways to create subgroups. 
- But since this method of splitting into subgroups is simple and easy, let’s start by trying it to see what happens:

In [4]:
import numpy as np
medianage=np.median(desktop['age'])
groupa=desktop.loc[desktop['age']<=medianage,:]
groupb=desktop.loc[desktop['age']>medianage,:]

- After creating `groupa` and `groupb`, you can send these two dataframes to your marketing team members and instruct them to send different emails to each group. 
- Suppose they send the **black-text** email to `groupa` and the **blue-text** email to `groupb`. 
- In every email, they include links to new products they want to sell, and by tracking who clicks which links and their purchases, the team members can measure the total revenue earned from each individual email recipient.

In [5]:
emailresults1=pd.read_csv('emailresults1.csv')

print(emailresults1.head())

   userid  revenue
0       1      100
1       2        0
2       3       50
3       4      550
4       5      175


- It will be useful to have this new revenue information in the same dataframe as our other information about each user. 
- Let’s join the datasets:

In [6]:
groupa_withrevenue=groupa.merge(emailresults1,on='userid')
groupb_withrevenue=groupb.merge(emailresults1,on='userid')
# "specify on='userid', meaning that we take the row of emailresults1 that corresponds to a particular userid and 
# merge it with the row of groupa that corresponds to that same userid”

In [7]:
print("Median Umur =", medianage, "\n")
print(groupa.shape)
print(groupa_withrevenue.head(), "\n")
print(groupb.shape)
print(groupb_withrevenue.head(), "\n")

Median Umur = 32.0 

(15, 4)
   userid  spending  age  visits  revenue
0       1      1250   31     126      100
1       2       900   27       5        0
2       3         0   30     459       50
3       4      2890   22      18      550
4       7       900   18      61       40 

(15, 4)
   userid  spending  age  visits  revenue
0       5      1460   38      20      175
1       6         0   60     100        0
2       8      1000   51     115      220
3       9       150   41     610      100
4      10      3400   48     154      150 



- After preparing our data, it’s simple to perform a t-test to check whether our groups are different. 
- We can do it in one line, as follows:

In [8]:
print(scipy.stats.ttest_ind(groupa_withrevenue['revenue'],groupb_withrevenue['revenue']))

TtestResult(statistic=np.float64(-2.186454851070545), pvalue=np.float64(0.03730073920038287), df=np.float64(28.0))


- The important part of this output is the `pvalue` variable, which tells us the p-value of our test. 
- We can see that the result says that `p = 0.037`, approximately. 
- Since `p < 0.05`, we can conclude that this is a statistically significant difference. 

We can check the size of the difference:

In [9]:
print(np.mean(groupb_withrevenue['revenue'])-np.mean(groupa_withrevenue['revenue']))

125.0


- The output is 125.0. 
- The average `groupb` customer has outspent the average `groupa` customer by \$125. 
- This difference is statistically significant, so we reject Hypothesis 0 in favor of Hypothesis 1, concluding (for now, at least) that the **blue text in marketing emails leads to about \$125 more in revenue per user than black text**.

- What we have just done was <u>an experiment</u>. 
- We split a population into two groups, performed different actions on each group, and compared the results. 
- In the context of business, such an experiment is often called an **A/B test**. 
- The A/B part of the name refers to the two groups, Group A and Group B, whose different responses to emails we compared. 

Every A/B test follows the same pattern we went through here: 
- a split into two groups, application of a different treatment (for example, sending different emails) to each group, and 
- statistical analysis to compare the groups’ outcomes and 
- draw conclusions about which treatment is better.

Now that we’ve successfully conducted an A/B test, we may want to conclude that the effect of blue text is to increase spending by $125.     
However, something is wrong with the A/B test we ran: 

> _it’s confounded._ 

<center>
<img src="images/confounded.png" width="1000"/>
</center>    

- We can see the important features of Group A and Group B. 
- Our $t$-test comparing spending found that their spending levels were significantly different. 
- We want an explanation for why they’re different, and any explanation of different outcomes will have to rely on the differences listed in Table 4-1. 
- We want to be able to conclude that the difference in spending can be explained by the difference in the text color. 
- However, that difference coexists with another difference: **age**.

- We can’t be certain that the difference in spending levels is due to text color rather than age. 
- For example, perhaps no one even noticed the text difference, but older people tend to be wealthier and more eager to buy your products than young people. 
- If so, our A/B test didn’t test for the effect of blue text, but rather for **the effect of age or wealth**. 
- We intended to study only the effect of text color in this A/B test, and now we don’t know whether we truly studied that or whether we studied age, wealth, or something else. It would be better if our A/B test had a simpler, non-confounded design like the one illustrated in Table 4-2.

<center>
<img src="images/non-confounded.png" width="1000"/>
</center>    

- Table 4-2 imagines that we had split the users into hypothetical groups called C and D, which are identical in all personal characteristics, but differ only in the text of the emails they received. 
- In this hypothetical scenario, the spending difference can be explained only by the different text colors sent to each group because that’s the only difference between them. 
- We should have split our groups in a way that ensured that the only differences between groups were in our experimental treatment, not in the group members’ preexisting characteristics. 
- If we had done so, we would have avoided having a confounded experiment.

### Understanding the Math of A/B Testing
- We can also express these notions mathematically. 
- We can use the common statistical notation $E()$ to refer to the expected value. 
- So _$E($A’s revenue with blk text$)$_ will mean **the expected value of revenue we would earn by sending a black-text email to Group A**.

<center>
    <img src="images/expected_value.png" width=900/>
</center>    

<center>
    <img src="images/example-expected-value.png" width=700/>
</center>    

<center>
    <img src="images/try-it-expected.png" width=1000/>
</center>    

- We can write two simple equations that describe the relationship between the revenue we expect to earn from black text, the effect of our experiment, and the revenue we expect to earn from blue text:    
    
$$
    E(\text{A's revenue with blk text}) + E(\text{effect of changing blk} \rightarrow \text{blue on A}) = E(\text{A's revenue with blue text})      \\
    E(\text{B's revenue with blk text}) + E(\text{effect of changing blk} \rightarrow \text{blue on B}) = E(\text{B's revenue with blue text})      
$$

To decide whether to _reject Hypothesis 0_, we need to solve for the effect sizes: 
- $E(\text{effect of changing blk} \rightarrow \text{blue on A})$ and 
- $E(\text{effect of changing blk} \rightarrow \text{blue on B})$. 

- If either of these effect sizes is different from 0, we should reject Hypothesis 0. 
- By performing our experiment, we found $E(\text{A's revenue with blk text}) = 104$ and $E(\text{B's revenue with blue text} = 229$. After knowing these values, we have the following equations:

$$
    104 + E(\text{effect of changing blk} \rightarrow \text{blue on A}) = E(\text{A's revenue with blue text})      \\
    E(\text{B's revenue with blk text}) + E(\text{effect of changing blk} \rightarrow \text{blue on B}) = 229
$$

- But this still leaves many variables we don’t know, and we’re not yet able to solve for $E(\text{effect of changing blk} \rightarrow \text{blue on A})$ and $E(\text{effect of changing blk} \rightarrow \text{blue on B})$. 
- The only way we’ll be able to solve for our effect sizes will be if we can simplify these two equations.

For example, if we knew that     
   
$E(\text{A’s revenue with blk text}) = E(\text{B’s revenue with blk text})$, and     
$E(\text{effect of changing blk} \rightarrow \text{blue on A}) = E(\text{effect of changing blk} \rightarrow \text{blue on B})$,       
    
and     
     
$E(\text{A's revenue with blue Text}) = E(\text{B’s revenue with blue text})$,      
     
then we could reduce these two equations to just one simple equation. 

If we knew that our groups were identical before our experiment, we would know that all of these expected values were equal, and we could simplify our two equations to the following easily solvable equation:    
    
$$
    104 + E(\text{effect of changing blk} \rightarrow \text{blue on everyone}) = 229
$$

- With this, we can be sure that the effect of blue text is a \$125 revenue increase. 
- This is why we consider it so important to design non-confounded experiments in which the groups have equal expected values for personal characteristics. 
- By doing so, we’re able to solve the preceding equations and be confident that our measured effect size is actually the effect of what we’re studying and not the result of different underlying characteristics.

### Translating the Math into Practice
- We know what to do mathematically, but we need to translate that into practical action. 
- How should we ensure that $E(\text{A’s revenue with blk text}) = E(\text{B’s revenue with blk text})$, and how should we ensure that the other expected values are all the same? 
- In other words, how can we ensure that our study design looks like Table 4-2 instead of Table 4-1? 
- We need to find a way to select subgroups of our desktop subscriber list that are expected to be identical.

- The simplest way to select subgroups that are expected to be identical is to select them randomly. 
- We mentioned this briefly di pertemuan sebelumnya: 
> _every random sample from a population has an expected value equal to the population mean_.    
    
So, we expect that two random samples from the same population won’t differ from each other significantly.

## A/B Testing on Laptop Subscriber
- Let’s perform an A/B test on our laptop subscriber list, but this time we’ll use randomization to select our groups to avoid having a confounded experimental design. 
- Suppose that in this new A/B test, we want to test whether adding a picture to a marketing email will improve revenue. 
- We can proceed just as we did before: _we split the laptop subscriber list into two subgroups, and we send different emails to each subgroup_. 
- The difference is that this time, instead of splitting based on age, we perform a random split:

In [10]:
# Translating Math into Practice

np.random.seed(18811015)
laptop.loc[:,'groupassignment1']=1*(np.random.random(len(laptop.index))>0.5)
groupc=laptop.loc[laptop['groupassignment1']==0,:].copy()
groupd=laptop.loc[laptop['groupassignment1']==1,:].copy()

In [11]:
print(groupc.head(), "\n")
print(groupd.head())

   userid  spending  age  visits  groupassignment1
0      31      1499   32      12                 0
2      33      1200   45      22                 0
4      35      1350   17      85                 0
5      36      2780   25       6                 0
7      38         0   79     450                 0 

    userid  spending  age  visits  groupassignment1
1       32       799   23      40                 1
3       34         0   59     126                 1
6       37      3400   65     428                 1
8       39      1800   25     180                 1
10      41       999   35     835                 1


- After generating this random column of 0s and 1s that indicates the group assignment of each customer, we create two smaller dataframes, `groupc` and `groupd`, that contain user IDs and information about the users in each subgroup either D or C, should receive an email with a picture. 
- Then, suppose that the marketing team sends you a file containing the results of this latest A/B test. 
- Let’s read the results of this email campaign into Python as follows:

In [12]:
emailresults2=pd.read_csv('emailresults2.csv')

Again, let’s join our email results to our group dataframes, just as we did before:

In [13]:
groupc_withrevenue=groupc.merge(emailresults2,on='userid')
groupd_withrevenue=groupd.merge(emailresults2,on='userid')

And again, we can use a t-test to check _whether the revenue resulting from Group C is different from the revenue we get from Group D_:

In [14]:
print(scipy.stats.ttest_ind(groupc_withrevenue['revenue'],groupd_withrevenue['revenue']))

TtestResult(statistic=np.float64(-2.381320497676198), pvalue=np.float64(0.024288828555138562), df=np.float64(28.0))


- We find that the <u>p-value is less than 0.05</u>, indicating that _the difference between the groups is statistically significant_. 
- **This time, our experiment isn’t confounded**, because we used random assignment to ensure that the differences between groups are the result of our different emails, not the result of different characteristics of each group.

- Since our experiment isn’t confounded, and since we find a significant difference between the revenues earned from Group C and Group D, we conclude that 
> **including the picture in the email has a nonzero effect**. 
- If the marketing team tells us that it sent the picture only to Group D, we can find the estimated size of the effect easily:

In [15]:
print(np.mean(groupd_withrevenue['revenue'])-np.mean(groupc_withrevenue['revenue']))

260.3333333333333


- We calculate the estimated effect here with subtraction: **the mean revenue obtained from subjects in Group D minus the mean revenue obtained from subjects in Group C**. 
- The difference between mean revenue from Group C and mean revenue from Group D, about \$260, is **the size of the effect of our experiment**.

- The process we follow for A/B testing is really quite simple, but it’s also powerful. 
- We can use it for a wide variety of questions that we might want to answer. 
- Anytime you’re unsure about an approach to take in business, especially in user interactions and product design, considering an A/B test as an approach to learn the answer is worthwhile. 
- Now that you know the process, let’s move on and understand its nuances.

## Optimizing with the Champion/Challenger Framework
- When we’ve crafted a great email, we might call it **our champion email design**: the one that, according to what we know so far, we think will perform the best. 
- After we have a champion email design, we may wish to stop doing A/B testing and simply rest on our laurels, collecting money indefinitely from our "perfect" email campaigns.



- But this isn’t a good idea, for a few reasons. 
- The first is that times change. 
- Fads in design and marketing change quickly, and a marketing effort that seems exciting and effective today may soon seem dated and outmoded. 

- _Like all champions, your champion email design will become weaker and less effective as it ages_. 
- Even if design and marketing fads don’t change, your champion will eventually seem boring as the novelty wears off: 
> _new stimuli are more likely to get people’s attention_.

- Another reason that you shouldn’t stop A/B testing is that 
> _your customer base will change_. 
- You’ll lose some old customers and gain new ones. 
- You’ll release new products and enter new markets. 
- As your customer mix changes, the types of emails that they tend to respond to will change as well, and constant A/B testing will enable you to keep up with their changing characteristics and preferences.

- A final reason to continue A/B testing is that **although your champion likely is good, you might not have optimized it in every possible way**. 
- A dimension you haven’t tested yet could enable you to have an even better champion that gets even better performance.
- If we can successfully run one A/B test and learn one thing, we’ll naturally want to continue to use our A/B testing skills to learn more and more and to increase profits higher and higher.

- Suppose you have a champion email and want to continue A/B testing to try to improve it. 
- You do another random split of your users, into a new Group A and a new Group B. 
- You send the champion email to Group A. 
- You send another email to Group B that differs from the champion email in one way that you want to learn about; for example, maybe it uses formal rather than informal language. 
- When we compare the revenues from Group A and Group B after the email campaign, we’ll be able to see whether this new email performs better than the champion email.

- Since the new email is in direct competition with the champion email, we call it the **challenger**. 
- If the champion performs better than the challenger, the champion retains its champion status. 
- If the challenger performs better than the champion, that challenger becomes the new champion.

- This process can continue indefinitely: 
    - we have a champion that represents the state of the art of whatever we’re doing (marketing emails, in this case). 
    - We constantly test the champion by putting it in direct competition with a succession of challengers in A/B tests. 
    - Each challenger that leads to significantly better outcomes than the champion becomes the new champion and is, in turn, put into competition against new challengers later.

- This endless process is called the **champion/challenger framework** for A/B tests. 
- It’s meant to lead to _continuous improvement_, _continuous refinement_, and _asymptotic optimization_ to get to the best-possible performance in all aspects of business. 
> _The biggest tech companies in the world run literally hundreds of A/B tests per day, with hundreds of challengers taking on hundreds of champions, sometimes defeating them and sometimes being defeated_. 
- The **champion/challenger framework** is a common approach for setting up and running A/B tests for the most important and most challenging parts of your business.

## Preventing Mistakes with Twyman’s Law and A/A Testing
- A/B testing is a relatively simple process from beginning to end. 
- Nevertheless, we are all human and make mistakes. 
- In any data science effort, not just A/B testing, it’s important to proceed carefully and constantly check whether we’ve done something wrong. 
- One piece of evidence that often indicates that we’ve done something wrong is that 
> _things are going too well_.

- How could it be bad for things to go too well? 
- Consider a simple example. 

- You perform an A/B test: Group A gets one email, and Group B gets a different one. 
- You measure revenue from each group afterward and find that the average revenue earned from members of Group A is about 25, while the average revenue earned from members of Group B is \$99,999. 
- You feel thrilled about the enormous revenue you earned from Group B. 
- You call all your colleagues to an emergency meeting and tell them to stop everything they’re doing and immediately work on implementing the email that Group B got and pivot the whole company strategy around this miracle email.

- As your colleagues are working around the clock on sending the new email to everyone they know, you start to feel a nagging sense of doubt. 
- You think about how unlikely it is that a single email campaign could plausibly earn almost 100,000 in revenue per recipient, especially when your other campaigns are earning only about \$25 per user. 
- You think about how \$99,999, the amount of revenue you supposedly earned per user, is five identical digits repeated. 

- Maybe you remember a conversation you had with a database administrator who told you that your company database automatically inserts 99999 every time a database error occurs or data is missing. 
- Suddenly, you realize that your email campaign didn’t really earn \$99,999 per user, but rather a database error for Group B caused the appearance of the apparently miraculous result.

- A/B testing is a simple process from a data science point of view, but it can be quite complex from a practical and social point of view. 
- For example, in any company larger than a tiny startup, the creative people designing marketing emails will be different from the technology people who maintain the databases that record revenues per user. 

- Other groups may be involved in little parts of A/B testing: maybe a group that maintains the software used to schedule and send out emails, maybe a group that creates art that the email marketing team asks for, and maybe others.
- With all these groups and steps involved, many possible chances exist for miscommunication and small errors. 



- Maybe two different emails are designed, but the person who’s in charge of sending them out doesn’t understand A/B testing and copies and pastes the same email to both groups. 
- Maybe they accidentally paste in something that’s not even supposed to be in the A/B test at all. 
- In our example, maybe the database that records revenues encounters an error and puts 99999 in the results as an error code, which others mistakenly interpret as a high revenue. 
- No matter how careful we try to be, mistakes and miscommunications will always find a way to happen.

- The inevitability of mistakes should lead us to be naturally suspicious of anything that seems too good, bad, interesting, or strange to be true. 
- This natural suspicion is advocated by **Twyman's law**, which states that 
> “_any figure that looks interesting or different is usually wrong._" 
- This law has been restated in several ways, including 
> “any statistic that appears interesting is almost certainly a mistake” 

and 
> "the more unusual or interesting the data, the more likely it is to have been the result of an error.”

- Besides extreme carefulness and natural suspicion of good news, we have another good way to prevent the kinds of interpretive mistakes that Twyman’s law warns against: **A/A testing**. 
- This type of testing is just what it sounds like; 
> _we go through the steps of randomization, treatment, and comparison of two groups just as in A/B testing, but instead of sending two different emails to our two randomized groups, **we send the identical email to each group**. In this case, we expect the null hypothesis to be true, and we won’t be gullibly convinced by a group that appears to get \$100,000 more revenue than the other group._

- If we consistently find that A/A tests lead to statistically significant differences between groups, **we can conclude that our process has a problem**: 
    - a database gone haywire, 
    - a t-test being run incorrectly, 
    - an email being pasted wrong, 
    - randomization performed incorrectly, or something else. 

An A/A test would also help us realize that the first test described di pertemuan hari ini (where Group A consists of younger people and Group B consists of older people) was confounded, since we would know that differences between the results of an A/A test must be due to the differences

## Understanding Effect Sizes
- In the first A/B test we ran, we observed a difference of \$125 between the Group A users who received a black-text email and the Group B users who received a blue-text email. 
- This \$125 difference between groups is also called **the A/B test’s effect size**.

It’s natural to try to form a judgment about whether we should consider this \$125 effect size a small effect, a medium effect, or a large effect.

Untuk menghitung **effect size**, kita perlu menggunakan the standard deviation of the dataset.     
    
$$
    \text{Cohen's } d = \frac{\text{an effect size}}{\text{relevant standard deviation}}
$$   
   
Cohen's $d$ is a common metric for measuring effect sizes.    

Cohen's $d$ is just <u>the number of standard deviations that two populations' means are apart from each other</u>.      

We can calculate Cohen's $d$ for our first A/B test as follows:

In [16]:
print(125/np.std(emailresults1['revenue']))

0.763769235188029


- We see that the result is about 0.76. 
- A common convention when we’re working with Cohen’s $d$ is to say that 
    - if Cohen’s $d$ is about 0.2 or lower, we have a **small effect**; 
    - if Cohen’s $d$ is about 0.5, we have a **medium effect**; and 
    - if Cohen’s $d$ is around 0.8 or even higher, we have a **large effect**. 
        
Since our result is about 0.76 $\Rightarrow$ quite close to 0.8 $\Rightarrow$ we can say that we’re working with a **large effect size**.

## Calculating the Significance of Data
- We typically use statistical significance as the key piece of evidence that convinces us that an effect that we study in an A/B test is real. 
- Mathematically, statistical significance depends on three things:

- The size of the effect being studied (like the increase in revenue that results from changing an email’s text color). Bigger effects make statistical significance more likely.   
   
- The size of the sample being studied (the number of people on a subscriber list who are receiving our marketing emails). Bigger samples make statistical significance more likely.      
    
- The significance threshold we’re using (typically 0.05). A higher threshold makes statistical significance more likely.

- If we have a big sample size, and we’re studying a big effect, our t-tests will likely reach statistical significance. 
- On the other hand, if we study an effect that’s very small, with a sample that’s very small, we may have predestined our own failure: 
> the probability that we detect a statistically significant result is essentially 0 $\Rightarrow$ even if the email truly does have an effect. 
- Since running an A/B test costs time and money, we’d rather not waste resources running tests like this that are predestined to fail to reach statistical significance.

- The probability that a correctly run A/B test will reject a false null hypothesis is called the **A/B test’s statistical power**. 
- If changing the color of text leads to a 125 increase in revenue per user, we can say that \$125 is the effect size, and since the effect size is nonzero, we know the null hypothesis (that changing the text color has no effect on revenue) is false. 
- But if we study this true effect by using a sample of only three or four email subscribers, it’s very possible that, by chance, none of these subscribers purchase anything, so we fail to detect the true 125 effect. By contrast, if we study the effect of changing the text color by using an email list of a million subscribers, we’re much more likely to detect the \$125 effect and measure it as statistically significant. 
- With the million-subscriber list, we have greater statistical power.

We can import a module into Python that makes calculating statistical power easy:

In [17]:
from statsmodels.stats.power import TTestIndPower
alpha=0.05
nobs=45 # number of observations
effectsize=0.5 #using Cohen's d

analysis=TTestIndPower()
power = analysis.solve_power(effect_size=effectsize, nobs1=nobs, alpha=alpha)
print(power)

0.650185508398425


- If you run `print(power)`, you can see that the estimated statistical power for our hypothetical A/B test is about 0.65. 
- This means that we expect about a 65 percent chance of detecting an effect from our A/B test and about a 35 percent chance that even though a true effect exists, our A/B test doesn’t find it. 
- These odds might seem unfavorable if a given A/B test is expected to be expensive; you’ll have to make your own decisions about the minimum level of power that is acceptable to you. 

- Power calculations can help at the planning stage to understand what to expect and be prepared. 
- One common convention is to authorize only A/B tests that are expected to have at least 80 percent power.

- You can also use the same `solve_power()` method we used in the previous snippet to “reverse” the power calculation: you’d start by assuming a certain power level and then calculate the parameters required to achieve that level of statistical power. 
- For example, in the following snippet, we define power, alpha, and our effect size, and run the `solve_power()` command not to calculate the power but to calculate observations, the number of observations we’ll need in each group to achieve the power level we specified:

In [18]:
# solve_power() reverses power calculations
analysis = TTestIndPower()
alpha = 0.05
effect = 0.5
power = 0.8
observations = analysis.solve_power(effect_size=effect, power=power, alpha=alpha)
# calculates the nobs to achieve the power level
print(observations)

63.76561058785403


- If you run `print(observations)`, you’ll see that the result is about 63.8. 
- This means that if we want to have 80 percent statistical power for our planned A/B test, we’ll need to recruit at least 64 participants for both groups. 
- Being able to perform these kinds of calculations can be helpful in the planning stages of A/B tests.

## Applications and Advanced Considerations
- So far, we’ve considered only A/B tests related to marketing emails. 
- But A/B tests are applicable to a wide variety of business challenges beyond optimal email design. 
- One of the most common applications of A/B testing is user interface/experience design. 

- A website might randomly assign visitors to two groups (called Group A and Group B, as usual) and show different versions of the site to each group. 
- The site can then measure which version leads to more user satisfaction, higher revenue, more link clicks, more time spent on the site, or whatever else interests the company. 
- The whole process can be completely automated, which is what enables the high-speed, high-volume A/B testing that today’s top tech companies are doing.

- E-commerce companies run tests, including A/B tests, on product pricing. 
- By running an A/B test on pricing, you can measure what economists call _the price elasticity of demand_, meaning how much demand changes in response to price changes. 

- If your A/B test finds only a very small change in demand when you increase the price, you should increase the price for everyone and take advantage of their greater willingness to pay. 
- If your A/B test finds that demand drops off significantly when you increase the price slightly, you can conclude that customers are sensitive to price, and their purchase decisions depend heavily on price considerations. 
- If customers are sensitive to price and constantly thinking about it, they’ll likely respond positively to a price decrease. 
- If so, you should decrease the price for everyone instead, and expect a large increase in demand. 
- Some businesses have to set prices based on intuition or other painstaking calculations, but A/B testing makes determining the right price relatively simple.

## The Ethics of A/B Testing
- A/B testing is fraught with difficult ethical issues. 
- This may seem surprising, but remember, A/B testing is an experimental method in which we intentionally alter human subjects’ experiences in order to study the results for our own gain. 
- This means that A/B testing is human experimentation. 
- Think about other examples of human experimentation to see why people have ethical concerns about it:

- Jonas Salk developed an untested, unprecedented polio vaccine, tried it on himself and his family, and then tried it on millions of American children to ensure that it worked. (It worked and helped eliminate a terrible disease from much of the world.)
- My grandmother made pisang ijo for her grandchildren, observed how we reacted to it, and then the next day made a different pisang ijo and checked whether we reacted more or less positively. (Both were delicious.)


- A professor posed as a student and emailed 6300 professors to ask them to schedule time to talk to her, lying about herself and her intentions in an attempt to determine whether her false persona would be a target of discrimination so she could publish a paper about the replies. 
- She didn’t compensate any of the unwitting study participants for the deception or the schedule disruption, nor did she receive their consent beforehand to be an experimental subject. (Every detail of this study was approved by a university ethics board.)

- A corporation intentionally manipulated the emotions of its users to better understand and sell products to them.

- Josef Mengele performed painful and deadly sadistic experiments on unwilling human subjects in the Auschwitz concentration camp.
- You perform an A/B test.

- Because of the broad range of activities that could be called **human experimentation**, making a single ethical judgment about all of its forms isn’t possible. 
- We have to consider several important ethical concepts when we’re deciding whether our A/B tests make us a hero like my grandmother or Salk, a villain like Mengele, or something in between.

> The first concept we should consider is **consent**. 

- In some cases, obtaining informed consent is not feasible. 
- For example, if we perform experiments about which outdoor billboard designs are most effective, we can’t obtain informed consent from every possible human research subject, since any person in the world could conceivably see a public billboard, and we don’t have a way to contact every living human.

- Other cases form a large gray area. For example, a website that performs A/B tests may have a Terms and Conditions section, with small print and legalese that claims that every website visitor provides consent to be experimented on (via A/B tests of user-interface features) whenever they navigate to the site. 
- This may technically meet the definition of informed consent, but only a tiny percentage of any website’s visitors likely visit and understand these conditions.

- Another important ethical consideration related to A/B testing is **risk**. 
- Risk itself involves two considerations: potential downsides to participation as a human subject and the probability of experiencing those downsides. 
- Salk’s vaccine had a large potential downside $\Rightarrow$ contracting polio $\Rightarrow$ but because of Salk’s preparation and knowledge, the probability of subjects experiencing it was remarkably low. 
- A/B testing for marketing campaigns usually has potential downsides that are minuscule or smaller, as it’s hard to even imagine any downside that could occur because (for example) someone was exposed to blue rather than black text in one marketing email. 
> Experiments with low risks to subjects are more ethical than risky experiments.

- We should also consider the potential benefits that could result from our experimentation. 
- Salk’s vaccine experiments had the potential (later realized) of eradicating polio from most of the Earth. 
- A/B tests are designed to improve profits, not cure diseases, so your judgment of their benefits will have to depend on your opinion of the moral status of corporate profits. 
- The only other benefit likely to come from a corporation’s marketing experiment would be an advance in understanding of human psychology. 
- Indeed, corporate marketing practitioners occasionally publish the results of marketing experiments in psychology journals, so this isn’t unheard of.

- Ethical and philosophical questions can never reach a definitive, final conclusion that everyone agrees on. 
- You can make up your own mind about whether you feel that A/B testing is fundamentally good, like Salk’s vaccine experiments, or fundamentally abhorrent, like Mengele’s horrors. 
- Most people agree that the extremely low risks of most online A/B testing, and the fact that people rarely refuse consent to benign A/B tests, mean that A/B testing is an ethically justifiable activity when performed properly.
- Regardless, you should think carefully through your own situation and come to your own conclusion.

## Kesimpulan
- Dalam pertemuan ini, we discussed A/B testing. 
- We started with a simple t-test, and then looked at the need for random, non-confounded data collection as part of the A/B testing process. 
- We covered some nuances of A/B testing, including the champion/challenger framework and Twyman’s law, as well as ethical concerns.

<center>
        <h1>The End</h1>
</center>