# A/B Testing -
Scientists do more than just observing data and making qunatitative judgements about it. But the job is more than just observing the preexisting differences. A huge part of science consists of creating differences experimentally and then drawing conclusions. Here we discuss, how to conduct those kinds of experiments in business. <br>
Starts of by discussing the need for experimentation and our motivations for testing. Moves to, how to properly setup experiements, including the need for randomization. Next, details of A/B testing and the champion/challenger framework. End by describing nuances like the exploration/exploitatoin trade off, as well as ethical concerns.

## The Need for Experimentation -
Imagining a computer company, which maintains email lists of customers who are interested in desktop computers and other email list in which the customers are interested in laptops.


In [3]:
import pandas as pd
desktop = pd.read_csv('desktop.csv')
laptop = pd.read_csv('laptop.csv')

In [4]:
print(desktop.head())
print(laptop.head())

   userid  spending  age  visits
0       1      1250   31     126
1       2       900   27       5
2       3         0   30     459
3       4      2890   22      18
4       5      1460   38      20
   userid  spending  age  visits
0      31      1499   32      12
1      32       799   23      40
2      33      1200   45      22
3      34         0   59     126
4      35      1350   17      85


In [5]:
import scipy.stats 
print(scipy.stats.ttest_ind(desktop['spending'], laptop['spending']))
print(scipy.stats.ttest_ind(desktop['age'], laptop['age']))
print(scipy.stats.ttest_ind(desktop['visits'], laptop['visits']))

TtestResult(statistic=np.float64(-2.109853741030508), pvalue=np.float64(0.03919630411621095), df=np.float64(58.0))
TtestResult(statistic=np.float64(-0.7101437106800108), pvalue=np.float64(0.4804606394128761), df=np.float64(58.0))
TtestResult(statistic=np.float64(0.20626752311535543), pvalue=np.float64(0.8373043059847984), df=np.float64(58.0))


<i>A/B Testing -- The focus of this chapter,  uses experiments to help businesses determine which practices will give them the greatest chances of success. It consists of a few steps: experimental design, random assignment into treatment and control groups, careful measurements of outcomes and finally statistical comparison of outcome between groups. <br>
The way we'll do statistical comparisons will be familiar: We'll use the t-tests. While t-tests are a part of the A/B testing process, they are not the only part. <b>A/B testing is a process for collecting new data, which can be then analyzed using tests like t-test.</b>

# Running Experiments to Test New Hypotheses --
Considering one hypothesis about our customers that might interest us. Suppose we're interested in studying whether changing the color of text in our marketings emails from black to blue will increase the revenue we earn as a result of the emails. <br>
<b>Hypothesis 0</b> Changing the color of the text in our emails from black to blue will have no effect on revenues. <br>
<b>Hypothesis 1</b> Changing the color of text in our emails from black to blue will lead to a change in revenues (either an increase or a decrease). 

We could use the testing hypothesis framework to check Hypothesis 0, whether we want to reject it in favour of its alternative hypothesis. The only difference in this instance is that, we do not have any data which shows the impact of black or blue text emails. So, extra steps are required before we perform hypothesis testing: Running an experiment, designing an experiment and collecting data related to experiments results. <br>
Running experiments may not sound so difficult, but some tricky parts are important to get exactly right. We need data from two groups, A group that has received a blue-text email and a group that has received a black-text email. We'll need to know how much revenue we received from each member of the group that received the blue text and how much revenue we received from each member of the group that received a black-text email. <br>
After that, we can do a simple t-test to determine whether revenue collected from the blue-text group differed significantly from the revenue collected from the black-text group. We are going to use the 5 percent significance level for all of our tests -- We'll reject the null hypothesis and accept the alternative hypothesis if the <i>p-value</i> is lower than 0.05. When we do our tests, if the revenues are significantly different, we can reject our null hypothesis. <br>
Otherwise, we won't reject our null hypothesis and accept its assertion that changing the color of the text in our emails from black to blue will have no effect on revenues. Now we need to split our population of interest into two subgroups and send a blue-text email to one subgroup and a black-text email to our other subgroup so we can compare revenues from each group. Beginning with the desktop group only and splitting our desktop dataframe into two sub-groups. <br>
Although there are many ways to split a group into two subgroups, One possible choice is to split our dataset into a group of younger people and a group of older people. We might split our data this way beccause we believe that younger people and older people might be interested in different products OR we might do this way because age is one of the variables that appears in our data. Yes, there are better ways to split our data into subgroups and this may lead to problems but since this is simple and easy, we begin this way--

In [6]:
import numpy as np
medianage= np.median(desktop['age'])
groupa= desktop.loc[desktop['age']<= medianage,:]
groupb= desktop.loc[desktop['age']> medianage, :]

Using numpy package (alias: np), we use its median() method. Simply find the median age of our group of desktop subscribers and create groupa, A subset of our desktop subscribers whose age is below or equal to the medianage and groupb, A subset of our desktop subscribers, whose age is above the median age. <br>
After creating groupa and groupb, we can send these datasets to our marketing team members and instruct them to send different emails to each group. Suppose they send the black-text email to groupa and blue-text email to groupb. In every email, they include links to new products they want to sell, by tracking who clicks which links and their purchases, the team members can measure the total revenue earned from each individual receipient. <br>
(some fabricated data)

In [10]:
emailresults1= pd.read_csv('emailresults1.csv')

In [11]:
#Merging the information
groupa_withrevenue = groupa.merge(emailresults1, on='userid')
groupb_withrevenue = groupb.merge(emailresults1, on='userid')

Through pandas merge() method, we combine our dataframes. We specify, on='userid', meaning that we take row of emailresults1 that corresponds to a particular userid and merge it with the row of groupa that corresponds to the same userid. The end result of using merge() is a dataframe in which every row corresponds to a particular user identified by their unique userID. The columns then don't just tell us about their characteristics like age but also about revenue we earned from them as a result of our recent email campaign. <br>
After the data is prepared we can simply run a ttest to check whether our groups are different.

In [12]:
print(scipy.stats.ttest_ind(groupa_withrevenue['revenue'], groupb_withrevenue['revenue']))

TtestResult(statistic=np.float64(-2.186454851070545), pvalue=np.float64(0.03730073920038287), df=np.float64(28.0))


The p-value in this output is 0.037, approximately. Since p<0.05, we can fairly conclude that this is statistically significant difference. In order to check the size of the difference

In [17]:
print(np.mean(groupb_withrevenue['revenue']) - np.mean(groupa_withrevenue['revenue']))

125.0


The output: <b>125</b>, The average groupb customer has outspent the average groupa customer by $125. Therefore, we reject Hypothesis 0 in favour of Hypothesis 1, concluding (atleast in this instance) that the blue text in marketing email leads to about $125 more in revenue per user than black text. <br>
What we did just now is an <i>experiment</i>, Splitting the population into two groups, performed different actions on each group and compared the results. In the context of business, such an experiment is often called <i><b>A/B test</b></i>. The A/B part of the name refers to the two groups, Group A and Group B whose different responses to emails we compared.<br>
<b>Every A/B test follows the same pattern: A split into two groups, application of a different treatment(for example, sending different emails to each group) and statistical analysis to compare the groups outcomes and draw conclusions about which treatment is better.</b> <br>
Now that we have successfully conducted an A/B test, we may conclude that effect of the blue text is to increase spending by $125. However, something is wrong with the A/B test we ran, its <b><i>confounded</i></b>--- Because, <br>
As we can see the most important feature of Group A and Group B is their spending, with our t-test we found out that their spending levels were significantly different. But we also need an explaination for why they are different? We want to conclude that the difference in spending can be explained by the <b>difference in the text color</b>. However, that difference coexists with another difference: <i>age</i>. <br>
We can't be sure that the difference exists due to the text color rather than the age. For example, perhaps no one even noticed the text difference, but older people just tend to be wealthier and more eager to buy your products than young people. If so, our A/B test didn't test for the effect of blue text but rather for the effect of age and wealth. We intended to study only the effect of text color in this A/B test and now we don't know whether we truly studied that or whether we studied age, wealth or something else. It would be better if we could A/B test in a non-confounded design where Groups were <b>AS IN EVERYWAY</b>. <br>
Example: If GroupC and GroupD are same in everyway but only differ in the text of emails they have received, the difference between spending can be explained only by the different text colors. Therefore, we should have split the groups in such a way that ensured that the only differences between groups were in our experimentaltreatment not in the group members preexisting characteristics. If we had done that, we would have avoided having a confounded experiment. 

# Understanding the Math of A/B Testing --
Mathematically these notions can be represented as E() to refer to the expected value. So, <i>E(A's revenue with blk text)</i> will mean the <b>expected value of revenue we would earn by sending a black-text email to Group-A.</b> With this we can write two simple equations that describe the relationship between the revenue we expect to earn from black text, the effect of our experiment and the revenue we expect to earn from blue text: <br>

E(A's revenue with blk text) + E(effect of changing blk --> blue on A) = E(A's revenue with blue text) <br>
E(B's revenue with blk text) + E(effect of changing blk --> blue on B) = E(B's revenue with blue text) <br>

To decide whether we want to reject the null-hypothesis (Hypothesis 0), we need to solve for the effect sizes: E(effect of changing blk --> blue on A) and E(effect of changing blk --> blue on B). If either of the sizes are different from 0, we should reject Hypothesis 0. By performing our experiment, we found E(A's revenue with blk text) = $104, E(B's revenue with blue text) = $229. After knowing these values and using them in our equations -- <br>

104 + E(effect of changing blk--> blue on A) = E(A's revenue with blue text) <br>
E(B's revenue with blk text) + E(effect of changing blk --> blue on B) = 229 <br>

But this still leaves many variables we don't know or we're not yet able to solve for E(effect of changing blk--> blue on A), E(effect of changing blk --> blue on B). The only way we'll be able to solve for our effect sizes will be if we can simplify these two equations. For example, If we knew that E(A's revenue with blk text) = E(B's revenue with blk text) and E(effect of changing blk --> blue on A) = E(effect of changing blk --> blue on B) and E(A's revenue with blue text) = E(B's revenue with blue text), we can simply reduce these two equations to just one simple equation. <b>If we knew our groups were identical before our experiment, we would know that all of these expected values were equal and we could simplify our two equations to the following easily solvable equation:</b>

<center> 104 + E(Effect of changing blk --> blue text on everyone) = 229</center>

With this, we can be sure that the effect of blue text is a $125 Revenue increase. This is why we consider it so important to design non-confounded experiments in which groups have equal <b>expected values for personal characteristics.</b> By doing so, we're able to solve the preceding equations and be confident that our measured effect size is actually the effect of what we're studying and not the result of different underlying characteristics.
