# Analyzing Farmburg's A/B Test
Brian is a Product Manager at FarmBurg, a company that makes a farming simulation social network game. In the FarmBurg game, you can plow, plant, and harvest different crops. ​Brian has been conducting an A/B Test with three different variants, and he wants you to help him analyze the results. Using the Python modules pandas and SciPy, you will help him make some important business decisions!

Note that a solution.py file is also loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers!

In [1]:
import pandas as pd
import numpy as np

# Project Requirements
## 1
Brian ran an A/B test with three different groups: A, B, and C. He has provided us with a CSV file of his results named clicks.csv. It has the following columns:

- user_id: a unique id for each visitor to the FarmBurg site
- group: either 'A', 'B', or 'C' depending on which group the visitor was assigned to
- is_purchase: either 'Yes' if the visitor made a purchase or 'No' if they did not.

We’ve already imported pandas as pd and loaded clicks.csv as abdata. Inspect the data using the .head() method.

In [3]:
# Read in the `clicks.csv` file as `abdata`
abdata = pd.read_csv('clicks.csv')
abdata.head()

Unnamed: 0,user_id,group,is_purchase
0,8e27bf9a,A,No
1,eb89e6f0,A,No
2,7119106a,A,No
3,e53781ff,A,No
4,02d48cf1,A,Yes


## 2
Note that we have two categorical variables: group and is_purchase. We are interested in whether visitors are more likely to make a purchase if they are in any one group compared to the others. Because we want to know if there is an association between two categorical variables, we’ll start by using a Chi-Square test to address our question.

In order to run a Chi-Square test, we first need to create a contingency table of the variables group and is_purchase. Use pd.crosstab() to create this table and name the result Xtab, then print it out. Which group appears to have the highest number of purchases?

In [6]:
Xtab = pd.crosstab(abdata['group'], abdata['is_purchase'])
print(Xtab)

is_purchase    No  Yes
group                 
A            1350  316
B            1483  183
C            1583   83


## 3
To conduct the Chi-Square Test, import chi2_contingency from scipy.stats.

Then, use the function chi2_contingency with the data in Xtab to calculate the p-value. Remember that of the four values returned by chi2_contingency, the p-value is the second value.

Save the p-value to a variable named pval and print the result. Using a significance threshold of 0.05, is there a significant difference in the purchase rate for groups A, B, and C?

Note that you might see a number in scientific notation. For example, 1.234e-8 is equal to 0.00000001234 (we move the decimal to the left by 8 places and insert zeros).

In [9]:
from scipy.stats import chi2_contingency

_,pval,_,_ = chi2_contingency(Xtab, 0.05)
print('There is a significant difference between the rates for each group' if pval < 0.05 else 'There is not a significant difference between the rates for each group')
print(f'The pvalue is {pval}')

There is a significant difference between the rates for each group
The pvalue is 2.4126213546684264e-35


## 4
Our day is a little less busy than expected, so we decide to ask Brian about his test.

**Us**: Hey Brian! What was that test you were running anyway?

**Brian**: We are trying to get users to purchase a small FarmBurg upgrade package. It’s called a microtransaction. We’re not sure how much to charge for it, so we tested three different price points: \\$0.99 (group 'A'), \\$1.99 (group 'B'), and \\$4.99 (group 'C'). It looks like significantly more people bought the upgrade package for \\$0.99, so I guess that’s what we’ll charge.

**Us**: Oh no! We should have asked you this before we did that Chi-Square test. That wasn’t the right test at all. It’s true that more people wanted to purchase the upgrade at $0.99; you probably expected that. What we really want to know is whether each price point allows us to make enough money that we can exceed some target goal. Brian, how much do you think it cost to build this feature?

**Brian**: Hmm. I guess that we need to generate a minimum of $1000 in revenue per week in order to justify this project.

**Us**: We have some work to do!

In order to justify this feature, we will need to calculate the necessary purchase rate for each price point. Let’s start by calculating the number of visitors to the site this week.

It turns out that Brian ran his original test over the course of a week, so the number of visitors in abdata is equal to the number of visitors in a typical week. Calculate the number of visitors in the data and save the value in a variable named num_visits. Make sure to print the value.

In [10]:
num_visits = len(abdata)
print(f'There are {num_visits} people.')

There are 4998 people.


## 5
Now that we know how many visitors we generally get each week (num_visits), we need to calculate the number of visitors who would need to purchase the upgrade package at each price point (\\$0.99, \\$1.99, \\$4.99) in order to generate Brian’s minimum revenue target of \\$1,000 per week.

To start, calculate the number of sales that would be needed to reach \\$1,000 dollars of revenue at a price point of \\$0.99. Save the result as num_sales_needed_099 and print it out.

In [12]:
num_sales_needed_099 =1000/0.99
print(f'Number of sales at $0.99 is {np.ceil(num_sales_needed_099)}')

Number of sales at $0.99 is 1011.0


## 6
Now that we know how many sales we need at a \\$0.99 price point, calculate the proportion of weekly visitors who would need to make a purchase in order to meet that goal. Remember that the number of weekly visitors is saved as num_visits. Save the result as p_sales_needed_099 and print it out.

In [14]:
p_sales_needed_099 = num_sales_needed_099 / num_visits
print(f'The number of weekly visitors required to make a purchase is {p_sales_needed_099}')

The number of weekly visitors required to make a purchase is 0.20210104243717691


## 7
Repeat the steps from tasks 5 and 6 for the other price points (\\$1.99 and \\$4.99). Save the number of sales needed for each price point as num_sales_needed_199 and num_sales_needed_499, respectively. Then, save the proportion of visits needed as p_sales_needed_199 and p_sales_needed_499, respectively.

Print out the proportions. Note that for higher price points, you’ll need to sell fewer upgrade packages in order to meet your minimum revenue target — so the proportions should decrease as the price points increase.

In [16]:
# For $1.99
num_sales_needed_199 =1000/1.99
print(f'Number of sales at $1.99 is {np.ceil(num_sales_needed_199)}')
p_sales_needed_199 = num_sales_needed_199 / num_visits
print(f'The number of weekly visitors required to make a purchase is {p_sales_needed_199}')
# For $4.99
num_sales_needed_499 =1000/4.99
print(f'Number of sales at $4.99 is {np.ceil(num_sales_needed_499)}')
p_sales_needed_499 = num_sales_needed_499 / num_visits
print(f'The number of weekly visitors required to make a purchase is {p_sales_needed_499}')

Number of sales at $1.99 is 503.0
The number of weekly visitors required to make a purchase is 0.10054272965467594
Number of sales at $4.99 is 201.0
The number of weekly visitors required to make a purchase is 0.040096198800161346


## 8
Now let’s return to Brian’s question. To start, we want to know if the percent of Group A (the $0.99 price point) that purchased an upgrade package is significantly greater than p_sales_needed_099 (the percent of visitors who need to buy an upgrade package at \\$0.99 in order to make our minimum revenue target of \\$1,000).

To answer this question, we want to focus on just the visitors in group A. Then, we want to compare the number of purchases in that group to p_sales_needed_099.

Since we have a single sample of categorical data and want to compare it to a hypothetical population value, a binomial test is appropriate. In order to run a binomial test for group A, we need to know two pieces of information:

- The number of visitors in group A (the number of visitors who were offered the \\$0.99 price point)
- The number of visitors in Group A who made a purchase

Calculate these two numbers and save them as samp_size_099 and sales_099, respectively. Note that you can use the contingency table that you printed earlier to get these numbers OR you can use Python syntax.

In [29]:
samp_size_099 = len(abdata[abdata['group'] == 'A'])
print(f'The number of samples is A is {samp_size_099}')
sales_099 = len(abdata[(abdata['group'] == 'A') & (abdata['is_purchase'] == 'Yes')])
print(f'The number of sales is A is {sales_099}')

The number of samples is A is 1666
The number of sales is A is 316


## 9

Calculate the sample size and number of purchases in group B (the \\$1.99 price point) and save them as samp_size_199 and sales_199, respectively. Then do the same for group C (the \\$4.99 price point) and save them as samp_size_499 and sales_499, respectively.

In [32]:
samp_size_199 = len(abdata[abdata['group'] == 'B'])
print(f'The number of samples is B is {samp_size_099}')
sales_199 = len(abdata[(abdata['group'] == 'B') & (abdata['is_purchase'] == 'Yes')])
print(f'The number of sales is B is {sales_099}')
samp_size_499 = len(abdata[abdata['group'] == 'C'])
print(f'The number of samples is C is {samp_size_099}')
sales_499 = len(abdata[(abdata['group'] == 'C') & (abdata['is_purchase'] == 'Yes')])
print(f'The number of sales is C is {sales_099}')

The number of samples is B is 1666
The number of sales is B is 83
The number of samples is C is 1666
The number of sales is C is 83


## 10
For Group A (\\$0.99 price point), perform a binomial test using binom_test() to see if the observed purchase rate is significantly greater than p_sales_needed_099. Remember that there are four inputs to binom_test():

- x will be the number of purchases for Group A
- n will be the total number of visitors assigned group A
- p will be the target percent of purchases for the \\$0.99 price point
- alternative will indicate the alternative hypothesis for this test; in this case, we want to know if the observed purchase rate is significantly 'greater' than the purchase rate that results in the minimum revenue target.

Save the results to pvalueA, and print its value. Note that you’ll first need to import the binom_test() function from scipy.stats using the following line of code:

In [53]:
from scipy.stats import binom_test
pvalueA = binom_test(x=sales_099, n=samp_size_099, p=p_sales_needed_099, alternative='greater')
print('The sales are significantly higher than the required amount.' if pvalueA < 0.05 else 'The sales are not significantly higher than the required amount.')
print(f'Pvalue for A is {pvalueA}')

The sales are not significantly higher than the required amount.
Pvalue for A is 0.9999999999999999


## 11 & 12
Do this for the other price points.

In [54]:
pvalueB = binom_test(x=sales_199, n=samp_size_199, p=p_sales_needed_199, alternative='greater')
print('The sales are significantly higher than the required amount.' if pvalueB < 0.05 else 'The sales are not significantly higher than the required amount.')
print(f'Pvalue for B is {pvalueB}')
pvalueC = binom_test(x=sales_499, n=samp_size_499, p=p_sales_needed_499, alternative='greater')
print('The sales are significantly higher than the required amount.' if pvalueC < 0.05 else 'The sales are not significantly higher than the required amount.')
print(f'Pvalue for C is {pvalueC}')

The sales are not significantly higher than the required amount.
Pvalue for B is 0.11184562623739903
The sales are significantly higher than the required amount.
Pvalue for C is 0.027944826659907135


**Mike:** From this data it turns out Brian was accidentaly correct via a serendipitous route. It turns out that the number of people that did buy option C for \\$4.99 are enough to make the \\$1000 required threshold. Although this was only 1 weeks worth of data and could dip.