
Project: Hypothesis Testing for Microtransactions
Brian is a Product Manager at FarmBurg, a company that makes a farming simulation social network game. In the FarmBurg game, you can plow, plant, and harvest different crops.

Today, you will be acting as Brian's data analyst for an A/B Test that he has been conducting.

Part 1: Testing for Significant Difference
Start by importing the following modules that you'll need for this project:

pandas as pd

Brian tells you that he ran an A/B test with three different groups: A, B, and C. You're kind of busy today, so you don't ask too many questions about the differences between A, B, and C. Maybe they were shown three different versions of an ad. Who cares?

(HINT: you will care later)

Brian gives you a CSV of results called clicks.csv. It has the following columns:

user_id: a unique id for each visitor to the FarmBerg site
ab_test_group: either A, B, or C depending on which group the visitor was assigned to
click_day: only filled in if the user clicked on a link to purchase
Load clicks.csv into the variable df.

In [3]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
print(df.head())

                                user_id group click_day
0  8e27bf9a-5b6e-41ed-801a-a59979c0ca98     A       NaN
1  eb89e6f0-e682-4f79-99b1-161cc1c096f1     A       NaN
2  7119106a-7a95-417b-8c4c-092c12ee5ef7     A       NaN
3  e53781ff-ff7a-4fcd-af1a-adba02b2b954     A       NaN
4  02d48cf1-1ae6-40b3-9d8b-8208884a0904     A  Saturday


Define a new column called is_purchase which is Purchase if click_day is not None and No Purchase if click_day is None. This will tell us if each visitor clicked on the Purchase link.

In [7]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
print(df.head())

                                user_id group click_day  is_purchase
0  8e27bf9a-5b6e-41ed-801a-a59979c0ca98     A       NaN  No Purchase
1  eb89e6f0-e682-4f79-99b1-161cc1c096f1     A       NaN  No Purchase
2  7119106a-7a95-417b-8c4c-092c12ee5ef7     A       NaN  No Purchase
3  e53781ff-ff7a-4fcd-af1a-adba02b2b954     A       NaN  No Purchase
4  02d48cf1-1ae6-40b3-9d8b-8208884a0904     A  Saturday     Purchase


We want to count the number of users who made a purchase from each group. Use groupby to count the number of Purchase and No Purchase from each group. Save your answer to the variable purchase_counts.

Hint: Group by group and is_purchase and the function count on the column user_id.

In [19]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
purchase_counts = df.groupby(['group','is_purchase']).user_id.count()

print(purchase_counts) # group A has the most purchases, then group B; group C has the least purchases

group  is_purchase
A      No Purchase    1350
       Purchase        316
B      No Purchase    1483
       Purchase        183
C      No Purchase    1583
       Purchase         83
Name: user_id, dtype: int64


This data is categorical and there are more than 2 conditions, so we'll want to use a chi-squared test to see if there is a significant difference between the three conditions.

Start by filling in the contingency table below with the correct values:

contingency = [[groupA_purchases, groupA_not_purchases],
               [groupB_purchases, groupA_not_purchases],
               [groupC_purchases, groupA_not_purchases]]

In [None]:
contingency = [[1350, 316],
               [1483, 183],
               [1583, 83]]

Now import the function chi2_contingency from scipy.stats and perform the chi-squared test.

Recall that the p-value is the second output of chi2_contingency.

In [41]:
import pandas as pd
from scipy.stats import chi2_contingency
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
purchase_counts = df.groupby(['group', 'is_purchase']).user_id.count()

contingency = [[1350, 316],
               [1483, 183],
               [1583, 83]]

chi2, pval_contingency, dof, expected = chi2_contingency(contingency)

print('P-val is {}'.format(pval_contingency))


def pval_hypothesis_test(pval):
    if pval < 0.05:
        print("Reject Null Hypothesis: there is a significant difference between the datasets")
    else:
        print("No significant difference between datasets")
pval_hypothesis_test(pval_contingency)

P-val is 2.4126213546684264e-35
Reject Null Hypothesis: there is a significant difference between the datasets


Great! It looks like a significantly greater portion of users from Group A made a purchase.

Part 2: Testing for Exceeding a Goal
Your day is a little less busy than you expected, so you decide to ask Brian about his test.

You: Hey Brian! What was that test you were running anyway?

Brian: It was awesome! We are trying to get users to purchase a small FarmBurg upgrade package. It's called a microtransaction. We're not sure how much to charge for it, so we tested three different price points: $0.99, $1.99, and $4.99. It looks like significantly more people bought the upgrade package for $0.99, so I guess that's what we'll charge.

You: Oh no! I should have asked you this before we did that chi-squared test. I don't think that this was the right test at all. It's true that more people wanted purchase the upgrade at $0.99; you probably expected that. What we really want to know is if each price point allows us to make enough money that we can exceed some target goal. Brian, how much do you think it cost to build this feature?

Brian: Hmm. I guess that we need to generate a minimum of $1000 per week in order to justify this project.

You: We have some work to do!

How many visitors came to the site this week?

Hint: Look at the length of df.

In [47]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
total_visitors = df.user_id.count()  # or len(df) ... duh! 
print('Total Visitors: {}'.format(total_visitors))

Total Visitors: 4998


Let's assume that this is how many visitors we generally get each week. Given that, calculate the percent of visitors who would need to purchase the upgrade package at each price point ($0.99, $1.99, $4.99) in order to generate $1000 per week.

In [84]:
# Calculate the number of people who would need to purchase a $0.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_a
total_visitors_reach_goal_a = (number_of_purchases_to_goal / total_visitors) * 100

print('Number of Visitors to reach revenue goal using price point 0.99: {}'.format(number_of_purchases_to_goal))
print('Percent of Visitors to reach revenue goal using price point 0.99: {} %'.format(round(total_visitors_reach_goal_a,2)))


Number of Visitors to reach revenue goal using price point 0.99: 1010.1010101010102
Percent of Visitors to reach revenue goal using price point 0.99: 20.21 %


In [83]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_b
total_visitors_reach_goal_b = (number_of_purchases_to_goal / total_visitors) * 100
print('Number of Visitors to reach revenue goal using price point 1.99: {}'.format(number_of_purchases_to_goal))
print('Percent of Visitors to reach revenue goal using price point 1.99: {} %'.format(round(total_visitors_reach_goal_b,2)))

Number of Visitors to reach revenue goal using price point 1.99: 502.51256281407035
Percent of Visitors to reach revenue goal using price point 1.99: 10.05 %


In [82]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
import pandas as pd
df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_c
total_visitors_reach_goal_c = (number_of_purchases_to_goal / total_visitors) * 100
print('Number of Visitors to reach revenue goal using price point 4.99: {}'.format(number_of_purchases_to_goal))
print('Percent of Visitors to reach revenue goal using price point 4.99: {} %'.format(round(total_visitors_reach_goal_c,2)))

Number of Visitors to reach revenue goal using price point 4.99: 200.40080160320642
Percent of Visitors to reach revenue goal using price point 4.99: 4.01 %


Note that you need a smaller percentage of purchases for higher price points.

Now, for each group, perform a binomial test using binom_test from scipy.stats.

x will be the number of purchases for that group
n will be the total number of visitors assigned to that group
p will be the target percent of purchases for that price point (calculated above)
Recall that:

Group A is the $0.99 price point
Group B is the $1.99 price point
Group C is the $4.99 price point

In [None]:
# import the binomial test from scipy.stats here
from scipy.stats import binom_test

In [100]:
# Test group A here
import pandas as pd
from scipy.stats import binom_test

df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
purchase_counts = df.groupby(['group','is_purchase']).user_id.count()

print(purchase_counts)
contingency = [[1350,316],
               [1483,183],
               [1583,83]]
observed_failures_groupA = contingency[0][0]
print('# Failures(GrpA): {}'.format(observed_failures_groupA))
observed_successes_groupA = contingency[0][1]
print('# Successes(GrpA): {}'.format(observed_successes_groupA))
number_trials_groupA = observed_failures_groupA + observed_successes_groupA
print('Total Trials(GrpA: {}'.format(number_trials_groupA))

total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_a

probability_to_goal_grpA = number_of_purchases_to_goal / total_visitors
print('Probability to goal(GrpA): {}'.format(probability_to_goal_grpA))
# binomal test(observed successes, number of trials, probability)
pval_group_a = binom_test(observed_successes_groupA,n=number_trials_groupA,p=probability_to_goal_grpA)

print('P-Val for GroupA: {}'.format(pval_group_a))


def pval_hypothesis_test(pval):
    if pval < 0.05:
        print("Reject Null Hypothesis: there is a significant difference between the datasets")
    else:
        print("No significant difference between datasets")
pval_hypothesis_test(pval_group_a)


group  is_purchase
A      No Purchase    1350
       Purchase        316
B      No Purchase    1483
       Purchase        183
C      No Purchase    1583
       Purchase         83
Name: user_id, dtype: int64
# Failures(GrpA): 1350
# Successes(GrpA): 316
Total Trials(GrpA: 1666
Probability to goal(GrpA): 0.20210104243717691
P-Val for GroupA: 0.2111287299402726
No significant difference between datasets


In [103]:
# Test group B here
import pandas as pd
from scipy.stats import binom_test

df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
purchase_counts = df.groupby(['group','is_purchase']).user_id.count()

print(purchase_counts)
contingency = [[1350,316],
               [1483,183],
               [1583,83]]
observed_failures_groupB = contingency[1][0]
print('# Failures(GrpB): {}'.format(observed_failures_groupB))
observed_successes_groupB = contingency[1][1]
print('# Successes(GrpB): {}'.format(observed_successes_groupB))
number_trials_groupB = observed_failures_groupB + observed_successes_groupB
print('Total Trials(GrpB): {}'.format(number_trials_groupB))

total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_b

probability_to_goal_grpB = number_of_purchases_to_goal / total_visitors
print('Probability to goal(GrpB): {}'.format(probability_to_goal_grpB))
# binomal test(observed successes, number of trials, probability)
pval_group_b = binom_test(observed_successes_groupB,n=number_trials_groupB,p=probability_to_goal_grpB)

print('P-Val for GroupB: {}'.format(pval_group_b))


def pval_hypothesis_test(pval):
    if pval < 0.05:
        print("Reject Null Hypothesis: there is a significant difference between the datasets")
    else:
        print("No significant difference between datasets")
pval_hypothesis_test(pval_group_b)

group  is_purchase
A      No Purchase    1350
       Purchase        316
B      No Purchase    1483
       Purchase        183
C      No Purchase    1583
       Purchase         83
Name: user_id, dtype: int64
# Failures(GrpB): 1483
# Successes(GrpB): 183
Total Trials(GrpB): 1666
Probability to goal(GrpB): 0.10054272965467594
P-Val for GroupB: 0.20660209246555486
No significant difference between datasets


In [104]:
# Test group C here
import pandas as pd
from scipy.stats import binom_test

df = pd.read_csv(r'C:\Users\Jessica\PycharmProjects\practice\microtransactions\clicks.csv')
df['is_purchase'] = df.click_day.apply(lambda x: 'Purchase' if pd.notnull(x) else 'No Purchase')
purchase_counts = df.groupby(['group','is_purchase']).user_id.count()

print(purchase_counts)
contingency = [[1350,316],
               [1483,183],
               [1583,83]]
observed_failures_groupC = contingency[1][0]
print('# Failures(GrpC): {}'.format(observed_failures_groupC))
observed_successes_groupC = contingency[1][1]
print('# Successes(GrpC): {}'.format(observed_successes_groupC))
number_trials_groupC = observed_failures_groupC + observed_successes_groupC
print('Total Trials(GrpC): {}'.format(number_trials_groupB))

total_visitors = len(df)

goal_revenue = 1000.
price_point_a = 0.99
price_point_b = 1.99
price_point_c = 4.99
number_of_purchases_to_goal = goal_revenue /price_point_c

probability_to_goal_grpC = number_of_purchases_to_goal / total_visitors
print('Probability to goal(GrpC): {}'.format(probability_to_goal_grpC))
# binomal test(observed successes, number of trials, probability)
pval_group_c = binom_test(observed_successes_groupC,n=number_trials_groupC,p=probability_to_goal_grpC)

print('P-Val for GroupB: {}'.format(pval_group_c))


def pval_hypothesis_test(pval):
    if pval < 0.05:
        print("Reject Null Hypothesis: there is a significant difference between the datasets")
    else:
        print("No significant difference between datasets")
pval_hypothesis_test(pval_group_c)

group  is_purchase
A      No Purchase    1350
       Purchase        316
B      No Purchase    1483
       Purchase        183
C      No Purchase    1583
       Purchase         83
Name: user_id, dtype: int64
# Failures(GrpC): 1483
# Successes(GrpC): 183
Total Trials(GrpC): 1666
Probability to goal(GrpC): 0.040096198800161346
P-Val for GroupB: 1.4574710785648963e-33
Reject Null Hypothesis: there is a significant difference between the datasets


If any of the groups passed the binomial test with p < 0.05, then we can be confident that enough people will buy the upgrade package at the price point to justify the feature.

Brian should go with price point C = 4.99. We can be confident that there will be enough people to purchase the upgrade package at this price point to justify the feature (and meet the revenue goal of $1000 to justify the project)

This goes to show that a lower price and more people choosing the lower prices doesn't necessarily mean you will meet your revenue goals, it's is important to consider the price that will satisify the minimum number of customers to reach revenue goals. It's important to consider tradeoffs! 