# Project: Hypothesis Testing for Microtransactions
Brian is a Product Manager at FarmBurg, a company that makes a farming simulation social network game.  In the FarmBurg game, you can plow, plant, and harvest different crops.

Today, you will be acting as Brian's data analyst for an A/B Test that he has been conducting.

## Part 1: Testing for Significant Difference

Start by importing the following modules that you'll need for this project:
- `pandas` as `pd`

In [5]:
import pandas as pd

Brian tells you that he ran an A/B test with three different groups: A, B, and C.  You're kind of busy today, so you don't ask too many questions about the differences between A, B, and C.  Maybe they were shown three different versions of an ad.  Who cares?

(HINT: you will care later)

Brian gives you a CSV of results called `clicks.csv`.  It has the following columns:
- `user_id`: a unique id for each visitor to the FarmBerg site
- `ab_test_group`: either `A`, `B`, or `C` depending on which group the visitor was assigned to
- `click_day`: only filled in *if* the user clicked on a link to purchase

Load `clicks.csv` into the variable `df`.

In [5]:
# import is not carrying-over from previous cell, why?
import pandas as pd

# I've had no exposure to `pandas`. The instructions below presume the CSV
# file has been loaded into some sort of data structure. It seems to have
# methods associated with it so it's not just the `csv` module's `DictReader`,
# or something like it.
#
# I Googled something like "pandas groupby" and found the documentation for
# the `DataFrame` class. Its abilities fit the usage below. (Also the
# variable name "df" seems to be a hint.) While reading the guide on 
# DataFrames I found pandas' `read_csv` function. I realized I could
# use that instead of using the `DictReader` and creating a DataFrame from
# that.
df = pd.read_csv('clicks.csv')

Define a new column called `is_purchase` which is `Purchase` if `click_day` is not `None` and `No Purchase` if `click_day` is `None`.  This will tell us if each visitor clicked on the Purchase link.

In [57]:
import math

# The most straight-forward way to create the new column would be something like this:
# (using the `assign` method of DataFrames)
#
# import math
# df = df.assign(is_purchase = lambda x: 'No Purchase' if math.isnan(x['click_day']) else 'Purchase')
#
# NOTE: when missing, the rows in the click_day column are coming through as NaN, instead of as None
# like the instructions say.
#
# However, that would be attempting to check if the entire click_day Series is NaN, which obviously
# doesn't work. I tried using numpy's `isnan` but that did't like the pandas.Series as an input.
# I tried converting the Series to a numpy array and a list but those also didn't work. According
# to these docs: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.isnan.html
# I would expect them to.
#
# Instead, I'm just going to create a new column with a list.
#

is_purchase = ['No Purchase' if type(x) == float and math.isnan(x) else 'Purchse' for x in df['click_day']]
df.is_purchase = is_purchase

# print(df)
# examing the output shows that that worked

                                   user_id group  click_day  is_purchase
0     8e27bf9a-5b6e-41ed-801a-a59979c0ca98     A        NaN  No Purchase
1     eb89e6f0-e682-4f79-99b1-161cc1c096f1     A        NaN  No Purchase
2     7119106a-7a95-417b-8c4c-092c12ee5ef7     A        NaN  No Purchase
3     e53781ff-ff7a-4fcd-af1a-adba02b2b954     A        NaN  No Purchase
4     02d48cf1-1ae6-40b3-9d8b-8208884a0904     A   Saturday      Purchse
5     5a3ca2d6-25d5-4909-8f07-519f71ee55e8     A        NaN  No Purchase
6     6b929341-1336-4c34-965b-92e368ab160b     A        NaN  No Purchase
7     90b0a07b-e20e-4e0a-872e-5cc303c5676b     A        NaN  No Purchase
8     4b16c922-b2ab-48a8-885c-713ebf0ae159     A        NaN  No Purchase
9     5eb5fc03-fbda-4149-b909-4f5fbc6b152f     A        NaN  No Purchase
10    389ff492-4635-4535-8e42-685f771fccb1     A        NaN  No Purchase
11    1d25885b-56c7-4fdd-bee8-6348c1386bf0     A   Thursday      Purchse
12    4546807b-8211-4e7b-94cf-4f9c879e284b     A   

We want to count the number of users who made a purchase from each group.  Use `groupby` to count the number of `Purchase` and `No Purchase` from each `group`.  Save your answer to the variable `purchase_counts`.

**Hint**: Group by `group` and `is_purchase` and the function `count` on the column `user_id`.

In [78]:
# following the examples here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
purchase_counts = df.groupby(['group', 'is_purchase'])['user_id'].count()
print(purchase_counts)

# okay, that worked, but how do we acces the data directly?
# print(purchase_counts['A'][0])  # okay, that's the "No Purchase" for group A
# print(purchase_counts['A'][1])  # okay, that's the "Purchase" for group A

group  is_purchase
A      No Purchase    1350
       Purchse         316
B      No Purchase    1483
       Purchse         183
C      No Purchase    1583
       Purchse          83
Name: user_id, dtype: int64


This data is *categorical* and there are *more than 2* conditions, so we'll want to use a chi-squared test to see if there is a significant difference between the three conditions.

Start by filling in the contingency table below with the correct values:
```py
contingency = [[groupA_purchases, groupA_not_purchases],
               [groupB_purchases, groupB_not_purchases],
               [groupC_purchases, groupC_not_purchases]]
```

In [82]:
pc = purchase_counts

gA_purch    = pc['A'][1]
gA_no_purch = pc['A'][0]
gB_purch    = pc['B'][1]
gB_no_purch = pc['B'][0]
gC_purch    = pc['C'][1]
gC_no_purch = pc['C'][0]

contingency = [[gA_purch, gA_no_purch],
               [gB_purch, gB_no_purch],
               [gC_purch, gC_no_purch]]

# is that set up correctly?
print(contingency)
# yes...

[[316, 1350], [183, 1483], [83, 1583]]


Now import the function `chi2_contingency` from `scipy.stats` and perform the chi-squared test.

Recall that the *p-value* is the second output of `chi2_contingency`.

In [45]:
from scipy.stats import chi2_contingency

In [93]:
# same problem, import only effective for the cell it's in...
from scipy.stats import chi2_contingency

cont_results = chi2_contingency(contingency)
p = cont_results[1]

print(cont_results)
print(p)

# hmm.. what do those results mean?
# I"m not well-versed in the statistics math referenced here: 
# https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html
# but I can see that the p-value is very small and, according to this: 
# http://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/
# a small value indicates the null hypothesis was false and the alternative hypothesis true.
#
# I presume we set up the contingency table such that our alternative hypothesis was that
# most of the purchases came from Group A.


(159.41952879874498, 2.4126213546684264e-35, 2, array([[ 194., 1472.],
       [ 194., 1472.],
       [ 194., 1472.]]))
2.4126213546684264e-35


Great! It looks like a significantly greater portion of users from Group A made a purchase.

## Part 2: Testing for Exceeding a Goal

Your day is a little less busy than you expected, so you decide to ask Brian about his test.

**You**: Hey Brian! What was that test you were running anyway?

**Brian**: It was awesome! We are trying to get users to purchase a small FarmBurg upgrade package.  It's called a microtransaction.  We're not sure how much to charge for it, so we tested three different price points: \$0.99, \$1.99, and \$4.99.  It looks like significantly more people bought the upgrade package for \$0.99, so I guess that's what we'll charge.

**You**: Oh no! I should have asked you this before we did that chi-squared test.  I don't think that this was the right test at all.  It's true that more people wanted purchase the upgrade at \$0.99; you probably expected that.  What we really want to know is if each price point allows us to make enough money that we can exceed some target goal.  Brian, how much do you think it cost to build this feature?

**Brian**: Hmm.  I guess that we need to generate a minimum of $1000 per week in order to justify this project.

**You**: We have some work to do!

How many visitors came to the site this week?

Hint: Look at the length of `df`.

In [109]:
num_visitors_each_week = df.shape[0]  # recommended on Stack Overflow because df.count ignores null values
print(num_visitors_each_week)

4998


Let's assume that this is how many visitors we generally get each week.  Given that, calculate the percent of visitors who would need to purchase the upgrade package at each price point (\$0.99, \$1.99, \$4.99) in order to generate \$1000 per week.

In [120]:
# Calculate the number of people who would need to purchase a $0.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.

project_min_each_week = 1000.

# the equations would look like this according to the instructions above.
# $1000 = $0.99 * x
# solve for x
# x = $1000 / $0.99
# y = x / num_visitors_each_week

percent_for_0_99 = project_min_each_week / 0.99 / num_visitors_each_week
print("{:.3}%".format(percent_for_0_99 * 100))

20.2%


In [121]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.

percent_for_1_99 = project_min_each_week / 1.99 / num_visitors_each_week
print("{:.3}%".format(percent_for_1_99 * 100))

10.1%


In [123]:
# Calculate the number of people who would need to purchase a -$1.99-, $4.99? upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.

percent_for_4_99 = project_min_each_week / 4.99 / num_visitors_each_week
print("{:.3}%".format(percent_for_4_99 * 100))

4.01%


Note that you need a smaller percentage of purchases for higher price points.

Now, for each group, perform a binomial test using `binom_test` from `scipy.stats`.
- `x` will be the number of purchases for that group
- `n` will be the total number of visitors assigned to that group
- `p` will be the target percent of purchases for that price point (calculated above)

Recall that:
- Group `A` is the \$0.99 price point
- Group `B` is the \$1.99 price point
- Group `C` is the \$4.99 price point

In [39]:
# import the binomial test from scipy.stats here
from scipy.stats import binom_test

In [136]:
# Test group A here
from scipy.stats import binom_test

p_A = binom_test(gA_purch, gA_purch + gA_no_purch, percent_for_0_99)
# print the passed value first for easy visual scanning (the reader
# is thinking "did it pass, yes/no?") and then the actual value for
# more information.
print(p_A < 0.05, "{:.2}".format(p_A))

False 0.21


In [134]:
# Test group A here
from scipy.stats import binom_test

p_B = binom_test(gB_purch, gB_purch + gB_no_purch, percent_for_1_99)
print(p_B < 0.05, "{:.2}".format(p_B))

False 0.21


In [135]:
# Test group C here
from scipy.stats import binom_test

p_C = binom_test(gC_purch, gC_purch + gC_no_purch, percent_for_4_99)
print(p_C < 0.05, "{:.2}".format(p_C))

True 0.046


If any of the groups passed the binomial test with $p < 0.05$, then we can be confident that enough people will buy the upgrade package at that price point to justify the feature.

Which price point should Brian go with?  Did this surprise you?

Brian should go with the $4.99 price point. It is the only one that will generate enough revenue to sustain the project. This is counter-intuitive and surprising at first, because the obvious (and true) assumption is that fewer people will buy the more expensive package. However, what this exercise shows is that you have to consider the trade-off between fewer purchases and more money per-purchase. It's not necessarily going to be a linear relationship.

It's a good thing we analyzed the results according to what he was actually looking for!