# Project: Hypothesis Testing for Microtransactions
Brian is a Product Manager at FarmBurg, a company that makes a farming simulation social network game.  In the FarmBurg game, you can plow, plant, and harvest different crops.

Today, you will be acting as Brian's data analyst for an A/B Test that he has been conducting.

## Part 1: Testing for Significant Difference

Start by importing the following modules that you'll need for this project:
- `pandas` as `pd`

In [1]:
import pandas as pd

Brian tells you that he ran an A/B test with three different groups: A, B, and C.  You're kind of busy today, so you don't ask too many questions about the differences between A, B, and C.  Maybe they were shown three different versions of an ad.  Who cares?

(HINT: you will care later)

Brian gives you a CSV of results called `clicks.csv`.  It has the following columns:
- `user_id`: a unique id for each visitor to the FarmBerg site
- `ab_test_group`: either `A`, `B`, or `C` depending on which group the visitor was assigned to
- `click_day`: only filled in *if* the user clicked on a link to purchase

Load `clicks.csv` into the variable `df`.

In [2]:
df = pd.read_csv('clicks.csv')

# Define a new column called `is_purchase` which is `Purchase` if `click_day` is not `None` and `No Purchase` if `click_day` is `None`.  This will tell us if each visitor clicked on the Purchase link.

In [3]:
#Is purchased, takes in a row and returns Purchase if click_day is not none and No Purchase if it is.
def is_purchased(row):
    if (pd.isnull(row['click_day'])):
        return "No Purchase"
    else:
        return "Purchase"

#apply the is purchased function on every row in df
df['is_purchase'] = df.apply(is_purchased, axis = 1)

We want to count the number of users who made a purchase from each group.  Use `groupby` to count the number of `Purchase` and `No Purchase` from each `group`.  Save your answer to the variable `purchase_counts`.

**Hint**: Group by `group` and `is_purchase` and the function `count` on the column `user_id`.

In [4]:
#groupby the group and is_purchase columns, and run the count() function on the result's 'user_id' column
purchase_counts = df.groupby(['group', 'is_purchase'])['user_id'].count()
purchase_counts

group  is_purchase
A      No Purchase    1350
       Purchase        316
B      No Purchase    1483
       Purchase        183
C      No Purchase    1583
       Purchase         83
Name: user_id, dtype: int64

This data is *categorical* and there are *more than 2* conditions, so we'll want to use a chi-squared test to see if there is a significant difference between the three conditions.

Start by filling in the contingency table below with the correct values:
```py
contingency = [[groupA_purchases, groupA_not_purchases],
               [groupB_purchases, groupB_not_purchases],
               [groupC_purchases, groupC_not_purchases]]
```

In [5]:
groupA_purchases = 316
groupA_not_purchases = 1350
groupB_purchases = 183
groupB_not_purchases = 1483
groupC_purchases = 83
groupC_not_purchases = 1583

contingency = [[316, 1350],
               [183, 1483],
               [83, 1583]]

Now import the function `chi2_contingency` from `scipy.stats` and perform the chi-squared test.

Recall that the *p-value* is the second output of `chi2_contingency`.

In [6]:
from scipy.stats import chi2_contingency

In [7]:
chi2_contingency(contingency)

(159.41952879874498, 2.4126213546684264e-35, 2, array([[  194.,  1472.],
        [  194.,  1472.],
        [  194.,  1472.]]))

Great! It looks like a significantly greater portion of users from Group A made a purchase.

## Part 2: Testing for Exceeding a Goal

Your day is a little less busy than you expected, so you decide to ask Brian about his test.

**You**: Hey Brian! What was that test you were running anyway?

**Brian**: It was awesome! We are trying to get users to purchase a small FarmBurg upgrade package.  It's called a microtransaction.  We're not sure how much to charge for it, so we tested three different price points: \$0.99, \$1.99, and \$4.99.  It looks like significantly more people bought the upgrade package for \$0.99, so I guess that's what we'll charge.

**You**: Oh no! I should have asked you this before we did that chi-squared test.  I don't think that this was the right test at all.  It's true that more people wanted purchase the upgrade at \$0.99; you probably expected that.  What we really want to know is if each price point allows us to make enough money that we can exceed some target goal.  Brian, how much do you think it cost to build this feature?

**Brian**: Hmm.  I guess that we need to generate a minimum of $1000 per week in order to justify this project.

**You**: We have some work to do!

How many visitors came to the site this week?

Hint: Look at the length of `df`.

In [8]:
#the number of visitors is the number of rows, i.e. the length of the user_id column assuming
#all user_id's are unique
numVisitors = len(df['user_id'])

Let's assume that this is how many visitors we generally get each week.  Given that, calculate the percent of visitors who would need to purchase the upgrade package at each price point (\$0.99, \$1.99, \$4.99) in order to generate \$1000 per week.

In [9]:
# Calculate the number of people who would need to purchase a $0.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
numRequired99 = 1000/0.99
groupAPercent = numRequired99/numVisitors
groupAPercent

0.20210104243717691

In [10]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
numRequired199 = 1000/1.99
groupBPercent = numRequired199/numVisitors
groupBPercent

0.10054272965467594

In [11]:
# Calculate the number of people who would need to purchase a $4.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
numRequired499 = 1000/4.99
groupCPercent = numRequired499/numVisitors
groupCPercent

0.040096198800161346

In [12]:
# import the binomial test from scipy.stats here
from scipy.stats import binom_test

In [17]:
binom_test(groupA_purchases, groupA_purchases + groupA_not_purchases, groupAPercent)

0.2111287299402726

In [14]:
# Test group B here
binom_test(groupB_purchases, groupB_purchases + groupB_not_purchases, groupBPercent)

0.20660209246555486

In [16]:
# Test group C here
binom_test(groupC_purchases, groupC_purchases + groupC_not_purchases, groupCPercent)

0.045623672477172125

If any of the groups passed the binomial test with $p < 0.05$, then we can be confident that enough people will buy the upgrade package at that price point to justify the feature.

Which price point should Brian go with?  Did this surprise you?

In [None]:
'''
Brian should go with the 4.99 price point, as its p value is under the 0.05 signifance threshold that indicates that
we can trust that we will make over 1000 dollars at this ratio. 

This is slightly surprising — although it makes sense when I think about it. It makes sense that while fewer people buy
a more expensive package, the price of the package is high enough such that it eliminates that factor and results in
an overall gain. So, as long as you have a certain, lower amount of people willing to pay the higher price, everything
works out.

I guess this is slightly surprising as I have seen a lot of companies and apps use smaller microtransaction amounts,
like 0.99 dollars, but now that I think on it, usually they offer several different packages so I guess that results in
those willing to pay the higher amount paying that amount, and that cascading down to the lower prices. Also, every other
company has probably done analyses such as this one to determine the most effective price points.
'''