# Assignment 4
## Econ 8310 - Business Forecasting

This assignment will make use of the bayesian statistical models covered in Lessons 10 to 12. 

A/B Testing is a critical concept in data science, and for many companies one of the most relevant applications of data-driven decision-making. In order to improve product offerings, marketing campaigns, user interfaces, and many other user-facing interactions, scientists and engineers create experiments to determine the efficacy of proposed changes. Users are then randomly assigned to either the treatment or control group, and their behavior is recorded.
If the changes that the treatment group is exposed to can be measured to have a benefit in the metric of interest, then those changes are scaled up and rolled out to across all interactions.
Below is a short video detailing the A/B Testing process, in case you want to learn a bit more:
[https://youtu.be/DUNk4GPZ9bw](https://youtu.be/DUNk4GPZ9bw)

For this assignment, you will use an A/B test data set, which was pulled from the Kaggle website (https://www.kaggle.com/datasets/yufengsui/mobile-games-ab-testing). I have added the data from the page into Codio for you. It can be found in the cookie_cats.csv file in the file tree. It can also be found at [https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv](https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv)

The variables are defined as follows:

| Variable Name  | Definition |
|----------------|----|
| userid         | A unique number that identifies each player  |
| version        | Whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40) |
| sum_gamerounds | The number of game rounds played by the player during the first 14 days after install.  |
| retention1     | Did the player come back and play 1 day after installing?     |
| retention7     | Did the player come back and play 7 days after installing?    |               

### The questions

You will be asked to answer the following questions in a small quiz on Canvas:
1. What was the effect of moving the gate from level 30 to level 40 on 1-day retention rates?
2. What was the effect of moving the gate from level 30 to level 40 on 7-day retention rates?
3. What was the biggest challenge for you in completing this assignment?

You will also be asked to submit a URL to your forked GitHub repository containing your code used to answer these questions.

In [None]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

df = pd.read_csv("https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv")

# Separate data by version
observations_1_day_A = df[df['version'] == 'gate_30']['retention_1']
observations_1_day_B = df[df['version'] == 'gate_40']['retention_1']
observations_7_day_A = df[df['version'] == 'gate_30']['retention_7']
observations_7_day_B = df[df['version'] == 'gate_40']['retention_7']

# Define the model for 1-day retention
with pm.Model() as model_1_day:
    p_30 = pm.Uniform('p_30', 0, 1)  # Prior for gate_30 retention probability
    p_40 = pm.Uniform('p_40', 0, 1)  # Prior for gate_40 retention probability

    # Likelihood
    obs_30 = pm.Bernoulli("obs_30", p_30, observed=observations_1_day_A)
    obs_40 = pm.Bernoulli("obs_40", p_40, observed=observations_1_day_B)

    # Sampling
    step = pm.Metropolis()
    trace_1_day = pm.sample(20000, step=step, chains=3, random_seed=RANDOM_SEED)

# Posterior samples for 1-day retention
p_30_samples_1_day = np.concatenate(trace_1_day.posterior.p_30.data[:, 1000:])
p_40_samples_1_day = np.concatenate(trace_1_day.posterior.p_40.data[:, 1000:])

# Plot posterior distributions for 1-day retention
plt.figure(figsize=(12, 8))

# Gate_30 posterior
plt.subplot(211)
plt.xlim(0.4, 0.5)
plt.hist(p_30_samples_1_day, histtype='stepfilled', bins=25, alpha=0.85,
         label="Posterior of $p_{30}$", color="#A60628", density=True)
plt.legend(loc="upper right")
plt.title("Posterior Distribution of $p_{30}$ for 1-Day Retention")

# Gate_40 posterior
plt.subplot(212)
plt.xlim(0.4, 0.5)
plt.hist(p_40_samples_1_day, histtype='stepfilled', bins=25, alpha=0.85,
         label="Posterior of $p_{40}$", color="#467821", density=True)
plt.legend(loc="upper right")
plt.title("Posterior Distribution of $p_{40}$ for 1-Day Retention")
plt.tight_layout()
plt.show()

# Define the model for 7-day retention
with pm.Model() as model_7_day:
    p_30 = pm.Uniform('p_30', 0, 1)  # Prior for gate_30 retention probability
    p_40 = pm.Uniform('p_40', 0, 1)  # Prior for gate_40 retention probability

    # Likelihood
    obs_30 = pm.Bernoulli("obs_30", p_30, observed=observations_7_day_A)
    obs_40 = pm.Bernoulli("obs_40", p_40, observed=observations_7_day_B)

    # Sampling
    step = pm.Metropolis()
    trace_7_day = pm.sample(20000, step=step, chains=3, random_seed=RANDOM_SEED)

# Posterior samples for 7-day retention
p_30_samples_7_day = np.concatenate(trace_7_day.posterior.p_30.data[:, 1000:])
p_40_samples_7_day = np.concatenate(trace_7_day.posterior.p_40.data[:, 1000:])

# Plot posterior distributions for 7-day retention
plt.figure(figsize=(12, 8))

# Gate_30 posterior
plt.subplot(211)
plt.xlim(0.1, 0.2)
plt.hist(p_30_samples_7_day, histtype='stepfilled', bins=25, alpha=0.85,
         label="Posterior of $p_{30}$", color="#A60628", density=True)
plt.legend(loc="upper right")
plt.title("Posterior Distribution of $p_{30}$ for 7-Day Retention")

# Gate_40 posterior
plt.subplot(212)
plt.xlim(0.1, 0.2)
plt.hist(p_40_samples_7_day, histtype='stepfilled', bins=25, alpha=0.85,
         label="Posterior of $p_{40}$", color="#467821", density=True)
plt.legend(loc="upper right")
plt.title("Posterior Distribution of $p_{40}$ for 7-Day Retention")
plt.tight_layout()
plt.show()

# Summarize 1-day retention model
# I got help from Copilot for part of this code
az_summary_1_day = az.summary(trace_1_day, var_names=["p_30", "p_40"], hdi_prob=0.95)
print("1-Day Retention Summary:\n", az_summary_1_day)

# Summarize 7-day retention model
# I got help from Copilot for part of this code
az_summary_7_day = az.summary(trace_7_day, var_names=["p_30", "p_40"], hdi_prob=0.95)
print("\n7-Day Retention Summary:\n", az_summary_7_day)