# Case 2 - Ubisoft
* Kyle Anderson u09858930

# Table of Contents

1. [Introduction and Understanding of the Case](#Introduction-and-Understanding-of-the-Case)
2. [Data](#Data)
3. [Question 1](#Question-1)
4. [Question 2](#Question-2)
5. [Question 3](#Question-3)
6. [Question 4](#Question-4)
7. [Question 5](#Question-5)
8. [Question 6](#Question-6)
9. [Question 7](#Question-7)


## Introduction and Understanding of the Case

### Problem
* Ubisoft is investigating the page design of Ubisoft to determine which redesign will get more customers who visit the site, to buy games. Is the page that has the buy-click at the bottom of the description or the top-left?    

## Install Libraries

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import ceil
from scipy.stats import norm
from math import sqrt

## Data

In [51]:
# Ubisoft Historical
ubisoft_hist = "https://raw.githubusercontent.com/jefftwebb/data/main/ubisoft_historical.csv"
df_h = pd.read_csv(ubisoft_hist, index_col=False)

# Descriptive Dictionary
diction = fr"https://raw.githubusercontent.com/jefftwebb/data/main/ubisoft_historical_dictionary.csv"
df_d = pd.read_csv(diction, index_col=False)

In [52]:
# Inspect data frame of historical
df_h.head(5)

Unnamed: 0,Date,Day_of_Week,Visitors,Conversions,Is_Campaign_Day
0,2023-01-01,Sunday,633,50,True
1,2023-01-02,Monday,487,30,False
2,2023-01-03,Tuesday,501,25,False
3,2023-01-04,Wednesday,475,19,False
4,2023-01-05,Thursday,497,25,False


In [53]:
# Inspect data frame of dictionary
df_d.head(10)

# Date could be confounder based on events, cultural activites, Ubisoft activity, or weekends/holidays.
# Day_of_Week could be Confounder
# Is_Campaign_Day could be the most obvious Confounder

Unnamed: 0,variable,type,description
0,Date,date,"The date on which the data were recorded, from..."
1,Day_of_Week,string,The day of the week corresponding to the date.
2,Visitors,integer,The total number of visitors to the For Honor ...
3,Conversions,integer,The number of visitors who completed a convers...
4,Is_Campaign_Day,logical,A flag indicating whether a marketing campaign...


## Question 1
* What is the unit of analysis for the proposed A/B test? Explain.

## Answer 1
* The target of this is getting customer click engagement, so the 'Unit of Anlysis is' CUSTOMER(s)

## Question 2
* What is the baseline conversion rate for the For Honor game in the historical data?

In [54]:
# Conversion Count // Total Site Visits
conversion_rate = (df_h['Conversions'].sum() / df_h['Visitors'].sum()) * 100
conversion_rate = round(conversion_rate, 2)
print("Conversion Rate:")
print(conversion_rate,"%")


Conversion Rate:
4.85 %


## Answer 2
* The conversion rate is 4.85% which means a very low rate of people visiting who choose to buy the game.

## Question 3
* What is the required sample size in each group in order for the test to detect the specified MEI of 1% in conversions, assuming alpha of .05 and power of .8 in a one-tailed (directional) test?

In [55]:
# Minimum Effect of Interest
# Power of .8 means that we will correctly reject a false null hypothesis 80% of the time

conversion_rate = 0.0485
p2 = conversion_rate + 0.01 # MEI
alpha = 0.05
power = 0.8

# Z1alpha
z_alpha = norm.ppf(1 - alpha)
# Z1beta
z_beta = norm.ppf(power)

p_bar = (conversion_rate + p2) / 2 # Pbar is the overall conversion rate

n = (((z_alpha * sqrt(2 * p_bar * (1 - p_bar))) + sqrt(conversion_rate * (1 - conversion_rate) + p2 * (1 - p2)) * z_beta)/ (p2 - conversion_rate)) ** 2
# Numpy ceiling to get whole customer
sample = np.ceil(n)
print("Sample Size:")
sample

Sample Size:


6261.0

## Answer 3:
* 6261 for groups A and B are the appropriate sample size for this test

## Question 4
* Given the study parameters from the previous question—MEI, alpha and power—and the visitor counts in the historical data, how long will the test need to run? Discuss the assumptions you are making in estimating the test duration.

In [56]:
median_visits = df_h['Visitors'].median() # Daily visitors

median_duration = np.ceil(6261 / median_visits)
print("Duration using median daily visitors:")
print(median_duration)

mean_visits = df_h['Visitors'].mean() # Daily visitors

mean_duration = np.ceil(6261 / mean_visits)
print('Duration using mean daily visitors:')
print(mean_duration)

std_visits = df_h['Visitors'].std()
std_visits = np.ceil(std_visits)
print('Standard Deviation of daily visitors, for 3 days:')
print(std_visits*1)


Duration using median daily visitors:
13.0
Duration using mean daily visitors:
12.0
Standard Deviation of daily visitors, for 3 days:
53.0


## Answer 4
* It depends...
* The answer will vary based on the business context. Using the median, the A/B test will finish after 13 days.
* Using the mean, the Test is 1 day less, so it will really depend on the decision makers and the cost of waiting for this information for them to decided to cut it early or if the 53 visits are necessary for the Test.

## Question 5
* In A/B testing false negatives can be more detrimental to a company than false positives. Primarily, they prevent the recognition and implementation of beneficial changes, resulting in missed opportunities, and, in the long run, competitive disadvantage. Recalculate sample size and study duration with this in mind, using different settings for alpha and power. Explain your choices.

In [57]:
# Re run sample size

conversion_rate = 0.0485
p2 = conversion_rate + 0.01 # MEI
alpha = 0.04
power = 0.95

# Z1alpha
z_alpha = norm.ppf(1 - alpha)
# Z1beta
z_beta = norm.ppf(power)

p_bar = (conversion_rate + p2) / 2 # Pbar is the overall conversion rate

n = (((z_alpha * sqrt(2 * p_bar * (1 - p_bar))) + sqrt(conversion_rate * (1 - conversion_rate) + p2 * (1 - p2)) * z_beta)/ (p2 - conversion_rate)) ** 2
# Numpy ceiling to get whole customer
new_sample = np.ceil(n)
print("Sample Size:")
new_sample

Sample Size:


11674.0

In [58]:
# Re run test days
new_median_duration = np.ceil(new_sample / median_visits)
print("Duration using median daily visitors:")
print(new_median_duration)

new_mean_duration = np.ceil(new_sample / mean_visits)
print('Duration using mean daily visitors:')
print(new_mean_duration)


Duration using median daily visitors:
23.0
Duration using mean daily visitors:
23.0


## Answer 5
* For this increase in sample size, there will be a doubling in days. For the median and mean, both equal 23 days for the Test. These changes should help gather more information and drive the False Negatives down. However, what is the cost of waiting?

## Question 6
* Simulate visitor level data for the test, based on numbers from the historical data, and given the MEI and the test duration you calculated in Q4. The simulated data should resemble the actual data you would collect during the experiment. You can ignore weekly seasonality and marketing campaigns but the simulation should include a realistic number of visitors per day for the A and B groups for the duration of the experiment, as well as a realistic proportion of conversions. Each row should represent a unique customer-day combination. Use the simulated data to analyze the difference between A and B conversion rates statistically. Report and explain your results. (Hint: conversion is a binary process that can be modeled as a binomial random variable. Use rbinom() in R or numpy.random.binomial() in Python.)

In [59]:
n = 6261  # Sample size
duration = 13  # Duration
conversion_rate = 0.0485
p2 = conversion_rate + 0.01  # MEI

data = []

np.random.seed(223)
# run for each day to avoid confounding
for day in range(duration):
    for group in ['A', 'B']: # set groups
        
        num_visitors = np.random.poisson(530) # 530 mean for total
        # using the hint, piece together a binomial for the average visitors
        conversions = np.random.binomial(num_visitors, conversion_rate if group == 'A' else p2) # using the conversion rate, assess where the row belongs
        for _ in range(num_visitors):
            data.append({'Day': day, 'Group': group, 'Conversion': np.random.binomial(1, conversion_rate if group == 'A' else p2)})

df_simul = pd.DataFrame(data)
df_simul.head(15)


Unnamed: 0,Day,Group,Conversion
0,0,A,0
1,0,A,0
2,0,A,0
3,0,A,0
4,0,A,0
5,0,A,1
6,0,A,0
7,0,A,0
8,0,A,0
9,0,A,0


In [60]:
# I had to get help from Chat and Copilot on this part. I was confused on how to run the comparison and it stuck with the chi squared.
conversion_rates = df_simul.groupby('Group')['Conversion'].mean()
print("Conversion Rates:")
print(conversion_rates)

from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df_simul['Group'], df_simul['Conversion'])
chi2, p, _, _ = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {round(chi2, 2)}")
print(f"P-value: {round(p,2)}")

if p < 0.05:
    print("There is a difference in conversion rates of group A and B.")
else:
    print("NO difference in conversion rates of group A and B.")


Conversion Rates:
Group
A    0.044381
B    0.058933
Name: Conversion, dtype: float64
Chi-square statistic: 14.78
P-value: 0.0
There is a difference in conversion rates of group A and B.


## Answer 6
* With the simulated data, the two groups have differences. However, with the change in a higher conversion rate in B. This marks a positive result with moving the click button to the top-left corner.

## Question 7

## Answer 7

* **Background**: When optimizing the click button for the 'For Honor' game, it is important to find the location that will convert visitors to buyers. In this study, we will compare the Control where the button has been located at the bottom of the page and then move it to the top left.
* When testing, it was important to gather the number of days expected/budgeted for this project as this will give stakeholders the expectations on the timeline. When using the median or mean, it was a difference of 1 day and 53 visitors. However, it would be wise to decrease False Negatives and increase the time period to 23 days, as this will give a wider net of visitors across double the amount of days. As well as the number of Visitors we needed to study at different time periods to gather unbiased data. This will be based on random assignment in groups A or B. A being the Control and B being the new button location.
* With this Test, it will be vital to let the Test run its course for the 23 days and review the results of the Control and Test at the end.
    

