In [24]:
import pandas as pd
import numpy as np
from scipy import stats

Background Story

Company Overview:
I work for a leading SaaS provider that recently launched a new app designed to help content creators optimize their online presence and engagement. The app provides a suite of tools and actionable insights to enhance content strategy and improve audience interaction.

Current Situation:
To boost user adoption, the company has been running advertising campaigns on Instagram. Two distinct campaigns are in progress:

Control Campaign: Targets a broad audience of content creators.

Test Campaign: Focuses on a premium user segment identified through analysis as having higher potential value but also incurring higher advertising costs.

Insight and Challenge:
Analysis indicates that a significant portion of our current user base belongs to this premium segment. However, targeting this group comes with higher advertising expenses. This has raised concerns for the CEO about whether the increased investment in the Test Campaign is justified.


The primary goal is to assess whether the Test Campaign is effectively driving more user clicks compared to the Control Campaign.



In [25]:
dataframe1 = pd.read_csv("test_group.csv", sep=';')
dataframe2 = pd.read_csv("control_group.csv", sep =';')

In [26]:
dataframe2.head(10)

Unnamed: 0,Campaign Name,Date,Spend [USD],# of Impressions,Reach,# of Website Clicks,# of Searches,# of View Content,# of Add to Cart,# of Purchase
0,Control Campaign,1.08.2019,2280,82702.0,56930.0,7016.0,2290.0,2159.0,1819.0,618.0
1,Control Campaign,2.08.2019,1757,121040.0,102513.0,8110.0,2033.0,1841.0,1219.0,511.0
2,Control Campaign,3.08.2019,2343,131711.0,110862.0,6508.0,1737.0,1549.0,1134.0,372.0
3,Control Campaign,4.08.2019,1940,72878.0,61235.0,3065.0,1042.0,982.0,1183.0,340.0
4,Control Campaign,5.08.2019,1835,,,,,,,
5,Control Campaign,6.08.2019,3083,109076.0,87998.0,4028.0,1709.0,1249.0,784.0,764.0
6,Control Campaign,7.08.2019,2544,142123.0,127852.0,2640.0,1388.0,1106.0,1166.0,499.0
7,Control Campaign,8.08.2019,1900,90939.0,65217.0,7260.0,3047.0,2746.0,930.0,462.0
8,Control Campaign,9.08.2019,2813,121332.0,94896.0,6198.0,2487.0,2179.0,645.0,501.0
9,Control Campaign,10.08.2019,2149,117624.0,91257.0,2277.0,2475.0,1984.0,1629.0,734.0


In [27]:
dataframe1.columns = ['Campaign_name','Date','Spend[usd]','Number_of_Impressions',
                          'Reach','Number_of_Website_Clicks','Numeber_of_searches',
                          'Number_of_view_content','Number_of_Add_To_Cart','Number_of_Purchase']





dataframe2.columns = ['Campaign_name','Date','Spend[usd]','Number_of_Impressions',
                          'Reach','Number_of_Website_Clicks','Numeber_of_searches',
                          'Number_of_view_content','Number_of_Add_To_Cart','Number_of_Purchase']

In [28]:
dataframe1.isnull().sum()

Campaign_name               0
Date                        0
Spend[usd]                  0
Number_of_Impressions       0
Reach                       0
Number_of_Website_Clicks    0
Numeber_of_searches         0
Number_of_view_content      0
Number_of_Add_To_Cart       0
Number_of_Purchase          0
dtype: int64

In [29]:
dataframe2.isnull().sum()

Campaign_name               0
Date                        0
Spend[usd]                  0
Number_of_Impressions       1
Reach                       1
Number_of_Website_Clicks    1
Numeber_of_searches         1
Number_of_view_content      1
Number_of_Add_To_Cart       1
Number_of_Purchase          1
dtype: int64

In [30]:
empty_value = dataframe2[dataframe2['Number_of_Impressions'].isna()]
date_with_empty = empty_value['Date']
date_with_empty

4    5.08.2019
Name: Date, dtype: object

In [31]:
print(dataframe2[dataframe2['Date']=='5.08.2019'])

      Campaign_name       Date  Spend[usd]  Number_of_Impressions  Reach  \
4  Control Campaign  5.08.2019        1835                    NaN    NaN   

   Number_of_Website_Clicks  Numeber_of_searches  Number_of_view_content  \
4                       NaN                  NaN                     NaN   

   Number_of_Add_To_Cart  Number_of_Purchase  
4                    NaN                 NaN  


In [32]:
print(dataframe1[dataframe1['Date']=='5.08.2019'])

   Campaign_name       Date  Spend[usd]  Number_of_Impressions  Reach  \
4  Test Campaign  5.08.2019        2297                 114295  95138   

   Number_of_Website_Clicks  Numeber_of_searches  Number_of_view_content  \
4                      5863                 2106                     858   

   Number_of_Add_To_Cart  Number_of_Purchase  
4                    956                 768  


I have decided to remove the day 5/08/2019 to prevent bais or mispresented skewness. The data is not missing at random. it is missing on a particular day.

In [33]:
dataframe1 = dataframe1[dataframe1['Date'] !='5.08.2019']
dataframe2 = dataframe2[dataframe2['Date'] !='5.08.2019']

In [34]:
dataframe2['Date'] = pd.to_datetime(dataframe2['Date'],format='%d.%m.%Y')
dataframe1['Date'] = pd.to_datetime(dataframe1['Date'],format='%d.%m.%Y')

In [None]:
dataframe1 = dataframe1.sort_values(by='Date',ignore_index=True)
dataframe1

In [None]:
dataframe2 = dataframe2.sort_values(by='Date',ignore_index=True)
dataframe2

significant_test = 0.05  # Significance level
confidence_interval = 95  # Confidence level as a percentage


Decision Rule
If p-value ≤ 0.05, we reject the null hypothesis.

Context
I had two options for the primary metric, focusing on determining whether the test campaign performs better than the current campaign. I had the Click through rate and mean of clicks. CTR speaks to unique users that is it counts only one click per user so if a user clicks multple times , it ignores the rest but the mean of clicks put all clicks into consideration and measures the avergae pers session which in this dataset it would be per day.

Business Hypothesis
The targeted campaign will lead to an increase in the number of clicks on the ad compared to the genric ad campaign.

Statistical Hypotheses
Null Hypothesis: The mean number of clicks generated by the targeted campaign is equal to the mean number of clicks generated by the generic ad campaign.
Alternative Hypothesis: The mean number of clicks generated by the targeted ad campaign is different from the mean number of clicks generated by the generic ad campaign.


Data Requirements
Two columns are needed:
1. Number_of_Website_Clicks: The count of clicks on the website.
2. Campaign_Name: Identifier for the current or test campaign.


Limitations
The dataset, summarized by date (e.g., Kaggle data), does not provide detailed user-level interaction data. This lack of granularity is a common limitation in pre-aggregated datasets and may prevent a true replication of the original experiment.

In [None]:
dataframe2.describe()

The choice of a statistical test depends on several factors, including the type of data (continuous or categorical), sample size, number of groups being compared, and distribution assumptions.
Examples for continous are: t-tests, z-tests, ANOVA
Examples for categorical are: chi-square tests, Fisher’s exact test.



If the variances are found to be unequal, you should use the Welch's t-test rather than the standard t-test, as it does not assume equal variances. This test adjusts the degrees of freedom based on the unequal variances, making it more reliable in this scenario.Howver for our dataset, the variances for both groups are equal, so the standard t-test is fine.

In [40]:

X_con = dataframe2['Number_of_Website_Clicks'].values
x_test = dataframe1['Number_of_Website_Clicks'].values

# F-test for equality of variances
f_stat, p_value_f = stats.levene(X_con, x_test)  # Levene's test is robust for unequal variances

# Checking the result of Levene's test
print(f"F-statistic (Levene's Test): {f_stat}")
print(f"P-value (Levene's Test): {p_value_f}")

if p_value_f < 0.05:
    print("The variances are significantly different (unequal variances).")
else:
    print("The variances are not significantly different (equal variances).")


F-statistic (Levene's Test): 0.11590561333015952
P-value (Levene's Test): 0.7347921763398915
The variances are not significantly different (equal variances).


In [41]:

# Calculate sample means and variances
x_con_mean = np.mean(X_con)
x_test_mean = np.mean(x_test)

x_con_var = np.var(X_con, ddof=1)  # sample variance
x_test_var = np.var(x_test, ddof=1)  # sample variance

# Sample sizes
N_con = len(X_con)
N_test = len(x_test)

# Pooled variance (since variances are assumed equal)
pooled_var = ((N_con - 1) * x_con_var + (N_test - 1) * x_test_var) / (N_con + N_test - 2)

# Pooled standard deviation
pooled_std = np.sqrt(pooled_var)

# T-statistic
t_stat = (x_con_mean - x_test_mean) / (pooled_std * np.sqrt(1 / N_con + 1 / N_test))

# Degrees of freedom
df = N_con + N_test - 2

# p-value (two-tailed)
p_value = 2 * (1 - stats.t.cdf(np.abs(t_stat), df))

# Confidence interval (95%)
confidence_level = 0.95
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, df)  # t-critical value for 95% CI

# Margin of error
margin_of_error = t_critical * pooled_std * np.sqrt(1 / N_con + 1 / N_test)

# Confidence interval for the difference in means
ci_lower = (x_con_mean - x_test_mean) - margin_of_error
ci_upper = (x_con_mean - x_test_mean) + margin_of_error

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {df}")
print(f"Confidence Interval: ({ci_lower}, {ci_upper})")

# Conclusion based on p-value
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the two groups.")


T-statistic: -1.5627904156089754
P-value: 0.1237359312213655
Degrees of Freedom: 56
Confidence Interval: (-1636.9417999712882, 202.18317928163435)
Fail to reject the null hypothesis: There is no significant difference between the two groups.


T-statistic: Tells you how far apart the sample means are, measured in terms of the pooled standard error.
P-value: Helps you decide whether the difference between the groups is statistically significant.
Confidence Interval: Provides a range for the difference in means, allowing you to understand the potential size of the effect.
 
 since zero is within the range for the confidence interval, there's a possibility that there is no real difference between the groups—any observed difference could be due to random variation.