In this notebook, we use the hypotheses formed when we ran our EDA to conduct some hypothesis tests. We are most interested in whether or not a post in a certain category have significantly higher views/subscriber or engagement (likes + comments/views) on average. 

From our EDA on the training data, we formed the following hypotheses:

1) Posts that use at least one hashtag in either the title or descriptio have significantly higher views/subscriber
2) Posts with an affiliate link have significantly lower views/subscriber.
3) Posts that declare sponsorship in the description (not the title) have significantly lower views/subscriber
4) Posts from users that are verified have significantly higher engagement
5) Posts from users that are verified have significantly lower views/subscriber
6) Posts that mention a popular brand have significantly higher views/subscriber
7) Posts that used keywords from the "korean" keyword group have higher views/subscriber...
8) ...as well as higher engagement
9) Posts that were posted during a "prime hour" have significantly higher views/subcriber
10) Posts that were posted during a "prime hour" have significantly higher engagement.

We also (as a secondary part of our EDA) analyzed the coefficients on a linear model with all categorical features of the targets y1 = views/subscriber, abnd y2 = engagement. We ran t-tests (on the training data) based on the largest coefficients and found that: 

11) Posts that mention a popular brand have decreased engagement.
12) Posts that contain at least one hashtag have decreased engagement.
    
13-15) Videos that contain keywords from the "comparing_products" or "products" keyword group decrease engagement, whereas videos with keywords from the "self_ref" keyword group increase engagement

We also found based on our linear models that posts that used keywords from the "korean" keyword group have higher views/subscriber, which was already confirmed by our first EDA. 

We will run several t-tests to confirm if these hypotheses are true or not on the testing set. To account for the fact that we are running multiple hypothesis tests, we will lower our significance level from the standard alpha = 0.05 to alpha = 0.005. This means we will have a 7.5% chance of a type 1 error if we run 15 tests. Keep in mind that at the very beginning we also ran two t-tests with alpha = 0.005 to determine if the data from the first two months had significantly lower views. This puts our total error rate at 8.5%. 

Our type II error rate is not as important. It is possible, for example, that we missed the fact that including a hashtag increases views. However, this is fine for our use case. We would rather miss out on recommendations to make than make recommendations that don't work. This is an important point also for power calculation considerations. We are forced to conduct our analysis on data that is already existing and do not have the ability to design a future study with a desired sample size necessary to make sure our study has a certain desired power. 

Next, we should see if these statistically significant differences are practically significant. What constitutes practical significance will be subjective here. We will be using an ex-post analysis of MDE (minimum detectble effect) to find out if our differences are practically significant, as inspired by [this blog post](https://blogs.worldbank.org/en/impactevaluations/why-ex-post-power-using-estimated-effect-sizes-bad-ex-post-mde-not)  

For views/subscriber, we will say that an increase of .1 is practically significant. If a user has 100,000 subscribers, this means that the video got an extra 10,000 views. If a user has 10,000 subscribers, this means that the video got an extra 1,000 views. This is a subjective judgment call.

I belive this means that our desired value of Cohen's $d$ is $0.1 / \sqrt{ \sigma_1^2 + \sigma_2^2 }$ or larger. 

For engagement, it is hard to say what is significant. We will use the following standard rough estimates: A value of 0.2 represents a small effect size. 
A value of 0.5 represents a medium effect size 
A value of 0.8 represents a large effect size  

In [62]:
import pandas as pd
import numpy as np
import scipy.stats as stats

df = pd.read_csv('../data/new/no_early_dates_all_features_test.csv')
#Creating the missing target variables
df["engagement"] = (df["likes"] + df["commentsCount"])  / (df["viewCount"] + 1) 
df["views"] = (df["viewCount"]) / ( df["numberOfSubscribers"] + 1) 
df.columns

Index(['Unnamed: 0', 'any_ht', 'commentsCount', 'isChannelVerified', 'likes',
       'numberOfSubscribers', 'text', 'title', 'viewCount',
       'views_per_subscriber', 'duration_in_seconds', 'date',
       'hashtag_indicator', 'has_any_affiliate', 'hasAdinTitle', 'hasAdinText',
       'Engagement_per_Subscriber', 'Engagement_per_View', 'popular_brand',
       'prime_hour', 'product', 'skills/teach', 'speed', 'comparing_products',
       'self_ref', 'budget', 'korean', 'engagement', 'views'],
      dtype='object')

In [63]:
#Note that we will be using Welch's t-test instead; we have gone ahead and printed the variances of all the groups involved as we run the following code
#to show why we should be doing this. 

#First we need to create a categorical hashtag variable
df["any_ht"] = 1 * df["any_ht"].astype(bool)

views_features = ["any_ht", "has_any_affiliate", "hasAdinText", "isChannelVerified", "popular_brand", "korean", "prime_hour"]
eng_features = ["isChannelVerified", "korean", "prime_hour", "popular_brand", "any_ht", "comparing_products", "product", "self_ref"] 

def cohen(x,y): 
    '''x is the series from the yes df, y is no'''
    mean_with = x.mean()
    mean_without = y.mean()
    std_with = x.std(ddof=1)
    std_without = y.std(ddof=1)
    n_with = len(x)
    n_without = len(y) 
    pooled_std = np.sqrt(((n_with - 1) * std_with**2 + (n_without - 1) * std_without**2) / (n_with + n_without - 2))
    cohens_d = abs( (mean_with - mean_without) / pooled_std )
    desired_d = abs(0.1 / pooled_std) 
    #we take the abs since which group we expect to be higher vs lower depends on the test 
    return (cohens_d, desired_d)

In [64]:
for feature in views_features: 
    yes = df.loc[ df[feature] == 1]
    no = df.loc[ df[feature] == 0]
    t_stat, p_value = stats.ttest_ind( yes["views"], no["views"], equal_var=False ) 
    print(feature)
    print("p-value:", p_value/2) #This is technically a one-tailed test so we divide the p-value by 2
    print("t-stat:", t_stat)
    print("Pass:", p_value/2 < 0.005)
    print("Cohen's d:", cohen( yes["views"], no["views"])[0]  )
    print("Desired d:", cohen( yes["views"], no["views"])[1]  )
    print("Desired difference obtained:", cohen( yes["views"], no["views"])[0] > cohen( yes["views"], no["views"])[1] )
    print("variances:", yes["views"].var(), no["views"].var())
    #print("variance:", yes["views"].var, no["views"].var)
    #When t_stat>0 we know that the "yes" group mean is greater than the "no" group mean. Otherwise, it's less than. 

any_ht
p-value: 2.2368813261800056e-05
t-stat: 4.083816568105291
Pass: True
Cohen's d: 0.08166307936719766
Desired d: 0.018614694438988753
Desired difference obtained: True
variances: 38.600522273288675 11.268364658202877
has_any_affiliate
p-value: 2.326208027584433e-12
t-stat: -6.930613675538518
Pass: True
Cohen's d: 0.10317966863463907
Desired d: 0.01860973050520539
Desired difference obtained: True
variances: 1.5667041297945037 32.05638862504538
hasAdinText
p-value: 0.3847060669838128
t-stat: 0.29319192748459416
Pass: False
Cohen's d: 0.009001836449133846
Desired d: 0.0186005647314472
Desired difference obtained: False
variances: 27.09204363059775 29.220039282372912
isChannelVerified
p-value: 6.965578537378104e-13
t-stat: -7.098876241368914
Pass: True
Cohen's d: 0.12607301523049674
Desired d: 0.01863365138262357
Desired difference obtained: True
variances: 2.7178139581688323 42.275963786372095
popular_brand
p-value: 0.003566895227436586
t-stat: -2.6916694132460037
Pass: True
Cohen's

In [68]:
for feature in eng_features: 
    yes = df.loc[ df[feature] == 1]
    no = df.loc[ df[feature] == 0]
    t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"], equal_var=False ) 
    print(feature)
    print("p-value:", p_value/2)
    print("t-stat:", t_stat)
    print("Pass:", p_value/2 < 0.005)
    print("Cohen's d:", cohen( yes["engagement"], no["engagement"])[0]  )

    if cohen( yes["engagement"], no["engagement"])[0] < 0.2:
        print("Not a significant effect size.")
    elif 0.2 <= cohen( yes["engagement"], no["engagement"])[0] < 0.5:
        print("Effect size: Small")
    elif 0.5 <= cohen( yes["engagement"], no["engagement"])[0] < 0.8:
        print("Effect size: Medium")
    elif cohen( yes["engagement"], no["engagement"])[0] >= 0.8:
        print("Effect size: Large") 
    
    print("variances:", yes["views"].var(), no["views"].var())
    #When t_stat>0 we know that the "yes" group mean is greater than the "no" group mean. Otherwise, it's less than. 

isChannelVerified
p-value: 1.8424588246660634e-22
t-stat: 9.711456550257012
Pass: True
Cohen's d: 0.20688476655837193
Effect size: Small
variances: 2.7178139581688323 42.275963786372095
korean
p-value: 0.06469617079865729
t-stat: 1.5184830117894588
Pass: False
Cohen's d: 0.05697002981458164
Not a significant effect size.
variances: 203.7225309059297 16.326070479286987
prime_hour
p-value: 2.397052061507639e-09
t-stat: 5.86142932445298
Pass: True
Cohen's d: 0.1339735017390076
Not a significant effect size.
variances: 7.25480481048998 47.252575933148236
popular_brand
p-value: 7.165796787125945e-26
t-stat: -10.625127727251936
Pass: True
Cohen's d: 0.3350873622164535
Effect size: Small
variances: 4.830906603154853 33.309647775579364
any_ht
p-value: 1.137741311926081e-47
t-stat: -14.587929735696385
Pass: True
Cohen's d: 0.34209686442231635
Effect size: Small
variances: 38.600522273288675 11.268364658202877
comparing_products
p-value: 2.041758511745789e-08
t-stat: -5.527548650061268
Pass: Tru

# Final Results:

We have confirmed that the following posts have significantly different views/subscriber on average:
1) Posts that contain at least one hashtag in the title or description have higher views/subscriber with a desired effect size 
2) Posts that contain an affiliate link have lower views/subscriber with a desired effect size
3) Posts from users that are verified have lower views/subscriber with a desired effect size 
4) Posts that mention a popular brand have higher views/subscriber with a desired effect size 
5) Posts that contain a keyword from the "korean" keyword group have higher views/subscriber with a desired effect size 

We have confirmed that the following posts have significantly different engagement rates on average:
1) Posts from users that are verified have higher engagement with a small effect size 
2) Posts posted during a prime hour have higher engagement but without a significant effect size 
3) Posts that mention a popular brand have lower engagement with a small effect size 
4) Posts that contain at least one hashtag in the title or description have lower engagement with a small effect size 
5) Posts that contain at least one keyword from the "comparing_products" keyword group have lower engagement but without a significant effect size 
6) Posts that contain at least one keyword from the "product" keyword group have lower engagement but without a significant effect size 