In this notebook, we use the hypotheses formed when we ran our EDA to conduct some hypothesis tests. We are most interested in whether or not a post in a certain category have significantly higher views/subscriber or engagement (likes + comments/views) on average. 

From our EDA on the training data, we formed the following hypotheses:

1) Posts that use at least one hashtag in either the title or descriptio have significantly higher views/subscriber
2) Posts with an affiliate link have significantly lower views/subscriber.
3) Posts that declare sponsorship in the description (not the title) have significantly lower views/subscriber
4) Posts from users that are verified have significantly higher engagement
5) Posts from users that are verified have significantly lower views/subscriber
6) Posts that mention a popular brand have significantly higher views/subscriber
7) Posts that used keywords from the "korean" keyword group have higher views/subscriber...
8) ...as well as higher engagement
9) Posts that were posted during a "prime hour" have significantly higher views/subcriber
10) Posts that were posted during a "prime hour" have significantly higher engagement.

We also (as a secondary part of our EDA) analyzed the coefficients on a linear model with all categorical features of the targets y1 = views/subscriber, abnd y2 = engagement. We ran t-tests (on the training data) based on the largest coefficients and found that: 

11) Posts that mention a popular brand have decreased engagement.
12) Posts that contain at least one hashtag have decreased engagement.
    
13-15) Videos that contain keywords from the "comparing_products" or "products" keyword group decrease engagement, whereas videos with keywords from the "self_ref" keyword group increase engagement

We also found based on our linear models that posts that used keywords from the "korean" keyword group have higher views/subscriber, which was already confirmed by our first EDA. 

We will run several t-tests to confirm if these hypotheses are true or not on the testing set. To account for the fact that we are running multiple hypothesis tests, we will lower our significance level from the standard alpha = 0.05 to alpha = 0.005. This means we will have a 7.5% chance of a type 1 error if we run 15 tests. Keep in mind that at the very beginning we also ran two t-tests with alpha = 0.005 to determine if the data from the first two months had significantly lower views. This puts our total error rate at 8.5%. 

In [22]:
import pandas as pd
import scipy.stats as stats

df = pd.read_csv('../data/new/no_early_dates_all_features_test.csv')
#Creating the missing target variables
df["engagement"] = (df["likes"] + df["commentsCount"])  / (df["viewCount"] + 1) 
df["views"] = (df["viewCount"]) / ( df["numberOfSubscribers"] + 1) 
df.columns

Index(['Unnamed: 0', 'any_ht', 'commentsCount', 'isChannelVerified', 'likes',
       'numberOfSubscribers', 'text', 'title', 'viewCount',
       'views_per_subscriber', 'duration_in_seconds', 'date',
       'hashtag_indicator', 'has_any_affiliate', 'hasAdinTitle', 'hasAdinText',
       'Engagement_per_Subscriber', 'Engagement_per_View', 'popular_brand',
       'prime_hour', 'product', 'skills/teach', 'speed', 'comparing_products',
       'self_ref', 'budget', 'korean', 'engagement', 'views'],
      dtype='object')

In [35]:

#First we need to create a categorical hashtag variable
df["any_ht"] = 1 * df["any_ht"].astype(bool)

views_features = ["any_ht", "has_any_affiliate", "hasAdinText", "isChannelVerified", "popular_brand", "korean", "prime_hour"]
eng_features = ["isChannelVerified", "korean", "prime_hour", "popular_brand", "any_ht", "comparing_products", "product", "self_ref"] 

for feature in views_features: 
    yes = df.loc[ df[feature] == 1]
    no = df.loc[ df[feature] == 0]
    t_stat, p_value = stats.ttest_ind( yes["views"], no["views"] ) 
    print(feature)
    print("p-value:", p_value)
    print("t-stat:", t_stat)
    print("Pass:", p_value < 0.005)
    #When t_stat>0 we know that the "yes" group mean is greater than the "no" group mean. Otherwise, it's less than. 

any_ht
p-value: 0.0004957348344092647
t-stat: 3.4844920678893194
Pass: True
has_any_affiliate
p-value: 0.004945310983733811
t-stat: -2.811362926442286
Pass: True
hasAdinText
p-value: 0.7752593760763163
t-stat: 0.28551198467456
Pass: False
isChannelVerified
p-value: 1.0474830443308478e-07
t-stat: -5.323198493686492
Pass: True
popular_brand
p-value: 0.12786023940923
t-stat: -1.5227534874232165
Pass: False
korean
p-value: 8.82825602920083e-27
t-stat: 10.752370253527811
Pass: True
prime_hour
p-value: 0.11224490431603802
t-stat: -1.588359531809305
Pass: False


In [39]:
for feature in eng_features: 
    yes = df.loc[ df[feature] == 1]
    no = df.loc[ df[feature] == 0]
    t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
    print(feature)
    print("p-value:", p_value)
    print("t-stat:", t_stat)
    print("Pass:", p_value < 0.005)
    #When t_stat>0 we know that the "yes" group mean is greater than the "no" group mean. Otherwise, it's less than. 

isChannelVerified
p-value: 2.9292719089874792e-18
t-stat: 8.735324333258344
Pass: True
korean
p-value: 0.21084840241325786
t-stat: 1.2513371067753714
Pass: False
prime_hour
p-value: 2.834079790861884e-09
t-stat: 5.947731083782474
Pass: True
popular_brand
p-value: 4.935861676246001e-27
t-stat: -10.806639390686946
Pass: True
any_ht
p-value: 1.2127742939725515e-47
t-stat: -14.596973562182166
Pass: True
comparing_products
p-value: 9.322208388064035e-07
t-stat: -4.909307887522326
Pass: True
product
p-value: 0.00037648697069116226
t-stat: -3.5575692448201424
Pass: True
self_ref
p-value: 0.011960421529940018
t-stat: 2.513889405307217
Pass: False


# Final Results:

We have confirmed that the following posts have significantly different views/subscriber on average:
1) Posts that contain at least one hashtag in the title or description have higher views/subscriber
2) Posts that contain an affiliate link have lower views/subscriber
3) Posts from users that are verified have lower views/subscriber
4) Posts that contain a keyword from the "korean" keyword group have higher views/subscriber

We have confirmed that the following posts have significantly different engagement rates on average:
1) Posts from users that are verified have higher engagement
2) Posts posted during a prime hour have higher engagement
3) Posts that mention a popular brand have lower engagement
4) Posts that contain at least one hashtag in the title or deescription have lower engagement.
5) Posts that contain at least one keyword from the "comparing_products" keyword group have lower engagement
6) Posts that contain at least one keyword from the "product" keyword group have lower engagement. 