In this notebook, we perform some t-tests to see if a post that is in a given category performs or underperforms significantly with respect to views or engagement. 

From our notebook views_models_categorical, we see that the top 5 categorical features with large coefficients in the linear regression line where views were the target variable are:
1) "Korean" - Posts that used keywords from the "korean" keyword group
2) "Has any affiliate" - Posts that contained affiliate links
3) "Acronym" - Posts that used keywords from the "acronym" keyword group
4) "product" - Posts that used keywords from the "product" keyword group
5) "hashtags" - Posts that contained a hashtag in the title or description

From our notebook engagement_models_categorical we see that the top 5 categorical features with large coefficients in the linear regression line where engagement was the target variable are:
1) "hashtags" - Posts that contained a hashtag in the title or description
2) "skincare" - Posts that used keywords from the "skincare" keyword group
3) "self_ref" - Posts that used keywords from the "self_ref" keyword group
4) "product" - Posts that used keywords from the "product" keyword group
5) "Popular_brand" - Posts that mentioned a popular brand

In [115]:
import pandas as pd
import numpy as np
import math 
df = pd.read_csv(r"no_early_dates_90_days.csv")
df.columns

Index(['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0',
       'channelDescription', 'channelJoinedDate', 'channelTotalVideos',
       'channelTotalViews', 'channelUsername', 'commentsCount', 'date',
       'duration', 'id', 'isChannelVerified', 'likes', 'numberOfSubscribers',
       'order', 'text', 'title', 'url', 'viewCount', 'likes_per_subscriber',
       'comments_per_subscriber', 'views_per_subscriber',
       'duration_in_seconds', 'datetime_date', 'hashtags', 'comm_to_views',
       'likes_to_views', 'popular_brand', 'has_title_affiliate',
       'has_description_affiliate', 'has_channel_description_affiliate',
       'has_any_affiliate', 'has_business_inquiry',
       'engagement_per_subscriber', 'product', 'budget', 'self_ref', 'acronym',
       'korean', 'speed', 'skills/teach', 'skincare', 'comparing_products',
       'datetime', 'hour', 'day_of_week', 'month', 'year', 'day_name',
       'engagement_rate', 'prime_time', 'cluster1', 'prime_time1',
       'postin

In [116]:
#Creating the missing target variable
df["engagement"] = (df["likes"] + df["commentsCount"])  / (df["viewCount"] + 1) 

#Now we run t-tests for each variable 
import scipy.stats as stats
yes = df.loc[ df["korean"] == 1]
no = df.loc[ df["korean"] == 0]
t_stat, p_value = stats.ttest_ind( yes["views_per_subscriber"], no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)
#When t>0 we know that the "yes" group mean is greater than the "no" group mean. Otherwise, it's less than. 

2.915611299093363e-31
11.606052343907693


In [117]:
yes = df.loc[ df["has_any_affiliate"] == 1] 
no = df.loc[ df["has_any_affiliate"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["views_per_subscriber"], no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

0.018942233751125373
-2.0763548560534826


In [118]:
yes = df.loc[ df["acronym"] == 1] 
no = df.loc[ df["acronym"] == 0]
t_stat, p_value = stats.ttest_ind( yes["views_per_subscriber"], no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

0.22229268573268257
0.7645008008755365


In [119]:
yes = df.loc[ df["product"] == 1] 
no = df.loc[ df["product"] == 0]
t_stat, p_value = stats.ttest_ind( yes["views_per_subscriber"], no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

2.340621006442661e-05
4.072634310047421


In [120]:
yes = df.loc[ df["hashtags"] == 1] 
no = df.loc[ df["hashtags"] == 0]
t_stat, p_value = stats.ttest_ind( yes["views_per_subscriber"], no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

0.1407022115162404
-1.0772773043171597


In [121]:
#Now for engagement 
yes = df.loc[ df["hashtags"] == 1] 
no = df.loc[ df["hashtags"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

2.8671283055588e-35
-12.424712053167248


In [122]:
#Now for engagement 
yes = df.loc[ df["skincare"] == 1] 
no = df.loc[ df["skincare"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

1.0912747645231904e-28
-11.081598949068626


In [123]:
#Now for engagement 
yes = df.loc[ df["self_ref"] == 1] 
no = df.loc[ df["self_ref"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

0.0022624596295346766
2.839630778160876


In [124]:
#Now for engagement 
yes = df.loc[ df["product"] == 1] 
no = df.loc[ df["product"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

6.165838273994999e-05
-3.840820459384414


In [125]:
#Now for engagement 
yes = df.loc[ df["popular_brand"] == 1] 
no = df.loc[ df["popular_brand"] == 0] 
t_stat, p_value = stats.ttest_ind( yes["engagement"], no["engagement"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

6.220362098808018e-37
-12.688319127539316


---Conclusions and Recommendations---

With a significance level of alpha = 0.05, we found the following:

1) Posts that contained a keyword for the "korean" keyword group had higher views/subscriber on average
2) Posts that contained an affiliate link had lower views/subscriber on average
3) Posts that contained a keyword from the "product" keyword group had higher views/subscriber on average

1) Posts that contained a hashtag had lower engagement on average
2) Posts that contained a keyword from the "skincare" keyword group had lower engagement on average
3) Posts that contained a keyword from the "self_ref" keyword group had higher engagmeny on average
4) Posts that contained a keyword from the "product" keyword group had lower engagement on average
5) Posts that mentioned a populat brand had lower engagement on average. 