In this notebook we will conduct exploratory data analysisfor the question "What effect does mentioning popular brands have on engagement?" We will look at mentions both in the title and in the description. 

We have two kinds of engagement metrics: likes/subscriber and comments/subscriber. Both of these are highly correlated with views, since a short must be seen before it can be commented on/liked. 

We have hand-selected 30 popular beauty brands; 15 of these are skincare brands and 15 of these are makeup brands. This list is subjective and may not be exhaustive but should cover a good number of popular brands. 

Brands:
Makeup:
1) Natasha Denona
2) Tower 28
3) Pat McGrath
4) Urban Decay
5) ColourPop
6) Fenty Beauty
7) E.L.F. cosmetics
8) Nyx professional makeup
9) Essence
10) Benefit Cosmetics
11) Anastasia Beverly Hills
12) Tarte
13) Milk Makeup
14) Maybelline
15) Oden's Eye

Skincare:
1) The Ordinary
2) Beauty of Josean
3) Bubble
4) Paula's Choice
5) Cerave
6) Good Molecules
7) Cosrx
8) Olive Young
9) Dennis Grossman
10) Skinfix
11) Drunk Elephant
12) La Roche-Posay
13) Supergoop
14) Glow Recipe
15) Rhode

In [3]:
mentions = ["natasha denona", "natashadenona", "denona", "tower 28", "tower28", "pat mcgrath", "pmg labs", "mcgrath", "patmcgrath"]
mentions += ["urban decay", "urbandecay", "colourpop", "colorpop", "colour pop", "fenty", "e.l.f.", "elf", "nyx", "essence", "benefit"]
mentions += ["anastasia", "abh", "tarte", "milk", "maybelline", "oden's eye", "oden'seye", "odenseye", "the ordinary", "theordinary"]
mentions += ["beauty of josean", "josean", "bubble", "paula's choice", "paula'schoice", "paulaschoice", "cerave", "good molecules"]
mentions += ["cosrx", "olive young", "oliveyoung", "grossman", "skinfix", "drunk elephant", "drunkelephant", "roche-posay", "roche posay", "rocheposay"]
mentions += ["supergoop", "glow recipe", "glowrecipe", "rhode"]

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df1 = pd.read_csv("clean_data_train.csv")
df2 = pd.read_csv("no_early_dates_30_days_train.csv")

text_list = []
for item in df1["text"]:
    x = item if type(item) == str else '' 
    text_list.append(x)

df1["text"] = np.array(text_list)

text_list = []
for item in df2["text"]:
    x = item if type(item) == str else '' 
    text_list.append(x)

df2["text"] = np.array(text_list)

In [7]:
#I am now creating an indicator variable that is True if the brand is mentioned and False otherwise. 

df1["title plus desc"] = df1["title"] + df1["text"] #Creating a single column of both title and description so that my for loop works in the next step. 

popb_list = []
for item in df1["title plus desc"]:
    item = item.lower()

    ment_bool = False
    for ment in mentions:
        if (ment in item):
            ment_bool = True

    popb_list.append(ment_bool) 
    
df1["popular brand ind"] = np.array(popb_list) 

#I am now creating an indicator variable that is True if the brand is mentioned and False otherwise for df2

df2["title plus desc"] = df2["title"] + df2["text"] #Creating a single column of both title and description so that my for loop works in the next step. 

popb_list = []
for item in df2["title plus desc"]:
    item = item.lower()

    ment_bool = False
    for ment in mentions:
        if (ment in item):
            ment_bool = True

    popb_list.append(ment_bool) 
    
df2["popular brand ind"] = np.array(popb_list) 

In [13]:
df1_yes = df1.loc[  df1["popular brand ind"] == True]
df1_no = df1.loc[  df1["popular brand ind"] == False]

print(df1_yes.shape)
print(df1.shape)
#This tells me 870 posts mention popular brands. This is a good number.

(870, 28)
(5790, 28)


In [15]:
df2_yes = df2.loc[  df2["popular brand ind"] == True]
df2_no = df2.loc[  df2["popular brand ind"] == False]

print(df2_yes.shape)
print(df2.shape)
#This tells me 1266 posts mention popular brands. This is a good number.

(1266, 27)
(7937, 27)


In [29]:
print(df2_yes["views_per_subscriber"].mean())
print(df2_no["views_per_subscriber"].mean())
print(df2["views_per_subscriber"].mean())

1.9040020843261447
0.7610972067437469
0.9433975185768472


In [31]:
print(df2_yes["likes_per_subscriber"].mean())
print(df2_no["likes_per_subscriber"].mean())
print(df2["likes_per_subscriber"].mean())

0.05917265855890454
0.039319825967545206
0.04248647407900557


In [33]:
print(df2_yes["comments_per_subscriber"].mean())
print(df2_no["comments_per_subscriber"].mean())
print(df2["comments_per_subscriber"].mean())

#We want to use df2 when looking at the above calculations since likes, comments, and views accumulate over time. 

0.0005353636478497805
0.00046479661517011717
0.00047605248809092534


Here are my initial thoughts based on the analysis: Mentioning popular brands seems to significantly improve views. It seems to improves likes/comments, but not necessarily significantly. This is quite strange! One reason for this might be because these posts tend to mention popular brands in hashtags, which improve views, even though people overall are not more likely to comment or like on your post just because you mentioned a popular brand. 

In [36]:
#We are allowed to run as many t-tests as we like on the training set. 
import scipy.stats as stats
t_stat, p_value = stats.ttest_ind(df2_yes["views_per_subscriber"], df2_no["views_per_subscriber"] ) 
print(p_value/2) #We are really doing a one-tailed t-test so we divide the p-value by 2. 
print(t_stat)

0.005443751267405046
2.546889441241137


Yes, mentioning a popular brand seems to significantly improve views.