# Upworthy A/B title testing `5 points`

Source: https://www.nature.com/articles/s41597-021-00934-7 and https://upworthy.natematias.com/.

The dataset itself is kind of a pain to get to, and can be downloaded directly from [here](https://osf.io/vy8mj/download), as linked to from [this page](https://osf.io/vy8mj/).

Description from [Data Is Plural](https://www.data-is-plural.com/archive/2021-08-18-edition/):

> The Upworthy Research Archive describes 32,000+ headline-testing experiments conducted in 2013–15 by Upworthy, the online publication that popularized a once-ubiquitous style of headline. The dataset, contributed by the publication to a team of academics, is split into three tranches for use in different phases of research. In total, it covers 150,000+ headline-plus-image permutations; for each, it provides the headline, an image identifier, the number of viewers assigned to see it, the number who clicked, and other details.

### Topis covered

* String functions
* Reading and understanding data dictionaries
* Sentiment analysis

## Two questions about questions `1 point`


Headlines that end in questions are so common there's even a rule about it: Betteridge's law of headlines! It states "Any headline that ends in a question mark can be answered by the word no."

### How often were Upworthy headlines phrased as questions?

In [6]:
import pandas as pd

df1 = pd.read_csv('upworthy-archive-confirmatory-packages-03.12.2020.csv', low_memory=False)
df1.loc[0].headline

'Let’s See … Hire Cops, Pay Teachers, Buy Books For Schools. Or Kill People. Hard Choice, Right?'

In [7]:
df1['headline'] = df1['headline'].astype(str)
#Absolute number of headlines that were phrased as a question
df1[df1['headline'].str.contains('?', regex=False)].headline.count()

15992

In [8]:
#Percentage of all headlines that were phrased as a question
(df1[df1['headline'].str.contains('?', regex=False)].headline.count()/\
df1.headline.count()\
*100).round(2)

15.15

### Are headlines with question marks more likely to succeed than ones without?

Along with your actual answer, defend what you mean by "succeed" (for example, what column you picked to represent success). You probably want to read the "Data Records" section from the source."

In [9]:
#Take the mean value of the significance column (Question = True)
df1[df1['headline'].str.contains('?', regex=False)].significance.mean()

37.5809779889945

In [10]:
#Take the mean value of the significance column (Question = False)
df1[~df1['headline'].str.contains('?', regex=False)].significance.mean()

41.15700041313547

Articles with a higher significance value were more lickely to be clicked on.
Question --> less likely to succeed

### Upworthy was thought of as an "obnoxiously positive clickbait" kind of site. Does the data support this? `3 points`

In [11]:
df1.head(1)

Unnamed: 0.1,Unnamed: 0,created_at,updated_at,clickability_test_id,excerpt,headline,lede,slug,eyecatcher_id,impressions,clicks,significance,first_place,winner,share_text,square,test_week
0,11,2014-11-20 11:33:26.475,2016-04-02 16:25:54.046,546dd17e26714c82cc00001c,Things that matter. Pass 'em on.,"Let’s See … Hire Cops, Pay Teachers, Buy Books...",<p>Iff you start with the basic fact that inno...,let-s-see-hire-cops-pay-teachers-buy-books-for...,546dce659ad54ec65b000041,3118,8,0.1,False,False,,,201446


In [None]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import nltk

nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

In [21]:
df2 = df1.head(2000)

sia = SIA()
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
    })

scores = df2.headline.apply(get_scores)
scores.style.background_gradient(cmap='RdYlGn', axis=None, low=0.4, high=0.4)

Unnamed: 0,content,textblob,textblob_bayes,nltk
0,"Let’s See … Hire Cops, Pay Teachers, Buy Books For Schools. Or Kill People. Hard Choice, Right?",-0.002976,0.174192,-0.7579
1,People Sent This Lesbian Questions And Her Raised Eyebrow Game Shames Us All,-0.4,0.970016,-0.4019
2,$3 Million Is What It Takes For A State To Legally Kill Someone,0.2,0.841553,-0.6486
3,The Fact That Sometimes Innocent People Are Executed Is Enough To End The Death Penalty. But This?,0.25,0.429043,-0.4118
4,Reason #351 To End The Death Penalty: It Costs $3 Million Per Case.,0.0,0.117873,-0.7845
5,"I Was Already Against The Death Penalty, But Now That I See What It Costs Us All? Ahem.",0.0,-0.304713,-0.5346
6,I'll Say It: It's Not OK For States To Legally Murder People.,-0.025,0.907326,-0.7737
7,The Fact That Sometimes Innocent People Are Executed Is Enough To End The Death Penalty. But This?,0.25,0.429043,-0.4118
8,The Fact That Sometimes Innocent People Are Executed Is Enough To End The Death Penalty. But This?,0.25,0.429043,-0.4118
9,The Fact That Sometimes Innocent People Are Executed Is Enough To End The Death Penalty. But This?,0.25,0.429043,-0.4118


### Were the selected headlines more likely to be positive or negative? `1 point`

In [22]:
print(scores['textblob'].mean())
print(scores['textblob_bayes'].mean())
print(scores['nltk'].mean())

0.07325566438191439
0.15946016656446485
0.0038070499999999993


The selected headlines were more likely to be positive. 