# Selecting Valid Tests and Packages from the Upworthy Archive
## Example: Getting the Effect on Clickthrough Rates of Including a Notable Person's Name in Headlines
*March 2020* - Marianne Aubin Le Quere 

This example code takes you through everything you need to get the minimum and maximum effect size of any property you wish in the Upworthy dataset. As an example, we look here at what the effect is on clickthrough rates of including a Notable Person's name in an Upworthy headline.

In [3]:
# import statements
import pandas as pd
import csv
import numpy as np
import codecs, os
from collections import defaultdict, Counter
import random
import pprint
import upworthy_methods as up

## Set random seed from Brooklyn Integers
## https://www.brooklynintegers.com/int/1713001553/
random.seed(1713001553)

# Section 1: Load the Upworthy data
The goal of this section is to load the exploratory dataset into a form that we can work with effectively. Here, the data structure we chose to work with is a [python dictionary](https://docs.python.org/3/tutorial/datastructures.html).

We create two dictionaries:
   * `tests`: This dictionary contains all the packages that are associated with each other. By saving a list of `package_ids` that are contained within a single test, we make it very easy to compare packages that belong to a single test.
   * `packages`: This dictionary contains all the individual information for a single package. This includes information such as the headline, the lede, the url slug, etc.

In [4]:
# you should rename these variables to the directory where your exploratory data file lives
# from your terminal, you can type `pwd` on unix or `cd` on windows to know what your
# current directory is
data_dir = "/Users/nathan/Documents/github/Upworthy-Research-Archive/create-research-samples/output/"
filename = "upworthy-archive-exploratory-packages-03.12.2020.csv"

In [5]:
# this is all the information that will be associated with tests
def test_object():
    return {"id": None, "headlines":[], "package_ids":[]}

# this is all the information that will be associated with packages
def package_object():
    return {"headline": None,
            "test_id": None,
            "package_id": None,
            "excerpt": None,
            "lede": None,
            "eyecatcher_id": None,
            "share_text": None,
            "square": None,
            "created_at": None,
            "updated_at": None,
            "slug": None,
            "impressions": None,
            "clicks": None,
            "significance": None,
            "first_place": None,
            "winner": None,
            "test_week": None
            }

#Note: This does not currently work. Need Nathan to take a look.
#I think the date format is somehow different to the OG data?
def apply_date(date_raw):
    #parser.parse(date_raw)
    return date_raw

def load_columns(convert_method, list_of_keys, package_dict):
    for key in list_of_keys:
        package_dict[key] = convert_method(row[key])
    return package_dict

tests = defaultdict(test_object)
packages = defaultdict(package_object)

# this code runs through or csv file with the exploratory Upworthy data
# and loads it into our two dictionaries
with codecs.open(os.path.join(data_dir, filename)) as f:
    for row in csv.DictReader(f):
                
        package_id = row['']
        
        # load strings
        packages[package_id] = load_columns(str, ['headline', 'excerpt','lede', 'slug',
                    'share_text', 'eyecatcher_id', 'square'], packages[package_id])
        # load numbers
        packages[package_id] = load_columns(int, ['impressions', 'clicks'], packages[package_id])
        packages[package_id] = load_columns(float, ['significance'], packages[package_id])
        
        # load dates
        packages[package_id] = load_columns(apply_date, ['created_at', 'updated_at'], packages[package_id])
        packages[package_id] = load_columns(apply_date, ['test_week'], packages[package_id])
        
        # load booleans
        packages[package_id] = load_columns(bool, ['first_place', 'winner'], packages[package_id])

        packages[package_id]['test_id'] = row['clickability_test_id']
        packages[package_id]['package_id'] = package_id
        
        test_id = packages[package_id]['test_id']
        tests[test_id]['id'] = test_id
        tests[test_id]['headlines'].append(packages[package_id]['headline'])
        tests[test_id]['package_ids'].append(packages[package_id]['package_id'])
        
print("{0} total package items created.".format(len(packages)))
print("{0} total test items created.".format(len(tests)))

22666 total package items created.
4873 total test items created.


To prove to ourselves that this worked, we will print a random package and test just to allow ourselves to see what it looks like. You can use the below printing functions as you wish to see what your dictionary looks like at any point.

In [6]:
def print_random_dictionary_item(dictionary, dictionary_name):
    pp = pprint.PrettyPrinter(indent=1)
    random_key, random_value = random.choice(list(dictionary.items()))
    print("****** Printing {0} Value at Random *******".format(dictionary_name))
    pp.pprint(random_value)
    
def print_dictionary_item(dictionary, key):
    pp = pprint.PrettyPrinter(indent=1)
    print("****** Printing Value with Key {0} from Dictionary *******".format(key))
    pp.pprint(dictionary[key])

In [7]:
print_random_dictionary_item(tests, "Tests")
print("\n")
print_random_dictionary_item(packages, "Packages")

****** Printing Tests Value at Random *******
{'headlines': ['5 Things More Businesses Should Care About',
               'Watch This Ridiculously Adorable And Uplifting Video Of A Huge '
               'Company Actually Doing Good Stuff',
               '84 Seconds Of Proof That Some Companies Care About More Than '
               'Just Their Bottom Line',
               'Here Are 5 Core Beliefs I Wish More Companies Shared. And Then '
               'Actually Acted On, Like This One.',
               "You Probably Haven't Heard Of Project Sunlight, But You Should "
               'Totally Agree With Its 5 Core Beliefs',
               'This Huge Company Has 5 Goals For Its New Project… None Of '
               'Which Involve Increasing Profits.',
               "What's Project Sunlight? A Beautiful Movement That's Already "
               'In Motion.'],
 'id': '534c8ec3a649df271500005f',
 'package_ids': ['20542', '87889', '89205', '91109', '91622', '92922', '93044']}


****** Printin

# Section 2: Read in a list of notable people
For this example, we are interested in headlines that do or do not include notable people's names. We have compiled a list of notable people based on two datasources:
1. The IMDB Top 100 most viewed actors in the years 2013-2015
2. All people named in the Time 100 Most Influential People of the Year 2013-2015

This list is stored in a file named `notable_people.csv`. It contains three columns:
   * `person_name`: The person's full name OR their alias that they are better known by (e.g. 'Rihanna' will be present instead of 'Robyn Rihanna Fenty'). Note these are all lower case
   * `rank`: If applicable, the rank on the Top 100 IMDB list that the actor occupied that year
   * `year`: The year that they were featured on a Top 100 list

In [9]:
# read in the csv file and construct a pandas dataframe
notable_people_df = pd.read_csv('notable_people.csv')

# randomly print a sample of 5 people to see what the data looks like
print(notable_people_df.sample(5),"\n")

# print the # of total rows
print("There are {0} rows in our dataframe.".format(len(notable_people_df)))

# now see how many people this corresponds to (as there may be duplicates that are feature on multiple lists)
print("This corresponds to {0} notable people in all.".format(len(set(notable_people_df['person_name']))))

            person_name  rank  year
94       julianne hough    95  2013
897      thomas piketty    -1  2015
104  scarlett johansson     5  2014
510         lee daniels    -1  2015
346    steven spielberg    -1  2013 

There are 900 rows in our dataframe.
This corresponds to 467 notable people in all.


# Section 3: Calculate the valid comparisons with minimum and maximum effect sizes
Now that we have a list of packages, and a list of notable people, it's time to compare the two! We specifically want to isolate tests where:
1. At least one package has a headline with a notable person's name and one package does not
2. The packages are identical except for the property you we are interested in. For example, we can only compare the headline of two packages if the image between the two is also the same.

The below function `get_valid_tests` is intended to be flexible such that you can test any effect you wish. In this case, our `has_treatment` function is called `has_notable_person` and will return 1 if the headline has the name of a notable person and 0 otherwise. However, we could just as easily pass in the `has_number` function below and it will test for a different kind of treatment.

Each test in the `valid_tests` dictionary tells us:
* for the smallest possible effect size within that test, the package id of the package with the treatment we are testing for, and the package id of the package without the treatment we are testing for
* for the largest possible effect size within that test, the package id of the package with the treatment we are testing for, and the package id of the package without the treatment we are testing for

In our example, the smallest possible effect size will look for a package with a large clickthrough rate that does not have a notable person's name in it, and a package with a small clickthrough rate that does. These two packages must be a valid comparison (i.e. nothing other than the headline will be different). The largest possible effect size provides the opposite.

In [10]:
# One example of a 'has_treatment' function.
# This function returns 1 if a headline contains the name of a notable person
# and 0 otherwise.
def has_notable_person(value):
    
    # clean headline if necessary for analysis
    value = value.lower()
    
    # create list of notable people
    # again, clean if necessary
    list_of_notable_people = list(set(notable_people_df['person_name']))
    
    for notable_person in list_of_notable_people:
        #if there is a notable person in the headline, return 1
        if notable_person in value:
            return 1
    #if no notable person is found in the headline, return 0
    return 0

# here is another example of an effect you can test for
# this returns 1 if a number is included in a property and 0 otherwise
def has_number(value):
    for character in value:
        if character.isdigit():
            return True
    return False

# this function is what determines whether or not two packages are statistically
# valid comparisons. For example, if one headline has an effect and the other does not,
# the two packages may nonetheless be invalid comparisons if they have a different image 
def is_valid_comparison(package_id1, package_id2, properties_to_contrast):
    
    all_relevant_properties = ['headline', 'excerpt', 'lede', 'eyecatcher_id', 'square', 'share_text']
        
    properties_must_be_the_same = [x for x in all_relevant_properties if x not in properties_to_contrast]
    
    for prop in properties_must_be_the_same:
        if packages[package_id1][prop] != packages[package_id2][prop]:
            return False
    return True

# this function generates the tests that are directly comparable
# the inputs to this function are:
#   tests: dictionary of tests
#   packages: dictionary of packages
#   has_treatment: 
#      this is a function you define that should return 1 if an effect is present in a property and 0 otherwise
#   is_valid_comparison:
#      this defines which packages can be compared based on the list of properties specified. It is written for
#      you above, though you may choose to alter it if you wish
#   properties_to_contrast:
#      this should be a list of columns you want to compare
# in the example below, something will be considered a valid test if there is at least:
#   one package with a headline and lede that contains at least one notable person's name
#   one package with a headline and lede that do not contain at least one notable person's name
valid_tests = up.get_valid_tests(tests, packages, 
                                 has_notable_person, 
                                 is_valid_comparison, 
                                 ['headline'])


There are 40 valid tests that match the given criteria.


In [11]:
print_random_dictionary_item(valid_tests, "Valid Tests")

****** Printing Valid Tests Value at Random *******
{'max': {'no_treatment': '92097', 'treatment': '88322'},
 'min': {'no_treatment': '88419', 'treatment': '88924'}}


# Section 4: Output valid test dictionary to CSV file
Now that we have the results that are valid, we will output them to a format that we can then perform our analysis on.

This section will generate three CSV files:
1. "max_effect_size_dataset.csv"
2. "min_effect_size_dataset.csv"
3. "headlines.csv"

Note you may want to alter this section if you plan to contrast another property, such as `lede` instead of `headline`.

In [12]:
def create_headline_indexes(packages):
    ## create headline idx
    headline_idx = 0
    headlines = {}
    for p in packages:
        headline = packages[p]['headline']
        if headline not in headlines.keys():
            headlines[headline] = headline_idx
            headline_idx += 1 
        packages[p]['headline_idx'] = headline_idx
    return packages, headlines

def write_dataset(packages, valid_tests):
    
    for effect_size in ["max", "min"]:
        with open("upworthy_archive_exploratory_" + effect_size + "_effect_size_dataset.csv", 'w', newline='') as file:
            writer = csv.writer(file)
            writer.writerow(["clickability_test_id", "headline_number", "has_treatment","clicked"])

            for test in valid_tests.keys():
                for condition in ['treatment', 'no_treatment']:
                    clicks, impressions, headline_idx = up.get_properties(packages, 
                                                                          valid_tests[test][effect_size][condition])
                    has_treatment = 1 if condition == 'treatment' else 0
                    for i in range(0,impressions):
                        clicked = 1 if i < clicks else 0
                        writer.writerow([test, headline_idx, has_treatment, clicked]) 
                    
def write_headline_index(headlines):
    with open('headlines.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["headline_index", "headline"])

        for headline in headlines:
            writer.writerow([headlines[headline], headline])

In [13]:
packages_with_headline_indexes, headlines = create_headline_indexes(packages)
write_dataset(packages_with_headline_indexes, valid_tests)
write_headline_index(headlines)

Et voila! You are now free to go analyse the resulting datasets as you wish!