# Newcomer Task edits in the AB variant test

The phabricator task for this work is [T253902](https://phabricator.wikimedia.org/T253902).

We want to understand if there's a meaningful difference in the probability that a user completes at least one edit tagged with "newcomer task" within 15 days of registration between variants A and B.

To analyze this, we'll grab data in the same way that we did for the previous analysis of interactions, but focus on the number of tagged edits done.

## Timestamps

The variant test launched on "2019-12-13T00:32:04Z" (ref [T238888#5738223](https://phabricator.wikimedia.org/T238888#5738223)). That's just after midnight on a Friday.

We're gathering data on 2020-06-01. In order to enable gathering 14 days of data from each user, and to use whole weeks of data, we are limiting this to users registered prior to midnight on 2020-05-15.

In [1]:
import json
import datetime as dt

import numpy as np
import pandas as pd

from wmfdata import hive, spark, mariadb
from growth import utils

# Canonical user dataset

Gather a canonical dataset of user registrations (user ID, user registration timestamp), their treatment/control assignment, and whether they registered from desktop or mobile.

In [4]:
## Configuration variables

wmf_snapshot = '2020-04'

wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki']

## Start and end timestamps of data gathering for each wiki, corresponding to the dates/times
## listed above.

cs_start_ts = dt.datetime(2019, 12, 13, 0, 32, 4)
cs_end_ts = dt.datetime(2020, 5, 15, 0, 0, 0)

ko_start_ts = cs_start_ts
ko_end_ts = cs_end_ts

vi_start_ts = cs_start_ts
vi_end_ts = cs_end_ts

ar_start_ts = cs_start_ts
ar_end_ts = cs_end_ts

## User IDs of known users to exclude (e.g. Stephane, Elena, and Marshall's accounts)
known_users = {
    'cswiki' : set([322106, 339583, 341191, 341611, 433381, 433382, 433511, 404765, 421667,
                      427625, 437386, 181724, 272273, 339583, 437386, 439783, 439792, 138342,
                      392634, 404765, 275298, 458487, 458049]),
    'kowiki' : set([384066, 539296, 539299, 539302, 539303, 539304, 539305, 539306, 539307,
                      539298, 416361, 416360, 413162, 495265, 518393, 518394, 518396, 530285,
                      531579, 531785, 536786, 536787, 542720, 542721, 542722, 543192, 543193,
                      544145, 544283, 470932, 38759, 555673]),
    'viwiki' : set(),
    'arwiki' : set()
}

## Filename of where the canonical dataset is stored.
tsv_output_file = "/home/nettrom/src/Growth-homepage-2019/datasets/variant-test-1-canonical-users.tsv"
tsv_user_file = "/home/nettrom/src/Growth-homepage-2019/datasets/variant-test-1-users.tsv"
tsv_usage_file = "/home/nettrom/src/Growth-homepage-2019/datasets/variant-test-1-usage.tsv"

In [58]:
tsv_tagged_edits_file = "/home/nettrom/src/Growth-homepage-2019/datasets/variant-test-1-tagged-edit-counts.tsv"

## Get known test accounts

In [5]:
## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

## Get all registrations

We grab registrations from the ServerSideAccountCreation schema. This allows us to identify whether the user created the account themselves, and whether they registered through the API. We require the user to register the account themselves. Registrations through the API are typically app installs, so we filter those out.

Using this also enables us to identify whether the user registered on the desktop or mobile site, so we can control for that in our analyses.

We also grab self-registrations from MediaWiki, together with usernames and information on whether the user is currently in the bot user group. This allows us to compare the two sources to identify discrepancies, as well as remove known bot accounts for our analyis.

In [13]:
user_registrations_query = '''
SELECT wiki, event.userid AS user_id,
       dt AS reg_timestamp,
       CAST(event.displaymobile AS INT) AS reg_on_mobile
FROM event_sanitized.serversideaccountcreation
WHERE ((year = 2019 AND month >= 12) OR year = 2020)
AND wiki IN ("cswiki", "kowiki", "viwiki", "arwiki")
AND event.isselfmade = true
AND event.isapi = false
AND (
    (wiki = "cswiki"
     AND dt > "{cs_start_time}"
     AND dt < "{cs_end_time}"
     AND event.userid NOT IN ({cs_known_users}))
    OR
    (wiki = "kowiki"
     AND dt > "{cs_start_time}"
     AND dt < "{cs_end_time}"
     AND event.userid NOT IN ({ko_known_users}))
    OR
    (wiki = "viwiki"
     AND dt > "{vi_start_time}"
     AND dt < "{cs_end_time}"
     AND event.userid NOT IN ({vi_known_users}))
    OR
    (wiki = "arwiki"
     AND dt > "{ar_start_time}"
     AND dt < "{cs_end_time}"
     AND event.userid NOT IN ({ar_known_users}))
)
'''

In [14]:
user_registrations = spark.run(
    user_registrations_query.format(
        cs_start_time = cs_start_ts.strftime(utils.hive_format),
        cs_end_time = cs_end_ts.strftime(utils.hive_format),
        vi_start_time = vi_start_ts.strftime(utils.hive_format),
        ar_start_time = ar_start_ts.strftime(utils.hive_format),
        cs_known_users = ', '.join([str(u) for u in known_users['cswiki']]),
        ko_known_users = ', '.join([str(u) for u in known_users['kowiki']]),
        vi_known_users = ', '.join([str(u) for u in known_users['viwiki']]),
        ar_known_users = ', '.join([str(u) for u in known_users['arwiki']]),
    )
)

In [15]:
## verify first registration in the dataset
user_registrations['reg_timestamp'].min()

'2019-12-13T00:37:16Z'

In [16]:
## verify last registration in the dataset
user_registrations['reg_timestamp'].max()

'2020-05-14T23:59:30Z'

In [9]:
def get_mw_regs(wikis, start_timestamp, end_timestamp):

    ## Query to get self-registrations through MediaWiki.
    ## Also grabbing usernames, bot-info in username, bot user group membership.

    ## From Analytics Engineering: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/504025/
    botUsernamePattern = r"^.*bot([^a-z].*$|$)"
    
    ## Using "CONVERT" to make the regexp case-insensitive
    mw_reg_query = '''
    SELECT "{wiki}" AS wiki,
           user_id, user_name, user_registration,
           IF(CONVERT (user_name USING utf8) REGEXP "{bot_regex}", 1, 0) AS bot_by_name,
           IF(ug_user IS NOT NULL, 1, 0) AS bot_by_group
    FROM user
    JOIN actor
    ON user_id = actor_user
    JOIN logging
    ON log_actor = actor_id
    LEFT JOIN (
        SELECT ug_user
        FROM user_groups
        WHERE ug_group = "bot"
    ) AS ug
    ON user_id = ug_user
    WHERE user_registration >= "{start_ts}"
    AND user_registration < "{end_ts}"
    AND log_type = "newusers"
    AND log_action = "create" -- only self-creations
    '''
    
    regs = list()
    for wiki in wikis:
        regs.append(
            mariadb.run(
                mw_reg_query.format(
                    wiki = wiki,
                    bot_regex = botUsernamePattern,
                    start_ts = start_timestamp.strftime(utils.mw_format),
                    end_ts = end_timestamp.strftime(utils.mw_format)
                ), wiki
            )
        )
                   
        
    return(pd.concat(regs))

In [10]:
mw_regs = get_mw_regs(user_registrations['wiki'].unique(), cs_start_ts, cs_end_ts)

In [11]:
mw_regs.groupby('wiki').agg({'user_id' : 'size'})

Unnamed: 0_level_0,user_id
wiki,Unnamed: 1_level_1
arwiki,59834
cswiki,8460
kowiki,12914
viwiki,21009


In [17]:
## What's the number of registrations?

user_registrations.groupby('wiki').agg({'user_id' : 'size'})

Unnamed: 0_level_0,user_id
wiki,Unnamed: 1_level_1
arwiki,50768
cswiki,7302
kowiki,11071
viwiki,17286


Given that we can't filter out app registrations, it's difficult to compare these numbers and be sure that they're correct. However, they're not too far off. Can we find all the SSAC users in the MediaWiki database?

In [18]:
all_users = user_registrations.merge(mw_regs, on = ['wiki', 'user_id'])

In [19]:
all_users.groupby('wiki').agg({'user_id' : 'size'})

Unnamed: 0_level_0,user_id
wiki,Unnamed: 1_level_1
arwiki,50768
cswiki,7302
kowiki,11071
viwiki,17286


Yeah, looks like they all exist, so let's go with that.

In [20]:
## Dropping the user name column, it's no longer needed.
all_users.drop('user_name', axis = 'columns', inplace = True)

In [21]:
## Removing all bots by name or group membership
all_users = all_users.loc[(all_users['bot_by_name'] == 0) & (all_users['bot_by_group'] == 0)]

In [22]:
all_users.groupby('wiki').agg({'user_id' : 'size'})

Unnamed: 0_level_0,user_id
wiki,Unnamed: 1_level_1
arwiki,50745
cswiki,7292
kowiki,11059
viwiki,17279


## Get treatment/control assignments

This involves two operations.

1. Get all users who have the Homepage turned on in their preferences.
2. Get all users who have Newcomer Tasks pre-initialized.

Users who don't have the Homepage turned on are candidates for the control group, and likewise for the experiment group. Secondary, we assign pre-initialized status to users who have that preference set. Once we remove all users who changed their preferences, our group assignments should be correct.

In [23]:
def get_prop_settings(wiki, prop, col_name, users=None):
    '''
    Query and return a `pandas.DataFrame` with columns `wiki` and `user_id` of all users who have
    the given property turned on in their preferences for that given wiki.
    
    :param wiki: database code of the wiki we're querying
    :type wiki: str
    
    :param prop: the user preference we're querying for
    :type prop: str
    
    :param col_name: name that the column with preference value should have in the
                     resulting DataFrame (e.g. "is_treatment")
    :type col_name: str
    
    :param users: user IDs of the users we are interested in. This is optional.
    :type users: list
    '''
    
    prop_query = '''
    SELECT "{wiki}" AS wiki, up_user AS user_id,
           CAST(up_value AS UNSIGNED INTEGER) AS {col_name}
    FROM user_properties
    WHERE up_property = "{prop}"
    '''.format(wiki = wiki, prop = prop, col_name = col_name)
    
    if users is not None:
        prop_query += '''
        AND up_user IN ({})
        '''.format(','.join([str(uid) for uid in users]))
        
    return(mariadb.run(prop_query, wiki))

In [24]:
## Get treatment/control assignments from the MW databases

hp_prefs = pd.concat(
    [get_prop_settings('cswiki', 'growthexperiments-homepage-enable', 'hp_enabled'),
     get_prop_settings('kowiki', 'growthexperiments-homepage-enable', 'hp_enabled'),
     get_prop_settings('viwiki', 'growthexperiments-homepage-enable', 'hp_enabled'),
     get_prop_settings('arwiki', 'growthexperiments-homepage-enable', 'hp_enabled')
    ])

In [25]:
all_users = all_users.merge(hp_prefs, on = ['wiki', 'user_id'], how = 'left').fillna(0)

In [26]:
## Get variant settings from the MW database

variant_prefs = pd.concat(
    [get_prop_settings('cswiki', 'growthexperiments-homepage-suggestededits-preactivated', 'nt_pre_enabled'),
     get_prop_settings('kowiki', 'growthexperiments-homepage-suggestededits-preactivated', 'nt_pre_enabled'),
     get_prop_settings('viwiki', 'growthexperiments-homepage-suggestededits-preactivated', 'nt_pre_enabled'),
     get_prop_settings('arwiki', 'growthexperiments-homepage-suggestededits-preactivated', 'nt_pre_enabled')
    ])

In [27]:
all_users = all_users.merge(variant_prefs, on = ['wiki', 'user_id'], how = 'left').fillna(0)

## Users who turned the Homepage on/off in their preferences

Lastly, we identify all users who turned the Homepage on or off in their preferences, as that means they self-selected into or out of our group assignments. These users can therefore not be part of the analysis.

In [28]:
## Second, identify all users who either turned the Homepage on themselves, or at some point
## turned the preference off.

switch_query = '''
SELECT wiki, event.userid AS user_id, event.value
FROM event_sanitized.prefupdate
WHERE ((year = 2019 AND month >= 11) OR year = 2020)
AND wiki IN ("cswiki", "kowiki", "viwiki", "arwiki")
AND event.property = "{prop}"
'''

In [29]:
switched_users = spark.run(
    switch_query.format(
        prop = 'growthexperiments-homepage-enable'
    )
)

How many users switched?

In [30]:
len(switched_users)

885

That's a surprisingly high number compared to other interventions we've run. I'm now curious to learn to what extent users turn this on or off, and how many users in the experiment who actually turned it on/off.

In [31]:
switched_users.groupby(['wiki', 'value']).agg({'user_id' : 'size'})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id
wiki,value,Unnamed: 2_level_1
arwiki,False,75
arwiki,True,261
cswiki,False,54
cswiki,True,133
kowiki,False,66
kowiki,True,122
viwiki,False,31
viwiki,True,143


Ok, so largely users are turning the Homepage *on*, not off. Given the number of users in our dataset, I don't think we're looking at a significant proportion turning it off.

In [32]:
## Left-join with switched users

all_users = all_users.merge(switched_users,
                            on = ['wiki', 'user_id'], how = 'left')

Now, for wiki and registration method, aggregate how many users turned it on or off, and how many didn't change it.

In [33]:
all_users.groupby(['wiki', 'reg_on_mobile', 'value']).agg({'user_id' : 'size'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,user_id
wiki,reg_on_mobile,value,Unnamed: 3_level_1
arwiki,0,False,19
arwiki,0,True,46
arwiki,1,False,25
arwiki,1,True,39
cswiki,0,False,17
cswiki,0,True,10
cswiki,1,True,1
kowiki,0,False,9
kowiki,0,True,12
kowiki,1,False,26


In [34]:
len(all_users)

86422

In [35]:
len(all_users.loc[~all_users['value'].isna()])

295

In [36]:
round(100 * len(all_users.loc[~all_users['value'].isna()]) / len(all_users), 1)

0.3

So, at the time of writing (June 1, 2020), we've only had 295 out of 86,422 users in the experiment turn it on or off. Most of these turned it *on*, btw. That's 0.3% of users.

In [37]:
all_users = all_users.loc[all_users['value'].isna()].copy()

In [38]:
## Drop the 'value' column, it's no longer needed
all_users.drop('value', axis = 'columns', inplace = True)

In [40]:
## Drop the 'reg_timestamp' column, it's also no longer needed
all_users.drop('reg_timestamp', axis = 'columns', inplace = True)

In [42]:
all_users.head()

Unnamed: 0,wiki,user_id,reg_on_mobile,user_registration,bot_by_name,bot_by_group,hp_enabled,nt_pre_enabled
0,arwiki,1796993,1,20200115200510,0,0,0,0
1,arwiki,1797003,1,20200115202129,0,0,1,0
2,arwiki,1797030,1,20200115205649,0,0,1,1
3,arwiki,1797023,0,20200115205046,0,0,1,1
4,arwiki,1796996,1,20200115201116,0,0,0,0


In [41]:
## Turn hp_enabled, nt_pre_enabled, and reg_on_mobile into ints

all_users['hp_enabled'] = all_users['hp_enabled'].astype(int)
all_users['nt_pre_enabled'] = all_users['nt_pre_enabled'].astype(int)

## Grab edit data for these users

We use the `event.mediawiki_revision_tags_change` table in the Data Lake to get data on the number of edits tagged with "newcomer task" made by users registered between the start and end of our data gathering window.

In [44]:
tagged_edits_query = '''
SELECT `database` AS wiki, performer.user_id, count(DISTINCT rev_id) AS num_edits
FROM event.mediawiki_revision_tags_change
WHERE datacenter = "eqiad"
AND ((year = 2019 AND month = 12) OR year = 2020)
AND `database` IN ("cswiki", "kowiki", "viwiki", "arwiki")
AND performer.user_registration_dt >= "{start_ts}"
AND performer.user_registration_dt < "{end_ts}"
AND array_contains(tags, "newcomer task")
AND unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
    unix_timestamp(performer.user_registration_dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*24*14
GROUP BY `database`, performer.user_id
'''

In [45]:
tagged_edits = spark.run(tagged_edits_query.format(
    start_ts = cs_start_ts.strftime(utils.hive_format),
    end_ts = cs_end_ts.strftime(utils.hive_format)
))

In [None]:
## Grab a few user IDs from Czech Wikipedia to verify against the replicated databases
tagged_edits.loc[tagged_edits['wiki'] == 'cswiki'].head()

In [51]:
## Query to verify number of edits for those users

verification_query = '''
SELECT actor_user, user_registration, revactor_rev, revactor_timestamp
FROM actor
JOIN user
ON actor_user = user_id
JOIN revision_actor_temp
ON actor_id = revactor_actor
JOIN change_tag
ON revactor_rev  = ct_rev_id
JOIN change_tag_def
ON ct_tag_id = ctd_id
WHERE actor_user = {user_id}
AND ctd_name = "newcomer task"
'''

In [None]:
mariadb.run(verification_query, 'cswiki')

Counts for a couple of the users appears to be correct. Left-join with the user data.

In [53]:
users_and_edits = all_users.merge(tagged_edits, how = 'left', on = ['wiki', 'user_id'])

In [56]:
users_and_edits = users_and_edits.fillna(0)

In [57]:
users_and_edits['num_edits'] = users_and_edits['num_edits'].astype(int)

Write out to a TSV to be imported into R.

In [59]:
## Export users and usage data to TSVs for reading into R for analysis

users_and_edits.to_csv(tsv_tagged_edits_file, sep = '\t', header = True, index = False)