# Q2 2020 OKR baselines

The team defined a set of Q2 OKRs that require us to estimate a couple of baselines. They are:

1. Of the users who make edits to suggested articles, what percentage edits more than one article?
2. Of the users who select (click on) a suggested article, what percentage save an edit to that article?

We'll grab data from MediaWiki on known test accounts to exclude those from the analysis. We'll most likely also want to segment or restrict the data by registration date or user tenure in order to filter out experienced users. It might be most straightforward to just work on users registered after Guidance was launched (on 2020-06-15).

Edits made to a suggested article are tagged if they're made within a week after the user clicks to select a task. In order to have the same set of users across the whole analysis, we limit registrations to those prior to 2020-09-01. That allows for two weeks of click events, and a week for the edit to happen. Since we're using the HomepageModule data as our basis for this analysis, we lose users without client-side EventLogging. That means users without JavaScript, or users with ad-blockers enabled that block our EL beacon.

For the baseline of edits to multiple articles, we count edits within two weeks after registration, because it's easy to restrict that in the query. Because this is based on edit data, it includes all users and therefore includes a larger group of users than that for the click-edit funnel.

We include all wikis with the Growth features up to and including Hebrew Wikipedia, where they were deployed on 2020-08-11. Deployment to Russian Wikipedia happened on 2020-08-27, so it's not included due to there being only a few days between deployment and our registration cutoff.

In [2]:
import json
import datetime as dt

import numpy as np
import pandas as pd

from collections import defaultdict

from wmfdata import hive, spark, mariadb
from growth import utils

# Configuration

In [38]:
## Configuration variables

wmf_snapshot = '2020-04'

## list of Wikipedias where the Homepage is deployed that we'll use in this analysis
wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki', 'huwiki', 'ukwiki', 'hywiki', 'euwiki',
         'srwiki', 'frwiki', 'fawiki', 'hewiki']

## Start and end timestamps of data gathering, per notes above
start_ts = dt.datetime(2020, 6, 16, 0, 0, 0)
end_ts = dt.datetime(2020, 9, 1, 0, 0, 0)

## User IDs of known users to exclude
known_users = defaultdict(set)

known_users['cswiki'].update([322106, 339583, 341191, 341611, 433381, 433382, 433511, 404765, 421667,
                      427625, 437386, 181724, 272273, 339583, 437386, 439783, 439792, 138342,
                      392634, 404765, 275298, 458487, 458049])
known_users['kowiki'].update([384066, 539296, 539299, 539302, 539303, 539304, 539305, 539306, 539307,
                      539298, 416361, 416360, 413162, 495265, 518393, 518394, 518396, 530285,
                      531579, 531785, 536786, 536787, 542720, 542721, 542722, 543192, 543193,
                      544145, 544283, 470932, 38759, 555673])

## Get known test accounts

In [39]:
## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

## Get all registrations

We'll grab registrations from ServerSideAccountCreation, and use that as a basis for our analysis to exclude users who didn't register during our given timeframe.

In [40]:
user_registrations_query = '''
SELECT wiki, event.userid AS user_id,
       dt AS reg_timestamp,
       CAST(event.displaymobile AS INT) AS reg_on_mobile
FROM event_sanitized.serversideaccountcreation
WHERE year = 2020
AND month BETWEEN 6 AND 9
AND wiki IN ({wiki_list})
AND event.isselfmade = true
AND event.isapi = false
AND dt BETWEEN "{start_time}" AND "{end_time}"
'''

In [41]:
user_registrations = spark.run(
    user_registrations_query.format(
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        start_time = start_ts.strftime(utils.hive_format),
        end_time = end_ts.strftime(utils.hive_format)
    )
)

In [44]:
## verify first registration in the dataset
user_registrations['reg_timestamp'].min()

'2020-06-16T00:03:05Z'

In [45]:
## verify last registration in the dataset
user_registrations['reg_timestamp'].max()

'2020-08-31T23:58:30Z'

## Proportion of edits to multiple articles

Of the users who make edits to suggested articles, what percentage edits more than one article? This needs to be calculated on a per-wiki basis, and then averaged across all wikis.

We'll use the `mediawiki_revision_tags_change` event table to measure this. We'll use all edits tagged with "newcomer task" made by users within 14 days of registration, so all users get an even time span to make edits. Because this table has bot-information as well, we'll filter out bot edits. It's unlikely that we have bot users making tagged edits, but filtering that out is easy to do. We aggregate on wiki and user ID, and count the number of distinct articles they've edited.

In [46]:
num_articles_query = '''
SELECT `database` AS wiki, performer.user_id, count(DISTINCT page_id) AS num_articles_edited
FROM event.mediawiki_revision_tags_change
WHERE datacenter = "eqiad"
AND year = 2020
AND month BETWEEN 6 AND 9
AND `database` IN ({wiki_list})
AND performer.user_registration_dt >= "{start_timestamp}"
AND performer.user_registration_dt < "{end_timestamp}"
AND performer.user_is_bot = false -- not a bot edit
AND array_contains(performer.user_groups, 'bot') = false -- not in the bot group either
AND array_contains(tags, "newcomer task")
AND unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
    unix_timestamp(performer.user_registration_dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*24*14
GROUP BY `database`, performer.user_id
'''

In [47]:
num_articles_per_user = spark.run(num_articles_query.format(
    wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
    start_timestamp = start_ts.strftime(utils.hive_format),
    end_timestamp = end_ts.strftime(utils.hive_format)
))

In [48]:
## join users and edits
articles_per_user_data = user_registrations.merge(num_articles_per_user, on = ['wiki', 'user_id'])

In [49]:
articles_per_user_data['edited_many'] = (articles_per_user_data['num_articles_edited'] > 1)

In [50]:
articles_per_user_agg = (articles_per_user_data.groupby(['wiki', 'edited_many'])
                         .agg({'user_id' : 'count'})
                         .reset_index()
                         .rename(columns = {'user_id' : 'n'}))

In [51]:
articles_per_user_agg['perc'] = (100 * articles_per_user_agg['n'] /
                                 articles_per_user_agg.groupby('wiki')['n'].transform('sum'))

In [None]:
articles_per_user_agg.round(1)

In [105]:
## average across all wikis except Basque

articles_per_user_agg.loc[
    (articles_per_user_agg['wiki'] != 'euwiki') &
    (articles_per_user_agg['edited_many'] == True), 'perc'].mean().round(1)

39.5

In [106]:
## average across all wikis

articles_per_user_agg.loc[(articles_per_user_agg['edited_many'] == True), 'perc'].mean().round(1)

44.6

## Proportion of users who save an edit on a suggested article

Of the users who select (click on) a suggested article, what percentage save an edit to that article?

We'll reuse the query from the background analysis for "Add a link" tasks, and use it to grab clicks from `event_sanitized.homepagemodule` and page IDs from `event_sanitized.newcomertask`. We'll add edits from `event.mediawiki_revision_tags_change` as before, and limit them to one week after the click happened.

In [93]:
task_click_edit_query = '''
WITH gc AS ( -- guidance click
    SELECT wiki, event.user_id, dt, event.homepage_pageview_token,
           regexp_extract(event.action_data, 'newcomerTaskToken=([\\\\w\\\\d]+)', 1) AS newcomer_task_token
    FROM event_sanitized.homepagemodule
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND dt BETWEEN "{event_start_timestamp}" AND "{event_end_timestamp}"
    AND wiki IN ({wiki_list})
    AND event.action = "se-task-click"
),
tok AS ( -- task info for guidance impressions and clicks
    SELECT event.newcomer_task_token, event.task_type, event.ordinal_position, event.page_id,
           ROW_NUMBER() OVER (PARTITION BY wiki, event.newcomer_task_token ORDER BY dt) AS row_number
    FROM event_sanitized.newcomertask
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND dt BETWEEN "{event_start_timestamp}" AND "{event_end_timestamp}"
    AND wiki IN ({wiki_list})
),
te AS ( -- edits tagged with "newcomer task" within 2 weeks of registration
    -- NOTE: edits can have multiple tag events, so we use DISTINCT to de-duplicate
    SELECT DISTINCT `database` AS wiki, performer.user_id, rev_id, rev_timestamp, page_id
    FROM event.mediawiki_revision_tags_change
    WHERE datacenter = "eqiad"
    AND year = 2020
    AND month BETWEEN 6 AND 9
    AND `database` IN ({wiki_list})
    AND performer.user_registration_dt >= "{reg_start_timestamp}"
    AND performer.user_registration_dt < "{reg_end_timestamp}"
    AND performer.user_is_bot = false -- not a bot edit
    AND array_contains(performer.user_groups, 'bot') = false -- not in the bot group either
    AND array_contains(tags, "newcomer task")
    -- edits within 21 days of registration, per the requirements
    AND unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(performer.user_registration_dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*24*21
)
SELECT gc.wiki, gc.user_id, gc.dt AS click_timestamp, te.rev_id, te.page_id, te.rev_timestamp
FROM gc
JOIN tok
ON gc.newcomer_task_token = tok.newcomer_task_token
LEFT JOIN te
ON gc.wiki = te.wiki
AND gc.user_id = te.user_id
AND tok.page_id = te.page_id
WHERE te.rev_timestamp IS NULL
   OR (unix_timestamp(te.rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
       unix_timestamp(gc.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*24*7)
'''

In [42]:
## registration start and end timestamps are the same as before
## event start timestamp is start_ts, event end timestamp is end_ts + two weeks
## We'll do further limitations on when clicks occurred after joining with registrations.

event_start_ts = start_ts
event_end_ts = end_ts + dt.timedelta(days = 14)

reg_start_ts = start_ts
reg_end_ts = end_ts


In [94]:
task_clicks_edits_data = spark.run(task_click_edit_query.format(
    wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
    event_start_timestamp = event_start_ts.strftime(utils.hive_format),
    event_end_timestamp = event_end_ts.strftime(utils.hive_format),
    reg_start_timestamp = reg_start_ts.strftime(utils.hive_format),
    reg_end_timestamp = reg_end_ts.strftime(utils.hive_format)
))

Merge with user registrations, keeping only those who clicked on something because that's a requirement in this case.

In [95]:
user_clicks_edits = user_registrations.merge(task_clicks_edits_data, on = ['wiki', 'user_id'])

Convert all timestamp strings to datetime objects.

In [96]:
user_clicks_edits['click_ts'] = pd.to_datetime(user_clicks_edits['click_timestamp'],
                                               format = '%Y-%m-%dT%H:%M:%SZ')

In [97]:
user_clicks_edits['rev_ts'] = pd.to_datetime(user_clicks_edits['rev_timestamp'],
                                             format = '%Y-%m-%dT%H:%M:%SZ')

In [98]:
user_clicks_edits['reg_ts'] = pd.to_datetime(user_clicks_edits['reg_timestamp'],
                                             format = '%Y-%m-%dT%H:%M:%SZ')

Now, we need to limit all users who made clicks/edits to clicks made within 14 days of registration.

In [99]:
user_clicks_edits = user_clicks_edits.loc[
    (user_clicks_edits['click_ts'] - user_clicks_edits['reg_ts'] < dt.timedelta(days = 14))]

In [101]:
## make a 'did_edit' column, set it to 0
user_clicks_edits['did_edit'] = 0

## set 'did_edit' to 1 if rev_timestamp is not None
user_clicks_edits.loc[~user_clicks_edits['rev_id'].isna(), 'did_edit'] = 1

## group by wiki and user, sum 'did_edit'
user_clicks_edits_user_agg = (user_clicks_edits.groupby(['wiki', 'user_id'])
                             .agg({'did_edit' : 'sum'})
                             .reset_index()
                             .rename(columns = {'did_edit' : 'num_clicks_edited'}))

## set a flag for user whose sum is > 0
user_clicks_edits_user_agg['did_edit'] = (user_clicks_edits_user_agg['num_clicks_edited'] > 0)
                         
## group by wiki and that flag
user_clicks_edits_agg = (user_clicks_edits_user_agg.groupby(['wiki', 'did_edit'])
                         .agg({'user_id' : 'count'})
                         .reset_index()
                         .rename(columns = {'user_id' : 'n'}))

## calculate per-wiki proportions
user_clicks_edits_agg['perc'] = (100 * user_clicks_edits_agg['n'] /
                                 user_clicks_edits_agg.groupby('wiki')['n'].transform('sum'))

In [None]:
user_clicks_edits_agg

In [104]:
user_clicks_edits_agg.loc[user_clicks_edits_agg['did_edit'] == True, 'perc'].mean().round(1)

31.8

# Summary

Of the users who make edits to suggested articles, what percentage edits more than one article? This needs to be calculated on a per-wiki basis, and then averaged across all wikis. We count all edits made by users within 14 days of registration.

The answer is 39.5% if we exclude Basque Wikipedia, 44.6% if we include it.

Of the users who select (click on) a suggested article, what percentage save an edit to that article? Again, we calculate this on a per-wiki basis and average across all wikis. We count clicks on articles made within 14 days of registration, and edits to those articles made within 7 days of those clicks.

The answer is 31.8%