# Impact of Missing Newcomer Task Edit Tags

The phab task for this is [T266610](https://phabricator.wikimedia.org/T266610). Per the task and its parent task, starting on 2020-09-15 certain conditions led to the "newcomer task" tag not being added to an edit. A fix was deployed on 2020-10-28.

The bug was triggered when the user visits the Homepage and clicks on the first card shown in the Newcomer Task queue. That card had an incorrect URL. If they changed their topics or difficulty settings, then the URL was correct and the "newcomer task" tag was applied to edits to the article. I tested on French Wikipedia, and going through the initialization process resulted in the correct URL being shown, so we can exclude those sessions too.

We'd like to get a historic estimate of how often users edited a task under those conditions, and then use that to estimate the actual number of task edits based on the number of existing tagged edits.

Here's the proposed approach:

1. Identify Homepage sessions that have an impression/task click of the task with position 0.
2. Remove sessions where said impression/click occurred *after* changing topics or difficulty, or after the user initialized the Newcomer Tasks module. Note that if the user didn't make any changes to the topics or diffculty setting, the bug is still present. We're going to ignore that and assume that users who clicked to save topic/difficulty changes actually changed them, rather than add complexity to identify whether they changed anything.
3. Identify the page ID of the recommended task.
4. Use `event.mediawiki_revision_tags_change` table to identify edits to those pages within 7 days of the task impression/click.

We'll exclude known test users in the same way that we usually do. And we'll count tagged edits per week similarly as we do in the reporting notebook. Then, we'll calculate a proportion on a weekly basis and summarize it. While those exclusions mean that the edit counts won't match the previously reported numbers, we want to use the best approaches in order to estimate the impact this bug had on actual users. That means removing test users makes absolute sense.

Lastly, we'll focus on the Guidance era because our instrumentation changed when Guidance was introduced. This simplifies the queries. It also means that we're limited to a three month period of data, from 2020-06-15 to 2020-09-15.

Doing things this way have some important implications and assumptions.

We're using client-side EventLogging data from the HomepageModule schema to identify user sessions where an impression or click took place. We know from before (e.g. [T243632#5898720](https://phabricator.wikimedia.org/T243632#5898720)) that ad blockers or other things might affect this data. Based on that previous analysis, we're expecting to see about 10% of users block EL completely, and another 10–15% block it some of the time. We've also seen evidence that blocking is more likely for highly prolific users, because on a per-session basis the block percentage is higher.

The analysis in this notebook makes the assumption that the impact of ad/JavaScript blocking is relatively consistent so that calculating a proportion of edits makes sense. Secondly, we're assuming the proportion behaves in such a way that we can calculate either a mean and standard deviation or a median and an interquartile range from the proportion and get meaningful results.

In [1]:
import datetime as dt

from collections import defaultdict

import numpy as np
import pandas as pd

from wmfdata import spark, mariadb

In [2]:
## _Ordered_ list of wikis that we've deployed to and that we want to include in the reports
## (the ordering determines ordering in later report queries)
wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki', 'ukwiki', 'huwiki', 'srwiki', 'hywiki',
         'frwiki', 'euwiki', 'fawiki', 'hewiki', 'ruwiki', 'plwiki']

## Lists of known users to ignore (e.g. test accounts and experienced users)
known_users = defaultdict(set)
known_users['cswiki'].update([14, 127629, 303170, 342147, 349875, 44133, 100304, 307410, 439792, 444907,
                              454862, 456272, 454003, 454846, 92295, 387915, 398470, 416764, 44751, 132801,
                              137787, 138342, 268033, 275298, 317739, 320225, 328302, 339583, 341191,
                              357559, 392634, 398626, 404765, 420805, 429109, 443890, 448195, 448438,
                              453220, 453628, 453645, 453662, 453663, 453664, 440694, 427497, 272273,
                              458025, 458487, 458049, 59563, 118067, 188859, 191908, 314640, 390445,
                              451069, 459434, 460802, 460885, 79895, 448735, 453176, 467557, 467745,
                              468502, 468583, 468603, 474052, 475184, 475185, 475187, 475188, 294174,
                              402906, 298011])

known_users['kowiki'].update([303170, 342147, 349875, 189097, 362732, 384066, 416362, 38759, 495265,
                              515553, 537326, 566963, 567409, 416360, 414929, 470932, 472019, 485036,
                              532123, 558423, 571587, 575553, 576758, 360703, 561281, 595100, 595105,
                              595610, 596025, 596651, 596652, 596653, 596654, 596655, 596993, 942,
                              13810, 536529])

known_users['viwiki'].update([451842, 628512, 628513, 680081, 680083, 680084, 680085, 680086, 355424,
                              387563, 443216, 682713, 659235, 700934, 705406, 707272, 707303, 707681, 585762])

known_users['arwiki'].update([237660, 272774, 775023, 1175449, 1186377, 1506091, 1515147, 1538902,
                              1568858, 1681813, 1683215, 1699418, 1699419, 1699425, 1740419, 1759328, 1763990])

## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

## Helper functions

In [3]:
def make_known_users_sql(kd, wiki_column, user_column):
    '''
    Based on the dictionary `kd` mapping wiki names to sets of user IDs of known users,
    create a SQL expression to exclude users based on the name of the wiki matching `wiki_column`
    and the user ID not matching `user_column`
    '''
    
    wiki_exp = '''({w_column} = '{wiki}' AND {u_column} NOT IN ({id_list}))'''
    
    expressions = list()

    ## Iteratively build the expression for each wiki
    for wiki_name, wiki_users in kd.items():
        expressions.append(wiki_exp.format(
            w_column = wiki_column,
            wiki = wiki_name,
            u_column = user_column,
            id_list = ','.join([str(u) for u in wiki_users])
        ))
    
    ## We then join all the expressions with an OR, and we're done.
    return(' OR '.join(expressions))
    

In [4]:
def make_when_then(wiki_list, wiki_column):
    '''
    Take the ordered list of wiki names and turn it into a string
    of "WHEN wiki_column = '{wiki}' THEN '{k}'" where `k` is the index
    of the wiki in the list, so it can be used for ordering results.
    '''

    whens = list()
    
    for k, wiki in enumerate(wiki_list):
        whens.append(f'WHEN {wiki_column} = "{wiki}" THEN "{k:02}"')
    
    ## Join them with line breaks to create the list
    return('\n'.join(whens))


## Data Gathering of Clicks and Edits

In [5]:
## NOTE: added GROUP BY to impression/clicks to de-duplicate those rows based on their task token,
##       and using the latest timestamp for the event as our basis for excluding sessions

tagged_edit_query = r'''
WITH gi AS ( -- impressions/clicks
    SELECT wiki, event.user_id, event.homepage_pageview_token,
           regexp_extract(event.action_data, "newcomerTaskToken=([\\w\\d]+)", 1) AS newcomer_task_token,
           MAX(dt) AS task_dt
    FROM event_sanitized.homepagemodule
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND TO_DATE(dt) BETWEEN "2020-06-16" AND "2020-09-14" -- bug started on 2020-09-15
    AND event.action IN ("se-task-impression", "se-task-click")
    AND wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    GROUP BY wiki, event.user_id, event.homepage_pageview_token, newcomer_task_token
),
tok AS ( -- task info for impressions and clicks
    SELECT event.newcomer_task_token, event.task_type, event.ordinal_position, event.page_id,
           ROW_NUMBER() OVER (PARTITION BY wiki, event.newcomer_task_token ORDER BY dt) AS row_number
    FROM event_sanitized.newcomertask
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND TO_DATE(dt) BETWEEN "2020-06-16" AND "2020-09-14"
    AND event.ordinal_position = 0
),
exc AS ( -- session we can exclude: module initialization, topic changes, difficulty changes
    SELECT wiki, event.user_id, event.homepage_pageview_token, MIN(dt) AS exclude_dt
    FROM event_sanitized.homepagemodule
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND TO_DATE(dt) BETWEEN "2020-06-16" AND "2020-09-14"
    AND event.action IN ("se-activate", "se-topicfilter-done", "se-taskfilter-done")
    GROUP BY wiki, event.user_id, event.homepage_pageview_token
),
te AS ( -- revisions tagged with "newcomer task"
    SELECT `database` AS wiki, performer.user_id, page_id, rev_id,
           FIRST_VALUE(rev_timestamp) AS rev_timestamp
    FROM event.mediawiki_revision_tags_change
    WHERE year = 2020
    AND month BETWEEN 6 AND 9
    AND TO_DATE(rev_timestamp) BETWEEN "2020-06-16" AND "2020-09-22" -- edits are tagged up to 7 days later
    AND array_contains(tags, "newcomer task")
    GROUP BY `database`, performer.user_id, page_id, rev_id
)
SELECT gi.wiki, gi.user_id, gi.homepage_pageview_token, gi.task_dt,
       gi.newcomer_task_token,
       tok.ordinal_position, tok.page_id,
       rev_id, rev_timestamp,
       CONCAT(YEAR(rev_timestamp), "-", LPAD(WEEKOFYEAR(rev_timestamp), 2, "0")) AS rev_week
FROM gi
JOIN tok -- straight join, every task should have a token
ON gi.newcomer_task_token = tok.newcomer_task_token
LEFT JOIN exc
ON gi.homepage_pageview_token = exc.homepage_pageview_token
LEFT JOIN te
ON gi.wiki = te.wiki
AND gi.user_id = te.user_id
AND tok.page_id = te.page_id
WHERE tok.row_number = 1 -- only select first newcomer task token
-- include sessions that didn't have an init/topic/difficulty event
AND (exc.homepage_pageview_token IS NULL
     OR gi.task_dt < exc.exclude_dt) -- or impression/click happened before the init/topic/difficulty event
'''

In [6]:
first_clicks_and_edits = spark.run(tagged_edit_query.format(
    wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
    known_user_id_expression = make_known_users_sql(known_users, 'wiki', 'event.user_id')
))

Now we can aggregate by week, as we otherwise would.

In [7]:
first_task_edits_weekly = (first_clicks_and_edits
                           .groupby('rev_week')
                           .agg({'rev_id' : 'count', 'rev_timestamp' : 'min'})
                           .reset_index()
                           .rename(columns = {'rev_id' : 'n'}))

In [8]:
first_task_edits_weekly['rev_date'] = first_task_edits_weekly['rev_timestamp'].apply(lambda x: x[:10])

In [9]:
first_task_edits_weekly

Unnamed: 0,rev_week,n,rev_timestamp,rev_date
0,2020-25,164,2020-06-16T00:08:02Z,2020-06-16
1,2020-26,123,2020-06-22T01:31:01Z,2020-06-22
2,2020-27,96,2020-06-29T05:26:22Z,2020-06-29
3,2020-28,143,2020-07-06T04:23:58Z,2020-07-06
4,2020-29,171,2020-07-13T01:02:13Z,2020-07-13
5,2020-30,220,2020-07-20T04:32:17Z,2020-07-20
6,2020-31,239,2020-07-27T01:44:47Z,2020-07-27
7,2020-32,279,2020-08-03T01:19:10Z,2020-08-03
8,2020-33,245,2020-08-10T01:27:30Z,2020-08-10
9,2020-34,173,2020-08-17T04:40:39Z,2020-08-17


## Number of Tagged Edits per Week

This is from the new version of the reporting notebook, where it excludes test users in the same way that we did in our query above. We've modified the date range to match the one in the above queries so that we're counting edits in the same time period.

In [10]:
def get_datalake_tagged_edits_by_week():
    '''
    Return a `pandas.DataFrame` with the number of edits tagged
    with "newcomer task" per wiki per week using the Data Lake as the source,
    with known test accounts excluded.
    '''
    
    tagged_edits_per_week_query = '''
    WITH tagged_edits AS (
        SELECT `database` AS wiki,
        rev_id,
        FIRST_VALUE(CONCAT(YEAR(rev_timestamp), "-", LPAD(WEEKOFYEAR(rev_timestamp), 2, "0"))) AS week,
        FIRST_VALUE(performer.user_id) AS user_id,
        FIRST_VALUE(IF(
            CONCAT(YEAR(rev_timestamp), "-", LPAD(WEEKOFYEAR(rev_timestamp), 2, "0")) =
            CONCAT(YEAR(performer.user_registration_dt), "-", LPAD(WEEKOFYEAR(performer.user_registration_dt), 2, "0")), 1, 0))
            AS user_registered_same_week,
        MAX(IF(array_contains(tags, 'mw-reverted') AND
               (unix_timestamp(meta.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
                unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*48), 1, 0)) AS was_reverted -- within 48 hours
        FROM event.mediawiki_revision_tags_change
        WHERE year = 2020
        AND month BETWEEN 6 AND 9
        AND TO_DATE(rev_timestamp) BETWEEN "2020-06-16" AND "2020-09-22" -- same date range as above
        AND `database` IN ({wiki_list})
        AND ({known_user_database_expression})
        AND array_contains(tags, "newcomer task")
        GROUP BY wiki, rev_id
    )
    SELECT
        wiki, week, user_registered_same_week,
        SUM(1) AS num_edits,
        SUM(IF(was_reverted = true, 1, 0)) AS num_reverts,
        COUNT(DISTINCT user_id) AS num_editors
    FROM tagged_edits
    GROUP BY wiki, week, user_registered_same_week
    ORDER BY wiki, week, user_registered_same_week
    '''
  
    return(spark.run(
        tagged_edits_per_week_query.format(
            wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
            known_user_database_expression = make_known_users_sql(known_users,
                                                                  '`database`', 'performer.user_id')
        )))

In [11]:
edits_by_week = get_datalake_tagged_edits_by_week()

In [12]:
edits_by_week_agg = (edits_by_week
                     .groupby('week')
                     .agg({'num_edits' : 'sum'})
                     .reset_index()
                     .rename(columns = {'num_edits' : 'num_total_tagged_edits'}))

## Calculating the Proportion

We combine the edits stemming from clicking a task with position 0 and the total number of tagged edits per week, and then calculate a percentage on a weekly basis.

In [13]:
combined_edits_weekly = first_task_edits_weekly.merge(edits_by_week_agg,
                                                      left_on = 'rev_week', right_on = 'week')

In [14]:
combined_edits_weekly['prop_first_edits'] = (combined_edits_weekly['n'] /
                                             combined_edits_weekly['num_total_tagged_edits'])

In [15]:
combined_edits_weekly['perc_first_edits'] = 100 * combined_edits_weekly['prop_first_edits']

In [16]:
combined_edits_weekly.round(2)

Unnamed: 0,rev_week,n,rev_timestamp,rev_date,week,num_total_tagged_edits,prop_first_edits,perc_first_edits
0,2020-25,164,2020-06-16T00:08:02Z,2020-06-16,2020-25,824,0.2,19.9
1,2020-26,123,2020-06-22T01:31:01Z,2020-06-22,2020-26,545,0.23,22.57
2,2020-27,96,2020-06-29T05:26:22Z,2020-06-29,2020-27,667,0.14,14.39
3,2020-28,143,2020-07-06T04:23:58Z,2020-07-06,2020-28,452,0.32,31.64
4,2020-29,171,2020-07-13T01:02:13Z,2020-07-13,2020-29,609,0.28,28.08
5,2020-30,220,2020-07-20T04:32:17Z,2020-07-20,2020-30,801,0.27,27.47
6,2020-31,239,2020-07-27T01:44:47Z,2020-07-27,2020-31,787,0.3,30.37
7,2020-32,279,2020-08-03T01:19:10Z,2020-08-03,2020-32,992,0.28,28.12
8,2020-33,245,2020-08-10T01:27:30Z,2020-08-10,2020-33,1103,0.22,22.21
9,2020-34,173,2020-08-17T04:40:39Z,2020-08-17,2020-34,815,0.21,21.23


## Estimating an Average and Variation

We see if the mean and standard deviation makes sense to use as the average and its variation, or whether it's more meaningful to use the median and the interquartile range.

In [17]:
combined_edits_weekly.loc[(combined_edits_weekly['week'] < '2020-38'), 'perc_first_edits'].describe()

count    13.000000
mean     24.927829
std       5.628841
min      14.392804
25%      21.226994
50%      26.262626
75%      28.125000
max      33.407325
Name: perc_first_edits, dtype: float64

In [18]:
combined_edits_weekly_desc = (combined_edits_weekly.loc[
    (combined_edits_weekly['week'] < '2020-38'), 'perc_first_edits'].describe())

In [19]:
combined_edits_weekly_desc['mean'] + 2*combined_edits_weekly_desc['std']

36.18551030799077

In [20]:
combined_edits_weekly_desc['mean'] - 2*combined_edits_weekly_desc['std']

13.670148253394698

We don't have a lot of data points, so the standard deviation is high and leads to this estimate being lower than the minimum and higher than the maximum. I think going with the interquartile range makes more sense here.

In [21]:
combined_edits_weekly_desc['25%']

21.226993865030675

In [22]:
combined_edits_weekly_desc['75%']

28.125

## Counts of tagged edits in affected weeks

We get the number of tagged edits on a weekly basis for the time period during which the bug was active, and we'll then use the lower and upper bounds to estimate how many edits we would've seen if the bug wasn't present.

In [23]:
def get_affected_edits_by_week():
    '''
    Return a `pandas.DataFrame` with the number of edits tagged
    with "newcomer task" per wiki per week using the Data Lake as the source,
    with known test accounts excluded, for the period 2020-09-15 through 2020-10-28.
    '''
    
    tagged_edits_per_week_query = '''
    WITH tagged_edits AS (
        SELECT `database` AS wiki,
        rev_id,
        FIRST_VALUE(CONCAT(YEAR(rev_timestamp), "-", LPAD(WEEKOFYEAR(rev_timestamp), 2, "0"))) AS week,
        FIRST_VALUE(performer.user_id) AS user_id,
        FIRST_VALUE(IF(
            CONCAT(YEAR(rev_timestamp), "-", LPAD(WEEKOFYEAR(rev_timestamp), 2, "0")) =
            CONCAT(YEAR(performer.user_registration_dt), "-", LPAD(WEEKOFYEAR(performer.user_registration_dt), 2, "0")), 1, 0))
            AS user_registered_same_week,
        MAX(IF(array_contains(tags, 'mw-reverted') AND
               (unix_timestamp(meta.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
                unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*48), 1, 0)) AS was_reverted -- within 48 hours
        FROM event.mediawiki_revision_tags_change
        WHERE year = 2020 AND month >= 9
        AND TO_DATE(rev_timestamp) BETWEEN "2020-09-15" AND "2020-10-28"
        AND `database` IN ({wiki_list})
        AND ({known_user_database_expression})
        AND array_contains(tags, "newcomer task")
        GROUP BY wiki, rev_id
    )
    SELECT
        wiki, week, user_registered_same_week,
        SUM(1) AS num_edits,
        SUM(IF(was_reverted = true, 1, 0)) AS num_reverts,
        COUNT(DISTINCT user_id) AS num_editors
    FROM tagged_edits
    GROUP BY wiki, week, user_registered_same_week
    ORDER BY wiki, week, user_registered_same_week
    '''
  
    return(spark.run(
        tagged_edits_per_week_query.format(
            wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
            known_user_database_expression = make_known_users_sql(known_users,
                                                                  '`database`', 'performer.user_id')
        )))

In [24]:
affected_edits_by_week = get_affected_edits_by_week()

In [25]:
affected_edits_by_week_agg = (affected_edits_by_week
                              .groupby('week')
                              .agg({'num_edits' : 'sum'})
                              .reset_index()
                              .rename(columns = {'num_edits' : 'num_total_tagged_edits'}))

In [26]:
affected_edits_by_week_agg

Unnamed: 0,week,num_total_tagged_edits
0,2020-38,949
1,2020-39,1101
2,2020-40,926
3,2020-41,912
4,2020-42,1071
5,2020-43,1626
6,2020-44,795


Now we use our interquartile range to estimate an upper and lower bound on how many edits we'd have without the bug. Because we have estimates of what proportion of the total edits came from clicks in position 0, we divide by one minus that proportion to get the estimate of total number of edits.

In [34]:
affected_edits_by_week_agg['lower_bound'] = (
    affected_edits_by_week_agg['num_total_tagged_edits'] /
    (1 - (combined_edits_weekly_desc['25%']/100)))

In [35]:
affected_edits_by_week_agg['upper_bound'] = (
    affected_edits_by_week_agg['num_total_tagged_edits'] /
    (1 - (combined_edits_weekly_desc['75%']/100)))

The table below shows the measured number of tagged edits per week while the bug was in effect, and the lower and upper bound estimates for what the actual number of tagged edits were.

In [36]:
affected_edits_by_week_agg

Unnamed: 0,week,num_total_tagged_edits,lower_bound,upper_bound
0,2020-38,949,1204.727414,1320.347826
1,2020-39,1101,1397.686916,1531.826087
2,2020-40,926,1175.529595,1288.347826
3,2020-41,912,1157.757009,1268.869565
4,2020-42,1071,1359.602804,1490.086957
5,2020-43,1626,2064.158879,2262.26087
6,2020-44,795,1009.228972,1106.086957


In [39]:
affected_edits_by_week_agg['num_total_tagged_edits'].sum()

7380

In [37]:
(affected_edits_by_week_agg['lower_bound'] - affected_edits_by_week_agg['num_total_tagged_edits']).sum()

1988.6915887850469

In [38]:
(affected_edits_by_week_agg['upper_bound'] - affected_edits_by_week_agg['num_total_tagged_edits']).sum()

2887.826086956522

In [40]:
(affected_edits_by_week_agg['num_total_tagged_edits'].sum() +
 (affected_edits_by_week_agg['lower_bound'] - affected_edits_by_week_agg['num_total_tagged_edits']).sum())

9368.691588785046

In [41]:
(affected_edits_by_week_agg['num_total_tagged_edits'].sum() +
 (affected_edits_by_week_agg['upper_bound'] - affected_edits_by_week_agg['num_total_tagged_edits']).sum())

10267.826086956522

# Quick summary

We use EventLogging data to estimate the proportion of edits made that would've triggered the bug on a weekly basis. The median (26.3%) and interquartile range of this proportion makes most sense, and we find a lower bound of 21.2% and an upper bound of 28.1%.

Gathering weekly edits tagged with "newcomer task" during the affected period, we find 7,380 edits were made. Based on our estimates, if the bug was not present this would be between 1,989 and 2,888 edits higher. In other words, we estimate a total somewhere between 9,369 and 10,268 Newcomer Task edits were actually made during this period.