# Welcome Survey and Activation in Russian Wikipedia

The Welcome Survey was deployed as an A/B test on Russian Wikipedia in order to determine if it has an effect on editor activation. The phab task for this analysis is [T259768](https://phabricator.wikimedia.org/T259768).

In [46]:
import datetime as dt

import numpy as np
import pandas as pd

from scipy import stats

from collections import defaultdict

from wmfdata import spark, mariadb
from growth import utils

## Group assignment

Per [T257490#6395284](https://phabricator.wikimedia.org/T257490#6395284), this was deployed with 20% of newcomers not getting the survey. Let's start off by making sure that we have randomization between the Welcome Survey and users getting the Growth features. While we should be using all registrations as the basis for this, we'll use all Welcome Survey responses since that's the dataset we'll be aggregating over anyway.

In [7]:
group_stats_query = '''
WITH wc_groups AS (
  SELECT
    up_user,
    CAST(json_value(up_value, "$._group") AS CHAR CHARACTER SET utf8) AS group_name
  FROM user_properties
  WHERE up_property = "welcomesurvey-responses"
),
hp_users AS (
  SELECT
    up_user,
    CAST(up_value AS UNSIGNED INTEGER) AS hp_enabled
  FROM user_properties
  WHERE up_property = "growthexperiments-homepage-enable"
)
SELECT
  wc_groups.group_name AS wc_group,
  IF(hp_users.hp_enabled IS NOT NULL, 1, 0) AS hp_group,
  SUM(1) AS num_users
FROM wc_groups
LEFT JOIN hp_users
ON wc_groups.up_user = hp_users.up_user
GROUP BY wc_group, hp_group
'''

In [8]:
group_assignment_stats = mariadb.run(group_stats_query, 'ruwiki')

In [11]:
group_assignment_stats['percent'] = (100 * group_assignment_stats['num_users'] /
                                   group_assignment_stats.groupby('wc_group')['num_users'].transform('sum'))
group_assignment_stats.round(1)

Unnamed: 0,wc_group,hp_group,num_users,percent
0,exp2_target_specialpage,0,6201.0,20.1
1,exp2_target_specialpage,1,24617.0,79.9
2,NONE,0,1522.0,19.5
3,NONE,1,6264.0,80.5


That's more than reasonably randomized, we can proceed.

## Helper Functions

In [21]:
def make_known_users_sql(kd, wiki_column, user_column):
    '''
    Based on the dictionary `kd` mapping wiki names to sets of user IDs of known users,
    create a SQL expression to exclude users based on the name of the wiki matching `wiki_column`
    and the user ID not matching `user_column`
    '''
    
    wiki_exp = '''({w_column} = '{wiki}' AND {u_column} NOT IN ({id_list}))'''
    
    expressions = list()

    ## Iteratively build the expression for each wiki
    for wiki_name, wiki_users in kd.items():
        expressions.append(wiki_exp.format(
            w_column = wiki_column,
            wiki = wiki_name,
            u_column = user_column,
            id_list = ','.join([str(u) for u in wiki_users])
        ))
    
    ## We then join all the expressions with an OR, and we're done.
    return(' OR '.join(expressions))
    

# Dataset of registrations and responses

Per [T257490#6416979](https://phabricator.wikimedia.org/T257490#6416979), the features were enabled on 2020-08-27T18:31:46Z. We'll use registrations from that timestamp onwards, limiting them to users who registered directly on the Russian Wikipedia (meaning autocreated accounts are excluded). Other accounts we'll exclude are those of known test accounts.

We'll use [MediaWiki history](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) as our data source for user registrations and edits, because it's the most authoritative source for that type of data. At the time of analysis the most recent available snapshot is "2020-11", containing data through November 2020. We therefore use registrations up to 2020-11-29 to allow for 24 hours for a possible activation edit for those who registered on that day.

In [18]:
## Start and end timestamps of data gathering for analysis, per notes above.
exp_start_ts = dt.datetime(2020, 8, 27, 18, 31, 46)
exp_end_ts = dt.datetime(2020, 11, 29, 0, 0, 0)

## The wikis that we're gathering data for, which is only one wiki:
wikis = ['ruwiki']

## The MediaWiki history snapshot we're using for registrations and edit data
wmf_snapshot = '2020-11'

## Lists of known users to ignore (e.g. test accounts and experienced users)
known_users = defaultdict(set)
known_users['cswiki'].update([14, 127629, 303170, 342147, 349875, 44133, 100304, 307410, 439792, 444907,
                              454862, 456272, 454003, 454846, 92295, 387915, 398470, 416764, 44751, 132801,
                              137787, 138342, 268033, 275298, 317739, 320225, 328302, 339583, 341191,
                              357559, 392634, 398626, 404765, 420805, 429109, 443890, 448195, 448438,
                              453220, 453628, 453645, 453662, 453663, 453664, 440694, 427497, 272273,
                              458025, 458487, 458049, 59563, 118067, 188859, 191908, 314640, 390445,
                              451069, 459434, 460802, 460885, 79895, 448735, 453176, 467557, 467745,
                              468502, 468583, 468603, 474052, 475184, 475185, 475187, 475188, 294174,
                              402906, 298011])

known_users['kowiki'].update([303170, 342147, 349875, 189097, 362732, 384066, 416362, 38759, 495265,
                              515553, 537326, 566963, 567409, 416360, 414929, 470932, 472019, 485036,
                              532123, 558423, 571587, 575553, 576758, 360703, 561281, 595100, 595105,
                              595610, 596025, 596651, 596652, 596653, 596654, 596655, 596993, 942,
                              13810, 536529])

known_users['viwiki'].update([451842, 628512, 628513, 680081, 680083, 680084, 680085, 680086, 355424,
                              387563, 443216, 682713, 659235, 700934, 705406, 707272, 707303, 707681, 585762])

known_users['arwiki'].update([237660, 272774, 775023, 1175449, 1186377, 1506091, 1515147, 1538902,
                              1568858, 1681813, 1683215, 1699418, 1699419, 1699425, 1740419, 1759328, 1763990])

## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

In [32]:
group_assignment_query = '''
SELECT
  "ruwiki" AS wiki_db,
  up_user AS user_id,
  IF(CAST(json_value(up_value, "$._group") AS CHAR CHARACTER SET utf8) = "NONE",
     "control", "survey") AS user_survey_group
FROM user_properties
WHERE up_property = "welcomesurvey-responses"
'''

In [27]:
registrations_and_edits_query = '''
WITH regs AS (
  SELECT
    wiki_db,
    user_id
  FROM wmf.mediawiki_user_history
  WHERE snapshot = "{snapshot}"
  AND wiki_db IN ({wiki_list})
  AND ({known_user_id_expression}) -- not a known test account
  AND caused_by_event_type = "create" -- account creation
  AND created_by_self = true -- self-created account
  AND size(is_bot_by_historical) = 0 -- is/was not a bot
   -- ...and registered within our data gathering window
  AND user_registration_timestamp > "{start_ts}"
  AND user_registration_timestamp < "{end_ts}"

),
edits AS (
  SELECT
    wiki_db,
    event_user_id AS user_id,
    SUM(1) AS num_edits
  FROM wmf.mediawiki_history
  WHERE snapshot = "{snapshot}"
  AND event_entity = "revision"
  AND event_type = "create"
  AND wiki_db IN ({wiki_list})
  AND ({known_event_user_id_expression}) -- not a known test account
  AND event_user_is_created_by_self = true -- self-created account
  AND size(event_user_is_bot_by_historical) = 0 -- is/was not a bot
  -- ...and registered within our data gathering window
  AND event_user_registration_timestamp > "{start_ts}"
  AND event_user_registration_timestamp < "{end_ts}"

  -- activation is editing within 24 hours of registration
  AND unix_timestamp(event_timestamp) - unix_timestamp(event_user_registration_timestamp) < 86400
  GROUP BY wiki_db, event_user_id
)
SELECT
  regs.wiki_db AS wiki_db,
  regs.user_id AS user_id,
  IF(num_edits IS NOT NULL, 1, 0) AS did_activate,
  coalesce(num_edits, 0) AS num_activation_edits
FROM regs
LEFT JOIN edits
ON (regs.wiki_db = edits.wiki_db
    AND regs.user_id = edits.user_id)
'''

In [28]:
registrations_and_edits = spark.run(registrations_and_edits_query.format(
    snapshot = wmf_snapshot,
    wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
    known_user_id_expression = make_known_users_sql(known_users, 'wiki_db', 'user_id'),
    known_event_user_id_expression = make_known_users_sql(known_users, 'wiki_db', 'event_user_id'),
    start_ts = exp_start_ts.strftime(utils.hive_format),
    end_ts = exp_end_ts.strftime(utils.hive_format)
))

In [33]:
group_assignments = mariadb.run(group_assignment_query, 'ruwiki')

In [34]:
experiment_dataset = registrations_and_edits.merge(group_assignments, how = 'left',
                                                   on = ['wiki_db', 'user_id'])

In [38]:
## How many users are in this dataset?

len(experiment_dataset)

36060

In [37]:
## How many users do we have without a group assigned?

len(experiment_dataset.loc[experiment_dataset['user_survey_group'].isna()])

5585

In [39]:
## What proportion is that?
(len(experiment_dataset.loc[experiment_dataset['user_survey_group'].isna()]) /
 len(experiment_dataset))

0.15488075429839157

I suspect that these users are mostly Android/iOS app users. If a user registers an account on the app, it'll show up as a local registration in MediaWiki history. The actual registration is as far as I know done through the API. Having 15.5% of users registered through the app seems reasonable, and we'll exclude these users from our analysis.

In [40]:
valid_survey_users = experiment_dataset.loc[~experiment_dataset['user_survey_group'].isna()]

We'll aggregate and calculate activation proportions for each group.

In [45]:
activation_aggregation = (valid_survey_users
                          .groupby(['user_survey_group', 'did_activate'])
                          .agg({'user_id' : 'count'})
                          .reset_index()
                          .rename(columns = {'user_id' : 'num_users'}))
activation_aggregation['percent'] = (100 * activation_aggregation['num_users'] /
                                   activation_aggregation.groupby('user_survey_group')['num_users'].transform('sum'))
activation_aggregation.round(1)

Unnamed: 0,user_survey_group,did_activate,num_users,percent
0,control,0,3882,63.6
1,control,1,2220,36.4
2,survey,0,15300,62.8
3,survey,1,9073,37.2


Then we use the `chi2_contingency` method to calculate the chi-squared value based on a 2x2 contingency table. The columns in the table are the counts of users who did or did not activate, the rows are the control and survey groups.

In [60]:
stats.chi2_contingency(
    np.array(
        [activation_aggregation.loc[activation_aggregation['user_survey_group'] == 'control', 'num_users'],
         activation_aggregation.loc[activation_aggregation['user_survey_group'] == 'survey', 'num_users']]
    )
)

(1.454811389184393,
 0.227757490744804,
 1,
 array([[ 3840.80603774,  2261.19396226],
        [15341.19396226,  9031.80603774]]))

We find that the proportion of users who activate in both groups is comparable because the test of proportions do not find a significant difference between the Control group and the Survey group.