# Homepage Discovery Revisit

We did an analysis of the Homepage discovery rate in [T228338](https://phabricator.wikimedia.org/T228338) in order to determine the effect of the Homepage discovery features on Czech and Korean Wikipedia.

We're now revisiting that analysis, and the phab task for this is [T245369](https://phabricator.wikimedia.org/T245369). We'll again look at the proportion of registrations that visit the Homepage within 48 hours. In this case, we're doing this analysis for a larger number of wikis:

* Czech
* Korean
* Arabic
* Vietnamese
* Hungarian
* Ukrainian
* Armenian

We do not need old data for this, so I'll use the first three months of 2020. Like we've done before, we'll exclude known test accounts from the dataset.

The phab task asks for a breakdown of desktop/mobile. In this case, we can break it down by where the first visit occurred, thus ignoring whether the user *registered* on desktop or mobile.

In order to allow everyone an equal opportunity to visit the Homepage, we'll require users to visit it within 48 hours of registration.

In [1]:
import json
import datetime as dt

import tabulate

import numpy as np
import pandas as pd

from wmfdata import hive, spark, mariadb

In [2]:
%load_ext rpy2.ipython

In [3]:
%%R
library(data.table)
library(ggplot2)
library(pwr)
library(zoo)

R[write to console]: 
Attaching package: ‘zoo’


R[write to console]: The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric




In [4]:
## Configuration variables

wmf_snapshot = '2020-02'

wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki', 'huwiki', 'ukwiki', 'hywiki']

## Start and end timestamps of data gathering for each wiki, corresponding to the dates/times
## listed above.

start_ts = dt.datetime(2020, 1, 1, 0, 0, 0)
end_ts = dt.datetime(2020, 4, 1, 0, 0, 0)

## User IDs of known users to exclude (e.g. Stephane, Elena, and Marshall's accounts)
known_users = {
    'cswiki' : set([322106, 339583, 341191, 341611, 433381, 433382, 433511, 404765, 421667,
                      427625, 437386, 181724, 272273, 339583, 437386, 439783, 439792, 138342,
                      392634, 404765, 275298, 458487, 458049]),
    'kowiki' : set([384066, 539296, 539299, 539302, 539303, 539304, 539305, 539306, 539307,
                      539298, 416361, 416360, 413162, 495265, 518393, 518394, 518396, 530285,
                      531579, 531785, 536786, 536787, 542720, 542721, 542722, 543192, 543193,
                      544145, 544283, 470932, 38759, 555673]),
    'viwiki' : set(),
    'arwiki' : set(),
    'huwiki' : set(),
    'ukwiki' : set(),
    'hywiki' : set()
}

## Get known test accounts

In [5]:
## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

## Data gathering

This largely reuses what we already had in the previous analysis.

In [6]:
user_activity_query = '''
WITH regs AS (
    SELECT wiki, event.userid AS user_id, dt AS reg_ts, event.displaymobile AS reg_on_mobile
    FROM event_sanitized.serversideaccountcreation
    WHERE year = 2020
    AND month < 4
    AND (
        (wiki = "cswiki"
         AND event.userid NOT IN ({cs_known_list}))
        OR
        (wiki = "kowiki"
         AND event.userid NOT IN ({ko_known_list}))
        OR
        (wiki = "viwiki"
         AND event.userid NOT IN ({vi_known_list}))
        OR
        (wiki = "arwiki"
         AND event.userid NOT IN ({ar_known_list}))
        OR
        (wiki = "huwiki"
          AND event.userid NOT IN ({hu_known_list}))
        OR
        (wiki = "ukwiki"
          AND event.userid NOT IN ({uk_known_list}))
        OR
        (wiki = "hywiki"
          AND event.userid NOT IN ({hy_known_list}))
    )
    AND event.isapi = false
    AND event.isselfmade = 1
),
pref_switch AS (
    SELECT wiki, event.userid AS user_id
    FROM event_sanitized.prefupdate
    WHERE year = 2020
    AND month < 4
    AND (
        (wiki = "cswiki"
         AND event.userid NOT IN ({cs_known_list}))
        OR
        (wiki = "kowiki"
         AND event.userid NOT IN ({ko_known_list}))
        OR
        (wiki = "viwiki"
         AND event.userid NOT IN ({vi_known_list}))
        OR
        (wiki = "arwiki"
         AND event.userid NOT IN ({ar_known_list}))
        OR
        (wiki = "huwiki"
          AND event.userid NOT IN ({hu_known_list}))
        OR
        (wiki = "ukwiki"
          AND event.userid NOT IN ({uk_known_list}))
        OR
        (wiki = "hywiki"
          AND event.userid NOT IN ({hy_known_list}))
    )
    AND event.property = "growthexperiments-homepage-enable"
),
hp_visits AS (
    SELECT wiki, event.user_id, dt AS visit_ts, event.is_mobile AS visit_mobile,
    event.referer_route, ROW_NUMBER() OVER (PARTITION BY wiki, event.user_id ORDER BY dt) AS visit_no
    FROM event_sanitized.homepagevisit
    WHERE year = 2020
    AND month < 4
    AND (
        (wiki = "cswiki"
         AND event.user_id NOT IN ({cs_known_list}))
        OR
        (wiki = "kowiki"
         AND event.user_id NOT IN ({ko_known_list}))
        OR
        (wiki = "viwiki"
         AND event.user_id NOT IN ({vi_known_list}))
        OR
        (wiki = "arwiki"
         AND event.user_id NOT IN ({ar_known_list}))
        OR
        (wiki = "huwiki"
          AND event.user_id NOT IN ({hu_known_list}))
        OR
        (wiki = "ukwiki"
          AND event.user_id NOT IN ({uk_known_list}))
        OR
        (wiki = "hywiki"
          AND event.user_id NOT IN ({hy_known_list}))
    )
)
SELECT regs.wiki, regs.user_id, reg_ts, reg_on_mobile, visit_ts, visit_mobile, referer_route,
    IF(pref_switch.user_id IS NOT NULL, 1, 0) AS pref_switched
FROM regs
LEFT JOIN hp_visits
ON (regs.wiki = hp_visits.wiki AND regs.user_id = hp_visits.user_id)
LEFT JOIN pref_switch
ON (regs.wiki = pref_switch.wiki AND regs.user_id = pref_switch.user_id)
WHERE (visit_no IS NULL
       OR visit_no = 1)
'''

In [7]:
user_activity = spark.run(
    user_activity_query.format(
        cs_known_list = ', '.join([str(u) for u in known_users['cswiki']]),
        ko_known_list = ', '.join([str(u) for u in known_users['kowiki']]),
        vi_known_list = ', '.join([str(u) for u in known_users['viwiki']]),
        ar_known_list = ', '.join([str(u) for u in known_users['arwiki']]),
        hu_known_list = ', '.join([str(u) for u in known_users['huwiki']]),
        uk_known_list = ', '.join([str(u) for u in known_users['ukwiki']]),
        hy_known_list = ', '.join([str(u) for u in known_users['hywiki']])
    )
)

In [8]:
def get_panel_prefs(wikis, prop, include_users = None, exclude_users = None):
    '''
    Query and return a `pandas.DataFrame` with a column `user_id` of all users who have
    the given property turned on in their preferences on the given list of wikis.
    
    :param wikis: the wikis we're querying
    :type wiki: list
    
    :param prop: the user preference were use for determining treatment/control membership
    :type prop: str
    
    :param include_users: user IDs of the users we are interested in
    :type include_users: iterator
    
    :param exclude_users: user IDs of users we are not interested in
    :type exclude_users: iterator
    '''
    
    panel_query = '''
    SELECT DATABASE() AS wiki, up_user AS user_id,
           CAST(up_value AS UNSIGNED INTEGER) AS is_treatment
    FROM user_properties
    WHERE up_property = "{prop}"
    '''.format(prop = prop)
    
    if include_users is not None:
        panel_query += '''
        AND up_user IN ({})
        '''.format(','.join([str(uid) for uid in include_users]))
        
    if exclude_users is not None:
        panel_query += '''
        AND up_user NOT IN ({})
        '''.format(', '.join([str(uid) for uid in exclude_users]))
        
    dfs = list()
    for wiki in wikis:
        dfs.append(mariadb.run(panel_query, wiki))
        
    return(pd.concat(dfs))
    
panel_prefs = get_panel_prefs(wikis, "growthexperiments-homepage-enable")


In [9]:
user_data = user_activity.merge(panel_prefs, how = 'left', on = ['wiki', 'user_id'])

In [10]:
## Exclude all users who switched the Homepage on or off
user_data = user_data.loc[user_data.pref_switched == 0]

In [11]:
## Fill NAs from the left join with preferences to set the control/treatment flag
user_data.loc[user_data.is_treatment.isna(), 'is_treatment'] = 0

In [12]:
user_data['is_treatment'] = user_data['is_treatment'].astype(int)

In [13]:
user_data['reg_dt']  = user_data['reg_ts']
user_data['visit_dt'] = user_data['visit_ts']

In [14]:
## Parse the timesetamps
user_data['reg_ts'] = pd.to_datetime(user_data.reg_dt, format = '%Y-%m-%dT%H:%M:%SZ')
user_data['visit_ts'] = pd.to_datetime(user_data.visit_dt, format = '%Y-%m-%dT%H:%M:%SZ')

In [15]:
user_data['visit_diff'] = user_data.visit_ts - user_data.reg_ts

In [16]:
user_data['reg_date'] = user_data.reg_ts.apply(lambda x: x.date())

We have to limit the data to end on March 29 so that all users have 48 hours to visit the Homepage.

In [41]:
user_data = user_data.loc[user_data['reg_date'] <= dt.date(2020, 3, 29)]

## Switching platforms

How many users switched platforms before they made their first visit to the Homepage (of the users that did?)

In [42]:
len(user_data.loc[(user_data['is_treatment'] == 1) &
                  (~user_data['visit_ts'].isna()) &
                  (user_data['reg_on_mobile'] != user_data['visit_mobile'])])

314

In [43]:
len(user_data.loc[(user_data['is_treatment'] == 1) & (~user_data['visit_ts'].isna())])

23962

What proportion is that?

In [44]:
round(100 * len(user_data.loc[(user_data['is_treatment'] == 1) &
                              (~user_data['visit_ts'].isna()) &
                      (user_data['reg_on_mobile'] != user_data['visit_mobile'])]) /
      len(user_data.loc[(user_data['is_treatment'] == 1) & (~user_data['visit_ts'].isna())]), 2)

1.31

This means that 98.7% of users who visited the Homepage within 48 hours did so on the same platform as they registered. Because this number is really low, I'd like to exclude those from the analysis as that allows us to easily aggregate by platform.

## Aggregate and plot

In [45]:
def homepage_visits(group):
    aggs = {
        'n_reg' : len(group),
        'n_visits' : len(group.loc[(~group['visit_ts'].isna()) &
                                   (group['visit_diff'] < dt.timedelta(days = 2))])
    }
    return(pd.Series(aggs, index = aggs.keys()))

## We exclude users who doesn't have the Homepage enabled (i.e control groups),
## then include users who didn't visit the Homepage, or users who visited it
## on the same platform as they registered.
user_data_agg = (user_data.loc[(user_data['is_treatment'] == 1) &
                               ((user_data['visit_ts'].isna()) |
                                ((~user_data['visit_ts'].isna()) &
                                 (user_data['reg_on_mobile'] == user_data['visit_mobile'])))]
                 .groupby(['wiki', 'reg_on_mobile', 'reg_date']).apply(homepage_visits))

In [46]:
user_data_agg = user_data_agg.reset_index()
user_data_agg['reg_date'] = user_data_agg.reg_date.apply(lambda x: x.strftime('%Y-%m-%d'))

In [47]:
%%R -i user_data_agg

user_data_agg = data.table(user_data_agg)
user_data_agg[, reg_date := as.Date(reg_date)]
user_data_agg[, visit_prop := 100 * n_visits / n_reg]

user_data_agg[
  , visit_prop.ma7 := rollapply(visit_prop,
                               width=7,
                               FUN=mean,
                               na.rm=TRUE,
                               fill=0,
                               align='right'),
    by = c('wiki', 'reg_on_mobile')];

user_data_agg$platform = 'Desktop'
user_data_agg[reg_on_mobile == TRUE, platform := 'Mobile']
user_data_agg$platform = factor(user_data_agg$platform)

for (wiki_code in unique(user_data_agg$wiki)) {
    g = ggplot(user_data_agg[wiki == wiki_code], aes(x = reg_date, y = visit_prop)) +
    theme_bw() +
    facet_grid(platform ~ .) +
    scale_x_date("Registration date", limits = c(as.Date('2020-01-01'), as.Date('2020-04-01'))) +
    scale_y_continuous("Percentage (w/7-day moving average in red)", limits = c(0, 100),
                       breaks = c(0:10) * 10, minor_breaks = c(1:20) * 5) +
    ggtitle(paste(wiki_code, ": Homepage first visit percentage Q1 2020")) +
    geom_line() +
    geom_line(aes(y = visit_prop.ma7, color = 'red')) +
    guides(color = FALSE)

    ggsave(plot = g, file = paste0("graphs/homepage_discovery_rate_Q1_2020_", wiki_code, ".png"),
           width = 8, height = 6, units = "in", dpi = "retina")
}; rm(wiki_code)


## Proportion who visited in March

We have full data for February and March for all wikis, while for some wikis we don't have all of January. Therefore, we'd like to aggregate across wiki, platform, and month and calculate the proportion of users who visited the Homepage within 48 hours of registration. We limit the final output to March, partly to make the reporting table easier to read, partly because the proportions are often similar between the months, and partly because March is the most recent month we have data for.

In [48]:
## Add a "reg_month" variable
user_data['reg_month'] = user_data['reg_date'].apply(lambda x: x.replace(day = 1))

In [49]:
## We exclude users who doesn't have the Homepage enabled (i.e control groups),
## then include users who didn't visit the Homepage, or users who visited it
## on the same platform as they registered.
user_data_agg_month = (user_data.loc[(user_data['is_treatment'] == 1) &
                               ((user_data['visit_ts'].isna()) |
                                ((~user_data['visit_ts'].isna()) &
                                 (user_data['reg_on_mobile'] == user_data['visit_mobile'])))]
                 .groupby(['wiki', 'reg_on_mobile', 'reg_month']).apply(homepage_visits))

In [50]:
user_data_agg_month = user_data_agg_month.reset_index()

In [51]:
user_data_agg_month['visit_prop'] = 100 * user_data_agg_month['n_visits'] / user_data_agg_month['n_reg']

In [52]:
user_data_agg_month.loc[user_data_agg_month['reg_month'] != dt.date(2020, 1, 1)]

Unnamed: 0,wiki,reg_on_mobile,reg_month,n_reg,n_visits,visit_prop
1,arwiki,False,2020-02-01,1984,1459,73.538306
2,arwiki,False,2020-03-01,2242,1578,70.383586
4,arwiki,True,2020-02-01,4230,2801,66.217494
5,arwiki,True,2020-03-01,4308,2852,66.202414
7,cswiki,False,2020-02-01,753,502,66.666667
8,cswiki,False,2020-03-01,780,548,70.25641
10,cswiki,True,2020-02-01,362,218,60.220994
11,cswiki,True,2020-03-01,344,203,59.011628
13,huwiki,False,2020-02-01,442,225,50.904977
14,huwiki,False,2020-03-01,478,233,48.74477


In [58]:
## Selecting just the month of March, then droppping the month column.
print(
    tabulate.tabulate(
        user_data_agg_month.loc[user_data_agg_month['reg_month'] == dt.date(2020, 3, 1)].replace({
            'arwiki' : 'Arabic',
            'cswiki' : 'Czech',
            'huwiki' : 'Hungarian',
            'hywiki' : 'Armenian',
            'kowiki' : 'Korean',
            'ukwiki' : 'Ukranian',
            'viwiki' : 'Vietnamese',
            False : 'Desktop',
            True : 'Mobile'
        }).rename(columns = {
            'wiki' : 'Wiki',
            'reg_on_mobile' : 'Platform',
            'n_reg' : 'N Regs',
            'n_visits' : 'N Visits',
            'visit_prop' : 'Visit %'
        }).drop(columns = 'reg_month'),
        tablefmt = "github", headers = "keys", showindex = None,
        floatfmt = ["", "", "", "", ".1f"])
)

| Wiki       | Platform   |   N Regs |   N Visits |   Visit % |
|------------|------------|----------|------------|-----------|
| Arabic     | Desktop    |     2242 |       1578 |      70.4 |
| Arabic     | Mobile     |     4308 |       2852 |      66.2 |
| Czech      | Desktop    |      780 |        548 |      70.3 |
| Czech      | Mobile     |      344 |        203 |      59.0 |
| Hungarian  | Desktop    |      478 |        233 |      48.7 |
| Hungarian  | Mobile     |      263 |         88 |      33.5 |
| Armenian   | Desktop    |      100 |         65 |      65.0 |
| Armenian   | Mobile     |       67 |         33 |      49.3 |
| Korean     | Desktop    |     1078 |        532 |      49.4 |
| Korean     | Mobile     |      797 |        343 |      43.0 |
| Ukranian   | Desktop    |      706 |        409 |      57.9 |
| Ukranian   | Mobile     |      394 |        158 |      40.1 |
| Vietnamese | Desktop    |     1543 |        682 |      44.2 |
| Vietnamese | Mobile     |     1123 |  

In [54]:
## Do we have any control group users on Arabic?
len(user_data.loc[(user_data['wiki'] == 'arwiki') &
                  (user_data['is_treatment'] == 0) &
                  (user_data['reg_month'] == dt.date(2020, 3, 1))])

1693

WikiStats lists 9,366 registrations in February. Some of those will be app registrations, which we discard. 1,693 control group users for March is in the ballpark of 20% of that total (given that we dropped the last 2 days).

## Visit routes split by wiki and platform

In this case, our denominator is users who visited the Homepage, since we're interested in learning the proportion who took each of the possible routes on their first visit. We'll again count this for the month of March as that is the most recent month of data we have available.

In [55]:
visit_paths = (user_data.loc[(user_data['is_treatment'] == 1) &
                             (~user_data['visit_ts'].isna()) &
                             (user_data['reg_month'] == dt.date(2020, 3, 1))]
               .groupby(['wiki', 'visit_mobile', 'referer_route'])
               .agg({'user_id' : 'count'})
               .reset_index()
               .rename(columns = {'user_id' : 'n'}))
visit_paths['percent'] = (100 * visit_paths['n'] /
                          visit_paths.groupby(['wiki', 'visit_mobile'])['n'].transform('sum'))

In [56]:
visit_paths['platform'] = visit_paths['visit_mobile'].apply(lambda x: 'Mobile' if x else 'Desktop')

In [57]:
print(
    tabulate.tabulate(
        visit_paths.sort_values(['wiki', 'platform', 'n'], ascending = [True, True, False])
        .drop(columns = 'visit_mobile')
        .replace(
            {'arwiki' : 'Arabic',
            'cswiki' : 'Czech',
            'huwiki' : 'Hungarian',
            'hywiki' : 'Armenian',
            'kowiki' : 'Korean',
            'ukwiki' : 'Ukranian',
            'viwiki' : 'Vietnamese'}
        ).rename(columns = {
            'wiki' : 'Wiki',
            'platform' : 'Platform',
            'referer_route' : 'Route to Homepage',
            'n' : 'N Visitors',
            'percent' : 'Percent'

        })[['Wiki', 'Platform', 'Route to Homepage', 'N Visitors', 'Percent']],
        tablefmt = "github", headers = "keys", showindex = None, floatfmt = ["", "", "", "", ".1f"]))

| Wiki       | Platform   | Route to Homepage    |   N Visitors |   Percent |
|------------|------------|----------------------|--------------|-----------|
| Arabic     | Desktop    | specialwelcomesurvey |         1012 |      62.4 |
| Arabic     | Desktop    | personaltoolslink    |          367 |      22.6 |
| Arabic     | Desktop    | specialconfirmemail  |          180 |      11.1 |
| Arabic     | Desktop    | specialcontributions |           26 |       1.6 |
| Arabic     | Desktop    | usertalkpagetab      |           19 |       1.2 |
| Arabic     | Desktop    | userpagetab          |           18 |       1.1 |
| Arabic     | Mobile     | specialwelcomesurvey |         1849 |      64.0 |
| Arabic     | Mobile     | personaltoolslink    |          655 |      22.7 |
| Arabic     | Mobile     | specialconfirmemail  |          176 |       6.1 |
| Arabic     | Mobile     | specialcontributions |          132 |       4.6 |
| Arabic     | Mobile     | userpagetab          |           49 