# Interacting with the Newcomer Task Module

What percent of visitors to the Homepage interact with the Newcomer Task module?

From the conversation in [the phab task](https://phabricator.wikimedia.org/T264831#6658203), it's clear that interacting means: interacting with the topic or difficulty filter buttons, navigating cards with the arrows, hovering on the "i", or selecting a task.

This is exactly the same on desktop and mobile. We therefore need to ignore the onboarding overlays for Variant C, and ignore the module initialization steps in Variant D. This gives us the following approaches:

* Variant C: The overlays shown are onboarding overlays, they don't allow for any interaction. And they're all logged as `se-cta-*` events. In other words, interacting with the topic or difficulty filters happens after the onboarding screen. So for Variant C, we count all interaction events.
* Variant D: grab the timestamp of their `se-activate` event, or their first task impression (or pseudo-impression). This can then identify when the clock starts for counting interaction events.

In [1]:
import json
import datetime as dt

from collections import defaultdict

import numpy as np
import pandas as pd

from wmfdata import hive, spark, mariadb
from growth import utils

In [15]:
## Start timestamp of the Variant C/D experiment, which is when the edit tag bug fix went into place
exp_start_ts = dt.datetime(2020, 10, 28, 18, 40, 2)

## Ordered list of wikis that we'll be gathering data for
## Note that we're excluding euwiki due to their small number of registrations
wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki', 'ukwiki', 'huwiki', 'srwiki', 'hywiki',
         'frwiki', 'fawiki', 'hewiki', 'ruwiki', 'plwiki', 'ptwiki', 'svwiki', 'trwiki']

## The mediawiki_history snapshot we'll be working with
wmf_snapshot = '2020-11'

## The canonical user table that we'll join mediawiki_history with
canonical_user_table = 'nettrom_growth.hp_variant_test2'

## Where we write out the resulting dataset for further analysis
interaction_counts_output_filename = 'datasets/variant-test-2-interaction-counts.tsv'

## Interaction Data

In [22]:
interaction_query = '''
WITH hp_visits AS( -- visits within 24 hours of registration
    SELECT users.wiki_db, users.user_id,
    1 AS visited_homepage
    FROM {exp_user_table} users
    JOIN event_sanitized.homepagemodule hpm
    ON (users.wiki_db = hpm.wiki
        AND users.user_id = hpm.event.user_id)
    WHERE hpm.year = 2020
    AND hpm.month IN (10, 11)
    AND (unix_timestamp(hpm.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
         unix_timestamp(users.user_registration, "yyyyMMddHHmmss") < 86400) -- within 24 hours of registration
    GROUP BY users.wiki_db, users.user_id
),
var_c AS ( -- user in Variant C group and interacted within 24 hours of registration
    SELECT users.wiki_db, users.user_id, 1 AS did_interact
    FROM {exp_user_table} users
    JOIN event_sanitized.homepagemodule hpm
    ON (users.wiki_db = hpm.wiki
        AND users.user_id = hpm.event.user_id)
    WHERE hpm.year = 2020
    AND hpm.month IN (10, 11)
    AND (unix_timestamp(hpm.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
         unix_timestamp(users.user_registration, "yyyyMMddHHmmss") < 86400) -- within 24 hours of registration
    AND users.hp_variant = "C" -- Variant C
    AND event.action IN ( -- the types of interaction that we count
        "se-taskfilter-open", "se-taskfilter-done", "se-taskfilter-cancel",
        "se-topicfilter-open", "se-topicfilter-select-all", "se-topicfilter-remove-all",
        "se-topicfilter-done", "se-topicfilter-cancel",
        "se-task-navigation", "se-task-click",
        "se-explanation-open", "se-explanation-close", "se-explanation-link-click"
    )
    GROUP BY users.wiki_db, users.user_id
),
var_d AS ( -- user in Variant D, activated the module and interacted within 24 hours of registration
    SELECT wiki AS wiki_db, event.user_id, 1 AS did_interact
    FROM event_sanitized.homepagemodule hpm
    JOIN (
        SELECT users.wiki_db, users.user_id, first_value(users.user_registration) AS user_registration,
               MIN(hpm2.dt) AS init_dt
        FROM {exp_user_table} users
        JOIN event_sanitized.homepagemodule hpm2
        ON (users.wiki_db = hpm2.wiki
            AND users.user_id = hpm2.event.user_id)
        WHERE year = 2020
        AND month IN (10, 11)
        AND (unix_timestamp(hpm2.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
             unix_timestamp(users.user_registration, "yyyyMMddHHmmss") < 86400) -- w/i 24 hours
        AND users.hp_variant = "D" -- Variant D
        AND event.action IN ("se-activate", "se-task-impression", "se-task-pseudo-impression")
        GROUP BY users.wiki_db, users.user_id
    ) AS u
    ON (hpm.wiki = u.wiki_db
        AND hpm.event.user_id = u.user_id)
    WHERE year = 2020
    AND month IN (10, 11)
    AND (unix_timestamp(hpm.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
         unix_timestamp(u.user_registration, "yyyyMMddHHmmss") < 86400) -- w/i 24 hours of registration
    AND hpm.dt > u.init_dt -- after module initialization
    AND event.action IN ( -- the types of interaction that we count
        "se-taskfilter-open", "se-taskfilter-done", "se-taskfilter-cancel",
        "se-topicfilter-open", "se-topicfilter-select-all", "se-topicfilter-remove-all",
        "se-topicfilter-done", "se-topicfilter-cancel",
        "se-task-navigation", "se-task-click",
        "se-explanation-open", "se-explanation-close", "se-explanation-link-click"
    )
    GROUP BY hpm.wiki, hpm.event.user_id 
)
SELECT u.wiki_db, u.user_id, u.user_registration, u.reg_on_mobile, u.hp_enabled, u.hp_variant,
       COALESCE(hp_visits.visited_homepage, 0) AS visited_homepage,
       COALESCE(int.did_interact, 0) AS did_interact
FROM {exp_user_table} AS u
LEFT JOIN hp_visits
ON u.wiki_db = hp_visits.wiki_db
AND u.user_id = hp_visits.user_id
LEFT JOIN (
    SELECT *
    FROM var_c
    UNION ALL
    SELECT *
    FROM var_d
) AS int
ON (hp_visits.wiki_db = int.wiki_db
    AND hp_visits.user_id = int.user_id)
'''

In [23]:
interaction_data = spark.run(
    interaction_query.format(
        exp_user_table = canonical_user_table
    )
)

Verify the number of users we have in the dataset and that it corresponds to the same number of users in the task navigation and click dataset.

In [24]:
len(interaction_data)

67599

In [27]:
len(interaction_data.loc[(interaction_data['hp_enabled'] == 1) &
                         (interaction_data['hp_variant'].isin(['C', 'D'])) &
                         (interaction_data['visited_homepage'] == 1)])

32615

That is the exact number of visitors as we have in the task navigation and task clicks dataset, so we can move ahead with analyzing this dataset.

In [None]:
interaction_data.loc[interaction_data['visited_homepage'] > 0].head()

In [None]:
interaction_data.loc[interaction_data['did_interact'] > 0].head()

Write out the dataset for importing into R for further analysis.

In [30]:
interaction_data.to_csv(interaction_counts_output_filename,
                        header = True, index = False, sep = '\t')