# Number of Edits

What percent of newcomers save a suggested edit?

In this case, we'll use `mediawiki_history`, the processed version of the replicated databases, as our source of truth for edit data. We could also use `mediawiki_revision_tags_change` as we did in the Variant A/B analysis, but since we in this case have the former available to use, I prefer that.

In [1]:
import json
import datetime as dt

from collections import defaultdict

import numpy as np
import pandas as pd

from wmfdata import hive, spark, mariadb
from growth import utils

In [14]:
## Start timestamp of the Variant C/D experiment, which is when the edit tag bug fix went into place
exp_start_ts = dt.datetime(2020, 10, 28, 18, 40, 2)

## Ordered list of wikis that we'll be gathering data for
## Note that we're excluding euwiki due to their small number of registrations
wikis = ['cswiki', 'kowiki', 'viwiki', 'arwiki', 'ukwiki', 'huwiki', 'srwiki', 'hywiki',
         'frwiki', 'fawiki', 'hewiki', 'ruwiki', 'plwiki', 'ptwiki', 'svwiki', 'trwiki']

## The mediawiki_history snapshot we'll be working with
wmf_snapshot = '2020-11'

## The canonical user table that we'll join mediawiki_history with
canonical_user_table = 'nettrom_growth.hp_variant_test2'

## Where we write out the resulting dataset for further analysis
edit_data_output_filename = 'datasets/variant-test-2-edit-data.tsv'

## Edit Data

In [7]:
edit_data_query = '''
WITH edits AS (
    SELECT wiki_db, event_user_id AS user_id, SUM(1) AS num_article_edits,
    SUM(IF(array_contains(revision_tags, "newcomer task"), 1, 0)) AS num_suggested_edits
    FROM wmf.mediawiki_history
    WHERE snapshot = "{snapshot}"
    AND event_entity = "revision"
    AND event_type = "create"
    AND wiki_db IN ({wiki_list})
    AND event_timestamp > "{start_date}"
    -- only article edits
    AND page_namespace = 0
    -- within 24 hours of registration
    AND unix_timestamp(event_timestamp) - unix_timestamp(event_user_creation_timestamp) < 86400
    GROUP BY wiki_db, event_user_id
),
users AS (
    SELECT wiki_db, user_id, user_registration, reg_on_mobile, hp_enabled, hp_variant
    FROM {exp_user_table}
)
SELECT users.wiki_db, users.user_id, users.user_registration, reg_on_mobile, hp_enabled, hp_variant,
    COALESCE(num_article_edits, 0) AS num_article_edits,
    COALESCE(num_suggested_edits, 0) AS num_suggested_edits
FROM users
LEFT JOIN edits
ON users.wiki_db = edits.wiki_db
AND users.user_id = edits.user_id
'''

In [8]:
all_users_edit_data = spark.run(
    edit_data_query.format(
        snapshot = wmf_snapshot,
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        start_date = exp_start_ts.date().isoformat(),
        exp_user_table = canonical_user_table
    )
)

In [9]:
len(all_users_edit_data)

67599

In [None]:
all_users_edit_data.loc[all_users_edit_data['num_article_edits'] > 0].head()

In [None]:
all_users_edit_data.loc[all_users_edit_data['num_suggested_edits'] > 0].head()

In [None]:
all_users_edit_data.loc[(all_users_edit_data['hp_enabled'] == 0) &
                        (all_users_edit_data['num_article_edits'] > 0)].head()

Write out the canonical edit dataset for importing into R.

In [15]:
all_users_edit_data.to_csv(edit_data_output_filename,
                           header = True, index = False, sep = '\t')