# Experiment Duration

How long do we need to run this experiment?

## Time since registration for first tagged edit?

Let's use `event.mediawiki_revision_tags_change` to get an estimate of time since registration for users who registered between 2020-07-01 and 2020-09-01 (when the "newcomer task" edit tag bug wasn't in effect). We'll limit it to edits within a week after registration.

In [1]:
import datetime as dt

import numpy as np
import pandas as pd

from wmfdata import spark, mariadb
from growth import utils

In [55]:
first_edit_query = '''
SELECT `database` AS wiki_db, performer.user_id,
  MIN(unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ssX") -
      unix_timestamp(performer.user_registration_dt, "yyyy-MM-dd'T'HH:mm:ssX"))
      AS seconds_to_first_tagged_edit
FROM event.mediawiki_revision_tags_change
WHERE year = 2020
AND month BETWEEN 7 AND 9
AND `database` IN ('cswiki', 'kowiki', 'viwiki', 'arwiki' 'ukwiki', 'hywiki', 'srwiki',
                   'frwiki', 'fawiki', 'hewiki', 'ruwiki')
AND TO_DATE(performer.user_registration_dt) >= "2020-07-01"
AND TO_DATE(performer.user_registration_dt) < "2020-09-01"
AND performer.user_is_bot = false
-- edit within 7 days of registration
AND (unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ssX") -
     unix_timestamp(performer.user_registration_dt, "yyyy-MM-dd'T'HH:mm:ssX") < 60*60*24*7)
AND array_contains(tags, "newcomer task")
GROUP BY `database`, performer.user_id
'''

In [56]:
first_edit_data = spark.run(first_edit_query)

In [57]:
first_edit_data['minutes_to_first_tagged_edit'] = first_edit_data['seconds_to_first_tagged_edit'] / 60

In [67]:
first_edit_data['minutes_to_first_tagged_edit'].describe()

count      545.000000
mean       505.431315
std       1389.147376
min          2.083333
25%         10.050000
50%         20.483333
75%         97.083333
max      10039.900000
Name: minutes_to_first_tagged_edit, dtype: float64

Median is 20.5mins, 75th percentile is 1hr 37mins.

In [62]:
first_edit_data['hours_to_first_tagged_edit'] = first_edit_data['minutes_to_first_tagged_edit'] / 60

In [63]:
first_edit_data['hours_to_first_tagged_edit'].describe()

count    545.000000
mean       8.423855
std       23.152456
min        0.034722
25%        0.167500
50%        0.341389
75%        1.618056
max      167.331667
Name: hours_to_first_tagged_edit, dtype: float64

In [64]:
first_edit_data['hours_to_first_tagged_edit'].quantile(0.9)

24.39238888888889

In [65]:
first_edit_data['hours_to_first_tagged_edit'].quantile(0.95)

53.82977777777749

In [66]:
first_edit_data['hours_to_first_tagged_edit'].quantile(0.99)

118.00958888888886

90th percentile is slightly more than 24 hours, 95th percentile is 53.8 hours, 99th is 118 hours, or just below 5 days.

## Number of registrations

How many users register on those wikis in four weeks? (We're ignoring test accounts for now, that's fine)

In [43]:
num_registrations_query = '''
SELECT wiki, count(*) AS num_registrations, count(*) * 0.8 AS est_homepage_group_size
FROM event.serversideaccountcreation
WHERE year = 2020
AND month = 10
AND TO_DATE(dt) >= '2020-10-01'
AND TO_DATE(dt) < '2020-10-29' -- four weeks
AND wiki IN ('cswiki', 'kowiki', 'viwiki', 'arwiki' 'ukwiki', 'hywiki', 'srwiki',
             'frwiki', 'fawiki', 'hewiki', 'ruwiki')
AND event.isselfmade = true
AND event.isapi = false
GROUP BY wiki
'''

In [44]:
registrations_by_wiki = spark.run(num_registrations_query)

In [45]:
registrations_by_wiki

Unnamed: 0,wiki,num_registrations,est_homepage_group_size
0,viwiki,2884,2307.2
1,hewiki,1519,1215.2
2,hywiki,201,160.8
3,frwiki,12766,10212.8
4,kowiki,1473,1178.4
5,cswiki,1110,888.0
6,srwiki,388,310.4
7,fawiki,5891,4712.8
8,ruwiki,9494,7595.2


In [46]:
registrations_by_wiki['est_homepage_group_size'].sum()

Decimal('28580.8')