In [13]:
import pandas as pd
import wmfdata as wmf

We've found a discrepancy by the monthly editor counts reported by Wikistats and the wiki comparison tool. More information in [T293660](https://phabricator.wikimedia.org/T293660).

## Wikistats values

For this investigation, I'll look at the average monthly editor count during 2020 at various wikis. Here are the Wikistats and wiki comparison values, showing the discrepancy.

In [14]:
monthly_editors = pd.DataFrame({
    "wiki": ["enwiki", "eswiki", "commonswiki", "kowiki"],
    "wikistats": [131860, 17157, 34702, 2436],
    "wiki_comparison": [139341, 19042, 41917, 2516]
})

In [15]:
monthly_editors

Unnamed: 0,wiki,wikistats,wiki_comparison
0,enwiki,131860,139341
1,eswiki,17157,19042
2,commonswiki,34702,41917
3,kowiki,2436,2516


## Wiki comparison query

If I re-run the query used to generate the wiki comparison dataset, I still get results that are very close to the ones shown in the wiki comparison tool.

In [16]:
wiki_comparison_recalculation = wmf.spark.run("""
SELECT
    wiki,
    COUNT(*) / 12 AS wiki_comparison_recalculation
FROM cchen.editor_month
WHERE
    month >= "2020-01" 
    AND month < "2021-01" 
    AND user_id != 0 
    AND bot_by_group = FALSE
    AND (user_name not regexp "bot\\b" or user_name in ("Paucabot", "Niabot", "Marbot"))
    AND wiki IN ("enwiki", "eswiki", "commonswiki", "kowiki")
GROUP BY wiki
""", session_type="yarn-large")

monthly_editors = monthly_editors.merge(wiki_comparison_recalculation, how="inner", on="wiki")
monthly_editors["wiki_comparison_recalculation"] = monthly_editors["wiki_comparison_recalculation"].round().apply(int)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [17]:
monthly_editors

Unnamed: 0,wiki,wikistats,wiki_comparison,wiki_comparison_recalculation
0,enwiki,131860,139341,138415
1,eswiki,17157,19042,19069
2,commonswiki,34702,41917,41894
3,kowiki,2436,2516,2522


## Wiki comparison query, bypassing `editor_month`

The wiki comparison query drawn from an intermediate table, `cchen.editor_month`. If I rewrite the query to incorporate the relevant parts from the query that generates `cchen.editor_month`, I also get results close to the original ones.

In [18]:
wiki_comparison_recalculation_2 = wmf.spark.run("""
SELECT
    wiki,
    (SUM(editors) / 12) AS wiki_comparison_recalculation_2
FROM (
    SELECT
        TRUNC(event_timestamp, "MONTH") AS month,
        wiki_db AS wiki,
        COUNT(DISTINCT event_user_text) AS editors
    FROM wmf.mediawiki_history mwh
    INNER JOIN canonical_data.wikis cdw
    ON
        wiki_db = database_code
        AND database_group IN (
            "commons", "incubator", "foundation", "mediawiki", "meta", "sources", 
            "species","wikibooks", "wikidata", "wikinews", "wikipedia", "wikiquote",
            "wikisource", "wikiversity", "wikivoyage", "wiktionary"
        )
    WHERE
        event_timestamp BETWEEN "2020-01-01 00:00:00.0" AND "2021-01-01 00:00:00.0"
        AND SIZE(event_user_is_bot_by) = 0
        AND SIZE(event_user_is_bot_by_historical) = 0
        AND event_entity = "revision"
        AND event_type = "create"
        AND snapshot = "2021-10"
    GROUP BY
        TRUNC(event_timestamp, "MONTH"),
        wiki_db
) monthly_wiki_editors
GROUP BY wiki
""", session_type="yarn-large")

monthly_editors = monthly_editors.merge(wiki_comparison_recalculation_2, how="inner", on="wiki")
monthly_editors["wiki_comparison_recalculation_2"] = monthly_editors["wiki_comparison_recalculation_2"].round().apply(int)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [19]:
monthly_editors

Unnamed: 0,wiki,wikistats,wiki_comparison,wiki_comparison_recalculation,wiki_comparison_recalculation_2
0,enwiki,131860,139341,138415,139576
1,eswiki,17157,19042,19069,19068
2,commonswiki,34702,41917,41894,42042
3,kowiki,2436,2516,2522,2521


## Wiki comparison query, excluding deleted pages

I just noticed that the Wikistats metric [excludes edits made to deleted pages](https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Editors). This isn't way we prefer to calculate metrics now (since this means historical metrics shift over time as pages get deleted), but it is more compatible with Wikistats's predecessor. 

Now that I've rewritten the wiki comparison query to directly use the `mediawiki_history` dataset, I can easily filter out edits made to pages which are now deleted.

In [20]:
wiki_comparison_excluding_deleted_pages = wmf.spark.run("""
SELECT
    wiki,
    (SUM(editors) / 12) AS wiki_comparison_excluding_deleted_pages
FROM (
    SELECT
        TRUNC(event_timestamp, "MONTH") AS month,
        wiki_db AS wiki,
        COUNT(DISTINCT event_user_text) AS editors
    FROM wmf.mediawiki_history mwh
    INNER JOIN canonical_data.wikis cdw
    ON
        wiki_db = database_code
        AND database_group IN (
            "commons", "incubator", "foundation", "mediawiki", "meta", "sources", 
            "species","wikibooks", "wikidata", "wikinews", "wikipedia", "wikiquote",
            "wikisource", "wikiversity", "wikivoyage", "wiktionary"
        )
    WHERE
        event_timestamp BETWEEN "2020-01-01 00:00:00.0" AND "2021-01-01 00:00:00.0"
        AND NOT revision_is_deleted_by_page_deletion
        AND SIZE(event_user_is_bot_by) = 0
        AND SIZE(event_user_is_bot_by_historical) = 0
        AND event_entity = "revision"
        AND event_type = "create"
        AND snapshot = "2021-10"
    GROUP BY
        TRUNC(event_timestamp, "MONTH"),
        wiki_db
) monthly_wiki_editors
GROUP BY wiki
""", session_type="yarn-large")

monthly_editors = monthly_editors.merge(wiki_comparison_excluding_deleted_pages, how="inner", on="wiki")
monthly_editors["wiki_comparison_excluding_deleted_pages"] = monthly_editors["wiki_comparison_excluding_deleted_pages"].round().apply(int)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [28]:
monthly_editors = monthly_editors.set_index("wiki")

These results are actually very close to those from Wikistats, meaning that deleted pages are the main cause of the discrepancy.

In [37]:
monthly_editors[["wikistats", "wiki_comparison_excluding_deleted_pages"]]

Unnamed: 0_level_0,wikistats,wiki_comparison_excluding_deleted_pages
wiki,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,131860,131780
eswiki,17157,17151
commonswiki,34702,34577
kowiki,2436,2434


Here's the wiki comparison value when excluding deleted pages, as a proportion of the Wikistats value.

In [38]:
(
    (monthly_editors["wiki_comparison_excluding_deleted_pages"] / monthly_editors["wikistats"])
    .to_frame()
    .rename({0: "comparison"}, axis="columns")
    .style.format({
        "comparison": wmf.utils.pct_str
    })
)

Unnamed: 0_level_0,comparison
wiki,Unnamed: 1_level_1
enwiki,99.9%
eswiki,100.0%
commonswiki,99.6%
kowiki,99.9%
