# Revert events in mediawiki_revision_tags_change

This notebook shows how to count revisions and flag reverts based on edit tags using the `event.mediawiki_revision_tags_change` table in the Data Lake. In this notebook we'll be querying it using Spark. The principles apply if you're using Presto or Hive, but some of the syntax might change.

## Revert tags

The GSoC project to add a "reverted" filter ([T248775](https://phabricator.wikimedia.org/T248775)) introduced tags for reverts into MediaWiki ([T254074](https://phabricator.wikimedia.org/T254074)). Tags are applied both to the *reverting* revision (the one making the revert) and the *reverted* revision (the one that's being undone). This notebook focuses on the latter, because we're often interested in understanding revert rates in our analyses.

Tags in `event.mediawiki_revision_tags_change` use the system names, for reverts that is "mw-reverted". On-wiki the tag is translated into the language of the user interface, in English that's "Reverted". This tag was deployed to production on 2020-09-15 (ref: [T164307#6463808](https://phabricator.wikimedia.org/T164307#6463808)).

Because these tags are applied when the revert takes place, it means that there can be a significant difference in time between the initial revision and it being reverted. This difference can be identified by comparing the timestamp of the event (`meta.dt`) with the revision timestamp (`rev_timestamp`). As we'll see, this can allow us to apply the common 48 hour cutoff for counting reverts (see [Research:Revert#Cutoffs for time to revert and edit radius](https://meta.wikimedia.org/wiki/Research:Revert#Cutoffs_for_time_to_revert_and_edit_radius) for more information).

We have previously known that a revision can have multiple tag change events (see [T218246#5981155](https://phabricator.wikimedia.org/T218246#5981155) for more about this). The introduction of the "mw-reverted" tag likely makes this a more common occurrence since reverts are commonplace. In our queries, it means that we'll need to count distinct revision IDs, or aggregate over revision IDs in order to de-duplicate these events.

In this example, we'll count the number of revisions made on the English Wikipedia on 2020-10-10, count the number of those that have had the "mw-reverted" tag applied as of 2020-10-26, and the number of those tags that were applied within 48 hours of the revision being made. This query can be further expanded to for example iterate over a group of (or all) wikis, extract contributor-specific information, etc, as needed.

In [1]:
from wmfdata import spark

In [17]:
revision_count_query = '''
WITH revs AS (
    SELECT
        rev_id,
        MAX(IF(array_contains(tags, 'mw-reverted'), 1, 0)) AS was_reverted, -- was reverted?
        MAX(IF(array_contains(tags, 'mw-reverted') AND
           (unix_timestamp(meta.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
            unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*48), 1, 0))
            AS was_reverted_48hrs -- reverted within 48 hours?
    FROM event.mediawiki_revision_tags_change
    WHERE year = 2020
    AND month = 10
    AND day = 10
    AND `database` = "enwiki"
    GROUP BY rev_id
)
SELECT
    SUM(1) AS num_revisions,
    SUM(was_reverted) AS num_reverted,
    SUM(was_reverted_48hrs) AS num_reverted_48_hrs
FROM revs
'''

In [18]:
revision_counts = spark.run(revision_count_query)

In [19]:
revision_counts

Unnamed: 0,num_revisions,num_reverted,num_reverted_48_hrs
0,61953,12998,10667


We can then use this to calculate revert rates. First the proportion of edits that are currently labelled as reverted, and secondly the proportion of edits that were labelled as reverted within 48 hours of being made.

In [20]:
round(100 * revision_counts['num_reverted'] / revision_counts['num_revisions'], 2)

0    20.98
dtype: float64

In [21]:
round(100 * revision_counts['num_reverted_48_hrs'] / revision_counts['num_revisions'], 2)

0    17.22
dtype: float64