# Investigating claim actions

Exploring the data for claims/labels in order to understand whether we can identify the number of statements a given file might have based on edit comments.  
<font color=red>_**Note that all metrics calculated in this version of the notebook are different than the previous versions due to the inclusion of December 2019 data that is now available in the Data lake. **_</font>


In [2]:
import pandas as pd
import numpy as np

import datetime as dt

from wmfdata import hive, mariadb

You are using wmfdata 0.1.0 (latest).

You can find the source for `wmfdata` at https://github.com/neilpquinn/wmfdata


## Configuration variables
**UPDATE 2020-01-17:** Modified Config variables snapshot and end_date to include December 2019 data in all the queries. Refer to [T242816](https://phabricator.wikimedia.org/T242816) 

In [3]:
wmf_snapshot = '2019-12'
start_date = '2019-01-01' # first date of caption edits
end_date = '2020-01-01' # last date of caption edits (exclusive)

# Investigating claim actions

I dug into the edit comments a bit to find examples, and found that [this file](https://commons.wikimedia.org/wiki/File:Rosendahl,_Darfeld,_Ortsansicht_--_2014_--_9391.jpg) has a bunch of property edits that provided good insight into what to look for. Basically, it meant that I had to expand my search to anything that starts with "wb". Using that insight, the below query looks for an edit comment matching "wb{something}-{something}:" (anchored at the start of the comment with space for "/* ") and aggregates over the first and second "something".

In [4]:
claim_query = '''
SELECT claim_action, claim_subaction, count(*) AS num_actions
FROM (
    SELECT
        regexp_extract(event_comment, "^...(wb[^-]+)", 1) AS claim_action,
        regexp_extract(event_comment, "^...wb[^-]+-([^:]+):", 1) AS claim_subaction
    FROM wmf.mediawiki_history
    WHERE snapshot = "{snapshot}"
    AND wiki_db = "commonswiki"
    AND event_entity = "revision"
    AND event_type = "create"
    AND event_timestamp >= "{start_date}"
    AND event_timestamp < "{end_date}"
    AND page_is_deleted = false -- only count live pages
    AND page_namespace = 6 -- only count files
    AND event_comment REGEXP "^...(wb[^-]+)-([^:]+):"
) AS ce
GROUP BY claim_action, claim_subaction
'''

In [5]:
claim_counts = hive.run(
    [
        "SET mapreduce.map.memory.mb=4096", 
        claim_query.format(
            snapshot = wmf_snapshot,
            start_date = start_date,
            end_date = end_date
        )
    ]
)

In [6]:
claim_counts.sort_values(['claim_action', 'claim_subaction'])

Unnamed: 0,claim_action,claim_subaction,num_actions
6,wbcreateclaim,create,1755433
12,wbeditentity,update,272306
0,wbeditentity,update-languages,1163
8,wbeditentity,update-languages-short,1479
10,wbremoveclaims,remove,23247
4,wbremoveclaims,update,14345
5,wbsetclaim,create,595942
11,wbsetclaim,update,41617
7,wbsetdescription,add,2
2,wbsetlabel,add,1772898


Based on this, I need to ask the developers what the various edit comments are and what they mean. Also, we know from the example file that a single edit can modify multiple properties, meaning that we cannot know how many properties a file has based on the comments.