# Quality Demo

### *TODO/WARNING: Deal with empty results returned*

We demo 3 different functions for collecting revision quality measures from Wikipedia:
1. `get_revisions_quality` - for getting quality scores for a list of revision IDs.
2. `get_articles_quality` - for getting quality scores for a list of titles / page IDs.
3. `pipeline_quality` - A convenience wrapper function that, in addition to the above, also sets up session and redirect maps.

All of these functions are able to access models through the Wikimedia lift wing API (including the older revscoring/ORES models). These models are:
- `articlequality` - language agnostic revision quality score 0-1
- `revertrisk-multilingual` - multilingual revision revert risk probability 0-1
- `revertrisk-language-agnostic` - language agnostic revision revert risk probability 0-1
- `{wiki}-articlequality` - ORES revision quality - probability of Start/Stub/C/B/GA/FA 0-1
- `{wiki}-draftquality` - ORES draft quality - probability of OK/attack/spam/vandalism 0-1
- `{wiki}-goodfaith` - ORES "good faith" revision probability 0-1
- `{wiki}-damaging` - ORES "damaging" revision probability 0-1
- `{wiki}-reverted` - ORES revision reverted probability 0-1




## Setup

In [1]:
import wikitoolkit
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
wtsession = wikitoolkit.WTSession('en.wikipedia', user_agent=my_agent)

pagemaps = wikitoolkit.PageMaps() # see demo_redirects.ipynb for more info

toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday
revision_ids = [1236428488,
                1236453299,
                1237461948,
                1237046423,
                1237232495,
                1236992079,
                1236436502,
                1236488217,
                1236305118,
                1237376589] # 10 random revision ids

## `get_revisions_quality`

Gets quality scores based on revision IDs.

In [2]:
# articlequality is the default model
r_quality = await wikitoolkit.get_revisions_quality(wtsession, revision_ids, 'en')
pd.DataFrame(r_quality).T

Unnamed: 0,articlequality
1236428488,
1236453299,
1237461948,0.954037
1237046423,
1237232495,
1236992079,0.955253
1236436502,
1236488217,
1236305118,
1237376589,


Can handle other models:

In [3]:
r_quality = await wikitoolkit.get_revisions_quality(wtsession, revision_ids, 'en', models=['articlequality', "revertrisk-multilingual"])
pd.DataFrame(r_quality).T

Unnamed: 0,revertrisk-multilingual
1236428488,0.837449
1236453299,0.277937
1237461948,0.188958
1237046423,0.185478
1237232495,0.040484
1236992079,0.280694
1236436502,0.705815
1236488217,0.244512
1236305118,0.644111
1237376589,0.138911


## `get_articles_quality`

Gets quality scores based on article titles or page IDs. Gets most recent revision by default, but can get revision at date, or revisions in range.

In [4]:
a_quality = await wikitoolkit.get_articles_quality(wtsession, titles=artlist[:10],
                           lang='en', pagemaps=pagemaps)
pd.DataFrame(a_quality).T

Unnamed: 0,revid,parentid,timestamp
Michael Phelps,1246292801,1246289231,2024-09-18T01:59:14Z
Sunisa Lee,1246512503,1246512450,2024-09-19T11:49:49Z
Katie Ledecky,1246151473,1246108898,2024-09-17T06:26:28Z
Jonathan Owens,1246317609,1246308172,2024-09-18T06:16:28Z
Deadpool & Wolverine,1246618701,1246618342,2024-09-20T01:37:32Z
Simone Biles,1246613302,1246477069,2024-09-20T00:54:42Z
MyKayla Skinner,1241947299,1241713378,2024-08-24T02:32:30Z
2024 Summer Olympics,1246620112,1246328710,2024-09-20T01:48:24Z
Kamala Harris,1246676367,1246676096,2024-09-20T11:47:02Z
Ismail Haniyeh,1246086465,1246014999,2024-09-16T20:28:08Z


In [5]:
a_quality = await wikitoolkit.get_articles_quality(wtsession, titles=artlist[:10],
                           start='2015-07-31T00:00:00Z', stop='2015-08-07T00:00:00Z', 
                           lang='en', pagemaps=pagemaps)
pd.concat({k: pd.DataFrame(v) for k, v in a_quality.items()}
          ).reset_index(level=1, drop=True).reset_index().rename(columns={'index': 'title'})

Unnamed: 0,title,revid,parentid,timestamp
0,Katie Ledecky,674235344.0,672360085.0,2015-08-02T15:33:15Z
1,Katie Ledecky,674235395.0,674235344.0,2015-08-02T15:33:45Z
2,Katie Ledecky,674235478.0,674235395.0,2015-08-02T15:34:31Z
3,Katie Ledecky,674236253.0,674235478.0,2015-08-02T15:41:32Z
4,Katie Ledecky,674236305.0,674236253.0,2015-08-02T15:42:02Z
...,...,...,...,...
92,2024 Summer Olympics,674719585.0,674712538.0,2015-08-05T18:13:57Z
93,2024 Summer Olympics,674737540.0,674719585.0,2015-08-05T20:29:57Z
94,2024 Summer Olympics,674748281.0,674737540.0,2015-08-05T21:59:25Z
95,2024 Summer Olympics,674810185.0,674748281.0,2015-08-06T08:26:49Z


## `pipeline_quality`

This function sets up the session, fixes redirects with PageMaps (if necessary), and collects revision quality data. It is a convenience function that wraps the previous functions to collect by revision ID / title / page ID. Different models / dates / date ranges can still be specified. Note that this does not require manual setup of the `wtsession`.

In [6]:
p_quality = await wikitoolkit.pipeline_quality('en.wikipedia', my_agent,
                                               titles=artlist[:10], pagemaps=pagemaps)
pd.DataFrame(p_quality).T

Unnamed: 0,revid,parentid,timestamp,articlequality
Michael Phelps,1246292801,1246289231,2024-09-18T01:59:14Z,0.995604
Sunisa Lee,1246512503,1246512450,2024-09-19T11:49:49Z,0.988588
Katie Ledecky,1246151473,1246108898,2024-09-17T06:26:28Z,0.976945
Jonathan Owens,1246317609,1246308172,2024-09-18T06:16:28Z,0.395441
Deadpool & Wolverine,1246618701,1246618342,2024-09-20T01:37:32Z,0.956483
Simone Biles,1246613302,1246477069,2024-09-20T00:54:42Z,0.982634
MyKayla Skinner,1241947299,1241713378,2024-08-24T02:32:30Z,0.75594
2024 Summer Olympics,1246620112,1246328710,2024-09-20T01:48:24Z,0.98258
Kamala Harris,1246676367,1246676096,2024-09-20T11:47:02Z,0.988517
Ismail Haniyeh,1246086465,1246014999,2024-09-16T20:28:08Z,0.70489


In [7]:
p_quality = await wikitoolkit.pipeline_quality('en.wikipedia', my_agent, titles=artlist[:10],
                                               qf_args={'start': '2024-09-17T01:59:14Z', 'stop': '2024-09-18T01:59:14Z'},
                                               models=['articlequality', 'revertrisk-multilingual'],
                                               pagemaps=pagemaps)
pd.concat({k: pd.DataFrame(v) for k, v in p_quality.items()}
          ).reset_index(level=1, drop=True).reset_index().rename(columns={'index': 'title'})   

Unnamed: 0,title,revid,parentid,timestamp,articlequality,revertrisk-multilingual
0,Michael Phelps,1246289000.0,1244097000.0,2024-09-18T01:28:36Z,0.995604,0.435464
1,Michael Phelps,1246293000.0,1246289000.0,2024-09-18T01:59:14Z,0.995604,0.446746
2,Sunisa Lee,1246292000.0,1245126000.0,2024-09-18T01:53:32Z,0.988588,0.359671
3,Katie Ledecky,1246151000.0,1246109000.0,2024-09-17T06:26:28Z,0.976945,0.861662
4,Deadpool & Wolverine,1246214000.0,1246123000.0,2024-09-17T16:11:09Z,,0.294085
5,Deadpool & Wolverine,1246262000.0,1246214000.0,2024-09-17T21:59:14Z,,0.344429
6,Deadpool & Wolverine,1246262000.0,1246262000.0,2024-09-17T21:59:35Z,,0.277937
7,Deadpool & Wolverine,1246274000.0,1246262000.0,2024-09-17T23:28:21Z,,0.266657
8,Deadpool & Wolverine,1246287000.0,1246274000.0,2024-09-18T01:13:01Z,,0.212187
9,Simone Biles,1246244000.0,1246112000.0,2024-09-17T19:47:48Z,,0.795054
