# Revisions Demo

We demo 4 different functions for collecting revisions data from Wikipedia:
1. `get_revision` - for getting a single revision (default most recent), and associated data, for each title/pageid.
2. `get_revisions` - for getting all revisions in a date range, and associated data, for each title/pageid.
3. `get_revisions_data` - for getting revision data for known revision ids.
4. `get_revisions_content` - for getting revision content for known revision ids.
5. `pipeline_revisions` - A convenience wrapper function that, in addition to the above, also sets up session and redirect maps.

## Setup

In [1]:
import mwapi
import wikitoolkit
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
async_session = mwapi.AsyncSession('https://en.wikipedia.org',
                    formatversion=2, user_agent=my_agent)

toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday
revision_ids = [1236428488,
                1236453299,
                1237461948,
                1237046423,
                1237232495,
                1236992079,
                1236436502,
                1236488217,
                1236305118,
                1237376589] # 10 random revision ids

## `get_revision`

### Get most recent revision

In [2]:
current_revisions = await wikitoolkit.get_revision(async_session, titles=artlist[:10])
pd.DataFrame(current_revisions).T



Unnamed: 0,revid,parentid,timestamp
MyKayla Skinner,1241947299,1241713378,2024-08-24T02:32:30Z
Michael Phelps,1244096767,1244096624,2024-09-05T00:58:05Z
Katie Ledecky,1243479333,1243204533,2024-09-01T18:18:03Z
Sunisa Lee,1244096528,1243840412,2024-09-05T00:56:06Z
Ismail Haniyeh,1244776547,1244746905,2024-09-09T03:34:33Z
2024 Summer Olympics,1244806576,1244805492,2024-09-09T08:56:33Z
Simone Biles,1244189717,1244188496,2024-09-05T15:37:41Z
Deadpool & Wolverine,1244802439,1244775182,2024-09-09T08:10:22Z
Jonathan Owens,1241702742,1239984736,2024-08-22T17:50:46Z
Kamala Harris,1244740856,1244740488,2024-09-08T22:12:57Z


### Get revision on specific date

In [3]:
date_revisions = await wikitoolkit.get_revision(async_session, titles=artlist[:10],
                                                 date='2015-07-31T00:00:00Z')
pd.DataFrame(date_revisions).T



Unnamed: 0,revid,parentid,timestamp
MyKayla Skinner,673825992.0,673729589.0,2015-07-30T19:01:12Z
Michael Phelps,673536509.0,671250902.0,2015-07-28T21:46:23Z
Katie Ledecky,672360085.0,668007409.0,2015-07-21T01:38:38Z
Sunisa Lee,,,
Ismail Haniyeh,661522866.0,659341058.0,2015-05-09T07:45:05Z
2024 Summer Olympics,673833579.0,673831521.0,2015-07-30T20:02:16Z
Simone Biles,673859220.0,673842766.0,2015-07-30T23:56:13Z
Deadpool & Wolverine,,,
Jonathan Owens,,,
Kamala Harris,673028346.0,673025637.0,2015-07-25T15:26:15Z


## `get_revisions`

By default, this collects the last 30 days.

In [4]:
revisions_1week = await wikitoolkit.get_revisions(async_session, titles=artlist[:10],
                                                start='2024-07-24T00:00:00Z',
                                                stop='2024-07-31T00:00:00Z')
pd.concat({k: pd.DataFrame(v) for k, v in revisions_1week.items()}).reset_index(
            level=1, drop=True).reset_index().rename(columns={'index': 'title'})



Unnamed: 0,title,revid,parentid,timestamp
0,MyKayla Skinner,1.237274e+09,1.235772e+09,2024-07-29T00:18:39Z
1,MyKayla Skinner,1.237275e+09,1.237274e+09,2024-07-29T00:25:02Z
2,MyKayla Skinner,1.237276e+09,1.237275e+09,2024-07-29T00:27:41Z
3,MyKayla Skinner,1.237277e+09,1.237276e+09,2024-07-29T00:32:35Z
4,MyKayla Skinner,1.237278e+09,1.237277e+09,2024-07-29T00:41:12Z
...,...,...,...,...
1534,Kamala Harris,1.237585e+09,1.237568e+09,2024-07-30T14:19:00Z
1535,Kamala Harris,1.237631e+09,1.237585e+09,2024-07-30T18:49:51Z
1536,Kamala Harris,1.237650e+09,1.237631e+09,2024-07-30T20:36:03Z
1537,Kamala Harris,1.237650e+09,1.237650e+09,2024-07-30T20:36:39Z


## `get_revisions_data`

Revision IDs must be supplied to this function, not titles or pageids. The data this function collects can also be collected by previous functions with the right `props` arguments.

In [5]:
revisions_data = await wikitoolkit.get_revisions_data(async_session, revids=revision_ids,
                                                    props=['timestamp', 'ids',
                                                           'size', 'comment','user'])
pd.DataFrame(revisions_data).T



Unnamed: 0,parentid,user,timestamp,size,comment,anon
1236488217,1236487199,JRDkg,2024-07-25T00:27:42Z,243454,secondary classification...Tamil is not a sepa...,
1236305118,1236300833,178.84.53.53,2024-07-24T00:12:12Z,158909,,True
1236428488,1236428120,LifeJustKnowYouthfulness,2024-07-24T17:06:17Z,197106,/* Plot */,
1236436502,1236435702,MutantX13,2024-07-24T18:04:05Z,197622,,
1236453299,1236451598,Sc2353,2024-07-24T19:58:08Z,197401,updated RT info,
1236992079,1236991794,BarntToust,2024-07-27T15:25:20Z,223617,Undid revision [[Special:Diff/1236991794|12369...,
1237046423,1237046107,BarntToust,2024-07-27T21:38:12Z,227630,/* References */,
1237232495,1237232311,Kailash29792,2024-07-28T19:56:49Z,234012,Rescuing 19 sources and tagging 0 as dead.) #I...,
1237461948,1237461784,ErnestoCabral2018,2024-07-29T22:30:31Z,235927,,
1237376589,1237217410,Marquardtika,2024-07-29T13:40:23Z,13372,/* Personal life */ not needed,


## `get_revisions_content`

This function collects the revision content for the revision IDs supplied. The data this function collects can also be collected by previous functions with the right `props` arguments.

In [6]:
revisions_content = await wikitoolkit.get_revisions_content(async_session, revids=revision_ids)
pd.Series(revisions_content)



1236488217    {{Short description|Vice President of the Unit...
1236305118    {{Short description|Multi-sport event in Paris...
1236428488    {{Short description|2024 Marvel Studios film}}...
1236436502    {{Short description|2024 Marvel Studios film}}...
1236453299    {{Short description|2024 Marvel Studios film}}...
1236992079    {{Short description|2024 Marvel Studios film}}...
1237046423    {{Short description|2024 Marvel Studios film}}...
1237232495    {{Short description|2024 Marvel Studios film}}...
1237461948    {{Short description|2024 Marvel Studios film}}...
1237376589    {{Short description|American football player (...
dtype: object

In [7]:
print(revisions_content[1236488217][:10000])

{{Short description|Vice President of the United States since 2021}}
<!--Do not include the distinguish hatnote per this discussion: [[Talk:Kamala Harris/Archive 4#Distinguish hatnote with wrestler Kamala]]-->
{{pp-extended|small=yes}}
{{Use American English|date=July 2024}}
{{Use mdy dates|date=July 2024}}
{{Infobox officeholder
| image         = Kamala Harris Vice Presidential Portrait.jpg
| caption       = Official portrait, 2021
| office        = 49th [[Vice President of the United States]]
| president     = [[Joe Biden]]
| term_start    = January 20, 2021
| predecessor   = [[Mike Pence]]
| jr/sr1        = United States Senator
| state1        = [[California]]
| term_start1   = January 3, 2017
| term_end1     = January 18, 2021
| predecessor1  = [[Barbara Boxer]]
| successor1    = [[Alex Padilla]]
| office2       = 32nd [[Attorney General of California]]
| governor2     = [[Jerry Brown]]
| term_start2   = January 3, 2011
| term_end2     = January 3, 2017
| predecessor2  = Jerry Bro

## `pipeline_revisions`

This function sets up the session, fixes redirects with PageMaps, and collects revision data. It is a convenience function that wraps the previous functions, decided by the `mode` argument ('single', 'range', 'data', 'content'). Note that this does not require manual setup of the `async_session`.

### Single mode

This uses `get_revision`

In [8]:
single_revision, pagemaps = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                           mode='single', titles=artlist[:10])
# Additionally returns new pagemaps object (if not supplied), storing redirects, normalizations, and page ids
# It's recommended to create a single pagemaps object in a project and update it with each call
print(pagemaps)

single_revision

Redirects: 0, Norms: 0, IDs: 10, Existing: 0


{'MyKayla Skinner': {'revid': 1241947299,
  'parentid': 1241713378,
  'timestamp': '2024-08-24T02:32:30Z'},
 'Michael Phelps': {'revid': 1244096767,
  'parentid': 1244096624,
  'timestamp': '2024-09-05T00:58:05Z'},
 'Katie Ledecky': {'revid': 1243479333,
  'parentid': 1243204533,
  'timestamp': '2024-09-01T18:18:03Z'},
 'Sunisa Lee': {'revid': 1244096528,
  'parentid': 1243840412,
  'timestamp': '2024-09-05T00:56:06Z'},
 'Ismail Haniyeh': {'revid': 1244776547,
  'parentid': 1244746905,
  'timestamp': '2024-09-09T03:34:33Z'},
 '2024 Summer Olympics': {'revid': 1244806576,
  'parentid': 1244805492,
  'timestamp': '2024-09-09T08:56:33Z'},
 'Simone Biles': {'revid': 1244189717,
  'parentid': 1244188496,
  'timestamp': '2024-09-05T15:37:41Z'},
 'Deadpool & Wolverine': {'revid': 1244802439,
  'parentid': 1244775182,
  'timestamp': '2024-09-09T08:10:22Z'},
 'Jonathan Owens': {'revid': 1241702742,
  'parentid': 1239984736,
  'timestamp': '2024-08-22T17:50:46Z'},
 'Kamala Harris': {'revid': 124

### Single mode - redirect demo

The pipeline function can also fix redirects / normalize titles as part of the process. The redirect has to be recorded on Wikipedia, it can't magically fix typos.

In [9]:
# create some incorrect, possibly normalisable/redirectable, titles
bad_titles = [x.lower() for x in artlist[:10]] + ['thisisnotatitle']

# call function with pagemaps object supplied
single_rd_revision = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                           pagemaps=pagemaps, mode='single', titles=bad_titles)

# pagemaps object is updated with new redirects and normalizations
print(pagemaps)
single_rd_revision

Redirects: 11, Norms: 10, IDs: 21, Existing: 0


{'Michael Phelps': {'revid': 1244096767,
  'parentid': 1244096624,
  'timestamp': '2024-09-05T00:58:05Z'},
 'Kamala Harris': {'revid': 1244740856,
  'parentid': 1244740488,
  'timestamp': '2024-09-08T22:12:57Z'},
 'Ismail Haniyeh': {'revid': 1244776547,
  'parentid': 1244746905,
  'timestamp': '2024-09-09T03:34:33Z'}}

### Single mode - specific date

In [10]:
# call function with pagemaps object supplied
date_revision = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                           mode='single', titles=artlist[:10],
                                           rf_args={'date': '2020-07-31T00:00:00Z'},
                                           pagemaps=pagemaps)
date_revision

{'MyKayla Skinner': {'revid': 969862901,
  'parentid': 968660653,
  'timestamp': '2020-07-27T20:30:18Z'},
 'Michael Phelps': {'revid': 964988208,
  'parentid': 958962948,
  'timestamp': '2020-06-28T19:23:35Z'},
 'Katie Ledecky': {'revid': 966831582,
  'parentid': 966831547,
  'timestamp': '2020-07-09T12:58:47Z'},
 'Sunisa Lee': {'revid': 969802408,
  'parentid': 964132745,
  'timestamp': '2020-07-27T14:10:23Z'},
 'Ismail Haniyeh': {'revid': 969250117,
  'parentid': 969250045,
  'timestamp': '2020-07-24T08:29:43Z'},
 '2024 Summer Olympics': {'revid': 968800570,
  'parentid': 967802903,
  'timestamp': '2020-07-21T15:53:28Z'},
 'Simone Biles': {'revid': 970165125,
  'parentid': 969898853,
  'timestamp': '2020-07-29T17:02:05Z'},
 'Deadpool & Wolverine': {'revid': 969855025,
  'parentid': 969282515,
  'timestamp': '2020-07-27T19:47:51Z'},
 'Jonathan Owens': {'revid': 963383629,
  'parentid': 949988363,
  'timestamp': '2020-06-19T14:34:03Z'},
 'Kamala Harris': {'revid': 970363031,
  'parenti

### Range mode

This uses `get_revisions`

In [11]:
# call function with pagemaps object supplied
range_revision = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                                    mode='range', titles=artlist[:10],
                                                    pagemaps=pagemaps)
range_revision
                                                    

{'MyKayla Skinner': [{'revid': 1240555941,
   'parentid': 1239249462,
   'timestamp': '2024-08-16T00:42:03Z'},
  {'revid': 1241161523,
   'parentid': 1240555941,
   'timestamp': '2024-08-19T17:24:29Z'},
  {'revid': 1241164520,
   'parentid': 1241161523,
   'timestamp': '2024-08-19T17:44:31Z'},
  {'revid': 1241713378,
   'parentid': 1241164520,
   'timestamp': '2024-08-22T19:06:15Z'},
  {'revid': 1241947299,
   'parentid': 1241713378,
   'timestamp': '2024-08-24T02:32:30Z'}],
 'Michael Phelps': [{'revid': 1240180315,
   'parentid': 1239158237,
   'timestamp': '2024-08-13T23:48:44Z'},
  {'revid': 1240182154,
   'parentid': 1240180315,
   'timestamp': '2024-08-14T00:02:01Z'},
  {'revid': 1240183040,
   'parentid': 1240182154,
   'timestamp': '2024-08-14T00:06:57Z'},
  {'revid': 1240190010,
   'parentid': 1240183040,
   'timestamp': '2024-08-14T01:04:11Z'},
  {'revid': 1240201578,
   'parentid': 1240190010,
   'timestamp': '2024-08-14T02:46:49Z'},
  {'revid': 1240202982,
   'parentid': 124

### Data mode

This uses `get_revisions_data`

In [12]:
# call function with pagemaps object supplied
data_revision = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                                    mode='data', revids=revision_ids,
                                                    rf_args={'props': ['timestamp', 'ids',
                                                                      'size', 'comment','user']},
                                                    pagemaps=pagemaps)
data_revision

{1236488217: {'parentid': 1236487199,
  'user': 'JRDkg',
  'timestamp': '2024-07-25T00:27:42Z',
  'size': 243454,
  'comment': 'secondary classification...Tamil is not a separate country'},
 1236305118: {'parentid': 1236300833,
  'user': '178.84.53.53',
  'anon': True,
  'timestamp': '2024-07-24T00:12:12Z',
  'size': 158909,
  'comment': ''},
 1236428488: {'parentid': 1236428120,
  'user': 'LifeJustKnowYouthfulness',
  'timestamp': '2024-07-24T17:06:17Z',
  'size': 197106,
  'comment': '/* Plot */'},
 1236436502: {'parentid': 1236435702,
  'user': 'MutantX13',
  'timestamp': '2024-07-24T18:04:05Z',
  'size': 197622,
  'comment': ''},
 1236453299: {'parentid': 1236451598,
  'user': 'Sc2353',
  'timestamp': '2024-07-24T19:58:08Z',
  'size': 197401,
  'comment': 'updated RT info'},
 1236992079: {'parentid': 1236991794,
  'user': 'BarntToust',
  'timestamp': '2024-07-27T15:25:20Z',
  'size': 223617,
  'comment': 'Undid revision [[Special:Diff/1236991794|1236991794]] by [[Special:Contributi

### Content mode

This uses `get_revisions_content`

In [13]:
# call function with pagemaps object supplied
content_revision = await wikitoolkit.pipeline_revisions('en.wikipedia', user_agent=my_agent,
                                                        mode='content', revids=revision_ids,
                                                        pagemaps=pagemaps)
content_revision

 1236305118: '{{Short description|Multi-sport event in Paris, France}}\n{{Use British English|date=October 2019}}\n{{Use dmy dates|date=July 2024}}\n{{Redirect-multi|2|Paris 2024|2024 Olympics|the Summer Paralympics|2024 Summer Paralympics|the Winter Youth Olympics in Gangwon, South Korea|2024 Winter Youth Olympics}}\n{{Infobox Olympic games|2024|Summer|Olympics|\n|image = 2024 Summer Olympics logo.svg\n|image_size = 220\n|caption = Emblem of the 2024 Summer Olympics\n|host_city = [[Paris]], France\n|motto =\'\'Games wide open\'\' ({{lang-fr|Ouvrons grand les Jeux}})<ref>{{cite web|url=https://olympics.com/ioc/news/new-paris-2024-slogan-games-wide-open-welcomed-by-ioc-president|title=New Paris 2024 slogan "Games wide open" welcomed by IOC President|date=25 July 2022|publisher=International Paralympic Committee|language=en|access-date=25 July 2022|archive-url=https://web.archive.org/web/20220726043101/https://olympics.com/ioc/news/new-paris-2024-slogan-games-wide-open-welcomed-by-ioc-pr