# Pageviews Demo

We demo 2 different functions for collecting revisions data from Wikipedia:
1. `api_article_views` - for collecting daily/monthly pageviews data from the API for a list of articles.
2. `pipeline_api_article_views` - A convenience wrapper function that, in addition to the above, also sets up client/session and redirect maps.

## Setup

In [1]:
import wikitoolkit
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
wtsession = wikitoolkit.WTSession('en.wikipedia', user_agent=my_agent)

toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday

## `api_article_views`

In [2]:
pageviews = wikitoolkit.api_article_views(wtsession, 'en.wikipedia', artlist[:10])
pd.DataFrame(pageviews).T.head()



Unnamed: 0,Simone Biles,Ismail Haniyeh,2024 Summer Olympics,Kamala Harris,Deadpool & Wolverine,Katie Ledecky,Sunisa Lee,MyKayla Skinner,Jonathan Owens,Michael Phelps
2024-08-21,23083,4050,35565,405214,120290,5068,5334,799,6820,8074
2024-08-22,18968,4127,32923,286921,107200,4668,4851,666,5323,8407
2024-08-23,17403,4170,30850,838677,121971,4146,4273,612,6051,6798
2024-08-24,17071,3767,25497,218058,183183,4105,4038,560,4976,5961
2024-08-25,16468,5642,26604,143746,204169,3738,3678,593,5003,6132


## `pipeline_api_article_views`

This function sets up the session/client, fixes redirects with PageMaps, and collects pageview data. It is a convenience function that wraps the previous function. Note that this does not require manual setup of the `wtsession`. Observe that the figures are slightly higher than the previous example, due to the inclusion of pageviews for the redirects.

In [3]:
pageviews_p, pagemaps = await wikitoolkit.pipeline_api_article_views('en.wikipedia', my_agent,
                                                        artlist[:10])
# Additionally returns new pagemaps object (if not supplied), storing redirects, normalizations, and page ids
# It's recommended to create a single pagemaps object in a project and update it with each call
print(pagemaps)
pd.DataFrame(pageviews_p).T.head()

Redirects: 181, Norms: 0, IDs: 181, Existing: 10


Unnamed: 0,Simone Biles,Ismail Haniyeh,2024 Summer Olympics,Kamala Harris,Deadpool & Wolverine,Katie Ledecky,Sunisa Lee,MyKayla Skinner,Jonathan Owens,Michael Phelps
2024-08-21,23086,4117,36121,406860,121141,5069,5396,812,6820,8346
2024-08-22,18971,4185,33402,288231,108062,4668,4948,679,5323,8701
2024-08-23,17407,4237,31285,841451,122693,4149,4352,627,6051,7063
2024-08-24,17075,3818,25842,218944,183946,4107,4105,569,4976,6195
2024-08-25,16472,5705,26942,144501,204876,3738,3753,605,5003,6376


In [4]:
# Demo that views from redirects can be considerable!

rd_max_increase = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T).max().max()
rd_max_increase_ix1 = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T).max().idxmax()
rd_max_increase_ix2 = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T
                        ).max(axis=1).idxmax().strftime('%Y-%m-%d')
print(f"Max increase in pageviews from redirects ={(rd_max_increase-1)*100:.2f}% for {rd_max_increase_ix1} on {rd_max_increase_ix2}")

Max increase in pageviews from redirects =10.41% for Michael Phelps on 2024-09-18


In [5]:
# call function with pagemaps object supplied
pageviews_old = await wikitoolkit.pipeline_api_article_views('en.wikipedia', my_agent,
                                                        artlist[10:20], pagemaps=pagemaps,
                                                        aav_args={'start':'20170701',
                                                                 'end':'20170731'})

# pagemaps object is updated with new redirects and normalizations
print(pagemaps)
display(pd.DataFrame(pageviews_old).T.head())

Redirects: 269, Norms: 0, IDs: 269, Existing: 20


Unnamed: 0,Alex Yee,Jordan Chiles,Imane Khelif,Donald J. Harris,India at the 2024 Summer Olympics,Léon Marchand,Hamas,Assassination of Ismail Haniyeh,Cleopatra,Deaths in 2024
2017-07-01,0,12,0,0,0,0,1678,0,5813,1831
2017-07-02,0,15,0,0,0,0,2126,0,6657,1753
2017-07-03,0,8,0,0,0,0,2721,0,6456,2007
2017-07-04,0,11,0,0,0,0,3385,0,5972,2045
2017-07-05,0,7,0,0,0,0,3656,0,6197,2311
