# Pageviews Demo

We demo 2 different functions for collecting revisions data from Wikipedia:
1. `api_article_views` - for collecting daily/monthly pageviews data from the API for a list of articles.
2. `pipeline_api_article_views` - A convenience wrapper function that, in addition to the above, also sets up client/session and redirect maps.

## Setup

In [1]:
import mwapi
from mwviews.api import PageviewsClient
import wikitools
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
async_session = mwapi.AsyncSession('https://en.wikipedia.org',
                    formatversion=2, user_agent=my_agent)
client = PageviewsClient(user_agent=my_agent)


toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday

## `api_article_views`

In [2]:
pageviews = wikitools.api_article_views(client, 'en.wikipedia', artlist[:10])
pd.DataFrame(pageviews).T.head()



Unnamed: 0,Simone Biles,Ismail Haniyeh,2024 Summer Olympics,Kamala Harris,Deadpool & Wolverine,Katie Ledecky,Sunisa Lee,MyKayla Skinner,Jonathan Owens,Michael Phelps
2024-08-07,190115,30287,440037,260067,285567,41610,60855,33579,46432,70891
2024-08-08,134676,20383,455510,175514,283432,37031,47331,15647,31380,69258
2024-08-09,112944,13826,473874,157382,302166,30021,34413,15235,26025,62647
2024-08-10,100528,8949,498718,155136,305052,26015,27763,7944,25703,52213
2024-08-11,122659,8384,600182,144764,408243,58699,31403,5419,28803,59099


## `pipeline_api_article_views`

This function sets up the session/client, fixes redirects with PageMaps, and collects pageview data. It is a convenience function that wraps the previous function. Note that this does not require manual setup of the `async_session`/`client`. Observe that the figures are slightly higher than the previous example, due to the inclusion of pageviews for the redirects.

In [3]:
pageviews_p, pagemaps = await wikitools.pipeline_api_article_views('en.wikipedia', my_agent,
                                                        artlist[:10])
# Additionally returns new pagemaps object (if not supplied), storing redirects, normalizations, and page ids
# It's recommended to create a single pagemaps object in a project and update it with each call
print(pagemaps)
pd.DataFrame(pageviews_p).T.head()

Redirects: 178, Norms: 0, IDs: 178, Existing: 10


Unnamed: 0,Simone Biles,Ismail Haniyeh,2024 Summer Olympics,Kamala Harris,Deadpool & Wolverine,Katie Ledecky,Sunisa Lee,MyKayla Skinner,Jonathan Owens,Michael Phelps
2024-08-07,190134,30587,444131,261756,286689,41612,61497,33688,46432,71455
2024-08-08,134697,20656,459393,176687,284538,37031,47860,15707,31380,69695
2024-08-09,112958,13984,478444,158476,303219,30021,34816,15287,26025,63111
2024-08-10,100541,9064,502671,156080,306130,26015,28148,7987,25703,52617
2024-08-11,122673,8508,605266,145665,409579,58701,31811,5470,28803,59488


In [4]:
# Demo that views from redirects can be considerable!

rd_max_increase = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T).max().max()
rd_max_increase_ix1 = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T).max().idxmax()
rd_max_increase_ix2 = (pd.DataFrame(pageviews_p).T/pd.DataFrame(pageviews).T
                        ).max(axis=1).idxmax().strftime('%Y-%m-%d')
print(f"Max increase in pageviews from redirects ={(rd_max_increase-1)*100:.2f}% for {rd_max_increase_ix1} on {rd_max_increase_ix2}")

Max increase in pageviews from redirects =7.36% for Michael Phelps on 2024-09-04


In [5]:
# call function with pagemaps object supplied
pageviews_old = await wikitools.pipeline_api_article_views('en.wikipedia', my_agent,
                                                        artlist[10:20], pagemaps=pagemaps,
                                                        aav_args={'start':'20170701',
                                                                 'end':'20170731'})

# pagemaps object is updated with new redirects and normalizations
print(pagemaps)
display(pd.DataFrame(pageviews_old).T.head())

Redirects: 268, Norms: 0, IDs: 268, Existing: 20


Unnamed: 0,Alex Yee,Jordan Chiles,Imane Khelif,Donald J. Harris,India at the 2024 Summer Olympics,Léon Marchand,Hamas,Assassination of Ismail Haniyeh,Cleopatra,Deaths in 2024
2017-07-01,0,12,0,0,0,0,1678,0,5813,1831
2017-07-02,0,15,0,0,0,0,2126,0,6657,1753
2017-07-03,0,8,0,0,0,0,2721,0,6456,2007
2017-07-04,0,11,0,0,0,0,3385,0,5972,2045
2017-07-05,0,7,0,0,0,0,3656,0,6197,2311
