# Topics Demo

We demo 2 different functions for collecting predicted topic data from Wikipedia:
1. `get_articles_topics` - This function takes a list of article titles / revision IDs and returns the predicted topics for each article.
2. `pipeline_topics` - A convenience wrapper function that, in addition to the above, also sets up session and redirect maps.

## Setup

In [1]:
import wikitoolkit
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
wtsession = wikitoolkit.WTSession('en.wikipedia', user_agent=my_agent)
pagemaps = wikitoolkit.PageMaps() # see demo_redirects.ipynb for more info

toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday
revision_ids = [1236428488,
                1236453299,
                1237461948,
                1237046423,
                1237232495,
                1236992079,
                1236436502,
                1236488217,
                1236305118,
                1237376589] # 10 random revision ids

## `get_articles_topics`

By default, this collects topics based on the `outlink-topic-model` model.

In [2]:
a_topics = await wikitoolkit.get_articles_topics(wtsession, artlist[:10],
                                                 lang='en', pagemaps=pagemaps)
pd.concat({k: pd.Series(v) for k, v in a_topics.items()}).reset_index().rename(
    columns={'level_0': 'article', 'level_1': 'topic', 0: 'score'})

Unnamed: 0,article,topic,score
0,Simone Biles,Culture.Sports,0.985506
1,Simone Biles,Culture.Biography.Biography*,0.890304
2,Simone Biles,Culture.Biography.Women,0.754925
3,Ismail Haniyeh,Geography.Regions.Asia.West_Asia,0.995105
4,Ismail Haniyeh,Geography.Regions.Asia.Asia*,0.987578
5,2024 Summer Olympics,Culture.Sports,0.993106
6,2024 Summer Olympics,Geography.Regions.Europe.Europe*,0.637041
7,2024 Summer Olympics,Geography.Regions.Europe.Western_Europe,0.546748
8,Kamala Harris,Geography.Regions.Americas.North_America,0.939923
9,Kamala Harris,Culture.Biography.Biography*,0.863402


Alternatively, the older ORES `articletopic`/`drafttopic` models can be used:

In [3]:
a_topics = await wikitoolkit.get_articles_topics(wtsession, revids=revision_ids, lang='en',
                                                 model='enwiki-drafttopic', pagemaps=pagemaps)
pd.concat({k: pd.Series(v) for k, v in a_topics.items()}).reset_index().rename(
    columns={'level_0': 'revision ID', 'level_1': 'topic', 0: 'score'})

Unnamed: 0,revision ID,topic,score
0,1236428488,Culture.Biography.Biography*,0.049051
1,1236428488,Culture.Biography.Women,0.030239
2,1236428488,Culture.Food and drink,0.001738
3,1236428488,Culture.Internet culture,0.863449
4,1236428488,Culture.Linguistics,0.088043
...,...,...,...
635,1237376589,STEM.Medicine & Health,0.002519
636,1237376589,STEM.Physics,0.001045
637,1237376589,STEM.STEM*,0.023641
638,1237376589,STEM.Space,0.000127


## `pipeline_topics`

This function sets up the session, fixes redirects with PageMaps, and collects article topic data. It is a convenience function that wraps the previous function. Note that this does not require manual setup of the `wtsession`.

In [4]:
p_topics = await wikitoolkit.pipeline_topics('en.wikipedia', my_agent, titles=artlist[:10],
                                  pagemaps=pagemaps)
pd.concat({k: pd.Series(v) for k, v in p_topics.items()}).reset_index().rename(
    columns={'level_0': 'revision ID', 'level_1': 'topic', 0: 'score'})

Unnamed: 0,revision ID,topic,score
0,Simone Biles,Culture.Sports,0.985506
1,Simone Biles,Culture.Biography.Biography*,0.890304
2,Simone Biles,Culture.Biography.Women,0.754925
3,Ismail Haniyeh,Geography.Regions.Asia.West_Asia,0.995105
4,Ismail Haniyeh,Geography.Regions.Asia.Asia*,0.987578
5,2024 Summer Olympics,Culture.Sports,0.993106
6,2024 Summer Olympics,Geography.Regions.Europe.Europe*,0.637041
7,2024 Summer Olympics,Geography.Regions.Europe.Western_Europe,0.546748
8,Kamala Harris,Geography.Regions.Americas.North_America,0.939923
9,Kamala Harris,Culture.Biography.Biography*,0.863402
