# Link Demo

We demo 2 different functions for collecting link data from Wikipedia:
1. `get_links` - for collecting different types of links from a list of pages.
2. `pipeline_get_links` - A convenience wrapper function that, in addition to the above, also sets up session and redirect maps.

## Setup

In [1]:
import mwapi
import wikitoolkit
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
async_session = mwapi.AsyncSession('https://en.wikipedia.org',
                    formatversion=2, user_agent=my_agent)
toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday


## `get_links`

This can be used to collect in-, out-, interwiki-, lang-, and external links to/from a list of articles.

### Out-links

In [2]:
outlinks = await wikitoolkit.get_links(async_session, mode='out', titles=artlist[:10])
edges = pd.DataFrame(set([(k, x['title']) for k, v in outlinks.items() for x in v]),
            columns=['source', 'target'])
edges

Getting out-links


Unnamed: 0,source,target
0,Deadpool & Wolverine,Chicago Sun-Times
1,Simone Biles,Gymnastics at the 1992 Summer Olympics – Women...
2,Kamala Harris,Dan Quayle
3,Kamala Harris,Amazon (company)
4,Kamala Harris,Electoral history of Kamala Harris
...,...,...
9725,Kamala Harris,Results of the 2024 Democratic Party president...
9726,Kamala Harris,National Council of Negro Women
9727,Michael Phelps,2009 Duel in the Pool
9728,2024 Summer Olympics,South Korea at the 2024 Summer Olympics


### External links

In [3]:
extlinks = await wikitoolkit.get_links(async_session, mode=['ext'], titles=artlist[:10])
edges = pd.DataFrame([{'source': k, 'target': x['url']}for k, v in extlinks.items()
                      for x in v], columns=['source', 'target'])
edges

Getting ext-links


Unnamed: 0,source,target
0,Kamala Harris,https://www.mercurynews.com/2020/08/18/heres-k...
1,Kamala Harris,https://web.archive.org/web/20201014130548/htt...
2,Kamala Harris,https://www.harris.senate.gov/about
3,Kamala Harris,https://abcnews.go.com/Politics/kamala-harris-...
4,Kamala Harris,https://www.washingtonpost.com/politics/kamala...
...,...,...
3476,Jonathan Owens,https://www.chicagobears.com/news/roster-moves...
3477,Jonathan Owens,https://web.archive.org/web/20220528191717/htt...
3478,Jonathan Owens,https://www.chicagobears.com/team/players-rost...
3479,Jonathan Owens,https://www.nfl.com/news/simone-biles-bears-gr...


### All kinds of link

In [4]:
alllinks = await wikitoolkit.get_links(async_session, mode=['in', 'out', 'lang',
                                                          'interwiki', 'ext'],
                                     titles=artlist[:10])

# Create a dataframe for each type of link
wikiedges = pd.DataFrame(set([(k, x['title']) for k, v in alllinks['out'].items()
                              for x in v] +
                             [(x['title'], k) for k, v in alllinks['in'].items()
                              for x in v]), columns=['source', 'target'])
extedges = pd.DataFrame([{'source': k, 'target': x['url']}
                         for k, v in alllinks['ext'].items() for x in v])
interwikiedges = pd.concat({k: pd.DataFrame(v) for k, v in alllinks['interwiki'].items()}
                            ).reset_index(drop=True, level=1).reset_index(
                            ).rename(columns={'index': 'source',
                                              'prefix': 'target_prefix',
                                              'title': 'target_title'})
langedges = pd.concat({k: pd.DataFrame(v).T for k, v in alllinks['lang'].items()}
                     ).reset_index().rename(columns={'level_0': 'source',
                                                     'level_1': 'target_lang',
                                                     0: 'target_title'})

display(wikiedges.head(), extedges.head(), interwikiedges.head(), langedges.head())

Getting in-links
Getting out-links
Getting lang-links
Getting interwiki-links
Getting ext-links


Unnamed: 0,source,target
0,Steve Gleason,Simone Biles
1,Morgan Pearson,2024 Summer Olympics
2,Alma Valencia,2024 Summer Olympics
3,List of years in film,Deadpool & Wolverine
4,Abdelrahman El-Sayed,2024 Summer Olympics


Unnamed: 0,source,target
0,Kamala Harris,https://www.mercurynews.com/2020/08/18/heres-k...
1,Kamala Harris,https://web.archive.org/web/20201014130548/htt...
2,Kamala Harris,https://www.harris.senate.gov/about
3,Kamala Harris,https://abcnews.go.com/Politics/kamala-harris-...
4,Kamala Harris,https://www.washingtonpost.com/politics/kamala...


Unnamed: 0,source,target_prefix,target_title
0,Kamala Harris,c,Category:Kamala_Harris
1,Kamala Harris,commons,Category:Kamala_Harris
2,Kamala Harris,commons,Vice_President_of_the_United_States
3,Kamala Harris,d,Q10853588
4,Kamala Harris,n,Category:Kamala_Harris


Unnamed: 0,source,target_lang,target_title
0,Kamala Harris,af,Kamala Harris
1,Kamala Harris,an,Kamala Harris
2,Kamala Harris,ar,كامالا هاريس
3,Kamala Harris,arz,كامالا هاريس
4,Kamala Harris,as,কমলা হেৰিছ


## `pipeline_get_links`

This function sets up the session, fixes redirects with PageMaps (see redirectsdemo.ipynb), and collects link data. It is a convenience function that wraps the previous function. Note that this does not require manual setup of the `async_session`.

In [5]:
links, pagemaps = await wikitoolkit.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=artlist[:10],
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if pagemaps.id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if pagemaps.id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

# Additionally returns new pagemaps object (if not supplied), storing redirects, normalizations, and page ids
# It's recommended to create a single pagemaps object in a project and update it with each call
print(pagemaps)

print(len(articles), len(edges))
display(edges.head())


Getting out-links
Getting in-links
Redirects: 1091, Norms: 0, IDs: 18733, Existing: 0
17642 25857


Unnamed: 0,source,target
0,Steve Gleason,Simone Biles
1,Morgan Pearson,2024 Summer Olympics
2,Alma Valencia,2024 Summer Olympics
3,List of years in film,Deadpool & Wolverine
4,Abdelrahman El-Sayed,2024 Summer Olympics


In [6]:
# testing with page ids instead of titles
ids = [pagemaps.id_map[x] for x in artlist[:10]]
# supply pre-existing pagemaps object
links = await wikitoolkit.pipeline_get_links('en.wikipedia', my_agent,
                                                pageids=ids, pagemaps=pagemaps,
                                                gl_args={'mode':['out', 'in',
                                                                 'lang', 'ext',
                                                                 'interwiki'],
                                                        'update_maps':True})
print(pagemaps)
print(len(links['in']), len(links['out']), len(links['lang']),
        len(links['ext']), len(links['interwiki']))

Getting out-links
Getting in-links
Getting lang-links
Getting ext-links
Getting interwiki-links
Redirects: 1091, Norms: 0, IDs: 18733, Existing: 0
10 10 10 10 10


### Redirect demo

The pipeline function can also fix redirects / normalize titles as part of the process. The redirect has to be recorded on Wikipedia, it can't magically fix typos.

In [7]:
bad_titles = [x.lower() for x in artlist[:10]] + ['thisisnotatitle']
links = await wikitoolkit.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=bad_titles, pagemaps=pagemaps,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if pagemaps.id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if pagemaps.id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

# has updated pagemaps object inplace with norms/redirects/pageids
print(pagemaps)

print(len(articles), len(edges))
display(edges.head())

Getting out-links
Getting in-links
Redirects: 1099, Norms: 10, IDs: 18741, Existing: 0
7213 10098


Unnamed: 0,source,target
0,"University of California College of the Law, S...",Kamala Harris
1,Nation of Islam,Kamala Harris
2,African-American leftism,Kamala Harris
3,Kamala Harris,Dan Quayle
4,Kamala Harris,Amazon (company)


## Large scale collection

The functions can be used to collect data on a large scale asyncronously, with adaptations around any rate limiting. Here we demonstrate collecting data on 1000 random articles. In-link collection is particularly slow for popular articles as so many other articles link to them.

In [8]:
links, pagemaps = await wikitoolkit.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=artlist,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if pagemaps.id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if pagemaps.id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

print(len(articles), len(edges))
display(edges.head())

Getting out-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
60.73% complete
Trying again at n=600 with batchsize=100
Increasing batchsize to 200
Getting in-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
0.00% complete
Trying again at n=0 w

Unnamed: 0,source,target
0,Flour massacre,Joe Biden
1,2008 Oklahoma Democratic presidential primary,President of the United States
2,Haldon Aerodrome,World War I
3,List of places named for Douglas MacArthur,World War I
4,Sensibility Objectified,India


In [9]:
# I have a hunch that ordering the articles by number of (in-)links
# will make collection of articles more efficient for subsequent API calls.
# Let's try it

top_indegree = edges['target'].value_counts()
ordered_artlist = ([x for x in top_indegree.index if x in artlist] +
                   [x for x in artlist if x not in top_indegree.index])[::-1]
ordered_artlist[-10:] # These articles have loooots of in-links, v slow to collect

new_links, new_pagemaps = await wikitoolkit.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=ordered_artlist,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})

# Get list of articles and edges (excluding redlinks, those with no page id)
new_articles = set([x['title'] for y in new_links['in'].values() for x in y] +
                [x['title'] for y in new_links['out'].values() for x in y
                 if new_pagemaps.id_map[x['title']] != -1])
new_edges = set([(k, x['title']) for k, v in new_links['out'].items() 
                 for x in v if new_pagemaps.id_map[x['title']] != -1] +
                [(x['title'], k) for k, v in new_links['in'].items()
                for x in v])

print(len(new_articles), len(new_edges))

Getting out-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
80.97% complete
Trying again at n=800 with batchsize=100
Increasing batchsize to 200
Getting in-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
40.49% complete
Trying again at n=40