# Link Demo

We demo 2 different functions for collecting link data from Wikipedia:
1. `get_links` - for collecting different types of links from a list of pages.
2. `pipeline_get_links` - A convenience wrapper function that, in addition to the above, also sets up session and redirect maps.

## Setup

In [1]:
import mwapi
import wikitools
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
async_session = mwapi.AsyncSession('https://en.wikipedia.org',
                    formatversion=2, user_agent=my_agent)
toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday


## `get_links`

This can be used to collect in-, out-, interwiki-, lang-, and external links to/from a list of articles.

### Out-links

In [2]:
outlinks = await wikitools.get_links(async_session, mode='out', titles=artlist[:10])
edges = pd.DataFrame(set([(k, x['title']) for k, v in outlinks.items() for x in v]),
            columns=['source', 'target'])
edges

Getting out-links


Unnamed: 0,source,target
0,Michael Phelps,Swimming at the 2012 Summer Olympics – Men's 1...
1,Kamala Harris,Republican reactions to Donald Trump's claims ...
2,Kamala Harris,Francis Suarez
3,Michael Phelps,Pan Pacific Swimming Championships
4,Katie Ledecky,Missy Gregg
...,...,...
9868,Katie Ledecky,Swimming at the 1973 World Aquatics Championsh...
9869,2024 Summer Olympics,United Arab Emirates at the 2024 Summer Olympics
9870,Sunisa Lee,The Guardian
9871,MyKayla Skinner,Ashleigh Gnat


### External links

In [3]:
extlinks = await wikitools.get_links(async_session, mode=['ext'], titles=artlist[:10])
edges = pd.DataFrame([{'source': k, 'target': x['url']}for k, v in extlinks.items()
                      for x in v], columns=['source', 'target'])
edges

Getting ext-links


Unnamed: 0,source,target
0,Kamala Harris,https://www.mercurynews.com/2020/08/18/heres-k...
1,Kamala Harris,https://web.archive.org/web/20201014130548/htt...
2,Kamala Harris,https://www.harris.senate.gov/about
3,Kamala Harris,https://abcnews.go.com/Politics/kamala-harris-...
4,Kamala Harris,https://www.washingtonpost.com/politics/kamala...
...,...,...
3472,Jonathan Owens,https://www.chicagobears.com/news/roster-moves...
3473,Jonathan Owens,https://web.archive.org/web/20220528191717/htt...
3474,Jonathan Owens,https://www.chicagobears.com/team/players-rost...
3475,Jonathan Owens,https://www.nfl.com/news/simone-biles-bears-gr...


### All kinds of link

In [4]:
alllinks = await wikitools.get_links(async_session, mode=['in', 'out', 'lang',
                                                          'interwiki', 'ext'],
                                     titles=artlist[:10])

# Create a dataframe for each type of link
wikiedges = pd.DataFrame(set([(k, x['title']) for k, v in alllinks['out'].items()
                              for x in v] +
                             [(x['title'], k) for k, v in alllinks['in'].items()
                              for x in v]), columns=['source', 'target'])
extedges = pd.DataFrame([{'source': k, 'target': x['url']}
                         for k, v in alllinks['ext'].items() for x in v])
interwikiedges = pd.concat({k: pd.DataFrame(v) for k, v in alllinks['interwiki'].items()}
                            ).reset_index(drop=True, level=1).reset_index(
                            ).rename(columns={'index': 'source',
                                              'prefix': 'target_prefix',
                                              'title': 'target_title'})
langedges = pd.concat({k: pd.DataFrame(v).T for k, v in alllinks['lang'].items()}
                     ).reset_index().rename(columns={'level_0': 'source',
                                                     'level_1': 'target_lang',
                                                     0: 'target_title'})

display(wikiedges.head(), extedges.head(), interwikiedges.head(), langedges.head())

Getting in-links
Getting out-links
Getting lang-links
Getting interwiki-links
Getting ext-links


Unnamed: 0,source,target
0,Michael Phelps,Pan Pacific Swimming Championships
1,Michael Phelps,Swimming at the 2011 World Aquatics Championsh...
2,Kamala Harris,Jim Costa
3,List of flag bearers for Uruguay at the Olympics,2024 Summer Olympics
4,Weightlifting at the 2024 Summer Olympics – Me...,2024 Summer Olympics


Unnamed: 0,source,target
0,Kamala Harris,https://www.mercurynews.com/2020/08/18/heres-k...
1,Kamala Harris,https://web.archive.org/web/20201014130548/htt...
2,Kamala Harris,https://www.harris.senate.gov/about
3,Kamala Harris,https://abcnews.go.com/Politics/kamala-harris-...
4,Kamala Harris,https://www.washingtonpost.com/politics/kamala...


Unnamed: 0,source,target_prefix,target_title
0,Kamala Harris,c,Category:Kamala_Harris
1,Kamala Harris,commons,Category:Kamala_Harris
2,Kamala Harris,commons,Vice_President_of_the_United_States
3,Kamala Harris,d,Q10853588
4,Kamala Harris,n,Category:Kamala_Harris


Unnamed: 0,source,target_lang,target_title
0,Kamala Harris,af,Kamala Harris
1,Kamala Harris,an,Kamala Harris
2,Kamala Harris,ar,كامالا هاريس
3,Kamala Harris,arz,كامالا هاريس
4,Kamala Harris,as,কমলা হেৰিছ


## `pipeline_get_links`

This function sets up the session, fixes redirects, and collects link data. It is a convenience function that wraps the previous function. Note that this does not require manual setup of the `async_session`.

In [5]:
links_data = await wikitools.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=artlist[:10],
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})
links = links_data['links']
id_map = links_data['id_map']
redirect_map = links_data['redirect_map']
norm_map = links_data['norm_map']

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

# Additionally returns id_map, redirect_map, norm_map
print(id_map)
print(redirect_map)
print(norm_map)

print(len(articles), len(edges))
display(edges.head())


Getting out-links
Getting in-links
{'Kamala Harris': 3120522, '2024 Summer Olympics': 8351239, 'Michael Phelps': 19084502, 'Katie Ledecky': 36301418, 'MyKayla Skinner': 38513743, 'Simone Biles': 38659992, 'Ismail Haniyeh': 41207463, 'Deadpool & Wolverine': 52234178, 'Sunisa Lee': 59267698, 'Jonathan Owens': 60104513, 'Atlanta': 3138, 'Berlin': 3354, 'Coca-Cola': 6690, 'Captain America': 7729, "Director's cut": 8731, 'Formula One': 10854, 'Fantastic Four': 11664, 'Galactus': 13095, 'Heinz': 13881, 'Ingmar Bergman': 14626, 'Iceman (Marvel Comics)': 15505, 'Jean Grey': 16559, 'Hulk': 31509, 'Glam rock': 38283, 'Framestore': 45057, 'Doctor Doom': 47497, 'Darth Vader': 53862, 'ITV (TV network)': 58089, 'CNN': 62028, 'Iron Man': 67055, 'John Byrne (comics)': 75920, 'Fourth wall': 76440, 'ESPN': 77795, 'Cyclops (Marvel Comics)': 79109, 'Guinness World Records': 100796, 'New York Post': 102227, 'Doctor Strange': 150076, 'Extended play': 156702, 'Jack in the Box': 166778, '20th Century Studios'

Unnamed: 0,source,target
0,Michael Phelps,Pan Pacific Swimming Championships
1,Michael Phelps,Swimming at the 2011 World Aquatics Championsh...
2,Kamala Harris,Jim Costa
3,List of flag bearers for Uruguay at the Olympics,2024 Summer Olympics
4,Weightlifting at the 2024 Summer Olympics – Me...,2024 Summer Olympics


In [6]:
# testing with page ids instead of titles
ids = [id_map[x] for x in artlist[:10]]
links_data = await wikitools.pipeline_get_links('en.wikipedia', my_agent,
                                                pageids=ids,
                                                gl_args={'mode':['out', 'in',
                                                                 'lang', 'ext',
                                                                 'interwiki'],
                                                        'update_maps':True})
links = links_data['links']
id_map = links_data['id_map']
redirect_map = links_data['redirect_map']
norm_map = links_data['norm_map']

Getting out-links
Getting in-links
Getting lang-links
Getting ext-links
Getting interwiki-links


### Redirect demo

The pipeline function can also fix redirects / normalize titles as part of the process. The redirect has to be recorded on Wikipedia, it can't magically fix typos.

In [7]:
bad_titles = [x.lower() for x in artlist[:10]] + ['thisisnotatitle']
links_data = await wikitools.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=bad_titles,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})
links = links_data['links']
id_map = links_data['id_map']
redirect_map = links_data['redirect_map']
norm_map = links_data['norm_map']

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

# Additionally returns id_map, redirect_map, norm_map
print(id_map)
print(redirect_map) # some titles have redirects found
print(norm_map) # some titles have normalized titles found

print(len(articles), len(edges))
display(edges.head())

Getting out-links
Getting in-links
{'Kamala Harris': 3120522, 'Michael Phelps': 19084502, 'Ismail Haniyeh': 41207463, 'Nahal Oz military base': -1, 'Murder of Abdel Aziz al-Rantisi': -1, 'Muhammad Abu Shamala': -1, 'Mazen Sonokrot': -1, 'Marj al-Zahour': -1, 'Mahmoud Rushdi Ishtiwi': -1, 'Mahmoud Rushdi Eshtewi': -1, 'Killing of Jamila al-Shanti': -1, 'Ibrahim Biari': -1, 'Ghoul sniper rifle': -1, 'Gaza Protectorate': -1, 'Ezzedeen Al-Qassam': -1, 'Electoral Lists': -1, 'Demolitions in the Gaza Strip': -1, 'Assassination of Mahmoud Al Mabhouh': -1, 'Ashraf al-Qidra': -1, 'Al Qassam military choir': -1, 'Al Jazeera Net': -1, 'Al-Zawari': -1, 'Al-Qassam Military Media': -1, 'Al-Mustaqbal electoral list': -1, '35th anniversary of Hamas': -1, '2008 Kerem Shalom attack': -1, 'Arabic': 803, 'Al-Qaeda': 1921, 'Ariel Sharon': 2944, 'Assassination': 2963, 'Cairo': 6293, 'European Union': 9317, 'Encyclopædia Britannica': 9508, 'Association football': 10568, 'Fatah': 11104, 'Gaza Strip': 12047, '

Unnamed: 0,source,target
0,Kamala Harris,Republican reactions to Donald Trump's claims ...
1,Michael Phelps,Swimming at the 2012 Summer Olympics – Men's 1...
2,Kamala Harris,Francis Suarez
3,Ismail Haniyeh,Ahlam Tamimi
4,Michael Phelps,Pan Pacific Swimming Championships


## Large scale collection

The functions can be used to collect data on a large scale asyncronously, with adaptations around any rate limiting. Here we demonstrate collecting data on 1000 random articles. In link collection is particularly slow for popular articles as so many other articles link to them.

In [8]:
links_data = await wikitools.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=artlist,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})
links = links_data['links']
id_map = links_data['id_map']
redirect_map = links_data['redirect_map']
norm_map = links_data['norm_map']

# Get list of articles and edges (excluding redlinks, those with no page id)
articles = set([x['title'] for y in links['in'].values() for x in y] +
                [x['title'] for y in links['out'].values() for x in y
                 if id_map[x['title']] != -1])
edges = pd.DataFrame(set([(k, x['title']) for k, v in links['out'].items()
                          for x in v if id_map[x['title']] != -1] +
                         [(x['title'], k) for k, v in links['in'].items()
                          for x in v]), columns=['source', 'target'])

print(len(articles), len(edges))
display(edges.head())

Getting out-links
Getting in-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
0.00% complete
Trying again at n=0 with batchsize=100
Increasing batchsize to 200
Cannot connect to host en.wikipedia.org:443 ssl:default [Connection reset by peer]
30.36% complete
Trying again at n=300 with batchsize=100
Increasing batchsize to 200
2590231 6184542


Unnamed: 0,source,target
0,Miu Hirano,Austria
1,Charles Kamathi,Paris
2,Internet Protocol television,India
3,Funny Girl (Fiona album),Hong Kong
4,1994 Family Circle Cup – Doubles,Germany


In [9]:
# I have a hunch that ordering the articles by number of (in-)links
# will make collection of articles more efficient for subsequent API calls.
# Let's try it

top_indegree = edges['target'].value_counts()
ordered_artlist = ([x for x in top_indegree.index if x in artlist] +
                   [x for x in artlist if x not in top_indegree.index])[::-1]
ordered_artlist[-10:] # These articles have loooots of in-links, v slow to collect

new_links_data = await wikitools.pipeline_get_links('en.wikipedia', my_agent,
                                                titles=ordered_artlist,
                                                gl_args={'mode':['out', 'in'],
                                                        'update_maps':True})
new_links = new_links_data['links']
new_id_map = new_links_data['id_map']
new_redirect_map = new_links_data['redirect_map']
new_norm_map = new_links_data['norm_map']

# Get list of articles and edges (excluding redlinks, those with no page id)
new_articles = set([x['title'] for y in new_links['in'].values() for x in y] +
                [x['title'] for y in new_links['out'].values() for x in y
                 if new_id_map[x['title']] != -1])
new_edges = set([(k, x['title']) for k, v in new_links['out'].items() 
                 for x in v if new_id_map[x['title']] != -1] +
                [(x['title'], k) for k, v in new_links['in'].items()
                for x in v])

print(len(new_articles), len(new_edges))

Getting out-links
Getting in-links
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
40.49% complete
Trying again at n=400 with batchsize=100
Increasing batchsize to 200
('MediaWiki returned an error:', 'Could not decode as JSON:\n<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n* { margin: 0; padding: 0; }\nbody { background: #fff; font: 15px/1.6 sans-serif; color: #333; }\n.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }\n.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f')
70.85% complete
Trying again at n=70