# Redirects demo

We demo the `PageMaps` object for fixing, collecting, and managing redirect data. This is used by other package functions for effective and efficient data collection. It has two key methods:
1. `fix_redirects`: Fixes redirects for any incorrect / old article titles.
2. `get_redirects`: Collects all redirects for given Wikipedia articles.

And 6 key attributes:
1. `titles_redirect_map` - A dictionary of redirects to their canonical title.
2. `pageids_redirect_map` - A dictionary of redirect page IDs to their canonical page IDs.
3. `norm_map` - A dictionary of non-normalised to normalised titles.
4. `id_map` - A dictionary of titles to page IDs.
5. `collected_title_redirects` - A dictionary of canonical titles to all their redirects.
6. `collected_pageid_redirects` - A dictionary of canonical page IDs to all their redirect page IDs.

## Setup

In [1]:
import mwapi
import wikitools
import pandas as pd

my_agent = 'mwapi testing <p.gildersleve@lse.ac.uk>'
async_session = mwapi.AsyncSession('https://en.wikipedia.org',
                    formatversion=2, user_agent=my_agent)
toparts = pd.read_csv('data/topviews-2024_07_31.csv')
artlist = toparts['Page'].unique().tolist() # ~1000 top articles yesterday

## `PageMaps`

Wikipedia articles have a canonical title, but there are typically many "redirect" pages with different titles that point to the same article, e.g. "US" redirects to "United States". In different Wiki data sources, these redirects may or may not be resolved. The `PageMaps` object is used to manage this redirect data.

It is recommended to use a single PageMaps object for a project, for consistent redirect and page ID resolution.

In [2]:
# Initialising a PageMaps object
pagemaps = wikitools.PageMaps()

### `fix_redirects`

Fixes redirects for any incorrect / old article titles. Stores results in PageMaps object that it edits in place.

In [3]:
rd_arts = ["kamala harris", "joe biden", "uk"] # random articles

await pagemaps.fix_redirects(async_session, titles=rd_arts)
print(pagemaps)
pagemaps.return_maps()


Redirects: 3, Norms: 3, IDs: 6, Existing: 0


{'titles_redirect_map': {'Uk': 'United Kingdom',
  'Joe biden': 'Joe Biden',
  'Kamala harris': 'Kamala Harris'},
 'pageids_redirect_map': {641291: 31717, 4725301: 145422, 60551360: 3120522},
 'norm_map': {'joe biden': 'Joe biden',
  'uk': 'Uk',
  'kamala harris': 'Kamala harris'},
 'id_map': {'United Kingdom': 31717,
  'Joe Biden': 145422,
  'Kamala Harris': 3120522,
  'Uk': 641291,
  'Joe biden': 4725301,
  'Kamala harris': 60551360},
 'collected_title_redirects': {},
 'collected_pageid_redirects': {}}

### `get_redirects`

This collects all the redirect pages for given page titles. Again, this method edits the pagemaps object in place. If incorrect titles are supplied, it will attempt to fix redirects first. Stores results in PageMaps object that it edits in place.

In [4]:
await pagemaps.get_redirects(async_session, rd_arts)
print(pagemaps)
print(pagemaps.collected_title_redirects.keys())
print(pagemaps.collected_pageid_redirects.keys())
print(pagemaps.collected_title_redirects)
print(pagemaps.collected_pageid_redirects)

Redirects: 300, Norms: 3, IDs: 300, Existing: 3
dict_keys(['United Kingdom', 'Joe Biden', 'Kamala Harris'])
dict_keys([31717, 145422, 3120522])
{'United Kingdom': ['United Kingdom', 'United Kindom', 'U.K.', 'ISO 3166-1:GB', 'U.K', 'United Kingom', 'Uk', 'Great Britain and Northern Ireland', 'The UK', 'UK', 'The United Kingdom', "UK's", 'United Kingdom of Great Britain and Northern Island', "United Kingdom's", 'UnitedKingdom', 'United kingdom of great britain and northern ireland', 'United Kingsom', 'British state', 'TUKOGBANI', 'United Kingdom of Great Britain and Northern Ireland', 'The United Kingdom of Great Britain and Northern Ireland', 'United Kingdom of Great Britain & Northern Ireland', 'United kingdom', 'United Kindgom', 'Great britain and northern ireland', 'UKGBNI', 'U.K.G.B.N.I.', 'The uk', 'Royaume-Uni', 'UKOGBANI', 'United Kingdom of Great Britain and Ulster', 'Great Britain and Ulster', 'Great Britain & Ulster', 'United Kingdom of Great Britain & Ulster', 'The United Kin

In [5]:
# also works with pageids
ids = [736, 9332, 60815369] # random new page ids
await pagemaps.get_redirects(async_session, pageids=ids)
print(pagemaps)
print(pagemaps.collected_title_redirects.keys())
print(pagemaps.collected_pageid_redirects.keys())
print(pagemaps.collected_title_redirects)
print(pagemaps.collected_pageid_redirects)

Redirects: 332, Norms: 3, IDs: 332, Existing: 6
dict_keys(['United Kingdom', 'Joe Biden', 'Kamala Harris', 'Albert Einstein', 'Errol Morris', 'Daisy Edgar-Jones'])
dict_keys([31717, 145422, 3120522, 736, 9332, 60815369])
{'United Kingdom': ['United Kingdom', 'United Kindom', 'U.K.', 'ISO 3166-1:GB', 'U.K', 'United Kingom', 'Uk', 'Great Britain and Northern Ireland', 'The UK', 'UK', 'The United Kingdom', "UK's", 'United Kingdom of Great Britain and Northern Island', "United Kingdom's", 'UnitedKingdom', 'United kingdom of great britain and northern ireland', 'United Kingsom', 'British state', 'TUKOGBANI', 'United Kingdom of Great Britain and Northern Ireland', 'The United Kingdom of Great Britain and Northern Ireland', 'United Kingdom of Great Britain & Northern Ireland', 'United kingdom', 'United Kindgom', 'Great britain and northern ireland', 'UKGBNI', 'U.K.G.B.N.I.', 'The uk', 'Royaume-Uni', 'UKOGBANI', 'United Kingdom of Great Britain and Ulster', 'Great Britain and Ulster', 'Great B