# Extract text content and links from Wikipedia

API documentations is here: https://www.mediawiki.org/wiki/API

Some relevant pages are:
- https://www.mediawiki.org/wiki/API:Parsing_wikitext
- https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page

In [1]:
import pickle as pkl
import requests

### Functions to get a list of page titles in a Wikipedia category and its subcategories, subsubcategories, etc

Note these get **only** the page titles.

In [2]:
def catMembers(cat):
    """ Get all page titles and subcategory titles from a wikipedia category """
    pagelist = []
    subcatlist = []
    base = 'https://en.wikipedia.org/w/api.php'
    params = {'cmtitle' : cat,
              'cmprop' : 'title',
              'action' : 'query',
              'list' : 'categorymembers',
              'cmlimit' : 'max',
              'format' : 'json'
             }

    catJson = requests.get(url = base, params = params).json()['query']['categorymembers']
    for pg in range(len(catJson)):
        if catJson[pg]['ns'] == 0:
            pagelist.append(catJson[pg]['title'])
        if catJson[pg]['ns'] == 14:
            subcatlist.append(catJson[pg]['title'])
    return pagelist, subcatlist


def catSubcatPgs(cat, depth = 1):
    """ Given a wikipedia category, get:
        (1) Titles of pages in the category
        (2) Titles of pages in all subcategories
        (3) (optional) Titles of pages in all subsubcategories and deeper
    cat : the name of the Wikipedia category to get data from
    depth : how 'deep' to go, e.g. depth = 1 will get the
            category and subcategory pages, depth = 2 will get category,
            subcategory and subsubcategory pages, etc
    Returns : a sorted list of titles, with any duplicates removed.
    """
    pgs, subcats = catMembers(cat = cat)
    for sc in subcats:
        d = depth
        subpgs, subsubcats = catMembers(cat = sc)
        if subpgs is not None: pgs.extend(subpgs)

        while d > 1:
            cs = []
            for ssc in subsubcats:
                ssp, sssc = catMembers(cat = ssc)
                if ssp is not None: pgs.extend(ssp)
                if sssc is not None: cs.extend(sssc)
            subsubcats = cs[:]
            d -= 1
    return(sorted(list(set(pgs))))

### Functions to get text/links from a single Wikipedia page


Known issues and limitations:
1. Text in tables will not be returned. This is especially noticeable in pages such as 'List of ____' pages, most sections will be entirely empty. See [this stackoverflow link](https://stackoverflow.com/questions/40210536) for possible workarounds.
2. Text in references/ending sections will often not be returned. This includes sections titled: 'References', 'Notes', 'Footnotes', 'Citations', 'Bibliography', 'Further Reading' (and maybe more if I've missed any). Note that 'References' is the only common one, others appear rarely. 
    - A workaround for the References section only can be done by:
        - Extracting the list of sections in the page, e.g. https://en.wikipedia.org/w/api.php?action=parse&page=1st_Armoured_Regiment_(Australia)&prop=sections&format=json
        - Finding where 'line' equals 'References' and noting the 'index' value
        - Extracting the wikitext from this section, e.g. https://en.wikipedia.org/w/api.php?action=parse&page=1st_Armoured_Regiment_(Australia)&prop=wikitext&section=11&format=json
        - Converting from the references format to plain text
3. The plain text of links in 'External links' sections is returned but the actual URL it links to is rarely included. 
    - Workaround 1: 
        - Get all URLs from external links/references/notes/etc sections, e.g. https://en.wikipedia.org/w/api.php?action=parse&page=1st_Armoured_Regiment_(Australia)&prop=externallinks&format=json
        - Limitations: gets more than just the external links section URLs; doesn't include any more info about the URL, e.g. which section it's referenced in or what it corresponds to
    - Workaround 2 (probably better):
        - Similar to the References section workaround, find the list of sections and note the index of the 'External links' line
        - Extract the wikitext from this section
        - Parse the result and make it look better. External links appear to all be in the first part, before the first double line break \n\n. Links are split by '\n*'. After the double line break there's links to category pages, wrapped in double square brackets and split by single line breaks. They aren't actually part of the External Links section

In [7]:
def getWikiPageText(title):
    """ 
    Get plain text of a Wikipedia page (i.e. all html/markup tags removed)
    Input : String containing page title
    Returns : String containing article text
    """
    
    base = 'https://en.wikipedia.org/w/api.php'
    params = {'titles' : title,
              'prop' : 'extracts',
              'action' : 'query',
              'explaintext' : '1',
              'redirects' : '1',
              'format' : 'json'
             }
    txt = requests.get(url = base, params = params).json()['query']['pages']
    return(txt[list(txt.keys())[0]]['extract'])


def getWikiPageLinks(title):
    """ Get the list of pages a Wikipedia page links to """
    
    base = 'https://en.wikipedia.org/w/api.php'
    params = {'page' : title,
              'prop' : 'links',
              'action' : 'parse',
              'format' : 'json'
             }
    resp = requests.get(url = base, params = params).json()['parse']['links']
    # Namespace (ns) 0 means articles
    # Length 3 means the page will exist, because "exists=''" is only included if the page exists.
    links = (title, [i['*'] for i in resp if i['ns']==0 and len(i)==3])
    return(links)

### Functions extending the above two, to get text/links from a list of Wiki pages

In [4]:
def allWikiPageText(titles):
    """
    Get the plain text from all Wikipedia pages provided in a list, and
    include the start and end tokens (for GPT-2) in each article.
    Input: List of Wikipedia page titles
    Returns : String containing text from each article
    """
    
    text = ''
    for ttl in titles:
        # Begin articles with '== Article Start ==\n' so GPT-2 learns it's the start.
        # Same with <|endoftext|> to end them
        text = text + '== Article Start ==\n' + ttl + '\n\n\n'+ getWikiPageText(ttl) + '\n\n<|endoftext|>\n\n\n'
    return(text[:-4])


def allWikiPageLinks(titles):
    """
    Get the list of pages a Wikipedia page links to, for each
    page title provided in a list
    Input: List of Wikipedia page titles
    Returns : List of tuples of the form: (title, linkList) where
            linkList is a list containing each of the linked page titles
    """
    
    links = []
    for ttl in titles:
        try:
            links.append(getWikiPageLinks(ttl))
        except:
            print('Error getting page: ' + ttl)
    return(links)

## Example to extract page text + links from all pages in a category (and deeper)

Set name of category we want to get data from:

In [None]:
cat = 'Category:Military of Australia'

Get pages titles in the category, subcategories, subsubcategories, and subsubsubcategories (depth 3):

In [None]:
pageTitles = catSubcatPgs(cat = cat, depth = 3)
print('Number of pages: ' + str(len(pageTitles)))

Get plain text from each of the pages:

In [None]:
pageText = allWikiPageText(pageTitles)

# Save result
#with open('pageText.txt', 'w', encoding="utf-8") as f:
#    f.writelines(pageText)

Get list of links from each of the pages:

In [None]:
pageLinks = allWikiPageLinks(pageTitles)

# Save result
#with open('pageLinks.pkl', 'wb') as f:
#    pkl.dump(pageLinks, f)

## Example output from a single page

#### Page text:

In [5]:
getWikiPageText('1st Armoured Regiment (Australia)')

# or to make it look nicer:
# print(getWikiPageText('1st Armoured Regiment (Australia)'))

"1st Armoured Regiment is an armoured regiment of the Australian Army and is the senior regiment of the Royal Australian Armoured Corps. Formed as a tank unit in the new Australian Regular Army on 7 July 1949, the regiment subsequently saw service during the Vietnam War operating Centurion tanks. Currently the unit is based in Edinburgh, South Australia as part of the 1st Brigade. As part of the Plan Beersheba reorganisation, the unit has become one of three Armoured Cavalry Regiments (ACRs) assigned to the Army's multirole combat brigades in Brisbane, Darwin and Townsville. Each ACR is equipped with M1A1 tanks and ASLAV light armoured vehicles.\n\n\n== History ==\n\n\n=== Formation ===\nThe 1st Armoured Regiment was raised as a regular unit on 7 July 1949 at Puckapunyal in Victoria when the 1st Armoured Car Squadron, which had returned from occupation duties in Japan a few months earlier, was converted to a tank unit. The formation occurred following the renaming of a reserve unit of 

#### Page links:

In [6]:
getWikiPageLinks('1st Armoured Regiment (Australia)')

('1st Armoured Regiment (Australia)',
 ['10th Light Horse Regiment (Australia)',
  '12th/16th Hunter River Lancers',
  '15th Northern River Lancers',
  '1st/15th Royal New South Wales Lancers',
  '1st Armoured Car Squadron (Australia)',
  '1st Australian Task Force',
  '1st Brigade (Australia)',
  '1st Royal New South Wales Lancers',
  '2003 invasion of Iraq',
  '2nd/14th Light Horse Regiment',
  '2nd Cavalry Regiment (Australia)',
  '3rd/4th Cavalry Regiment (Australia)',
  '3rd/9th Light Horse (South Australian Mounted Rifles)',
  "4th/19th Prince of Wales' Light Horse",
  "4th/19th Prince of Wales's Light Horse",
  '5th Battalion, Royal Australian Regiment',
  'ASLAV',
  'Aida',
  'Alvis Saracen',
  'Armoured recovery vehicle',
  'Australian 1st Brigade',
  'Australian Army',
  'Australian Army Reserve',
  'Australian Medium Tank Trials Unit',
  'Battalion',
  'Battle of Binh Ba',
  'Battle of Cambrai (1917)',
  'Battle of Coral–Balmoral',
  'Battle of Hat Dich',
  'Battle of Long K