### Data Extraction - Wikipedia Pages
##### a) Imports 
Let's start by importing some packages that we'll need...

In [1]:
import pandas as pd
import time
import sys
import pickle
import copy

# File utilities
from pathlib import Path
from gzip import GzipFile
from bz2file import BZ2File
import wget

# parsing
from lxml import etree
import re
import wikitextparser as wtp
from urllib.parse import unquote
from typing import Dict, Set, List, Tuple

##### b) Download page content
 Note that there are many different [wikipedia sites](https://meta.wikimedia.org/wiki/List_of_Wikipedias"). The subdomain for the wikipedia site is usually the 2-letter [international language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the language in which the wikipedia is written. There are exceptions, such as the simple English wikipedia - [simple.wikipedia.org](https://simple.wikipedia.org/wiki/Main_Page). We'll begin by downloading the page content from the Wikimedia site....

In [2]:
wiki = "simple"
rawdatadir = "" # "../rawdata/"
datadir = "" # "../data/"
wikidump = "https://dumps.wikimedia.org/" + wiki + "wiki/latest/"
filenames = {}
filenames['article'] = wiki + "wiki-latest-pages-articles.xml.bz2"
try:
    Path(rawdatadir + "/" + filenames['article']).resolve(strict=True)
    print ("Articles file already downloaded")
except FileNotFoundError:
    wget.download(wikidump+filenames['article'], rawdatadir)

##### c) Parse page XML file for content
The pages file is an XML file. We'll use the etree package to parse the xml.  We'll need a function that can extract the relevant data elements from the XML...

In [3]:
class PageData:
    __slots__ = ("page","title","ns",'pageid','redirect','rd_title','wikitext','pageweight')
    def __init__(self, page):
        self.page = page
        self.parse()

    def __str__(self):
        return self.title + "(" + self.pageid + "): " + self.wikitext

    def parse(self):
        self.redirect = False
        for child in self.page:
            name = etree.QName(child).localname
            if name == "title":
                self.title = child.text
            elif name == "ns":
                self.ns = child.text
            elif name == "id":
                self.pageid = child.text
            elif name == "redirect":
                self.redirect = True
                self.rd_title = child.attrib['title']
            elif name == "revision":
                for grandchild in child:
                    name = etree.QName(grandchild).localname
                    if name == "text":
                        if grandchild.text:
                            self.wikitext = grandchild.text.replace('\n', ' ').replace('\t', ' ')  # replace tabs
                        self.pageweight = grandchild.attrib['bytes']
                        try:
                            int(self.pageweight)
                        except:
                            self.pageweight = 0
            else:
                pass
            if not self.redirect:
                self.rd_title = self.title

Below is a simplfied sample of the page content for testing. Let's run the function on the sample of XML data:

In [4]:
testpage=" \
<page> \
  <title>April</title> \
  <ns>0</ns> \
  <id>1</id> \
  <revision> \
    <id>8446859</id> \
    <text bytes = \"22188\"> April is the fourth [[month]] of the [[year]] </text> \
    <sha1>iyw2lle520lh9mgpxgg1y0age5yr5b5</sha1> \
  </revision> \
</page>"

page = etree.fromstring(testpage)
print(PageData(page))

April(1):  April is the fourth [[month]] of the [[year]] 


Now, we'll parse the page data. Note that we won't be loading all the data into memory. The etree package allows us to iteratively traverse the XML tree with teh "iterparse" method. In addition to a content file, we'll also create a pagemaster file. This will include a flag for whether the page redirects to another page and the title of the redirect page. 

In [5]:
def process_pages(*,wiki, rawdatadir = "", datadir = ""):
    article_file = wiki + "wiki-latest-pages-articles.xml.bz2"
    title_redirect_dict = {}
    title_pageid_dict = {}

    with BZ2File(rawdatadir + article_file) as infile:
        with BZ2File(datadir + wiki + "wiki-pagemaster.tsv.bz2", "w") as pagemasterfile:
            pagemasterfile.write(("pageid\ttitle\trdflag\tredirect\tpageweight\n").encode())
            with BZ2File(datadir + wiki + "wiki-pages.tsv.bz2", "w") as pagefile:
                pagefile.write(("pageid\ttitle\twikitext\n").encode())
                wiki_ns = "{http://www.mediawiki.org/xml/export-0.10/}"

                for _, page in etree.iterparse(infile, tag=wiki_ns + "page"):

                    article = PageData(page)
                    if article.ns == "0": #article 
                        title_redirect_dict[article.title.lower()] = article.rd_title
                        title_pageid_dict[article.title] = article.pageid   
                        output = article.pageid + "\t" + article.title + "\t" + str(int(article.redirect)) + "\t" + article.rd_title + "\t" 
                        output += article.pageweight + "\n"
                        pagemasterfile.write(output.encode())
                        if not article.redirect:
                            pagefile.write((article.pageid + "\t" + article.title + "\t" + article.wikitext + "\n").encode())   
                    page.clear(keep_tail=True)
                    
                with open(rawdatadir + wiki + 'wiki-title-redirect.pickle', 'wb') as handle:
                    pickle.dump(title_redirect_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
                with open(rawdatadir + wiki + 'wiki-title-pageid.pickle', 'wb') as handle:
                    pickle.dump(title_pageid_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

Let's process and load the data. Note that we're specifying the numeric data types to limit memory usage. 

In [6]:
pagemasterfile = datadir  +  wiki + "wiki-pagemaster.tsv.bz2"
try:
    Path(pagemasterfile).resolve(strict=True)
    print ("Page master file already exists")
except FileNotFoundError:
    process_pages(wiki=wiki,rawdatadir = rawdatadir ,datadir = datadir)

pagemasterfile = datadir +  wiki + "wiki-pagemaster.tsv.bz2"
pagemaster = pd.read_table(
    pagemasterfile, 
    dtype = {"pageid":"int32","rdflag":"int8", "pageweight":"int32",
    "sections":"int16", "wikilinks":"int16","extlinks":"int16"},
    keep_default_na=False, 
    na_values=['_'],
    quoting = 3,
    iterator = True).get_chunk(100)
pagemaster.head()

Unnamed: 0,pageid,title,rdflag,redirect,pageweight
0,1,April,0,April,22188
1,2,August,0,August,13326
2,6,Art,0,Art,7655
3,8,A,0,A,3182
4,9,Air,0,Air,4328


We'll also take a look at the page file. We'll use a parsing library - wikitextparser - to convert the wikitext into plaintext.

In [7]:
pagefile = datadir + wiki + "wiki-pages.tsv.bz2"
pages = pd.read_table(
        pagefile, 
        dtype = {"pageid":"int32"},
        keep_default_na=False, 
        na_values=['_'],
        quoting = 3,
        iterator = True).get_chunk(100)

def wikitext_to_plaintext(wikitext):
    return wtp.parse(wikitext).plain_text()

pages["plaintext"]=pages["wikitext"].map(wikitext_to_plaintext)

pages.head()

Unnamed: 0,pageid,title,wikitext,plaintext
0,1,April,{{monththisyear|4}} '''April''' is the fourth ...,April is the fourth month of the year in the ...
1,2,August,{{monththisyear|8}} '''August''' (Aug.) is the...,August (Aug.) is the eighth month of the year...
2,6,Art,"[[File:Pierre-Auguste_Renoir,_Le_Moulin_de_la_...",Art is a creative activity and technical ski...
3,8,A,{{about| the first [[letter]] in the [[alphabe...,"thumb|Writing ""A"" in cursive font. A or a ..."
4,9,Air,[[Image:Kawasaki-Electric Fan.jpg|thumb|A [[wi...,Air is the Earth's atmosphere. Air is a mixt...
