# Download and Converting Wikipedia XML Dump Files to Clean Text

This notebook is about downloading wikipedia dump file, process the articles, cleaning them, and save it for later use.

All the functions in this notebook will be running in "lazy" behavior.

We don't want to read the whole file, process the whole file, as it would be memory consuming.

Firstly, the imports.
- bz2 - for extracting the downloaded file
- json - the "clean text" data we're saving is a bytes of json with format {"text": "the content of articles"}
- os - for file stat, os.path functionalities
- re - for cleaning texts
- ElementTree - for parsing json
- six.moves.urllib - used to download the wikipedia file
- time - for tracking time

In [1]:
import bz2
import json
import os
import re
import xml.etree.ElementTree as ET

from six.moves import urllib
from time import time

Download the wikipedia file.

You may change the url and filename, if necessary.

In [2]:
def maybe_download(url, filename, expected_bytes=None):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        print("getting from: {}".format(url))
        filename, _ = urllib.request.urlretrieve(url, filename)
    if expected_bytes:
        statinfo = os.stat(filename)
        if statinfo.st_size == expected_bytes:
            print('Found and verified', filename)
        else:
            print(statinfo.st_size)
            raise Exception(
                'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename


bz2file = maybe_download("https://dumps.wikimedia.org/idwiki/20170620/idwiki-20170620-pages-articles.xml.bz2", "idwiki-20170620-pages-articles.xml.bz2", 409688848)
print("The file is in:", bz2file)

Found and verified idwiki-20170620-pages-articles.xml.bz2
The file is in: idwiki-20170620-pages-articles.xml.bz2


After verifying the file, firstly, extract the .bz2 file.

In [3]:
def extract_bz2(filename):
    fname, ext = os.path.splitext(filename)
    if ext != ".bz2":
        raise ValueError("filename specified is not a .bz2")
    if os.path.exists(fname):
        print(fname, "alread existed")
        return fname

    with open(fname, "wb") as f, bz2.BZ2File(filename, "rb") as bf:
        for data in iter(lambda : bf.read(100*1024), b''):
            _ = f.write(data)
    return fname


xmlfile = extract_bz2(bz2file)
statinfo = os.stat(xmlfile)
print("file size: {:.3f} GB".format(statinfo.st_size / (1024*1024*1024)))

idwiki-20170620-pages-articles.xml alread existed
file size: 1.949 GB


After extracting .bz2 file, we now get the .xml file.

It's time to parse the XML file to get pages of article.

This functionality is similar to:

```
def read_xml(filename):
    tree = ET.parse(filename)
    root = tree.getroot()

    pages = root.findall('export-0.1:page', ns)
    return pages
```

However, this code consumes too much memory, that a 8GB memory instance still experiencing a MemoryError.

Hence, we use the `iterparse()`.

In [4]:
ns = {'export-0.1': 'http://www.mediawiki.org/xml/export-0.10/'}
tags_to_skip = ["siteinfo"]


def parse_wiki_xml(filename):
    skipping = ""
    in_page = False
    for event, elem in ET.iterparse(filename, events=("start", "end",)):
        if event == "start":
            for tag in tags_to_skip:
                if tag in elem.tag:
                    print("removing elem siteinfo")
                    skipping = tag
                    elem.clear()
                    break
            if in_page:
                continue
            if "page" in elem.tag:
                in_page = True
#             if not skipping and "page" not in elem.tag:
#                 print("start event for tag:", elem.tag)
        elif event == "end":
            if skipping:
                if skipping in elem.tag:
                    elem.clear()
                    skipping = ""
            else:
                if "page" in elem.tag:
                    yield elem
                    elem.clear()
                    in_page = False


pages = parse_wiki_xml('/media/dispsiau-2013/FE6CC69D6CC65057/Users/Dispsiau 2013/Documents/Fasilkom015/idwiki-20180501-pages-articles.xml/idwiki-20180501-pages-articles.xml')

The XML file needs some cleaning. Hence we create this function.

This function, `process_text`, is similar to the Perl code available in http://mattmahoney.net/dc/textdata (see Appendix A for `wikifil.pl`).

In [5]:
def process_text(text):
    # Remove any text not normally visible
    text = re.sub(r"<.*>", "", text)  # remove xml tags
    text = re.sub(r"&amp;", "&", text)  # decode URL encoded chars
    text = re.sub(r"&nbsp;", " ", text)
    text = re.sub(r"&lt;", "<", text)
    text = re.sub(r"&gt;", ">", text)
    text = re.sub(r"<ref[^<]*</ref>", "", text)  # remove references <ref...> ... </ref>
    text = re.sub(r"<[^>]*>", "", text)  # remove xhtml tags
    text = re.sub(r"\[http:[^] ]*", "[", text)  # remove normal url, preserve visible text
    text = re.sub(r"\|thumb", "", text)  # remove images links, preserve caption
    text = re.sub(r"\|left", "", text)
    text = re.sub(r"\|right", "", text)
    text = re.sub(r"\|\d+px", "", text)
    text = re.sub(r"\[\[image:[^\[\]]*\|", "", text)
    text = re.sub(r"\[\[category:([^|\]]*)[^]]*\]\]", r"\1", text, flags=re.I)  # show categories without markup
    text = re.sub(r"\[\[[a-z\-]*:[^\]]*\]\]", "", text)  # remove links to other languages
    text = re.sub(r"\[\[[^\|\]]*\|", "[[", text)  # remove wiki url, preserve visible text
    text = re.sub(r"{{[^}]*}}", "", text)  # remove {{icons}} and {tables}
    text = re.sub(r"{[^}]*}", "", text)
    text = re.sub(r"\[", "", text)  # remove [ and ]
    text = re.sub(r"\]", "", text)
    text = re.sub(r"\(", "", text)  # remove ( and )
    text = re.sub(r"\)", "", text)
    text = re.sub(r"&[^;]*;", " ", text)  # remove URL encoded chars
    text = re.sub(r"\"", "", text)  # remove ' and "
    text = re.sub(r"'", "", text)
    text = re.sub(r"_", "", text)  # remove _
    text = re.sub(r"\W+", " ", text)
    text = re.sub(r"0", " nol ", text)
    text = re.sub(r"1", " satu ", text)
    text = re.sub(r"2", " dua ", text)
    text = re.sub(r"3", " tiga ", text)
    text = re.sub(r"4", " empat ", text)
    text = re.sub(r"5", " lima ", text)
    text = re.sub(r"6", " enam ", text)
    text = re.sub(r"7", " tujuh ", text)
    text = re.sub(r"8", " delapan ", text)
    text = re.sub(r"9", " sembilan ", text)
    text = text.lower()
    return text

We already have a function to clean text. It's time to process the wikipedia pages and convert them into clean texts.

In [6]:
def convert_to_text(pages):
    for i, page in enumerate(pages):
        if i % 20000 == 19999:
            print("Read {}k articles. Elapsed time: {:.3f}s".format(int((i+1)/1000), time() - t0))

        title = page.find('export-0.1:title', ns).text.lower()
        if title.startswith("wikipedia:catatan commons"):
            page.clear()
            continue
        del title

        text = page.find('export-0.1:revision', ns).find('export-0.1:text', ns).text
        if not text:
            page.clear()
            continue

        text = process_text(text)

        words = text.split()
        del text
        total_words = len(words)
        total_long_words = len([w for w in words if len(w) > 3])
        if total_long_words < 15:
            page.clear()
            continue
        yield " ".join(words)


texts = convert_to_text(pages)

In [7]:
text_filename = os.path.splitext(xmlfile)[0]+".text"
print("start writing wikipedia texts to {}".format(text_filename))
t0 = time()

with open(text_filename, "wb") as f:
    for i, text in enumerate(texts):
        f.write((json.dumps({"text": text}) + os.linesep).encode("utf-8"))
        if i % 10000 == 9999:
            print("Done writing {}k pages. Elapsed time: {:.3f}s".format(int((i+1)/1000), time() - t0))
print("Done writing wikipedia texts in {:.3f}s".format(time() - t0))

start writing wikipedia texts to idwiki-20170620-pages-articles.text
removing elem siteinfo
Done writing 10k pages. Elapsed time: 18.612s
Read 20k articles. Elapsed time: 22.436s
Done writing 20k pages. Elapsed time: 35.760s
Read 40k articles. Elapsed time: 38.214s
Done writing 30k pages. Elapsed time: 48.134s
Read 60k articles. Elapsed time: 48.445s
Read 80k articles. Elapsed time: 57.816s
Done writing 40k pages. Elapsed time: 61.211s
Read 100k articles. Elapsed time: 67.440s
Done writing 50k pages. Elapsed time: 73.638s
Read 120k articles. Elapsed time: 79.241s
Done writing 60k pages. Elapsed time: 87.741s
Read 140k articles. Elapsed time: 90.314s
Done writing 70k pages. Elapsed time: 97.043s
Read 160k articles. Elapsed time: 100.681s
Done writing 80k pages. Elapsed time: 109.359s
Read 180k articles. Elapsed time: 112.199s
Done writing 90k pages. Elapsed time: 121.453s
Read 200k articles. Elapsed time: 123.773s
Read 220k articles. Elapsed time: 133.969s
Done writing 100k pages. Elaps

For reading the file, use this function below.

In [8]:
import json
def read_text_data(filename="idwiki-20170620-pages-articles.text"):
    with open(filename, "rb") as f:
        i = 0
        for line in f:
            row = json.loads(line.decode("utf-8"))
            text = row["text"]
            yield text
            i += 1
        print("total line: {}".format(i))
        
for item in read_text_data():
    pass

total line: 410495
