## Exploration of XML files

Let's take the first portion of the codes in the `sql_exploration.ipynb` file.

In [2]:
import requests
from bs4 import BeautifulSoup

base_url = "https://dumps.wikimedia.org/simplewiki/"
# Get the text text response of the base page
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, "html.parser")

# In the base index, there should be multiple <a> tag leading
# to different version of the database
dumps = [a["href"] for a in soup_index.find_all("a") if a.has_attr("href")]

if "latest" not in dumps[-1]:
    print("Couldn't find the latest dump")
    exit

# Later on production, we will use this.
dump = dumps[-2]

# For now, let's use this.
dump = "20201001/"

# Create dump url with the base and the latest timestamp
dump_url = base_url + dump

# Retrieve the dump page
dump_html = requests.get(dump_url).text
soup_dump = BeautifulSoup(dump_html, "html.parser")

# Search for SQL files
files = []
for file in soup_dump.find_all("li", {"class": "file"}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))

files_to_download = [file[0] for file in files]

In [3]:
files_to_download

['simplewiki-20201001-pages-articles-multistream.xml.bz2',
 'simplewiki-20201001-pages-articles-multistream-index.txt.bz2',
 'simplewiki-20201001-pages-articles.xml.bz2']

Let's only take the `multistream xml` file. That will be what we need.

In [4]:
# Search for SQL files
files = []
for file in soup_dump.find_all("li", {"class": "file"}):
    text = file.text
    if 'pages-articles' in text and 'multistream.xml' in text:
        files.append((text.split()[0], text.split()[1:]))

files_to_download = [file[0] for file in files]

In [5]:
files_to_download

['simplewiki-20201001-pages-articles-multistream.xml.bz2']

In [6]:
from keras.utils import get_file
from os import path
import subprocess

# Directory where keras download the files
dataset_dir = os.path.join(os.getcwd(), "datasets")

data_paths = []
file_info = []

for file in files_to_download:
    path = os.path.join(dataset_dir, file)

    if not os.path.exists(path):
        print(f"Downloading {file} ...")
        data_paths.append(
            get_file(fname=file, origin=dump_url + file, cache_subdir=dataset_dir)
        )
        # Find the file size in MB
        file_size = os.stat(path).st_size / 1e6
        print(file_size)

        file_info.append((file, file_size))

    else:
        # If file exists, put in the list still, for later processing
        data_paths.append(path)
        file_size = os.stat(path).st_size / 1e6
        file_info.append((file.split("-")[-1], file_size))

### Parsing the file
- First, we will unzip the file using `bz2`
- Then, we will parse the data using `xml`

In [7]:
import bz2

# Since there is only 1 file
data_path = data_paths[0]

lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    lines.append(line)
    if i > 100:
        break

lines

[b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">\n',
 b'  <siteinfo>\n',
 b'    <sitename>Wikipedia</sitename>\n',
 b'    <dbname>simplewiki</dbname>\n',
 b'    <base>https://simple.wikipedia.org/wiki/Main_Page</base>\n',
 b'    <generator>MediaWiki 1.36.0-wmf.10</generator>\n',
 b'    <case>first-letter</case>\n',
 b'    <namespaces>\n',
 b'      <namespace key="-2" case="first-letter">Media</namespace>\n',
 b'      <namespace key="-1" case="first-letter">Special</namespace>\n',
 b'      <namespace key="0" case="first-letter" />\n',
 b'      <namespace key="1" case="first-letter">Talk</namespace>\n',
 b'      <namespace key="2" case="first-letter">User</namespace>\n',
 b'      <namespace key="3" case="first-letter">User talk</namespace>\n',
 b'      <namespace key="4" case="first-l

Our goal is to keep the content between the tags: `<title>`, `<id>` and `<text>`

With that, we are going to use `xml.sax`. This code is inspired by:

https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c

In [42]:
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Content handler for Wiki XML data using SAX"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._pages = []

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text', 'timestamp', 'id'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._pages.append((self._values['title'],self._values['id'], self._values['text']))

In [43]:
# Content handler for Wiki XML
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

handler._pages

[]

In [44]:
lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    parser.feed(line)

    if len(handler._pages) > 20:
        break

In [45]:
import mwparserfromhell 

print(handler._pages[19][0])

# Create the wiki article
wiki = mwparserfromhell.parse(handler._pages[19][1])

Angel


In [46]:
print(wiki[:1000])
print('='*10)
wiki.filter_wikilinks()
print('='*10)
wiki.filter_external_links()

551349


[]

As we can see, the `wiki` object created my `mwparserfromhell` comes pretty handy with a lot of built-in function.

We can now easily extract information about page.

In [51]:
print("Page title: ", handler._pages[0][0])
print("Page id: ", handler._pages[0][1])
print("Page content preview: ", handler._pages[0][2][:150])

Page title:  April
Page id:  844779
Page content preview:  {{monththisyear|4}} 
 '''April''' is the fourth [[month]] of the [[year]], and comes between [[March]] and [[May]]. It is one of four months to have 3


This information later can be ingested into our MySQL database. However, it isn't the current requirement of the assignment.

We have, though, explored the available dataset of the Simple Wiki dumps.