In [1]:
import zipfile

In [43]:
import xml.etree.ElementTree as ElementTree # A simple API for parsing and creating XML data. I'm just parsing here.

This from the python documentation page:
https://docs.python.org/3.4/library/xml.etree.elementtree.html
Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

The church's epub download is probably safe. But just to be careful.

In [3]:
BoMEpub = zipfile.ZipFile("book-of-mormon-eng.epub")

The object has a filelist member which gives a list of "file" objects. Still need to work out how these work exactly.

In [4]:
for zipped_file in BoMEpub.filelist:
    print(zipped_file.filename)

mimetype
OEBPS/images/03990_000_bofm_000-cover.jpg
OEBPS/lds_ePub_scriptures.css
OEBPS/images/03990_000_bofm-image-1.jpg
OEBPS/images/03990_000_bofm-image-2.jpg
OEBPS/images/03990_000_bofm-image-3.jpg
OEBPS/images/03990_000_bofm-image-4.jpg
OEBPS/images/03990_000_bofm-image-5.jpg
OEBPS/images/03990_000_bofm-image-6.jpg
OEBPS/images/03990_000_bofm-image-7.jpg
OEBPS/images/03990_000_bofm-image-8.jpg
META-INF/container.xml
OEBPS/1-ne.xhtml
OEBPS/1-ne_1.xhtml
OEBPS/1-ne_10.xhtml
OEBPS/1-ne_11.xhtml
OEBPS/1-ne_12.xhtml
OEBPS/1-ne_13.xhtml
OEBPS/1-ne_14.xhtml
OEBPS/1-ne_15.xhtml
OEBPS/1-ne_16.xhtml
OEBPS/1-ne_17.xhtml
OEBPS/1-ne_18.xhtml
OEBPS/1-ne_19.xhtml
OEBPS/1-ne_2.xhtml
OEBPS/1-ne_20.xhtml
OEBPS/1-ne_21.xhtml
OEBPS/1-ne_22.xhtml
OEBPS/1-ne_3.xhtml
OEBPS/1-ne_4.xhtml
OEBPS/1-ne_5.xhtml
OEBPS/1-ne_6.xhtml
OEBPS/1-ne_7.xhtml
OEBPS/1-ne_8.xhtml
OEBPS/1-ne_9.xhtml
OEBPS/2-ne.xhtml
OEBPS/2-ne_1.xhtml
OEBPS/2-ne_10.xhtml
OEBPS/2-ne_11.xhtml
OEBPS/2-ne_12.xhtml
OEBPS/2-ne_13.xhtml
OEBPS/2-ne_1

The ZipFile object has a function called 'read' which returns the file whose name (including path) is given.

The TOC is standard for an epub - the ZipFile object has the list object which we looked at previously, but it gives the files in alphabetical order, rather than in the right (book) order.

In [5]:
tocFile = BoMEpub.read("OEBPS/toc.ncx")

This tocFile registers as a 'bytes' type, which seems odd, because it's just text. It could be that ZipFile will return any kind of file as 'bytes' that it gets from the zip.

In [6]:
type(tocFile)

bytes

In [7]:
print(tocFile)

b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE ncx\n  PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">\n<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1"><head><meta name="dtb:uid" content="978-1-59297-692-8"/><meta name="dtb:depth" content="2"/><meta name="dtb:totalPageCount" content="0"/><meta name="dtb:maxPageNumber" content="0"/></head><docTitle><text>Book of Mormon</text></docTitle><navMap><navPoint id="navPoint-1" playOrder="1"><navLabel><text>Title Page</text></navLabel><content src="bofm_bofm-title.xhtml"/></navPoint><navPoint id="navPoint-2" playOrder="2"><navLabel><text>Introduction</text></navLabel><content src="bofm_introduction.xhtml"/></navPoint><navPoint id="navPoint-3" playOrder="3"><navLabel><text>Testimony of Three Witnesses</text></navLabel><content src="bofm_three.xhtml"/></navPoint><navPoint id="navPoint-4" playOrder="4"><navLabel><text>Testimony of Eight Witnesses</text></navLabel><content src="bofm_eight

This "root" thing is the main item in the XML document. All the subsequent things are nested underneath it. It's an ElementTree, which means it can have child thigs. It's based on the XML above, which is not so clear to see from the way it's been printed without line-breaks and indentations.


tocRoot = ElementTree.fromstring(tocFile)

The xml element tree class has a function "iter" which iterates recursively through the whole tree, but as will be seen, it does it all on one level, you don't really get a feel for which things are children and which are parents. So I'll need to write a better function.

In [56]:
for element in tocRoot.iter():
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}ncx' at 0x7f7ed2f161d8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}head' at 0x7f7ed2f166d8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7f7ed2f16778>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7f7ed2f167c8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7f7ed2f16818>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7f7ed2f16868>
<Element '{http://www.daisy.org/z3986/2005/ncx/}docTitle' at 0x7f7ed2f16908>
<Element '{http://www.daisy.org/z3986/2005/ncx/}text' at 0x7f7ed2f169a8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navMap' at 0x7f7ed2f16a48>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f16ae8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navLabel' at 0x7f7ed2f16b88>
<Element '{http://www.daisy.org/z3986/2005/ncx/}text' at 0x7f7ed2f16bd8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}content' at 0x7f7ed2f16c78>
<Element '{http://www.daisy.org/z39

In [None]:
def recurseElementTree(etree):
    # TODO: Carry on here. Lenochka is sleeping and I need to put her to bed.

In [11]:
for element in tocRoot:
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}head' at 0x7f7ed2f166d8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}docTitle' at 0x7f7ed2f16908>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navMap' at 0x7f7ed2f16a48>


This "navMap" thing is the XML block which has the actual table of contents in it.

It probably makes clickable links in the epub as well, but I'm not terribly worried about that...

In [12]:
tocNavMap = tocRoot[2]

It's got the same type as its parent.

In [13]:
type(tocNavMap)

xml.etree.ElementTree.Element

In [22]:
for element in tocNavMap:
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f16ae8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f16cc8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f16e08>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f16f98>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30188>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f302c8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30408>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30598>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f306d8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30818>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30958>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30a98>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7f7ed2f30bd8>

In [52]:
# Just a bit of messing around to see what's in here. Just turtles all the way down, it looks like.
for element in tocNavMap:
    print("\nElement: {}".format(element))
    print(element.tag)
    print(element.attrib)
    print(type(element))
    print(len(element))
    print(dir(element))
    for item in element:
        print("\nItem: {}".format(item))
        print(item.tag)
        print(item.attrib)
        print(type(item))
        print(len(item))
        print(dir(item))

2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


In [45]:
tocNavMap.__getattribute__??

In [15]:
# Extract the individual item paths from the navMap.
tocItems = [item[1].attrib["src"] for item in tocNavMap]
print(tocItems)

['bofm_bofm-title.xhtml', 'bofm_introduction.xhtml', 'bofm_three.xhtml', 'bofm_eight.xhtml', 'bofm_js.xhtml', 'bofm_explanation.xhtml', 'bofm-illustrations.xhtml', '1-ne.xhtml', '2-ne.xhtml', 'jacob.xhtml', 'enos.xhtml', 'jarom.xhtml', 'omni.xhtml', 'w-of-m.xhtml', 'mosiah.xhtml', 'alma.xhtml', 'hel.xhtml', '3-ne.xhtml', '4-ne.xhtml', 'morm.xhtml', 'ether.xhtml', 'moro.xhtml', 'bofm_pronunciation.xhtml', 'bofm_reference.xhtml']


In [16]:
# Get the same XML roots for each individual item in the TOC.
tocItemsRoots = [ElementTree.fromstring(BoMEpub.read("OEBPS/%s"%item)) for item in tocItems]

In [17]:
bookFileList = []

for item in tocItemsRoots:
    if "chapter" in item[1].attrib["class"]:
        pass        
        print("Chapter: {0}".format(item[1].attrib["class"][len("chapter") + 1:]))
    elif "book" in item[1].attrib["class"]:
        if "illustrations" not in item[1].attrib["class"]:
            print("Book: {0}".format(item[1].attrib["class"][len("book") + 1:]))
            bookFileList.append(BoMEpub.read("OEBPS/{0}.xhtml".format(item[1].attrib["class"][len("book") + 1:])))
    else:
        pass

Chapter: bofm_bofm-title
Chapter: bofm_introduction
Chapter: bofm_three
Chapter: bofm_eight
Chapter: bofm_js
Chapter: bofm_explanation
Book: 1-ne
Book: 2-ne
Book: jacob
Book: enos
Book: jarom
Book: omni
Book: w-of-m
Book: mosiah
Book: alma
Book: hel
Book: 3-ne
Book: 4-ne
Book: morm
Book: ether
Book: moro
Chapter: bofm_pronunciation
Chapter: bofm_reference


In [18]:
bookRootList = []
for file in bookFileList:
    bookRootList.append(ElementTree.fromstring(file))
    for child in bookRootList[-1][1][0]:
        print(child.attrib)

{'class': 'title', 'title': '1 Nephi', 'id': 'lds_1-ne_title'}
{'class': 'subtitle'}
{'class': 'intro', 'id': 'lds_1-ne_intro'}
{'class': 'title', 'title': '2 Nephi', 'id': 'lds_2-ne_title'}
{'class': 'intro', 'id': 'lds_2-ne_intro'}
{'class': 'title', 'title': 'Jacob', 'id': 'lds_jacob_title'}
{'class': 'intro', 'id': 'lds_jacob_intro'}
{'class': 'title', 'title': 'Enos', 'id': 'lds_enos_title'}
{'class': 'titleNumber', 'id': 'lds_enos_1_title'}
{'class': 'studySummary', 'id': 'lds_enos_1_intro'}
{'class': 'title', 'title': 'Jarom', 'id': 'lds_jarom_title'}
{'class': 'titleNumber', 'id': 'lds_jarom_1_title'}
{'class': 'studySummary', 'id': 'lds_jarom_1_intro'}
{'class': 'title', 'title': 'Omni', 'id': 'lds_omni_title'}
{'class': 'titleNumber', 'id': 'lds_omni_1_title'}
{'class': 'studySummary', 'id': 'lds_omni_1_intro'}
{'class': 'title', 'title': 'Words of Mormon', 'id': 'lds_w-of-m_title'}
{'class': 'titleNumber', 'id': 'lds_w-of-m_1_title'}
{'class': 'studySummary', 'id': 'lds_w-of