In [1]:
import zipfile

In [2]:
import xml.etree.ElementTree as ElementTree # A simple API for parsing and creating XML data. I'm just parsing here.

This from the python documentation page:
https://docs.python.org/3.4/library/xml.etree.elementtree.html
Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

The church's epub download is probably safe. But just to be careful.

In [3]:
BoMEpub = zipfile.ZipFile("book-of-mormon-eng.epub")

The object has a filelist member which gives a list of "file" objects. Still need to work out how these work exactly.

In [4]:
for zipped_file in BoMEpub.filelist:
    print(zipped_file.filename)

mimetype
OEBPS/images/03990_000_bofm_000-cover.jpg
OEBPS/lds_ePub_scriptures.css
OEBPS/images/03990_000_bofm-image-1.jpg
OEBPS/images/03990_000_bofm-image-2.jpg
OEBPS/images/03990_000_bofm-image-3.jpg
OEBPS/images/03990_000_bofm-image-4.jpg
OEBPS/images/03990_000_bofm-image-5.jpg
OEBPS/images/03990_000_bofm-image-6.jpg
OEBPS/images/03990_000_bofm-image-7.jpg
OEBPS/images/03990_000_bofm-image-8.jpg
META-INF/container.xml
OEBPS/1-ne.xhtml
OEBPS/1-ne_1.xhtml
OEBPS/1-ne_10.xhtml
OEBPS/1-ne_11.xhtml
OEBPS/1-ne_12.xhtml
OEBPS/1-ne_13.xhtml
OEBPS/1-ne_14.xhtml
OEBPS/1-ne_15.xhtml
OEBPS/1-ne_16.xhtml
OEBPS/1-ne_17.xhtml
OEBPS/1-ne_18.xhtml
OEBPS/1-ne_19.xhtml
OEBPS/1-ne_2.xhtml
OEBPS/1-ne_20.xhtml
OEBPS/1-ne_21.xhtml
OEBPS/1-ne_22.xhtml
OEBPS/1-ne_3.xhtml
OEBPS/1-ne_4.xhtml
OEBPS/1-ne_5.xhtml
OEBPS/1-ne_6.xhtml
OEBPS/1-ne_7.xhtml
OEBPS/1-ne_8.xhtml
OEBPS/1-ne_9.xhtml
OEBPS/2-ne.xhtml
OEBPS/2-ne_1.xhtml
OEBPS/2-ne_10.xhtml
OEBPS/2-ne_11.xhtml
OEBPS/2-ne_12.xhtml
OEBPS/2-ne_13.xhtml
OEBPS/2-ne_1

The ZipFile object has a function called 'read' which returns the file whose name (including path) is given.

The TOC is standard for an epub - the ZipFile object has the list object which we looked at previously, but it gives the files in alphabetical order, rather than in the right (book) order.

In [5]:
tocFile = BoMEpub.read("OEBPS/toc.ncx")

This tocFile registers as a 'bytes' type, which seems odd, because it's just text. It could be that ZipFile will return any kind of file as 'bytes' that it gets from the zip.

In [6]:
type(tocFile)

bytes

In [7]:
print(tocFile)

b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE ncx\n  PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">\n<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1"><head><meta name="dtb:uid" content="978-1-59297-692-8"/><meta name="dtb:depth" content="2"/><meta name="dtb:totalPageCount" content="0"/><meta name="dtb:maxPageNumber" content="0"/></head><docTitle><text>Book of Mormon</text></docTitle><navMap><navPoint id="navPoint-1" playOrder="1"><navLabel><text>Title Page</text></navLabel><content src="bofm_bofm-title.xhtml"/></navPoint><navPoint id="navPoint-2" playOrder="2"><navLabel><text>Introduction</text></navLabel><content src="bofm_introduction.xhtml"/></navPoint><navPoint id="navPoint-3" playOrder="3"><navLabel><text>Testimony of Three Witnesses</text></navLabel><content src="bofm_three.xhtml"/></navPoint><navPoint id="navPoint-4" playOrder="4"><navLabel><text>Testimony of Eight Witnesses</text></navLabel><content src="bofm_eight

This "root" thing is the main item in the XML document. All the subsequent things are nested underneath it. It's an ElementTree, which means it can have child thigs. It's based on the XML above, which is not so clear to see from the way it's been printed without line-breaks and indentations.


In [8]:
tocRoot = ElementTree.fromstring(tocFile)

The xml element tree class has a function "iter" which iterates recursively through the whole tree, but as will be seen, it does it all on one level, you don't really get a feel for which things are children and which are parents. So I'll need to write a better function.

In [9]:
for element in tocRoot.iter():
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}ncx' at 0x7fa50c37d318>
<Element '{http://www.daisy.org/z3986/2005/ncx/}head' at 0x7fa50c37d3b8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7fa50c37d458>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7fa50c37d4a8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7fa50c37d4f8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}meta' at 0x7fa50c37d548>
<Element '{http://www.daisy.org/z3986/2005/ncx/}docTitle' at 0x7fa50c37d5e8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}text' at 0x7fa50c37d688>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navMap' at 0x7fa50c37d728>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37d7c8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navLabel' at 0x7fa50c37d868>
<Element '{http://www.daisy.org/z3986/2005/ncx/}text' at 0x7fa50c37d8b8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}content' at 0x7fa50c37d958>
<Element '{http://www.daisy.org/z39

Before I get started with that though, I need to remove the annoying URL at the beginning of the tag:

In [10]:
annoying_string = "{http://www.daisy.org/z3986/2005/ncx/}navPoint"
print(annoying_string[38:]) # Determined empirically
#del(annoying_string)

navPoint


In [11]:
def recurseElementTree(etree, indent=0):
    print("\t"*indent + "Tag: " + etree.tag[38:])
    #print("\t"*indent + "Attrib: " +str(etree.attrib))
    if etree.text != None:
        print("\t"*indent + "Text: " + etree.text)
        # Note, this is how to print the text in each element. <tag>This is the stuff I mean.</tag>
    length = len(etree)
    #print("\t"*indent + str(length))
    if length != 0:
        for child in etree:
            recurseElementTree(child, indent + 1)
    
    # Reasonably happy with this function for the time being.

In [12]:
recurseElementTree(tocRoot)

Tag: ncx
	Tag: head
		Tag: meta
		Tag: meta
		Tag: meta
		Tag: meta
	Tag: docTitle
		Tag: text
		Text: Book of Mormon
	Tag: navMap
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Title Page
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Introduction
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Testimony of Three Witnesses
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Testimony of Eight Witnesses
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Testimony of the Prophet Joseph Smith
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Brief Explanation
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: Illustrations
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: 1 Nephi
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: text
				Text: 2 Nephi
			Tag: content
		Tag: navPoint
			Tag: navLabel
				Tag: te

Looking just at the first lower level, you'll see that there are three subitems:

In [13]:
for element in tocRoot:
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}head' at 0x7fa50c37d3b8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}docTitle' at 0x7fa50c37d5e8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navMap' at 0x7fa50c37d728>


This "navMap" thing is the XML block which has the actual table of contents in it, which produced most of the above output.

It probably makes clickable links in the epub as well, but I'm not terribly worried about that for my purposes.

In [14]:
tocNavMap = tocRoot[2]

In [15]:
for element in tocNavMap:
    print(element)

<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37d7c8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37d9a8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37dae8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37dc78>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37de08>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c37df48>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c3840e8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c384278>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c3843b8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c3844f8>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c384638>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c384778>
<Element '{http://www.daisy.org/z3986/2005/ncx/}navPoint' at 0x7fa50c3848b8>

These navPoints aren't immediately obvious from the highest level, but if we use a bit of Python kung-fu based on the recursion that earlier, then I can populate a list of the books.

In [16]:
# Extract the individual item paths from the navMap.
tocItems = [item[1].attrib["src"] for item in tocNavMap]
for item in tocItems:
    print(item)

bofm_bofm-title.xhtml
bofm_introduction.xhtml
bofm_three.xhtml
bofm_eight.xhtml
bofm_js.xhtml
bofm_explanation.xhtml
bofm-illustrations.xhtml
1-ne.xhtml
2-ne.xhtml
jacob.xhtml
enos.xhtml
jarom.xhtml
omni.xhtml
w-of-m.xhtml
mosiah.xhtml
alma.xhtml
hel.xhtml
3-ne.xhtml
4-ne.xhtml
morm.xhtml
ether.xhtml
moro.xhtml
bofm_pronunciation.xhtml
bofm_reference.xhtml


Okay, now what to do with them? This isn't entirely obvious from the get-go.

TODO: Decide from here. Think about the structure of the book a bit more. What am I going to do about the different levels of 'books', 'chapters' and 'verses'? I've got some code here but it quite likely needs re-thinking.

Each of these is an XML (specifically XHTML) file in the epub, so we could extract 'roots' from each of them.

In [17]:
# Get the same XML roots for each individual item in the TOC.
tocItemsRoots = [ElementTree.fromstring(BoMEpub.read("OEBPS/%s"%item)) for item in tocItems]

In [27]:
chapterFileList = []
bookFileList = []

for item in tocItemsRoots:
    if "chapter" in item[1].attrib["class"]:
        # In the root, the things listed as "chapters" are mostly introductions and witnesses, as well as the appendix.
        # The only thing of ancient origin is the title page, so we'll exclude all the other stuff.
        if "title" in item[1].attrib["class"]:
            print("Chapter: {0}".format(item[1].attrib["class"][len("chapter") + 1:]))
            #print("Chapter: {0}".format(item[1].attrib["class"]))
            chapterFileList.append(BoMEpub.read("OEBPS/{0}.xhtml".format(item[1].attrib["class"][len("chapter") + 1:])))
    elif "book" in item[1].attrib["class"]:
        # One thing listed as a book is the illustrations for some reason. We will exclude this.
        if "illustrations" not in item[1].attrib["class"]:
            print("Book: {0}".format(item[1].attrib["class"][len("book") + 1:]))
            #print("Book: {0}".format(item[1].attrib["class"]))
            bookFileList.append(BoMEpub.read("OEBPS/{0}.xhtml".format(item[1].attrib["class"][len("book") + 1:])))
    else:
        # Catch-all just in case. There shouldn't be any of these.
        print("Other item: {0}".format(item[1].attrib["class"]))
        
# Note: the "class" strings have "book " or "chapter " in them, which are not part of the filenames so I'll remove them.
# The list generated here should just be able to have ".xhtml" appended to every string to get a filename.

Chapter: bofm_bofm-title
Book: 1-ne
Book: 2-ne
Book: jacob
Book: enos
Book: jarom
Book: omni
Book: w-of-m
Book: mosiah
Book: alma
Book: hel
Book: 3-ne
Book: 4-ne
Book: morm
Book: ether
Book: moro


In [30]:
recurseElementTree(ElementTree.fromstring(chapterFileList[0]))

Tag: 
	Tag: 
		Tag: 
		Tag: 
		Text: title page of the Book of Mormon
		Tag: 
		Tag: 
		Tag: 
	Tag: 
		Tag: 
			Tag: 
			Text: The 
				Tag: 
				Tag: 
				Text: Book of Mormon
			Tag: 
			Text: An Account Written by 
				Tag: 
				Tag: 
				Text: the Hand of Mormon
				Tag: 
				Tag: 
		Tag: 
			Tag: 
			Text: Wherefore, it is an abridgment of the record of the people of Nephi, and also of the Lamanites—Written to the Lamanites, who are a remnant of the house of Israel; and also to Jew and Gentile—Written by way of commandment, and also by the spirit of prophecy and of revelation—Written and sealed up, and hid up unto the Lord, that they might not be destroyed—To come forth by the gift and power of God unto the interpretation thereof—Sealed by the hand of Moroni, and hid up unto the Lord, to come forth in due time by way of the Gentile—The interpretation thereof by the gift of God.
			Tag: 
			Text: An abridgment taken from the Book of Ether also, which is a record of the people of Ja

In [20]:
bookRootList = []
for file in bookFileList:
    bookRootList.append(ElementTree.fromstring(file))
    for child in bookRootList[-1][1][0]:
        print(child.attrib)

{'title': '1 Nephi', 'class': 'title', 'id': 'lds_1-ne_title'}
{'class': 'subtitle'}
{'class': 'intro', 'id': 'lds_1-ne_intro'}
{'title': '2 Nephi', 'class': 'title', 'id': 'lds_2-ne_title'}
{'class': 'intro', 'id': 'lds_2-ne_intro'}
{'title': 'Jacob', 'class': 'title', 'id': 'lds_jacob_title'}
{'class': 'intro', 'id': 'lds_jacob_intro'}
{'title': 'Enos', 'class': 'title', 'id': 'lds_enos_title'}
{'class': 'titleNumber', 'id': 'lds_enos_1_title'}
{'class': 'studySummary', 'id': 'lds_enos_1_intro'}
{'title': 'Jarom', 'class': 'title', 'id': 'lds_jarom_title'}
{'class': 'titleNumber', 'id': 'lds_jarom_1_title'}
{'class': 'studySummary', 'id': 'lds_jarom_1_intro'}
{'title': 'Omni', 'class': 'title', 'id': 'lds_omni_title'}
{'class': 'titleNumber', 'id': 'lds_omni_1_title'}
{'class': 'studySummary', 'id': 'lds_omni_1_intro'}
{'title': 'Words of Mormon', 'class': 'title', 'id': 'lds_w-of-m_title'}
{'class': 'titleNumber', 'id': 'lds_w-of-m_1_title'}
{'class': 'studySummary', 'id': 'lds_w-of