## A Demonstration of how one can use Python to select chapters of a Ebook based on their POV in title.

 - The is is a simple demonstration, it only catches the basic case
 - You'll likely need to customise it on a per-book basis
 - For demonstration, I am using a particular ebook containing the first for books of George R.R. Matrin's Song of Ice and Fire
 - It is your responsibility to ensure the legality of this in your local


#### The MIT License (MIT)

Copyright (c) 2015,2016, 2018, Lyndon White

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

----------------------

#### The Libraries
We are using python3 today, but this code should work almost without change in python2. To libraries are required.

 - [ebooklib](https://github.com/aerkalov/ebooklib) is for reading and writing the epubs as a whole -- they are basically Zip Archieves
 - [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is for reading the HTML files within them

Both can be installed with `pip`.

We are also going to use the standard library component:

 - [re](https://docs.python.org/3/library/re.html) for regular expressions 

In [7]:
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import re

## What to Keep:
ebooklib.epub breaks the epub up into items. These are files with in the zip archieve.
Generally most booked have one item (ie file), per chapter. That is the case for our book.

Of these items, there are three catagories of  item we want to keep:
    
- items that are not chapters at all -- these could be pictures, or metadata or something else. we don't know.
- chapters that are universal, eg the prologue, the dedication or the appendix.
- chapter's  that are about the character we are interested in

In [18]:
def is_not_chapter(item):
    return item.get_type() != ebooklib.ITEM_DOCUMENT

##### Recognising univeral chapters
All the normal chapted in out case are named along the lines of: `b01-c01` for book 1 (as it is a complation) chapter 1. Special chapters like the appendix don't follow this pattern. We can check for it with a regex

In [3]:
def is_univeral_chapter(chapter):
    return not re.match("(b\d\d.c\d\d)|(c\d\d)",chapter.get_id()) 

## Is it about our character?

In this particular book all the character names are in the chapter headings.
However it does represent them in two different ways. In some sections it is with a `<h1>` element, in others in is in a `<p class="ct">` element. We'll check for both. 

Notice this function is a higher order function that returns a function. That makes it work nice with filter -- useful for testing, if you've already stripped down to just the normal chapters.

`filter(is_character("JON"), chapters)`

In [15]:
def character_name_finder(matchers=[],postproc=lambda matched: matched.text.strip()):
    def character_name(chapter):
        soup = BeautifulSoup(chapter.content,"lxml")

        for matcher in matchers:
            matched = matcher(soup)
            if len(matched)>0:
                try:
                    return postproc(matched[0])
                except IndexError:
                    continue #Couldn't post-proc -- so not a expected name
        
        #Nothing wse found
        return False    
    return character_name

#####

title_case_postproc=lambda matched:matched.text.strip()[0]+matched.text.strip()[1:].lower()

####
character_name_none = character_name_finder(
    matchers=[lambda soup: soup.find_all(name='body')],
    postproc = lambda x: "<unknown>"
)

character_name_GoT = character_name_finder(
        matchers=[lambda soup: soup.find_all(class_='ct'),
         lambda soup: soup.find_all(class_='subchapter'),
         lambda soup: soup.find_all('h1') if len(soup.find_all('h1'))==1 else []
        ],
        postproc=title_case_postproc)

####

character_name_HoO2 = character_name_finder(
        matchers=[lambda soup: soup.find_all(name="title")],
        postproc=lambda matched:matched.text.strip().split(" ")[1])

character_name_HoO5 = character_name_finder(
        matchers=[lambda soup: soup.find_all(class_="brandingheadclosedtitle")])

####
character_name_Bartimaeus = character_name_finder(matchers=[lambda soup: soup.find_all(name="img")],
               postproc=lambda ele: os.path.splitext(os.path.basename(ele['src']))[0]
              )
###
character_name_Dregs = character_name_finder(matchers=[
                    lambda soup: soup.find_all(class_ = "CT"),
                    lambda soup: soup.find_all(class_ = "Chap-Title-ct")],
               postproc=title_case_postproc
              )

In [5]:
def is_character(name):
    def inner(chapter):
        chapter_character_name = character_name(chapter)
        return chapter_character_name == name
        
    return inner

### Bring our conditions together
Another higher order function, again to make it work with `filter`.
In this case it is a closure.

In [6]:
def keep_item(character_name):
    is_our_charatacter = is_character(character_name)
    def inner(item):
        return (is_not_chapter(item) 
                or is_univeral_chapter(item) 
                or is_our_charatacter(item))
    return inner
        
    

### Combine it all, with a read and a write
Also we'll modify the title, don't want to get them confused.
There is also a helper function below to workout the new filename

In [7]:
def rewrite_book_by_character(filename, character):
    book = epub.read_epub(filename)
    book.items = list(filter(keep_item(character), book.items))
    book.title+=": " + character + "POVs_ONLY"
    
    new_filename = get_new_filename(filename,character)
    print(new_filename)
    epub.write_epub(new_filename, book, {})
    return new_filename

def get_new_filename(filename,character):
    import os.path
    filename_base, ext = os.path.splitext(os.path.basename(filename))
    new_filename = "output_books/"+filename_base +"_" + character+"_ONLY"+ext
    return new_filename


## Give it a go

In [None]:
from IPython.display import FileLink
filename = rewrite_book_by_character(
    '../input_books/ASOIF/5.ADanceWithDragons-GeorgeR.R.Martin.epub',
    "DAENERYS")
FileLink(filename)

# End ASOIF stuff

In [6]:
book=epub.read_epub("../input_books/labelled/dregs01.epub")


In [None]:
character_name_SoC(chapter)

In [None]:
soup=BeautifulSoup(chapter.content,"lxml")
soup.find_all()[0].text

# Generating a flat-file test set

In [19]:
import re
def get_raw_text(ch,character_name):
    char_name = character_name(ch)
    soup = BeautifulSoup(ch.get_content(),"lxml")
    full_text=soup.text
    
    return re.sub(char_name, "", full_text, count=1, flags = re.IGNORECASE) #remove first reference to the name -- this comes from the heading

In [20]:

def get_annonated_chapters(book_filename,character_name):
    book = epub.read_epub(book_filename)
    for ch in book.items:
        if is_not_chapter(ch):
            continue
        
        name = character_name(ch)
        if (not(name) 
            or " " in name
            or "Part" in name
            or name in {'Acknowledgments','Contents','Prologue','Appendix','Epilogue'}):
           
            continue
        
        text = get_raw_text(ch,character_name)
        #if len(text)<2000:
        #    continue
               
        yield dict((
            ("character", name),
            ("text", text)
         ))
    
        
    

In [21]:
import json
import os.path
def make_data(book_filename, character_name):
    filename_base, ext = os.path.splitext(os.path.basename(book_filename))   
    data = list(get_annonated_chapters(book_filename,character_name))
    with open("../flat_data/"+filename_base+".json",'w') as fh:
        json.dump(data, fh)

In [22]:
make_data("../input_books/labelled/asoif01-04.epub", character_name_GoT)

In [23]:
make_data("../input_books/labelled/dregs01.epub", character_name_Dregs)
make_data("../input_books/labelled/dregs02.epub", character_name_Dregs)

In [None]:
make_data("../input_books/warbreaker/Warbreaker.epub", character_name_none)

In [9]:
from book import *
import tempfile

In [23]:
for ln in convert_book("../input_books/labelled/dregs01.epub",
                       "../input_books/labelled/dregs01_cleaned.epub",
                       reprocess_epub=True, heuristics=True):
    print(ln)
    
for ln in convert_book("../input_books/labelled/dregs02.epub",
                       "../input_books/labelled/dregs02_cleaned.epub",
                       reprocess_epub=True, heuristics=True):
    print(ln)

ebook-convert ../input_books/labelled/dregs01.epub ../input_books/labelled/dregs01_cleaned.epub --enable-heuristics


1% Converting input to HTML...

InputFormatPlugin: EPUB Input running

on /home/wheel/oxinabox/public-html/apps/NovelPerspective/input_books/labelled/dregs01.epub

Found HTML cover OEBPS/xhtml/cover.xhtml

Parsing all content...

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is

In [20]:
for ln in convert_book("../input_books/labelled/asoif01-04.epub",
                       "../input_books/labelled/asoif01-04_cleaned.epub",
                       reprocess_epub=True, heuristics=True):
    print(ln)

ebook-convert ../input_books/labelled/asoif01-04.epub ../input_books/labelled/asoif01-04_cleaned.epub --enable-heuristics


1% Converting input to HTML...

InputFormatPlugin: EPUB Input running

on /home/wheel/oxinabox/public-html/apps/NovelPerspective/input_books/labelled/asoif01-04.epub

Found HTML cover OEBPS/Mart_9780345529060_epub_cvi_r1.htm

Parsing all content...

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not running heuristics

flow is too short, not r

In [26]:
def get_annonated_sections(book_filename,character_name):
    book = epub.read_epub(book_filename)
    for ch in book.items:
        if is_not_chapter(ch):
            continue
        
        name = character_name(ch)
        if (not(name) 
            or " " in name
            or "Part" in name
            or name in {'Acknowledgments','Contents','Prologue','Appendix','Epilogue'}):
           
            continue
        
        
        
        soup = BeautifulSoup(ch.get_content(),"lxml")
        scenebreaks=soup.findAll(class_="scenebreak")
        for scenebreak in scenebreaks:
            print(scenebreak)
        

               
        #yield dict((
        #    ("character", name),
        #    ("text", text)
        # ))
        
list(get_annonated_sections("../input_books/labelled/dregs02.epub", character_name_Dregs))


TypeError: 'NoneType' object is not iterable

## useful loading

In [33]:
soup.ma

In [34]:
load_book("../input_books/warbreaker/Warbreaker.epub")

In [25]:
import os.path



('a/b', '.c')

In [28]:
?subprocess