# Preprocessing

## Parsing the html in the epubs

In [1]:
import glob
from collections import OrderedDict
from bs4 import BeautifulSoup

We will make use of dictionaries and lists in Python to represent the Potter canon.

### UK edition (Bloomsbury version)

In [2]:
def EnglishHarry():
    HP = OrderedDict()
    uk_path = '../../data/rowling/potter/UK/OEBPS'

    for i in range(1, 8):
        # iterate over individual book folders
        p = f'../../data/rowling/potter/UK/OEBPS/{i}'
        files = sorted(glob.glob(p + '/hp*_ch*.html'))

        book_title = None

        for html_file in files:
            with open(html_file, 'r') as f:
                tree = BeautifulSoup(f.read(), 'lxml')

                # get title of current book:
                if book_title is None:
                    book_title = tree.title.text.split(' - ')[0].strip()
                    HP[book_title] = OrderedDict()
                
                # get title of current chapter:
                chapter = tree.h1
                chapter_title = chapter.text.split(' - ')[0].strip()
                HP[book_title][chapter_title] = []

                for element in chapter.next_siblings:
                    if not element.name == 'p':
                        continue
                    paragraph = ' '.join(element.text.split())
                    HP[book_title][chapter_title].append(paragraph)
    
    return HP

In [3]:
UK_HP = EnglishHarry()

for book in UK_HP:
    print(book)
    for chapter in UK_HP[book]:
        print(f'   {chapter} ({len(UK_HP[book][chapter])} paragraphs)')

Harry Potter and the Philosopher's Stone
   The Boy Who Lived (111 paragraphs)
   The Vanishing Glass (100 paragraphs)
   The Letters from No One (128 paragraphs)
   The Keeper of the Keys (137 paragraphs)
   Diagon Alley (260 paragraphs)
   The Journey from Platform Nine and Three-Quarters (288 paragraphs)
   The Sorting Hat (143 paragraphs)
   The Potions Master (84 paragraphs)
   The Midnight Duel (214 paragraphs)
   Hallowe’en (162 paragraphs)
   Quidditch (138 paragraphs)
   The Mirror of Erised (210 paragraphs)
   Nicolas Flamel (125 paragraphs)
   Norbert the Norwegian Ridgeback (140 paragraphs)
   The Forbidden Forest (188 paragraphs)
   Through the Trapdoor (305 paragraphs)
   The Man with Two Faces (233 paragraphs)
Harry Potter and the Chamber of Secrets
   The Worst Birthday (95 paragraphs)
   The Burrow (183 paragraphs)
   At Flourish and Blotts (192 paragraphs)
   The Whomping Willow (196 paragraphs)
   Gilderoy Lockhart (148 paragraphs)
   Mudbloods and Murmurs (170 parag

### US edition (Scholastic)

*Note: manually remove these files, which have not been properly inserted: part0024.html, part0113.html*

In [4]:
def AmericanHarry():
    HP = OrderedDict()
    us_path = '../../data/rowling/potter/US/text/'
    
    book_title = ''
    
    for fn in sorted(glob.glob(us_path + '*.html')):
        with open(fn, 'r') as f:
            tree = BeautifulSoup(f.read(), 'lxml')
            
            title = tree.title.text.split(' - ')[0].strip()
            title = title.replace('’', "'")
            
            if 'collection' in title.lower():
                continue
            
            # detect start of new book:
            if title is not None and title != book_title:
                book_title = title
                HP[book_title] = OrderedDict()

            chapter = tree.html.body.h3
            if not chapter:
                chapter = tree.html.body.h2
            
            if chapter:
                chapter_title = chapter.text.strip()
                chapter_title = chapter_title.replace('’', "'")
                
                # skip ToC
                if 'contents' in chapter_title.lower():
                    continue
                
                HP[book_title][chapter_title] = []
            
                for element in chapter.next_siblings:
                    if not element.name == 'p':
                        continue
                    paragraph = ' '.join(element.text.split())
                    HP[book_title][chapter_title].append(paragraph)
    
    return HP

In [5]:
US_HP = AmericanHarry()

for book in US_HP:
    print(book)
    for chapter in US_HP[book]:
        print(f'   {chapter} ({len(US_HP[book][chapter])} paragraphs)')

Harry Potter and the Sorcerer's Stone
   THE BOY WHO LIVED (110 paragraphs)
   THE VANISHING GLASS (100 paragraphs)
   THE LETTERS FROM NO ONE (121 paragraphs)
   THE KEEPER OF THE KEYS (137 paragraphs)
   DIAGON ALLEY (252 paragraphs)
   THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS (282 paragraphs)
   THE SORTING HAT (143 paragraphs)
   THE POTIONS MASTER (84 paragraphs)
   THE MIDNIGHT DUEL (211 paragraphs)
   HALLOWEEN (161 paragraphs)
   QUIDDITCH (137 paragraphs)
   THE MIRROR OF ERISED (203 paragraphs)
   NICOLAS FLAMEL (120 paragraphs)
   NORBERT THE NORWEGIAN RIDGEBACK (134 paragraphs)
   THE FORBIDDEN FOREST (184 paragraphs)
   THROUGH THE TRAPDOOR (304 paragraphs)
   THE MAN WITH TWO FACES (229 paragraphs)
Harry Potter and the Chamber of Secrets
   THE WORST BIRTHDAY (96 paragraphs)
   THE BURROW (184 paragraphs)
   AT FLOURISH AND BLOTTS (198 paragraphs)
   THE WHOMPING WILLOW (199 paragraphs)
   GILDEROY LOCKHART (148 paragraphs)
   MUDBLOODS AND MURMURS (170 paragraph

In [6]:
import os
#os.mkdir('us_txt')
for book_name, book in US_HP.items():
    with open('us_txt/' + book_name + '.txt', 'w') as f:
        for us_chap in US_HP[book_name]:
            f.write('\n===========\n' + us_chap + '\n')
            text = '\n'.join(US_HP[book_name][us_chap])
            f.write(text)

### Compare

First at the level of paragraphs - or rather text blocks - which we extracted per chapter:

In [7]:
diffs = []
for uk_book, us_book in zip(UK_HP, US_HP):
    print(uk_book, 'vs', us_book)
    for uk_chap, us_chap in zip(UK_HP[uk_book], US_HP[us_book]):
        print('   ', uk_chap, 'vs', us_chap)
        us_len = len(UK_HP[uk_book][uk_chap])
        uk_len = len(US_HP[us_book][us_chap])
        diff = abs(us_len - uk_len)
        print('   ', us_len, 'vs', uk_len, '-> diff of ', diff)
        diffs.append(diff)

print('Maximum difference in text blocks between chapters:', max(diffs))

Harry Potter and the Philosopher's Stone vs Harry Potter and the Sorcerer's Stone
    The Boy Who Lived vs THE BOY WHO LIVED
    111 vs 110 -> diff of  1
    The Vanishing Glass vs THE VANISHING GLASS
    100 vs 100 -> diff of  0
    The Letters from No One vs THE LETTERS FROM NO ONE
    128 vs 121 -> diff of  7
    The Keeper of the Keys vs THE KEEPER OF THE KEYS
    137 vs 137 -> diff of  0
    Diagon Alley vs DIAGON ALLEY
    260 vs 252 -> diff of  8
    The Journey from Platform Nine and Three-Quarters vs THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS
    288 vs 282 -> diff of  6
    The Sorting Hat vs THE SORTING HAT
    143 vs 143 -> diff of  0
    The Potions Master vs THE POTIONS MASTER
    84 vs 84 -> diff of  0
    The Midnight Duel vs THE MIDNIGHT DUEL
    214 vs 211 -> diff of  3
    Hallowe’en vs HALLOWEEN
    162 vs 161 -> diff of  1
    Quidditch vs QUIDDITCH
    138 vs 137 -> diff of  1
    The Mirror of Erised vs THE MIRROR OF ERISED
    210 vs 203 -> diff of  7
   

    The White Tomb vs THE WHITE TOMB
    133 vs 131 -> diff of  2
Harry Potter and the Deathly Hallows vs Harry Potter and the Deathly Hallows
    The Dark Lord Ascending vs THE DARK LORD ASCENDING
    108 vs 107 -> diff of  1
    In Memoriam vs IN MEMORIAM
    25 vs 26 -> diff of  1
    The Dursleys Departing vs THE DURSLEYS DEPARTING
    120 vs 121 -> diff of  1
    The Seven Potters vs THE SEVEN POTTERS
    167 vs 166 -> diff of  1
    Fallen Warrior vs FALLEN WARRIOR
    259 vs 260 -> diff of  1
    The Ghoul in Pyjamas vs THE GHOUL IN PAJAMAS
    226 vs 226 -> diff of  0
    The Will of Albus Dumbledore vs THE WILL OF ALBUS DUMBLEDORE
    269 vs 271 -> diff of  2
    The Wedding vs THE WEDDING
    211 vs 211 -> diff of  0
    A Place to Hide vs A PLACE TO HIDE
    145 vs 145 -> diff of  0
    Kreacher’s Tale vs KREACHER'S TALE
    193 vs 190 -> diff of  3
    The Bribe vs THE BRIBE
    206 vs 202 -> diff of  4
    Magic is Might vs MAGIC IS MIGHT
    215 vs 216 -> diff of  1
    T

Now compare at the per-chapter character level:

In [8]:
for uk_book, us_book in zip(UK_HP, US_HP):
    print(uk_book, 'vs', us_book)
    for uk_chap, us_chap in zip(UK_HP[uk_book], US_HP[us_book]):
        print('   ', uk_chap, 'vs', us_chap)
        us_len = len('\n'.join(UK_HP[uk_book][uk_chap]))
        uk_len = len('\n'.join(US_HP[us_book][us_chap]))
        diff = abs(us_len - uk_len)
        print('   ', us_len, 'vs', uk_len, '-> diff of ', diff)
        diffs.append(diff)

print('Maximum character difference between two chapters: ', max(diffs))

Harry Potter and the Philosopher's Stone vs Harry Potter and the Sorcerer's Stone
    The Boy Who Lived vs THE BOY WHO LIVED
    25677 vs 25842 -> diff of  165
    The Vanishing Glass vs THE VANISHING GLASS
    18983 vs 19070 -> diff of  87
    The Letters from No One vs THE LETTERS FROM NO ONE
    21246 vs 21338 -> diff of  92
    The Keeper of the Keys vs THE KEEPER OF THE KEYS
    19558 vs 19627 -> diff of  69
    Diagon Alley vs DIAGON ALLEY
    35890 vs 36080 -> diff of  190
    The Journey from Platform Nine and Three-Quarters vs THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS
    34257 vs 34277 -> diff of  20
    The Sorting Hat vs THE SORTING HAT
    23648 vs 23809 -> diff of  161
    The Potions Master vs THE POTIONS MASTER
    16332 vs 16425 -> diff of  93
    The Midnight Duel vs THE MIDNIGHT DUEL
    28013 vs 28026 -> diff of  13
    Hallowe’en vs HALLOWEEN
    23695 vs 23737 -> diff of  42
    Quidditch vs QUIDDITCH
    18919 vs 19025 -> diff of  106
    The Mirror of Er

    35887 vs 36157 -> diff of  270
    Lord Voldemort’s Request vs LORD VOLDEMORT'S REQUEST
    37639 vs 37819 -> diff of  180
    The Unknowable Room vs THE UNKNOWABLE ROOM
    32698 vs 33009 -> diff of  311
    After the Burial vs AFTER THE BURIAL
    32747 vs 33234 -> diff of  487
    Horcruxes vs HORCRUXES
    33124 vs 33264 -> diff of  140
    Sectumsempra vs SECTUMSEMPRA
    33821 vs 34111 -> diff of  290
    The Seer Overheard vs THE SEER OVERHEARD
    29250 vs 29457 -> diff of  207
    The Cave vs THE CAVE
    36282 vs 36555 -> diff of  273
    The Lightning-Struck Tower vs THE LIGHTNING-STRUCK TOWER
    25705 vs 25997 -> diff of  292
    Flight of the Prince vs FLIGHT OF THE PRINCE
    20571 vs 20388 -> diff of  183
    The Phoenix Lament vs THE PHOENIX LAMENT
    32077 vs 32287 -> diff of  210
    The White Tomb vs THE WHITE TOMB
    30882 vs 31122 -> diff of  240
Harry Potter and the Deathly Hallows vs Harry Potter and the Deathly Hallows
    The Dark Lord Ascending vs THE D

## Save as TEI-XML

### Define a skeleton

We first create a very siumple dump. We add a minimalistic TEI header, that just contains enough data to be parsed a a valid TEI document (see the [guidelines](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD7)). Adapt this header to our own project:

In [9]:
import lxml
parser = lxml.etree.XMLParser()

In [10]:
def make_header(title='dummy'):
    header = lxml.etree.XML(f"""<teiHeader>
     <fileDesc>
      <titleStmt>
       <title>{title}</title>
       <respStmt>
        <resp>written by</resp>
        <name>J.K. Rowling</name>
       </respStmt>
      </titleStmt>
      <publicationStmt>
       <distributor>Bloomsbury</distributor>
      </publicationStmt>
      <sourceDesc>
       <bibl>{title}</bibl>
      </sourceDesc>
     </fileDesc>
    </teiHeader>""", parser)
    return header

In [11]:
print(make_header())

<Element teiHeader at 0x10bec3248>


In [12]:
print(lxml.etree.tostring(make_header(), pretty_print=True).decode('utf-8'))

<teiHeader>
     <fileDesc>
      <titleStmt>
       <title>dummy</title>
       <respStmt>
        <resp>written by</resp>
        <name>J.K. Rowling</name>
       </respStmt>
      </titleStmt>
      <publicationStmt>
       <distributor>Bloomsbury</distributor>
      </publicationStmt>
      <sourceDesc>
       <bibl>dummy</bibl>
      </sourceDesc>
     </fileDesc>
    </teiHeader>



In [13]:
def make_text():
    text = lxml.etree.XML("""<text>
      <front>
    <!-- front matter of copy text, if any, goes here -->
      </front>
      <body>
    <!-- body of copy text goes here -->
      </body>
      <back>
    <!-- back matter of copy text, if any, goes here -->
      </back>
     </text>""", parser)
    return text

In [14]:
print(make_text())
print(lxml.etree.tostring(make_text(), pretty_print=True).decode('utf-8'))

<Element text at 0x10bcd3608>
<text>
      <front>
    <!-- front matter of copy text, if any, goes here -->
      </front>
      <body>
    <!-- body of copy text goes here -->
      </body>
      <back>
    <!-- back matter of copy text, if any, goes here -->
      </back>
     </text>



We join the header and body:

In [15]:
root = lxml.etree.Element('TEI')
root.attrib['xmlns'] = 'http://www.tei-c.org/ns/1.0'
root.append(make_header())
root.append(make_text())
with open('skeleton.xml', 'w') as f:
    f.write(lxml.etree.tostring(root, xml_declaration=True,
                                    pretty_print=True, encoding='utf-8').decode())

Inspect the newly created file `skeleton.xml` in a text editor.

## Fill the skeleton

Under the `body` element of the text element, we now add the series of books and chapters as a tree of `div1` (book) and `div2` (chapter) nodes. The actual text blocks (or "paragraphs") take the form of consecutive `p`-elements under the chapter nodes.

In [16]:
def simple_xml(HP, fn):
    root = lxml.etree.Element('TEI')
    root.attrib['xmlns'] = 'http://www.tei-c.org/ns/1.0'
    
    root.append(make_header())
    
    text = make_text()
    for book_idx, book_title in enumerate(HP):
        print(book_idx, book_title)
        book_node = lxml.etree.Element('div1')
        book_node.attrib['type'] = 'book'
        book_node.attrib['title'] = book_title
        book_node.attrib['n'] = str(book_idx + 1)

        for chapter_idx, chapter_title in enumerate(HP[book_title]):
            print('   ', chapter_idx, chapter_title)
            chapter_node = lxml.etree.Element('div2')
            chapter_node.attrib['type'] = 'chapter'
            chapter_node.attrib['title'] = chapter_title
            chapter_node.attrib['n'] = str(chapter_idx + 1)

            for paragraph_idx, paragraph in enumerate(HP[book_title][chapter_title]):
                paragraph_node = lxml.etree.Element('p')
                paragraph_node.attrib['n'] = str(paragraph_idx + 1)
                paragraph_node.text = paragraph

                chapter_node.append(paragraph_node)

            book_node.append(chapter_node)

        text[1].append(book_node)
    
    root.append(text)

    with open(fn, 'w') as f:
        f.write(lxml.etree.tostring(root, xml_declaration=True,
                                    pretty_print=True, encoding='utf-8').decode())

In [17]:
simple_xml(UK_HP, 'simple_potter_uk.xml')
simple_xml(US_HP, 'simple_potter_us.xml')

0 Harry Potter and the Philosopher's Stone
    0 The Boy Who Lived
    1 The Vanishing Glass
    2 The Letters from No One
    3 The Keeper of the Keys
    4 Diagon Alley
    5 The Journey from Platform Nine and Three-Quarters
    6 The Sorting Hat
    7 The Potions Master
    8 The Midnight Duel
    9 Hallowe’en
    10 Quidditch
    11 The Mirror of Erised
    12 Nicolas Flamel
    13 Norbert the Norwegian Ridgeback
    14 The Forbidden Forest
    15 Through the Trapdoor
    16 The Man with Two Faces
1 Harry Potter and the Chamber of Secrets
    0 The Worst Birthday
    2 The Burrow
    3 At Flourish and Blotts
    4 The Whomping Willow
    5 Gilderoy Lockhart
    6 Mudbloods and Murmurs
    7 The Deathday Party
    8 The Writing on the Wall
    9 The Rogue Bludger
    10 The Duelling Club
    11 The Polyjuice Potion
    12 The Very Secret Diary
    13 Cornelius Fudge
    14 Aragog
    15 The Chamber of Secrets
    16 The Heir of Slytherin
    17 Dobby’s Reward
2 Harry Potter and the 

## Detecting direct speech

### Quick introduction to NLP

In [18]:
import spacy
nlp = spacy.load('en')

### American version

For the US edition, in which different quotations marks are used, the solution is relatively simple. (But not perfect, because there are still names etc. that also get tagged):

In [19]:
import shutil
import os

try:
    shutil.rmtree('hp_us_xml')
except FileNotFoundError:
    pass
os.mkdir('hp_us_xml')

book_cnt = 0

for book_idx, book_title in enumerate(US_HP):
    print(book_idx, book_title)
    book_cnt += 1

    root = lxml.etree.Element('TEI')
    root.attrib['xmlns'] = 'http://www.tei-c.org/ns/1.0'
    
    header = make_header(title=book_title)
    root.append(header)
    
    text = make_text()
    
    for chapter_idx, chapter_title in enumerate(US_HP[book_title]):
        print('   ', chapter_idx, chapter_title)
        chapter_node = lxml.etree.Element('div')
        chapter_node.attrib['n'] = f'b{book_cnt}-c{chapter_idx + 1}'
        
        head_node = lxml.etree.Element('head')
        head_node.text = chapter_title
        chapter_node.append(head_node)
        
        for paragraph_idx, paragraph in enumerate(US_HP[book_title][chapter_title]):
            paragraph_node = lxml.etree.Element('p')
            paragraph_node.attrib['n'] = f'b{book_cnt}-c{chapter_idx + 1}-p{paragraph_idx + 1}'
            
            said_node = lxml.etree.Element('said')
            said_node.attrib['direct'] = 'false'
            said_node.text = ''
            just_flushed = False
            
            tokens = nlp(paragraph)
            
            for idx, token in enumerate(tokens):
                
                # opening quotation mark:
                if token.text == '“':
                    if len(said_node.text):
                        paragraph_node.append(said_node)
                    
                    said_node = lxml.etree.Element('said')
                    said_node.attrib['direct'] = 'true'
                    said_node.attrib['who'] = 'unknown'
                    said_node.text = token.text_with_ws
                
                elif token.text[-1] == '”':
                    said_node.text += token.text_with_ws
                    paragraph_node.append(said_node)
                    just_flushed = True
                else:
                    if just_flushed:
                        said_node = lxml.etree.Element('said')
                        said_node.attrib['direct'] = 'false'
                        said_node.text = ''
                        just_flushed = False
                    
                    said_node.text += token.text_with_ws
            
            # don't forget last bit dangling:
            if said_node.text:
                paragraph_node.append(said_node)
            
            chapter_node.append(paragraph_node)
        
        text[1].append(chapter_node)
    
    root.append(text)

    with open(f'hp_us_xml/us{book_cnt}.xml', 'w') as f:
        f.write(lxml.etree.tostring(root, xml_declaration=True,
                                pretty_print=True, encoding='utf-8').decode())

0 Harry Potter and the Sorcerer's Stone
    0 THE BOY WHO LIVED
    1 THE VANISHING GLASS
    2 THE LETTERS FROM NO ONE
    3 THE KEEPER OF THE KEYS
    4 DIAGON ALLEY
    5 THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS
    6 THE SORTING HAT
    7 THE POTIONS MASTER
    8 THE MIDNIGHT DUEL
    9 HALLOWEEN
    10 QUIDDITCH
    11 THE MIRROR OF ERISED
    12 NICOLAS FLAMEL
    13 NORBERT THE NORWEGIAN RIDGEBACK
    14 THE FORBIDDEN FOREST
    15 THROUGH THE TRAPDOOR
    16 THE MAN WITH TWO FACES
1 Harry Potter and the Chamber of Secrets
    0 THE WORST BIRTHDAY
    2 THE BURROW
    3 AT FLOURISH AND BLOTTS
    4 THE WHOMPING WILLOW
    5 GILDEROY LOCKHART
    6 MUDBLOODS AND MURMURS
    7 THE DEATHDAY PARTY
    8 THE WRITING ON THE WALL
    9 THE ROGUE BLUDGER
    10 THE DUELING CLUB
    11 THE POLYJUICE POTION
    12 THE VERY SECRET DIARY
    13 CORNELIUS FUDGE
    14 ARAGOG
    15 THE CHAMBER OF SECRETS
    16 THE HEIR OF SLYTHERIN
    17 DOBBY'S REWARD
2 Harry Potter and the Priso

Add basic NLP annotations using Spacy. Very good illustration of ambiguity of natural language. UK spelling doesn't differentiate between apostrophes and closing quotations, this requires a bit of hacking (the problem is surprisingly simple for the US edition).
Making abstractions over standard contractions such as in 'don't', apostrophe at end of token are typically closing quotation, but not always:
- genitive for plurals nouns, e.g. `The Dursleys' house`)
- abbrevitated ing-forms, e.g. `flyin'` (i.e. slang, typically uttered by Hagrid)
- other abbreviated forms, in formulaic expressions as "rock 'n roll"
- words with special emphasis or verbatim quotes, e.g. `What is his name? 'Harry'`

The latter is very hard to detect, but we can try to solve, to some extent, the first two issues below. (Note that the problem gets extra emphasis because Spacy doesn't properly recognize the closing quotes, cf. `token.is_quote` property)

In [None]:
try:
    shutil.rmtree('hp_uk_xml')
except FileNotFoundError:
    pass
os.mkdir('hp_uk_xml')

book_cnt = 0

for book_idx, book_title in enumerate(UK_HP):
    print(book_idx, book_title)
    book_cnt += 1

    root = lxml.etree.Element('TEI')
    root.attrib['xmlns'] = 'http://www.tei-c.org/ns/1.0'
    
    header = make_header(title=book_title)
    root.append(header)
    
    text = make_text()
    
    for chapter_idx, chapter_title in enumerate(UK_HP[book_title]):
        print('   ', chapter_idx, chapter_title)
        chapter_node = lxml.etree.Element('div')
        chapter_node.attrib['n'] = f'b{book_cnt}-c{chapter_idx + 1}'
        
        head_node = lxml.etree.Element('head')
        head_node.text = chapter_title
        chapter_node.append(head_node)
        
        
        for paragraph_idx, paragraph in enumerate(UK_HP[book_title][chapter_title]):
            paragraph_node = lxml.etree.Element('p')
            paragraph_node.attrib['n'] = f'b{book_cnt}-c{chapter_idx + 1}-p{paragraph_idx + 1}'
            
            said_node = lxml.etree.Element('said')
            said_node.attrib['direct'] = 'false'
            said_node.text = ''
            just_flushed = False
            
            tokens = nlp(paragraph)
            
            for idx, token in enumerate(tokens):
                
                # catch potential plural genitive
                plural_genitive = False
                if token.text[-1] == '’':
                    try:
                        plural_genitive = (nlp(token.text[:-1])[0].tag_ == 'NNS')
                        plural_genitive = (plural_genitive and not tokens[idx + 1].is_sent_start)
                    except:
                        pass
                
                # catch potential abbreviation
                abbreviation = False
                if token.text.endswith(('an’', 'in’', 'o’')):
                    abbreviation = True
                    abbreviation = (abbreviation and not tokens[idx + 1].is_sent_start)
                
                # opening quotation mark:
                if token.text == '‘':
                    if len(said_node.text):
                        paragraph_node.append(said_node)
                    
                    said_node = lxml.etree.Element('said')
                    said_node.attrib['direct'] = 'true'
                    said_node.attrib['who'] = 'unknown'
                    said_node.text = token.text_with_ws
                
                elif token.text[-1] == '’' and not (plural_genitive or abbreviation):
                    said_node.text += token.text_with_ws
                    paragraph_node.append(said_node)
                    just_flushed = True
                else:
                    if just_flushed:
                        said_node = lxml.etree.Element('said')
                        said_node.attrib['direct'] = 'false'
                        said_node.text = ''
                        just_flushed = False
                    
                    said_node.text += token.text_with_ws
            
            # don't forget last bit dangling:
            if said_node.text:
                paragraph_node.append(said_node)
            
            chapter_node.append(paragraph_node)
        
        text[1].append(chapter_node)
    
    root.append(text)

    with open(f'hp_uk_xml/uk{book_cnt}.xml', 'w') as f:
        f.write(lxml.etree.tostring(root, xml_declaration=True,
                                pretty_print=True, encoding='utf-8').decode())

0 Harry Potter and the Philosopher's Stone
    0 The Boy Who Lived
    1 The Vanishing Glass
    2 The Letters from No One
    3 The Keeper of the Keys
    4 Diagon Alley
    5 The Journey from Platform Nine and Three-Quarters
    6 The Sorting Hat
    7 The Potions Master
    8 The Midnight Duel
    9 Hallowe’en
    10 Quidditch
    11 The Mirror of Erised
    12 Nicolas Flamel
    13 Norbert the Norwegian Ridgeback
    14 The Forbidden Forest
    15 Through the Trapdoor
    16 The Man with Two Faces
1 Harry Potter and the Chamber of Secrets
    0 The Worst Birthday
    2 The Burrow
    3 At Flourish and Blotts
    4 The Whomping Willow
    5 Gilderoy Lockhart
    6 Mudbloods and Murmurs
    7 The Deathday Party
    8 The Writing on the Wall
    9 The Rogue Bludger
    10 The Duelling Club
    11 The Polyjuice Potion
    12 The Very Secret Diary
    13 Cornelius Fudge
    14 Aragog
    15 The Chamber of Secrets
    16 The Heir of Slytherin
    17 Dobby’s Reward
2 Harry Potter and the 

If you would like to keep the annotations:

```python
import lxml
import spacy

nlp = spacy.load('en')

series = lxml.etree.Element('HarryPotterSeries')

def token_to_xml(token):
    token_node = lxml.etree.Element('token')
    
    token_node.text = token.text_with_ws
    token_node.attrib['lemma'] = token.lemma_
    token_node.attrib['pos'] = token.tag_
    token_node.attrib['ent'] = token.ent_type_
    token_node.attrib['ent_iob'] = token.ent_iob_
    
    return token_node

cnt = 0

for book_idx, book_title in enumerate(UK_HP):
    print(book_idx, book_title)
    book_node = lxml.etree.Element('book')
    book_node.attrib['title'] = book_title
    book_node.attrib['n'] = str(book_idx + 1)
    
    for chapter_idx, chapter_title in enumerate(UK_HP[book_title]):
        print('   ', chapter_idx, chapter_title)
        chapter_node = lxml.etree.Element('chapter')
        chapter_node.attrib['title'] = chapter_title
        chapter_node.attrib['n'] = str(chapter_idx + 1)
        
        for paragraph_idx, paragraph in enumerate(UK_HP[book_title][chapter_title]):
            paragraph_node = lxml.etree.Element('p')
            paragraph_node.attrib['n'] = str(paragraph_idx + 1)
            
            said_node = lxml.etree.Element('said')
            said_node.attrib['direct'] = 'false'
            said_node.text = ''
            just_flushed = False
            
            tokens = nlp(paragraph)
            
            for idx, token in enumerate(tokens):
                
                # catch potential plural genitive
                plural_genitive = False
                if token.text[-1] == '’':
                    try:
                        plural_genitive = (nlp(token.text[:-1])[0].tag_ == 'NNS')
                        plural_genitive = (plural_genitive and not tokens[idx + 1].is_sent_start)
                    except:
                        pass
                
                # catch potential abbreviation
                abbreviation = False
                if token.text.endswith(('an’', 'in’', 'o’')):
                    abbreviation = True
                    abbreviation = (abbreviation and not tokens[idx + 1].is_sent_start)
                
                # opening quotation mark:
                if token.text == '‘':
                    if len(said_node):
                        paragraph_node.append(said_node)
                    
                    said_node = lxml.etree.Element('said')
                    said_node.attrib['direct'] = 'true'
                    said_node.attrib['who'] = 'unknown'
                    said_node.append(token_to_xml(token))
                
                elif token.text[-1] == '’' and not (plural_genitive or abbreviation):
                    said_node.append(token_to_xml(token))
                    paragraph_node.append(said_node)
                    just_flushed = True
                else:
                    if just_flushed:
                        said_node = lxml.etree.Element('said')
                        said_node.attrib['direct'] = 'false'
                        just_flushed = False
                    
                    said_node.append(token_to_xml(token))
            
            # don't forget last bit dangling:
            if len(said_node):
                paragraph_node.append(said_node)
            
            
            chapter_node.append(paragraph_node)

        book_node.append(chapter_node)

    series.append(book_node)

with open('rich_potter_uk.xml', 'w') as f:
    f.write(lxml.etree.tostring(series, xml_declaration=True,
                                pretty_print=True, encoding='utf-8').decode())
```

## Tagging character names

Wikipedia ids uit deze lijst gebruiken:
https://en.m.wikipedia.org/wiki/List_of_Harry_Potter_characters

## Parsing the background corpus

In [15]:
!pip install ebooklib

Collecting ebooklib
Installing collected packages: ebooklib
Successfully installed ebooklib-0.16


In [65]:
import re

SPECIAL_ISBN = ['0000000000']
ISBN = re.compile(r'[- 0-9X]{10,19}', re.M | re.S)


def extract_isbn(ebook):
    isbns = set()
    for match in ISBN.finditer(ebook.content):
        isbn = match.group()
        isbn = ''.join(c.upper() if c in 'isbn' else c for c in isbn)
        isbn = re.sub(r'ISBN', 'ISBN\x20', re.sub(r'\x20', '', isbn))
        if isbn not in SPECIAL_ISBN:
            try:
                canonical_isbn = isbnlib.get_canonical_isbn(isbn)
            except IndexError:
                continue
            isbn_formats = []
            if canonical_isbn:
                isbn_formats.append(canonical_isbn)
                isbn_formats.append(isbnlib.to_isbn10(canonical_isbn) or '')
                isbn_formats.append(isbnlib.to_isbn13(canonical_isbn) or '')
            isbns.add(','.join(isbn_formats))
    return isbns

In [66]:
!pip install isbnlib



In [71]:
import html.parser
import zipfile
import isbnlib

class EbookReader(html.parser.HTMLParser):
    def __init__(self, fname):
        super().__init__()
        self.fname = fname
        self.content = ''

    def handle_data(self, data):
        self.content += data

    def parse(self):
        try:
            zf = zipfile.ZipFile(self.fname, 'r')
            html_files = [f for f in zf.filelist if f.filename.endswith("html")]
            for html_file in html_files:
                try:
                    self.feed(str(zf.read(html_file)))
                except KeyError:
                    logger.debug(f"File not found in Zipfile {self.fname}")
        except (zipfile.BadZipFile, zipfile.zlib.error):
            logger.debug(f'Exception in ZipFile {self.fname}.')
            pass

In [75]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-2.5.0.tar.gz (169kB)
[K    100% |████████████████████████████████| 174kB 1.9MB/s ta 0:00:01
[?25hCollecting jdcal (from openpyxl)
  Downloading jdcal-1.3.tar.gz
Collecting et_xmlfile (from openpyxl)
  Downloading et_xmlfile-1.0.1.tar.gz
Building wheels for collected packages: openpyxl, jdcal, et-xmlfile
  Running setup.py bdist_wheel for openpyxl ... [?25ldone
[?25h  Stored in directory: /Users/mike/Library/Caches/pip/wheels/a7/88/96/29c1f91ba5a9b94dfc39a9f6f72d0eb92d6f0d917cf2341a3f
  Running setup.py bdist_wheel for jdcal ... [?25ldone
[?25h  Stored in directory: /Users/mike/Library/Caches/pip/wheels/0f/63/92/19ac65ed64189de4d662f269d39dd08a887258842ad2f29549
  Running setup.py bdist_wheel for et-xmlfile ... [?25ldone
[?25h  Stored in directory: /Users/mike/Library/Caches/pip/wheels/99/f6/53/5e18f3ff4ce36c990fa90ebdf2b80cd9b44dc461f750a1a77c
Successfully built openpyxl jdcal et-xmlfile
Installing collected packages: jdcal, et-xmlfile

In [76]:
import glob
import ebooklib
import zipfile
from ebooklib import epub
import lxml
import pandas as pd
import shutil
import bs4

try:
    shutil.rmtree('background')
except:
    pass
os.mkdir('background')

entries, id = [], 1
exts = set()
for filepath in glob.glob('background_dirty/*'):
    filename = os.path.basename(filepath)
    ext = os.path.splitext(filename)[-1]
    if ext == '.txt':
        shutil.copyfile(filepath, 'background/' + filename.replace('.epub', '.txt'))
    elif ext == '.rtf':
        pass
    elif ext == '.epub':
        print(filepath[:50])
        try:
            book = epub.read_epub(filepath)
        except (KeyError, AttributeError, lxml.etree.XMLSyntaxError,
                epub.EpubException, zipfile.BadZipFile):
            continue
        
        metadata = book.metadata['http://purl.org/dc/elements/1.1/']
        entries.append([id, f'{filename}.txt', author, title])
        id += 1
        
        try:
            title, author = metadata['title'][0][0], metadata['creator'][0][0]
        except KeyError:
            author, title = 'Unknown', 'Unknown'
            print('missing author/title', f)
        
        text = ''
        for f in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
            content = f.get_content()
            if content:
                html = bs4.BeautifulSoup(content, 'lxml')
                for p in html.find('body').find_all('p'):
                    p = p.get_text(separator=' ', strip=True)
                    text += p + '\n\n'
            text = text.strip()
            if text:
                with open('background/' + filename.replace('.epub', '.txt'), 'w') as newf:
                    newf.write(text)
            else:
                author, title = 'ERROR', 'ERROR'
                
        entries.append([id, filename.replace('.epub', '.txt'), author, title])
        id += 1

df = pd.DataFrame(entries, columns=['id', 'filename', 'author', 'title'])
df.to_excel("background_meta.xlsx", index=False)

background_dirty/(Chrestomanci 7) Diana Wynne Jone
background_dirty/Alexander, Lloyd - [Chronicles Of
background_dirty/Alexander, Lloyd - [Chronicles Of
background_dirty/Alexander, Lloyd - [Chronicles of
background_dirty/Alexander, Lloyd - [Chronicles of
background_dirty/Blyton, Enid & Cox, Pamela - [Com
background_dirty/Blyton, Enid & Cox, Pamela - [Com
background_dirty/Blyton, Enid & Cox, Pamela - [Com
background_dirty/Blyton, Enid & Cox, Pamela - [Mal
background_dirty/Blyton, Enid & Cox, Pamela - [Mal
background_dirty/Blyton, Enid - [Malory Towers 1] 
background_dirty/Blyton, Enid - [Malory Towers 2] 
background_dirty/Blyton, Enid - [Malory Towers 3] 
background_dirty/Blyton, Enid - [Naughtiest Girl 1
background_dirty/Blyton, Enid - [Naughtiest Girl 2
background_dirty/Blyton, Enid - [St Clare's 1] - T
background_dirty/Blyton, Enid - [St Clare's 2] - T
background_dirty/Blyton, Enid - [St Clare's 3] - S
background_dirty/Blyton, Enid - [St Clare's 4] - T
background_dirty/Blyton, Enid -

## Stage-student

### Taken

1. De oorspronkelijke map `background_dirty` bevat de 229 oorspronkelijke, gedownloade bestanden in verschillende formaten (epub, txt, rtf) etc. De meeste (212) heb ik automatisch naar automatisch naar txt kunnen omzetten (zie map `background`). Voor enkele andere moeten de bestanden nog naar plain text worden omgezet (bestand met .txt-extensie). Dit gebeurt best in een *plain text editor* als `Sublime Text`: opslaan met `File > Save with encoding ... > UTF-8`.
2. De front matter and back matter moet nog uit de bestanden verwijderd worden (bv. het deel dat naar de digitalisatie bij Project Gutenberg verwijst). De bedoeling is dat *enkel de auteurstekst zelf* overblijft. Belangrijk is dat alle vermeldingen van de auteurnaam zelf verwijderd worden.
3. De metadata is nu opgeslagen als een spreadsheet-bestand dat je zou moeten kunnen inlezen in een applicatie als Microsoft Excel (of uploaden naar Google Drive als er meerdere mensen tegelijk aan zouden werken). Voor elke tekst (die aan de hand van de volledige bestandsnaam wordt geïdentificeerd) moeten we volgende velden hebben:
    - bestandnaam in de map `background` (e.g. "Austen Wuthering Heights.txt")
    - voornaam auteur (e.g. Jane)
    - achternaam auteur (e.g. Austen)
    - volledige titel van het boek (bv. "Wuthering Heights")
    - een korte **unieke** versie van de titel (bv. 'heights')
    - of het om een 'kostschoolroman' gaat (zie centraal boekenbestand jeugdliteratuur?)
    - eerste publicatiedatum van het werk (bv. 1887)
    - biologisch gender van de auteur (M/V)
Belangrijk is vooral dat de voor- en achternaam van de auteur **consistent** gespeld worden - dit is makkelijk te verifiëren door de spreadsheet op de bewuste kolommen te sorteren, zodat inconsistenties aan het licht komen.
4. Er moet geverifieerd worden of er nog belangrijke lacunes zijn die makkelijk kunnen aangevuld worden. Er zijn vooral (1) meer kostschoolromans nodig en (2) er kunnen ook gaten in bepaalde series opgevuld worden. (3) Van sommige belangrijke auteurs kan het zijn dat we nog relatief weinig teksten hebben opegenomen.