# Text Reuse and Intertextuality

## Data preparation

1. TEXT: Format the HP books and HP movie scripts to the TRACER format. For the movie scripts, this means taking out the names of the speakers. I can give you clear instructions on how to do TRACER formatting (all texts in one text file):
```
id of seven digits (book 1: 11, book 12) \t sentence in tokenized version, lowercased tokens, splitted along whitespace \t "NULL" \t "book 1, chapter 1" (free field)
1100001 quod sit officium sapientis .   NULL    Summa Contra Gentiles
1100002 veritatem meditabitur guttur meum , et labia mea detestabuntur impium .     NULL    Summa Contra Gentiles
1100003 prov. 8-7 .     NULL    Summa Contra Gentiles
1100004 multitudinis usus , quem in rebus nominandis sequendum philosophus censet , communiter obtinuit ut sapientes dicantur qui res directe ordinant et eas bene gubernant .  NULL    Summa Contra Gentiles
```

2. SYNONYMS: For near-verbatim/paraphrase detection, you need an English list of synonyms or thesaurus. This can be extracted from Wordnet and also needs to be formatted as a bidirectional list in two columns. I can also give you details for this. Problem is: many HP neologisms won't be in a standard thesaurus. This gives us two options:
(If we don't do this, we will only superexact matches.)
    * exhaustive list of synonyms but only for words in movie scripts or books; also add british vs american list
    * has to be "directional". E.g.

```
- love \t care
- care \t love
```
3. LEMMAS: PoS-tag and lemmatise the corpus. Here I ask that you do it with StanfordCore NLP as the output is recognised by TRACER and no extra conversion is needed.
```
lowercased wordform \t baseform/lemma from corenlp \t postag \n
```

### Format texts

Important to use same tokenizer for both data streams:

In [9]:
from nltk import word_tokenize, sent_tokenize

Keep track of unique words occuring:

In [10]:
vocab = set()

First subtitles:

In [11]:
import glob
import os
import pysrt

with open('movies.txt', 'w') as new_file:
    filenames = sorted(glob.glob('/Users/mike/GitRepos/potter/data/subtitles/*.srt'))
    title_cnt = 10
    
    for filename in filenames:
        title_cnt += 1
        title = os.path.basename(filename).split('.')[0]
        title = title.split('(')[0].strip().replace(' ', '_')
        print(title)
        
        sub_cnt = 0
        for sub in pysrt.open(filename):
            sub_cnt += 1
            
            start_time = sub.end.to_time().strftime('%H:%M:%S')
            end_time = sub.end.to_time().strftime('%H:%M:%S')
            info = title + '-' + start_time + '-' + end_time
            
            text = ' '.join(sub.text_without_tags.split())
            tokens = [t.lower() for t in word_tokenize(text)]
            
            vocab.update(tokens)
            
            c = str(sub_cnt)
            while len(c) < 6:
                c = '0' + c

            new_file.write(str(title_cnt) + c + '\t' + ' '.join(tokens) + '\tFULL\t' + info + '\n')

01-Harry_Potter_and_the_Sorcerer_s_Stone
02-Harry_Potter_and_the_Chamber_of_Secrets
03-Harry_Potter_and_the_Prisoner_of_Azkaban
04-Harry_Potter_And_The_Goblet_Of_Fire
05-Harry_Potter_and_the_Order_of_the_Phoenix
06-Harry_Potter_And_The_Half_blood_Prince
07a-Harry_Potter_and_Deathly_Hallows_Part_1
07b-Harry_Potter_And_Deathly_Hallows_Part_2


Then novels (UK version, although movies might be closer in language and spelling to US):

In [12]:
from lxml import etree
from collections import OrderedDict

def load_potter(fn):
    series = etree.parse(fn)
    HP = OrderedDict()
    for book in series.iterfind('.//book'):
        book_title = book.attrib['title']
        #print(book_title)
        HP[book_title] = OrderedDict()
        
        for chapter in book.iterfind('.//chapter'):
            chapter_title = chapter.attrib['title']
            #print('   ', chapter_title)
            HP[book_title][chapter_title] = []
            
            for paragraph in chapter.iterfind('.//p'):
                text = ''.join([x for x in paragraph.itertext()])
                HP[book_title][chapter_title].append(text)
    return HP

novels = load_potter('../preprocessing/simple_potter_uk.xml')

In [13]:
with open('books.txt', 'w') as new_file:
    book_cnt = 10
    for book in novels:
        print(book)
        book_cnt += 1
        for chapter in novels[book]:
            sent_cnt = 0
            for paragraph in novels[book][chapter]:
                for sentence in sent_tokenize(paragraph):
                    sent_cnt += 1
                    tokens = word_tokenize(sentence.strip())
                    tokens = [t.lower() for t in tokens]
                    vocab.update(tokens)
                    info = book.replace(' ', '_') + '-' + chapter.replace(' ', '_')
                    c = str(sent_cnt)
                    while len(c) < 6:
                        c = '0' + c
                    new_file.write(str(book_cnt) + c + '\t' + ' '.join(tokens) + '\tFULL\t' + info + '\n')

Harry Potter and the Philosopher's Stone
Harry Potter and the Chamber of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half Blood Prince
Harry Potter and the Deathly Hallows


### Extract synonyms

How many unique words did we collect?

In [15]:
print(len(vocab))

23467


In [20]:
from nltk.corpus import wordnet as wn

with open('synonyms.txt', 'w') as f:
    for word in sorted(vocab)[:1000]:
        for synset in wn.synsets(word):
            for synonym in synset.lemma_names():
                if synonym.lower() != word:
                    f.write('\t'.join((word, synonym)) + '\n')
                    f.write('\t'.join((synonym, word)) + '\n')   

!
#
&
'
''
''an
''come
''get
''glistening
''magical
''marvelous
''me
''miss
''old
''potentially
''try
'd
'did
'em
'hey
'll
'm
'prolly
're
's
't
've
'you
'your
(
)
*
,
-
--
-a
-absolutely
-alastor
-albus
-all
-amazing
-and
-any
-are
-attention
-auror
-avada
-axxo-
-barty
-because
-besides
-better
-bodied
-bonsoir
-bottoms
-brilliant
-but
-ced
-cedric
-cho
-come
-congratulations
-cooked
-crime
-crouch
-curse
-did
-do
-dress
-dumb
-everything
-excellent
-excuse
-expelliarmus
-feet
-fine
-fleur
-follow
-for
-four
-fourteen
-get
-ginny
-go
-good
-great
-hagrid
-harry
-have
-he
-hello
-here
-hermione
-hey
-hi
-his
-how
-just
-keep
-kill
-knew
-krum
-l
-least
-let
-lf
-look
-ls
-lt
-moral
-most
-mr.
-my
-never
-neville
-next
-no
-nothing
-now
-oh
-oi
-one
-or
-pack
-pathetic
-perhaps
-potter
-professor
-put
-quiet
-read
-ready
-really
-right
-ron
-rosier
-second
-see
-shall
-she
-shut
-silence
-simple
-snape
-someone
-stupid
-such
-surely
-take
-teaching
-technically
-thank
-thanks
-that
-the

### Provide lemmas and tags

In [24]:
import spacy
nlp = spacy.load('en')

items = set()

for book in novels:
    print(book)
    for chapter in novels[book]:
        for paragraph in novels[book][chapter]:
            for sentence in sent_tokenize(paragraph):
                for token in nlp(sentence):
                    form = token.text.lower()
                    lemma = token.lemma_.lower()
                    pos = token.tag_
                items.add(tuple([form, lemma, pos]))

with open('lemma_pos.txt', 'w') as new_file:
    for item in items:
        new_file.write('\t'.join(item) + '\n')

Harry Potter and the Philosopher's Stone
Harry Potter and the Chamber of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half Blood Prince
Harry Potter and the Deathly Hallows
