# Text Reuse and Intertextuality

## Data preparation

1. TEXT: Format the HP books and HP movie scripts to the TRACER format. For the movie scripts, this means taking out the names of the speakers. I can give you clear instructions on how to do TRACER formatting (all texts in one text file):
```
id of seven digits (book 1: 11, book 12) \t sentence in tokenized version, lowercased tokens, splitted along whitespace \t "NULL" \t "book 1, chapter 1" (free field)
1100001 quod sit officium sapientis .   NULL    Summa Contra Gentiles
1100002 veritatem meditabitur guttur meum , et labia mea detestabuntur impium .     NULL    Summa Contra Gentiles
1100003 prov. 8-7 .     NULL    Summa Contra Gentiles
1100004 multitudinis usus , quem in rebus nominandis sequendum philosophus censet , communiter obtinuit ut sapientes dicantur qui res directe ordinant et eas bene gubernant .  NULL    Summa Contra Gentiles
```

2. SYNONYMS: For near-verbatim/paraphrase detection, you need an English list of synonyms or thesaurus. This can be extracted from Wordnet and also needs to be formatted as a bidirectional list in two columns. I can also give you details for this. Problem is: many HP neologisms won't be in a standard thesaurus. This gives us two options:
(If we don't do this, we will only superexact matches.)
    * exhaustive list of synonyms but only for words in movie scripts or books; also add british vs american list
    * has to be "directional". E.g.

```
- love \t care
- care \t love
```
3. LEMMAS: PoS-tag and lemmatise the corpus. Here I ask that you do it with StanfordCore NLP as the output is recognised by TRACER and no extra conversion is needed.
```
lowercased wordform \t baseform/lemma from corenlp \t postag \n
```

### Format texts

Important to use same tokenizer for both data streams:

In [5]:
from nltk import word_tokenize, sent_tokenize
import spacy
nlp = spacy.load('en') 

First subtitles:

In [6]:
import glob
import os
import pysrt

title_cnt = 10

lemma_pos = set()

with open('lines.txt', 'w') as new_file:
    filenames = sorted(glob.glob('/Users/mike/GitRepos/potter/data/subtitles/*.srt'))
    
    for filename in filenames:
        title_cnt += 1
        title = os.path.basename(filename).split('.')[0]
        title = title.split('(')[0].strip().replace(' ', '_')
        print(title)
        
        sub_cnt = 0
        for sub in pysrt.open(filename):
            sub_cnt += 1
            
            start_time = sub.end.to_time().strftime('%H:%M:%S')
            end_time = sub.end.to_time().strftime('%H:%M:%S')
            info = title + '-' + start_time + '-' + end_time
            
            text = ' '.join(sub.text_without_tags.split())
            tokens = [t.lower() for t in word_tokenize(text)]
            
            c = str(sub_cnt)
            while len(c) < 6:
                c = '0' + c

            new_file.write(str(title_cnt) + c + '\t')
            
            for token in nlp(text):
                form = ''.join([c for c in token.text.lower() if c.isalpha()])
                if form:
                    new_file.write(form + ' ')
                    lemma = token.lemma_.lower()
                    pos = token.tag_
                    lemma_pos.add(tuple([form, lemma, pos]))
            
            new_file.write('\tNULL\t' + info + '\n')

01-Harry_Potter_and_the_Sorcerer_s_Stone
02-Harry_Potter_and_the_Chamber_of_Secrets
03-Harry_Potter_and_the_Prisoner_of_Azkaban
04-Harry_Potter_And_The_Goblet_Of_Fire
05-Harry_Potter_and_the_Order_of_the_Phoenix
06-Harry_Potter_And_The_Half_blood_Prince
07a-Harry_Potter_and_Deathly_Hallows_Part_1
07b-Harry_Potter_And_Deathly_Hallows_Part_2


Then novels (US version for orthographic reasons):

In [7]:
from lxml import etree
from collections import OrderedDict

def load_potter(fn):
    series = etree.parse(fn)
    HP = OrderedDict()
    for book in series.iterfind('.//book'):
        book_title = book.attrib['title']
        #print(book_title)
        HP[book_title] = OrderedDict()
        
        for chapter in book.iterfind('.//chapter'):
            chapter_title = chapter.attrib['title']
            #print('   ', chapter_title)
            HP[book_title][chapter_title] = []
            
            for paragraph in chapter.iterfind('.//p'):
                text = ''.join([x for x in paragraph.itertext()])
                HP[book_title][chapter_title].append(text)
    return HP

novels = load_potter('../preprocessing/simple_potter_us.xml')

In [8]:
with open('lines.txt', 'a') as new_file:
    for book in novels:
        print(book)
        title_cnt += 1
        for chapter in novels[book]:
            sent_cnt = 0
            for paragraph in novels[book][chapter]:
                for sentence in sent_tokenize(paragraph):
                    sent_cnt += 1
                    text = sentence.strip()
                    tokens = word_tokenize(text)
                    tokens = [t.lower() for t in tokens]
                    info = book.replace(' ', '_') + '-' + chapter.replace(' ', '_')
                    c = str(sent_cnt)
                    while len(c) < 6:
                        c = '0' + c
                    
                    new_file.write(str(title_cnt) + c + '\t')

                    for token in nlp(text):
                        form = ''.join([c for c in token.text.lower() if c.isalpha()])
                        if form:
                            new_file.write(form + ' ')
                            lemma = token.lemma_.lower()
                            pos = token.tag_
                            lemma_pos.add(tuple([form, lemma, pos]))

                    new_file.write('\tNULL\t' + info + '\n')

Harry Potter and the Sorcerer's Stone
Harry Potter and the Chamber of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half-Blood Prince
Harry Potter and the Deathly Hallows


### Dump lemmas and tags

In [10]:
with open('lemma_pos.txt', 'w') as new_file:
    for item in lemma_pos:
        new_file.write('\t'.join(item) + '\n')

### Extract synonyms

In [11]:
from nltk.corpus import wordnet as wn

with open('synonyms.txt', 'w') as f:
    for token, lemma, pos in lemma_pos:
        for synset in wn.synsets(lemma):
            for synonym in synset.lemma_names():
                if synonym.lower() != lemma:
                    lemma = lemma.replace('_', ' ').lower()
                    synonym = synonym.replace('_', ' ').lower()
                    
                    f.write('\t'.join((lemma, synonym)) + '\n')
                    f.write('\t'.join((synonym, lemma)) + '\n')   

(Add UK-US couples?)

In [9]:
#from nltk.stem import WordNetLemmatizer
#wordnet_lemmatizer = WordNetLemmatizer()

#with open('synonyms.txt', 'a') as f:
#    for line in open('../collation/uk_vs_us.txt', 'r'):
#        if line.startswith('#'):
#            continue
        
#        a, b = line.strip().split()
#        a = wordnet_lemmatizer.lemmatize(a)
#        b = wordnet_lemmatizer.lemmatize(b)
        
#        f.write('\t'.join((a, b)) + '\n')
#        f.write('\t'.join((b, a)) + '\n')