# 01: Quoth The Raven NLG - Data Loading and Preprocessing

---

In this notebook, load in the text files for the complete works of Edgar Allan Poe and prepare the text for training natural language generating models.

* [Section A: Loading Corpus](#load)
* [Section B: Cleaning the Loaded Text](#clean)
* [Section C: Generating Word-Level Tokens of the Text Data and Sequences of Tokens](#token)
* [Section D: Saving the Sequences to a File for Use in Modeling](#save)

### Imports:

In [1]:
import os
import re
from unidecode import unidecode
from keras.preprocessing.text import text_to_word_sequence, hashing_trick

--- 

### <a name="load"></a>Section A: Loading and Cleaning the Text Files 

In the data folder, you'll notice a few different items:
* Original text files from Project Gutenberg
    * These are the original full version with no editing or cleaning.
* Trimmed version of the files separated into `Prose` and `Poetry` folders 
    * Since the ultimate goal of this project is to generate micro-stories about an image, only Poe's prose is currently being used to train the model.
    * Non-Poe text has been manually removed from as much of this text as possible including the Project Gutenberg headers and footers, editor notes, as well as chapters and titles to focus as much on Poe's complete language as possible.

*Aside:* My original desire was to utilize web scraping to acquire the texts, but many of the sites were either incomplete or forbade scraping. With the data easily and ethically accessible via Project Gutenberg, I decided to use the text files available there.  

In [2]:
# function to load the data from the trimmed data files
def load_corpus(path, file_encoding=None):
    
    # make a list of all the text files in the specified directory
    text_files = [file for file in os.listdir(path)]
    
    print(f'The following {len(text_files)} file(s) have been loaded:')
    for _ in range(len(text_files)):
        print(text_files[_])
    
    # create variable to hold the text from our combined documents
    loaded_text = ''
    
    # open and append the file contents to our com
    for file in text_files:
        loaded_text += open(path + file, encoding=file_encoding).read()
        loaded_text += ' '
    
    # gets rid of extra spaces due to project gutenberg formatting
    loaded_text = ' '.join(loaded_text.split())
    
    # converting to ASCII to get rid of smart quotes and some special characters
    loaded_text = unidecode(loaded_text)
    
    print()
    print(f'The length of the combined documents (in characters) is: {len(loaded_text)}')
    
    return loaded_text

In [3]:
raw_text = load_corpus('./data/Poe_NLG/02_Poe_author_text_only/Prose/', 'utf-8')

The following 5 file(s) have been loaded:
CompletePoeVol3-trimmed.txt
CompletePoeVol4-trimmed.txt
CompletePoeVol1-trimmed.txt
CompletePoeVol5-prose-trimmed.txt
CompletePoeVol2-trimmed.txt

The length of the combined documents (in characters) is: 2296101


--- 

### <a name="clean"></a>Section B: Cleaning the Loaded Text 

In [4]:
''' Function to clean and prepare the loaded text for tokenization -- 
    we'll keep selected punctuation as words to see if the model can learn to apply them
    correctly. We'll also keep captilization of words to see how the model places them.'''
def clean_corpus(corpus, punc_to_keep):
    
    # creating a string of the punctuation marks to keep 
    punc_string = ''.join([c if c != "'" else "\'" for c in punc_to_keep])
    
    ''' Using regex to get rid of all numberic and special characters.
        Also swapping out the ASCII double dash with '&' since we want to keep the them separate
        and don't want them treated as two individual dashes.''' 
    corpus = re.sub('[^A-Za-z'+punc_string+']+',' ', corpus)
   
    corpus = corpus.replace('--', ' & ')
    
    # putting spacing around punctuation so they are treated as their own words during tokenization
    for punc in punc_to_keep:
        corpus = corpus.replace(punc, f' {punc} ')
    
    # putting our double dashes (em-dashes) back into the corpus
    corpus = corpus.replace(' & ', ' -- ')
    
    # removing any additional spaces
    corpus = re.sub('\s\s+', ' ', corpus)
    
    # returning our cleaned data
    return corpus

In [5]:
punc_to_keep = ['!', '?', '.', ',', '"', "'", ':', ';', '-'] # cannot include '&' for use in function
cleaned_text = clean_corpus(raw_text, punc_to_keep)

---

### <a name="token">Section C: Generating Word-Level Tokens of the Text Data and Sequences of Tokens

In [6]:
# function to separate text into word tokens and put them into sequences
def create_tokens_and_sequences(cleaned_text, sequence_input_length):
    
    # creating tokens by splitting text at spaces
    tokens = cleaned_text.split()
    print(f'Total tokens created from text: {len(tokens)}')
    print(f'Unique tokens created from text: {len(set(tokens))}')
    
    # total length of generated sequences will be the designated number of words + the next word to be predicted
    seq_total_len = sequence_input_length + 1
    
    # variable to hold our generated sequences
    sequences = []
    for i in range (seq_total_len, len(tokens)):
        selected_tokens = tokens[i - seq_total_len: i]
        sequence = ' '.join(selected_tokens)
        sequences.append(sequence)
        
    print(f'{len(sequences)} sequences were created.')
    print(f'Each sequence is {seq_total_len} word(s) in length with {sequence_input_length} word(s) preceding one output word.')
    
    return tokens, sequences, seq_total_len

In [7]:
tokens, sequences, seq_total_len = create_tokens_and_sequences(cleaned_text, sequence_input_length=100)

Total tokens created from text: 480070
Unique tokens created from text: 22467
479969 sequences were created.
Each sequence is 101 word(s) in length with 100 word(s) preceding one output word.


In [8]:
tokens[:10]

['Upon', 'my', 'return', 'to', 'the', 'United', 'States', 'a', 'few', 'months']

In [9]:
sequences[:5]

['Upon my return to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which an account is given in the following pages , accident threw me into the society of several gentlemen in Richmond , Va . , who felt deep interest in all matters relating to the regions I had visited , and who were constantly urging it upon me , as a duty , to give my narrative to the public . I had several reasons , however , for declining to do so , some',
 'my return to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which an account is given in the following pages , accident threw me into the society of several gentlemen in Richmond , Va . , who felt deep interest in all matters relating to the regions I had visited , and who were constantly urging it upon me , as a duty , to give my narrative to the public . I had several reasons , however , for declining to do so , some o

---

### <a name="save">Section D: Saving the Sequences to a File for Use in Modeling

In [10]:
# save cleaned and prepped sequences
def save_sequences_to_file(sequences, filename):
    sequence_lines = '\n'.join(sequences)
    file = open(f'./data/Poe_NLG/03_Text_files_for_models/{filename}', 'w')
    file.write(sequence_lines)
    file.close()

save_sequences_to_file(sequences, f'cleaned_poe_tot_seq_len_{seq_total_len}.txt')