# TV Script Generation - Preprocess Data
This notebook performs preprocessing of some Simpsons TV Script data.  It converts the data into a form that is more suitable as an input to a Recurrent Neural Network. The Neural Network will generate a new TV script for a scene at [Moe's Tavern](https://simpsonswiki.com/wiki/Moe's_Tavern).

## Get the Data
The data consists of only the scenes in Moe's Tavern.  This doesn't include other versions of the tavern, like "Moe's Cavern", "Flaming Moe's", "Uncle Moe's Family Feed-Bag", etc..

In [1]:
import simpsons.helper as helper

data_dir = './data/simpsons/moes_tavern_lines.txt'
text = helper.load_data(data_dir)

# Ignore Twentieth Century Fox header, since we don't use it for analysing the data
text = text[81:]

## Explore the Data
Play around with `view_sentence_range` to view different parts of the data.

In [2]:
view_sentence_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

# Perhaps better way of counting the words?
print('Roughly the number of unique words - 2: {}'.format(len(set(text.split()))))

scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 11492
Roughly the number of unique words - 2: 11492
Number of scenes: 262
Average number of sentences in each scene: 15.251908396946565
Number of lines: 4258
Average number of words in each line: 11.50164396430249

The sentences 0 to 10:

Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.
Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.
Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?
Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.
Moe_Szyslak: What's the matter Homer? You're not your normal effervescent self.
Homer_Simpson: I got my problems, Moe. Give me another one.
Moe_Szyslak: Homer, hey, you should not drink to forget your problems.
Barney_Gumble: Yeah, you should only drink to enhance your social skills.



## Preprocessing Functions

### Lookup Table
- Dictionary to go from the words to an id, called `vocab_to_int`
- Dictionary to go from the id to word, called `int_to_vocab`

These dictionaries are returned in the following tuple `(vocab_to_int, int_to_vocab)`

In [3]:
import numpy as np

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab_to_int = {word: i for i, word in enumerate(set(text))}
    int_to_vocab = {i: word for word, i in vocab_to_int.items()}

    return vocab_to_int, int_to_vocab

### Tokenize Punctuation
Replace punctuation with dummy words, to allow the neural network to distinguish between the word "bye" and "bye!".  The dummy words all start and end with "||" to differentiate them from normal words.  For example, "!" will be tokenized into "||Exclamation_Mark||".

Later on in the processing spaces are added around these words, completing the transformation of punctuation into words.

The following punctuation is converted:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( -- )
- Return ( \n )

In [4]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    punc_dict = {'.': '||period||',
                 ',': '||comma||',
                 '"': '||quotationmark||',
                 ';': '||semicolon||',
                 '!': '||exclamationmark||',
                 '?': '||questionmark||',
                 '(': '||leftparen||',
                 ')': '||rightparen||',
                 '--': '||dash||',
                 '\n': '||return||'}
    
    return punc_dict

## Preprocess all the data and save it
Processed data is saved to pickle file.

The pickle file contains a tuple with the following elements:

`int_text, vocab_to_int, int_to_vocab, token_dict`

Where:
- `int_text` is the textual data, with words replaced with integers.
- `vocab_to_int` is the dictionary mapping words to integers
- `int_to_vocab` is the dictionary mapping integers to words
- `token_dict` is the dictionary mapping punctuation to tokenized punctuation

In [5]:
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)