# Preprocessing data

In [1]:
anna = open('anna.txt', 'r')
txt = anna.read()

In [2]:
txt[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

### charcater tokenizer

 encode characters as integers. In Keras this can be done by the Tokenizer class

In [3]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


- By default, the Tokenizer class tokenizes the words in the text rather than individual characters. This can be changed by setting char_level = True

- The default tokenizer converts all alphabets to lower case, this can be changed by setting lower = False

- The default tokeniser ignores all punctuations, tabs and line breaks etc. This can be changed by passing an explicit list of characters to the keyworded argument 'filters' 



In [4]:
# we will keep all the default characters in the filter except line breaks, question marks, fullstops and exclamation marks
fltr = '"#$%&()*+,-/:;<=>@[\\]^_`{|}~\t' 

# create a tokenizer instance
# we will also require the text to be Case sensitive
tokenizer = Tokenizer(filters = fltr, lower = False, char_level = True)

 the Tokenizer should be thought of analogously to data transformers in sklearn so we first fit then to our training text and the use the fitten tokenizer to transform any given text.

In [5]:
# fit the tokenizer
# we can fit it on a list of multiple different texts
tokenizer.fit_on_texts([txt])

In [6]:
# chaecking the tokenizer indeed produces a sensible output
sample = 'How are you doing?'

tokenizer.texts_to_sequences([sample])

[[37, 5, 15, 1, 4, 10, 2, 1, 19, 5, 14, 1, 11, 5, 8, 6, 18, 35]]

In [7]:
# converting a sequence back to text
tokenizer.sequences_to_texts([[37, 5, 15, 1, 4, 10, 2, 1, 19, 5, 14, 1, 11, 5, 8, 6, 18, 35]])

['H o w   a r e   y o u   d o i n g ?']

In [13]:
# number of distinct characters in the text
print('Number of distinct characters in the text: {}'.format(len(tokenizer.word_index)))

Number of distinct characters in the text: 83


In [27]:
# total number of characters in the text
# the attribute .word_counts gives the number of times each character/word appears in the text
# it returns an ordered dictionary with the characters as its keys and their count as the corresponding value
character_count = tokenizer.word_counts

# computations on elements of a list can be done efficiently through reduce 
# for e.g. see here : https://book.pythontips.com/en/latest/map_filter.html
from functools import reduce

total_chars = reduce(lambda x, y: x+y , character_count.values())
print('total number of characters in the text: {}'.format(total_chars))

total number of characters in the text: 1985223


In [34]:
import numpy as np

In [37]:
# converting the whole text to a sequence of integers using tokenizer
encoded = np.array(tokenizer.texts_to_sequences([txt]))
encoded

array([[53,  7,  4, ...,  9, 24, 13]])

In [38]:
encoded.shape

(1, 1985223)