<a href="https://colab.research.google.com/github/kevinrchilders/nlp/blob/master/space_and_punctuation_adder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
# Upload the text from "Warbreaker" by Brandon Sanderson

from google.colab import files
uploaded = files.upload()

Saving warbreaker-pro-5.txt to warbreaker-pro-5 (1).txt


In [3]:
text = open('warbreaker-pro-5.txt').read()

In [4]:
print(text[:1000])

It’s funny, Vasher thought, how many things begin with my getting thrown into prison.

The guards laughed to one another, slamming the cell door shut with a clang. Vasher stood and dusted himself off, rolling his shoulder and wincing. While the bottom half of his cell door was solid wood, the top half was barred, and he could see the three guards open his large duffel and rifle through his possessions.

One of them noticed him watching. The guard was an oversized beast of a man with a shaved head and a dirty uniform that barely retained the bright yellow and blue coloring of the T’Telir city guard.

Bright colors, Vasher thought. I’ll have to get used to those again. In any other nation, the vibrant blues and yellows would have been ridiculous on soldiers. This, however, was Hallandren: land of Returned gods, Lifeless servants, BioChromatic research, and—of course—color.

The large guard sauntered up to the cell door, leaving his friends to amuse themselves with Vasher’s belongings. “T

In [5]:
# A function to strip a string of capitalization, whitespace, and punctuation
# The goal is to train a model to reverse this process

import re

def strip_string(s):
  return (re.sub(r'\W', r'', s)).lower()

In [6]:
print(strip_string(text[:1000]))

itsfunnyvasherthoughthowmanythingsbeginwithmygettingthrownintoprisontheguardslaughedtooneanotherslammingthecelldoorshutwithaclangvasherstoodanddustedhimselfoffrollinghisshoulderandwincingwhilethebottomhalfofhiscelldoorwassolidwoodthetophalfwasbarredandhecouldseethethreeguardsopenhislargeduffelandriflethroughhispossessionsoneofthemnoticedhimwatchingtheguardwasanoversizedbeastofamanwithashavedheadandadirtyuniformthatbarelyretainedthebrightyellowandbluecoloringofthettelircityguardbrightcolorsvasherthoughtillhavetogetusedtothoseagaininanyothernationthevibrantbluesandyellowswouldhavebeenridiculousonsoldiersthishoweverwashallandrenlandofreturnedgodslifelessservantsbiochromaticresearchandofcoursecolorthelargeguardsauntereduptothecelldoorleavinghisfriendstoamusethemselveswithvashersbelongingst


In [7]:
# Create a 'vocabulary' of the characters used, together with dictionaries for transitioning
# between characters and indices

vocab = sorted(set(text))
print("Number of distinct characters: {}".format(len(vocab)))
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
print(char2idx)

Number of distinct characters: 65
{'\n': 0, ' ': 1, '!': 2, ',': 3, '-': 4, '.': 5, ':': 6, ';': 7, '?': 8, 'A': 9, 'B': 10, 'C': 11, 'D': 12, 'E': 13, 'F': 14, 'G': 15, 'H': 16, 'I': 17, 'J': 18, 'K': 19, 'L': 20, 'M': 21, 'N': 22, 'O': 23, 'P': 24, 'Q': 25, 'R': 26, 'S': 27, 'T': 28, 'U': 29, 'V': 30, 'W': 31, 'X': 32, 'Y': 33, 'a': 34, 'b': 35, 'c': 36, 'd': 37, 'e': 38, 'f': 39, 'g': 40, 'h': 41, 'i': 42, 'j': 43, 'k': 44, 'l': 45, 'm': 46, 'n': 47, 'o': 48, 'p': 49, 'q': 50, 'r': 51, 's': 52, 't': 53, 'u': 54, 'v': 55, 'w': 56, 'x': 57, 'y': 58, 'z': 59, '—': 60, '‘': 61, '’': 62, '“': 63, '”': 64}


In [8]:
print(text[:50])
print([char2idx[c] for c in text[:50]])

It’s funny, Vasher thought, how many things begin 
[17, 53, 62, 52, 1, 39, 54, 47, 47, 58, 3, 1, 30, 34, 52, 41, 38, 51, 1, 53, 41, 48, 54, 40, 41, 53, 3, 1, 41, 48, 56, 1, 46, 34, 47, 58, 1, 53, 41, 42, 47, 40, 52, 1, 35, 38, 40, 42, 47, 1]


In [64]:
# Batch the text into sequences of length SEQUENCE_LENGTH

SEQUENCE_LENGTH = 100

sequences = []
for i in range(len(text) // SEQUENCE_LENGTH):
  sequences.append(text[i*SEQUENCE_LENGTH:i*SEQUENCE_LENGTH+SEQUENCE_LENGTH])
stripped_sequences = [strip_string(s) for s in sequences]
for i in range(5):
  print([sequences[i], stripped_sequences[i]])

['It’s funny, Vasher thought, how many things begin with my getting thrown into prison.\n\nThe guards la', 'itsfunnyvasherthoughthowmanythingsbeginwithmygettingthrownintoprisontheguardsla']
['ughed to one another, slamming the cell door shut with a clang. Vasher stood and dusted himself off,', 'ughedtooneanotherslammingthecelldoorshutwithaclangvasherstoodanddustedhimselfoff']
[' rolling his shoulder and wincing. While the bottom half of his cell door was solid wood, the top ha', 'rollinghisshoulderandwincingwhilethebottomhalfofhiscelldoorwassolidwoodthetopha']
['lf was barred, and he could see the three guards open his large duffel and rifle through his possess', 'lfwasbarredandhecouldseethethreeguardsopenhislargeduffelandriflethroughhispossess']
['ions.\n\nOne of them noticed him watching. The guard was an oversized beast of a man with a shaved hea', 'ionsoneofthemnoticedhimwatchingtheguardwasanoversizedbeastofamanwithashavedhea']
['our body is immune to all toxins.”\n\n“I know,” Ligh

In [60]:
data = [[[char2idx[c] for c in s], [char2idx[c] for c in strip_string(s)]] for s in sequences]

In [None]:
model = Sequential([
                    Embedding(),
                    GRU(),
                    Dense()
])