Skip to content
Permalink
Fetching contributors…
Cannot retrieve contributors at this time
252 lines (188 sloc) 9.08 KB

Neural Language Model and Spinoza's Ethics [view code]

image title image title image title image title Image title Image title Image title Image title

The code is available here or by clicking on the [view code] link above.

Introduction The Ethics Imports Loading Text Preprocessing Training Model

In this project I will show how to build a language model for text generation using deep learning techniques. For more details on this topic and several others Ref.1.

Introduction

Though natural language, in principle, have formal structures and grammar, in practice it is full of ambiguities. Modeling it using examples and modeling is an interesting alternative. The definition of a (statistical) language model given by Ref.2 is:

A statistical language model is a probability distribution over sequences of words. Given such a sequence it assigns a probability to the whole sequence.

Or equivalently, given a sequence of words of length m, the model assigns a probability

to the whole sequence. In particular, a neural language model can predict the probability of the next word in a sentence (see Ref.3 for more details).

The use of neural networks has become one of the main approaches to language modeling. Three properties can describe this neural language modeling (NLM) approach succinctly Ref. 3:

We first associate words in the vocabulary with a distributed word feature vector, then express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence and then learn simultaneously the word feature vector and the parameters of the probability function.

In this project I used Spinoza's Ethics (Ethica, ordine geometrico demonstrata) to build a NLM.

The Ethics

From Ref.4:

Ethics, Demonstrated in Geometrical Order, usually known as the Ethics, is a philosophical treatise written by Benedict de Spinoza.

The article goes on to say that:

The book is perhaps the most ambitious attempt to apply the method of Euclid in philosophy. Spinoza puts forward a small number of definitions and axioms from which he attempts to derive hundreds of propositions and corollaries [...]

The book has structure shown below. We see that it is set out in geometrical form paralleling the "canonical example of a rigorous structure of argument producing unquestionable results: the example being the geometry of Euclid" (see link).

PART I. CONCERNING GOD.

DEFINITIONS.

I. By that which is self—caused, I mean that of which the essence involves existence, or that of which the nature is only conceivable as existent.

II. A thing is called finite after its kind, when it can be limited by another thing of the same nature; for instance, a body is called finite because we always conceive another greater body. So, also, a thought is limited by another thought, but a body is not limited by thought, nor a thought by body.

III. By substance, I mean that which is in itself, and is conceived through itself: in other words, that of which a conception can be formed independently of any other conception.

IV. By attribute, I mean that which the intellect perceives as constituting the essence of substance.”

Imports

The following libraries were imported:

numpy 
pickle 
keras
pandas
nltk

Loading Text

We first write a function to load texts. The steps of the function below are:

  • Opens the file 'ethics.txt'
  • Reads it into a string
  • Closes it
def load_txt(file):
    f = open(file, 'r')
    text = f.read()
    f.close()
    return text

Preprocessing

The first step is tokenization. With the tokens we will be able to train our model. Some other actions are:

  • Exclude stopwords (common words, adding no meaning such as for example, "I", "am")
  • Take out punctuation and spaces
  • Convert text to lower case
  • Split words (on white spaces)
  • Elimitate --,", numbers and brackets
  • Dropping non-alphabetic words
  • Stemming

The following functions accomplished these steps:

def cleaner(text):
    stemmer = PorterStemmer()
    stop = stopwords.words('english') 
    text = text.replace('[',' ').replace(']',' ').replace('--', ' ')
    tokens = text.split()
    text = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(text) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.lower() for word in tokens]
    return tokens

We then join tokens to build raw after cleaning:

tokens = cleaner(raw)
raw = ' '.join(tokens)

The next step is building sequences of n words (I chose n=20) and saving it:

n = 20
sequences = list()
for i in range(n, len(raw)):
    sequences.append(raw[i-n:i+1])

The following function saves the prepared sequences:

def save_txt(sequences, file):
    f = open(file, 'w')
    f.write('\n'.join(sequences))
    f.close()

out = 'ethics_sequences.txt';
save_txt(sequences, out)

Training

We loading the sequences and encode them as integers:

raw = load_txt('ethics_sequences.txt')
seqs = raw.split('\n')
unique_chars = sorted(list(set(raw)))
char_int_map = dict((a, b) for b, a in enumerate(unique_chars))

encoded_sequences = list()
for seq in seqs:
    encoded_sequences.append([char_int_map[char] for char in seq])

Printing out sequences and their encoded form

part i concerning god
[17, 2, 19, 21, 1, 10, 1, 4, 16, 15, 4, 6, 19, 15, 10, 15, 8, 1, 8, 16, 5]
art i concerning god 
[2, 19, 21, 1, 10, 1, 4, 16, 15, 4, 6, 19, 15, 10, 15, 8, 1, 8, 16, 5, 1]

Next we build an array from the encoded sequences, define our X and y, perform hot encoding

encoded_sequences = array(encoded_sequences)
X,y = encoded_sequences[:,:-1], encoded_sequences[:,-1]
sequences = [to_categorical(x, num_classes=len(char_int_map)) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=len(char_int_map))

Model

def define_model(X):
    model = Sequential()
    model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
    model.add(Dense(size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

model = define_model(X)

We fit it and save it:

history = model.fit(X, y, epochs=30, verbose=2)
loss = history.history['loss']
model.save('model.h5')
dump(char_int_map, open('char_int_map.pkl', 'wb'))

Generating sequences

from pickle import load
from numpy import array
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

def gen_seq(model, char_int_map, n_seq, test_seq, size_gen):
    num_classes=len(char_int_map)
    txt = test_seq
    print(txt)
    # generate a fixed number of characters
    for i in range(size_gen):
        encoded = pad_sequences([[char_int_map[c] for c in txt]], 
                                maxlen=n_seq, truncating='pre')
        encoded = to_categorical(encoded, num_classes=num_classes)
        ypred = model.predict_classes(encoded)
        int_to_char = ''
        for c, idx in char_int_map.items():
            if idx == ypred:
                int_to_char = c
                break
        # append to input
        txt += int_to_char
    return txt

Loading the model and the dictionary

Testing the model:

print(gen_seq(model, char_int_map, 20, 'that which is self caused', 40))
print(gen_seq(model, char_int_map, 20, 'nature for instance a body', 40))

Conclusion

To be finished.

You can’t perform that action at this time.