https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html


We will try to classify posts coming from 20 different newsgroup into their original 20 categories.


## Approach
- convert all text samples into seq of word indices. (consider only top 20,000 most commonly occurring words in the dataset, and will truncate the sequences to max length of 1000 words)
- prepare an "embedding matrix" -> contain at index i the embedding vector for the word of index `i` in our word index.
- load this embedding matrix `Embedding` layer, set to be frozen (its weights, the embedding vectors, will not be updated during training).
- build on top of it 1D convolutaional nw, ending in a softmax output over our 20 categories.

In [1]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Embedding, Conv1D, MaxPooling1D
from keras.models import Model
from keras.utils import to_categorical
import numpy as np
import sys
import os

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

# Preparing the text data

In [10]:
texts = []
labels_index = {}
labels = []

for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n') # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)
                
print('Found {} texts'.format(len(texts)))

Found 19997 texts


In [11]:
texts[0:2]

['\n\nArchive-name: atheism/resources\nAlt-atheism-archive-name: resources\nLast-modified: 11 December 1992\nVersion: 1.0\n\n                              Atheist Resources\n\n                      Addresses of Atheist Organizations\n\n                                     USA\n\nFREEDOM FROM RELIGION FOUNDATION\n\nDarwin fish bumper stickers and assorted other atheist paraphernalia are\navailable from the Freedom From Religion Foundation in the US.\n\nWrite to:  FFRF, P.O. Box 750, Madison, WI 53701.\nTelephone: (608) 256-8900\n\nEVOLUTION DESIGNS\n\nEvolution Designs sell the "Darwin fish".  It\'s a fish symbol, like the ones\nChristians stick on their cars, but with feet and the word "Darwin" written\ninside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.\n\nWrite to:  Evolution Designs, 7119 Laurel Canyon #4, North Hollywood,\n           CA 91605.\n\nPeople in the San Francisco Bay area can get Darwin Fish from Lynn Gold --\ntry mailing <figmo@netcom.com>.  For net

In [12]:
labels[0] # in what categorties

0

In [13]:
labels_index

{'alt.atheism': 0,
 'comp.graphics': 1,
 'comp.os.ms-windows.misc': 2,
 'comp.sys.ibm.pc.hardware': 3,
 'comp.sys.mac.hardware': 4,
 'comp.windows.x': 5,
 'misc.forsale': 6,
 'rec.autos': 7,
 'rec.motorcycles': 8,
 'rec.sport.baseball': 9,
 'rec.sport.hockey': 10,
 'sci.crypt': 11,
 'sci.electronics': 12,
 'sci.med': 13,
 'sci.space': 14,
 'soc.religion.christian': 15,
 'talk.politics.guns': 16,
 'talk.politics.mideast': 17,
 'talk.politics.misc': 18,
 'talk.religion.misc': 19}

### Format text samples and labels into tensors

In [14]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS) # 20,000 words
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print("Found {} unique tokens.".format(len(word_index)))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # pad inputs

print('before', type(labels), len(labels))
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]


Found 174074 unique tokens.
before <class 'list'> 19997
Shape of data tensor: (19997, 1000)
Shape of label tensor: (19997, 20)


In [8]:
nb_validation_samples

3999

In [9]:
# 19997 texts
print(len(texts))
print(x_train.shape)
print(x_train.shape[0], x_val.shape[0], x_train.shape[0]+x_val.shape[0])

19997
(15998, 1000)
15998 3999 19997


In [10]:
x_train[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

# Preparing the Embedding layer

In [11]:
embeddings_index = {}
try:
    # we will use 100-dimentional GloVe embeddings of 400k words
    # to pretrain the model.
    f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word]= coefs
finally:
    f.close()
    
print('Found {} word vectors.'.format(embeddings_index))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [12]:
len(embeddings_index.keys())

400000

#### Compute the embedding matrix

In [13]:
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [14]:
embedding_matrix.shape

(20000, 100)

In [15]:
print(num_words, MAX_NB_WORDS, len(word_index))

20000 20000 174074


In [16]:
embedding_layer = Embedding(num_words,
                           EMBEDDING_DIM,
                           weights=[embedding_matrix], #frozen it, not update during the training
                           input_length=MAX_SEQUENCE_LENGTH,
                           trainable=False)
# embedding map the ints to the vectors
# i.e. [1, 2] would be converted to [embedding[1], embedding[2]]
# the output will be 3D tensor of shape (samples, sequence_length, embedding_dim)


In [18]:
# training a 1D convnet
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequence = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedded_sequence)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x) # global max pooling
x = Dense(128, activation='relu')(x)

preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train, validation_data=(x_val, y_val),
         epochs=2,
         batch_size=128)


Train on 15998 samples, validate on 3999 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x181cb8e0f0>

In [67]:
text = """
Newsgroups: alt.atheism
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!psgrain!ee.und.ac.za!csir.co.za!proxima.alt.za!lucio
From: lucio@proxima.alt.za (Lucio de Re)
Subject: Re: atheist?
Message-ID: <1993Apr6.090128.15186@proxima.alt.za>
Reply-To: lucio@proxima.Alt.ZA
Organization: MegaByte Digital Telecommunications
References: <1993Apr1.140431.250@juncol.juniata.edu> <ii1i2B1w165w@mantis.co.uk>
Date: Tue, 6 Apr 1993 09:01:28 GMT
Lines: 33

Tony Lezard <tony@mantis.co.uk> writes:

>My opinion is that the strong atheist position requires too much
>belief for me to be comfortable with. Any strong atheists out there
>care to comment? As far as I can tell, strong atheists are far
>outnumbered on alt.atheism by weak atheists.

At the cost of repudiating the FAQ, I think too much is made of the
strong vs weak atheism issue, although in the context of alt.atheism,
where we're continually attacked on the basis that strong atheists
"believe" in the non-existence of god, I think the separation is a
valid one.

To cover my arse, what I'm trying to say is that there is an
infinitely grey area between weak and strong, as well as between
strong and the unattainable mathematical atheism (I wish!).  Whereas I
_logically_ can only support the weak atheist position, in effect I am
a strong atheist (and wish I could be a mathematical one).  To
justify my strong atheist position I believe I need only show that
the evidence presented in favour of any of the gods under scrutiny
is faulty.

If I read the FAQ correctly, no argument for the existence of god
(generic, as represented by mainstream theologians) has ever been
found to be unassailable.  To me this is adequate evidence that the
_real_god_ is undefinable (or at least no definition has yet been
found to be watertight), which in turn I accept as sufficient to
base a disbelief in each and every conceivable god.

I'm a little fuzzy on the edges, though, so opinions are welcome
(but perhaps we should change the thread subject).
-- 
Lucio de Re (lucio@proxima.Alt.ZA) - tab stops at four.
"""
i = text.find('\n\n')
if 0 < i:
    text = text[i:]
print(text)
tokenizer2 = Tokenizer(num_words=MAX_NB_WORDS) # 20,000 words
tokenizer2.fit_on_texts([text])
sequences2 = tokenizer.texts_to_sequences([text])

word_index2 = tokenizer.word_index
print("Found {} unique tokens.".format(len(word_index2)))

data2 = pad_sequences(sequences2, maxlen=MAX_SEQUENCE_LENGTH)

indices2 = np.arange(data2.shape[0])
data2 = data2[indices2]

print(np.asarray(data2).shape)
print(data2.shape)
print(np.asarray([0]))
print(y_train.shape)
y2 = np.zeros(20)
y2[0] = 1 #

model.evaluate(x=data2, y=np.reshape(y2, newshape=(1,20)))



Tony Lezard <tony@mantis.co.uk> writes:

>My opinion is that the strong atheist position requires too much
>belief for me to be comfortable with. Any strong atheists out there
>care to comment? As far as I can tell, strong atheists are far
>outnumbered on alt.atheism by weak atheists.

At the cost of repudiating the FAQ, I think too much is made of the
strong vs weak atheism issue, although in the context of alt.atheism,
where we're continually attacked on the basis that strong atheists
"believe" in the non-existence of god, I think the separation is a
valid one.

To cover my arse, what I'm trying to say is that there is an
infinitely grey area between weak and strong, as well as between
strong and the unattainable mathematical atheism (I wish!).  Whereas I
_logically_ can only support the weak atheist position, in effect I am
a strong atheist (and wish I could be a mathematical one).  To
justify my strong atheist position I believe I need only show that
the evidence presented in fav

[0.9114609956741333, 0.0]

In [65]:
predicted = model.predict(x=data2)

should be `alt.atheism` label though

In [66]:
print(max(predicted[0]))
print(list(labels_index.keys())[list(labels_index.values()).index(predicted.argmax())])

0.470408
talk.religion.misc


In [5]:
from keras.preprocessing.text import one_hot
one_hot(text="oops i did it agian.", n = 10)

[1, 8, 2, 7, 1]