# Prediction groceries using a Word-Based Neural Language Model in Python with Keras

## Testing One-Word-In, One-Word-Out Sequences with a toy corpus

In [0]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from the model
def generate_seq2(model, tokenizer, seed_text, n_words):
	in_text, result = seed_text, seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		encoded = array(encoded)
		# predict a word in the vocabulary
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text, result = out_word, result + ' ' + out_word
	return result

# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# pre-pad sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text

The first step is to encode the text as integers.

Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the `texts_to_sequences()` function.

In [0]:
# source text
data2 = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """

# integer encode text
tokenizer3 = Tokenizer()
tokenizer3.fit_on_texts([data2])
encoded3 = tokenizer3.texts_to_sequences([data2])[0]


We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.

The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the `word_index` attribute.

In [58]:
# determine the vocabulary size
vocab_size3 = len(tokenizer3.word_index) + 1
print('Vocabulary Size: %d' % vocab_size3)


Vocabulary Size: 22


Running this example, we can see that the size of the vocabulary is 21 words.

We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

In [59]:
# create word -> word sequences
sequences3 = list()
for i in range(1, len(encoded3)):
	sequence3 = encoded3[i-1:i+1]
	sequences3.append(sequence3)
print('Total Sequences: %d' % len(sequences3))


Total Sequences: 24


We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

In [0]:
# split into X and y elements
sequences3 = array(sequences3)
X3, y3 = sequences3[:,0],sequences3[:,1]


We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.

Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

In [0]:
# one hot encode outputs
y3 = to_categorical(y3, num_classes=vocab_size3)


We are now ready to define the neural network model.

The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.

The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

In [62]:
# define model
model3 = Sequential()
model3.add(Embedding(vocab_size3, 10, input_length=1))
model3.add(LSTM(50))
model3.add(Dense(vocab_size3, activation='softmax'))
print(model3.summary())



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_6 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_6 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
None


We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.

Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.

The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.

In [0]:
# compile network
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [65]:
# fit network
model3.fit(X3, y3, epochs=500, verbose=2)


Epoch 1/500
 - 2s - loss: 3.0899 - acc: 0.0833
Epoch 2/500
 - 0s - loss: 3.0891 - acc: 0.1250
Epoch 3/500
 - 0s - loss: 3.0884 - acc: 0.2083
Epoch 4/500
 - 0s - loss: 3.0876 - acc: 0.2083
Epoch 5/500
 - 0s - loss: 3.0869 - acc: 0.2083
Epoch 6/500
 - 0s - loss: 3.0861 - acc: 0.2083
Epoch 7/500
 - 0s - loss: 3.0853 - acc: 0.2083
Epoch 8/500
 - 0s - loss: 3.0845 - acc: 0.2083
Epoch 9/500
 - 0s - loss: 3.0837 - acc: 0.2083
Epoch 10/500
 - 0s - loss: 3.0829 - acc: 0.2083
Epoch 11/500
 - 0s - loss: 3.0821 - acc: 0.2083
Epoch 12/500
 - 0s - loss: 3.0812 - acc: 0.2083
Epoch 13/500
 - 0s - loss: 3.0804 - acc: 0.2083
Epoch 14/500
 - 0s - loss: 3.0795 - acc: 0.2083
Epoch 15/500
 - 0s - loss: 3.0786 - acc: 0.2083
Epoch 16/500
 - 0s - loss: 3.0777 - acc: 0.2083
Epoch 17/500
 - 0s - loss: 3.0768 - acc: 0.2083
Epoch 18/500
 - 0s - loss: 3.0758 - acc: 0.2083
Epoch 19/500
 - 0s - loss: 3.0749 - acc: 0.2083
Epoch 20/500
 - 0s - loss: 3.0739 - acc: 0.2083
Epoch 21/500
 - 0s - loss: 3.0729 - acc: 0.2083
E

<keras.callbacks.History at 0x7f9dde1ff518>

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

In [70]:
# evaluate
print(generate_seq2(model3, tokenizer3, 'Jack', 6))

Jack and jill went up the hill


## Downloading grocery dataset

In [1]:
import os
!pip install kaggle



In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"jacobjohn2016","key":"0206904f95d5d7e082b1af55b711e3f4"}'}

In [3]:
!ls -lha kaggle.json

-rw-r--r-- 1 root root 69 Apr  3 13:16 kaggle.json


In [4]:
!ls
!pwd
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

 aisles.csv		     order_products__prior.csv.zip
 aisles.csv.zip		     order_products__train.csv.zip
 departments.csv	     orders.csv
 departments.csv.zip	     orders.csv.zip
'kaggle (1).json'	     products.csv
 kaggle.json		     products.csv.zip
 __MACOSX		     sample_data
 order_products__prior.csv   sample_submission.csv.zip
/content


In [5]:
!kaggle competitions download -c instacart-market-basket-analysis

departments.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
aisles.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
order_products__train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
products.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
orders.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
order_products__prior.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
!ls
!unzip order_products__prior.csv
!unzip orders.csv
!unzip products.csv
!unzip aisles.csv
!unzip departments.csv

 aisles.csv		     order_products__prior.csv.zip
 aisles.csv.zip		     order_products__train.csv.zip
 departments.csv	     orders.csv
 departments.csv.zip	     orders.csv.zip
'kaggle (1).json'	     products.csv
 kaggle.json		     products.csv.zip
 __MACOSX		     sample_data
 order_products__prior.csv   sample_submission.csv.zip
Archive:  order_products__prior.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
Archive:  order_products__prior.csv.zip
replace order_products__prior.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace __MACOSX/._order_products__prior.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
Archive:  orders.csv
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter cas

## Data Preprocessing

In [71]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

orders = pd.read_csv('order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())

orders -- dimensions: (32434489, 4);   size: 1037.90 MB


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


### 2.1 Convert order data into format expected by the association rules function

In [72]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

pandas.core.series.Series

In [73]:
item_name = pd.read_csv('products.csv')
item_name = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
item_name.tail()

Unnamed: 0,item_id,item_name,aisle_id,department_id
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8
49687,49688,Fresh Foaming Cleanser,73,11


# Line-by-Line Sequence


Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.

In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.

Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.

First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

In [0]:
data = ""
indexset = set()

for i in orders.index:
  if len(indexset) < 2000:
    for itemid in orders[i].data:
      data += item_name.loc[itemid].item_name + " "
      indexset.add(itemid)
    data += "\n"
    
vocab_size = len(indexset)

In [75]:
# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

# create line-based sequences
sequences = list()
for line in data.split('\n'):
	encoded = tokenizer.texts_to_sequences([line])[0]
	for i in range(len(encoded)-2, len(encoded)):
		sequence = encoded[:i+1]
		sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Vocabulary Size: 2450
Total Sequences: 6090


Next, we can pad the prepared sequences. We can do this using the `pad_sequences()` function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are `max_length`-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

In [76]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 173


In [0]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)


In [78]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 172, 10)           24500     
_________________________________________________________________
lstm_7 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_7 (Dense)              (None, 2450)              124950    
Total params: 161,650
Trainable params: 161,650
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [80]:
# fit network
model.fit(X, y, epochs=50, verbose=2)


Epoch 1/50
 - 63s - loss: 6.3393 - acc: 0.0172
Epoch 2/50
 - 61s - loss: 5.7334 - acc: 0.0197
Epoch 3/50
 - 60s - loss: 5.7096 - acc: 0.0200
Epoch 4/50
 - 60s - loss: 5.6646 - acc: 0.0190
Epoch 5/50
 - 61s - loss: 5.5838 - acc: 0.0207
Epoch 6/50
 - 61s - loss: 5.4907 - acc: 0.0263
Epoch 7/50
 - 61s - loss: 5.3854 - acc: 0.0264
Epoch 8/50
 - 61s - loss: 5.2664 - acc: 0.0291
Epoch 9/50
 - 61s - loss: 5.0806 - acc: 0.0412
Epoch 10/50
 - 61s - loss: 4.7595 - acc: 0.0691
Epoch 11/50
 - 61s - loss: 4.4180 - acc: 0.1010
Epoch 12/50
 - 61s - loss: 4.0973 - acc: 0.1557
Epoch 13/50
 - 61s - loss: 3.8001 - acc: 0.2059
Epoch 14/50
 - 61s - loss: 3.5202 - acc: 0.2690
Epoch 15/50
 - 61s - loss: 3.2599 - acc: 0.3255
Epoch 16/50
 - 61s - loss: 3.0103 - acc: 0.4011
Epoch 17/50
 - 61s - loss: 2.7724 - acc: 0.4695
Epoch 18/50
 - 61s - loss: 2.5503 - acc: 0.5184
Epoch 19/50
 - 62s - loss: 2.3473 - acc: 0.5704
Epoch 20/50
 - 62s - loss: 2.1421 - acc: 0.6046
Epoch 21/50
 - 62s - loss: 1.9544 - acc: 0.6532
E

<keras.callbacks.History at 0x7f9ddd788748>

We can use the model to generate new sequences as before. The `generate_seq()` function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.

In [81]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Strawberry', 4))
print(generate_seq(model, tokenizer, max_length-1, 'sandwiches', 4))


Strawberry drink bags carrots potatoes
sandwiches ranchera filberts soup santa


# Two-Words-In, One-Word-Out Sequence

We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.

This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.

We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:

In [0]:
# integer encode sequences of words
tokenizer2 = Tokenizer()
tokenizer2.fit_on_texts([data])
encoded2 = tokenizer2.texts_to_sequences([data])[0]


In [48]:
# retrieve vocabulary size
vocab_size2 = len(tokenizer2.word_index) + 1
print('Vocabulary Size: %d' % vocab_size2)


Vocabulary Size: 2450


In [49]:
# encode 2 words -> 1 word
sequences2 = list()
for i in range(2, 6090):
	sequence2 = encoded2[i-2:i+1]
	sequences2.append(sequence2)
print('Total Sequences: %d' % len(sequences2))


Total Sequences: 6088


In [50]:
# pad sequences
max_length2 = max([len(seq) for seq in sequences2])
sequences2 = pad_sequences(sequences2, maxlen=max_length2, padding='pre')
print('Max Sequence Length: %d' % max_length2)


Max Sequence Length: 3


In [0]:
# split into input and output elements
sequences2 = array(sequences2)
X2, y2 = sequences2[:,:-1],sequences2[:,-1]
y2 = to_categorical(y2, num_classes=vocab_size2)


In [52]:
# define model
model2 = Sequential()
model2.add(Embedding(vocab_size2, 10, input_length=max_length2-1))
model2.add(LSTM(50))
model2.add(Dense(vocab_size2, activation='softmax'))
print(model2.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 2, 10)             24500     
_________________________________________________________________
lstm_5 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_5 (Dense)              (None, 2450)              124950    
Total params: 161,650
Trainable params: 161,650
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
# compile network
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [54]:
# fit network
model2.fit(X2, y2, epochs=50, verbose=2)

Epoch 1/50
 - 3s - loss: 6.7099 - acc: 0.0227
Epoch 2/50
 - 2s - loss: 5.4353 - acc: 0.0235
Epoch 3/50
 - 2s - loss: 5.3245 - acc: 0.0218
Epoch 4/50
 - 2s - loss: 5.2675 - acc: 0.0232
Epoch 5/50
 - 2s - loss: 5.2141 - acc: 0.0230
Epoch 6/50
 - 2s - loss: 5.1119 - acc: 0.0238
Epoch 7/50
 - 2s - loss: 4.7985 - acc: 0.0348
Epoch 8/50
 - 2s - loss: 4.2641 - acc: 0.0971
Epoch 9/50
 - 2s - loss: 3.6493 - acc: 0.2221
Epoch 10/50
 - 2s - loss: 3.0628 - acc: 0.3883
Epoch 11/50
 - 2s - loss: 2.5205 - acc: 0.5309
Epoch 12/50
 - 2s - loss: 2.0587 - acc: 0.6590
Epoch 13/50
 - 2s - loss: 1.6781 - acc: 0.7372
Epoch 14/50
 - 2s - loss: 1.3708 - acc: 0.8134
Epoch 15/50
 - 2s - loss: 1.1273 - acc: 0.8530
Epoch 16/50
 - 2s - loss: 0.9313 - acc: 0.8955
Epoch 17/50
 - 2s - loss: 0.7741 - acc: 0.9156
Epoch 18/50
 - 2s - loss: 0.6458 - acc: 0.9379
Epoch 19/50
 - 2s - loss: 0.5419 - acc: 0.9497
Epoch 20/50
 - 2s - loss: 0.4555 - acc: 0.9614
Epoch 21/50
 - 2s - loss: 0.3881 - acc: 0.9662
Epoch 22/50
 - 2s - lo

<keras.callbacks.History at 0x7f9de00bfba8>

In [55]:
# evaluate model
print(generate_seq(model2, tokenizer2, max_length2-1, 'Strawberry', 4))
print(generate_seq(model2, tokenizer2, max_length2-1, 'sandwiches', 4))

Strawberry cake style traditional with
sandwiches ground activated ultracube pizza
