<a href="https://colab.research.google.com/github/harryahlas/generate-survey-comments/blob/master/seq2seqcomments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate Survey Comments
Builds a sequence to sequence model to create comments resembling responses from employee surveys.  Training data (*training_comments.csv*, stored in my personal Google Drive and available on request) was pulled from multiple online sources, mostly *data.world*. I truncated the comments at 1000 characters to facilitate training.

The model is based on the work of George Pipis, link below.

https://pub.towardsai.net/word-level-text-generation-dd61a5a0313d

*Note: runs faster on CPU than TPU*


#### Mount Drive

In [3]:
# Mount Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


#### Load Modules

In [4]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np
import pandas as pd

#### Build Model

In [5]:
tokenizer = Tokenizer()
data = open('/content/gdrive/MyDrive/Development/seq2seqcomments/training_comments.csv').read()
#data = open('comments-not-on-github.txt').read()
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
# create input sequences using list of tokens
input_sequences = []
for line in corpus:
 token_list = tokenizer.texts_to_sequences([line])[0]
 for i in range(1, len(token_list)):
  n_gram_sequence = token_list[:i+1]
  input_sequences.append(n_gram_sequence)
# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 201, 100)          796400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 201, 300)          301200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 201, 300)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_2 (Dense)              (None, 3982)              402182    
_________________________________________________________________
dense_3 (Dense)              (None, 7964)              31720612  
Total params: 33,380,794
Trainable params: 33,380,794
Non-trainable params: 0
__________________________________________

#### Train Model

In [83]:
history = model.fit(predictors, label, epochs=15, verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


#### Save Model

In [212]:
model.save('/content/gdrive/My Drive/Development/seq2seqcomments/seq2seq50')
#model_backup = model



INFO:tensorflow:Assets written to: /content/gdrive/My Drive/Development/seq2seqcomments/seq2seq50/assets


INFO:tensorflow:Assets written to: /content/gdrive/My Drive/Development/seq2seqcomments/seq2seq50/assets


#### Load Model from Drive *(Optional)*

In [6]:
from tensorflow import keras
model = keras.models.load_model('/content/gdrive/My Drive/Development/seq2seqcomments/seq2seq50')

#### Function to Predict Words *print_next_words()*

In [209]:
# orig
def print_next_words(seed_text,number_of_words_to_predict):
  for _ in range(number_of_words_to_predict):
   token_list = tokenizer.texts_to_sequences([seed_text])[0]
   token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
   #predicted = model.predict_classes(token_list, verbose=0)
   predicted = np.argmax(model.predict(token_list), axis=-1)
   output_word = ""
   for word, index in tokenizer.word_index.items():
    if index == predicted:
     output_word = word
     break
   seed_text += " " + output_word
  print(seed_text)

#### Make Predictions

In [210]:
print_next_words("My manager has helped me at my job. I have grown and become a better employee.", 50)

My manager has helped me at my job. I have grown and become a better employee. the city of the same as well i am concerned that is a lot of people who are not a lot of people who are not a lot of people who have to get the same as a lot of the same as well as a lot of women are


In [190]:
seed_text = "My manager has helped me at my job. I have grown and become a better employee. I wonder why this works. to "

In [211]:
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = np.argmax(model.predict(token_list), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
    if index == predicted:
      output_word = word
      break
seed_text += " " + output_word
print(seed_text)

My work is wonderful. I love my manager. I would change with a lot of the same boat groups are wasting money on the same system i am a lot of people in the same time the same as above i think


In [203]:
# Seems to work ok
seed_text = "My work is wonderful. I love my manager. I would change"
words_to_add = 30 
for i in range(0,words_to_add):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
  predicted = np.argmax(model.predict(token_list), axis=-1)
  output_word = ""
  for word, index in tokenizer.word_index.items():
      if index == predicted:
        output_word = word
        break
  seed_text += " " + output_word
  if i == (words_to_add - 1):
    print(seed_text)

My work is wonderful. I love my manager. I would change with a lot of the same boat groups are wasting money on the same system i am a lot of people in the same time the same as above i


In [109]:
def get_new_text(seed_text_input):
  token_list = tokenizer.texts_to_sequences([seed_text_input])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
  predicted = np.argmax(model.predict(token_list), axis=-1)
  seed_text_input += " " + list(tokenizer.word_index.items())[predicted[0]][0]
  return(seed_text_input)

In [195]:
seed_text = "My manager has helped me at my job. I have grown and become a better employee. "

In [169]:
seed_text = get_new_text(seed_text)
print(seed_text)

My manager has helped me at my job. I have grown and become a better employee.  to good as to to
