### Preparing Model for Deployment
- Create data cleaning pipeline for a single user input
- Load pre-trained model
- Calculate

Special Notes:
- This model relies on GoogleNews 3000, which is 1.5gig ZIPPED, and must be in a ./model folder relative to this one
- requires gensim, nltk
- The first draft of this notebook was done in google colab, and there are commented out portions of this notebook left in for posterity should I go back

***Colab Specific Cells:***

In [75]:
# For a fresh Colab instance, clone fresh:
#!pip install -q xlrd
#!git clone https://github.com/hamil168/NSV_Hackfest

In [80]:
#ls

In [79]:
#cd NSV_Hackfest/

In [78]:
#ls

In [76]:
#!pip install gensim
#!python -m pip install --upgrade pip

In [77]:
#!pip install nltk

***Begin Work***:

In [5]:
import pandas as pd
import numpy as np
import gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string



In [81]:
# For some reason, colab couldn't download stopwords
# so I switched to local development

# nltk.download('stopwords')

In [8]:
# Build Lemmatizer
stop_words = stopwords.words('english')

In [9]:
exclude_chars = set(string.punctuation)
lemma = WordNetLemmatizer()

***CONFIGURATION***

In [13]:
# This file is huge but it must be present at ./model/ wherever the script is deployed or w2v will not work
WORD2VEC_MODEL = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)

In [26]:
INPUT_LENGTH_LIMIT = 6  # specific to NSV_Hackfest model
W2V_LENGTH = 300 # sepecific to this model

In [82]:
# type(WORD2VEC_MODEL.wv)

*** User Inputs ***

In [16]:
# Get Line of User Input

user_input = ""
length_ok = False

while not length_ok:
  
  # get a phrase
  user_input_string = input("enter input string: ")

  # split on spaces
  user_input_list = user_input_string.split(' ')

  # check against length limit
  if len(user_input_list) > INPUT_LENGTH_LIMIT:
    print('\ninput exceeds {} characters'.format(INPUT_LENGTH_LIMIT))
  else:
    print('\ninput length OK')
    length_ok = True


input length OK


In [20]:
def clean_user_input(user_input_string):
    
    #filter for stop words
    stop_filtered = [word for word in user_input_string.lower().split(' ') if word not in stop_words]
    
    #filter for punctuation
    punc_filtered = [word for word in stop_filtered if word not in exclude_chars]
    
    # lemmatize
    lemma_filtered = [lemma.lemmatize(word) for word in punc_filtered]
    
    return lemma_filtered

In [21]:
clean_user_input(user_input_string)

['testing', 'one', 'two', 'three']

***Prepare input Volume***

In [51]:
def input_volume(user_input_string, rnn_time_steps):
  input_list = clean_user_input(user_input_string)
  
  # w2v volume has 3 components:
  # arg1: # of rows; here it is 1, in training it is number of training examples
  # arg2: # of words for the rnn_timesteps
  # arg3: # of elements in the w2v encoding, 300 for the NSV_Hackfest model
  w2v = np.zeros([1, rnn_time_steps, W2V_LENGTH])
  
  # x is a default w2v for a single word; all zeros
  x = np.zeros([W2V_LENGTH])
  
  w2v_idx = 0
  for word in input_list:
    
    try:
      x = WORD2VEC_MODEL[word]
    except KeyError:
      pass
    
    w2v[0][w2v_idx] = x
    w2v_idx += 1     
    
  return w2v

In [52]:
# This can go to the model

input_vol = input_volume(user_input_string, INPUT_LENGTH_LIMIT)

In [53]:
input_vol

array([[[-0.29101562,  0.02905273,  0.13671875, ...,  0.01275635,
         -0.08203125, -0.25976562],
        [ 0.0456543 , -0.14550781,  0.15625   , ..., -0.01586914,
          0.00671387, -0.00188446],
        [ 0.03173828, -0.10644531,  0.00241089, ..., -0.0625    ,
         -0.10302734,  0.02929688],
        [ 0.04931641, -0.10009766,  0.00665283, ..., -0.02478027,
         -0.15917969, -0.02282715],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]]])

In [54]:
input_vol.shape

(1, 6, 300)

***Load Model and Predict***

In [67]:
from keras.models import load_model

In [68]:
lstm_model = load_model('model.h5')

In [69]:
y_pred = lstm_model.predict(input_vol)

In [70]:
THRESHOLD = 0.8 # same used when we trained the model, doesn't have ot be this high.

print(yp)

['0101']


In [71]:
def user_classification(user_input, model):
    
    input_vol = input_volume(user_input, INPUT_LENGTH_LIMIT)
    
    y_pred = model.predict(input_vol)
    
    yp = []
    for label in y_pred:
        val = ""
        for x in label:

            val = val + str(int(0 if x < 0.8 else 1))
        yp.append(val)
    return yp
    
    

***Classification Examples***

In [72]:
user_classification(user_input_string, lstm_model)

['0101']

In [73]:
user_classification("jimmy ate his dog", lstm_model)

['0110']

In [74]:
user_classification("eating bugs", lstm_model)

['0010']

***Validation TODO:***
- check order of w2v words