# Keras Text Classification Artifacts

###### Purpose of this notebook: 

CONTEXT: Text Classification Models

Whenever Neural Networks are build, deploying them as inference endpoints would require some preprocessing steps on the new input and also the model file. The purpose of this notebook is to get all the artifacts of the Neural Network Model needed to do inference on an new text input.

###### Needed:
1. The model file (Example: The .h5 file)
2. Tokenizer (to convert text to tokens in the same way)
3. Other params

###### Priorities
1. The NN model need not be the best.
2. The goal is to develop an inference pipeline on Amazon Web Services. For prototyping, I chose a small dataset from Kaggle (diasaster tweets) and a model with as few parameters as possible.
3. The accuracy is not expected to be the best.

## Imports

In [1]:
import re
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Embedding, Dense, Dropout, SpatialDropout1D, LSTM
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from keras_transformer import get_model, get_custom_objects
import tensorflow as tf
import pickle
from tensorflow.keras.regularizers import l2
import numpy as np

import pandas as pd
pd.set_option('display.max_colwidth', None)

## Save model and tokenizer

In [3]:
PATH = "path/to/data"
df = pd.read_csv(PATH, usecols=['text', 'target'])

In [None]:
def preprocess_tweet(tweet):
    
    tweet = tweet.lower() # convert to lowercase
    tweet = re.sub(r"http\S+", "", tweet) # remove urls
    tweet = re.sub(r"@\w+", "", tweet) # remove mentions
    tweet = re.sub(r"#\w+", "", tweet) # remove hashtags
    tweet = re.sub(r'[^\w\s]', '', tweet) # remove punctuation
    tweet = re.sub(r'\d+', '', tweet) # remove numbers
    tweet = re.sub(r'\s+', ' ', tweet).strip() # remove extra whitespace
    
    stopwords = set([
        'ourselves', 'hers','between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very',
        'having', 'with', 'they', 'own', 'an', 'be', 'some','for', 'do', 'its', 'yours', 'such', 'into', 'of', 
        'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves',
        'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more',
        'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she',
        'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does',
        'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he',
        'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom',
        't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'
    ])
    tweet = tweet.split()
    tweet = [i for i in tweet if i not in stopwords]
    
    return tweet

In [None]:
df.text = df.text.apply(preprocess_tweet)

In [None]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['text'].tolist())

sequences = tokenizer.texts_to_sequences(df['text'].tolist())

### Save Tokenizer

In [None]:
with open('disaster-tweet-tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

In [None]:
x = pad_sequences(sequences, padding='post', maxlen=16)
y = np.array(df.target)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42, shuffle=True)

In [None]:
def build_model(vocab_size, max_len):
    input_layer = Input(shape=(max_len,))

    x = Embedding(vocab_size, 4)(input_layer)
    x = SpatialDropout1D(rate=0.1)(x)
    x = LSTM(32, return_sequences=True, activation='relu')(x)
    x = Dropout(rate=0.5)(x)
    x = LSTM(16, activation='relu')(x)
    x = Dropout(rate=0.5)(x)
    x = Dense(1, activation='sigmoid', kernel_regularizer=l2(0.01))(x)

    model = Model(input_layer, x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

model = build_model(5000, 16)
model.summary()

In [None]:
model.fit(X_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split=0.15, shuffle=True)

In [None]:
model.evaluate(X_val, y_val)

### Save Model

In [None]:
model.save(filepath='disaster-tweet-model.h5')

### Save modelling params

In [None]:
params = {
    "padding_type": "post",
    "maxlen": 16
}

In [None]:
with open('disaster-tweet-modelling-params.pkl', 'wb') as f:
    pickle.dump(params, f, protocol=pickle.HIGHEST_PROTOCOL)

## Load Model, Tokenizer and do inference

In [None]:
del model, tokenizer, params

### Load Model

In [None]:
model = load_model("disaster-tweet-model.h5")

### Load Tokenizer

In [None]:
with open('disaster-tweet-tokenizer.pkl', 'rb') as handle:
    tokenizer = pickle.load(handle)

### Load modelling params

In [None]:
with open('disaster-tweet-modelling-params.pkl', 'rb') as handle:
    params = pickle.load(handle)

## Inference

In [None]:
new_text = "no disaster warning; no need to seek shelter immediately."
new_text = preprocess_tweet(new_text)

tokenized_text = tokenizer.texts_to_sequences([new_text])
tokenized_text = pad_sequences(tokenized_text, maxlen=16, padding='post')

In [None]:
model.evaluate(tokenized_text, np.array([1]))

In [None]:
model.evaluate(X_val, y_val)