<a href="https://colab.research.google.com/github/imaaditya-stack/SpamFilterForQuoraQuestions-DeepLearning/blob/master/SpamFilterTrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains Training a Long Short Term Memory Neural Network LSTM from scratch to classify between **"spam"** and **not spam**" examples of Quora Questions. <br>
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#Importing Packages
import pandas as pd
import numpy as np

#Scikit-Learn
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight

#Keras
import keras
import keras.backend as K
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.models import Model
from keras.initializers import he_uniform
from keras.optimizers import Adam
from keras import callbacks

#Disables deprecation warnings
from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False

In [0]:
#Reading data
df = pd.read_csv("/content/drive/My Drive/SpamFilterCleanedData.csv",sep=',')

In [0]:
df.shape

(1306122, 2)

In [0]:
df.head()

Unnamed: 0,Question_text_modified,target
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
#Checking for Nan values
df.isnull().any()

Question_text_modified     True
target                    False
dtype: bool

In [0]:
df.isnull().sum()

Question_text_modified    404
target                      0
dtype: int64

In [0]:
#Removing Nans
df.dropna(axis=0,inplace=True)

In [0]:
df.isnull().any().any()

False

As the dataset is cleaned earlier we will again check some basic features.

In [0]:
#Number of words
df["word_count"] = df["Question_text_modified"].apply(lambda x: len(str(x).split(" ")))
df[["Question_text_modified","word_count"]].head()

Unnamed: 0,Question_text_modified,word_count
0,quebec nationalists see province nation,5
1,adopt dog would encourage people shop,6
2,velocity affect time space geometry,5
3,otto von guericke use magdeburg hemispheres,6
4,convert montra helicon mountain bike change tyres,7


In [0]:
max(df["word_count"]), min(df["word_count"])

(53, 1)

In [0]:
#Number of characters
df['char_count'] = df['Question_text_modified'].str.len() ## this also includes spaces
df[['Question_text_modified','char_count']].head()

Unnamed: 0,Question_text_modified,char_count
0,quebec nationalists see province nation,39.0
1,adopt dog would encourage people shop,37.0
2,velocity affect time space geometry,35.0
3,otto von guericke use magdeburg hemispheres,43.0
4,convert montra helicon mountain bike change tyres,49.0


In [0]:
max(df["char_count"]), min(df["char_count"])

(335.0, 1.0)

In [0]:
#Number of special words
df['hastags'] = df['Question_text_modified'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['Question_text_modified','hastags']].head()

Unnamed: 0,Question_text_modified,hastags
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
max(df["hastags"]), min(df["hastags"])

(0, 0)

In [0]:
#Number of numerics
df['numerics'] = df['Question_text_modified'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['Question_text_modified','numerics']].head()

Unnamed: 0,Question_text_modified,numerics
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
max(df["numerics"]), min(df["numerics"])

(2, 0)

In [0]:
df["Question_text_modified"][df["numerics"]==2]

687189    evaluate limit ⁴x ⁴ ³x ³ x approach give function
853043                               remainder ²²² ³ divide
886025                                         solve ³ ⁿ² ²
Name: Question_text_modified, dtype: object

I do not know how to clean such data so i kept it as it is.

### Splitting dataset into Training and Validation set.
### The validation set will be used for tuning the model performance

In [115]:
X = df["Question_text_modified"]
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((1044574,), (261144,), (1044574,), (261144,))

#Word Embedding
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

In [0]:
def tokenization(data):

  """This function creates the vocabulary index based on word frequency and 
  Transforms each text in texts to a sequence of integers and 
  also returns vocabulary length
  """

  tok = Tokenizer(char_level=False,split=' ')
  #this creates the dictionary
  tok.fit_on_texts(data)
  #this transforms the texts in to sequences of indices
  return tok.texts_to_sequences(data), len(tok.index_word.keys())

def padding(sequences_data,maxlen):

  """This function pads variable length sequences.The default padding value is 0.0"""

  return sequence.pad_sequences(sequences_data,maxlen=maxlen)

In [0]:
#Let's check how many maximum words on an average are there in the dataset
np.quantile(df["word_count"],0.95)

13.0

In [0]:
#maxlen = 13
maxlen = 15
sequences_train, vocab_len = tokenization(X_train)
sequences_train_matrix = padding(sequences_train,maxlen)

In [118]:
vocab_len

181459

### The training dataset contains **181459** Unique Vocabulary

In [0]:
sequences_test, _ = tokenization(X_test)
sequences_test_matrix = padding(sequences_test,maxlen)

#Building LSTM Architecture

In [0]:
def build_model(input,LSTM_units,nb_classes,finalAct='sigmoid'):

    """This function Builds the LSTM Model using keras Functional API"""

    #Defining basic parameters
    embedding_input_dim = vocab_len
    embedding_output_dim = 200
    initializer = he_uniform(seed=200)
    
    #Input Layer, shape=15.0
    inputs = Input(name='inputs',shape=[input])
    #Embedding Layer
    layer = Embedding(embedding_input_dim+1,embedding_output_dim,input_length=input,
                      mask_zero=True,embeddings_initializer=initializer)(inputs)
    #LSTM Layer
    layer = LSTM(LSTM_units,kernel_initializer=initializer)(layer)
    #Classifier
    layer = Dense(units=64,name='FC1',kernel_initializer=initializer)(layer)
    layer = Activation('relu')(layer)
    #Dropout
    layer = Dropout(0.5)(layer)
    layer = Dense(nb_classes,name='Output_layer',kernel_initializer=initializer)(layer)
    #Final Output Layer
    layer = Activation(finalAct)(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [121]:
model = build_model(input=15,LSTM_units=100,nb_classes=1,finalAct='sigmoid')
model.summary()

Model: "model_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          (None, 15)                0         
_________________________________________________________________
embedding_19 (Embedding)     (None, 15, 200)           36292000  
_________________________________________________________________
lstm_19 (LSTM)               (None, 100)               120400    
_________________________________________________________________
FC1 (Dense)                  (None, 64)                6464      
_________________________________________________________________
activation_37 (Activation)   (None, 64)                0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 64)                0         
_________________________________________________________________
Output_layer (Dense)         (None, 1)                 65 

### Compiling the model

In [0]:
#Using Adam optimizer with an initial learning rate of 0.0001
opt=Adam(lr=0.001, beta_1=0.91, beta_2=0.999, epsilon=1e-08, decay=0)
#Compile the model
model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy'])

### Defining Callbacks

In [0]:
def myCallbacks():

    """This function returns a list of callbacks"""

    #Model Checkpoint
    file_path = r"/content/drive/My Drive/spamModel.h5"
    checkpoint = callbacks.ModelCheckpoint(file_path,monitor='val_acc',verbose=1,save_best_only=True,mode='auto')

    #ReduceLROnPlateau
    reduce_lr = callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.1,patience=5,min_lr=1e-30,cooldown=2,verbose=1)

    # EarlyStopping
    es = callbacks.EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=10)

    return [checkpoint,reduce_lr, es]

# Training model using Keras .fit_generator()

In [0]:
def train_batch_generator(features, labels, batch_size):
  # Create empty arrays to contain batch of features and labels#
  batch_features = np.zeros((batch_size, 15, ))
  batch_labels = np.zeros((batch_size,))
  while True:
    for i in range(batch_size):
      # choose random index in features
      index = np.random.choice(len(features),1)
      batch_features[i] = features[index]
      batch_labels[i] = labels[index]
    yield batch_features, batch_labels

In [0]:
def validation_batch_generator(features, labels, batch_size):
  # Create empty arrays to contain batch of features and labels#
  batch_features = np.zeros((batch_size, 15, ))
  batch_labels = np.zeros((batch_size,))
  while True:
    for i in range(batch_size):
      # choose random index in features
      index = np.random.choice(len(features),1)
      batch_features[i] = features[index]
      batch_labels[i] = labels[index]
    yield batch_features, batch_labels

In [128]:
#Defining class weights as the dataset is heavily imbalance
class_weight = class_weight.compute_class_weight('balanced',np.unique(y_train),y_train)
class_weight_dict = dict(enumerate(class_weight))
class_weight_dict

{0: 0.532981269165812, 1: 8.08006002568109}

# Training

In [0]:
#Training batch size
tbs = 512
#validation batch size
vbs = 64
training_generator = train_batch_generator(sequences_train_matrix, np.asarray(y_train),tbs)
validation_generator = validation_batch_generator(sequences_test_matrix, np.asarray(y_test),vbs)

In [0]:
epochs = 30
steps_per_epoch = int(sequences_train_matrix.shape[0]/512)
validation_steps = int(sequences_test_matrix.shape[0]/64)

In [131]:
history = model.fit_generator(generator=training_generator,steps_per_epoch=steps_per_epoch,
    epochs=epochs,class_weight=class_weight_dict,
    validation_data=validation_generator,validation_steps=validation_steps,
    callbacks=myCallbacks(),verbose=1)

Epoch 1/30

Epoch 00001: val_acc improved from -inf to 0.83351, saving model to /content/drive/My Drive/spamModel.h5
Epoch 2/30

Epoch 00002: val_acc improved from 0.83351 to 0.83864, saving model to /content/drive/My Drive/spamModel.h5
Epoch 3/30

Epoch 00003: val_acc improved from 0.83864 to 0.84446, saving model to /content/drive/My Drive/spamModel.h5
Epoch 4/30

Epoch 00004: val_acc improved from 0.84446 to 0.85939, saving model to /content/drive/My Drive/spamModel.h5
Epoch 5/30

Epoch 00005: val_acc did not improve from 0.85939
Epoch 6/30

Epoch 00006: val_acc did not improve from 0.85939

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 7/30

Epoch 00007: val_acc did not improve from 0.85939
Epoch 8/30

Epoch 00008: val_acc did not improve from 0.85939
Epoch 9/30

Epoch 00009: val_acc did not improve from 0.85939
Epoch 10/30

Epoch 00010: val_acc did not improve from 0.85939
Epoch 11/30

Epoch 00011: val_acc improved from 0.85939 to 0.85967, 