<a href="https://colab.research.google.com/github/imaaditya-stack/SpamFilterForQuoraQuestions-DeepLearning/blob/master/SpamFilterTrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains Training a Long Short Term Memory Neural Network LSTM from scratch to classify between **"spam"** and **not spam**" examples of Quora Questions. <br>
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [3]:
#Importing Packages
import pandas as pd
import numpy as np
import keras
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.models import Model

Using TensorFlow backend.


In [0]:
#Reading data
df = pd.read_csv("/content/drive/My Drive/SpamFilterCleanedData.csv",sep=',')

In [0]:
df.shape

(1306122, 2)

In [0]:
df.head()

Unnamed: 0,Question_text_modified,target
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
#Checking for Nan values
df.isnull().any()

Question_text_modified     True
target                    False
dtype: bool

In [0]:
df.isnull().sum()

Question_text_modified    404
target                      0
dtype: int64

In [0]:
#Removing Nans
df.dropna(axis=0,inplace=True)

In [0]:
df.isnull().any().any()

False

As the dataset is cleaned earlier we will again check some basic features.

In [0]:
#Number of words
df["word_count"] = df["Question_text_modified"].apply(lambda x: len(str(x).split(" ")))
df[["Question_text_modified","word_count"]].head()

Unnamed: 0,Question_text_modified,word_count
0,quebec nationalists see province nation,5
1,adopt dog would encourage people shop,6
2,velocity affect time space geometry,5
3,otto von guericke use magdeburg hemispheres,6
4,convert montra helicon mountain bike change tyres,7


In [0]:
max(df["word_count"]), min(df["word_count"])

(53, 1)

In [0]:
#Number of characters
df['char_count'] = df['Question_text_modified'].str.len() ## this also includes spaces
df[['Question_text_modified','char_count']].head()

Unnamed: 0,Question_text_modified,char_count
0,quebec nationalists see province nation,39.0
1,adopt dog would encourage people shop,37.0
2,velocity affect time space geometry,35.0
3,otto von guericke use magdeburg hemispheres,43.0
4,convert montra helicon mountain bike change tyres,49.0


In [0]:
max(df["char_count"]), min(df["char_count"])

(335.0, 1.0)

In [0]:
#Number of special words
df['hastags'] = df['Question_text_modified'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['Question_text_modified','hastags']].head()

Unnamed: 0,Question_text_modified,hastags
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
max(df["hastags"]), min(df["hastags"])

(0, 0)

In [0]:
#Number of numerics
df['numerics'] = df['Question_text_modified'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['Question_text_modified','numerics']].head()

Unnamed: 0,Question_text_modified,numerics
0,quebec nationalists see province nation,0
1,adopt dog would encourage people shop,0
2,velocity affect time space geometry,0
3,otto von guericke use magdeburg hemispheres,0
4,convert montra helicon mountain bike change tyres,0


In [0]:
max(df["numerics"]), min(df["numerics"])

(2, 0)

In [0]:
#I do not know how to clean such data so i kept it as it is.
df["Question_text_modified"][df["numerics"]==2]

687189    evaluate limit ⁴x ⁴ ³x ³ x approach give function
853043                               remainder ²²² ³ divide
886025                                         solve ³ ⁿ² ²
Name: Question_text_modified, dtype: object

### Splitting dataset into training and validation set which will be used for tuning the model performance

In [111]:
X = df["Question_text_modified"]
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=2)
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((1044574,), (261144,), (1044574,), (261144,))

#Word Embedding
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

In [0]:
def tokenization(data):
  """This function creates the vocabulary index based on word frequency and 
  Transforms each text in texts to a sequence of integers and 
  also returns vocabluary length
  """
  tok = Tokenizer(char_level=False,split=' ')
  tok.fit_on_texts(data)
  return tok.texts_to_sequences(data), len(tok.index_word.keys())

def padding(sequences_data,maxlen):
  """This function pads variable length sequences.The default padding value is 0.0"""
  return sequence.pad_sequences(sequences_data,maxlen=maxlen)

In [0]:
#Let's check how many maximum words are there in the dataset
np.quantile(df["word_count"],0.95)

13.0

In [0]:
maxlen = 13
sequences_train, vocab_len = tokenization(X_train)
sequences_train_matrix = padding(sequences_train,maxlen)

In [0]:
sequences_train_matrix.shape

(1044574, 13)

In [0]:
sequences_test, vocab_len = tokenization(X_test)
sequences_test_matrix = padding(sequences_test,maxlen)

In [0]:
sequences_test_matrix.shape

(261144, 13)

In [135]:
vocab_len

80727

#Building LSTM Architecture

In [0]:
def build_model(input,LSTM_units,dropout,classes,finalAct='sigmoid'):
    """This function Builds the LSTM Model using keras Functional API"""
    inputs = Input(name='inputs',shape=[input])
    layer = Embedding(vocab_len+1,500,input_length=input,mask_zero=True)(inputs)
    layer = LSTM(LSTM_units)(layer)
    layer = Dense(units=64,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(dropout)(layer)
    layer = Dense(classes,name='Output_layer')(layer)
    layer = Activation(finalAct)(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [146]:
model = build_model(input=13,LSTM_units=64,dropout=0.5,classes=1,finalAct='sigmoid')
model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 13)]              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 13, 500)           40364000  
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                144640    
_________________________________________________________________
FC1 (Dense)                  (None, 64)                4160      
_________________________________________________________________
activation_6 (Activation)    (None, 64)                0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
Output_layer (Dense)         (None, 1)                 65  

In [0]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [0]:
training_generator = DataGenerator(sequences_train_matrix, np.asarray(y_train).reshape(-1), **params)
validation_generator = DataGenerator(sequences_test_matrix, np.asarray(y_test).reshape(-1), **params)

In [0]:
# #Custom data generator
# def generator(data, classes, batch_size):
#     """
#     data : data
#     classes : y_train
#     batch_size: integer value
#     """
#     index = 0
#     while True:
#         for index,sample in enumerate(data):
#           if index == data.shape[0]:
#             index = 0
#           yield data[index*batch_size:(index+1)*batch_size], classes[index*batch_size:(index+1)*batch_size]

In [0]:
def trainbatchgenerator(features, labels, batch_size):
 # Create empty arrays to contain batch of features and labels#
 batch_features = np.zeros((batch_size, 13, ))
 batch_labels = np.zeros((batch_size,))
 while True:
   for i in range(batch_size):
     # choose random index in features
     index = np.random.choice(len(features),1)
     batch_features[i] = features[index]
     batch_labels[i] = labels[index]
   yield batch_features, batch_labels

In [0]:
def validationbatchgenerator(features, labels, batch_size):
 # Create empty arrays to contain batch of features and labels#
 batch_features = np.zeros((batch_size, 13, ))
 batch_labels = np.zeros((batch_size,))
 while True:
   for i in range(batch_size):
     # choose random index in features
     index = np.random.choice(len(features),1)
     batch_features[i] = features[index]
     batch_labels[i] = labels[index]
   yield batch_features, batch_labels

In [132]:
#Defining class weights as the dataset is heavily imbalance
from sklearn.utils import class_weight
class_weight = class_weight.compute_class_weight('balanced',np.unique(y_train),y_train)
class_weight_dict = dict(enumerate(class_weight))
class_weight_dict

{0: 0.532981269165812, 1: 8.08006002568109}

# Training

In [0]:
training_generator = trainbatchgenerator(sequences_train_matrix, np.asarray(y_train),512)
validation_generator = validationbatchgenerator(sequences_test_matrix, np.asarray(y_test),512)

In [151]:
history = model.fit_generator(
    generator=training_generator,
    steps_per_epoch=int(sequences_train_matrix.shape[0]/512),
    epochs=5,
    class_weight=class_weight_dict,
    validation_data=validation_generator,
    validation_steps=int(sequences_test_matrix.shape[0]/512))

  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 2040 steps, validate for 510 steps
Epoch 1/5
   1/2040 [..............................] - ETA: 1:45:01

InvalidArgumentError: ignored