<a href="https://colab.research.google.com/github/nazirumar/NLP/blob/main/classifying_text_with_deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Recipe 6-2. Classifying Text with Deep
Learning**

In this recipe, let us build a text classifier using deep learning approaches.

**Problem**

We want to build a text classification model using CNN, RNN, and LSTM.

**Solution**

The approach and NLP pipeline would remain the same as discussed
earlier. The only change would be that instead of using machine learning
algorithms, we would be building models using deep learning algorithms.

**How It Works**

Let’s follow the steps in this section to build the email classifier using the
deep learning approaches.

**Step 2-1 Understanding/defining business problem**

Email classification (spam or ham). We need to classify spam or ham email
based on email content

**Step 2-2 Identifying potential data sources, collection,
and understanding**

Using the same data used in Recipe 4-6 from Chapter 4:

In [2]:
import pandas as pd


In [5]:
#rea file
file_content = pd.read_csv('/content/drive/MyDrive/NLP/spam.csv', encoding='latin-1')
file_content

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


**Step 2-3 Text preprocessing**

Let’s preprocess the data:

*Import library*

In [37]:
from nltk.corpus import stopwords
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
from keras.utils import to_categorical




*Remove stopwords*

In [7]:
stop = stopwords.words('english')
file_content['v2'] = file_content['v2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

LookupError: ignored

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

*Delete unwanted columns*

In [9]:
Email_Data = file_content[['v1', 'v2']]

*Rename column names*

In [10]:
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})
Email_Data.head()

Unnamed: 0,Target,Email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


*Delete punctuations, convert text in lower case and delete the
double space*

In [11]:
Email_Data['Email'] = Email_Data['Email'].apply(lambda x:
re.sub('[!@#$:).;,?&]', '', x.lower()))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x:
re.sub(' ', ' ', x))
Email_Data['Email'].head(5)

0    go until jurong point crazy available only in ...
1                              ok lar joking wif u oni
2    free entry in 2 a wkly comp to win fa cup fina...
3          u dun say so early hor u c already then say
4    nah i don't think he goes to usf he lives arou...
Name: Email, dtype: object

*Separating text(input) and target classes*

In [12]:
list_sentences_rawdata = Email_Data["Email"].fillna("_na_").values
list_classes = ["Target"]
target = Email_Data[list_classes].values
To_Process=Email_Data[['Email', 'Target']]

**Step 2-4 Data preparation for model building**

*Now we prepare the data*:


*Train and test split with 80:20 ratio*


In [16]:
train, test = train_test_split(To_Process, test_size=0.2)


*Define the sequence lengths, max number of words and
embedding dimensions*

*Sequence length of each sentence. If more, truncate. If less,
pad with zeros*

In [13]:
MAX_SEQUENCE_LENGTH = 300


**Top 20000 frequently occurring words**

In [14]:
MAX_NB_WORDS = 20000

**Get the frequently occurring words**

In [21]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train.Email)
train_sequences = tokenizer.texts_to_sequences(train.Email)
test_sequences = tokenizer.texts_to_sequences(test.Email)

*dictionary containing words and their index*

In [22]:
word_index = tokenizer.word_index
# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))

Found 8389 unique tokens.


In [28]:
# get only the top frequent words on train
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [29]:
# get only the top frequent words on test
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [30]:
print(train_data.shape)
print(test_data.shape)

(4457, 300)
(1115, 300)


In [31]:
train_labels = train['Target']
test_labels = test['Target']

In [35]:
le = LabelEncoder()
le.fit(train_labels)

train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)

print(le.classes_)
print(np.unique(train_labels, return_counts=True))
print(np.unique(test_labels, return_counts=True))

['ham' 'spam']
(array([0, 1]), array([3860,  597]))
(array([0, 1]), array([965, 150]))


In [38]:
# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', labels_train.shape)
print('Shape of label tensor:', labels_test.shape)

Shape of data tensor: (4457, 300)
Shape of label tensor: (4457, 2)
Shape of label tensor: (1115, 2)


In [39]:
EMBEDDING_DIM = 100
print(MAX_SEQUENCE_LENGTH)

300


**Step 2-5 Model building and predicting**


*We are building the models using different deep learning approaches
like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the
performance of each model using different accuracy metrics.
We can now define our CNN model.
Here we define a single hidden layer with 128 memory units. The
network uses a dropout with a probability of 0.5. The output layer is a
dense layer using the softmax activation function to output a probability
prediction*

In [44]:
# Import Libraries
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout, BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential

In [46]:
print('Training CNN 1D model.')

model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
 optimizer='rmsprop',
 metrics=['acc'])

Training CNN 1D model.


*We are now fitting our model to the data. Here we have 5 epochs and a
batch size of 64 patterns.*

In [47]:
model.fit(train_data, labels_train,
 batch_size=64,
 epochs=5,
 validation_data=(test_data, labels_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fabfb77b7c0>

In [48]:
#predictions on test data
predicted=model.predict(test_data)
predicted



array([[0.54133695, 0.45866308],
       [0.54365015, 0.4563498 ],
       [0.34789062, 0.6521094 ],
       ...,
       [0.5350634 , 0.4649366 ],
       [0.5396483 , 0.4603517 ],
       [0.53957343, 0.46042654]], dtype=float32)

In [49]:
#model evaluation
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score

In [51]:
precision, recall, fscore, support = score(labels_test,
predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,
predicted.round()))

precision: [0.96981891 0.99173554]
recall: [0.99896373 0.8       ]
fscore: [0.9841756  0.88560886]
support: [965 150]
############################
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       965
           1       0.99      0.80      0.89       150

   micro avg       0.97      0.97      0.97      1115
   macro avg       0.98      0.90      0.93      1115
weighted avg       0.97      0.97      0.97      1115
 samples avg       0.97      0.97      0.97      1115



**We can now define our RNN model**

In [53]:
#import library
from keras.layers import SimpleRNN

In [54]:
#model training
print('Training SIMPLERNN model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])

Training SIMPLERNN model.


In [55]:
model.fit(train_data, labels_train,
 batch_size=16,
 epochs=5,
 validation_data=(test_data, labels_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fabfc1bb760>

In [56]:
# prediction on test data
predicted_Srnn=model.predict(test_data)
predicted_Srnn



array([[9.9112761e-01, 8.8724168e-03],
       [9.9896932e-01, 1.0306175e-03],
       [3.8694527e-02, 9.6130544e-01],
       ...,
       [9.9734265e-01, 2.6574109e-03],
       [9.9432892e-01, 5.6710988e-03],
       [9.9906796e-01, 9.3210797e-04]], dtype=float32)

In [57]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,
predicted_Srnn.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,
predicted_Srnn.round()))

precision: [0.97461929 0.96153846]
recall: [0.99481865 0.83333333]
fscore: [0.98461538 0.89285714]
support: [965 150]
############################
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       965
           1       0.96      0.83      0.89       150

   micro avg       0.97      0.97      0.97      1115
   macro avg       0.97      0.91      0.94      1115
weighted avg       0.97      0.97      0.97      1115
 samples avg       0.97      0.97      0.97      1115



**And here is our Long Short-Term Memory (LSTM):**

In [61]:
#model training
print('Training LSTM model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(LSTM(16, activation='relu', recurrent_activation='hard_sigmoid',return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])


Training LSTM model.


In [62]:
model.fit(train_data, labels_train,
 batch_size=16,
 epochs=5,
 validation_data=(test_data, labels_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fabeb2562c0>

In [63]:
#prediction on text data
predicted_lstm=model.predict(test_data)
predicted_lstm



array([[9.9999946e-01, 4.4962260e-07],
       [9.9999994e-01, 5.8289885e-16],
       [5.3873176e-15, 9.9999994e-01],
       ...,
       [9.9726641e-01, 2.7336071e-03],
       [9.9987471e-01, 1.2530525e-04],
       [1.0000000e+00, 7.7512163e-11]], dtype=float32)

In [64]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,
predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,
predicted_lstm.round()))

precision: [0.98466258 0.98540146]
recall: [0.99792746 0.9       ]
fscore: [0.99125064 0.94076655]
support: [965 150]
############################
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.90      0.94       150

   micro avg       0.98      0.98      0.98      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115
 samples avg       0.98      0.98      0.98      1115



*Finally, let’s see what is Bidirectional LSTM and implement the same.
As we know, LSTM preserves information from inputs using the
hidden state. In bidirectional LSTMs, inputs are fed in two ways: one
from previous to future and the other going backward from future to
past, helping in learning future representation as well. Bidirectional
LSTMs are known for producing very good results as they are capable of
understanding the context better*

In [65]:
#model training
print('Training Bidirectional LSTM model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(Bidirectional(LSTM(16, return_sequences=True,
dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid",
kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])

Training Bidirectional LSTM model.


In [68]:
model.fit(train_data, labels_train,
 batch_size=16,
 epochs=3,
 validation_data=(test_data, labels_test))


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fabe501e2f0>

In [70]:
# prediction on test data
predicted_blstm=model.predict(test_data)
predicted_blstm



array([[9.9999982e-01, 6.7775765e-08],
       [9.9999994e-01, 1.8605950e-13],
       [2.3985425e-10, 9.9999994e-01],
       ...,
       [1.0000000e+00, 2.3395990e-11],
       [9.9999976e-01, 2.3934504e-07],
       [1.0000000e+00, 3.8733337e-13]], dtype=float32)

In [72]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test,
predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,
predicted_blstm.round()))

precision: [0.98567042 0.98550725]
recall: [0.99792746 0.90666667]
fscore: [0.99176107 0.94444444]
support: [965 150]
############################
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       965
           1       0.99      0.91      0.94       150

   micro avg       0.99      0.99      0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115
 samples avg       0.99      0.99      0.99      1115

