

**Q1. Problem Statement: Long Short-Term Memory Networks**<br>
Write a Python program that reads the spam.csv (provided on LMS) file into a DataFrame, where the dataset contains data of different mails and is classified into spam and not spam, the following are the tasks that are to be taken into consideration while constructing LSTM model, to predict binary classification using given data:
1.	Load the given dataset into a DataFrame (use delimiter as “,” and encoding as “latin-1”) Ex.  pd.read_csv('file name',delimiter=' , ',encoding='latin-1')
2.	Drop all “unnamed” columns and do missing value analysis for the remaining columns 
3.	Use the count plot and check the balance of the target variable (“V1” is our target variable.)
4.	Split the data into X and Y as per independent and dependent variable
5.	Do label encoding for the target variable and reshape its array into 2D format
6.	Split the data into train and test as a 20% test size
7.	Generate tokens (max words = 1000) then convert them into numbers (text to sequence) and do padding as 150 for both train and test data.
8.	Create a new function and declare LSTM and all other layers of your model then call this method to create the final RNN model.
9.	Fit the final model on the train data set and measure accuracy on the test dataset


**Step-1:** Importing required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from keras.callbacks import EarlyStopping
%matplotlib inline




**Step-2:**  Loading Given dataset into dataframe.

In [2]:
df = pd.read_csv('spam.csv',delimiter=',',encoding='latin-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


**Step-3:**  Dropping unnecessary columns.

In [3]:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1,inplace=True)


**Step-4:** Missing value analysis.

In [4]:
df.isna().sum()

v1    0
v2    0
dtype: int64

**Step-6:** Spliting  the data into X and Y as per independent and dependent variable

In [6]:
X = df.v2 #independent variable
Y = df.v1 #dependent variable


**Step-7:** Doing Label encoding for target variable.

In [7]:
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1,1) # reshaping array to 2d formate.

In [8]:
Y 

array([[0],
       [0],
       [1],
       ...,
       [0],
       [0],
       [0]])

**Step-8:** Spliting the data into Train and test for traing and testing perpouse.

In [9]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

**Step-9:** Tokenization,sequencing of token and padding on test data.

In [10]:
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_train)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

**Step-10:** RNN architecture with help of LSTM.

In [11]:
def RNN():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

**Step-11:** model compilation and training of model

In [12]:
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding (Embedding)       (None, 150, 50)           50000     
                                                                 
 lstm (LSTM)                 (None, 64)                29440     
                                                                 
 FC1 (Dense)                 (None, 256)               16640     
                                                                 
 activation (Activation)     (None, 256)               0         
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257  

In [13]:
model.fit(sequences_matrix,Y_train,batch_size=32,epochs=20,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])

Epoch 1/20


Epoch 2/20
Epoch 3/20
Epoch 4/20


<keras.src.callbacks.History at 0x25eddf94110>

**Step-12:** Testing of model on test dataset.

In [14]:
#evaluate the model on tets data
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

In [15]:
accr = model.evaluate(test_sequences_matrix,Y_test)



**Step-13:** Final result.

In [16]:
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.073
  Accuracy: 0.977
