# SMS Spam or Ham Classification Using RNN

This project uses a **Recurrent Neural Network (RNN)** to classify SMS messages as **spam** or **ham** (non-spam) based on a dataset containing labeled text messages.

---

## Data Preprocessing

- **Loading Data**:  
  Dataset loaded from `SPAM text message 20170820 - Data.csv` with `Category` (spam/ham) and `Message` columns.

- **Cleaning**:  
  Renamed columns to `output` and `input`. Dropped irrelevant columns: `Unnamed: 2`, `Unnamed: 3`, `Unnamed: 4`.

- **Text Processing**:  
  Applied **one-hot encoding** with a vocabulary size of 10,000 and padded all sequences to a length of 100.

- **Label Encoding**:  
  Converted output labels to binary: `0` for ham and `1` for spam.

- **Train-Test Split**:  
  Split the data into **80% training** and **20% testing** using `random_state=42`.

---

## Model Architecture

- **Embedding Layer**:  
  Maps words to **70-dimensional vectors** (vocab size: 10,000, input length: 100).

- **Simple RNN Layer**:  
  Contains **70 units** with **ReLU activation**.

- **Output Layer**:  
  Single neuron with **sigmoid activation** for binary classification.

- **Training Setup**:  
  Used **Adam optimizer**, **binary cross-entropy loss**, and **accuracy** as the evaluation metric.

---

## Training and Validation

- **Configuration**:  
  Trained with **batch size 10**, up to **10 epochs**, with **early stopping** (patience = 5) based on validation loss.

- **Performance**:  
  - ~**99.99% training accuracy**  
  - ~**98% validation accuracy**  
  - **Validation loss**: ~0.0828

---

## Key Observations

- Dataset contains **5,572 messages**, each padded to **100 tokens**.
- High accuracy was achieved, but early high loss suggests potential for improved preprocessing.
- **Class imbalance** is likely present but not addressed in the current implementation.

---

## Conclusion

The RNN model effectively classifies SMS messages as spam or ham. However, several improvements could enhance performance:

- Advanced text preprocessing (lemmatization, stopword removal)
- Switching to **LSTM** or **GRU** for better sequence modeling
- Addressing class imbalance (oversampling or class weights)
- Hyperparameter tuning and using additional metrics (precision, recall, F1-score)


In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
pip install nltk



In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
data = pd.read_csv('/content/SPAM text message 20170820 - Data.csv', encoding='latin1')

In [None]:
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
data.columns

Index(['Category', 'Message'], dtype='object')

In [None]:
data.rename(columns = {'Category' : 'output', 'Message' : 'input'}, inplace = True)

In [None]:
data.head()

Unnamed: 0,output,input
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
data.drop(columns = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace = True)

In [None]:
data

Unnamed: 0,output,input
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
data.reset_index()

Unnamed: 0,index,output,input
0,0,ham,"Go until jurong point, crazy.. Available only ..."
1,1,ham,Ok lar... Joking wif u oni...
2,2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,3,ham,U dun say so early hor... U c already then say...
4,4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...,...
5567,5567,spam,This is the 2nd time we have tried 2 contact u...
5568,5568,ham,Will Ã¼ b going to esplanade fr home?
5569,5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
y = data['output']

In [None]:
ps = PorterStemmer()
corpus = []

for i in range(0, len(data)):
  text = re.sub('[^a-zA-Z]', ' ', data['input'][i])
  text = text.lower()
  text = text.split()

  text = [ps.stem(word) for word in text if not word in stopwords.words('english')]
  text = ' '.join(text)
  corpus.append(text)


In [None]:
corpus[0]

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'

In [None]:
op = {'spam' : 0, 'ham' : 1}

data['output'] = data['output'].map(op)

NameError: name 'data' is not defined

In [None]:
data['output']

Unnamed: 0,output
0,1
1,1
2,0
3,1
4,1
5,0
6,1
7,1
8,0
9,0


In [None]:
from tensorflow.keras.preprocessing.text import one_hot

In [None]:
voc_size = 10000


one_hot_rep = [one_hot(word, voc_size)for word in corpus]

In [None]:
length = [len(i) for i in one_hot_rep]
max(length)

77

In [None]:
from tensorflow.keras.utils import pad_sequences

sent_len = 100
pad_rep = pad_sequences(one_hot_rep, padding = 'pre', maxlen = sent_len)

In [None]:
pad_rep[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0, 6602,   13, 6160,  467,
       2125, 9090, 5819, 9900, 2351, 8061, 8402, 4314, 1514,  673, 6757,
       5231], dtype=int32)

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(pad_rep, data['output'].values, test_size = 0.2, random_state = 42 )

In [None]:
y_train


array([0, 1, 1, ..., 1, 1, 1])

In [None]:
X_train.shape

(4457, 100)

In [None]:
X_test.shape

(1115, 100)

In [None]:
47

47

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
model = Sequential()
model.add(Embedding(voc_size, 70, input_length = sent_len))
model.add(SimpleRNN(70, activation='relu'))
model.add(Dense(1,activation = 'sigmoid'))
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])



In [None]:
model.build(input_shape=(None, sent_len))
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor= 'val_loss', patience= 5, restore_best_weights= True)

In [None]:
model.fit(X_train, y_train, validation_split= 0.2, batch_size = 10, epochs = 10, callbacks = [early_stop] )

Epoch 1/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 39ms/step - accuracy: 0.8611 - loss: 370057408.0000 - val_accuracy: 0.9787 - val_loss: 0.1122
Epoch 2/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 37ms/step - accuracy: 0.9873 - loss: 0.0597 - val_accuracy: 0.9798 - val_loss: 0.0892
Epoch 3/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 34ms/step - accuracy: 0.9953 - loss: 0.0283 - val_accuracy: 0.9798 - val_loss: 0.0887
Epoch 4/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 36ms/step - accuracy: 0.9968 - loss: 0.0182 - val_accuracy: 0.9675 - val_loss: 0.1240
Epoch 5/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 42ms/step - accuracy: 0.9944 - loss: 0.0227 - val_accuracy: 0.9798 - val_loss: 0.0828
Epoch 6/10
[1m357/357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 35ms/step - accuracy: 0.9974 - loss: 0.0179 - val_accuracy: 0.9798 - val_loss: 0.0864
Epoch 7/

<keras.src.callbacks.history.History at 0x7a74a5e63e80>