
# **Topic:** To implement text processing with a Neural Network (LSTM)

**Name:** Prexit Joshi  
**Roll No.:** UE233118



## Aim
Implementation of text preprocessing and an LSTM model for basic sentiment classification.

## Description
We will:
- Use a very small sample dataset
- Preprocess text (lowercase, simple cleaning)
- Tokenize and pad sequences
- Build a small LSTM model
- Train and test and make one prediction

Run cells one-by-one .

## Requirements
Google Colab has required libraries. If running locally:
```
pip install tensorflow pandas scikit-learn
```

In [1]:
# 1) Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

print('TF version:', tf.__version__)

TF version: 2.19.0


In [2]:
# 2) Very small dataset (easy to understand)
data = {
    'text': [
        'I love this',
        'I hate this',
        'This is awesome',
        'This is bad',
        'Really good experience',
        'Really bad experience'
    ],
    'label': ['positive','negative','positive','negative','positive','negative']
}
df = pd.DataFrame(data)
print(df)


                     text     label
0             I love this  positive
1             I hate this  negative
2         This is awesome  positive
3             This is bad  negative
4  Really good experience  positive
5   Really bad experience  negative


In [3]:
# 3) Encode labels and split
e = LabelEncoder()
df['label_enc'] = e.fit_transform(df['label'])
X = df['text']
y = df['label_enc']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print('Train:', len(X_train), 'Test:', len(X_test))
print('\nTrain samples:\n', X_train.values)


Train: 4 Test: 2

Train samples:
 ['Really good experience' 'I love this' 'This is bad'
 'Really bad experience']


In [4]:
# 4) Tokenize & pad
vocab_size = 50
max_len = 6
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post')
print('Example:', X_train.values[0], '->', X_train_seq[0], '->', X_train_pad[0])


Example: Really good experience -> [2, 6, 3] -> [2 6 3 0 0 0]


In [5]:
# 5) Build a tiny LSTM model
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=8, input_length=max_len),
    LSTM(8),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()




In [7]:
# 6) Train (very few epochs to keep it simple)
history = model.fit(X_train_pad, y_train, epochs=8, batch_size=2, validation_data=(X_test_pad, y_test), verbose=1)


Epoch 1/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 171ms/step - accuracy: 0.5000 - loss: 0.6941 - val_accuracy: 0.5000 - val_loss: 0.6934
Epoch 2/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 125ms/step - accuracy: 0.5000 - loss: 0.6935 - val_accuracy: 0.5000 - val_loss: 0.6935
Epoch 3/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 265ms/step - accuracy: 0.6667 - loss: 0.6932 - val_accuracy: 0.5000 - val_loss: 0.6937
Epoch 4/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 137ms/step - accuracy: 0.5000 - loss: 0.6929 - val_accuracy: 0.5000 - val_loss: 0.6938
Epoch 5/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step - accuracy: 0.5000 - loss: 0.6926 - val_accuracy: 0.5000 - val_loss: 0.6940
Epoch 6/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 178ms/step - accuracy: 0.5000 - loss: 0.6922 - val_accuracy: 0.5000 - val_loss: 0.6941
Epoch 7/8
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━

In [8]:
# 7) Evaluate and predict
loss, acc = model.evaluate(X_test_pad, y_test, verbose=0)
print(f'Test acc: {acc:.2f}')

# Single sample prediction
sample = ['I really love this product']
seq = tokenizer.texts_to_sequences(sample)
pad = pad_sequences(seq, maxlen=max_len, padding='post')
prob = model.predict(pad)[0][0]
print('Text:', sample[0])
print('Prob positive:', round(float(prob),3))
print('Predicted label:', 'positive' if prob>0.5 else 'negative')


Test acc: 0.00
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
Text: I really love this product
Prob positive: 0.499
Predicted label: negative


## Conclusion
This practical demonstrated a simple text-processing workflow using an LSTM-based neural network for sentiment classification. We converted text into padded sequences, trained a basic LSTM model, and evaluated its performance on test data. The experiment shows how LSTM networks can learn patterns from sequential text and can be applied effectively to basic NLP tasks. The objective of implementing text processing with a neural network was successfully achieved.