# ***To Implement text processing with LSTM***


**Name:** Prexit Joshi  
**Roll No.:** 118



## 1. Aim
To implement a basic text-processing pipeline and train a small LSTM model for sentiment classification using Keras (TensorFlow).

## 2. Description
This practical demonstrates how raw text is preprocessed and converted to numerical inputs suitable for neural networks. Steps include:

- Simple text cleaning (lowercasing)
- Tokenization and converting words to integer sequences
- Padding sequences to a fixed length
- Building and training a small LSTM network
- Evaluating model performance and making predictions

The notebook uses a small sample dataset for clarity; the same pipeline applies to larger datasets with minimal changes.

## 3. Requirements
- Google Colab (recommended) or local Jupyter
- Libraries: `numpy`, `pandas`, `tensorflow`, `sklearn`

If running locally, install with:
```
pip install numpy pandas tensorflow scikit-learn
```

In [1]:
# 4.1 Imports and seed
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Reproducibility (best-effort)
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

print('TensorFlow version:', tf.__version__)

TensorFlow version: 2.19.0


In [2]:
# 4.2 Small sample dataset
# Replace this with a real dataset (CSV) when required
data = {
    'text': [
        'I love this product',
        'This is the worst',
        'Absolutely fantastic service',
        'I hate it',
        'Not satisfied with the quality',
        'Very happy with my purchase',
        'Will never buy again',
        'Best experience ever',
        'Terrible customer support',
        'I really liked it'
    ],
    'label': ['positive','negative','positive','negative','negative','positive','negative','positive','negative','positive']
}

df = pd.DataFrame(data)
df.index += 1  # lab-style indices
print('Dataset preview:')
print(df.head())


Dataset preview:
                             text     label
1             I love this product  positive
2               This is the worst  negative
3    Absolutely fantastic service  positive
4                       I hate it  negative
5  Not satisfied with the quality  negative


In [3]:
# 4.3 Encode labels and split
le = LabelEncoder()
df['label_enc'] = le.fit_transform(df['label'])

X = df['text']
y = df['label_enc']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)
print('Train size:', len(X_train), 'Test size:', len(X_test))


Train size: 8 Test size: 2


In [4]:
# 4.4 Simple preprocessing, tokenization, and padding
# Lowercase (simple cleaning)
X_train = X_train.str.lower()
X_test = X_test.str.lower()

VOCAB_SIZE = 500
OOV_TOKEN = '<OOV>'
MAX_LEN = 10

tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=MAX_LEN, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=MAX_LEN, padding='post')

print('Example:')
print('Text:', X_train.iloc[0])
print('Sequence:', X_train_seq[0])
print('Padded:', X_train_pad[0])


Example:
Text: best experience ever
Sequence: [6, 7, 8]
Padded: [6 7 8 0 0 0 0 0 0 0]


In [5]:
# 4.5 Build small LSTM model
EMBED_DIM = 32
LSTM_UNITS = 32

model = Sequential([
    Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_DIM, input_length=MAX_LEN),
    LSTM(LSTM_UNITS),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()




In [6]:
# 4.6 Train the model (small epochs for demo)
EPOCHS = 8
BATCH_SIZE = 2

history = model.fit(
    X_train_pad, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_test_pad, y_test),
    verbose=1
)


Epoch 1/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 137ms/step - accuracy: 0.2167 - loss: 0.6970 - val_accuracy: 0.5000 - val_loss: 0.6936
Epoch 2/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.6833 - loss: 0.6842 - val_accuracy: 0.5000 - val_loss: 0.6937
Epoch 3/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.6833 - loss: 0.6823 - val_accuracy: 0.5000 - val_loss: 0.6939
Epoch 4/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.6833 - loss: 0.6751 - val_accuracy: 0.5000 - val_loss: 0.6940
Epoch 5/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.5833 - loss: 0.6872 - val_accuracy: 0.5000 - val_loss: 0.6941
Epoch 6/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.6833 - loss: 0.6880 - val_accuracy: 0.5000 - val_loss: 0.6943
Epoch 7/8
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m

In [7]:
# 4.7 Evaluate and sample prediction
loss, acc = model.evaluate(X_test_pad, y_test, verbose=0)
print(f'Test loss: {loss:.4f}, Test accuracy: {acc:.4f}')

# Helper to predict
def predict_text(text_list):
    texts = [t.lower() for t in text_list]
    seq = tokenizer.texts_to_sequences(texts)
    pad = pad_sequences(seq, maxlen=MAX_LEN, padding='post')
    probs = model.predict(pad)
    labels = le.inverse_transform([0,1]) if False else ['negative','positive']
    for t,p in zip(text_list, probs):
        print(f"Text: {t}\nProb positive: {p[0]:.4f} -> Pred: {labels[int(p[0]>0.5)]}\n")

predict_text(['I absolutely love this service', 'Worst product I bought'])


Test loss: 0.6947, Test accuracy: 0.5000
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218ms/step
Text: I absolutely love this service
Prob positive: 0.4936 -> Pred: negative

Text: Worst product I bought
Prob positive: 0.4893 -> Pred: negative



In [8]:
# 4.8 Save model (optional)
model_path = '/mnt/data/text_lstm_model.h5'
model.save(model_path)
print('Model saved to', model_path)




Model saved to /mnt/data/text_lstm_model.h5


## 5. Observations
- Tokenization converts words to integer IDs; padding makes sequences equal length for batch processing.
- LSTM captures sequence information (word order) which helps in tasks like sentiment classification.
- On very small datasets the model may overfit; for reliable results use larger datasets and regularization.


## 6. Conclusion

This practical implemented a basic text-processing pipeline and trained a small LSTM-based neural network for binary sentiment classification. Raw sentences were lowercased, tokenized, and padded before being fed into an embedding layer and LSTM. The experiment demonstrates the end-to-end process for preparing textual data and using sequence models to learn patterns. Although a small dataset was used here for instructional purposes, the same pipeline is applicable to larger datasets with further tuning and validation.

---

**Prepared by:** Prexit Joshi  
**Roll No.:** 118