In [19]:
import tensorflow as tf
print(tf.__version__)


2.20.0


In [20]:
!pip install gensim



## Phase 3: Deep NLP with LSTM

Before:
text → TF-IDF / Word2Vec → Logistic Regression

Now:
text → tokenizer → padded sequences
     → embedding layer
     → LSTM
     → sentiment prediction

Here the model:

Learns word embeddings automatically
Understands word order
Captures context

## Step 1: Import required libraries

In [21]:
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split


## Step 2: Load dataset

In [22]:
df = pd.read_csv("../data/sentimentdataset.csv")

texts = df["Text"].astype(str)
labels = df["Sentiment"]


## Step 3: Convert labels to numbers

Neural networks require numeric labels.


In [23]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(labels)

## Step 4: Tokenize text

In [24]:
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)

sequences = tokenizer.texts_to_sequences(texts)

# Understand the distribution of sequence lengths

## Step 5: Pad sequences

In [25]:
X = pad_sequences(sequences, padding='post', maxlen=20)

In [26]:
print(X.shape)

(732, 20)


## Step 6: Train–test split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Step 7: Build LSTM model

In [28]:
model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=20),
    LSTM(64),
    Dense(32, activation='relu'),
    Dense(len(np.unique(y)), activation='softmax')
])



## Step 8: Compile model

In [29]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

## Step 9: Train model

In [30]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), batch_size=16)


Epoch 1/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 22ms/step - accuracy: 0.0462 - loss: 5.6042 - val_accuracy: 0.0612 - val_loss: 5.5432
Epoch 2/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.0496 - loss: 5.2998 - val_accuracy: 0.0204 - val_loss: 5.6080
Epoch 3/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.0530 - loss: 5.0320 - val_accuracy: 0.0680 - val_loss: 5.8492
Epoch 4/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.0718 - loss: 4.7918 - val_accuracy: 0.0340 - val_loss: 5.5295
Epoch 5/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.0650 - loss: 4.6107 - val_accuracy: 0.0748 - val_loss: 5.6924
Epoch 6/10
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.0957 - loss: 4.4118 - val_accuracy: 0.0884 - val_loss: 5.6866
Epoch 7/10
[1m37/37[0m [32m━━━━

## Why LSTM didn’t improve much

This is again because of the dataset limitations, not your model.

Main reasons
1) Very small dataset

Deep learning models need:

Thousands or millions of samples

Your dataset: ~700 samples

So the LSTM:

Doesn’t have enough data to learn patterns

2) Too many classes

You have:

Dozens of emotion labels

Many classes with only 1–3 samples

Deep models struggle with:

Small multi-class datasets

Severe imbalance

But this is actually good for interviews

You now have a clear experimental story:

Phase 1 worked best on small data.
Phase 2 struggled because embeddings need large corpora.
Phase 3 showed some improvement but was still limited by dataset size.

This shows:

Analytical thinking

Understanding of model–data relationships

Interview-ready explanation

If asked:

“Why didn’t deep learning perform better?”

You can say:

“The dataset was very small and highly imbalanced across many sentiment classes. Deep learning models like LSTMs require larger datasets to generalize well, so the performance was limited by data rather than model capability.”

That’s a strong, honest answer.