# Twitter Sentiment Analysis – LSTM with Class Weights

**Goal:** Build, evaluate, and (later) deploy a sentiment analysis model on  
`Twitter_Data.csv` using an LSTM-based deep learning model.

We follow this workflow:

1. Dataset overview and basic EDA  
2. Text cleaning (NLP preprocessing)  
3. Train–test split  
4. Tokenization and sequence padding  
5. Building a BiLSTM model in TensorFlow/Keras  
6. Training with EarlyStopping and class weights  
7. Model evaluation (accuracy, precision, recall, F1, confusion matrix)  
8. Saving the model and tokenizer for deployment (Streamlit app)

Imports & configuration

In [1]:
# ---------------------------------------------------------
# Imports and global configuration
# ---------------------------------------------------------

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from joblib import dump

# File paths (adapt if your files are in another folder)
DATA_PATH = "Twitter_Data.csv"        # raw dataset
MODEL_PATH = "sentiment_model.h5"     # trained Keras model (for Streamlit)
TOKENIZER_PATH = "tokenizer.joblib"   # saved tokenizer (for Streamlit)

# Reproducibility
RANDOM_STATE = 42

# Model & text hyperparameters
MAX_NUM_WORDS = 20000         # vocabulary size for Tokenizer
MAX_SEQUENCE_LENGTH = 40      # max tokens per tweet (for padding)
EMBEDDING_DIM = 100           # dimension of embedding vectors
EPOCHS = 15                   # max epochs (EarlyStopping will likely stop earlier)
BATCH_SIZE = 32               # batch size

# Mapping between sentiment strings and numeric IDs
LABEL_TO_ID = {"negative": 0, "neutral": 1, "positive": 2}
ID_TO_LABEL = {v: k for k, v in LABEL_TO_ID.items()}
NUM_CLASSES = len(LABEL_TO_ID)

print("Python version:", tf.sysconfig.get_build_info().get("python_version", "N/A"))
print("TensorFlow version:", tf.__version__)

2025-11-26 13:28:28.481046: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-26 13:28:28.561698: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-26 13:28:28.593218: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-26 13:28:28.603179: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-26 13:28:28.658233: I tensorflow/core/platform/cpu_feature_guar

Python version: N/A
TensorFlow version: 2.17.0


## 1. Load the dataset and basic overview

In this step we:

- Load `Twitter_Data.csv`
- Inspect the first few rows
- Check the distribution of the `sentiment` labels
- Look at basic statistics of tweet length

This gives us an understanding of class balance and the nature of the text.

Load and inspect raw data

In [2]:
# ---------------------------------------------------------
# Load raw dataset and basic EDA
# ---------------------------------------------------------

df_raw = pd.read_csv(DATA_PATH)

print("Raw shape:", df_raw.shape)
print("\nColumns:", df_raw.columns.tolist())

# Show first few rows
display(df_raw.head())

# Sentiment label distribution
print("\nSentiment value counts:")
print(df_raw["sentiment"].value_counts(dropna=False))

# Simple length distribution of raw text
df_raw["text_len"] = df_raw["text"].astype(str).str.len()
print("\nText length (characters) summary:")
print(df_raw["text_len"].describe())

Raw shape: (27481, 4)

Columns: ['textID', 'text', 'selected_text', 'sentiment']


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative



Sentiment value counts:
sentiment
neutral     11118
positive     8582
negative     7781
Name: count, dtype: int64

Text length (characters) summary:
count    27481.000000
mean        68.327645
std         35.605403
min          3.000000
25%         39.000000
50%         64.000000
75%         97.000000
max        141.000000
Name: text_len, dtype: float64


## 2. NLP preprocessing – cleaning text

We now:

- Define a `clean_text()` function that:
  - Removes URLs  
  - Removes @mentions  
  - Keeps **letters, spaces, `!`, and `?`** (sentiment-rich punctuation)  
  - Lowercases text  
  - Collapses multiple spaces

- Drop rows with missing `text` or `sentiment`
- Apply `clean_text()` to create a `clean_text` column
- Drop very short cleaned tweets
- Map sentiment labels to numeric IDs (0 = negative, 1 = neutral, 2 = positive)

We will **reuse `clean_text()` in the deployment app** so training and inference preprocessing match.

Clean text & prepare labels

In [4]:
# ---------------------------------------------------------
# Define cleaning function and apply it
# ---------------------------------------------------------

def clean_text(text: str) -> str:
    """
    Simple tweet cleaning.

    Steps:
    - Ensure input is a string
    - Remove URLs
    - Remove @mentions
    - Keep letters, spaces, ! and ?
    - Lowercase
    - Collapse multiple spaces
    """
    text = str(text)
    text = re.sub(r"http\S+", " ", text)            # remove URLs
    text = re.sub(r"@[A-Za-z0-9_]+", " ", text)     # remove @mentions

    # Keep letters, spaces, and basic sentiment punctuation ! and ?
    text = re.sub(r"[^a-zA-Z\s!?]", " ", text)

    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Drop rows with missing text or sentiment
df = df_raw.dropna(subset=["text", "sentiment"]).copy()

# Apply cleaning
df["clean_text"] = df["text"].apply(clean_text)

# Drop very short cleaned tweets (1–2 characters)
df = df[df["clean_text"].str.len() > 2]

# Map labels to IDs
df["label_id"] = df["sentiment"].map(LABEL_TO_ID)

# Drop any rows with unmapped labels (should be none)
df = df.dropna(subset=["label_id"])
df["label_id"] = df["label_id"].astype(int)

print("After cleaning, shape:", df.shape)
display(df[["text", "clean_text", "sentiment", "label_id"]].head())

After cleaning, shape: (27470, 7)


Unnamed: 0,text,clean_text,sentiment,label_id
0,"I`d have responded, if I were going",i d have responded if i were going,neutral,1
1,Sooo SAD I will miss you here in San Diego!!!,sooo sad i will miss you here in san diego!!!,negative,0
2,my boss is bullying me...,my boss is bullying me,negative,0
3,what interview! leave me alone,what interview! leave me alone,negative,0
4,"Sons of ****, why couldn`t they put them on t...",sons of why couldn t they put them on the rele...,negative,0


## 3. Split into train and test sets

We now split:

- `X` – the features (cleaned tweets)
- `y` – the numeric labels (0/1/2)

We use a **stratified train–test split** so that the class distribution is similar
in the train and test sets.

Train/test split

In [5]:
# ---------------------------------------------------------
# Train–test split (stratified)
# ---------------------------------------------------------

X = df["clean_text"].values   # numpy array of strings
y = df["label_id"].values     # numpy array of ints

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

print("\nTrain label distribution:")
print(pd.Series(y_train).map(ID_TO_LABEL).value_counts())

print("\nTest label distribution:")
print(pd.Series(y_test).map(ID_TO_LABEL).value_counts())

Train size: 21976, Test size: 5494

Train label distribution:
neutral     8888
positive    6866
negative    6222
Name: count, dtype: int64

Test label distribution:
neutral     2222
positive    1716
negative    1556
Name: count, dtype: int64


## 4. Tokenization and sequence padding

Steps:

1. Create a Keras `Tokenizer` with a fixed vocabulary size and `<OOV>` token  
2. Fit it **only on training texts** (best practice)  
3. Convert train and test texts to integer sequences  
4. Pad/truncate sequences to a fixed length (`MAX_SEQUENCE_LENGTH`)

Outputs:

- `X_train_pad`: (n_train, max_len) integer array  
- `X_test_pad`: (n_test, max_len) integer array

Tokenization & padding

In [6]:
# ---------------------------------------------------------
# Tokenizer and padded sequences
# ---------------------------------------------------------

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)   # fit only on training data

# Convert text to sequences of word IDs
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq  = tokenizer.texts_to_sequences(X_test)

# Pad / truncate to fixed length
X_train_pad = pad_sequences(
    X_train_seq,
    maxlen=MAX_SEQUENCE_LENGTH,
    padding="post",
    truncating="post"
)
X_test_pad = pad_sequences(
    X_test_seq,
    maxlen=MAX_SEQUENCE_LENGTH,
    padding="post",
    truncating="post"
)

print("X_train_pad shape:", X_train_pad.shape)
print("X_test_pad shape :", X_test_pad.shape)


X_train_pad shape: (21976, 40)
X_test_pad shape : (5494, 40)


## 5. Build the BiLSTM model

We now define a slightly richer model than the earlier baseline:

- `Embedding`: maps word IDs to embedding vectors  
- `SpatialDropout1D(0.1)`: regularization  
- `Bidirectional(LSTM(128))`: reads the tweet left→right and right→left  
- `Dense(64, relu) + Dropout(0.3)`: non-linear layer with moderate dropout  
- `Dense(3, softmax)`: outputs probabilities for 3 sentiment classes

We use:

- `sparse_categorical_crossentropy` (since labels are integer IDs)  
- `adam` optimizer  
- `accuracy` as the main metric.

Build & compile model

In [7]:
# ---------------------------------------------------------
# Build BiLSTM model
# ---------------------------------------------------------

from tensorflow.keras.layers import Bidirectional, LSTM

model = models.Sequential([
    layers.Embedding(
        input_dim=MAX_NUM_WORDS,
        output_dim=EMBEDDING_DIM,
        input_length=MAX_SEQUENCE_LENGTH
    ),
    layers.SpatialDropout1D(0.1),
    Bidirectional(LSTM(128)),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(NUM_CLASSES, activation="softmax")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()

I0000 00:00:1764163846.407075   77757 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1764163846.599892   77757 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1764163846.599996   77757 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1764163846.603830   77757 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1764163846.604336   77757 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:0

## 6. Train the model with EarlyStopping and class weights

In the earlier version, the model **ignored the "negative" class**,  
predicting mostly "neutral" and "positive".

To address this, we:

1. Compute **class weights** using `compute_class_weight` (balanced)  
2. Pass them to `model.fit(class_weight=...)` so errors on minority/ignored classes  
   are penalized more strongly  
3. Use `EarlyStopping` on validation loss with `restore_best_weights=True`

Interpretation:

- Watch training `loss/accuracy` vs `val_loss/val_accuracy`  
- Training may stop before `EPOCHS` when validation loss stops improving  
- After this, we expect non-zero recall for the “negative” class.

Class weights & training

In [8]:
# ---------------------------------------------------------
# Compute class weights and train model
# ---------------------------------------------------------

# Ensure numeric arrays
X_train_pad = np.asarray(X_train_pad, dtype="int32")
X_test_pad  = np.asarray(X_test_pad, dtype="int32")
y_train     = np.asarray(y_train, dtype="int32")
y_test      = np.asarray(y_test, dtype="int32")

# Compute class weights to handle any imbalance / ignored classes
classes = np.array([0, 1, 2])  # negative, neutral, positive
class_weights_array = compute_class_weight(
    class_weight="balanced",
    classes=classes,
    y=y_train
)
class_weight_dict = {int(c): float(w) for c, w in zip(classes, class_weights_array)}
print("Class weights:", class_weight_dict)

# EarlyStopping callback
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=2,
    restore_best_weights=True
)

# Train model
history = model.fit(
    X_train_pad,
    y_train,
    validation_data=(X_test_pad, y_test),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[early_stop],
    verbose=1,
    class_weight=class_weight_dict
)

Class weights: {0: 1.1773277617057751, 1: 0.8241824182418241, 2: 1.066899698999903}
Epoch 1/15


2025-11-26 13:31:03.755533: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907


[1m687/687[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 22ms/step - accuracy: 0.5113 - loss: 0.9375 - val_accuracy: 0.7080 - val_loss: 0.6857
Epoch 2/15
[1m687/687[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 18ms/step - accuracy: 0.7726 - loss: 0.5480 - val_accuracy: 0.6986 - val_loss: 0.7345
Epoch 3/15
[1m687/687[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 18ms/step - accuracy: 0.8448 - loss: 0.3895 - val_accuracy: 0.7037 - val_loss: 0.7735


## 7. Evaluate the model on the test set

We now:

1. Get prediction probabilities on `X_test_pad`  
2. Convert them to predicted class IDs (`argmax`)  
3. Print a `classification_report`  
4. Print the confusion matrix

Interpretation:

- **Accuracy** should be noticeably higher than the majority baseline (~40%)  
- **"Negative" class** should now have non-zero recall and F1-score  
- The confusion matrix shows which pairs of classes are still confused.

Evaluation

In [9]:
# ---------------------------------------------------------
# Evaluate on test set
# ---------------------------------------------------------

y_pred_prob = model.predict(X_test_pad)
y_pred = np.argmax(y_pred_prob, axis=1)

target_names = [ID_TO_LABEL[i] for i in range(NUM_CLASSES)]

print("Classification report:\n")
print(classification_report(y_test, y_pred, target_names=target_names, zero_division=0))

print("Confusion matrix (rows = true, cols = pred):")
print(confusion_matrix(y_test, y_pred))

[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step
Classification report:

              precision    recall  f1-score   support

    negative       0.69      0.71      0.70      1556
     neutral       0.67      0.67      0.67      2222
    positive       0.77      0.76      0.77      1716

    accuracy                           0.71      5494
   macro avg       0.71      0.71      0.71      5494
weighted avg       0.71      0.71      0.71      5494

Confusion matrix (rows = true, cols = pred):
[[1098  382   76]
 [ 428 1480  314]
 [  73  331 1312]]


## 8. Save model and tokenizer for deployment

Finally we:

- Save the trained Keras model to `sentiment_model.h5`  
- Save the fitted tokenizer to `tokenizer.joblib`

Your `app.py` (Streamlit app) will:

- Use the **same `clean_text()`** function  
- Load `tokenizer.joblib` to tokenize and pad new text  
- Load `sentiment_model.h5` to predict probabilities and sentiment labels.

This ensures training and inference pipelines stay consistent.

Save model & tokenizer

In [10]:
# ---------------------------------------------------------
# Save model & tokenizer
# ---------------------------------------------------------

model.save(MODEL_PATH)
dump(tokenizer, TOKENIZER_PATH)

print("Saved model to:", MODEL_PATH)
print("Saved tokenizer to:", TOKENIZER_PATH)



Saved model to: sentiment_model.h5
Saved tokenizer to: tokenizer.joblib
