# Import and Setup

In [1]:
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore") 
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('dataset/tweet_emotions.csv')
data.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


# EDA

## Removing ID, duplicate and Null Values

In [3]:
#If there is any null value throughout the row, remove it.
data = data.dropna()
data = data.reset_index(drop=True)
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   40000 non-null  int64 
 1   sentiment  40000 non-null  object
 2   content    40000 non-null  object
dtypes: int64(1), object(2)
memory usage: 937.6+ KB


It was noticed that there are no null values present, therefore no values were dropped

In [4]:
# Remove the id column
data = data.drop(['tweet_id'], axis=1)
data.head()

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


### Removing Duplicate Rows

In this step, duplicate rows based on the "content" column were identified and removed. Duplicate rows can introduce bias and redundancy in the dataset, which can negatively impact the performance of machine learning models. By removing these duplicates, each tweet is ensured to be unique, leading to a more accurate and reliable analysis.

The process involved:
1. **Counting Duplicates**: The number of duplicate rows in the dataset was counted.
2. **Removing Duplicates**: The duplicate rows based on the "content" column were removed.
3. **Resetting Index**: After removing duplicates, the index of the dataframe was reset to maintain a clean and sequential index.

The code used for this process is as follows:

In [5]:
# Remove duplciate rows if they have the same "content" value. Also print the number of removed rows.
print("Number of duplicate rows before removing: ", data.duplicated().sum())
data = data.drop_duplicates(subset='content')
data = data.reset_index(drop=True)
print("Number of duplicate rows after removing: ", data.duplicated().sum())

Number of duplicate rows before removing:  91
Number of duplicate rows after removing:  0


There were 91 duplicate rows found, h

## Cleaning Text

The text cleaning function is designed to preprocess and clean the text data in the dataset. It performs the following operations:

1. **Remove URLs**: Eliminates any URLs from the text.
2. **Remove Non-Word Characters**: Replaces non-word characters with spaces.
3. **Remove @Mentions**: Removes mentions (e.g., `@username`).
4. **Remove Hashtags**: Removes the `#` symbol from hashtags.
5. **Remove Non-ASCII Characters**: Removes any non-ASCII characters.
6. **Remove Digits**: Eliminates digits from the text.
7. **Fix Multiple Spaces**: Replaces multiple spaces with a single space.
8. **Trim Spaces**: Removes leading and trailing spaces.
9. **Lemenisation** Applied word net lemmenisation to all words

The function is applied to the `content` column of the dataset to standardize the text format.

In [6]:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required NLTK resources (run only once)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)

# Function to clean text
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\W', ' ', text)
    #convert to lowercase
    text = text.lower()
    #remove any @mentions
    text = re.sub(r'@\w+', '', text)
    #remove # from #hashtags
    text = re.sub(r'#', '', text)
    #remove any non-ascii characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    #remove any digits
    text = re.sub(r'\d', '', text)
    
    # Fix double or multiple spacing cause from removal
    text = re.sub(r'\s+', ' ', text)
    # Remove any leading or trailing spaces
    text = re.sub(r'^\s+|\s+?$', '', text)
    
    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
    
    return lemmatized_text

# Apply the function to the "content" column
data['content'] = data['content'].apply(clean_text)

# Display the cleaned data
print(data.head())

    sentiment                                            content
0       empty  tiffanylue i know i wa listenin to bad habit e...
1     sadness  layin n bed with a headache ughhhh waitin on y...
2     sadness                     funeral ceremony gloomy friday
3  enthusiasm                  want to hang out with friend soon
4     neutral  dannycastillo we want to trade with someone wh...


Saving the cleaned data to a new file

In [7]:
data.to_csv('dataset/cleaned_tweet_emotions.csv', index=False)

In [8]:
df = pd.read_csv('dataset/cleaned_tweet_emotions.csv')
df.head()

Unnamed: 0,sentiment,content
0,empty,tiffanylue i know i wa listenin to bad habit e...
1,sadness,layin n bed with a headache ughhhh waitin on y...
2,sadness,funeral ceremony gloomy friday
3,enthusiasm,want to hang out with friend soon
4,neutral,dannycastillo we want to trade with someone wh...


In [9]:
print(df.shape)
df = df.dropna()
df.isnull().sum()

(39827, 2)


sentiment    0
content      0
dtype: int64

## Embedding Tweets with GloVe and TF-IDF



In this section, tweets were embedded using a combination of GloVe embeddings and TF-IDF weighting. GloVe (Global Vectors for Word Representation) is a pre-trained word embedding model that captures semantic relationships between words. The 200-dimensional GloVe embeddings trained on Twitter data were used, which is particularly suitable for the sentiment analysis task on tweets due to its training corpus.

The process involved:
1. **Loading GloVe Embeddings**: The pre-trained GloVe embeddings were loaded from a file.
2. **TF-IDF Vectorization**: A TF-IDF vectorizer was fitted on the dataset to capture the importance of words in the context of the specific dataset.
3. **Combining GloVe and TF-IDF**: For each tweet, a weighted average of the GloVe vectors of its words was computed, using the TF-IDF scores as weights. This resulted in a single 200-dimensional vector representing each tweet.

This approach leverages the semantic richness of GloVe embeddings and the contextual relevance captured by TF-IDF, providing a robust representation of tweets for sentiment analysis. Compared to other encodings, this method is particularly effective for short, informal text like tweets, where word context and importance can vary significantly.


In [None]:
import pickle
import numpy as np
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.sequence import pad_sequences

# Load Pre-trained GloVe Embeddings (200d)
def load_glove_embeddings(filepath):
    glove_dict = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            glove_dict[word] = vector
    return glove_dict

# Convert tweets to TF-IDF weighted GloVe embeddings
def get_tweet_embedding(tweet, glove_dict, tfidf, feature_names):
    words = tweet.split()
    tweet_vector = np.zeros(200)  # GloVe 200d
    word_count = 0

    for word in words:
        if word in glove_dict and word in feature_names:
            weight = tfidf.get(word, 1)  # Default to 1 if word not in TF-IDF dict
            tweet_vector += weight * glove_dict[word]
            word_count += weight

    return tweet_vector / word_count if word_count != 0 else tweet_vector

# Load GloVe Embeddings
glove_path = "glove.twitter.27B.200d.txt"
glove_embeddings = load_glove_embeddings(glove_path)

# TF-IDF Vectorizer (Fitted on Your Dataset), then pickle it
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_vectorizer.fit(df['content'])
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

# Load the saved TF-IDF vectorizer
with open('tfidf_vectorizer.pkl', 'rb') as f:
    loaded_tfidf_vectorizer = pickle.load(f)

feature_names = set(loaded_tfidf_vectorizer.get_feature_names_out())
idf_scores = dict(zip(loaded_tfidf_vectorizer.get_feature_names_out(), loaded_tfidf_vectorizer.idf_))

# Convert Tweets to GloVe Embeddings
df['embedding'] = df['content'].apply(lambda x: get_tweet_embedding(x, glove_embeddings, idf_scores, feature_names))

print(df.head())

    sentiment                                            content  \
0       empty  tiffanylue i know i wa listenin to bad habit e...   
1     sadness  layin n bed with a headache ughhhh waitin on y...   
2     sadness                     funeral ceremony gloomy friday   
3  enthusiasm                  want to hang out with friend soon   
4     neutral  dannycastillo we want to trade with someone wh...   

                                           embedding  
0  [-0.009336741176117474, -0.11781357224288551, ...  
1  [0.020380503826825182, -0.05467616021290979, -...  
2  [-0.10435475312059417, -0.02760633744912397, -...  
3  [0.0272797092070235, 0.18452336164318683, 0.01...  
4  [0.02917883614275043, 0.36351611138752765, -0....  


### Label encoding for the sentiment

In [11]:
#print all unique values in the sentiment column
print(df['sentiment'].unique())

['empty' 'sadness' 'enthusiasm' 'neutral' 'worry' 'surprise' 'love' 'fun'
 'hate' 'happiness' 'boredom' 'relief' 'anger']


In [12]:
#Perform label encoding in a new colomn. [empty, sadness, worry, hate, boredom, anger] is the first label which is negative. [neutral] label is neutral. [enthusiasm, love, fun, happiness, relief] is positive label. Create a new colomn for encoding.
df['sentiment_encoded'] = df['sentiment'].apply(lambda x: 0 if x in ['empty', 'sadness', 'worry', 'hate', 'boredom', 'anger'] else 1 if x in ['neutral'] else 2)
print(df.head())

#print how many instances are there in each class
print(df['sentiment_encoded'].value_counts())

    sentiment                                            content  \
0       empty  tiffanylue i know i wa listenin to bad habit e...   
1     sadness  layin n bed with a headache ughhhh waitin on y...   
2     sadness                     funeral ceremony gloomy friday   
3  enthusiasm                  want to hang out with friend soon   
4     neutral  dannycastillo we want to trade with someone wh...   

                                           embedding  sentiment_encoded  
0  [-0.009336741176117474, -0.11781357224288551, ...                  0  
1  [0.020380503826825182, -0.05467616021290979, -...                  0  
2  [-0.10435475312059417, -0.02760633744912397, -...                  0  
3  [0.0272797092070235, 0.18452336164318683, 0.01...                  2  
4  [0.02917883614275043, 0.36351611138752765, -0....                  1  
sentiment_encoded
0    16023
2    15205
1     8598
Name: count, dtype: int64


In [13]:
print(df.shape)
print(df.dtypes)

(39826, 4)
sentiment            object
content              object
embedding            object
sentiment_encoded     int64
dtype: object


## Further Data Pre-processing and Cleaning

It was noticed that once the encoding was done, there were some issues with how the data was stored and how it was retrieved. This caused the model to not run since it can not read strings, therefore the data was further processed to change the data type and remove trailing and beginning text that was added during time of encoding

In [14]:
import ast
import numpy as np

def parse_embedding(embedding):
    # 1) If it's already a NumPy array, just ensure dtype float32.
    if isinstance(embedding, np.ndarray):
        return embedding.astype(np.float32)
    
    # 2) If it's a string that looks like "array([-0.07, 0.08, ...])", remove "array(" and trailing ")".
    if isinstance(embedding, str):
        if embedding.startswith("array(") and embedding.endswith(")"):
            # remove the leading array( and trailing )
            embedding = embedding[len("array("):-1]  # everything inside the parentheses

        # now it should look like "[-0.07, 0.08, ...]"
        python_list = ast.literal_eval(embedding)  # parse as Python list
        return np.array(python_list, dtype=np.float32)
    
    # 3) Otherwise, try to convert it to float32 array anyway (covers lists or other formats).
    return np.array(embedding, dtype=np.float32)

# Now apply
df['embedding'] = df['embedding'].apply(parse_embedding)


In [15]:
print(df['embedding'].iloc[0])
print(type(df['embedding'].iloc[0]))  # <class 'numpy.ndarray'>
print(df['embedding'].iloc[0].dtype)  # float32 (or float64, depending on your code)


[-9.33674164e-03 -1.17813572e-01  1.22281229e-02  6.81492612e-02
 -6.33047298e-02  1.30258262e-01  5.25684357e-01 -2.91683655e-02
 -3.08205783e-01 -3.26570541e-01 -2.42037419e-02 -1.09019227e-01
 -6.26551151e-01 -1.21132649e-01 -1.47696985e-02 -2.19138026e-01
  1.58629134e-01  1.12198010e-01 -9.28363428e-02 -8.19794461e-02
  6.53500184e-02  1.51154352e-02  1.46863945e-02 -1.94295198e-01
 -7.14265183e-02  9.73988235e-01 -1.12449199e-01  7.18857348e-02
 -5.75402007e-03 -9.51587930e-02 -1.14354961e-01 -7.59564415e-02
 -3.50154161e-01  4.51495834e-02 -2.32437313e-01  1.06302112e-01
 -2.54962463e-02 -1.25372291e-01  2.15248689e-01  5.58137894e-02
  4.61367577e-01  2.26965860e-01  1.52214497e-01 -5.74009120e-02
 -1.53786600e-01  1.57505050e-02  4.47052896e-01 -2.52260659e-02
 -3.82465087e-02  1.31910861e-01  8.63551274e-02 -2.89009154e-01
 -6.92554861e-02 -1.41300857e-01  2.20181625e-02  1.73297785e-02
 -7.28377327e-02 -1.81020528e-01  1.54833734e-01 -1.40639067e-01
  3.45269032e-02  3.43265

In [16]:
#Store only the embedding colomn and the sentiment_encoded colomn in a new dataframe.
df = df[['embedding', 'sentiment_encoded']]
print(df.head())

                                           embedding  sentiment_encoded
0  [-0.009336742, -0.11781357, 0.012228123, 0.068...                  0
1  [0.020380504, -0.05467616, -0.048116226, -0.20...                  0
2  [-0.104354754, -0.027606338, -0.0489604, 0.226...                  0
3  [0.027279709, 0.18452336, 0.019765332, -0.4269...                  2
4  [0.029178835, 0.36351612, -0.011995873, 0.0128...                  1


In [17]:
df.dtypes

embedding            object
sentiment_encoded     int64
dtype: object

In [18]:
df["embedding"] = df["embedding"].apply(lambda arr: arr.tolist())

#Save to CSV
df.to_csv("dataset/embeddings_as_string.csv", index=False)
print("\nSaved to 'embeddings_as_string.csv'")


Saved to 'embeddings_as_string.csv'


In [19]:
df.head()

Unnamed: 0,embedding,sentiment_encoded
0,"[-0.009336741641163826, -0.11781357228755951, ...",0
1,"[0.020380504429340363, -0.05467616021633148, -...",0
2,"[-0.10435475409030914, -0.027606338262557983, ...",0
3,"[0.02727970853447914, 0.18452335894107819, 0.0...",2
4,"[0.029178835451602936, 0.3635161221027374, -0....",1


# Model Building

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Split the dataset into train (70%) and temp (30%) with stratification
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42, stratify=df['sentiment_encoded'])

# Split the temp dataset into validation (20% of original) and test (10% of original) with stratification
val_df, test_df = train_test_split(temp_df, test_size=1/3, random_state=42, stratify=temp_df['sentiment_encoded'])

# Print the sizes of the splits to verify
print(f"Train set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")

Train set size: 27878
Validation set size: 7965
Test set size: 3983


In [21]:
import numpy as np
import pandas as pd

def prepare_lstm_data(df, label_col='sentiment_encoded', embed_col='embedding'):
    """
    df:       DataFrame with at least 2 columns: [label_col, embed_col]
    label_col: name of the sentiment/label column
    embed_col: name of the embedding column (a numerical vector or numeric data)
    """
    # 1) Extract labels
    y = df[label_col].values  # shape -> (num_samples,)

    # 2) Extract numeric features (assuming 'embedding' column contains numeric vectors)
    #    If 'embedding' is already stored as a vector (list/np.array) per row, convert each row to np.array:
    X = np.array(df[embed_col].tolist())  # shape -> (num_samples, embedding_dim)

    # 3) Reshape to 3D for LSTM: (samples, timesteps=1, features=embedding_dim)
    #    If each row is just one “step” with that embedding:
    X = X.reshape((X.shape[0], 1, X.shape[1]))

    return X, y

In [22]:
import numpy as np
import pandas as pd

# Keras / TensorFlow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Sklearn for additional metrics
from sklearn.metrics import classification_report, confusion_matrix
X_train, y_train = prepare_lstm_data(train_df,
                                     label_col='sentiment_encoded',
                                     embed_col='embedding')

X_val, y_val = prepare_lstm_data(val_df,
                                 label_col='sentiment_encoded',
                                 embed_col='embedding')

X_test, y_test = prepare_lstm_data(test_df,
                                   label_col='sentiment_encoded',
                                   embed_col='embedding')

print("X_train shape:", X_train.shape)  # (28000, 1, embedding_dim) for example
print("y_train shape:", y_train.shape)  # (28000,)

print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)

print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (27878, 1, 200)
y_train shape: (27878,)
X_val shape: (7965, 1, 200)
y_val shape: (7965,)
X_test shape: (3983, 1, 200)
y_test shape: (3983,)


Each LSTM layer processes sequential data, gradually reducing the dimensionality from 64 units down to 16, which captures increasingly distilled features at each stage. The key parameters and layers function as follows:

- **kernel_initializer**: Initializes the weights of the LSTM neurons. GlorotUniform balances variance across layers, aiding faster convergence.  
- **recurrent_initializer**: Specifies how the recurrent (hidden) weights are initialized. Orthogonal initialization preserves gradient magnitude over time, helping stability in deep recurrent networks.  
- **bias_initializer**: Sets biases to zero, which provides a consistent starting point for training.  
- **kernel_regularizer (l2)**: Adds a penalty on large weight values, reducing overfitting by encouraging smaller, more general weights.  
- **return_sequences**: Controls whether each time step’s output or only the final output is passed to the next layer, allowing deeper stacked LSTM structures.  
- **BatchNormalization**: Normalizes activations, stabilizing training and speeding up convergence.  
- **Dropout**: Randomly zeros out a fraction of connections during training to prevent overfitting and encourage more robust feature learning.  
- **Dense(3, activation='softmax')**: Maps the final LSTM output to three probability scores for multi-class classification.  
- **Adam(learning_rate=1e-4)**: Adam optimizer adjusts learning rates adaptively for each weight, improving speed and stability of convergence.

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout, BatchNormalization
from tensorflow.keras import regularizers, initializers
from tensorflow.keras.optimizers import Adam

def build_lstm_model(input_shape):
    model = Sequential()
    
    # First LSTM layer
    model.add(LSTM(64, 
                  return_sequences=True, 
                  kernel_initializer=initializers.GlorotUniform(),
                  recurrent_initializer=initializers.Orthogonal(),
                  bias_initializer='zeros',
                  kernel_regularizer=regularizers.l2(1e-4), 
                  input_shape=input_shape))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))

    # Second LSTM layer - new
    model.add(LSTM(48, 
                  return_sequences=True, 
                  kernel_initializer=initializers.GlorotUniform(),
                  recurrent_initializer=initializers.Orthogonal(),
                  bias_initializer='zeros',
                  kernel_regularizer=regularizers.l2(1e-4)))
    model.add(BatchNormalization())
    model.add(Dropout(0.25))
    
    # Third LSTM layer - new
    model.add(LSTM(32, 
                  return_sequences=True, 
                  kernel_initializer=initializers.GlorotUniform(),
                  recurrent_initializer=initializers.Orthogonal(),
                  bias_initializer='zeros',
                  kernel_regularizer=regularizers.l2(1e-4)))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    # Fourth LSTM layer (final)
    model.add(LSTM(16, 
                  return_sequences=False, 
                  kernel_initializer=initializers.GlorotUniform(),
                  recurrent_initializer=initializers.Orthogonal(),
                  bias_initializer='zeros',
                  kernel_regularizer=regularizers.l2(1e-4)))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    # Dense output layer
    model.add(Dense(3, 
                   activation='softmax', 
                   kernel_initializer=initializers.GlorotUniform(),
                   bias_initializer='zeros',
                   kernel_regularizer=regularizers.l2(1e-4)))

    # Compile the model
    model.compile(
        loss='categorical_crossentropy',
        optimizer=Adam(learning_rate=1e-4),
        metrics=['AUC']
    )
    return model

In [24]:
# Note: X_train.shape[1:] is (timesteps, features)
model = build_lstm_model(X_train.shape[1:])
model.summary()


In [25]:
import datetime
import numpy as np
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping

# Create a TensorBoard callback
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

# Create an EarlyStopping callback
early_stopping_callback = EarlyStopping(monitor='AUC', patience=5, restore_best_weights=True)

# Convert y_train and y_val to categorical
y_train_categorical = tf.keras.utils.to_categorical(y_train, num_classes=3)
y_val_categorical = tf.keras.utils.to_categorical(y_val, num_classes=3)

# Fit the model with the TensorBoard and EarlyStopping callbacks
history = model.fit(
    X_train, y_train_categorical,
    epochs=100,
    batch_size=64,
    validation_data=(X_val, y_val_categorical),
    callbacks=[tensorboard_callback, early_stopping_callback],
    verbose=1
)

Epoch 1/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - AUC: 0.5379 - loss: 1.5409 - val_AUC: 0.7067 - val_loss: 1.0270
Epoch 2/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - AUC: 0.6153 - loss: 1.2542 - val_AUC: 0.7151 - val_loss: 1.0257
Epoch 3/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - AUC: 0.6475 - loss: 1.1628 - val_AUC: 0.7287 - val_loss: 0.9979
Epoch 4/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - AUC: 0.6647 - loss: 1.1168 - val_AUC: 0.7377 - val_loss: 0.9802
Epoch 5/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - AUC: 0.6850 - loss: 1.0721 - val_AUC: 0.7446 - val_loss: 0.9677
Epoch 6/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - AUC: 0.6929 - loss: 1.0537 - val_AUC: 0.7490 - val_loss: 0.9608
Epoch 7/100
[1m436/436[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/

In [26]:
#!tensorboard --logdir logs/fit

In [27]:
#use the model to predict the test data, it is a multi-class classification problem with 3 classes
y_pred = model.predict(X_test)

# Convert the predicted probabilities to class labels
y_pred_labels = np.argmax(y_pred, axis=1)

# Print the classification report
print(classification_report(y_test, y_pred_labels))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred_labels))

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
              precision    recall  f1-score   support

           0       0.60      0.72      0.65      1602
           1       0.45      0.23      0.31       860
           2       0.61      0.66      0.63      1521

    accuracy                           0.59      3983
   macro avg       0.55      0.54      0.53      3983
weighted avg       0.57      0.59      0.57      3983

[[1148  126  328]
 [ 362  200  298]
 [ 401  122  998]]


In [28]:
#write Per-Class ROC-AUC:

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def per_class_roc_auc(y_true, y_pred):
    lb = LabelBinarizer()
    lb.fit(y_true)
    y_true = lb.transform(y_true)
    
    for (idx, label) in enumerate(lb.classes_):
        print(f"{label} AUC: {roc_auc_score(y_true[:, idx], y_pred[:, idx])}")
        
per_class_roc_auc(y_test, y_pred)


0 AUC: 0.7682295230499884
1 AUC: 0.6896877257258598
2 AUC: 0.7690238101723448


In [29]:
#mcc score please test

from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_test, y_pred_labels)

print("MCC Score:", mcc)

MCC Score: 0.34972914063743094


In [30]:
#save the model
model.save('lstm_model.keras')
print("Model saved to 'lstm_model.keras'")

Model saved to 'lstm_model.keras'
