# Deep Learning Model - All Features

The deep learning model based on a Multilayer Perceptron (MLP) architecture and utilized both textual data (from the post's title and selftext) and categorical metadata (such as subreddit, flair, is_self, and nsfw).

In [None]:
# Install essential packages
# !pip install pandas
# !pip install tensorflow
# !pip install scikeras

Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.13.0


In [None]:
# Import essential libraries
import pandas as pd             
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense, GlobalAveragePooling1D, Concatenate, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

In [147]:
# Step 1: Load the dataset
df = pd.read_csv("../data/cleaned_reddit_posts.csv")

Show how many entries fall into each popularity bucket to understand class balance.

In [None]:
print(df["popularity_bucket"].value_counts())

popularity_bucket
high      3415
low       3316
medium    3316
Name: count, dtype: int64


In [149]:
# Step 2: Drop unneeded columns
df = df.drop(columns=["id", "author", "score", "num_comments", "upvote_ratio"])

Combine title and selftext to give the model more text information from the post, improving prediction.

In [150]:
# Step 3: Combine title and selftext
df["text"] = df["title"].fillna('') + " " + df["selftext"].fillna('')

This step encodes the target column popularity_bucket into integer labels using LabelEncoder.   
The integer labels are then converted into one-hot vectors with to_categorical() for classification.   
The final output y is a one-hot encoded target array used to train the model.   

In [151]:
# Step 4: Encode labels (popularity_bucket)
label_encoder = LabelEncoder()
df["label"] = label_encoder.fit_transform(df["popularity_bucket"])
y = to_categorical(df["label"])

Prepare the text data for input into a neural network by converting it into numerical format.    
First create a tokenizer that maps the most common 10,000 words to integers, treating unknown words as <OOV>.   
Then it transforms each text into a sequence of integers and pads them to the same length (100) so the model can process them consistently. 

In [152]:
# Step 5: Tokenize text
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(df["text"])
sequences = tokenizer.texts_to_sequences(df["text"])
X_text = pad_sequences(sequences, maxlen=100)

This code encodes categorical columns like "subreddit" using LabelEncoder after filling missing values.  
It also converts binary and numeric features (like is_self, nsfw, and created_hour) into integer format.   
Finally, all encoded features are combined into a single NumPy array X for model input, and the text features and other encoded features into a single input array for the model. 

In [154]:
# Step 6: Encode categorical features
cat_features = ["subreddit", "flair", "media_type"]
encoded_features = []

for col in cat_features:
    le = LabelEncoder()
    df[col] = df[col].fillna("unknown")
    encoded = le.fit_transform(df[col])
    encoded_features.append(encoded)

# Add binary features
encoded_features.append(df["is_self"].astype(int))
encoded_features.append(df["nsfw"].astype(int))
encoded_features.append(df["created_hour"].fillna(0).astype(int))

# Final non-text input
X_other = np.stack(encoded_features, axis=1)

# horizontally join features into one array
X_combined = np.hstack([X_text, X_other])  

This function builds a Multilayer Perceptron (MLP) with three hidden layers and dropout for regularization.   
It uses ReLU activation in hidden layers and softmax for multiclass output.    
The model is compiled with Adam optimizer and categorical crossentropy loss for classification.

In [155]:
# Step 7: Build the model
def create_model(dropout_rate=0.3):
    model = Sequential()
    model.add(Input(shape=(X_combined.shape[1],)))  # single combined input
    model.add(Dense(256, activation='relu'))  #128
    model.add(Dropout(dropout_rate))
    model.add(Dense(128, activation='relu'))  #64
    model.add(Dropout(dropout_rate))
    
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(dropout_rate))

    model.add(Dense(3, activation='softmax'))
    
    optimizer = Adam(learning_rate=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

The dataset is split into training and testing sets to evaluate model performance.      
A Keras model is wrapped and tuned using GridSearchCV to find the best hyperparameters.    
The best model is then evaluated on the test set, and accuracy is calculated after converting predictions and labels from one-hot to class format.    

In [None]:
# Step 8: Compile and train
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

print("Type of X_train:", type(X_train))
print("X_train shape:", X_train.shape)
print("X_train[0] shape:", np.array(X_train[0]).shape)

# Wrap model for GridSearch
model = KerasClassifier(model=create_model, verbose=0)

# Define hyperparameter grid
param_grid = {
    'batch_size': [16, 32],
    'epochs': [20, 30],  #10, 15
    "model__dropout_rate": [0.3, 0.5]
}

# Perform GridSearchCV on training data
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_result = grid.fit(X_train, y_train)

# Print best hyperparameters
print(f"Best params: {grid_result.best_params_}")
print(f"Best accuracy: {grid_result.best_score_:.4f}")

# Evaluate on unseen test data
y_pred = grid_result.best_estimator_.predict(X_test)

# Convert predictions to class labels if they are probabilities or one-hot
y_pred_labels = np.argmax(y_pred, axis=1)  

# Convert one-hot y_test to class labels
y_test_labels = np.argmax(y_test, axis=1)

accuracy = accuracy_score(y_test_labels, y_pred_labels)
print(f"Test accuracy: {accuracy:.4f}")

Evaluate the model on the test set to get loss and accuracy metrics.

In [157]:
# Evaluate
# Get the underlying Keras model from the best estimator
best_model = grid_result.best_estimator_.model_

# Evaluate on test data using combined features
loss, accuracy = best_model.evaluate(X_test, y_test, verbose=0)
print(f"Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}")

Test loss: 1.0994, Test accuracy: 0.3234


A detailed classification report and confusion matrix to evaluate model performance by comparing true and predicted labels.

In [None]:
# Print classification report
print("\nClassification Report:")
print(classification_report(y_test_labels, y_pred_labels, target_names=label_encoder.classes_))


Classification Report:
              precision    recall  f1-score   support

        high       0.32      1.00      0.49       650
         low       0.00      0.00      0.00       684
      medium       0.00      0.00      0.00       676

    accuracy                           0.32      2010
   macro avg       0.11      0.33      0.16      2010
weighted avg       0.10      0.32      0.16      2010


Confusion Matrix:
[[650   0   0]
 [684   0   0]
 [676   0   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Calculate and display overall accuracy, precision, recall, and F1 score for model evaluation.

In [160]:
# Evaluate model on test set
y_pred = grid_result.best_estimator_.predict(X_test)
y_pred_labels = np.argmax(y_pred, axis=1)
y_test_labels = np.argmax(y_test, axis=1)

# Compute metrics
model_accuracy = accuracy_score(y_test_labels, y_pred_labels)
model_precision = precision_score(y_test_labels, y_pred_labels, average='weighted', zero_division=0)
model_recall = recall_score(y_test_labels, y_pred_labels, average='weighted', zero_division=0)
model_f1 = f1_score(y_test_labels, y_pred_labels, average='weighted', zero_division=0)

# Print all metrics
print("\n=== Model Performance Metrics ===")
print(f"Accuracy:  {model_accuracy:.4f}")
print(f"Precision: {model_precision:.4f}")
print(f"Recall:    {model_recall:.4f}")
print(f"F1 Score:  {model_f1:.4f}")



=== Model Performance Metrics ===
Accuracy:  0.3234
Precision: 0.1046
Recall:    0.3234
F1 Score:  0.1580


We will now train the model on a different subset of input features, to see if we can achieve better model performance.