<a href="https://colab.research.google.com/github/laurenthanhvo/steam_predictive/blob/casey/steam_schema_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steam AU Reviews & Items – Modeling

**Context**

Context

Our predictive task is formulated as a supervised binary classification problem.
For each (user, game) interaction, we want to use the user’s review text and gameplay behavior to predict whether the user will recommend the game (recommend = True) or not recommend it (recommend = False).

Inputs (Features)

We use information available to us in the merged interaction dataset:

Gameplay-Based Features

 -playtime_forever

- playtime_2weeks


Text-Based Features

- TF–IDF vectorization of the review text

Game Metadata

- Genre

Output (Label)

- A binary variable:

1 → user recommends the game

0 → user does not recommend the game

Objective Function

We optimize the model using standard binary classification objectives:

Binary Cross-Entropy Loss (for logistic regression and neural networks)

Metrics used for evaluation:

- Accuracy

- F1 score

- Precision / Recall


Appropriate Models

Because our dataset contains both structured numeric data and unstructured text, multiple model types are suitable:

- Logistic Regression (with TF-IDF text features)

- Naive Bayes (text-only baseline)

- Random Forest or Gradient Boosted Trees (numerical + metadata)

- Shallow Neural Network (dense layers on concatenated features)

These models allow us to test both simple linear approaches and slightly more complex nonlinear ones.

Discussion

This section compares modeling approaches and explains why each is useful or limited for this task.

1. Logistic Regression

Advantages

- Strong baseline for text classification

- Works well with TF-IDF sparse vectors

- Fast to train and easy to interpret

- Robust on moderately sized datasets

Disadvantages

- Linear decision boundary

- Struggles with nonlinear interactions between features

- Sensitive to feature scaling

This model is a standard ML-class baseline and matches course content.

2. Naive Bayes

Advantages

- Extremely fast and lightweight

- Performs surprisingly well on raw text

- Good baseline for TF-IDF or Bag-of-Words

Disadvantages

- Assumes independence between features

- Not suitable when numeric features matter

- Lower accuracy when reviews are long or nuanced

We use Naive Bayes as a text-only benchmark to check whether adding gameplay features truly helps.

3. Tree-Based Models (Random Forest, XGBoost, LightGBM)

Advantages

- Capture nonlinear patterns

- Handle missing values and skewed numeric data well

- Do not require heavy preprocessing

Disadvantages

- Do not accept high-dimensional TF-IDF matrices directly

- Must combine text features via dimensionality reduction or embeddings

- Slower to train on large datasets

These models help test whether nonlinear relationships in gameplay data matter.

4. Neural Networks

Advantages

- Can combine numeric data with dense text embeddings

- More expressive than linear models

- Can model interaction effects between playtime and review sentiment

Disadvantages

- Higher computational cost

- Longer training time

- Less interpretable

- Requires careful tuning to avoid overfitting

In this project we use NN only as an optional extension, not as the primary baseline.

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

from scipy.sparse import hstack

# ----------------------------------------------------
# 1. Load your dataset here (adjust path as needed)
# ----------------------------------------------------
# Example: df = pd.read_csv("interactions.csv")
# For now assuming 'interactions' variable already exists:

df = interactions.copy()

# ----------------------------------------------------
# 2. Clean and prepare data
# ----------------------------------------------------
df = df.dropna(subset=["review", "recommend"]).copy()
df["label"] = df["recommend"].astype(int)
df["playtime_forever"] = df["playtime_forever"].fillna(0)

# ----------------------------------------------------
# 3. Train/Test Split
# ----------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    df, df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

# ----------------------------------------------------
# 4. TF-IDF Text Features
# ----------------------------------------------------
tfidf = TfidfVectorizer(
    max_features=20000,
    stop_words="english"
)

X_train_text = tfidf.fit_transform(X_train["review"])
X_test_text = tfidf.transform(X_test["review"])

# ----------------------------------------------------
# 5. Numeric Feature (playtime)
# ----------------------------------------------------
scaler = StandardScaler()

X_train_num = scaler.fit_transform(X_train[["playtime_forever"]])
X_test_num = scaler.transform(X_test[["playtime_forever"]])

# Combine text + numeric
X_train_combined = hstack([X_train_text, X_train_num])
X_test_combined  = hstack([X_test_text, X_test_num])

# ===============================
# BASELINES
# ===============================

# ----------------------------------------------------
# Baseline 1: Majority Class
# ----------------------------------------------------
baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train_text, y_train)
pred_baseline = baseline.predict(X_test_text)

print("\n=== Baseline: Majority Class ===")
print("Accuracy:", accuracy_score(y_test, pred_baseline))

# ----------------------------------------------------
# Baseline 2: Naive Bayes
# ----------------------------------------------------
nb = MultinomialNB()
nb.fit(X_train_text, y_train)
pred_nb = nb.predict(X_test_text)

print("\n=== Baseline: Naive Bayes ===")
print("Accuracy:", accuracy_score(y_test, pred_nb))
print("F1 Score:", f1_score(y_test, pred_nb))

# ===============================
# FINAL MODEL
# ===============================

# ----------------------------------------------------
# Logistic Regression (Text + Numeric Features)
# ----------------------------------------------------
logreg = LogisticRegression(max_iter=200, n_jobs=-1)
logreg.fit(X_train_combined, y_train)

pred_lr = logreg.predict(X_test_combined)
proba_lr = logreg.predict_proba(X_test_combined)[:, 1]

print("\n=== Final Model: Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test, pred_lr))
print("F1 Score:", f1_score(y_test, pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, proba_lr))

# ----------------------------------------------------
# Optional: Print Most Important Words
# ----------------------------------------------------
feature_names = np.array(tfidf.get_feature_names_out())
coef = logreg.coef_[0][:-1]  # exclude numeric feature

top_pos = feature_names[np.argsort(coef)][-15:]
top_neg = feature_names[np.argsort(coef)][:15]

print("\nTop words predicting recommend=True:")
print(top_pos)

print("\nTop words predicting recommend=False:")
print(top_neg)


NameError: name 'interactions' is not defined