# Steam AU Reviews & Items

## Introduction

### Predictive task
***Given a user's past sequence of games, what is the next game they buy?***

The **input** features for our model include:
* Hours played
* Number of sessions
* Game genre
* Review text
* Basic user history

Note: Review text refers to processed user reviews through TF-IDF vectorization and analyzing sentiment scores. Basic user history refers to a user's past recommendation rate and what games already exist in their Steam library. 

The **output** of our model is a binary label (1 - recommend, 0 - not recommend) indicating whether the user recommends the game or not. This task is appropriate for supervised learning and aligns directly with models covered in the course.

### Plans: Baselines and Evaluation

We plan to use the following baseline models:
* Random baseline
* Logistic regression
* Naive Bayes

We plan to evaluate these models by comparing these metrics:
* Accuracy
* F1 score

## Imports & Setup

In [None]:
# Loading data
import gzip
import ast
from pathlib import Path

# Essentials
import pandas as pd
import numpy as np
from collections import defaultdict

# Preprocessing & Splitting
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# Pipelines & Feature Combos
from sklearn.pipeline import make_pipeline, FeatureUnion

# Models
from sklearn.svm import LinearSVC
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Evaluation
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

In [None]:
DATA_DIR = Path('data')

def load_python_dicts_gz(path: Path, max_rows=None, verbose=True) -> pd.DataFrame:
    rows = []
    with gzip.open(path, 'rt', encoding='utf-8') as f:
        for i, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            rows.append(ast.literal_eval(line))
            if max_rows is not None and len(rows) >= max_rows:
                break
            if verbose and i % 100_000 == 0:
                print(f"Read {i} lines from {path.name}...")

    df = pd.json_normalize(rows)
    return df

In [None]:
# Load user reviews data
reviews_path = DATA_DIR / 'australian_user_reviews.json.gz'
reviews = load_python_dicts_gz(reviews_path, max_rows=100_000)

print('reviews shape:', reviews.shape)
print('reviews columns:')
print(list(reviews.columns))

reviews.head()

In [None]:
# Load user items data
items_path = DATA_DIR / 'australian_users_items.json.gz'
items = load_python_dicts_gz(items_path, max_rows=100)

print('items shape:', items.shape)
print('items columns:')
print(list(items.columns))

items.head()

## Preprocessing Data

### What does each value in the reviews column represent in the ``reviews`` dataframe?

In [None]:
from pprint import pprint

review_list = reviews['reviews'].iloc[0]
print(type(review_list))

pprint(review_list)

In [None]:
first_review = review_list[0]
pprint(first_review)

### What about the items column in the ``items`` dataframe?

In [None]:
item_list = items['items'].iloc[0]
print(type(item_list))

item_list

In [None]:
first_item = item_list[0]
pprint(first_item)

It seems that each value in the reviews column and items column are both lists of dictionaries. In the reviews column, each value is a dictionary of reviews for different games. In the items column, each value is a dictionary of items, or games, for different Steam users.

It's a bit hard to read each review from a user, especially if they leave a lot of reviews. We can explode the list of reviews to make it easier to read. We will also do the same thing for items.

In [None]:
# Explode list of reviews so each review gets its own row
reviews_long = reviews.explode('reviews').reset_index(drop = True)

# Split each dictionary key into separate columns
review_details = pd.json_normalize(reviews_long['reviews'])

reviews_long = pd.concat(
    [reviews_long.drop(columns = ['reviews']), review_details],
    axis = 1
)

reviews_long.head()

In [None]:
items_long = items.explode("items").reset_index(drop=True)

item_details = pd.json_normalize(items_long["items"])
items_long = pd.concat(
    [items_long.drop(columns=["items"]), item_details],
    axis=1
)

items_long.head()

### Explaratory Data Analysis

In [None]:
print("reviews_long shape:", reviews_long.shape)
print("items_long shape:", items_long.shape)

print("Unique users in reviews:", reviews_long["user_id"].nunique())
print("Unique users in items:",   items_long["user_id"].nunique())
print("Unique items (reviews):", reviews_long["item_id"].nunique())
print("Unique items (items):",   items_long["item_id"].nunique())

In [None]:
reviews_per_user = reviews_long.groupby("user_id")["item_id"].nunique()
reviews_per_user.describe()

In [None]:
owned_per_user = items_long.groupby("user_id")["item_id"].nunique()
owned_per_user.describe()

In [None]:
items_long["playtime_forever"].describe()

In [None]:
reviews_long["recommend"].value_counts(normalize=True)

In [None]:
# keep only rows with an item_id
reviews_long = reviews_long.dropna(subset=["item_id"])

# parse 'posted' like "Posted November 5, 2011."
posted_clean = (
    reviews_long["posted"]
      .str.replace("Posted ", "", regex=False)
      .str.rstrip(".")
)

reviews_long["timestamp"] = pd.to_datetime(posted_clean, errors="coerce")

# drop rows we couldn't parse
reviews_long = reviews_long.dropna(subset=["timestamp"])

# sort by time per user
reviews_long = reviews_long.sort_values(["user_id", "timestamp"])
reviews_long.head()

In [None]:
# make sure item_id is the same type in both tables
reviews_long["item_id"] = reviews_long["item_id"].astype(str)
items_long["item_id"]   = items_long["item_id"].astype(str)

interactions = reviews_long.merge(
    items_long[["user_id", "item_id", "item_name", "playtime_forever", "playtime_2weeks"]],
    on=["user_id", "item_id"],
    how="left"
)

interactions.head()

In [None]:
(
    interactions.groupby("recommend")["playtime_forever"]
    .describe()
)

### Feature Engineering

In [None]:
# Prepare data for TF-IDF

features = FeatureUnion([
    ("word_tfidf",
     TfidfVectorizer(
         max_features=50000,
         ngram_range=(1,2),
         min_df=3,
         stop_words="english",
         sublinear_tf=True
     )),
])

In [None]:
# train/test split

X_train, X_test, y_train, y_test = train_test_split(
    interactions["review"],
    interactions["recommend"],
    test_size=0.2,
    random_state=42
)

## Modeling

### Baseline models

**1) Random baseline**

Our first model is the random baseline, which is used to randomly predict either 0 or 1 based on class distribution. In the context of our dataset, it would be whether the user buys a game or not. As it is unpredictable, it is harder to beat than a majority-class baseline where there is an imbalance between buying and not buying a game, but it also ensures that our models will outperform randomness. This model also shows the value of actual machine learning models more clearly.

In [None]:
def build_baseline():
    baseline = DummyClassifier(strategy="stratified")
    baseline.fit(X_train, y_train)
    return baseline

**2) Logistic regression**

Our next model is logistic regression&mdash;a strong and widely used baseline in machine learning. We chose this baseline since it works well with high-dimensional sparse features like TF-IDF and it is also simply, interpretable, and fast to train. For our dataset, logistic regression will learn to identify a weighted linear boundary between recommend or not recommend. Its weights also correspond directly to influential words since it is able to capture the direction and strength of sentiment.

In [None]:
def build_logistic_regression():
    model = make_pipeline(
        features,
        LogisticRegression(max_iter=2000)
    )
    model.fit(X_train, y_train)
    return model

**3) Naive Bayes**

Why we included Naive Bayes

- Classic baseline for text classification tasks.

- Very fast to train and evaluate.

- Performs surprisingly well on short reviews and simple sentiment.

- Helps us check whether TF-IDF alone can produce strong performance.

What it does

- Uses word frequencies under a conditional independence assumption.

- Learns how often words appear in positive vs. negative reviews.

- Provides a lightweight benchmark to compare against more complex models.

In [None]:
def build_naive_bayes():
    text_vectorizer = TfidfVectorizer(stop_words="english")

    nb_model = make_pipeline(
        text_vectorizer,
        MultinomialNB()
    )
    nb_model.fit(X_train["review"], y_train)
    return nb_model

### Final model

**Large TF-IDF + Linear SVC**

Why LinearSVC?

We choose a Linear Support Vector Classifier (LinearSVC) as our final model because:

- It performs extremely well on high-dimensional sparse text data

- It is more robust than Naive Bayes when features correlate

- It scales better than kernel SVM for large datasets

- It is fast to train on tens of thousands of TF-IDF features

- It handles class imbalance well when paired with strong features

Why combine multiple TF-IDF representations?

- Our FeatureUnion merges different types of text signals:

Word-level TF-IDF (1–2 grams)

- captures phrases like “very fun”, “not good”

Character-level TF-IDF (3–5 grams)

- captures subword patterns

- helps with misspellings, slang, repeated letters (“goooood”, “amazzing”)

- helps stylized writing common in game reviews

Together, these create a richer and more expressive representation of Steam review text.

Why C=1.0?

- A balanced default that prevents overfitting

- Strong performance without needing heavy tuning

Why Pipeline?

Using make_pipeline ensures:

- preprocessing + model are connected

- no manual feature handling needed

- one unified model object for training + prediction

In [None]:
def build_linear_svc():
    model = make_pipeline(
        features,
        LinearSVC(C=1.0)
    )

    print("Training final LinearSVC model...")
    model.fit(X_train, y_train)
    return model

## Evaluation

Recall that our goal is to correctly predict whether a Steam user will recommend a game based on their review and gameplay behavior. Since this is a binary classification task, our evaluation needs to measure how well our model distinguishes between positive and negative recommendations.

As mentioned in the beginning of the notebook, our plans were to use the following metrics to evaluate our models:

* Accuracy

Measuring accuracy will allow us to see the percentage of correct predictions our model makes, which is useful when classes (recommend vs. not recommend) are fairly balanced.

* F1 Score

The F1 score represents the harmonic mean for precision and recall. This value is more robust than accuracy if our dataset is imbalanced, which can apply to Steam reviews.

We chose these metrics in particular due to the following reasons:
* Recommendation data often contains more positive reviews than negative ones
* Accuracy alone, especially at a high value, may be misleading for models that ignore the minority class
* F1 score captures how well the model handles both sides, making it the most appropriate metric for this type of task

Moving forward, we will be evaluating all of our models (baseline + final model) using the following helper function. In the end, we will compare our metrics in a table together to see which one scores higher.

In [None]:
def evaluate(model, name):
    '''
    Purpose: Keeps code clean and readable
             Ensures all models are evaluated consistently
    Outputs: accuracy, F1 score, ROC-AUC
    '''
    preds = model.predict(X_test)

    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds)

    print(f"\n=== {name} ===")
    print("Accuracy:", acc)
    print("F1 Score:", f1)

    return {
        "model": name,
        "accuracy": acc,
        "f1": f1
    }

### Baseline evaluation

In [None]:
baseline = DummyClassifier(strategy="stratified")
baseline.fit(X_train, y_train)

baseline_results = evaluate(baseline, "Baseline: Random (Stratified)")
baseline_results

In [None]:
logreg_model = make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=300))
logreg_model.fit(X_train, y_train)

logreg_results = evaluate(logreg_model, "Logistic Regression + TF-IDF")
logreg_results

In [None]:
nb_model = make_pipeline(TfidfVectorizer(), MultinomialNB())
nb_model.fit(X_train, y_train)

nb_results = evaluate(nb_model, "Naive Bayes + TF-IDF")
nb_results

### Final model evaluation

In [None]:
steam_model = make_pipeline(features, LinearSVC(C=1.0))
steam_model.fit(X_train, y_train)

svm_results = evaluate(steam_model, "Final Model: LinearSVC + TF-IDF + Numeric Features")
svm_results

### Model comparison

In [None]:
results_df = pd.DataFrame([
    baseline_results,
    nb_results,
    logreg_results,
    svm_results
])

results_df

### Performance plot

In [None]:
results_df.set_index("model")[["accuracy","f1"]].plot(kind="bar", figsize=(8,4))

plt.title("Model Performance Comparison")
plt.ylabel("Score")
plt.xticks(rotation=45, ha="right")

plt.show()

## Conclusion