# Steam AU Reviews & Items

## Introduction

### Predictive task
***Given a user's past sequence of games, what is the next game they buy?***

The **input** features for our model include:
* Hours played
* Number of sessions
* Game genre
* Review text
* Basic user history

Note: Review text refers to processed user reviews through TF-IDF vectorization and analyzing sentiment scores. Basic user history refers to a user's past recommendation rate and what games already exist in their Steam library. 

The **output** of our model is a binary label (1 - recommend, 0 - not recommend) indicating whether the user recommends the game or not. This task is appropriate for supervised learning and aligns directly with models covered in the course.

### Plans: Baselines and Evaluation

We plan to use the following baseline models:
* Random baseline
* Logistic regression
* Naive Bayes

We plan to evaluate these models by comparing these metrics:
* Accuracy
* F1 score
* Precision/recall

## Imports & Setup

In [None]:
# Loading data
import gzip
import ast
from pathlib import Path

# Essentials
import pandas as pd
import numpy as np
from collections import defaultdict

# Preprocessing & Splitting
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# Pipelines & Feature Combos
from sklearn.pipeline import make_pipeline, FeatureUnion

# Models
from sklearn.svm import LinearSVC
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Evaluation
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

In [None]:
DATA_DIR = Path('data')

def load_python_dicts_gz(path: Path, max_rows=None, verbose=True) -> pd.DataFrame:
    rows = []
    with gzip.open(path, 'rt', encoding='utf-8') as f:
        for i, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            rows.append(ast.literal_eval(line))
            if max_rows is not None and len(rows) >= max_rows:
                break
            if verbose and i % 100_000 == 0:
                print(f"Read {i} lines from {path.name}...")

    df = pd.json_normalize(rows)
    return df

In [None]:
# Load user reviews data
reviews_path = DATA_DIR / 'australian_user_reviews.json.gz'
reviews = load_python_dicts_gz(reviews_path, max_rows=100_000)

print('reviews shape:', reviews.shape)
print('reviews columns:')
print(list(reviews.columns))

reviews.head()

In [None]:
# Load user items data
items_path = DATA_DIR / 'australian_users_items.json.gz'
items = load_python_dicts_gz(items_path, max_rows=100)

print('items shape:', items.shape)
print('items columns:')
print(list(items.columns))

items.head()

## Preprocessing Data

### Explaratory Data Analysis

### Feature Engineering

## Modeling

### Baseline models

**1) Random baseline**

Our first model is the random baseline, which is used to randomly predict either 0 or 1 based on class distribution. In the context of our dataset, it would be whether the user buys a game or not. As it is unpredictable, it is harder to beat than a majority-class baseline where there is an imbalance between buying and not buying a game, but it also ensures that our models will outperform randomness. This model also shows the value of actual machine learning models more clearly.

In [None]:
def build_baseline():
    baseline = DummyClassifier(strategy="stratified")
    baseline.fit(X_train, y_train)
    return baseline

**2) Logistic regression**

Why we included Logistic Regression

- Strong and widely used baseline in machine learning.

- Works extremely well with high-dimensional sparse features like TF-IDF.

- Simple, interpretable, and fast to train.


What it does

- Learns a weighted linear boundary between recommend / not-recommend.

- Weights correspond directly to influential words.

- Captures direction and strength of sentiment based on the TF-IDF features.

In [None]:
def build_logistic_regression():
    model = make_pipeline(
        features,
        LogisticRegression(max_iter=2000)
    )
    model.fit(X_train, y_train)
    return model

**3) Naive Bayes**

Why we included Naive Bayes

- Classic baseline for text classification tasks.

- Very fast to train and evaluate.

- Performs surprisingly well on short reviews and simple sentiment.

- Helps us check whether TF-IDF alone can produce strong performance.

What it does

- Uses word frequencies under a conditional independence assumption.

- Learns how often words appear in positive vs. negative reviews.

- Provides a lightweight benchmark to compare against more complex models.

In [None]:
def build_naive_bayes():
    text_vectorizer = TfidfVectorizer(stop_words="english")

    nb_model = make_pipeline(
        text_vectorizer,
        MultinomialNB()
    )
    nb_model.fit(X_train["review"], y_train)
    return nb_model

### Final model

**Large TF-IDF + Linear SVC**

Why LinearSVC?

We choose a Linear Support Vector Classifier (LinearSVC) as our final model because:

- It performs extremely well on high-dimensional sparse text data

- It is more robust than Naive Bayes when features correlate

- It scales better than kernel SVM for large datasets

- It is fast to train on tens of thousands of TF-IDF features

- It handles class imbalance well when paired with strong features

Why combine multiple TF-IDF representations?

- Our FeatureUnion merges different types of text signals:

Word-level TF-IDF (1–2 grams)

- captures phrases like “very fun”, “not good”

Character-level TF-IDF (3–5 grams)

- captures subword patterns

- helps with misspellings, slang, repeated letters (“goooood”, “amazzing”)

- helps stylized writing common in game reviews

Together, these create a richer and more expressive representation of Steam review text.

Why C=1.0?

- A balanced default that prevents overfitting

- Strong performance without needing heavy tuning

Why Pipeline?

Using make_pipeline ensures:

- preprocessing + model are connected

- no manual feature handling needed

- one unified model object for training + prediction

## Evaluation

### Baseline evaluation

### Final model evaluation

### Model comparison

## Conclusion