# Steam Review Sentiment

A machine learning project to predict Steam game ratings (thumbs up/down) using review text and game metadata.

# Data Overview TODO

- Our data is using 700 rows with 5 features to determine a Steam game rating.
- The data was collected through a script we ran on our local machine to pull data from the `/review` endpoint from the Steam API.
- When collecting this data, we did not need to perform any initial transformations

#### Details on the data
- What is your dependent variable? regression or classification? distribution?
    - **Dependent Variable**: The dependent variable is `Recommended`, which indicates whether a review is positive (thumbs up) or negative (thumbs down).
    - **Regression or Classification**: This is a `binary classification` problem since the target variable (Recommended) has two possible values: 1 (positive) or 0 (negative).
    - **Distribution**: You can check the distribution of the Recommended column to identify class imbalances.

### Data Descriptions

#### Continuous
- `VotesUp`: The number of users that found this review helpful
- `VotesFunny`: The number of users that found this review funny
- `PlaytimeTotal`: Lifetime playtime tracked in this app
- `PlaytimeReview`: Playtime when the review was written
- `PlaytimeTwoWeek`: Playtime tracked in the past two weeks for this app
- `NumberofReviews`: Number of reviews written by the user
- `PostedDate`: Date the review was created (unix timestamp)

#### Categorical
- `AppID`: The unique id of the game
- `GameName`: The name of the reviewed game
- `ReviewID`: The unique id of the recommendation
- `Author`: The user’s SteamID
- `Review`: Text of written review
- `Recommended`: True means it was a positive recommendation

## Import data

Import data using pandas. Data imported is in a CSV format.

In [7]:
import pandas as pd

# Load the dataset
file_path = "Dataset/steamreviews.csv"
steam_reviews = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(steam_reviews.head())

ModuleNotFoundError: No module named 'pandas'

## Training TODO

10 pts: perform some thoughtful supervised learning, including engineering and selecting features, selecting
and optimizing a model, and explaining your model (coeﬃcients or feature importance, performance). Here
are some suggested key points.
- feature engineering / selection, bivariate charts? Interactions?
- missing data? how to handle it?
- Selection of modeling algorithm? classification or regression? binary or multi-class?
- interpretation of variable importance, coeﬃcients if applicable
- justification of choice of metric (accuracy, precision / recall, other?)
- is class weighting or over / under sampling appropriate?
- discussion of choice or tuning of hyperparameters, if any
- meaningful discussion of predictive power and conclusions from model
- look at misclassified examples from test dataset, what do they say about your model?
- outliers in data?

### Perform some Feature Engineering

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import torch
import scipy

# Drop rows where Review or Recommended is missing
steam_reviews_clean = steam_reviews.dropna(subset=["Review", "Recommended"])

# Convert 'Recommended' column to binary labels
steam_reviews_clean["label"] = steam_reviews_clean["Recommended"].astype(int)

# -------- Numeric Features --------
numeric_cols = ["VotesUp", "VotesFunny", "PlaytimeTotal", "PlaytimeTwoWeeks", "NumberofReviews"]
X_numeric = steam_reviews_clean[numeric_cols].fillna(0)

# Scale numeric features
scaler = StandardScaler()
X_numeric_scaled = scaler.fit_transform(X_numeric)

# -------- Text Features --------
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_text = vectorizer.fit_transform(steam_reviews_clean["Review"])

# -------- Combine --------
X_combined = scipy.sparse.hstack([X_text, X_numeric_scaled])

# -------- Labels --------
y = steam_reviews_clean["label"].values

# -------- Train/Test Split --------
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# -------- Convert to PyTorch Tensors --------
X_train_tensor = torch.tensor(X_train.toarray(), dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)


### Define Simple Neural Network

Details on neural network:
- One hidden layer
- ReLU activation
- Sigmoid output (since it's binary classification)

In [None]:
import torch.nn as nn

class SteamReviewClassifier(nn.Module):
    def __init__(self, input_dim):
        super(SteamReviewClassifier, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.model(x)

# Initialize with the correct input size
input_dim = X_train_tensor.shape[1]
model = SteamReviewClassifier(input_dim)

### Train the Model

A simple training loop for several epochs

In [None]:
import torch.optim as optim

# Loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()

    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()

    # Print training loss
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

### Evaluate the Model

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Switch to evaluation mode
model.eval()

# No need to compute gradients during evaluation
with torch.no_grad():
    y_pred_probs = model(X_test_tensor)
    y_pred = (y_pred_probs >= 0.5).float()

# Convert tensors to numpy for reporting
y_pred_np = y_pred.numpy()
y_test_np = y_test_tensor.numpy()

# Report metrics
print("Accuracy:", accuracy_score(y_test_np, y_pred_np))
print("\nClassification Report:\n", classification_report(y_test_np, y_pred_np))


### Save the Model

Save model to the onnx format to load it to a JavaScript app.

In [None]:
dummy_input = torch.randn(1, input_dim)
torch.onnx.export(
    model,
    dummy_input,
    "models/steam_review_model.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=11
)

## PCA TODO

5 pts: PCA as data exploration and visualization. Here are some suggested key points.
- take a look at PCA, percent explained
- take a look at top eigenvector or two, what is it made out of?
- can you visualize your prediction problem by projecting to 2 dimensions?

## K-means and data exploration TODO

5 pts: k-means as data exploration and visualization. Here are some suggested key points.
- discussion for choosing number of clusters
- analysis of cluster centers
- scatter plot(s) showing 2 dimensional perspective of clusters and cluster centers?
- meaningful interpretation / discussion of conclusions