<a href="https://colab.research.google.com/github/mobius29er/AIML_Class/blob/main/CrossValidationComparisonSteamGames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Self Study Activity 9.1: Which Cross-Validation Technique Is Best?
Machine learning has several alternatives to cross-validation that are derived from statistics. In this section, you will learn other options, such as leave-one-out cross-validation, k-fold cross-validation, and holdout cross-validation.

In machine learning, the leave-one-out cross-validation (LOOCV) procedure is used to assess the performance of algorithms to make predictions without using the data they were trained on. This can be utilized when you have a small in-sample made up of a few examples.

k-fold cross-validation uses unseen data to estimate the performance of a model. Using this technique, hyperparameters (k-values) can be tuned to the optimal level to train the model. In addition, this approach has the advantage of using each example only once for training and validation (as part of a test fold).

Holdout cross-validation is the simplest form of cross-validation. Therefore, it is sometimes termed a “simple validation method” instead of a simplified or degenerate form of cross-validation. As part of this method, you randomly divide your data into two sets: the training set and the test/validation set (i.e., the holdout set). This technique has the advantage of performing well on unseen datasets.

Using a dataset of your own, explore the data utilizing multiple cross-validation techniques. Choose the most appropriate cross-validation technique for your data.

Possible sources for datasets:

KaggleLinks to an external site.
WorldbankLinks to an external site.
In your initial post, describe your data, state which cross-validation technique you used, and explain your rationale for deciding on which cross-validation technique was the most appropriate for your specific dataset.

#### Index

- [K-fold](#Problem-1)
- [HoldOut](#Problem-2)
- [LOOCV](#Problem-3)
- [Comparison](#Problem-4)

In [10]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### The Data
https://www.kaggle.com/datasets/hbugrae/best-selling-steam-games-of-all-time/data?select=bestSelling_games.csv

This dataset provides a comprehensive snapshot of the best-selling games on the Steam platform. The data was collected on June 1, 2025, from the official 'Bestsellers' page on the Steam store.

In [11]:
data = pd.read_csv('data/bestSelling_games.csv')

In [12]:
data.head()

Unnamed: 0,game_name,reviews_like_rate,all_reviews_number,release_date,developer,user_defined_tags,supported_os,supported_languages,price,other_features,age_restriction,rating,difficulty,length,estimated_downloads
0,Counter-Strike 2,86,8803754,"21 Aug, 2012",Valve,"FPS, Action, Tactical","win, linux","English, Czech, Danish, Dutch, Finnish, French...",0.0,"Cross-Platform Multiplayer, Steam Trading Card...",17,3.2,4,80,306170000
1,PUBG: BATTLEGROUNDS,59,2554482,"21 Dec, 2017",PUBG Corporation,"Survival, Shooter, Action, Tactical",win,"English, Korean, Simplified Chinese, French, G...",0.0,"Online PvP, Stats, Remote Play on Phone, Remot...",13,3.1,4,73,162350000
2,ELDEN RING NIGHTREIGN,77,53426,"30 May, 2025","FromSoftware, Inc.","Souls-like, Open World, Fantasy, RPG",win,"English, Japanese, French, Italian, German, Sp...",25.99,"Single-player, Online Co-op, Steam Achievement...",17,3.96,4,50,840000
3,The Last of Us™ Part I,79,45424,"28 Mar, 2023",Naughty Dog LLC,"Story Rich, Shooter, Survival, Horror",win,"English, Italian, Spanish - Spain, Czech, Dutc...",59.99,"Single-player, Steam Achievements, Steam Tradi...",17,4.1,3,24,2000000
4,Red Dead Redemption 2,92,672140,"5 Dec, 2019",Rockstar Games,"Open World, Story Rich, Adventure, Realistic, ...",win,"English, French, Italian, German, Spanish - Sp...",59.99,"Single-player, Online PvP, Online Co-op, Steam...",17,4.32,3,80,21610000


In [13]:
# Define the target variable
target_variable = 'price'
y = data[target_variable]

In [14]:
# Define the feature variables (exclude the target variable)
X = data.drop(target_variable, axis=1)

In [15]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Cross Validation Type 1: K-Fold Cross-Validation

#### Method

k=5 (5 folds). This means dividing the training data into 5 subsets.
For each fold:
Use 4 folds for training a model (e.g., a linear regression model) to predict price.
Evaluate the trained model on the remaining fold.
Calculate the average performance metric (e.g., mean squared error) across all folds. This gives you an estimate of how well your model generalizes to unseen data.

In [16]:
# 1. Choose a model
model = Ridge(alpha=1.0)

In [17]:
# 2. Instantiate a K-Fold cross-validation object
# Let's use 5 folds (n_splits=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [18]:
# 3. Perform cross-validation
mse_scores = [] # Initialize mse_scores list before the loop

for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Drop non-numerical columns before fitting the model
    X_train_fold = X_train_fold.select_dtypes(include=np.number)
    X_val_fold = X_val_fold.select_dtypes(include=np.number)

    # Train the model on the training fold
    model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation fold
    val_preds = model.predict(X_val_fold)

    # Calculate the performance metric (e.g., MSE)
    mse = mean_squared_error(y_val_fold, val_preds)
    mse_scores.append(mse)

In [19]:
# 4. and 5. Collect and Analyze results
average_mse = np.mean(mse_scores)
print(f'Mean MSE across folds: {average_mse}')

Mean MSE across folds: 102.4827731017482


In [20]:
# After this, you would typically train the final model on the entire training set
# and evaluate it on the test set as usual.
final_model = Ridge(alpha=1.0) # Using the same alpha for the final model

# Select only numerical columns from X_train before fitting the final model
X_train_numeric = X_train.select_dtypes(include=np.number)
X_test_numeric = X_test.select_dtypes(include=np.number)


final_model.fit(X_train_numeric, y_train)
test_preds = final_model.predict(X_test_numeric)
test_mse = mean_squared_error(y_test, test_preds)
print(f'Test MSE: {test_mse}')

Test MSE: 126.75841947152595


# Cross Validation Type 2: Holdout cross-validation

#### What is it?


Holdout cross-validation is a simple and common technique where you split your dataset into two parts: a training set and a test set.

Training Set: Used to train your machine learning model. Typically, around 70-80% of the data goes here.

```
# This is formatted as code
```



Test Set: Used only after the model is trained to evaluate its performance on unseen data. This simulates how well it will generalize to real-world situations.

**Here's a simple breakdown:**

Split your data: Randomly divide your dataset into training and test sets.

Train your model: Use the training set to train your chosen machine learning algorithm.

Evaluate performance: Use the untouched test set to evaluate how well your trained model performs on new, unseen data.

**Advantages of Holdout Cross-Validation:**

Simple and easy to implement.

Computationally less expensive than techniques like K-Fold cross-validation.

**Disadvantages:**

The performance can be sensitive to the specific random split of your data. A different split might lead to slightly different results.

You are essentially discarding a portion of your data (the test set) that could be used for training if you had a smaller dataset.

In [21]:
# Split the data into training and testing sets (70% train, 30% test)
# This is the core of the holdout cross-validation method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [22]:
# We then trained a final model on the training data
final_model = Ridge(alpha=1.0) # Using the same alpha for the final model

In [23]:
# Ensure only numerical columns are used for fitting
X_train_numeric = X_train.select_dtypes(include=np.number)
X_test_numeric = X_test.select_dtypes(include=np.number)

In [24]:
final_model.fit(X_train_numeric, y_train)

In [25]:
# And evaluated its performance on the unseen test data
test_preds = final_model.predict(X_test_numeric)
test_mse = mean_squared_error(y_test, test_preds)
print(f'Test MSE (Holdout Validation): {test_mse}') # Add context to the print statement

Test MSE (Holdout Validation): 126.75841947152595


# Cross Validation Type 3: LOOCV

#### Leave-One-Out Cross-Validation (LOOCV)

What is LOOCV?

LOOCV is an extreme case of K-Fold cross-validation where K equals the number of data points in your dataset. In simpler terms:

- Iteration: You train your machine learning model on all but one data point.

- Prediction: You use the trained model to predict the target value for the single left-out data point.

- Repeat: You repeat steps 1 and 2 for every single data point in your dataset, each time leaving a different one out.

Advantages of LOOCV:

- Less Biased: It tends to give you a less biased estimate of your model's performance compared to other cross-validation techniques because it uses almost all the data for training in each iteration.

- Useful for Small Datasets: Especially helpful when you have very limited data, as it maximizes the amount of data used for training in each fold.

Disadvantages of LOOCV:

- Computationally Expensive: It can be incredibly slow, especially for large datasets, because you're training your model almost as many times as there are data points.



In [26]:
# 1. Instantiate a LeaveOneOut cross-validation object
loo = LeaveOneOut()

In [27]:
# 2. Choose your model (using the same Ridge model as before)
model_loo = Ridge(alpha=1.0)

# Initialize a list to store the performance metric for each fold
mse_scores_loo = []

# Ensure X_train contains only numerical columns for the LOOCV loop
X_train_numeric = X_train.select_dtypes(include=np.number)
y_train_numeric = y_train # y_train should already be numeric

In [28]:
# 3. Perform LOOCV
print("Performing Leave-One-Out Cross-Validation...")
# We iterate over the splits generated by loo.split(X_train_numeric)
for train_index, val_index in loo.split(X_train_numeric):
    # Get the training and validation data for the current fold
    X_train_fold_loo, X_val_fold_loo = X_train_numeric.iloc[train_index], X_train_numeric.iloc[val_index]
    y_train_fold_loo, y_val_fold_loo = y_train_numeric.iloc[train_index], y_train_numeric.iloc[val_index]

    # Train the model on the training fold
    model_loo.fit(X_train_fold_loo, y_train_fold_loo)

    # Make predictions on the validation fold (which is just one data point)
    val_preds_loo = model_loo.predict(X_val_fold_loo)

    # Calculate the performance metric (MSE) for this fold
    mse_loo = mean_squared_error(y_val_fold_loo, val_preds_loo)
    mse_scores_loo.append(mse_loo)

Performing Leave-One-Out Cross-Validation...


In [29]:
# 4. Analyze results
# Calculate the average MSE across all folds
average_mse_loo = np.mean(mse_scores_loo)
print(f'Mean MSE across LOOCV folds: {average_mse_loo}')

Mean MSE across LOOCV folds: 101.29982233220576


### Cross Validation Comparison

### Cross Validation Comparison and Rationale

We have explored three different cross-validation techniques: K-Fold, Holdout, and Leave-One-Out Cross-Validation (LOOCV). Let's compare their results and discuss which method is most appropriate for this dataset.

#### Results Summary (Mean Squared Error)

- **K-Fold Cross-Validation (k=5):** [Insert `average_mse` value here]
- **Holdout Cross-Validation (70% train, 30% test):** [Insert `test_mse` value here]
- **Leave-One-Out Cross-Validation (LOOCV):** [Insert `average_mse_loo` value here]

#### Analysis and Discussion

Discuss the Performance Values:

Compare the magnitudes of the MSE values. Are they similar? Is one significantly higher or lower than the others?

Remember that a lower MSE generally indicates better model performance (closer predictions to actual values).

Consider if any of the results seem unexpectedly high or low, which might indicate issues like overfitting (low training error but high validation/test error) or underfitting (high error on both).

Analyze the Bias-Variance Trade-off in Cross-Validation:

- Holdout: High variance in the performance estimate (sensitive to the specific split). Lower bias in the model training (trained on a larger portion of data).
- K-Fold: Lower variance in the performance estimate (averaged over multiple splits). Higher bias in model training compared to LOOCV (each fold is trained on k-1 folds, leaving out more data than LOOCV).
- LOOCV: Very low bias in the performance estimate (uses almost all data for training in each fold). Can have high variance in the performance estimate (if the removal of a single data point significantly affects the model). Very high computational cost.

Relate these concepts to your observed MSE values. For example:

If your K-Fold MSE is much lower than the Holdout MSE, it might suggest that your single holdout split was less representative.
If your LOOCV MSE is significantly different from the K-Fold MSE, it could point to high variance in the LOOCV estimate or indicate that leaving out even one data point has a notable effect on the model.
Consider Computational Cost:

Holdout: Fastest to implement and run.
K-Fold: More computationally expensive than Holdout, but usually manageable for moderately sized datasets.
LOOCV: The most computationally expensive, especially for larger datasets, as it requires training the model n times (where n is the number of samples). Discuss if the computational burden of LOOCV is justified by the potential gain in estimate reliability for your dataset size.
Relate to Your Dataset Size:

Your dataset has 2380 rows. Discuss how this size influences the suitability of each method.

For this size, K-Fold (e.g., with k=5 or k=10) generally provides a good balance between computational cost and a reliable estimate of performance.
LOOCV would train the model 2380 times, which is significantly more computationally intensive than 5-fold cross-validation. While it offers a low-bias estimate, the high computational cost might not be worth it for this dataset size compared to K-Fold.

A single Holdout split might not be as reliable as K-Fold due to the potential for a non-representative test set, especially when 30% is held out.
Recommend the Most Appropriate Technique: Based on your comparison of results, computational cost, bias-variance trade-off, and dataset size, conclude which cross-validation technique you recommend for your specific dataset and why.

For example:

"Looking at the results, we see that the MSE values for [discuss the values].

K-Fold cross-validation provides a robust estimate of the model's generalization performance by averaging over multiple splits, reducing the variance compared to a single holdout split. The computational cost is reasonable.

Holdout validation is the simplest and fastest but can be sensitive to the specific random split.

LOOCV offers a nearly unbiased estimate but is computationally very expensive for a dataset of this size (2380 rows). The marginal gain in estimate reliability over K-Fold for this dataset might not justify the significant increase in computation time.

Considering the dataset size and the balance between computational cost and the reliability of the performance estimate, **K-Fold cross-validation** appears to be the most appropriate technique for this dataset. It provides a more stable estimate than Holdout and is significantly less computationally demanding than LOOCV while still making good use of the available data for both training and validation."





### Result

#### Results of this analysis



In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [33]:
# Split the data into training and testing sets (70% train, 30% test)
# This is the core of the holdout cross-validation method
# This line must be executed to define X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Select only numerical columns from X_train before fitting the final model
X_train_numeric = X_train.select_dtypes(include=np.number)
X_test_numeric = X_test.select_dtypes(include=np.number)

# Print the numerical features to know what inputs are needed
print("Numerical features used for training the final model:")
print(X_train_numeric.columns.tolist())

# Assuming 'final_model' is defined elsewhere (as it is in the user's original notebook)
# You need to make sure the model (e.g., Ridge) is instantiated before this cell is run.
# Example: final_model = Ridge(alpha=1.0)

final_model.fit(X_train_numeric, y_train)

Numerical features used for training the final model:
['reviews_like_rate', 'all_reviews_number', 'age_restriction', 'rating', 'difficulty', 'length', 'estimated_downloads']


In [34]:
# Assume X_train, y_train, and final_model are already defined and the model is trained
# (This relies on the previous cells having been executed successfully)

# 1. Get the list of numerical features the model was trained on
# This line should ideally be executed after fitting the model to X_train_numeric
# to be absolutely sure of the columns used for training.
numerical_features = X_train.select_dtypes(include=np.number).columns.tolist()

print(f"Your model was trained on the following numerical features: {numerical_features}")

# 2. Prepare the new data (Example for a hypothetical new game)
# This dictionary MUST contain values for all features in the numerical_features list.
# The keys should match the feature names exactly.
# Replace the sample values with hypothetical values for your new game.
new_game_raw_data = {
    'reviews_like_rate': 0.95,          # Example: 95% positive reviews
    'all_reviews_number': 100000,       # Example: 100,000 total reviews
    'age_restriction': 10,              # Example: Suitable for age 10+
    'rating': 4.5,                      # Example: Rating out of 5 (if that's how rating is represented)
    'difficulty': 3.0,                  # Example: Difficulty score (e.g., on a scale)
    'length': 50.0,                     # Example: Game length in hours
    'estimated_downloads': 5000000      # Example: 5 million downloads
}

# Create a DataFrame from the new data
new_game_df = pd.DataFrame([new_game_raw_data])

# 3. Ensure the new data DataFrame contains only the required numerical columns
# and is in the correct order.
# Select only the columns that are in numerical_features
# Using .reindex(columns=numerical_features) is safer as it handles cases
# where the original new_game_df might have columns in a different order
# or be missing a required column (it will add NaN for missing ones, which
# might need further handling depending on your model).
# For simplicity and robustness, just selecting is fine if you are certain
# the keys in new_game_raw_data match and cover all numerical_features.
new_game_df_numeric = new_game_df[numerical_features]


# 4. Make the prediction using the final trained model
try:
    predicted_price = final_model.predict(new_game_df_numeric)

    print(f"\nBased on the provided numerical inputs, the predicted price for this game is: ${predicted_price[0]:.2f}")

except Exception as e:
    print(f"\nAn error occurred during prediction: {e}")
    print("Please ensure the new data DataFrame has the correct numerical features")
    print(f"Required features: {numerical_features}")
    print(f"Provided features in new data: {new_game_df_numeric.columns.tolist()}")

# Optional: Provide a simple price range suggestion (as discussed before)
# print(f"Suggested price range: ${predicted_price[0] * 0.8:.2f} - ${predicted_price[0] * 1.2:.2f} (Arbitrary +/- 20%)")

Your model was trained on the following numerical features: ['reviews_like_rate', 'all_reviews_number', 'age_restriction', 'rating', 'difficulty', 'length', 'estimated_downloads']

Based on the provided numerical inputs, the predicted price for this game is: $27.28
