# Gaming Engagement Level Machine Level Project

# Dataset Overview

This dataset captures behavioral and engagement patterns of video game players across different regions, genres, and difficulty levels.
The goal of this project is to predict a player’s **engagement level** (Low, Medium, or High) based on their demographics, gameplay habits, and in-game performance indicators.

---

## Table of Contents

1. [Dataset Overview & Features](#dataset-overview--features)
2. [Introduction](#introduction)
3. [Import Libraries](#import-libraries)
4. [Data Preprocessing](#data-preprocessing)
5. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis)
6. [Model Development](#model-development)
7. [Model Evaluation](#model-evaluation)
8. [Conclusion](#conclusion)

---

## Dataset Overview & Features <a id="dataset-overview--features"></a>

This dataset contains **player behavior and engagement data** from various gaming genres.
Each record represents one player, with demographic information, gaming activity statistics, and performance achievements.

### Features

| Feature                       | Description                                                             |
| ----------------------------- | ----------------------------------------------------------------------- |
| **PlayerID**                  | Unique identifier assigned to each player.                              |
| **Age**                       | Age of the player (in years).                                           |
| **Gender**                    | Gender of the player (Male/Female).                                     |
| **Location**                  | Geographic region (e.g., USA, Europe, Other).                           |
| **GameGenre**                 | Primary genre of games played (Action, Strategy, Sports, RPG, etc.).    |
| **PlayTimeHours**             | Total gameplay hours recorded over a period.                            |
| **InGamePurchases**           | Indicates whether the player makes in-game purchases (0 = No, 1 = Yes). |
| **GameDifficulty**            | Average difficulty setting used (Easy, Medium, Hard).                   |
| **SessionsPerWeek**           | Number of gameplay sessions per week.                                   |
| **AvgSessionDurationMinutes** | Average duration (in minutes) of each gaming session.                   |
| **PlayerLevel**               | Player’s current game level or rank.                                    |
| **AchievementsUnlocked**      | Number of achievements or milestones earned.                            |
| **EngagementLevel**           | Overall engagement rating (Low, Medium, High). *(Target Variable)*      |

---

## Target Variable

**EngagementLevel** — The categorical target variable representing a player's overall engagement level.

| Class      | Description                                                                                |
| ---------- | ------------------------------------------------------------------------------------------ |
| **High**   | Highly engaged players with frequent sessions, longer durations, and active participation. |
| **Medium** | Moderately engaged players with balanced gameplay and activity.                            |
| **Low**    | Less active players with minimal gameplay or short sessions.                               |

---

## Goal

To build and evaluate **machine learning models** that can predict a player’s *Engagement Level* based on their gameplay behavior, activity patterns, and game-related attributes.

---



In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.multiclass import OneVsOneClassifier
import warnings
warnings.filterwarnings("ignore", category=UserWarning) 
%matplotlib inline

# Libraries Imported

## Data Loading and Structure

In [20]:
url = 'online_gaming_insights.csv'
df = pd.read_csv(url)
df

Unnamed: 0,PlayerID,Age,Gender,Location,GameGenre,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,EngagementLevel
0,9000,43,Male,Other,Strategy,16.271119,0,Medium,6,108,79,25,Medium
1,9001,29,Female,USA,Strategy,5.525961,0,Medium,5,144,11,10,Medium
2,9002,22,Female,USA,Sports,8.223755,0,Easy,16,142,35,41,High
3,9003,35,Male,USA,Action,5.265351,1,Easy,9,85,57,47,Medium
4,9004,33,Male,Europe,Action,15.531945,0,Medium,2,131,95,37,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40029,49029,32,Male,USA,Strategy,20.619662,0,Easy,4,75,85,14,Medium
40030,49030,44,Female,Other,Simulation,13.539280,0,Hard,19,114,71,27,High
40031,49031,15,Female,USA,RPG,0.240057,1,Easy,10,176,29,1,High
40032,49032,34,Male,USA,Sports,14.017818,1,Medium,3,128,70,10,Medium


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40034 entries, 0 to 40033
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PlayerID                   40034 non-null  int64  
 1   Age                        40034 non-null  int64  
 2   Gender                     40034 non-null  object 
 3   Location                   40034 non-null  object 
 4   GameGenre                  40034 non-null  object 
 5   PlayTimeHours              40034 non-null  float64
 6   InGamePurchases            40034 non-null  int64  
 7   GameDifficulty             40034 non-null  object 
 8   SessionsPerWeek            40034 non-null  int64  
 9   AvgSessionDurationMinutes  40034 non-null  int64  
 10  PlayerLevel                40034 non-null  int64  
 11  AchievementsUnlocked       40034 non-null  int64  
 12  EngagementLevel            40034 non-null  object 
dtypes: float64(1), int64(7), object(5)
memory usag

In [22]:
df.isna().sum()

PlayerID                     0
Age                          0
Gender                       0
Location                     0
GameGenre                    0
PlayTimeHours                0
InGamePurchases              0
GameDifficulty               0
SessionsPerWeek              0
AvgSessionDurationMinutes    0
PlayerLevel                  0
AchievementsUnlocked         0
EngagementLevel              0
dtype: int64

### No missing values detected

In [23]:
df.duplicated().sum()

np.int64(0)

#### No duplicate rows found

In [24]:
df = df.drop(['Gender', 'PlayerID'], axis=1)

In [25]:
# df['GameDifficulty'].unique()

In [26]:
# #Data Standardization Uinsg Dummny Variables
# df.replace({
#     'Extracurricular Activities': {'Yes': 1, 'No': 0},
#     'Location': {'Other': 0, 'USA': 1, 'Europe': 2, 'Asia': 3},
#     'GameGenre': {'Strategy': 0, 'Sports': 1, 'Action': 2, 'RPG': 3, 'Simulation': 4},
#     'GameDifficulty': {'Easy': 0, 'Medium': 1, 'Hard': 2}
# }, inplace=True)


In [27]:
# Identify columns you want to scale (only true continuous features)
continuous_columns = [
    'Age',
    'PlayTimeHours',
    'SessionsPerWeek',
    'AvgSessionDurationMinutes',
    'PlayerLevel',
    'AchievementsUnlocked'
]

# Standardize those columns only
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[continuous_columns])

# Convert scaled features back to DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=continuous_columns)

# Combine scaled continuous data with the rest (including mapped categoricals)
scaled_data = pd.concat([df.drop(columns=continuous_columns), scaled_df], axis=1)


In [28]:
scaled_data

Unnamed: 0,Location,GameGenre,InGamePurchases,GameDifficulty,EngagementLevel,Age,PlayTimeHours,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
0,Other,Strategy,0,Medium,Medium,1.096023,0.614176,-0.602363,0.269487,1.026459,0.032814
1,USA,Strategy,0,Medium,Medium,-0.297969,-0.939816,-0.775865,1.004019,-1.352160,-1.006648
2,USA,Sports,0,Easy,High,-0.994965,-0.549654,1.132666,0.963212,-0.512647,1.141573
3,USA,Action,1,Easy,Medium,0.299456,-0.977506,-0.081854,-0.199798,0.256906,1.557358
4,Europe,Action,0,Medium,Medium,0.100314,0.507275,-1.296374,0.738771,1.586134,0.864383
...,...,...,...,...,...,...,...,...,...,...,...
40029,USA,Strategy,0,Easy,Medium,0.000744,1.243074,-0.949368,-0.403835,1.236337,-0.729458
40030,Other,Simulation,0,Hard,High,1.195594,0.219091,1.653174,0.391909,0.746622,0.171409
40031,USA,RPG,1,Easy,High,-1.691961,-1.704277,0.091649,1.656937,-0.722525,-1.630325
40032,USA,Sports,1,Medium,Medium,0.199885,0.288298,-1.122871,0.677560,0.711642,-1.006648


In [29]:
#Identifying Categorical columns
categorical_columns = scaled_data.select_dtypes(include=['object']).columns.to_list()
categorical_columns.remove('EngagementLevel')

#Applying one-hot encoding
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = encoder.fit_transform(scaled_data[categorical_columns])

#Converting to a dataframe
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_columns))

#Combining with the original dataset
prepped_data = pd.concat([scaled_data.drop(columns=categorical_columns), encoded_df], axis=1)

In [30]:
df.head()

Unnamed: 0,Age,Location,GameGenre,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,EngagementLevel
0,43,Other,Strategy,16.271119,0,Medium,6,108,79,25,Medium
1,29,USA,Strategy,5.525961,0,Medium,5,144,11,10,Medium
2,22,USA,Sports,8.223755,0,Easy,16,142,35,41,High
3,35,USA,Action,5.265351,1,Easy,9,85,57,47,Medium
4,33,Europe,Action,15.531945,0,Medium,2,131,95,37,Medium


In [31]:
df.corr(numeric_only=True)

Unnamed: 0,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
Age,1.0,0.002462,-0.000186,0.008777,-0.002269,0.001353,-0.0011
PlayTimeHours,0.002462,1.0,-0.006067,-0.003655,-0.001925,-0.005152,0.003913
InGamePurchases,-0.000186,-0.006067,1.0,0.005132,-0.003059,0.006524,9.8e-05
SessionsPerWeek,0.008777,-0.003655,0.005132,1.0,-0.00062,0.003257,0.003187
AvgSessionDurationMinutes,-0.002269,-0.001925,-0.003059,-0.00062,1.0,0.001368,-0.002227
PlayerLevel,0.001353,-0.005152,0.006524,0.003257,0.001368,1.0,0.006343
AchievementsUnlocked,-0.0011,0.003913,9.8e-05,0.003187,-0.002227,0.006343,1.0


In [32]:
#Encoding the target variable
prepped_data['EngagementLevel'] = prepped_data['EngagementLevel'].astype('category').cat.codes

In [33]:
#Preparing Final Dataset
X = prepped_data.drop('EngagementLevel', axis=1)
y = prepped_data['EngagementLevel']

In [34]:
#Stratify ensures equal distribution of each class(target) to train with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, stratify=y)

In [35]:
#Training The Linear Regression Model
Lin_reg = LinearRegression()
Lin_reg.fit(X_train, y_train)
print(f'The intercept is {Lin_reg.intercept_}')
print(f'The Coefficient is {Lin_reg.coef_}')
y_pred = Lin_reg.predict(X_test)

The intercept is 1.2420941550766875
The Coefficient is [-0.02374159  0.00770747 -0.00776623 -0.20648745 -0.24671017  0.00986867
  0.01354504  0.00412753 -0.00597526 -0.00172152 -0.00392765 -0.02619355
 -0.01602243 -0.00514705 -0.01062679  0.00299633]


In [36]:
#Model Evaluation
print(f'The R Square Value is {Lin_reg.score(X_test, y_test)}')
mse = mean_squared_error (y_test, y_pred)
print(f'The Mean Square Error is {mse}')

The R Square Value is 0.14428449581014813
The Mean Square Error is 0.5913969031306738


In [37]:
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.to_list()
categorical_cols = df.select_dtypes(include=["object"]).columns.to_list()
categorical_cols.remove("EngagementLevel")  # Target column

# Scale numeric features
scaler = StandardScaler()
scaled_numeric = scaler.fit_transform(df[numeric_cols])
scaled_df = pd.DataFrame(scaled_numeric, columns=scaler.get_feature_names_out(numeric_cols))

# Encode categorical features
encoder = OneHotEncoder(sparse_output=False, drop="first")
encoded_features = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_cols))

In [38]:
# Combine scaled + encoded + target
data = pd.concat([scaled_df, encoded_df, df["EngagementLevel"]], axis=1)

# Encode target labels
data["EngagementLevel"] = data["EngagementLevel"].astype("category").cat.codes

In [39]:
# Split dataset
X = data.drop("EngagementLevel", axis=1)
y = data["EngagementLevel"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
#ONE-VS-ALL STRATEGY
model_ova = LogisticRegression(multi_class='ovr', max_iter=1000)  # Iterations improve accuracy
model_ova.fit(X_train, y_train)
y_pred_ova = model_ova.predict(X_test)

# Evaluation metrics for OvA
print("One-Vs-All (OvA) Strategy")
print(f'Accuracy: {np.round(100 * accuracy_score(y_test, y_pred_ova), 2)}%')
print(f'Error (MSE): {mean_squared_error(y_test, y_pred_ova)}')





One-Vs-All (OvA) Strategy
Accuracy: 82.33%
Error (MSE): 0.34457349818908456




In [None]:
#ONE-VS-ONE STRATEGY
model_ovo = OneVsOneClassifier(LogisticRegression(max_iter=1000))
model_ovo.fit(X_train, y_train)

y_pred_ovo = model_ovo.predict(X_test)

# Evaluation metrics for OvO
print("\nOne-Vs-One (OvO) Strategy")
print(f'Accuracy: {np.round(100 * accuracy_score(y_test, y_pred_ovo), 2)}%')
print(f'Error (MSE): {mean_squared_error(y_test, y_pred_ovo)}')


One-Vs-One (OvO) Strategy
Accuracy: 81.84%
Error (MSE): 0.34494817035094294


In [None]:
#POLYNOMIAL REGRESSION
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

poly_reg = LinearRegression()
poly_reg.fit(X_poly_train, y_train)

# Predictions & Evaluation
y_pred_poly = poly_reg.predict(X_poly_test)
mse_poly = mean_squared_error(y_test, y_pred_poly)

print("POLYNOMIAL REGRESSION RESULTS")
print(f"Mean Squared Error (MSE): {mse_poly}")

POLYNOMIAL REGRESSION RESULTS
Mean Squared Error (MSE): 0.3569547789068502
