# Final Report


Group members: Juliane Lou (30661920), Kaitlyn Yee (88878830), Anastasya Situmorang (73455958)

# Introduction

In order to collect data about how people play video games, a research group in Computer Science at UBC set up a MineCraft server to monitor players actions. To run the project smoothly, they have provided us with two data files. Our group will be analyzing player information in the players.csv file to answer Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?. 

Our specific, predictive question is: Can we predict a player's subscription status with "experience", "played_hours" and "age"? If yes, which combination of exploratory variables gives the most accurate prediction for the target variable?

The raw players.csv dataset, with data collected from a MineCraft server, contains 196 observations, where each row represents an individual player, and 9 columns, each containing a different variable consisting of:

- experience (categorical): The experience level of the player (Amateur,Beginner, Regular, Pro, Veteran)
- subscribe (boolean): Indicates whether the player is subscribed to the game-related newsletter (target variable)
- hashedEmail (string): Player's email
- played_hours (numerical): Total number of hours played
- name (string): Player's name
- gender(categorical): Player's reported gender
- age (numerical): Player's age in years
- individualId (N/A, no data): Each player's in-game ID
- organizationName (N/A, no data): Player's affiliated organization

Columns "individualId" and "organizationName" contain no data, while columns such as "name" and "hashedEmail" are identifying variables, not predictive. The types of variables are mixed, making it difficult to plot on same graphs and use same evaluation methods. 

# Methods 

The dataset contains multiple variables for each player and also a categorical target variable (True/False). Thus, this question will use KNN classification as the model. The response variable is subscribed, and the explanatory variables are "experience", "played_hours", and "age". We will use KNN classification on these graphs to train the model using an 80-20 training-testing data split to predict "subscribed". 
Then, we will compare 5-fold cross-validation results of the 3 models to find the highest accuracy, precision and recall. The variables of the best-performing graph would then be the variables that are most predictive of subscribing to a game-related newsletter. 

Since KNN only works with numeric values, we must convert "experience", a categorical variable, into a number scale (e.g., Amateur = 1, Beginner = 2, Regular = 3, Pro = 4, Veteran = 5). Each KNN model will then be fit using cross-validation to compare accuracy, precision, and recall, allowing us to determine which pair of variables best predicts newsletter subscription.

There are no assumptions to make about the model, because we are testing through cross-validation. However, when we convert "experience" into a number scale, we assume 1 is the lowest, 5 is the highest and assign values to each level of experience based on assumption.

Exploratory Data Analysis (EDA):
- Before fitting the models, we will examine the relationship between each pair of predictor variables (age vs. played hours, age vs. experience, played hours vs. experience)
    - Scatter plots are created for pair with points coloured by the subscribed target variable
    - Figures 1-3 allow us to visualize the relationship between predictors and the target variable, helping us determine which pair will likely be the most predictive

Wrangling:
- Remove empty/irrelevant columns after checking for missing values
- Convert categorical exploratory variable "experience" to numeric for KNN. (Limitation: the "number" assigned to the "level of experience" could be subjectve).
- "subscribe" is boolean, we will convert to 0 and 1 (:N nominal) for KNN
- Drop rows with duplicated "hashedEmail". Ensure each player only gets to submit one response.
- Standardize/Scale exploratory variables.
- To properly split data, use stratify=y to balance the number of boolean in the testing vs training dataset.

Issues with the model:
- Sensitivity to noise
- Imbalanced toward majority classes
- Scaling variables changes it's significance/ features

Comparing and selecting the best model and processing the data to apply the model:
1. Filter rows to keep: age, played hours, experience, and subscribe
2. Fit 3 models of KNN classification (3 sets of 3 pairs of variables, variable ‘subscribe’ constant through all 3 models as "colour":N). -Models: age vs. played hours, age vs. experience, played hours vs. experience.
3. Quantitative analysis: Split and cross-validation with training and testing (0.75, 0.25).
   - Perform 5-fold cross-validation on each model to evaluate accuracy, precision and recall.
5. Compare cross-validation results of the 3 models, and select the model with the highest combination of accuracy, precision and recall.
6. Evaluate the selected model on the remaining test set data (0.25).

# Results

In [1]:
#Imports
import altair as alt
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score


# Some are unnecessary (come back after finishing to eliminate ones we didn't use

In [2]:
# Reading in the dataset and minimal wrangling to tidy data:
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
pd.read_csv(url)
players_data =  pd.read_csv(url)

#get rid of empty columns
players = players_data[["experience", "subscribe", "played_hours", "age"]]
players


Unnamed: 0,experience,subscribe,played_hours,age
0,Pro,True,30.3,9
1,Veteran,True,3.8,17
2,Veteran,False,0.0,17
3,Amateur,True,0.7,21
4,Regular,True,0.1,21
...,...,...,...,...
191,Amateur,True,0.0,17
192,Veteran,False,0.3,22
193,Amateur,False,0.0,17
194,Amateur,False,2.3,17


In [3]:
#CONVERT EXPERIENCE TO NUMERICAL VALUES

players.loc[:,"experience"] = players["experience"].replace({
    "Amateur": 1,
    "Beginner": 2,
    "Regular": 3,
    "Pro": 4,
    "Veteran": 5 })

players
 

  players.loc[:,"experience"] = players["experience"].replace({


Unnamed: 0,experience,subscribe,played_hours,age
0,4,True,30.3,9
1,5,True,3.8,17
2,5,False,0.0,17
3,1,True,0.7,21
4,3,True,0.1,21
...,...,...,...,...
191,1,True,0.0,17
192,5,False,0.3,22
193,1,False,0.0,17
194,1,False,2.3,17


In [4]:
#SPLIT DATASET INTO TRAINING AND TESTING DATA

players_train, players_test = train_test_split(players, test_size=0.25, random_state=123) # set the random state to be 123

players_train

Unnamed: 0,experience,subscribe,played_hours,age
100,1,True,0.0,20
10,5,True,1.6,23
149,1,True,0.0,16
171,2,False,1.8,32
178,1,True,0.4,17
...,...,...,...,...
17,1,True,48.4,17
98,1,False,0.0,17
66,5,False,0.1,22
126,2,True,0.7,24


In [5]:
#EDA ON TRAINING DATA ONLY (as per ta comment)
    #make 3 scatter plots?/histograms to explore data using only training dataset (sim to proposal)
    # i asked ta and she said we can copy from smos indiv proposal !!! (so use based on wtv histogram we wen t over last tut)

In [6]:
#build preprocessor

players_preprocessor_1 = make_column_transformer(
    (StandardScaler(), ["experience", "played_hours"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

players_preprocessor_2 = make_column_transformer(
    (StandardScaler(), ["experience", "age"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

players_preprocessor_3 = make_column_transformer(
    (StandardScaler(), ["played_hours", "age"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

In [7]:
#specify knn classifier
knn_spec = KNeighborsClassifier(n_neighbors=3)

#identify training predictors vs. target

X_train_1 = players_train[["experience", "played_hours"]]
X_train_2 = players_train[["experience", "age"]]
X_train_3 = players_train[["played_hours", "age"]]
y = players_train["subscribe"]

#create fitted pipelines

players_fit_1 = make_pipeline(players_preprocessor_1, knn_spec).fit(X_train_1, y)
players_fit_2 = make_pipeline(players_preprocessor_2, knn_spec).fit(X_train_2, y)
players_fit_3 = make_pipeline(players_preprocessor_3, knn_spec).fit(X_train_3, y)

In [8]:
# first focus on players_preprocessor_1, find optimal KNeighbors value through cross-validation to tune KNeighbors
# create parameter grid with range of 1-15 (inclusive) for simplicity and processing speed

param_grid={'kneighborsclassifier__n_neighbors':range(1, 16)}

#specify X and y variables for predictors vs. target for the variables within the preprocessor

X_train_1 = players_train[["experience", "played_hours"]]
y = players_train["subscribe"]

# create pipe

players_pipe_1=make_pipeline(players_preprocessor_1, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_1=GridSearchCV(
        estimator=players_pipe_1,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_1 = knn_tune_grid_1.fit(X_train_1, y)

# find the results and store in a new dataframe 

accuracies_grid_1 = pd.DataFrame(knn_model_grid_1.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_1=alt.Chart(accuracies_grid_1).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_1

In [9]:
# from the above results, we can see the optimal KNeighbours values is k=13 or k=15. we will use k=13 in this project

# still focusing on the variable set in players_preprocessor_1, create new knn specification for this dataset

knn_spec_1=KNeighborsClassifier(n_neighbors=13)

# create new pipeline and fit to X_1 and y variables

players_fit_1_final=make_pipeline(players_preprocessor_1, knn_spec_1).fit(X_train_1, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_1=pd.DataFrame(
    cross_validate(
        estimator=players_fit_1_final,
        cv=5,
        X=X_train_1,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_1=players_tune_grid_1.agg(['mean', 'sem'])
player_metrics_1

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003599,0.003867,0.727816,0.750007
sem,0.000192,0.000178,0.019021,0.001775


In [10]:
# repeat above steps for players_preprocessor_2 to find accuracy
# specify X (predictors) for the variables within the preprocessor

X_train_2 = players_train[["experience", "age"]]

# create pipe

players_pipe_2=make_pipeline(players_preprocessor_2, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_2=GridSearchCV(
        estimator=players_pipe_2,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_2=knn_tune_grid_2.fit(X_train_2, y)

# find the results and store in a new dataframe 

accuracies_grid_2=pd.DataFrame(knn_model_grid_2.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_2=alt.Chart(accuracies_grid_2).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_2

In [11]:
# from the above results, we can see the optimal KNeighbours values is k=9

# still focusing on the variable set in players_preprocessor_2, create new knn specification for this dataset

knn_spec_2=KNeighborsClassifier(n_neighbors=9)

# create new pipeline and fit to X_train_2 and y variables

players_fit_2_final=make_pipeline(players_preprocessor_2, knn_spec_2).fit(X_train_2, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_2=pd.DataFrame(
    cross_validate(
        estimator=players_fit_2_final,
        cv=5,
        X=X_train_2,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_2=players_tune_grid_2.agg(['mean', 'sem'])
player_metrics_2

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003634,0.009448,0.727816,0.748312
sem,0.000163,0.005505,0.011135,0.003102


In [12]:
# focus on players_preprocessor_3, find optimal KNeighbors value through cross-validation to tune KNeighbors
# specify X and y variables for predictors vs. target for the variables within the preprocessor

X_train_3 = players_train[["played_hours", "age"]]

# create pipe

players_pipe_3=make_pipeline(players_preprocessor_3, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_3=GridSearchCV(
        estimator=players_pipe_3,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_3=knn_tune_grid_3.fit(X_train_3, y)

# find the results and store in a new dataframe 

accuracies_grid_3=pd.DataFrame(knn_model_grid_3.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_3=alt.Chart(accuracies_grid_3).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_3

In [13]:
# from the above results, we can see the optimal KNeighbours values is k=5

# still focusing on the variable set in players_preprocessor_3, create new knn specification for this dataset

knn_spec_3=KNeighborsClassifier(n_neighbors=5)

# create new pipeline and fit to X_train_3 and y variables

players_fit_3_final=make_pipeline(players_preprocessor_3, knn_spec_3).fit(X_train_3, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_3=pd.DataFrame(
    cross_validate(
        estimator=players_fit_3_final,
        cv=5,
        X=X_train_3,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_3=players_tune_grid_3.agg(['mean', 'sem'])
player_metrics_3

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003695,0.003889,0.741609,0.787382
sem,0.000232,0.000178,0.038444,0.009105


In [14]:
# Make predictions on test data
players_test["predicted"] = players_fit_3_final.predict(
    players_test[["played_hours", "age"]]
)

# Calculating test accuracy
players_fit_3_final.score(
    players_test[["played_hours", "age"]],
    players_test["subscribe"]
)

0.7346938775510204

In [15]:
# Calculating test precision
precision_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"]
)

np.float64(0.7333333333333333)

In [16]:
# Test recall
recall_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"]
)

np.float64(0.9705882352941176)

In [17]:
# Confusion matrix
pd.crosstab(
    players_test["subscribe"], players_test["predicted"]
)

predicted,False,True
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,3,12
True,1,33


# Discussion

Our KNN classification analysis compared three models using different pairs of predictor variables to determine which combination best predicted subscription status. Using cross-validation, it was shown that the model using played_hours and age achieved the highest mean test accuracy, and we therefore selected this model as our final classifier. When evaluated on the test set, the tuned model achieved an accuracy of approximately 0.735, correctly classifying about three out of four players. Its recall was 0.971, indicating that the model successfully identified nearly all players who subscribed, while its precision of 0.733 showed that it correctly predicted subscription about three-quarters of the time. 

These findings are generally consistent with expectations. Since played_hours represents engagement, it is reasonable that it would be a strong predictor of subscription status. Age also contributed meaningfully to the model’s performance, suggesting that subscription likelihood varies across different age groups. The models involving experience performed slightly worse, indicating that self-reported experience alone did not provide as much predictive value as the other predictors. Overall, our results support the idea that both player activity and demographic factors contain useful information for predicting subscription behaviour.

The results of this analysis may be helpful for understanding which players are most likely to subscribe. For instance, the game administrators could use this insight to develop more targeted engagement strategies. For example, players who seem similar to non-subscribers, like those who don't play as much, could be encouraged to check out more features or join community events that might get them more interested. Subscription reminders could also be made-to-order to player groups that the model identifies as having higher predicted interest.

There are several ways this analysis could be improved. Although our model performed well, it is likely that additional predictors, such as achievement progress, session duration, or in-game purchases, would increase predictive accuracy. The relatively high recall and moderate precision indicate that the model tends to predict “subscribe” more often than necessary, suggesting that adjusting classification  or testing alternative models such as logistic regression may reduce false positives. We also learned that evaluating only pairs of variables may overlook interactions between predictors, so future work could explore models that include all predictors simultaneously or examine interaction effects directly.

These findings lead to several questions for further research. How do subscription patterns change over time? Do certain updates, events, or engagement campaigns influence whether players subscribe? Would similar models perform equally well in other gaming communities, or are these patterns specific to this dataset? Exploring these questions would provide a more complete understanding of player behaviour and could lead to more effective engagement strategies.

In summary, our analysis evaluated three KNN classification models to determine which pair of predictors best identified players who subscribed to the newsletter. After tuning and comparing the cross validation results, the model using played_hours and age demonstrated the strongest performance. These results suggest that both engagement level and demographic factors play an important role in predicting subscription behaviour. While the model performed well, incorporating additional predictors and exploring alternative modelling approaches may further improve accuracy and reduce false positives. Overall, this work provides a useful foundation for understanding subscription patterns and for developing targeted strategies to increase player engagement.

# References

Timbers, T., Campbell, T., Lee, M., Ostblom, J., & Heagy, L. (2024). Data science: A first introduction with Python. https://python.datasciencebook.ca/classification2.html#tuning-the-classifier