# DSCI100 101 - Project Final Report

Group 25:

Melody Mokhtari Amirmajdi (88736350), Sophia Boniati Ozi (99642803), Anson Lam (97811442)

# Introduction

In this project, we aim to analyze player data from the PLAI Minecraft research server to understand what player characteristics are most associated with subscribing to the project’s newsletter. Subscribing is used as a measurement for player engagement as it suggests greater interest in contributing to research data collection.  

**Specific question:**  
Can we predict a player’s likelihood of subscribing to the newsletter using personal and in-game characteristics such as age, experience, and played hours? This predictive question will utilize classification methods, and will serve to respond to researchers' question #1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Our analysis will focus on the `players.csv` dataset. This report includes a descriptive summary of the dataset, exploratory data analysis (EDA) through visualizations, and a proposed methodology for carrying out this project.

**Data Description:**

There are 196 observations in the `players.csv` file, each representing an individual player. The dataset contains 9 columns. Two of these columns were selected to include in this project's analysis. Of the remaining 7 variables, four are stored as objects (categorical or string), one is a float, one is an integer, and one is boolean.  

**Variables**

- experience (object): player’s experience level in Minecraft (Amateur, Regular, Veteran, or Pro)  
- subscribe (boolean): whether the player subscribed to the PLAI newsletter (True or False)  
- played_hours (float): total hours played on the server  
- name (object): player name (not used for modeling)  
- gender (object): gender identification (Male, Female, Other, Prefer not to say, etc...)  
- age (integer): player’s age in years

There are some issues with the data that may result in problems moving forward: 

- There is a class imbalance: more players are subscribed than unsubscribed.  
- Some ages appear unusually high (e.g., 91) for a typical player, which may be an outlier.  
- `experience` must be converted into numeric format for modeling.  
- There are no missing values, but some numeric variables may contain mostly zeros, which could reduce their usefulness in modeling.
The data was collected from the PLAI Minecraft server and participant registration system, so in-game variables (like hours and experience) considered in the context of this project reflect real and accurate gameplay metrics.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
np.random.seed(4)

from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.model_selection import train_test_split
set_config(transform_output="pandas")

In [2]:
players_url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(players_url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


## Methods & Results

We conducted a supervised classification analysis to predict whether a player will subscribe to the PLAI Minigame newsletter. In classification, we use labeled examples to learn a rule that maps a set of input features (predictor variables) to a categorical outcome (the response). Here, the response variable is `subscribe`, and the predictors include player-level characteristics such as age, played hours, and experience.

We used the k-nearest neighbors (KNN) algorithm to conduct this classification. KNN predicts the class of a new observation by identifying the K most similar observations in the training data and assigning the most common label among them. Similarity is measured using distances with this method; because KNN relies directly on distances, appropriate preprocessing and tuning of K are essential to achieve reliable predictions.

### 1 - Data loading

We loaded the `players` dataset used to study subscription behavior in the PLAI Minigame context. The dataset contains demographic and engagement variables for individual players, along with an indicator of whether they subscribed to the newsletter. Our analysis uses these variables to assess whether subscription likelihood can be predicted from measurable player characteristics.

### 2 - Data cleaning and initial wrangling

We removed identifier and organization-related columns (e.g., individualId, organizationName, hashedEmail). These fields do not provide meaningful predictive information for subscription behavior, so removing them allows a focus on relevant player attributes for the analysis.

In [3]:
dropped_players=players.drop(columns=["individualId","organizationName","hashedEmail"])
dropped_players

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
0,Pro,True,30.3,Morgan,Male,9
1,Veteran,True,3.8,Christian,Male,17
2,Veteran,False,0.0,Blake,Male,17
3,Amateur,True,0.7,Flora,Female,21
4,Regular,True,0.1,Kylie,Male,21
...,...,...,...,...,...,...
191,Amateur,True,0.0,Bailey,Female,17
192,Veteran,False,0.3,Pascal,Male,22
193,Amateur,False,0.0,Dylan,Prefer not to say,17
194,Amateur,False,2.3,Harlow,Male,17


### 3 - Exploratory data analysis

We began with exploratory analysis to understand the distribution of key variables and their relationships with subscription. We examined how played hours and age vary across experience levels and subscription groups. This step helps confirm whether the chosen predictors plausibly relate to the response and provides context for later modeling choices.

In [4]:
# Figure 1 is a bar graph of experience of players against their total summed hours of playing.  We can see that regulars and amateurs play substainially more than players of other experiences.
plot_bar1 = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("experience:N", title="Experience Level"),
        y=alt.Y("sum(played_hours):Q", title="Total Played Hours"),
        color=alt.Color("experience:N", legend=alt.Legend(title="Experience")),
        tooltip=["experience", alt.Tooltip("sum(played_hours):Q", title="Total Hours")]
    )
    .properties(title="Figure 1: Total Played Hours by Experience Level")
)

# Figure 2 is a scatter plot of age against played hours. We can see that most people fall in a specific range, but there are some people who outlie in either dimension (very high age or very high hours played).
plot_bar2 = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("subscribe:N", title="Subscription Status"),
        y=alt.Y("mean(played_hours):Q", title="Average Played Hours"),
        color=alt.Color(
            "experience:N",
            legend=alt.Legend(title="Experience"),
            scale=alt.Scale(scheme="set2")
        ),
        tooltip=[
            "subscribe",
            "experience",
            alt.Tooltip("mean(played_hours):Q", title="Average Hours"),
        ],
    )
    .properties(title="Figure 2: Average Played Hours by Subscription and Experience")
)

# Figure 3 is a bar graph of subscription status against average played hours. Players on average play more hours if they are subscribed, but that may also be because subscribing gives you more play time.
plot_class = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("subscribe:N", title="Subscription Status"),
        y=alt.Y("count():Q", title="Number of Players"),
        fill=alt.Fill(
            "subscribe:N",
            legend=alt.Legend(title="Subscription status"),
            scale=alt.Scale(scheme="set2")
        ),
        tooltip=["subscribe", alt.Tooltip("count():Q", title="Count")],
    )
    .properties(title="Figure 3: Class Balance: Subscribed vs Not Subscribed")
)

# Figure 4 is a bar graph of the subscription status versus the number of players. There are near two and a half times more subscribed players than unsubscribed, so we must make sure they are both even later when we set up our KNN classifier system.
plot_rate = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("experience:N", title="Experience Level"),
        y=alt.Y("mean(subscribe):Q", title="Subscription Rate"),
        color=alt.Color("experience:N", legend=alt.Legend(title="Experience")),
        tooltip=[
            "experience",
            alt.Tooltip("mean(subscribe):Q", title="Subscription Rate"),
        ],
    )
    .properties(title="Figure 4: Subscription Rate by Experience Level")
)


# Figure 5 is experience level versys subscription rate. We can see on average, 70-80% of players subscribe, regardless of their experience level.
plot_scatter = (
    alt.Chart(dropped_players)
    .mark_point(size=80, opacity=0.7)
    .encode(
        x=alt.X("age:Q", title="Age (years)"),
        y=alt.Y("played_hours:Q", title="Played Hours"),
        color=alt.Color("experience:N", legend=alt.Legend(title="Experience")),
        tooltip=["age", "played_hours", "experience", "subscribe"]
    )
    .properties(title="Figure 5: Relationship Between Age, Experience, and Played Hours")
)

(plot_bar1 | plot_scatter ) & ( plot_bar2 | plot_class | plot_rate )

### 4 - Class balance

We checked the distribution of the subscribe classes. Class imbalance can affect classification performance, especially for models like KNN that are influenced by the local composition of labels. Identifying imbalance early helps determine whether resampling is needed.

In [5]:
dropped_players["subscribe"].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

### 5 - Handling class imbalance

To reduce bias toward the majority class, we used upsampling to increase the representation of the minority class. This creates a more balanced dataset for model training and helps the classifier learn patterns associated with both subscription outcomes rather than defaulting to the most common label.

In [6]:
# scaling so theres equal numbers of each option
not_subscribed_players = dropped_players[dropped_players["subscribe"] == False]
subscribed_players = dropped_players[dropped_players["subscribe"] == True]
not_subscribed_scaledup = not_subscribed_players.sample(
    n=subscribed_players.shape[0], replace=True
)
upsampled_players = pd.concat((not_subscribed_scaledup, subscribed_players))
upsampled_players["subscribe"].value_counts()

subscribe
False    144
True     144
Name: count, dtype: int64

### 6 - Feature engineering: encoding experience

KNN requires numeric predictors to compute distances. We converted the categorical experience variable into an ordered numeric scale to reflect increasing proficiency levels. This preserves the ordinal structure of experience while allowing it to be used in KNN modeling.

In [7]:
# changing experience into ordered numeric values for KNN
upsampled_players["experience"] = (
    upsampled_players["experience"]
    .replace({
        "Beginner": 1,
        "Amateur": 2,
        "Regular": 3,
        "Veteran": 4,
        "Pro": 5,
    })
    .astype(int)
)

upsampled_players

  .replace({


Unnamed: 0,experience,subscribe,played_hours,name,gender,age
184,5,False,1.7,Asher,Male,17
29,4,False,0.1,Vivienne,Male,18
7,2,False,0.0,Emerson,Male,21
171,1,False,1.8,Amelia,Male,32
108,4,False,0.0,Wren,Male,20
...,...,...,...,...,...,...
187,2,True,0.0,Jasper,Male,17
188,1,True,0.0,Lina,Female,17
190,2,True,0.0,Rhys,Male,20
191,2,True,0.0,Bailey,Female,17


### 7 - Summary visualization

We created a final summary visualization to summarize the key patterns observed in the exploratory analysis. This figure highlights the most relevant relationships between engagement, experience, and subscription, providing a clear transition into the predictive modeling stage.

In [8]:
#A visual using the variable 'gender'. This has not been converted to a nominal variable as 'gender' logically cannot be categorized into numbers.
final_visualization = alt.Chart(upsampled_players, title="Figure 6: Players subscribed vs. played hours by gender").mark_bar().encode(
    x = alt.X("subscribe").title("Players Subcribed"),
    y = alt.Y("played_hours").title("Played Hours"),
    color = alt.Color("gender").title("Gender").scale(scheme = "set2")
)
final_visualization

**Figure 6**: A bar graph of players subscribed against played hours. We have added a colour legend for player's gender, and we can see that female, male and nonbinary players make up a large portion of the played hours.

### 7 - Train/test split & Pre-processing

We split the dataset into training and test sets to evaluate how well the model generalizes to new players. The training set is used for model fitting and tuning, while the test set is held out for final performance evaluation. Moreover, because KNN uses distances between observations, predictors measured on larger numeric scales can dominate the neighbor search. To ensure each predictor contributes comparably, we standardized the numeric features. This improves the interpretability and reliability of the KNN model.

In [9]:
#train test split

players_train, players_test = train_test_split(
    upsampled_players, train_size=0.75, stratify=upsampled_players["subscribe"])

X_train = players_train[["age", "played_hours","experience"]]
y_train = players_train["subscribe"]
X_test = players_test[["age", "played_hours","experience"]]
y_test = players_test["subscribe"]
# scaling

players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours","experience"]),)

#finding best K
knn = KNeighborsClassifier()
players_tune_pipe = make_pipeline(players_preprocessor, knn)

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 40),
}
players_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)

### 8 - Model training and tuning

We fit a KNN classifier and tuned the number of neighbors K. Small K values can lead to overfitting by making predictions too sensitive to individual observations, while large K values can lead to underfitting by overly smoothing decision boundaries. We used cross-validation to compare a range of K values and selected the value that produced the strongest and most stable performance.

In [10]:
#fitting to data

players_tune_grid.fit(X_train, y_train)

players_grid = pd.DataFrame(players_tune_grid.cv_results_)
players_grid
#plotting the accuracy vs k
accuracy_vs_k = alt.Chart(players_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

### 9 - Selecting the best model

Based on cross-validation results, we identified the optimal *K* and refit the KNN pipeline using this value. This tuned model was then used for final evaluation on the test set.

In [11]:
#best k?
players_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 1}

### 10 - Model evaluation

We evaluated the tuned model on the test set using accuracy, precision, and recall. Accuracy summarizes overall correctness. Precision indicates how often predicted subscribers were truly subscribers, while recall reflects how many true subscribers the model successfully identified. Reporting multiple metrics provides a more complete view of performance than accuracy alone.

In [14]:
#Using the model to predict

final_model = players_tune_grid.best_estimator_

players_pred = final_model.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score

test_accuracy = accuracy_score(y_test, players_pred)
test_precision = precision_score(y_test, players_pred)
test_recall = recall_score(y_test, players_pred)

print(test_accuracy*100, test_precision*100, test_recall*100)

#Accuracy score is 77.78%, precision score is 85.71% and recall score is 66.67%

77.77777777777779 85.71428571428571 66.66666666666666


### 11 - Example prediction

To demonstrate practical use, we applied the final model to a sample player profile. This illustrates how the classifier can estimate subscription likelihood for new players based on their age, played hours, and experience.

In [13]:
#An example using our model

new_player = pd.DataFrame({
    "age" : [25],
    "played_hours" : [120],
    "experience" : [3],
})

new_player_pred = final_model.predict(new_player)
new_player_pred

array([ True])

# Discussion

By scaling the dataset, changing `experience` to binary results, and applying classification methods, we are able to put together a final model. This model can be used to predict the likelihood of a player subcribing to the game's newsletter based off their own characterics such as age, played hours and experience. The accuracy, precision and recall of the model are calculated to ensure that the model was giving us well-calculated predictions. Above shows an example of this model is use.

With this assignment, we expected to find a well curated model to help us find an answer to our question: can we predict a player’s likelihood of subscribing to the newsletter using personal and in-game characteristics such as age, experience, and played hours? We also wanted to make sure that the model was able to have a good accuracy and precision, as having an unfit model could serve us inaccurate results. When we were able to put together the final model and read in its accuracy, precison and recall scores, we were satisfied with the answers.

The first question asked by the researchers was: “What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”

Connecting back to the first question asked by the researchers, the model created can now be used to obtain a much better understanding of the game’s target demographic. This can lead to better advertising towards a certain group of people in terms of their age, experience and playing hours. This can also help them customize the game to further customer experience to attract more people to the game. With all of this, the game could see a larger amount of players subscribing to their newsletter.

Further questions can be asked with this new model and its impact, such as how can the process of subscription be changed to truly optimize how many people subscribe? How does subscribing to a newsletter relate to engagement across other channels (e.g., social media, in-game events, email promotions)? Understanding whether the newsletter subscribers are more likely to engage in other channels could help the researchers target players with a multi-channel approach to increase game visibility and retention. This can connect back their individual characteristics and whether there is any connections. Another question that could be asked in the future is what are the most common drop-off points in the subscription funnel (e.g., from website visit to sign-up, from sign-up to email confirmation)? Analyzing where players drop off in the subscription process can improve the user experience and increase conversion rates. In addition, this model is very helpful in understanding certain aspects of players and how it impacts their likelihood to subscribe, but could it be same for a paid subscription? How do these players differ from those who opt for free newsletters?