# DSCI100 101 - Project Final Report

Group 25:

Melody Mokhtari Amirmajdi (88736350), Sophia Boniati Ozi (99642803), Anson Lam (97811442)

# Introduction

In this project, we aim to analyze player data from the PLAI Minecraft research server to understand what player characteristics are most associated with subscribing to the project’s newsletter. Subscribing is used as a measurement for player engagement as it suggests greater interest in contributing to research data collection.  

**Specific question:**  
Can we predict a player’s likelihood of subscribing to the newsletter using personal and in-game characteristics such as age, experience, and played hours? This predictive question will utilize classification methods, and will serve to respond to researchers' question #1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Our analysis will focus on the `players.csv` dataset. This report includes a descriptive summary of the dataset, exploratory data analysis (EDA) through visualizations, and a proposed methodology for carrying out this project.

**Data Description:**

There are 196 observations in the `players.csv` file, each representing an individual player. The dataset contains 9 columns. Two of these columns were selected to include in this project's analysis. Of the remaining 7 variables, four are stored as objects (categorical or string), one is a float, one is an integer, and one is boolean.  

**Variables**

- experience (object): player’s experience level in Minecraft (Amateur, Regular, Veteran, or Pro)  
- subscribe (boolean): whether the player subscribed to the PLAI newsletter (True or False)  
- played_hours (float): total hours played on the server  
- name (object): player name (not used for modeling)  
- gender (object): gender identification (Male, Female, Other, Prefer not to say, etc...)  
- age (integer): player’s age in years

There are some issues with the data that may result in problems moving forward: 

- There is a class imbalance: more players are subscribed than unsubscribed.  
- Some ages appear unusually high (e.g., 91) for a typical player, which may be an outlier.  
- `experience` must be converted into numeric format for modeling.  
- There are no missing values, but some numeric variables may contain mostly zeros, which could reduce their usefulness in modeling.
The data was collected from the PLAI Minecraft server and participant registration system, so in-game variables (like hours and experience) considered in the context of this project reflect real and accurate gameplay metrics.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
np.random.seed(4)

from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.model_selection import train_test_split
set_config(transform_output="pandas")

In [2]:
players=pd.read_csv("players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


# Methods & Results
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all figures should have a figure number and a legend

In [3]:
dropped_players=players.drop(columns=["individualId","organizationName","hashedEmail"])
dropped_players

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
0,Pro,True,30.3,Morgan,Male,9
1,Veteran,True,3.8,Christian,Male,17
2,Veteran,False,0.0,Blake,Male,17
3,Amateur,True,0.7,Flora,Female,21
4,Regular,True,0.1,Kylie,Male,21
...,...,...,...,...,...,...
191,Amateur,True,0.0,Bailey,Female,17
192,Veteran,False,0.3,Pascal,Male,22
193,Amateur,False,0.0,Dylan,Prefer not to say,17
194,Amateur,False,2.3,Harlow,Male,17


In [19]:
# This chart displays the total played hours for each experience level to highlight how engagement differs among player types.
plot_bar1 = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("experience:N", title="Experience Level"),
        y=alt.Y("sum(played_hours):Q", title="Total Played Hours"),
        color=alt.Color("experience:N", title="Experience"),
        tooltip=["experience", alt.Tooltip("sum(played_hours):Q", title="Total Hours")]
    )
    .properties(title="Total Played Hours by Experience Level")
)

# This bar chart compares the average played hours by subscription status across different experience levels.
plot_bar2 = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("subscribe:N", title="Subscription Status"),
        y=alt.Y("mean(played_hours):Q", title="Average Played Hours"),
        color = alt.Color("experience").title("Experience").scale(scheme = "set2"),
        tooltip=[
            "subscribe",
            "experience",
            alt.Tooltip("mean(played_hours):Q", title="Average Hours"),
        ],
    )
    .properties(title="Average Played Hours by Subscription and Experience")
)

# This plot shows the overall class balance between subscribed and non-subscribed players to confirm whether the data is imbalanced.
plot_class = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("subscribe:N", title="Subscription Status"),
        y=alt.Y("count():Q", title="Number of Players"),
        color=alt.Color("subscribe:N", legend=None).scale(scheme = "set2"),
        tooltip=["subscribe", alt.Tooltip("count():Q", title="Count")],
    )
    .properties(title="Class Balance: Subscribed vs Not Subscribed")
)

# This chart illustrates the average subscription rate within each experience level.
plot_rate = (
    alt.Chart(dropped_players)
    .mark_bar()
    .encode(
        x=alt.X("experience:N", title="Experience Level"),
        y=alt.Y("mean(subscribe):Q", title="Subscription Rate"),
        color=alt.Color("experience:N", legend=None),
        tooltip=[
            "experience",
            alt.Tooltip("mean(subscribe):Q", title="Subscription Rate"),
        ],
    )
    .properties(title="Subscription Rate by Experience Level")
)


# This scatter plot shows how age relates to total played hours. Color indicates each player's experience level.
plot_scatter = (
    alt.Chart(dropped_players)
    .mark_point(size=80, opacity=0.7)
    .encode(
        x=alt.X("age:Q", title="Age (years)"),
        y=alt.Y("played_hours:Q", title="Played Hours"),
        color=alt.Color("experience:N", title="Experience"),
        tooltip=["age", "played_hours", "experience", "subscribe"]
    )
    .properties(title="Relationship Between Age, Experience, and Played Hours")
)

(plot_bar1 | plot_scatter ) & ( plot_bar2 | plot_class | plot_rate )

In [20]:
dropped_players["subscribe"].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

In [21]:
# scaling so theres equal numbers of each option
not_subscribed_players = dropped_players[dropped_players["subscribe"] == False]
subscribed_players = dropped_players[dropped_players["subscribe"] == True]
not_subscribed_scaledup = not_subscribed_players.sample(
    n=subscribed_players.shape[0], replace=True
)
upsampled_players = pd.concat((not_subscribed_scaledup, subscribed_players))
upsampled_players["subscribe"].value_counts()


subscribe
False    144
True     144
Name: count, dtype: int64

In [22]:
#changing categories into numbers
#experience
upsampled_players["experience"] = upsampled_players["experience"].replace({
    "Beginner" : "1",
    "Amateur" : "2",
    "Regular" : "3",
    "Veteran" : "4",
    "Pro" : "5",
})

upsampled_players

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
47,3,False,0.0,Edmund,Prefer not to say,23
141,3,False,0.0,Cornelius,Male,18
86,2,False,0.0,Ziad,Two-Spirited,23
21,2,False,0.1,Anastasia,Female,17
45,1,False,0.0,Umar,Male,24
...,...,...,...,...,...,...
187,2,True,0.0,Jasper,Male,17
188,1,True,0.0,Lina,Female,17
190,2,True,0.0,Rhys,Male,20
191,2,True,0.0,Bailey,Female,17


In [23]:
#A visual using the variable 'gender'. This has not been converted to a nominal variable as 'gender' logically cannot be categorized into numbers.
final_visualization = alt.Chart(upsampled_players, title="Players subscribed vs. played hours by gender").mark_bar().encode(
    x = alt.X("subscribe").title("Players Subcribed"),
    y = alt.Y("played_hours").title("Played Hours"),
    color = alt.Color("gender").title("Gender").scale(scheme = "set2")
)
final_visualization

In [24]:
#train test split

players_train, players_test = train_test_split(
    upsampled_players, train_size=0.75, stratify=upsampled_players["subscribe"])

X_train = players_train[["age", "played_hours","experience"]]
y_train = players_train["subscribe"]
X_test = players_test[["age", "played_hours","experience"]]
y_test = players_test["subscribe"]
# scaling

players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours","experience"]),)

#finding best K
knn = KNeighborsClassifier()
players_tune_pipe = make_pipeline(players_preprocessor, knn)

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 40),
}
players_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)

In [25]:
#fitting to data

players_tune_grid.fit(X_train, y_train)

players_grid = pd.DataFrame(players_tune_grid.cv_results_)
players_grid
#plotting the accuracy vs k
accuracy_vs_k = alt.Chart(players_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

In [26]:
#best k?
players_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 1}

In [29]:
#Using the model to predict

final_model = players_tune_grid.best_estimator_

players_pred = final_model.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score

test_accuracy = accuracy_score(y_test, players_pred)
test_precision = precision_score(y_test, players_pred)
test_recall = recall_score(y_test, players_pred)

print(test_accuracy*100, test_precision*100, test_recall*100)

#Accuracy score is 79.17%, precision score is 78.38% and recall score is 80.56%

79.16666666666666 78.37837837837837 80.55555555555556


In [30]:
#An example using our model

new_player = pd.DataFrame({
    "age" : [25],
    "played_hours" : [120],
    "experience" : [3],
})

new_player_pred = final_model.predict(new_player)
new_player_pred

array([ True])

# Discussion
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?