# Project Final Report

## Introduction:

Most video games require any new players to register, or returning players to login using their email/username and password. The game, PLAICraft, is no different. New players are required to register with their email, age, gender, and more. However, as they are a relatively new game, they are trying to implement ways to attract more players towards their game. For this reason, they would like to know the major demographic of their players, so they know what types of players play their game the most and therefore, end up contributing more to the data that is collected.

For this reason, we will be using the `players.csv`. This dataset was compiled from the data that each player enters when they join and register for the game the first time they login. It is then kept constantly up-to-date for each player every time they login using the same player name and email.

### Question:

Our group aims to answer:

**Question 1:** We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

As the question that we are dealing with does not necessarily require a regression or a classification model, our exploratory data will consist of simple, colour-coded bar graphs in relation to each independent variable chosen for our study. However, we have also included a multivariable regression model that can be used for further, more complex questions that the video game developers would like to identify to reach a more conclusive result.

### Data Description:

`players.csv`:
The file has 196 rows and 7 columns of data (2 additional columns provided are empty for all rows). Each cell has one value and there are no missing values. It contains the following information:

| Feature Name | Data Type | Description |
|---|---|---|
| Experience | Categorical | The level of experience the player has playing video games. It can be one of the following values: "Beginner", "Amateur", "Regular", "Veteran", "Pro". |
| Gender | Categorical | Represents the players gender identity. It can be one of the following values: Male, Female, Two-Spirited, Non-binary, Agender, Prefer not to say |
| played_hours | Float | Number of hours played |
| age | Integer | Age of the player |
| name | String | Player's name |
| hashedEmail | String | Hashed email address |
| subscribe | Boolean | Whether or not the player has subscribed to the mailing list (True/False) |

## Methods & Results:

To answer this question we will use the `players.csv` file and in particular columns "Experience", "Gender" and "Age" to help answer what "kinds" of player will contribute a large amount of data ("played_hours"). Exploring the relationship between these 4 variables will hopefully reveal the answer to this question. 

We will use linear regression to determine the "kinds" of players who contribute a lot of data. This method is appropriate because it is simple and interpretable. We will use it to learn and predict the number of hours a player will contribute. Then, based on the coefficients calculated for features such as "Gender," "Age," etc., we will determine which factors lead to greater contribution. 

In [29]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import KBinsDiscretizer

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

DataTransformerRegistry.enable('vegafusion')

### Data Loading and Cleaning:
In order to answer the question, we will first begin with loading and preparing the data. 

In [30]:
url = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,



In the cell below we extract the columns of interest and removed rows where gender is not specified. Numerically encode the "Experience" column to reflect increasing levels of expertise.

In [31]:
players_filtered = players.loc[players["gender"]!="Prefer not to say", ["experience", "gender", "age", "played_hours"]]
players_filtered

Unnamed: 0,experience,gender,age,played_hours
0,Pro,Male,9,30.3
1,Veteran,Male,17,3.8
2,Veteran,Male,17,0.0
3,Amateur,Female,21,0.7
4,Regular,Male,21,0.1
...,...,...,...,...
190,Amateur,Male,20,0.0
191,Amateur,Female,17,0.0
192,Veteran,Male,22,0.3
194,Amateur,Male,17,2.3


### Exploratory data analysis:

To perform some exploratory data analysis, let us observe the impact of the 3 columns of interest ("Experience", "Gender" and "Age") on the target value ("played_hours"). Additionally, let us look at some summary statistics.

In [32]:
players_filtered.describe()

Unnamed: 0,age,played_hours
count,185.0,185.0
mean,21.275676,6.171351
std,9.915661,29.15973
min,8.0,0.0
25%,17.0,0.0
50%,19.0,0.1
75%,22.0,0.6
max,99.0,223.1


In [33]:
plot_age = alt.Chart(players_filtered, title="Figure 1: Age vs. Hours Played").mark_bar().encode(
    x = alt.X("age").title("Age of player in years"),
    y = alt.Y("played_hours").title("Hours played")
)
plot_age

From the graph above, it appears most players who play a lot are in the teenage - young adults range (15-25 years).

In [34]:
plot_exp = alt.Chart(players_filtered, title="Figure 2: Experience vs. Hours Played").mark_bar().encode(
    x = alt.X("experience").title("Experience of players with video games"),
    y = alt.Y("played_hours").title("Hours played")
)
plot_exp

From the visualization above, it appears those with experience 1 and 2 ("amateur" and "regular") play the most.

In [35]:
plot_gender = alt.Chart(players_filtered, title="Figure 3: Gender vs. Hours Played").mark_bar().encode(
    x = alt.X("gender").title("Gender of player"),
    y = alt.Y("played_hours").title("Hours played")
)
plot_gender

The plot shows that those with genders "male", "female" and "non-binary" appear to play the most.

### Converting "Experience" values into Binary values

In [36]:
players_filtered["experience"] = players_filtered["experience"].replace({
    "Beginner" : 0,
    "Amateur" : 1, 
    "Regular" : 2, 
    "Pro": 3, 
    "Veteran": 4,
})
players_filtered

  players_filtered["experience"] = players_filtered["experience"].replace({


Unnamed: 0,experience,gender,age,played_hours
0,3,Male,9,30.3
1,4,Male,17,3.8
2,4,Male,17,0.0
3,1,Female,21,0.7
4,2,Male,21,0.1
...,...,...,...,...
190,1,Male,20,0.0
191,1,Female,17,0.0
192,4,Male,22,0.3
194,1,Male,17,2.3


Above, we numerically encode the "Experience" column to reflect increasing levels of expertise. We needed to change the orginial values into numerical values since the predictive model we will be using (LinearRegression) only accepts numerical data. 

### Splitting the data:
We then split the data so 25% of the dataset is used for testing. We are using a smaller test size because we have a relatively small dataset and we need to ensure we have enough data points for both learning and testing.

In [37]:
players_training, players_testing = train_test_split(
    players_filtered, test_size=0.25, random_state=2024
)
X_train_play = players_training[["experience", "gender", "age"]]
y_train_play = players_training["played_hours"]

X_test_play = players_testing[["experience", "gender", "age"]]
y_test_play = players_testing["played_hours"]

### Preprocessing:
We have already encoded the categorical data "Experience" into a numerical column representing experience level. Now we will processing the columns "Age" and "Gender".

The column "Gender" is also a categorical column, but unlike "Experience" its values have no meaningful order. Hence, we use OneHotEncoder to transform this data so we have a binary column for each gender type, such as "Agender", "Female", etc. (where 0 would suggest the individual does not identify as such and 1 would mean they do). By creating separate binary columns for each gender category, the model can learn the unique contribution of each gender to the outcome.

While we could have simply scaled "Age" a KBinsDiscretizer is used because it allows us to capture non-linear relationships between age and the target variable. By dividing the age range into 4 discrete bins, we can model different age groups as separate categories, potentially revealing patterns that linear scaling might miss. For example, a linear scaling might treat a difference between 20 and 30 years the same as a difference between 60 and 70 years, while binning can account for the potentially different impact of these age ranges on the target variable.

In [38]:
play_preprocessor = make_column_transformer(
    (KBinsDiscretizer(n_bins=4), ["age"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["gender"]),
    remainder="passthrough",
    verbose_feature_names_out=False
)
play_preprocessor

players_scaled = play_preprocessor.fit_transform(X_train_play)
feature_names = play_preprocessor.get_feature_names_out()
X_train_enc_df = pd.DataFrame(players_scaled, columns = feature_names)
X_train_enc_df

Unnamed: 0,age_0.0,age_1.0,age_2.0,age_3.0,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Two-Spirited,experience
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
133,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0
134,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0
135,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0
136,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [39]:
print(play_preprocessor.transformers_[0][1].bin_edges_)

[array([ 8., 17., 19., 22., 99.])]


As seen in above, the KBinsDiscretizer divides the column "Age" into 4 columns with the following 4 ranges: 
- **age_0.0:** ages 8-17
- **age_1.0:** ages 17-19
- **age_2.0:** ages 19-20
- **age_3.0:** ages 22-99

In [40]:
pipe = make_pipeline(play_preprocessor, LinearRegression())
lm_fit = pipe.fit(X_train_play, y_train_play)

In [41]:
coefs_df = pd.DataFrame(pipe[1].coef_, feature_names)
print(pipe[1].intercept_)
coefs_df

13.955145639071802


Unnamed: 0,0
age_0.0,9.973386
age_1.0,-8.42928
age_2.0,5.390737
age_3.0,-6.934844
gender_Agender,-1.959117
gender_Female,4.407071
gender_Male,-6.526405
gender_Non-binary,11.782122
gender_Other,-3.08982
gender_Two-Spirited,-4.613851


As seen from the coefficients above, the factors contributing the most to driving the hours played higher are `gender_Non-binary` and `age_0.0` (which refers to ages 8-17). On the other hand, the most contributing factors in opposite direction (fewer hours played) are `age_1.0` (ages 17-19), `age_3.0` (ages 22-99) and `gender_Male`.

In [42]:
# Train score
play_predictions = pipe.predict(X_train_play)
lm_rmse = mean_squared_error(y_train_play, play_predictions)**(1/2)
lm_rmse

np.float64(25.903829982467414)

In [43]:
# Test score
play_predictions = pipe.predict(X_test_play)
lm_rmspe = mean_squared_error(y_test_play, play_predictions)**(1/2)
lm_rmspe

np.float64(35.397381001383756)

In [44]:
#Training set 
train_pred = players_training.assign(predictions=pipe.predict(X_train_play))
train_pred

Unnamed: 0,experience,gender,age,played_hours,predictions
84,0,Female,17,0.0,9.932937
82,0,Female,37,0.2,11.427373
181,1,Female,22,0.8,10.183879
172,4,Agender,20,0.0,12.412790
76,1,Female,21,3.5,22.509460
...,...,...,...,...,...
38,4,Male,17,0.0,-5.974515
27,4,Male,23,0.0,-4.480079
137,4,Male,17,0.0,-5.974515
101,1,Female,17,0.0,8.689443


In [45]:
plot_training_preds = alt.Chart(train_pred, title="Figure 4: Actual vs. predicted hours played in training set").mark_point().encode(
    x = alt.X("played_hours").title("Actual hours played"),
    y = alt.Y("predictions").title("Predicted hours played")
)
plot_training_preds

In [46]:
#Testing set 
test_pred = players_testing.assign(predictions=pipe.predict(X_test_play))
test_pred

Unnamed: 0,experience,gender,age,played_hours,predictions
186,4,Female,44,0.1,6.453397
128,1,Female,17,0.0,8.689443
80,4,Female,17,0.0,4.958961
142,0,Female,17,1.0,9.932937
162,3,Male,19,0.6,9.088996
39,1,Male,17,0.0,-2.244033
129,1,Two-Spirited,17,0.0,-0.331479
153,0,Male,17,0.1,-1.000539
30,1,Male,23,0.1,-0.749597
123,0,Male,17,7.1,-1.000539


In [47]:
plot_test_preds = alt.Chart(test_pred, title="Figure 5: Actual vs. predicted hours played in test set").mark_point().encode(
    x = alt.X("played_hours").title("Actual hours played"),
    y = alt.Y("predictions").title("Predicted hours played")
)
plot_test_preds

As observed from the visualizations and RMSE scores above (26 for train score and 34 for test score), our model could perform better, since on unseen data we have an error of 34 hours on average. While techniques such as cross-validation are great for optimizing hyperparameter values and obtaining better results, the scikit-learn documentation does not provide such hyperparameters for the linear regression model. Therefore, cross-validation cannot be used to improve the results from linear regression.

While other models, such as KNNRegressor, can help us predict the number of "played_hours," the issue we encounter with it is that it does not help us answer the question "what kinds of players contribute more?" This is because KNNRegressor uses the average of neighboring points to predict the target value, rather than providing insights into the underlying factors driving these predictions. Hence, the current tools in our toolbox are not sufficient for improving on our current results. As future work, we would need to explore additional techniques or consider more advanced machine learning models to gain deeper insights.

## Discussion:

Based on the coefficients obtained from the linear regression, the factors contributing to driving the hours played higher are non-binary gender and ages 8-17. On the other hand, the most contributing factors in the opposite direction (fewer hours played) are ages 17-19, ages 22-99 and males. Ages 19-20 and female gender also increased the played hours predictions, but to a lesser extent than the other positive factors. Similarly, agender, two-spirited, other genders and experience had a small negative impact on the predictions.

These results are somewhat in line with our expectations since we expected those who are younger (teens and young adults) to contribute more played hours. However, we did not expect gender to have much of an impact, unlike the model above suggests. Finally, we expected experience to have a positive weight, suggesting those with more expertise play more, but we observed the opposite in our results.

The impact of these findings would be that we can now target specific individuals to recruit for richer playing data. Based on our results, these individuals should be either non-binary or female and should be in the age range 8-17 or 19-20.

To delve deeper into these findings, future research could explore the following:

- The impact of specific game genres and platforms on player behavior within different demographic groups.
- The underlying reasons for the negative correlation between experience and playtime.
- The role of social and cultural factors in shaping gaming preferences and habits.
- The robustness of these findings when using alternative modeling techniques.

By addressing these questions, we can gain a more comprehensive understanding of player behavior and develop targeted strategies to optimize user engagement.