# DSCI100 Group Project #

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

## Introduction ##

### (1) Background Information: ###
Minecraft, a widely popular 3D sandbox video game, is the focus of a research study at UBC aimed at predicting the overall video game usage patterns. This study aims to identify the types of individuals most likely to contribute substantial amounts of gameplay data. To address this question, key classifiers such as experience, gender, and age will be analysed in relation to total hours played.

The dataset utilized for this exploration is players.csv. There are eight total variables this dataset displays: "experience", "subscribe", "hashedEmail", "played_hours", "name", "gender", "age", "individualId", and "organizationName".

 - experience: A string that describes how much experience a player has had with Minecraft, comparmentalized into Amateur, Regular, Pro, and Veteran. 
 - subscribe: A boolean that describes whether the player is subscribed to the mail or not. 
 - hashedEmail: The player's email hashed to conceal their email information. 
 - played_hours: The total hours this player has spent playing on the server. 
 - name: The given name of the player based on the options on the website upon registration. 
 - gender: The gender of the player, divided into Male, Female, Prefer not to say, and Other. 
 - age: The age of the player. 
 - individualId: The id of the player, which is all NaN for now. 
 - organizationName: The organization name of the player, which is all NaN for now. 

However, one clear issue observed is the lack of data that was actually contributed by the player besides the hours spent on the game, which makes analysis all based on the total time spent by a player. Regardless, we deem it reasonable to assume that the total quantity of data obtained will correlate positively with hours played.

In this project, we seek to determine which age group contributes the most hours of gameplay based on the available data. Determining an age or range of ages will allow for study recruitment efforts to be targeted to age-specific localities, such as schools, colleges, community centers, or seniors homes. 

## Methods & Results: ##

In [2]:
#load the players data
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### Step 1: Data Wrangling ###

We need to first drop all the columns that won't benefit us during data modeling. Columns such as "hashedEmail", "subscribe", "name", "individualId", and "organizationName". After dropping these data points, we need to manually remove a few outliers that we have decided would be considered noise in the data. Data points where the age of a person is over 80 years old will be dropped since it's highly improbable someone is contributing data to PlaiCraft at age 80 or above.

In [3]:
#Step 1
players_dropped = players.drop(["hashedEmail", "subscribe", "name", "individualId", "organizationName"], axis = 1)
players_clean = players_dropped[players_dropped["age"] <= 80]
players_clean

Unnamed: 0,experience,played_hours,gender,age
0,Pro,30.3,Male,9
1,Veteran,3.8,Male,17
2,Veteran,0.0,Male,17
3,Amateur,0.7,Female,21
4,Regular,0.1,Male,21
...,...,...,...,...
190,Amateur,0.0,Male,20
191,Amateur,0.0,Female,17
192,Veteran,0.3,Male,22
193,Amateur,0.0,Prefer not to say,17


### Step 2: Data Transformation ###

For the second step for our data wrangling we need to consider the issue with categorical data. It's not possible to do any regression or classification if we have columns where the data is a string. Columns like "experience" can't be plotted on a graph. However, we can use one hot encoding to tranform each unique entry in a column to it own column that will set the row of the original entry's column to 1 while the other columns to 0. We need to do this one hot encoding transformation on "experience" and "gender".

In [4]:
#Step 2
players_one_hot_exp = players_clean.pivot(columns="experience", values="experience")    
players_one_hot_gen = players_clean.pivot(columns="gender", values="gender")  

players_one_hot_exp = players_one_hot_exp.fillna(0)
players_one_hot_gen = players_one_hot_gen.fillna(0)

players_one_hot_exp[players_one_hot_exp != 0] = 1
players_one_hot_gen[players_one_hot_gen != 0] = 1

players_wrangled = pd.concat([players_clean.drop(["experience", "gender"], axis=1),players_one_hot_exp,players_one_hot_gen], axis=1)
players_wrangled

Unnamed: 0,played_hours,age,Amateur,Beginner,Pro,Regular,Veteran,Agender,Female,Male,Non-binary,Prefer not to say,Two-Spirited
0,30.3,9,0,0,1,0,0,0,0,1,0,0,0
1,3.8,17,0,0,0,0,1,0,0,1,0,0,0
2,0.0,17,0,0,0,0,1,0,0,1,0,0,0
3,0.7,21,1,0,0,0,0,0,1,0,0,0,0
4,0.1,21,0,0,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,0.0,20,1,0,0,0,0,0,0,1,0,0,0
191,0.0,17,1,0,0,0,0,0,1,0,0,0,0
192,0.3,22,0,0,0,0,1,0,0,1,0,0,0
193,0.0,17,1,0,0,0,0,0,0,0,0,1,0


### Step 3: Data Visualization ###

Now that our data has been properly cleaned, we can now perform the data analysis. We want to compare age and experience against played hours, and we can to make a few simple visualizations of what the played hours look like compared with all of our predictor variables. 

In [46]:
age_time = alt.Chart(players_wrangled).mark_circle(opacity = 0.7).encode(
    x = alt.X("age").title("The age of the player (years)"),
    y = alt.Y("played_hours").title("Hours spent playing Minecraft(hours)"),
)
age_time

Figure 1. Ages of individual players under the age of 80 and how long they spent playing Minecraft in hours. There some clear outliers who logged over 100 hours in the game.

One issue with this visualization is that the plot heavily scale towards the larger data and it's impossible to look at the plot of the average play times which is usually around 1 to 5 hours. We can manually visualize this data by filtering out all play times greater than 10 hours. However, our modeling will use the entire dataset, including observations with play times greater than 10 hours.

In [6]:
filtered_age_time = alt.Chart(players_wrangled[players_wrangled["played_hours"] <= 10]).mark_circle(opacity = 0.7).encode(
    x = alt.X("age").title("The age of the player (years)"),
    y = alt.Y("played_hours").title("Hours spent playing Minecraft(hours)"),
)
filtered_age_time

Figure 2. Ages of individual players under the age of 80 and how long they spent playing Minecraft in hours. Individuals who logged more than 10 hours in the game were omitted from this plot.

After filtering out the large hours of contribution, the graph looks a lot more clear. We can see that most of the data points are around age 15 to 28 with 0 to 2 hours of total play time. 

Now we need to graph the experience of players compared with the total played hours. Here we're going back to our original data set that wasn't cleaned up by one hot encoding to create this visualization since it's a lot easier and clearer to plot a column of categorical data on a histogram than trying to plot 4 columns of ones and zeros against the total played hours.

In [7]:
experience_time = alt.Chart(players).mark_bar().encode(
    x = alt.X("experience").title("The experience level"),
    y = alt.Y("played_hours").title("Hours spent playing Minecraft(hours)"),
    color = alt.Color("experience").title("Experience")
)
experience_time

Figure 3. Total hours of gameplay contributed by players according to their self-reported experience levels. Regulars contributed the most hours followed by Amateurs. The rest only contributed a fraction of what Amateurs and Regulars contributed.

### Step 4: Data Modeling ###

The regression model we have decided to use is KNN regression. 

This code will be a 5 fold cross validation that loops through 1 to 30 n neighbors hyperparameter and outputs a dataframe that contains the cross validation negative mean squared error. The data will first be standardized to ensure equal weighting on the distance for KNN since it's sensitive to smaller values since the distance is less.

In [8]:
player_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

player_training, player_testing = train_test_split(
   players_wrangled, test_size=0.2, random_state=200,
)

X_train = player_training.drop(["played_hours"], axis = 1)
y_train = player_training["played_hours"]

X_test = player_testing.drop(["played_hours"], axis = 1)
y_test = player_testing["played_hours"]

param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 31, 1),
}
player_tuned = GridSearchCV(player_pipe, param_grid, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
player_results = pd.DataFrame(player_tuned.fit(X_train, y_train).cv_results_) 
player_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003177,0.000553,0.001701,0.000151,1,{'kneighborsregressor__n_neighbors': 1},-70.295147,-42.565119,-40.317862,-56.395799,-40.748196,-50.064425,11.725149,30
1,0.002723,3.2e-05,0.001563,1.2e-05,2,{'kneighborsregressor__n_neighbors': 2},-40.296274,-40.89283,-20.192876,-45.622678,-39.576463,-37.316224,8.820291,29
2,0.002706,5e-05,0.003987,0.004871,3,{'kneighborsregressor__n_neighbors': 3},-28.735725,-40.206983,-17.494757,-43.706491,-39.360621,-33.900915,9.60797,28
3,0.019033,0.032618,0.001628,0.000114,4,{'kneighborsregressor__n_neighbors': 4},-21.834626,-41.364621,-14.185231,-43.221738,-41.836146,-32.488472,12.082062,27
4,0.002818,0.000254,0.001569,8e-06,5,{'kneighborsregressor__n_neighbors': 5},-20.536214,-40.952818,-11.400432,-43.752198,-41.72249,-31.67283,13.175882,26
5,0.002728,9.8e-05,0.001558,1.8e-05,6,{'kneighborsregressor__n_neighbors': 6},-18.419364,-40.774141,-13.439457,-43.464794,-40.987555,-31.417062,12.778353,24
6,0.002684,4.2e-05,0.001587,5.1e-05,7,{'kneighborsregressor__n_neighbors': 7},-18.023772,-40.65581,-13.003141,-45.486593,-40.611373,-31.556138,13.313133,25
7,0.002696,5.7e-05,0.001576,3e-05,8,{'kneighborsregressor__n_neighbors': 8},-18.789092,-40.658708,-10.232113,-41.758505,-40.279093,-30.343502,13.216612,23
8,0.002684,3.3e-05,0.001579,1.8e-05,9,{'kneighborsregressor__n_neighbors': 9},-18.346605,-41.15176,-10.159205,-41.510407,-40.224185,-30.278433,13.345084,22
9,0.00277,6.8e-05,0.00166,7.1e-05,10,{'kneighborsregressor__n_neighbors': 10},-17.459483,-41.168782,-9.25232,-41.467823,-40.393237,-29.948329,13.798474,21


### Step 5: Model Visualization ###

We want to store the different k values and their respective test score in a data frame that can be visualized.

In [16]:
cv_result = pd.DataFrame({"k": player_results["param_kneighborsregressor__n_neighbors"], "wssd" : -1 * player_results["mean_test_score"]})

cv_plot = alt.Chart(cv_result[cv_result['k'] > 10]).mark_line(point=True).encode(
    x = alt.X("k").title("K values").scale(zero=False, domain=[10, 30]),
    y = alt.Y("wssd").title("Mean Test Score").scale(zero=False)
)

best_val = cv_result.sort_values("wssd")
print("best K value: ", best_val.head(1)["k"].values[0])
cv_plot

best K value:  29


Figure 4. Results of cross validation from 1 to 30 k neighbour hyperparameters. K=29 has the lowest mean test score, meaning it performed the best on the training data. K values under 10 were omitted from the above plot as their mean test scores were much worse than the ones shown. 

The next step is to use k = 29 to retrain our model and test our test data. Then we'll make a visualization of the KNN regression on the data.

In [47]:
# Set up KNN and have k = 29
knn = KNeighborsRegressor(n_neighbors = 29)

model = knn.fit(X_train, y_train)
y_hat = knn.predict(X_test)

RMSE = np.sqrt(((np.sum(y_hat - y_test)) ** 2) / len(y_hat))
RMSE

np.float64(27.58345490804527)

Below, the model makes predictions on the testing data. These predictions are then plotted on top of the entire dataset, with the omission of players that played more than 60 hours to clearly illustrate the relationship between the testing data and hours played. 

In [62]:
players_preds = X_test.assign(predictions = knn.predict(X_test))


knn_plot = alt.Chart(players_preds).mark_line(color="black").encode(
    x="age",
    y="predictions", 
)

knn_plot

age_time_clean = alt.Chart(players_wrangled[players_wrangled["played_hours"]<60]).mark_circle(opacity = 0.7).encode(
    x = alt.X("age").title("The age of the player (years)"),
    y = alt.Y("played_hours").title("Hours spent playing Minecraft(hours)"),
)

big_viz = alt.layer(age_time_clean, knn_plot)
big_viz


Figure 5. Predictions of the model on the testing data plotted as a black line over the initial dataset. Observations with play time exceeding 60 hours were omitted from the plot as so to better visualize the relationship proposed by the model.

## Discussion ##

In summary, we found that by modeling KNN regression with 29 neighbours we were able to predict our data the best. The root mean squared error for k = 29 is 27.58. That means given a piece of test observation with the players experience, gender, and age, we can predict the total played hours with an error of 27.58 hours. One issue with RMSE is that it's harder to understand what that value implies a good model or bad model in terms of prediction. The only thing we know for certain is that when k = 29, we have the best model relative to all the other models we tested. Notably, the RMSE of the model is actually greater than the range of predicted play times from the model, but the implications of this are hard to conclude.

We chose KNN regression because we assumed most data correlation would be in clumps making KNN regression a great predictor. The group made an educated assumption that most likely age and experience would be gathered in clusters which KNN would do a better prediction than a basic linear regression. With regards to age specifically, there is no clear linear relationship between age and hours played, making linear regression a poor choice when age is part of the testing data. Additionally, KMeans clustering and KNN classification aren't good choices as we are looking for an output of a quantitative variable (hours played), and not a category. 

Based on our initial visualization of the data, it was clear to us that a lot of the "whales" contributing large hours of play time were in their late teens, and this was also concluded by our model, which suggested that the age that would contribute the most play time was around 19 years. 

With this model, there is a possiblility we can use it to predict future data that's been colleted by the PlaiCraft team. However, due to optimization bias, it is possible our model doesn't work as well with newer data even with cross validation confirming a better error value.

Due to the constraint of time, we can't try other possible models to predict the data. There are many other ways of transforming the data that will make linear regression a possible better choice than KNN regression. One example is trying linear regression with different regularizations, or use non-linear regression. However, these are beyond the scope of the course and were not used in the project. Another future question is how can more variables affect this data? Can we use player inputs, voice, or progression speed to get a better prediction of total played hours?



## Conclusion ##

Overall, based on our findings, "Amateur" and "Regular" players and 19 year olds contributed the most play time to the study. Additionally, 10-13 year olds and 17-21 year olds contributed large amounts of play time as well. This makes middle schools and college campuses excellent targets for recruitment efforts. However, the KNN Regression model may not have been the best choice for this analysis, and different methods could serve to elucidate the relationship between play time and the rest of the variables in the dataset.