# Final Report


Group members: Juliane Lou (30661920),

# Introduction

In order to collect data about how people play video games, a research group in Computer Science at UBC set up a MineCraft server to monitor players actions. To run the project smoothly, they have provided us with two data files. Our group will be analyzing player information in the players.csv file to answer Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?. 

Our specific, predictive question is: Can we predict a player's subscription status with "experience", "played_hours" and "age"? If yes, which combination of exploratory variables gives the most accurate prediction for the target variable?

The raw players.csv dataset, with data collected from a MineCraft server, contains 196 observations, where each row represents an individual player, and 9 columns, each containing a different variable consisting of:

- experience (categorical): The experience level of the player (Amateur,Beginner, Regular, Pro, Veteran)
- subscribe (boolean): Indicates whether the player is subscribed to the game-related newsletter (target variable)
- hashedEmail (string): Player's email
- played_hours (numerical): Total number of hours played
- name (string): Player's name
- gender(categorical): Player's reported gender
- age (numerical): Player's age in years
- individualId (N/A, no data): Each player's in-game ID
- organizationName (N/A, no data): Player's affiliated organization

Columns "individualId" and "organizationName" contain no data, while columns such as "name" and "hashedEmail" are identifying variables, not predictive. The types of variables are mixed, making it difficult to plot on same graphs and use same evaluation methods. 

# Methods 

The dataset contains multiple variables for each player and also a categorical target variable (True/False). Thus this question will use KNN classification as the model. The response variable is subscribed, and explanatory variables are be experience, played_hours, and age. We will use KNN classification on these graphs to train the model using a 80-20 training-testing data split to predict subscribed. 
Then, we will compare 5-fold cross validation results of the 3 models to find highest accuracy, precision and recall. The variables of the best-performing graph would then be the variables that are most predictive of subscribing to a game-related newsletter. 

Since KNN only works with numeric values, we must convert "experience", a categorical variable, into a number scale (e.g., Amateur = 1, Beginner = 2, Regular = 3, Pro = 4, Veteran = 5). Each KNN model will then be fit using cross-validation to compare accuracy, precision, and recall, allowing us to determine which pair of variables best predicts newsletter subscription.

There are no assumptions to make about the model, because we are testing through cross validation. However, when we convert "experience" into a number scale we assume 1 is the lowest, 5 is the highest and assign values to each level of experience based on assumption.

Wrangling:
- Remove empty/irrelevant columns after checking for missing values
- Convert categorical exploratory variable "experience" to numeric for KNN. (Limitation: the "number" assigned to the "level of experience" could be subjectve.)
- "subscribe" is boolean, we will convert to 0 and 1 (:N nominal) for KNN
- Drop rows with duplicated "hashedEmail". Ensure each player only gets to submit one response.
- Standardize/Scale exploratory variables.
- To properly split data, use stratify=y to balance the number of boolean in the testing vs training dataset.

Issues with the model:
- Sensitivity to noise
- Imbalanced toward majority classes
- Scaling variables changes it's significance/ features

Comparing and selecting the best model and processing the data to apply the model:
1. Filter rows to keep: age, played hours, experience, and subscribe
2. Fit 3 models of KNN classification (3 sets of 3 pairs of variables, variable ‘subscribe’ constant through all 3 models as "colour":N). -Models: age vs. played hours, age vs. experience, played hours vs. experience.
3. Quantitative analysis: Split and cross validation with training and testing (0.80, 0.20).
4. Compare cross validation results of the 3 models, find highest accuracy, precision and recall.
5. Take the highest accuracy model.

# Results

In [30]:
#Imports
import altair as alt
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

# Some are unnecessary (come back after finishing to eliminate ones we didn't use

In [31]:
# Reading in the dataset and minimal wrangling to tidy data:
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
pd.read_csv(url)
players_data =  pd.read_csv(url)

#get rid of empty columns
players = players_data[["experience", "subscribe", "played_hours", "age"]]
players


Unnamed: 0,experience,subscribe,played_hours,age
0,Pro,True,30.3,9
1,Veteran,True,3.8,17
2,Veteran,False,0.0,17
3,Amateur,True,0.7,21
4,Regular,True,0.1,21
...,...,...,...,...
191,Amateur,True,0.0,17
192,Veteran,False,0.3,22
193,Amateur,False,0.0,17
194,Amateur,False,2.3,17


In [32]:
#CONVERT EXPERIENCE TO NUMERICAL VALUES

players.loc[:,"experience"] = players["experience"].replace({
    "Amateur": 1,
    "Beginner": 2,
    "Regular": 3,
    "Pro": 4,
    "Veteran": 5 })

players
 

  players.loc[:,"experience"] = players["experience"].replace({


Unnamed: 0,experience,subscribe,played_hours,age
0,4,True,30.3,9
1,5,True,3.8,17
2,5,False,0.0,17
3,1,True,0.7,21
4,3,True,0.1,21
...,...,...,...,...
191,1,True,0.0,17
192,5,False,0.3,22
193,1,False,0.0,17
194,1,False,2.3,17


In [33]:
#SPLIT DATASET INTO TRAINING AND TESTING DATA

players_train, players_test = train_test_split(players, test_size=0.25, random_state=123) # set the random state to be 123

players_train

Unnamed: 0,experience,subscribe,played_hours,age
100,1,True,0.0,20
10,5,True,1.6,23
149,1,True,0.0,16
171,2,False,1.8,32
178,1,True,0.4,17
...,...,...,...,...
17,1,True,48.4,17
98,1,False,0.0,17
66,5,False,0.1,22
126,2,True,0.7,24


In [34]:
#EDA ON TRAINING DATA ONLY (as per ta comment)
    #make 3 scatter plots?/histograms to explore data using only training dataset (sim to proposal)
    # i asked ta and she said we can copy from smos indiv proposal !!! (so use based on wtv histogram we wen t over last tut)

In [35]:
#build preprocessor

players_preprocessor_1 = make_column_transformer(
    (StandardScaler(), ["experience", "played_hours"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

players_preprocessor_2 = make_column_transformer(
    (StandardScaler(), ["experience", "age"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

players_preprocessor_3 = make_column_transformer(
    (StandardScaler(), ["played_hours", "age"]),
    remainder='passthrough', 
    verbose_feature_names_out=False
)

In [45]:
#specify knn classifier
knn_spec = KNeighborsClassifier(n_neighbors=3)

#identify training predictors vs. target

X_train_1 = players_train[["experience", "played_hours"]]
X_train_2 = players_train[["experience", "age"]]
X_train_3 = players_train[["played_hours", "age"]]
y = players_train["subscribe"]

#create fitted pipelines

players_fit_1 = make_pipeline(players_preprocessor_1, knn_spec).fit(X_train_1, y)
players_fit_2 = make_pipeline(players_preprocessor_2, knn_spec).fit(X_train_2, y)
players_fit_3 = make_pipeline(players_preprocessor_3, knn_spec).fit(X_train_3, y)

In [37]:
# first focus on players_preprocessor_1, find optimal KNeighbors value through cross-validation to tune KNeighbors
# create parameter grid with range of 1-15 (inclusive) for simplicity and processing speed

param_grid={'kneighborsclassifier__n_neighbors':range(1, 16)}

#specify X and y variables for predictors vs. target for the variables within the preprocessor

X_1=players[['experience', 'played_hours']]
y=players['subscribe']

# create pipe

players_pipe_1=make_pipeline(players_preprocessor_1, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_1=GridSearchCV(
        estimator=players_pipe_1,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_1=knn_tune_grid_1.fit(X_1, y)

# find the results and store in a new dataframe 

accuracies_grid_1=pd.DataFrame(knn_model_grid_1.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_1=alt.Chart(accuracies_grid_1).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_1

In [38]:
# from the above results, we can see the optimal KNeighbours values is k=13 or k=15. we will use k=13 in this project

# still focusing on the variable set in players_preprocessor_1, create new knn specification for this dataset

knn_spec_1=KNeighborsClassifier(n_neighbors=13)

# create new pipeline and fit to X_1 and y variables

players_fit_1_final=make_pipeline(players_preprocessor_1, knn_spec_1).fit(X_1, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_1=pd.DataFrame(
    cross_validate(
        estimator=players_fit_1_final,
        cv=5,
        X=X_1,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_1=players_tune_grid_1.agg(['mean', 'sem'])
player_metrics_1

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003368,0.004131,0.734744,0.734697
sem,0.000119,0.000121,0.005531,0.001381


In [39]:
# repeat above steps for players_preprocessor_2 to find accuracy
# specify X (predictors) for the variables within the preprocessor

X_2=players[['experience', 'age']]

# create pipe

players_pipe_2=make_pipeline(players_preprocessor_2, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_2=GridSearchCV(
        estimator=players_pipe_2,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_2=knn_tune_grid_2.fit(X_2, y)

# find the results and store in a new dataframe 

accuracies_grid_2=pd.DataFrame(knn_model_grid_2.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_2=alt.Chart(accuracies_grid_2).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_2

In [40]:
# from the above results, we can see the optimal KNeighbours values is k=9

# still focusing on the variable set in players_preprocessor_2, create new knn specification for this dataset

knn_spec_2=KNeighborsClassifier(n_neighbors=9)

# create new pipeline and fit to X_1 and y variables

players_fit_2_final=make_pipeline(players_preprocessor_2, knn_spec_2).fit(X_2, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_2=pd.DataFrame(
    cross_validate(
        estimator=players_fit_2_final,
        cv=5,
        X=X_2,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_2=players_tune_grid_2.agg(['mean', 'sem'])
player_metrics_2

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003385,0.004998,0.75,0.760207
sem,9e-05,0.00088,0.02202,0.006202


In [41]:
# focus on players_preprocessor_3, find optimal KNeighbors value through cross-validation to tune KNeighbors
# specify X and y variables for predictors vs. target for the variables within the preprocessor

X_3=players[['played_hours', 'age']]

# create pipe

players_pipe_3=make_pipeline(players_preprocessor_3, knn_spec)
             
# perform standard 5-fold cross validation

knn_tune_grid_3=GridSearchCV(
        estimator=players_pipe_3,
        param_grid=param_grid,
        cv=5
    )

# fit tuned grid to X and y

knn_model_grid_3=knn_tune_grid_3.fit(X_3, y)

# find the results and store in a new dataframe 

accuracies_grid_3=pd.DataFrame(knn_model_grid_3.cv_results_)


# create a line graph (with points) to visualize results and help determine optimal KNeighbors value

accuracy_k_grid_3=alt.Chart(accuracies_grid_3).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("KNeighborsClassifier").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False)
)
accuracy_k_grid_3

In [42]:
# from the above results, we can see the optimal KNeighbours values is k=5

# still focusing on the variable set in players_preprocessor_3, create new knn specification for this dataset

knn_spec_3=KNeighborsClassifier(n_neighbors=5)

# create new pipeline and fit to X_1 and y variables

players_fit_3_final=make_pipeline(players_preprocessor_3, knn_spec_3).fit(X_3, y)

# perform standard 5-fold cross-validation on training dataset and store information in new dataframe

players_tune_grid_3=pd.DataFrame(
    cross_validate(
        estimator=players_fit_3_final,
        cv=5,
        X=X_3,
        y=y,
        return_train_score=True
    )
)

#calculate the accuracy of the model using these predictor variables

player_metrics_3=players_tune_grid_3.agg(['mean', 'sem'])
player_metrics_3

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.003388,0.00427,0.759872,0.789515
sem,0.000114,0.000342,0.026932,0.008234


In [43]:
#BUILD 3 KNN MODELS TO EVALUATE
    #use tut style scaffolding to make 3 pairs
    # use this step to find best pair
        # predict accuracy (find accuracy score) --> classification chap
        #also find precision score here to pick best model

#are we using confusion matrix?

In [44]:
#EVALUATE THE MOST ACCURATE/PRECISE PAIR ON THE TEST DATA

# Discussion

e

# References (Optional)