# DSCI100 Project Final Report
### Players Subscription Predictive Data Analysis

### Group 29:
- Ysabel Maria Fleet - 13009485
- Sanjana Gopee - 59940676
- Simar Pandher - 14521397
- Olivia Kong - 72594369

## Introduction
This project is based on datasets that have been provided by a research group in Computer Science at UBC led by Frank Wood. The data obtained regards which individuals play video games and how. This was done through recording player actions in a MineCraft Server—PLAIcraft—and collecting data regarding information about each individual player's personal attributes. This data has been condensed into the datasets of **Players** and **Sessions**. Our group has chosen to answer Question 3 provided in the criteria, which we have identified as a predictive classification question.  

**Question 3**: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 

**Formulated Question:** Given a player's total played hours and average session length, can future contribution be predicted, assuming their subscription status reflects commitment?


## Data Description

### Players Data
The players.csv file is a data set containing information about the players in the game. There are 196 observations with data about the players such as their experience, whether they subscribe to the game, their email, the number of hours played, their names, gender and age. These categories are split into the 9 variables (column names) below.


**players.csv:**

|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| experience       | String  | The level of expertise that the player has in the game. Possible values are "Amateur", "Beginner", "Regular", "Pro", "Veteran". This is a categorical variable.              |
| subscribe        | Boolean | Whether or not the player has subscribed to the game. The variable can only take the value "TRUE" or "FALSE", indicating "yes"  or "no" to whether they are subscribed.  |
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                        |
| played_hours     | Float   | The played hours indicates the number of hours spent playing the game approximated to one decimal place.                               |
| name             | String  | This is the name (first name) of the player. This is probably not a unique identifier since two people could coincidentally have the same name.                                             |
| gender           | String  | Gender is a categorical variable which has the following possible values: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited", "Other".                                        |
| age              | Integer | Player's age                                              |
| individualID     | N/A     | Individual ID of the player, values were not provided therefore the category is essentially useless in data analysis           |
| organizationName | N/A     | Organization name, values were not provided making this category useless just as the "IndividualID" column is.                  |

The final two columns 'individualID' and 'organizationName', have all null cells, therefore are negligible and will be dropped. Furthermore, columns such as 'name', 'gender', 'age', and 'experience' may not be useful to our classification analysis as they are categorical and not useful as variables; therefore will be dropped. Thus, the columns of use are reduced to 3 variables from 9.

### Sessions Data
The sessions.csv data set contains specific data about the playing sessions of the players in the game. There are 1535 observations in the sessions data set and 5 variables. Variables include the players' hashed email, start time, end time, original start time and original end time. This provides data about individual sessions played by each player, importantly, the times and dates of sessions played associated with a unique identifier.

**sessions.csv:**
|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                       |
| start_time     | String   | This includes the date - in format DD/MM/YY - and the time the player started playing the game. The time is in 24-hour format.                               |
| end_time             | String  | This includes the date - in format DD/MM/YY - and the time the player stopped playing the game. The time is in 24-hour format.                                             |
| original_start_time           | Integer  | This variable is a 14-digit integer indicating the start time in Unix Time format.                                        |
| original_end_time                | Integer | This variable is a 14-digit integer indicating the end time in Unix Time format.                                               |

Unlike the players data, none of the variables have negligible values, all of them providing information about the playing sessions rather than being empty observations. The two columns, 'original_start_time' and 'original_end_time' are both in Unix time, which is not a useful unit of measurement for this data analysis, therefore these columns will be dropped. Both 'start_time' and 'end_time' have two measurements per cell, with both the date and time of the sessions start/end. This will need to be tidied and wrangled so that these observations are not a string data type, so that date and time can be managed appropriately.

## Methods

### _Summary_
Using data available about **Players** and **Sessions of Players**, the goal of this project is to predict player retention, whether a player will continue playing the game based on the data from the two given dataframes. The column "subscribe" is the categorical variable we are trying to predict, and this is based on our assumption that players subscribing to the game will continue to play it. The two variables chosen to predict this are:

1. Total number of hours played
2. Average session time per player

These have been chosen as predictors based on a few assumptions. The first assumption is that a player who has a higher total number of played hours, is more likely to subscribe. The second assumption is that a player who has a higher average session time is more likely to subscribe. These assumptions are reasonable because a player's prior interest in the game can likely indicate their continued interest in the game. 

To predict the value of the "subscribe" column, a KNN classification model is required using the most optimal K value, and the two quantitative predictor variables mentioned above. The data will also be split into training and test sets to evaluate the classifier model's performance using accuracy, recall and precision. Through a GridSearchCV object and accuracy as the scoring metric, an optimal k-value will be chosen.


### Import Packages:
> 1. Import altair as alt.
> 2. Import numpy as np.
> 3. Import pandas as pd. From sci-kit learn (sklearn), import all relevant commands for classification.

### Reading Data:
> 1. Set random seed (2020)
> 2. Use a relative path to load in 'sessions.csv' data set, named as 'sessions_original'.
> 3. Use a relative path to load in 'players.csv' data set, named as 'players_original'.
> 4. Manually inspect each data set and use .info() to understand the contents of the data set, and create data descriptions (see above). Use descriptions and understanding to inform cleaning and wrangling.

### Cleaning and Wrangling Data:

Sessions Data Frame:
> 1. Convert the 'start_time' and 'end_time' columns from a dtype object to datetime, specifying dayfirst as true. For clarity, name them 'start_time_final' and 'end_time_final' respectively.
> 2. Subtract the 'start_time_final' from the 'end_time_final' to calculate and create a 'session_length' column that has the session time per session per player.
> 3. Drop columns 'original_start_time' and 'original_end_time', and name the data frame 'sessions'.
> 4. Call Groupby on hashedEmail to find the mean average 'session_length', yielding the average session time (in hours) per player. Name this column 'session_length_hours' and name the data frame 'sessions_group'.

Merging Data Frames:
> 1. Merge the 'players_original' with the 'sessions_group' on the hashedEmail column. Name the new data frame to 'merged' data frame.

Cleaning the Merged Data Frame:
>1. From the 'merged' data frame, drop unwanted columns identified in the 'players_original' data frame and unwanted columns created when wrangling the 'sessions_original' data frame. Dropped columns: 'name', 'gender', 'age','individualID', 'organizationName', 'experience', and 'session_length'.
>2. For clarity, name the data frame 'merged_final'.

Balancing the Data Frame:
> 1. Address data imbalance in the 'subscribe' column, the predicted label, by oversampling the rare observation (False) from the 'merged_final' data frame.
> 2. Name the balanced data frame to 'upsampled_merged'. This data frame should have 186 rows and 4 columns: 'hashedEmail', 'session_length_hours', 'subscribe', and 'played_hours'.

### Exploratory Analysis:

Session Length and Played Hours Scatter Plot:
> 1. Filter the 'upsampled_merged' data frame to address outliers, so that the exploratory visualisation is focused and easily interpretable. Name this ‘filtered_data’.
> 2. From the 'upsampled_merged' data frame, create a scatter plot for 'session_length_hours' on the x-axis and 'played_hours' on the y-axis. Assign color to the 'subscribe' column, to observe how the predicted label factors in the relationship between the variables.
> 3. From the ‘filtered_data’, create a scatter plot for 'session_length_hours' on the x-axis and 'played_hours' on the y-axis. Assign color to the 'subscribe' column, to observe how the predicted label factors in the relationship between the variables.
> 4. From the visualisations, make note of any observations that may help with further analysis.  

### Data Analysis, Training and Predicting: 

Data Preprocessing:
> 1. Split the data into training and testing data sets, called 'players_session_train' and 'playes_sessions_test' respectively. The test size will be set to 25%, to maximise data used for the training model. Use shuffle to ensure the order does not influence the data in the sets made, and call stratify parameter to arrange data by class label to equally divide to the sets.
> 2. Create a preprocessor, ensuring to scale and center (standardise) the data by calling StandardScaler on the predictor variables.

Data Processing and Creating a Pipeline:
> 1. Create a K-NN specification model, setting k to 3 as an arbitrary baseline.
> 2. From 'players_session_train' name predictor variables 'X_train' and the response variable 'y_train'.
> 3. Create a pipeline with the preprocessor and K-NN model, and fit the X and y training arguments.
> 4. Use the predict function to call a new prediction on the classifier.

Examining the Accuracy:
> 1. Examine the accuracy by using the score method with data from the 'players_sessions_test' data frame.
> 2. Analyse the accuracy of the pipeline, considering its significance in this application.

### Data Analysis, Evaluation and Tuning:

Cross Validation:
> 1. Perform a 5-fold cross-validation using the cross_validate function.
> 2. Aggregate the mean and standard error of the classifier’s validation accuracy across the folds.


Tuning:
> 1. Construct a parameter grid, using a range informed by the parameter values.
> 2. Make a tuning pipeline, calling the preprocessor and KNeighborsClassifier.
> 3. Create a K-NN tuning grid by calling GridSearchCV. Note, to yield a higher precision, add scoring='precision', so that when best_params_  is called precision will be shown
> 4. Fit the K-NN tuning grid to the training data predictors and labels, naming this the K-NN model grid.
> 5. Create an accuracy grid by wrapping the cv_results_.
> 6. Plot an accuracy grid called 'accuracy_versus_k_grid', with the accuracy estimate on the y-axis and K-neighbours on the x-axis. Ensure to layer data points on the line chart.
> 7. Based on the plot, select the most optimal value of K, whilst critically considering the result of the tuning for the application.

### Final Model Training and Evaluation:

Retraining the K-NN Model for Test Data:
> 1. Retrain the K-NN classifier, using the most optimal value of K-Neighbours. For clarity, name this classifier 'knn_2'.
> 2. Make a new pipeline 'players_session_fit2', that calls upon the preprocessor and 'knn_2' classifier.
> 3. Create a new data frame called 'players_session_predict', creating a new column 'predicted' using the 'players_session_fit2' pipeline and data of predictor variables from 'players_session_test'.

Examining the Accuracy:
> 1. Examine the accuracy by using the score method with data from the 'players_sessions_test' data frame.
> 2. Analyse the accuracy of the pipeline, considering its significance in this application.

Visualise and Summarise Performance with a Confusion Matrix:
> 1. Use the ‘crosstab’ function to display the actual and predicted labels in a confusion matrix.

Computing Precision and Recall:
> 1. Referencing the confusion matrix, manually calculate precision by dividing the number of positive predictions by the total number of positive predictions.
> 2. Referencing the confusion matrix, manually calculate recall by dividing the number of correct positive predictions by the total number of positive test set observations.


Create Prediction Plots with Test Data:
> 1. Address outliers that may make the observations/visualisations difficult to interpret lower scoring and clustered observations. Execute this by creating a ‘filtered_data’ which has a range and domain that omits outlying observations.
> 2. Generate a scatter plot named ‘test_plot’ that has 'played_hours' on the y-axis and 'session_length_hours' on the x-axis, using colour to indicate if a player is currently subscribed or not.
> 3. Generate a scatter plot named ‘test_predicted_plot’ that has 'played_hours' on the y-axis and 'session_length_hours' on the x-axis, using colour to indicate if a player is predicted to subscribe or not, therefore, visualising predicted labels.
> 4. Analyse the plots, specifically, the predicted map’s effectiveness in addressing the classification question; identifying something about our population of users, to predict if a player will continue contributing given past participation.

Critically Analyzing Performance:
> 1. This should be done throughout different steps of the method.
> 2. Using the context of the problem and data, analyze the performance of the K-NN classification of the model. Consider the precision-recall trade-off, and acknowledge both the accuracy and confusion matrix.

## Import Packages

In [33]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

## Reading Data

In [34]:
#Setting random seed (2020)

np.random.seed(2020)

# Using a relative path to load in 'sessions.csv' data set, named as 'sessions_original'.
# Using a relative path to load in 'players.csv' data set, named as 'players_original'.

sessions_original=pd.read_csv('data/sessions.csv')
sessions_original

players_original = pd.read_csv("data/players.csv")

## Cleaning and Wrangling Data

In [35]:
# Converting the 'start_time' and 'end_time' columns from a dtype object to datetime.
sessions_original['start_time_final']=pd.to_datetime(sessions_original["start_time"], dayfirst=True)
sessions_original['end_time_final']=pd.to_datetime(sessions_original["end_time"], dayfirst=True)

# Calculating session length per session per player, and creating a new column 'session_length'.
sessions_original["session_length"] = sessions_original["end_time_final"]-sessions_original["start_time_final"]

# Tidying and renaming the data frame.
sessions=sessions_original.drop(columns=['original_start_time','original_end_time', 'start_time', 'end_time'])
sessions

# Grouping and claculating the mean average of 'session_length' given in hours.
sessions_group=sessions.groupby('hashedEmail')["session_length"].mean().reset_index()
sessions_group["session_length_hours"] = sessions_group["session_length"].dt.total_seconds()/3600

# Merging the 'players_original' with the 'sessions_group' on the hashedEmail column.
merged=sessions_group.merge(players_original, on='hashedEmail')

# Cleaning the merged data frame, creating the 'merged_final' data frame.
merged_final = merged.drop(columns=["session_length", "individualId", "organizationName", "name", "gender","age","experience"])

# Addressing data imbalance in the 'subscribe' column.
subscribe_false = merged_final[merged_final["subscribe"] == False]
subscribe_true = merged_final[merged_final["subscribe"] == True]

subscribe_false_upsample = subscribe_false.sample(
n=subscribe_true.shape[0], replace=True)
upsampled_merged = pd.concat((subscribe_false_upsample, subscribe_true))

# Run (upsampled_merged["subscribe"].value_counts()) to ensure that the data is balanced.

This block of code processes the players.csv and sessions.csv files, tidying their data to prepare for data analysis and classification. As mentioned above in the Methods section, the predictor variables chosen for this report are the total played hours and the mean session time per player. To attain this, a new column was made called 'session_length_hours' to store the average session length for each player. This was done by converting the datatype of the session start and end time columns in the sessions dataframe to datetime, which allows us to substract the start time of the session from the end time to get the length of that session. From there, columns that we aren't using for this report were dropped and the groupby function was applied to the hashedEmail column to group sessions by player, allowing the mean session length to be calculated and ensuring that each player corresponded to one row. The two dataframes were then merged via the 'hashedEmail' column and additional unused columns were dropped (e.g. 'individualId' and 'organizationName').

Since this report requires classification and the 'subscribe' column is the categorical variable being predicted, it's important to have a balanced number of subscribed and non subscribed observations. This ensures that the classification model doesn't consistently predict the majority class due to an imbalance between the two classes. In the merged dataset, subscribed observations outnumbered non-subscribed ones. To address this, the sample function was used to balance the dataset by matching the number of non-subscribed observations to the number of subscribed observations.

## Exploratory Analysis

In [36]:
# Filtering the upsampled data to address outliers, and prepare the data for a more interpretable visualisation.
filtered_data = upsampled_merged[
    (upsampled_merged["session_length_hours"] <= 1.0) &
    (upsampled_merged["played_hours"] <= 20)]

# Generating a visualisation for the upsampled data, plotting 'session_length_hours' against 'played_hours'.

players_sessions_plot_2 = alt.Chart(upsampled_merged, title='Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset').mark_circle(opacity=0.6).encode(
    x=alt.X("session_length_hours").title('Average Session Length (Hours)').scale(zero=False),
    y=alt.Y("played_hours").title('Total Played Hours').scale(zero=False),
    color = alt.Color("subscribe", title = 'Subscribed (True) or not Subscribed (False)', scale = alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

# Generating a visualisation for the filtered upsampled data, plotting 'session_length_hours' against 'played_hours'.
players_sessions_plot_2_filtered = alt.Chart(filtered_data, title='Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset (Filtered)').mark_circle(opacity=0.6).encode(
    x=alt.X("session_length_hours").title('Average Session Length (Hours)').scale(zero=False),
    y=alt.Y("played_hours").title('Total Played Hours').scale(zero=False),
    color = alt.Color("subscribe", title = 'Subscribed (True) or not Subscribed (False)', scale = alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

players_sessions_plot_2 | players_sessions_plot_2_filtered

**Figure 1: Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset**

The plots above display the relationship between total played hours and the average session length for each individual through the tidied and balanced merged dataset. The plot on the left (unfiltered data) takes into account all of the observatins provided. One issue with this plot is that many of the points have similar total played hours (approximately 5 hours and less), making it so they are overlapping. This makes it difficult to see how many true and false observations are present. To help with interpretation, the plot on the right has the data filtered so that it only shows observations that have total hours less than 20 hours and average session lengths under 1 hour. This new plot makes it easier to visually see what observations are true (subscribed) and which ones are false (not subscribed). The fact that the observations are so densely clustered indicates low total played hours and mean session time for most players. A few significant outliers are present, with total played hours extending up to 240 hours. Despite these variations, there is no obvious relationship or trend appparent in the data. For instance, looking at the points at the bottom right corner of the plot, where the average session time is greater, it would be expected that these observations are all subscribed, however that is not the case.

## Data Analysis, Training and Predicting

In [37]:
# Splitting data into  training and testing data sets, with test size set to 25%.
players_sessions_train, players_sessions_test = train_test_split(
    upsampled_merged, 
    test_size=0.25,
    shuffle=True,
    stratify=upsampled_merged['subscribe']
) 

# Creating a preprocessor
players_sessions_preprocessor = make_column_transformer(
     (StandardScaler(), ["played_hours", "session_length_hours"]),
     remainder="passthrough",
     verbose_feature_names_out=False
 )

# Data processing, setting k to 3 as an arbitrary baseline.

players_knn = KNeighborsClassifier(n_neighbors=3)

X_train = players_sessions_train[["session_length_hours","played_hours"]]
y_train = players_sessions_train["subscribe"]

players_sessions_fit = make_pipeline(players_sessions_preprocessor, players_knn).fit(X_train,y_train)

# Creating a pipeline with test data.
players_sessions_test_predictions = players_sessions_test.assign(
    predicted = players_sessions_fit.predict(
        players_sessions_test[["session_length_hours","played_hours"]]))

# Examining the accuracy of the initial pipeline.
X_test = players_sessions_test[["session_length_hours","played_hours"]]
y_test = players_sessions_test["subscribe"]

players_prediction_accuracy = players_sessions_fit.score(X_test,y_test)
players_prediction_accuracy

0.8085106382978723

This block of code starts the classification process by splitting the merged dataframe with balanced classes into a training and testing sets. A preprocessor was made to scale the predictor variables and a classification model with K=3 was arbitrarily chosen as a baseline. This will be compared with the most optimal K value, which will be determined by a GridSearchCV object, to evaluate their performance in predicting the observations. A pipeline was trained using the training data set and used to predict the testing data, resulting in an accuracy of 81%.

## Data Analysis, Evaluation and Tuning

In [38]:
# Performing a 5-fold cross-validation.


players_sessions_pipe = make_pipeline(players_sessions_preprocessor, players_knn)

players_vfold_score = pd.DataFrame(
     cross_validate(
         estimator= players_sessions_pipe,
         cv=5,
         X = players_sessions_train[["session_length_hours","played_hours"]],
         y = players_sessions_train["subscribe"],
         return_train_score=True,
     )
)

# Aggregating the average mean and standard error of the classifier's validation.
players_metrics = players_vfold_score.agg(["mean","sem"])
players_metrics[['test_score']]

Unnamed: 0,test_score
mean,0.733598
sem,0.037138


This block of code performs 5-fold cross validation for the classification model using K=3. The validation scores of the test_score column depict the mean (estimated accuracy) and the standard error (measure of uncertainty around the mean value) for the model. As a result, the KNN model with K=3 is expected to correctly classify approximately 73% of the test data, and the standard error of 3.71% indicates that the accuracy can be expected to fall approximately between 70% and 76%. The validation scores collected from this 5-fold cross validation help give a performance control for a classification model with K=3, and  can be used for comparison once the optimal K value is found. This is because these scores help assess the model's ability to generalize across different splits of the data and how consistent its performance is.

In [39]:
# Constructing a parameter grid.
param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 15, 1),
}

# Create a tuning pipeline and K-NN tuning grid.
players_tune_pipe = make_pipeline(players_sessions_preprocessor, KNeighborsClassifier())

knn_tune_grid = GridSearchCV(
     players_tune_pipe, param_grid, cv=4,
)

# Fit the K-NN tuning grid to the training data, then create an accuracy grid by wrapping the cv_results_.
knn_model_grid = knn_tune_grid.fit(X_train,y_train)

accuracies_grid= pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid

# Plot the accuracy grid.
accuracy_versus_k_grid = alt.Chart(accuracies_grid, title='Estimated Accuracy of each K Value (Range: 2-14) in the Classification Model ').mark_line(point=True).encode(
     x=alt.X("param_kneighborsclassifier__n_neighbors")
         .title("Number of Neighbors")
         .scale(zero=False),
     y=alt.Y("mean_test_score")
         .title("Accuracy")
         .scale(zero=False)
 )

accuracy_versus_k_grid

**Figure 2: Estimated Accuracy of each K Value (Range: 2-14) in the Classification Model**

This line plot visualizes the accuracy of the classification model across different K values in the range from 2 to 14, as determiend by the GridSearchCV model. The plot reveals that K=2 has the highest accuracy, making it the optimal choice for the model within the test range. A decrease in accuracy occurs as the number of neighbors increases, which suggests that a model using a greater K value is more likely to make more errors when predicting.

## Final Model Training and Evaluation

In [40]:
# Retraing the K-NN model for the testing data, using the optimal value of K.
knn_2=KNeighborsClassifier(n_neighbors=2)

players_session_fit2=make_pipeline(players_sessions_preprocessor, knn_2).fit(X_train,y_train)

players_session_predict=players_sessions_test.assign(
    predicted=players_session_fit2.predict(
        players_sessions_test[['session_length_hours','played_hours']])
)

# Displaying the first 15 rows of the data frame.
players_session_predict.head(15)

Unnamed: 0,hashedEmail,session_length_hours,subscribe,played_hours,predicted
121,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,1.333333,True,1.2,True
101,e21a324ccf5c873bafe82e47d5137b36aa312ee4803eeb...,0.1,True,0.0,False
1,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,0.5,False,0.4,False
28,5a340c0e3d1aa3e579efc625bd3e5bca7fc25f7115b68e...,0.275,True,0.4,True
88,ca20f724571080b997e0efa874b9611e9f280c1af5f68f...,0.35,True,0.2,True
35,6fa105fac7f4f37350f21830db78cde153d8edda41d6f4...,0.116667,False,0.1,False
60,90f1495942837b1cde67cc9e3119421e38183502a4c6de...,0.266667,True,0.2,True
52,88247d9a46fc214a12485dcbcbb03a8ddebfe8c1ec5fe2...,0.327778,False,1.4,False
92,d3ca24e4d7fe8ffe2a821ebd3b841252950ca53e4d659a...,0.233333,False,0.1,False
27,577aa5f15468252b1c6f32dcd515012923476292e30f95...,0.166667,True,0.1,False


In [41]:
# Examining the accuracy of the final model.
players_session_accuracy = players_session_fit2.score(X_test,y_test)
players_session_accuracy

0.851063829787234

These two blocks of code create and test a new KNN classification model using the optimal K value of 2. A new pipeline was made using the new model and trained with the training data set. A new dataframe was made which includes a new column 'predicted', containing the model's predictions for the test set. We see that the accuracy of the predictions when using K=2 is 85% compared to 81% when using K=3 previously. This means that the classifier makes fewer overall mistakes when classifying observations.

In [42]:
# Visualising and summarising the model's performance with a confusion matrix.
players_session_mat=pd.crosstab(
    players_session_predict['subscribe'],
    players_session_predict['predicted'],
)

players_session_mat
# Manual calculations of precision and recall are given in the commentary below.

predicted,False,True
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,23,0
True,7,17


We see from this confusion matrix that the classifier incorrectly predicted 7 subscribed observations as not subscribed and one non subscribed observation as subscribed. We are more interested in identifying observations that are subscribed, making this the positive label, and non-subscribed observations are treated as the negative label.

As a result of these predictions, the precision of the classifier when using K=2 is 100% (17/17) and the recall is 71% (17/(7+17)). Therefore, every observation that is predicted as subscribed is correct. However the classifier only identifies 71% of the actual positive observations and misses the remaining 29%. This means that the classifier makes a significant amount of false negatives (i.e., observations that are subscribed are incorrectly classified as not subscribed).

The lower recall could mean that the classifier is conservative when predicting the positive class (subscribed), as it avoids predicting subscribed unless it is very certain, which explains the high precision, but it fails to catch all positive (subscribed) observations. 

In a real life scenario, such as when game developers are predicting how many players will play their new game, prioritizing precision can maximize profit. A model with high precision ensures that the players predicted to play to game are likely to actually play. This would reduce the cost of any additional marketing or resources wasted to target players who are unlikely to play. On the other hand, a model with higher recall would help game developers anticipate the majority of players likely to join the game. This would allow them to prepare servers capable of accommodating an increased demand from players, which in turn would reduce the risk of the server crashing.  

In [43]:
# Addressing outliers in the data that make visulaistion more difficult.

filtered_data_2 = players_sessions_test_predictions[
    (players_sessions_test_predictions["session_length_hours"] <= 1.2) &
    (players_sessions_test_predictions["played_hours"] <= 50)
]

# Generating a scatter plot visualising player's current subscription status, given 'played_hours' and 'session_length_hours'.
test_plot = alt.Chart(filtered_data_2, title="Hours Played vs Session Length (Filtered)").mark_circle().encode(
    x=alt.X("session_length_hours", title="Session Length (Hours)"),
    y=alt.Y("played_hours", title="Hours Played"),
    color=alt.Color("subscribe", title="Subscribed (Yes or No)"),
    tooltip=["session_length_hours", "played_hours"])

# Generating a scatter plot visualising player's predicted subscription status, given 'played_hours' and 'session_length_hours'.
test_predicted_plot = alt.Chart(filtered_data_2, title="Hours Played vs Session Length (Predicted, Filtered)").mark_circle().encode(
    x=alt.X("session_length_hours", title="Session Length (Hours)"),
    y=alt.Y("played_hours", title="Hours Played"),
    color=alt.Color("predicted", title="Subscribed (Yes or No)", scale=alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

test_plot | test_predicted_plot

**Figure 3: Current Player Subscription, Hours Played vs Session Length for Filtered Data**

**Figure 4: Predicted Player Subscription, Hours Played vs Session Length for Filtered Data**

The plots above show the difference between the given upsampled data and the predicted data done by our model, with both datasets being filtered. From what we can see, the model does a good job on predicting whether or not an individual will be subscribed based on the hours played and session length. Looking at the observations with low hours played and low session length, overall the pattern of red and blue (false and true) is consistent. In general,we can see that the few people with long session lengths who are not subscribed show that they do not have many hours played, and their sessions lengths on average are lower than those who are subscribed. The cluster of points in the bottom left of both plots show that individuals who are subscribed still have longer session lengths, though their hours played are around the same. This is shown by the blue points (true points) being further along the x axis compared to the red (false points). The dataset was filtered because similarily to figure 1, much of the observations were densely clustered in the bottom left due to the hours played on average being very low. Filtering the data has made it so the plots are easier to interpret.

## Discussion 

### Summary of Findings:

From the K-NN Classification model visualization, a few conclusions can be drawn. Firstly, there is no correlation between players having more total hours played or higher average session times and their likelihood of subscribing to the game. As seen in the visualization, players with high values for both variables are not always predicted to be subscribed. Secondly, there is a visible trend where players with greater total played hours tend to have higher average session lengths, therefore displaying a positive relationship between the two variables. This implies that as players engage more with the game, they are likely to contribute increasingly longer sessions.  

Therefore, regarding the classification question for this report, our model doesn't provide any concrete conclusions. The classification model indicates that given past participation, a player's likelihood to subscribe cannot be discerned, or at least not from the predictor variables chosen. This being said, the visualization does indicate that players who play the game more, tend to do so for longer sessions. This suggests that although players may not be subscribed, greater past participation indicates a higher likelihood of continued engagement. Therefore, based on the inconclusive results from our classification model, the subscribe label may not be a suitable response variable for predicting a player's likelihood of continuing to play. However, the classification model could be improved by including additional predictor variables, such as the age column in the players dataset. In a real-life scenario, other predictor variables that could be used include the amount of in-game purchases or the number of friends in the game. 

### Comparison of Findings to Expectations:

The results from the K-NN Classification model do not align with our initial assumptions. To summarize, we expected that players who score highly in both predictor variables to also be predicted as subscribed (and conversely those scoring low in both to be predicted as not subscribed) by the classifier. Rather, the classifier predicts players as subscribed seemingly regardless of how a player scores for each variable. 

### Impact of Findings:

These findings contradict with what we expected, therefore we should reevaluate the variables we have chosen for the K-NN classification and the type of classification model.

Moving forward, it would be well advised to reconsider if a player's subscription is an accurate label regarding a player's past and future participation. Furthermore, we must reconsider if the chosen predictors variables, total played hours and average session times, are influential to the subscribed classification or any new classification label we may choose. By doing this, we may create an improved K-NN classification model that better predicts whether or not a player will continue contributing given past participation.

We may also consider posing the initial question differently to formulate a regression problem. Hypothetically, in this scenario we would predict a player's likeliness to keep playing out of a score of 10, using either numerical or categorical data as predictor variables. However, this may not be very straightforward to do, and thus not the optimal course of action or an improvement to what we have already done.

As aforementioned, we could alternatively choose to use a different classification model type. K-NN was chosen as a default to the course material, but it does possess limitations, namely its potentially impaired performance when classes are imbalanced. As seen in our exploratory analysis, the data does have class imbalance under the subscribe column. For this reason a different classification model type that is more suited to the data and the question could be used. Additionally, K-NN may not perform as well when more predictor variables are used, and as mentioned above, a way to improve the performance of the model is to increase the amount of predictor variables used. Therefore, a different type of classification model that performs better with more predictor variables could be used, which in turn would improve the quality of the predictions made. As a result, a more definitive conclusion would be produced that better answers the research question of this report.
  
### Future Questions:

Considering the impact of our analysis on our findings we may want to ask questions regarding:

- Which labels in the original players and sessions data frames may be a better predictor variable in addressing the question?
- What assumptions may we have overlooked or were misguided in our exploration that could help us create a better model in the future?
- In a real world sense, in what ways does past participation indicate future contribution? Is there a better predictor label (other than subscribe) that can represent this? 

We may also want to consider some future questions that build upon our findings and the initial question:

- Are there certain demographic factors of players who are predicted to continue contributing to the game given their past participation?
- Are there specific times or other attributes that correlate with predictions to continue contributing?
- What may be limiting factors of a player's continued contribution to the game? How can player participation retention be improved?
- Is there an initial period of time, or a "honeymoon phase", in which players are initially contributing more to the game that is impacting our data analysis? If so, what factors influence this?

## References
Timbers, T., Campbell, T., Lee, M., Ostblom, J., & Heagy, L. (2024). Data science: A First Introduction with Python. CRC Press. https://python.datasciencebook.ca/index.html