# DSCI100 Project Final Report
### Players Subscription Predictive Data Analysis

### Group 29:
- Ysabel Maria Fleet - 13009485
- Sanjana Gopee - 59940676
- Simar Pandher - 14521397
- Olivia Kong - 72594369

## Instructions for Final Report
- Title **(Done)**

- Introduction: Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project identify and fully describe the dataset that was used to answer the question **(Done)**

- Methods & Results: Describe the methods you used to perform your analysis from beginning to end that narrates the analysis code. **(In-progress)**

Your report should include code which:

1. Loads data 
2. Wrangles and cleans the data to the format necessary for the planned analysis
3. Performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
4. Creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
5. Performs the data analysis
6. Creates a visualization of the analysis 

Note: All figures should have a figure number and a legend

- Discussion:
1. Summarize what you found
2. Discuss whether this is what you expected to find?
3. Discuss what impact could such findings have?
4. Discuss what future questions could this lead to?

- References
You may include references if necessary, as long as they all have a consistent citation style.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix


set_config(transform_output="pandas")

### Introduction
This project is based on datasets that have been provided by a research group in Computer Science at UBC led by Frank Wood, who obtained data regarding how individuals play video games. They did this through recording player actions in a MineCraft Server—PLAIcraft—and collecting data regarding information about each individual player as well as how they play. This data has been condensed into the datasets of **Players** and **Sessions**. The question our group has chosen to answer is Question 3 provided in the criteria.  

**Question 3**: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 

## Data Description

## Players Data
The players.csv file is a data set containing information about the players in the game. There are 196 observations with data about the players such as their experience, whether they subscribe, their email, the number of hours played, their names, gender and age. These categories are split into the 9 variables (column names) below. 

  


**Players.csv:**

|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| experience       | String  | The level of expertise that the player has in the game. Possible values are "Amateur", "Beginner", "Regular", "Pro", "Veteran". This is a categorical variable.              |
| subscribe        | Boolean | Whether or not the player has subscribed to the game. The variable can only take the value "TRUE" or "FALSE", indicating "yes"  or "no" to whether they are subscribed.  |
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                        |
| played_hours     | Float   | The played hours indicates the number of hours spent playing the game approximated to one decimal place.                               |
| name             | String  | This is the name (first name) of the player. This is probably not a unique identifier since two people could coincidentally have the same name.                                             |
| gender           | String  | Gender is a categorical variable which has the following possible values: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited", "Other".                                        |
| age              | Integer | Player's age                                              |
| individualId     | N/A     | Individual ID of the player, values were not provided therefore the category is essentially useless in data analysis           |
| organizationName | N/A     | Organization name, values were not provided making this category useless just as the "IndividualID" column is.                  |

The final two columns 'individualID' and 'organizationName', have all null cells, therefore are negligible and will be dropped. Furthermore, columns such as 'name', 'gender', and 'age' may not be useful to analysis and will be dropped. Thus, the categories of use are reduced to 4 variables from 9.

## Sessions Data
The sessions.csv data set contains specific data about the playing sessions of the players in the game. There are 1535 observations in the sessions data set and 5 variables. Variables include the players' hashed email, start time, end time, original start time and original end time. This provides data about individual sessions played by each player, importantly, the times and dates of sessions played associated with a unique identifier.

**Sessions.csv:**
|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                       |
| start_time     | String   | This includes the date - in format DD/MM/YY - and the time the player started playing the game. The time is in 24-hour format.                               |
| end_time             | String  | This includes the date - in format DD/MM/YY - and the time the player stopped playing the game. The time is in 24-hour format.                                             |
| original_start_time           | Integer  | This variable is a 14-digit integer indicating the start time in unix time.                                           |
| original_end_time                | Integer | This variable is a 14-digit integer indicating the end time in unix time.                                              |

Unlike the players data, none of the variables have negligible values, all of them providing information about the playing sessions rather than being empty observations. The two columns, 'original_start_time' and 'original_end_time' are both in Unix time, which is not a useful unit of measurement for the data analysis, therefore these columns will be dropped. Both 'start_time' and 'end_time' both have two measurements per cell, with both the date and time of the sessions start/end. This needs to be tidied so that the data set has 'start_date', 'end_date', 'start_time', and 'end_time' columns with appropriate data types.


## Method

### Summary
Using data available about **Players** and **Sessions of Players**, the goal of this project is to predict player retention, whether a player will continue playing the game based on the data from the two given dataframes. The column "subscribe" is the categorical variable we are trying to predict, and this is based on our assumption that players subscribing to the game will continue to play it. The two variables chosen to predict this are:

1. Total number of hours played
2. Average session time per player

These have been chosen as predictors based on a few assumptions. The first assumption is that a player who has a higher total number of played hours, is more likely to subscribe. The second assumption is that a player who has a higher average session time is more likely to subscribe. These were assumptions were made as they indicate past intrest in the game, and thus it can be inferred that there will be future interest. 

To predict the value of the "subscribe" column, will require a KNN classification model with the most optimal K value, and the two above quantitative predictor variables. The data will also be split into training and test sets to evaluate the classifer model's performance using accuracy, recall and precision. Through a GridSearchCV object and accuracy as the scoring metric, an optimal k-value will be chosen. 

### Detailed Steps [INCOMPLETE - Still need to write more]
- **Data Wrangling**:
-     Players Data Set:
1. Drop columns: 'name', 'gender', 'age','individualID', 'organizationName'

-     Sessions Data Set:
1. Drop columns: 'original_start_time' and 'originial_end_time'
2. Convert the 'start_time' and 'end_time' columns from a dtype object to datetime
3. Subtract the 'start_time' from the 'end_time' to create a 'session_time' column that has the session time per session per player.
4. Groupby() hashedEmail and find the mean average session time per player.

-     Merging Data Sets:
1. Merge the Players data set with the Sessions data set on the hashedEmail column

- **Exploratory Analysis**: explain
- **Data Analysis**: explain
- **Visualization of Analysis**: explain

## Analysis

In [2]:
np.random.seed(2020)

# reading data file
sessions_original=pd.read_csv('sessions.csv')
sessions_original

players_original = pd.read_csv("players.csv")

# converting str to datetime
sessions_original['start_time_final']=pd.to_datetime(sessions_original["start_time"], dayfirst=True)
sessions_original['end_time_final']=pd.to_datetime(sessions_original["end_time"], dayfirst=True)

# calculating session legnth
sessions_original["session_length"] = sessions_original["end_time_final"]-sessions_original["start_time_final"]

# tidying
sessions=sessions_original.drop(columns=['original_start_time','original_end_time', 'start_time', 'end_time'])
sessions

# grouping and calculating mean
sessions_group=sessions.groupby('hashedEmail')["session_length"].mean().reset_index()
sessions_group["session_length_hours"] = sessions_group["session_length"].dt.total_seconds()/3600

# merging data sets (dropped rows without any sessions)
merged=sessions_group.merge(players_original, on='hashedEmail')
merged_final = merged.drop(columns=["session_length", "individualId", "organizationName", "name", "gender","age","experience"])

# scaling error (data imbalance)

subscribe_false = merged_final[merged_final["subscribe"] == False]
subscribe_true = merged_final[merged_final["subscribe"] == True]

subscribe_false_upsample = subscribe_false.sample(
n=subscribe_true.shape[0], replace=True)

upsampled_merged = pd.concat((subscribe_false_upsample, subscribe_true))
# (upsampled_merged["subscribe"].value_counts()) #run this to see data balanced

This block of code processes the players.csv and sessions.csv files, tidying their data to prepare for data analysis and classification. As mentioned above in the Methods section, the predictor variables chosen for this report are the total played hours and the mean session time per player. To attain this, a new column was made called 'session_length_hours' to store the average session length for each player. This was done by converting the datatype of the session start and end time columns in the sessions dataframe to datetime, which allows us to substract the start time of the session from the end time to get the length of that session. From there, columns that we aren't using for this report were dropped and the groupby function was applied to the hashedEmail column to group sessions by player, allowing the mean session length to be calculated and ensuring that each player corresponded to one row. The two dataframes were then merged via the 'hashedEmail' column and additional unused columns were dropped (e.g. 'individualId' and 'organizationName').

Since this report requires classification and the 'subscribe' column is the categorical variable being predicted, it's important to have a balanced number of subscribed and non subscribed observations. This ensures that the classification model doesn't consistently predict the majority class due to an imbalance between the two classes. In the merged dataset, subscribed observations outnumbered non-subscribed ones. To address this, the sample function was used to balance the dataset by matching the number of non-subscribed observations to the number of subscribed observations.

In [22]:
filtered_data = upsampled_merged[
    (upsampled_merged["session_length_hours"] <= 1.0) &
    (upsampled_merged["played_hours"] <= 20)]

players_sessions_plot_2 = alt.Chart(upsampled_merged, title='Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset').mark_circle(opacity=0.6).encode(
    x=alt.X("session_length_hours").title('Average Session Length (Hours)').scale(zero=False),
    y=alt.Y("played_hours").title('Total Played Hours').scale(zero=False),
    color = alt.Color("subscribe", title = 'Subscribed (True) or not Subscribed (False)', scale = alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

players_sessions_plot_2_filtered = alt.Chart(filtered_data, title='Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset (Filtered)').mark_circle(opacity=0.6).encode(
    x=alt.X("session_length_hours").title('Average Session Length (Hours)').scale(zero=False),
    y=alt.Y("played_hours").title('Total Played Hours').scale(zero=False),
    color = alt.Color("subscribe", title = 'Subscribed (True) or not Subscribed (False)', scale = alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

Figure 1: Relationship between Total Played Hours and Average Session Length (Hours) in the Merged Dataset
-
The plots above display the relationship between total played hours and the average session length for each individual through the tidied and balanced merged dataset. The plot on the left (unfiltered data) takes into account all of the observatins provided. One issue with this plot is that many of the points have similar total played hours (approximately 5 hours and less), making it so they are overalpping. This makes it difficult to see how many true and false observations are present. To help with interpretation, the plot on the right has the data filtered so that it only shows observations that have total hours less than 20 hours and average session lengths under 1 hour. This new plot makes it easier to visually see what observations are true (subscribed) and which ones are false (not subscribed). The fact that the observations are so densely clustered indicates low total played hours and mean session time for most players. A few significant outliers are present, with total played hours extending up to 240 hours. Despite these variations, there is no obvious relationship or trend appparent in the data. For instance, looking at the points at the bottom right corner of the plot, where the average session time is greater, it would be expected that these observations are all subscribed, however that is not the case.

In [4]:
# splitting data
players_sessions_train, players_sessions_test = train_test_split(
    upsampled_merged, 
    test_size=0.25,
    shuffle=True,
    stratify=upsampled_merged['subscribe']
) 

# creating preprocessor
players_sessions_preprocessor = make_column_transformer(
     (StandardScaler(), ["played_hours", "session_length_hours"]),
     remainder="passthrough",
     verbose_feature_names_out=False
 )

# create knn spec

players_knn = KNeighborsClassifier(n_neighbors=3)

X_train = players_sessions_train[["session_length_hours","played_hours"]]
y_train = players_sessions_train["subscribe"]

players_sessions_fit = make_pipeline(players_sessions_preprocessor, players_knn).fit(X_train,y_train)

# predictions
players_sessions_test_predictions = players_sessions_test.assign(
    predicted = players_sessions_fit.predict(
        players_sessions_test[["session_length_hours","played_hours"]]))

# accuracy
X_test = players_sessions_test[["session_length_hours","played_hours"]]
y_test = players_sessions_test["subscribe"]

players_prediction_accuracy = players_sessions_fit.score(X_test,y_test)
players_prediction_accuracy

0.8085106382978723

This block of code starts the classification process by splitting the merged dataframe with balanced classes into a training and testing sets. A preprocessor was made to scale the predictor variables and a classification model with K=3 was arbitrarily chosen as a baseline. This will be compared with the most optimal K value, which will be determined by a GridSearchCV object, to evaluate their performance in predicting the observations. A pipeline was trained using the training data set and used to predict the testing data, resulting in an accuracy of 81%.

In [5]:
# cross validation


players_sessions_pipe = make_pipeline(players_sessions_preprocessor, players_knn)

players_vfold_score = pd.DataFrame(
     cross_validate(
         estimator= players_sessions_pipe,
         cv=5,
         X = players_sessions_train[["session_length_hours","played_hours"]],
         y = players_sessions_train["subscribe"],
         return_train_score=True,
     )
)

#  Average of Metrics
players_metrics = players_vfold_score.agg(["mean","sem"])
players_metrics[['test_score']]

Unnamed: 0,test_score
mean,0.733598
sem,0.037138


This block of code performs 5-fold cross validation for the classification model using K=3. The validation scores of the test_score column depict the mean (estimated accuracy) and the standard error (measure of uncertainty around the mean value) for the model. As a result, the KNN model with K=3 is expected to correctly classify approximately 73% of the test data, and the standard error of 3.71% indicates that the accuracy can be expected to fall approximately between 70% and 76%. The validation scores collected from this 5-fold cross validation help give a performance control for a classification model with K=3, and  can be used for comparison once the optimal K value is found. This is because these scores help assess the model's ability to generalize across different splits of the data and how consistent its performance is.

In [6]:
# selecting k value
param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 15, 1),
}

# making tuning pipeline
players_tune_pipe = make_pipeline(players_sessions_preprocessor, KNeighborsClassifier())

# knn tuning grid
knn_tune_grid = GridSearchCV(
     players_tune_pipe, param_grid, cv=4,
)#if you want higher precision, add scoring='precision', so when we call best_params_ it will show precision

#knn model grid
knn_model_grid = knn_tune_grid.fit(X_train,y_train)

accuracies_grid= pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid

#plot accuracies grid
accuracy_versus_k_grid = alt.Chart(accuracies_grid, title='Estimated Accuracy of each K Value (Range: 2-14) in the Classification Model ').mark_line(point=True).encode(
     x=alt.X("param_kneighborsclassifier__n_neighbors")
         .title("Number of Neighbors")
         .scale(zero=False),
     y=alt.Y("mean_test_score")
         .title("Accuracy")
         .scale(zero=False)
 )

accuracy_versus_k_grid

Figure 2: Estimated Accuracy of each K Value (Range: 2-14) in the Classification Model 
-
This line plot visualizes the accuracy of the classification model across different K values in the range from 2 to 14, as determiend by the GridSearchCV model. The plot reveals that K=2 has the highest accuracy, making it the optimal choice for the model within the test range. A decrease in accuracy occurs as the number of neighbors increases, which suggests that a model using a greater K value is more likely to make more errors when predicting.

In [7]:
knn_2=KNeighborsClassifier(n_neighbors=2)

players_session_fit2=make_pipeline(players_sessions_preprocessor, knn_2).fit(X_train,y_train)

players_session_predict=players_sessions_test.assign(
    predicted=players_session_fit2.predict(
        players_sessions_test[['session_length_hours','played_hours']])
)
players_session_predict.head(15)

Unnamed: 0,hashedEmail,session_length_hours,subscribe,played_hours,predicted
121,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,1.333333,True,1.2,True
101,e21a324ccf5c873bafe82e47d5137b36aa312ee4803eeb...,0.1,True,0.0,False
1,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,0.5,False,0.4,False
28,5a340c0e3d1aa3e579efc625bd3e5bca7fc25f7115b68e...,0.275,True,0.4,True
88,ca20f724571080b997e0efa874b9611e9f280c1af5f68f...,0.35,True,0.2,True
35,6fa105fac7f4f37350f21830db78cde153d8edda41d6f4...,0.116667,False,0.1,False
60,90f1495942837b1cde67cc9e3119421e38183502a4c6de...,0.266667,True,0.2,True
52,88247d9a46fc214a12485dcbcbb03a8ddebfe8c1ec5fe2...,0.327778,False,1.4,False
92,d3ca24e4d7fe8ffe2a821ebd3b841252950ca53e4d659a...,0.233333,False,0.1,False
27,577aa5f15468252b1c6f32dcd515012923476292e30f95...,0.166667,True,0.1,False


In [8]:
players_session_accuracy = players_session_fit2.score(X_test,y_test)
players_session_accuracy

0.851063829787234

These two blocks of code create and test a new KNN classification model using the optimal K value of 2. A new pipeline was made using the new model and trained with the training data set. A new dataframe was made which includes a new column 'predicted', containing the model's predictions for the test set. We see that the accuracy of the predictions when using K=2 is 85% compared to 81% when using K=3 previously. This means that the classifier makes fewer overall mistakes when classifying observations.

Make a Confusion Matrix
-

In [9]:
players_session_mat=pd.crosstab(
    players_session_predict['subscribe'],
    players_session_predict['predicted'],
)

players_session_mat

predicted,False,True
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,23,0
True,7,17


We see from this confusion matrix that the classifier incorrectly predicted 7 subscribed observations as not subscribed and one non subscribed observation as subscribed. We are more interested in identifying observations that are subscribed, making this the positive label, and non-subscribed observations are treated as the negative label.

As a result of these predictions, the precision of the classifier when using K=2 is 100% (17/17) and the recall is 71% (17/(7+17)). Therefore, every observation that is predicted as subscribed is correct. However the classifier only identifies 71% of the actual positive observations and misses the remaining 29%. This means that the classifier makes a significant amount of false negatives (i.e., observations that are subscribed are incorrectly classified as not subscribed).

The lower recall could mean that the classifier is conservative when predicting the positive class (subscribed), as it avoids predicting subscribed unless it is very certain, which explains the high precision, but it fails to catch all positive (subscribed) observations. 

In a real life scenario, such as when game developers are predicting how many players will play their new game, prioritizing precision can maximize profit. A model with high precision ensures that the players predicted to play to game are likely to actually play. This would reduce the cost of any additional marketing or resources wasted to target players who are unlikely to play. On the other hand, a model with higher recall would help game developers anticipate the majority of players likely to join the game. This would allow them to prepare servers capable of accommodating an increased demand from players, which in turn would reduce the risk of the server crashing.  

In [10]:
# The data had the be filtered because there were outlying observations of very high values, making the lower value observations difficult to interpret.

filtered_data_2 = players_sessions_test_predictions[
    (players_sessions_test_predictions["session_length_hours"] <= 1.2) &
    (players_sessions_test_predictions["played_hours"] <= 50)
]

test_plot = alt.Chart(filtered_data_2, title="Hours Played vs Session Length (Filtered)").mark_circle().encode(
    x=alt.X("session_length_hours", title="Session Length (Hours)"),
    y=alt.Y("played_hours", title="Hours Played"),
    color=alt.Color("subscribe", title="Subscribed (Yes or No)"),
    tooltip=["session_length_hours", "played_hours"])

test_predicted_plot = alt.Chart(filtered_data_2, title="Hours Played vs Session Length (Predicted, Filtered)").mark_circle().encode(
    x=alt.X("session_length_hours", title="Session Length (Hours)"),
    y=alt.Y("played_hours", title="Hours Played"),
    color=alt.Color("predicted", title="Subscribed (Yes or No)", scale=alt.Scale(scheme="set1")),
    tooltip=["session_length_hours", "played_hours"])

test_plot | test_predicted_plot

Figure 3 and 4:
-
The plots above show the difference between the given upsampled data and the predicted data done by our model, with both datasets being filtered. From what we can see, the model does a good job on predicting whether or not an individual will be subscribed based on the hours played and session length. Looking at the observations with low hours played and low session length, overall the pattern of red and blue (false and true) is consistent. In general,we can see that the few people with long session lengths who are not subscribed show that they do not have many hours played, and their sessions lengths on average are lower than those who are subscribed. The cluster of points in the bottom left of both plots show that individuals who are subscribed still have longer session lengths, though their hours played are around the same. This is shown by the blue points (true points) being further along the x axis compared to the red (false points). The dataset was filtered because similarily to figure 1, much of the observations were densely clustered in the bottom left due to the hours played on average being very low. Filtering the data has made it so the plots are easier to interpret.