# DSCI100 Project Final Report
### Players Subscription Predictive Data Analysis

### Group 29:
- Ysabel Maria
- Sanjana Gopee - 59940676
- Simar Pandher - 14521397
- Olivia Kong - 72594369

## Instructions for Final Report
- Title **(Done)**

- Introduction: Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project identify and fully describe the dataset that was used to answer the question **(Done)**

- Methods & Results: Describe the methods you used to perform your analysis from beginning to end that narrates the analysis code. **(In-progress)**

Your report should include code which:

1. Loads data 
2. Wrangles and cleans the data to the format necessary for the planned analysis
3. Performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
4. Creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
5. Performs the data analysis
6. Creates a visualization of the analysis 

Note: All figures should have a figure number and a legend

- Discussion:
1. Summarize what you found
2. Discuss whether this is what you expected to find?
3. Discuss what impact could such findings have?
4. Discuss what future questions could this lead to?

- References
You may include references if necessary, as long as they all have a consistent citation style.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate



alt.data_transformers.enable('vegafusion')

set_config(transform_output="pandas")

### Introduction
This project is based on datasets that have been provided by a research group in Computer Science at UBC who obtained data regarding how individuals play video games. They did this through recording player actions in a MineCraft Server, and collecting data regarding information about each individual as well as how they play. This data has been condensed into the datasets of **Players** and **Sessions**. The question our group has chosen to answer is Question 3 provided in the criteria. 

**Question 3**: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 

## Data Description

## Players Data
  The players.csv filed is a data set containing information about the players in the game. There are 196 observations with data about the players such as their experience, whether they subscribe, their email, the number of hours played, their names, gender and age. These categories are split into the 9 variables (column names) below. 

  


**Players.csv:**

|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| experience       | String  | The level of expertise that the player has in the game. Possible values are "Amateur", "Beginner", "Regular", "Pro", "Veteran". This is a categorical variable.              |
| subscribe        | Boolean | Whether or not the player has subscribed to the game. The variable can only take the value "TRUE" or "FALSE", indicating "yes"  or "no" to whether they are subscribed.  |
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                        |
| played_hours     | Float   | The played hours indicates the number of hours spent playing the game approximated to one decimal place.                               |
| name             | String  | This is the name (first name) of the player. This is probably not a unique identifier since two people could coincidentally have the same name.                                             |
| gender           | String  | Gender is a categorical variable which has the following possible values: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited", "Other".                                        |
| age              | Integer | Player's age                                              |
| individualId     | N/A     | Individual ID of the player, values were not provided therefore the category is essentially useless in data analysis           |
| organizationName | N/A     | Organization name, values were not provided making this category useless just as the "IndividualID" column is.                  |

The final two columns are negligible, therefore will be dropped and the categories of use are reduced to 7 variables from 9.

## Sessions Data
The sessions.csv data set contains specific data about the playing sessions of the players in the game. There are 1535 observations in the sessions data set. It has variables like the players' email, start time, end time, original start time and original end time.

**Sessions.csv:**
|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                       |
| start_time     | String   | This includes the date - in format DD/MM/YY - and the time the player started playing the game. The time is in 24-hour format.                               |
| end_time             | String  | This includes the date - in format DD/MM/YY - and the time the player stopped playing the game. The time is in 24-hour format.                                             |
| original_start_time           | Integer  | This variable is a 14-digit integer indicating the start time.                                           |
| original_end_time                | Integer | This variable is a 14-digit integer indicating the end time.                                              |

Unlike the players data, none of the variables are negligible as they all provide information about the playing sessions rather than being empty observations.


## Method

### Summary
Using data available about **Players** and **Sessions of Players**, the goal of this project is to predict player retention, whether a player will continue playing the game based on the data from the two given dataframes. The column "subscribe" is the categorical variable we are trying to predict, and this is based on our assumption that a players subscribing to the game will continue to play it. The two varaibles chosen to predict this are:

1. Total number of hours played
2. Average session time per player

This will require a KNN classification model with the most optimal K value, and the two above quantitative predictor variables. The data will also be split into training and test sets to evaluate the classifer model's performance using accuracy, recall and precision. Through cross validation, an optimal k-value will be chosen. 

### Detailed Steps [INCOMPLETE - Still need to write more]
- **Data Wrangling**:

1. Converting dtype object to datetime
2. Groupby() and mean
3. Merging the two data sets based on hashedemail

- **Exploratory Analysis**: explain
- **Data Analysis**: explain
- **Visualization of Analysis**: explain

## Analysis

In [2]:
np.random.seed(2020)

# reading data file
sessions_original=pd.read_csv('data/sessions.csv')
sessions_original

players_original = pd.read_csv("data/players.csv")

# converting str to datetime
sessions_original['start_time_final']=pd.to_datetime(sessions_original["start_time"], dayfirst=True)
sessions_original['end_time_final']=pd.to_datetime(sessions_original["end_time"], dayfirst=True)

# calculating session legnth
sessions_original["session_length"] = sessions_original["end_time_final"]-sessions_original["start_time_final"]

# tidying
sessions=sessions_original.drop(columns=['original_start_time','original_end_time', 'start_time', 'end_time'])
sessions

# grouping and calculating mean
sessions_group=sessions.groupby('hashedEmail')["session_length"].mean().reset_index()
sessions_group["session_length_hours"] = sessions_group["session_length"].dt.total_seconds()/3600

# merging data sets (dropped rows without any sessions)
merged=sessions_group.merge(players_original, on='hashedEmail')
merged_final = merged.drop(columns=["session_length", "individualId", "organizationName", "name", "gender","age","experience"])

# scaling error (data imbalance)

subscribe_false = merged_final[merged_final["subscribe"] == False]
subscribe_true = merged_final[merged_final["subscribe"] == True]

subscribe_false_upsample = subscribe_false.sample(
n=subscribe_true.shape[0], replace=True)

upsampled_merged = pd.concat((subscribe_false_upsample, subscribe_true))
(upsampled_merged["subscribe"].value_counts()) #run this to see data balanced

subscribe
False    93
True     93
Name: count, dtype: int64

In [3]:
upsampled_merged

Unnamed: 0,hashedEmail,session_length_hours,subscribe,played_hours
1,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,0.500000,False,0.4
31,5c27e8b9fed2816b006dc8397ec04470b59339fd591a46...,0.950000,False,0.9
9,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.083333,False,0.0
71,a2a0612e9a7da558cbac2ee3c816740324505a69a6e042...,0.100000,False,0.0
9,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.083333,False,0.0
...,...,...,...,...
120,fc0224c81384770e93ca717f32713960144bf0b52ff676...,0.266667,True,0.2
121,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,1.333333,True,1.2
122,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,0.257796,True,56.1
123,fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d...,0.150000,True,0.1


In [4]:
players_sessions_plot_2 = alt.Chart(upsampled_merged).mark_point().encode(
    x = "session_length_hours",
    y = "played_hours",
    color = alt.Color("subscribe")
)

In [5]:
players_sessions_plot_2

In [6]:
np.random.seed(2020)

# splitting data
players_sessions_train, players_sessions_test = train_test_split(
    upsampled_merged, 
    test_size=0.25, 
    # shuffle=True,
    # stratify=upsampled_merged['subscribe']
) # set the random state to be 123

# creating preprocessor
players_sessions_preprocessor = make_column_transformer(
     (StandardScaler(), ["played_hours", "session_length_hours"]),
     remainder="passthrough",
     verbose_feature_names_out=False
 )

# create knn spec

players_knn = KNeighborsClassifier(n_neighbors=3)

X_train = players_sessions_train[["session_length_hours","played_hours"]]
y_train = players_sessions_train["subscribe"]

players_sessions_fit = make_pipeline(players_sessions_preprocessor, players_knn).fit(X_train,y_train)

# predictions
players_sessions_test_predictions = players_sessions_test.assign(
    predicted = players_sessions_fit.predict(
        players_sessions_test[["session_length_hours","played_hours"]]))

# accuracy
X_test = players_sessions_test[["session_length_hours","played_hours"]]
y_test = players_sessions_test["subscribe"]

players_prediction_accuracy = players_sessions_fit.score(X_test,y_test)
players_prediction_accuracy

0.7872340425531915

In [7]:
# cross validation

# np.random.seed(2020)  # DO NOT REMOVE

# players_sessions_pipe = make_pipeline(players_sessions_preprocessor, players_knn)

# players_vfold_score = pd.DataFrame(
#      cross_validate(
#          estimator= players_sessions_pipe,
#          cv=5,
#          X = players_sessions_train[["session_length_hours","played_hours"]],
#          y = players_sessions_train["subscribe"],
#          return_train_score=True,
#      )
#  )

# players_vfold_score

# # Average of Metrics
# players_metrics = players_vfold_score.agg(["mean","sem"])
# players_metrics

In [8]:
#np.random.seed(2020)

# selecting k value
param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 15, 1),
}

# making tuning pipeline
players_tune_pipe = make_pipeline(players_sessions_preprocessor, KNeighborsClassifier())

# knn tuning grid
knn_tune_grid = GridSearchCV(
     players_tune_pipe, param_grid, cv=4,
)

#knn model grid
knn_model_grid = knn_tune_grid.fit(X_train,y_train)

accuracies_grid= pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid

#plot accuracies grid
accuracy_versus_k_grid = alt.Chart(accuracies_grid).mark_line(point=True).encode(
     x=alt.X("param_kneighborsclassifier__n_neighbors")
         .title("Number of Neighbors")
         .scale(zero=False),
     y=alt.Y("mean_test_score")
         .title("Accuracy")
         .scale(zero=False)
 )

accuracy_versus_k_grid

## Conclusion from Cross Validation
Hence, from the plot above, we can see that K=2 provides us with the highest level of accuracy. Larger K values result in a reduced accuracy estimate. 

In [9]:
knn_2=KNeighborsClassifier(n_neighbors=2)

players_session_fit2=make_pipeline(players_sessions_preprocessor, knn_2).fit(X_train,y_train)

players_session_predict=players_sessions_test.assign(
    predicted=players_session_fit2.predict(
        players_sessions_test[['session_length_hours','played_hours']])
)
players_session_predict.head(15)

Unnamed: 0,hashedEmail,session_length_hours,subscribe,played_hours,predicted
54,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.216667,True,0.1,True
19,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,0.496538,True,53.9,True
22,42eafe96ed5c1684e3b5cc614d1b01a117173d3ec6898a...,0.089394,False,0.3,False
71,a2a0612e9a7da558cbac2ee3c816740324505a69a6e042...,0.1,False,0.0,False
31,5c27e8b9fed2816b006dc8397ec04470b59339fd591a46...,0.95,False,0.9,False
48,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.15,True,0.1,True
83,bda1905e54b6c745bcced9d59ce655a5bfd03c35cc6abd...,0.083333,True,0.0,False
17,2cfed571797b66cc810c32562fc5b0f70b5bec0f525079...,0.1,True,0.0,False
80,ba9b943f7fd9b5078b20ece741c4cb6435418d5d8373b4...,0.166667,False,0.1,False
35,6fa105fac7f4f37350f21830db78cde153d8edda41d6f4...,0.116667,False,0.1,False


In [10]:
players_session_accuracy = players_session_fit2.score(X_train,y_train)
players_session_accuracy

0.8920863309352518

We see that the accuracy of the predictions when using K=2 is 87% compared to 72% when using K=3 previously. This means that the classifier makes fewer overall mistakes when classifying observations.

Make a Confusion Matrix
-

In [11]:
players_session_mat=pd.crosstab(
    players_session_predict['subscribe'],
    players_session_predict['predicted'],
)

players_session_mat

predicted,False,True
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,20,1
True,11,15


We see from this confusion matrix that the classifier incorrectly predicted 10 subscribed observations as not subscribed. We are more interested in identifying observations that are subscribed???, making this the positive label, and non-subscribed observations are treated as the negative label.

As a result of these predictions, the precision of the classifier when using K=2 is 94% (15/16) and the recall is 58% (15/(11+15)). Therefore, almost every observation that is predicted as subscribed is correct, and no false positive predictions are made. However the classifier only identifies 58% of the actual positive observations and misses the remaining 42%. This means that the classifier makes a significant amount of false negatives (i.e., observations that are subscribed are incorrectly classified as not subscribed).

The lower recall could mean that the classifier is conservative when predicting the positive class (subscribed), as it avoids predicting subscribed unless it is very certain, which explains the high precision, but it fails to catch all positive (subscribed) observations. 

In [12]:
test_predicted_plot=alt.Chart(players_sessions_test).mark_circle().encode(
    x=alt.X('session_length_hours').title('Session Length (hours)'),
    y=alt.Y('played_hours').title('Total Played Hours'),
    color=alt.Color('subscribe').title('Subscribed: Yes or No')
)
test_predicted_plot

In [13]:
test_predicted_plot=alt.Chart(players_session_predict).mark_circle().encode(
    x=alt.X('session_length_hours').title('Session Length (hours)'),
    y=alt.Y('played_hours').title('Total Played Hours'),
    color=alt.Color('predicted').title('Predicted Subscribed: Yes or No')
)
test_predicted_plot