# DSCI100 Project Final Report
### Players Subscription Predictive Data Analysis

### Group 29:
- Ysabel Maria Fleet - 13009485
- Sanjana Gopee - 59940676
- Simar Pandher - 14521397
- Olivia Kong - 72594369

### Introduction
This project is based on datasets that have been provided by a research group in Computer Science at UBC led by Frank Wood. The data obtained regards which individuals play video games and how. This was done through recording player actions in a MineCraft Server—PLAIcraft—and collecting data regarding information about each individual player's personal attributes. This data has been condensed into the datasets of **Players** and **Sessions**. Our group has chosen to answer Question 3 provided in the criteria, which we have identified as a predictive classification question.  

**Question 3**: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 

## Data Description

### Players Data
The players.csv file is a data set containing information about the players in the game. There are 196 observations with data about the players such as their experience, whether they subscribe to the game, their email, the number of hours played, their names, gender and age. These categories are split into the 9 variables (column names) below. 



**players.csv:**

|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| experience       | String  | The level of expertise that the player has in the game. Possible values are "Amateur", "Beginner", "Regular", "Pro", "Veteran". This is a categorical variable.              |
| subscribe        | Boolean | Whether or not the player has subscribed to the game. The variable can only take the value "TRUE" or "FALSE", indicating "yes"  or "no" to whether they are subscribed.  |
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                        |
| played_hours     | Float   | The played hours indicates the number of hours spent playing the game approximated to one decimal place.                               |
| name             | String  | This is the name (first name) of the player. This is probably not a unique identifier since two people could coincidentally have the same name.                                             |
| gender           | String  | Gender is a categorical variable which has the following possible values: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited", "Other".                                        |
| age              | Integer | Player's age                                              |
| individualID     | N/A     | Individual ID of the player, values were not provided therefore the category is essentially useless in data analysis           |
| organizationName | N/A     | Organization name, values were not provided making this category useless just as the "IndividualID" column is.                  |

The final two columns 'individualID' and 'organizationName', have all null cells, therefore are negligible and will be dropped. Furthermore, columns such as 'name', 'gender', 'age', and 'experience' may not be useful to calssification analysis as they are categorical and not useful as features; therefore will be dropped. Thus, the categories of use are reduced to 3 variables from 9.

### Sessions Data
The sessions.csv data set contains specific data about the playing sessions of the players in the game. There are 1535 observations in the sessions data set and 5 variables. Variables include the players' hashed email, start time, end time, original start time and original end time. This provides data about individual sessions played by each player, importantly, the times and dates of sessions played associated with a unique identifier.

**sessions.csv:**
|     Variable     |  Type   |                    Description                            |
|------------------|---------|-----------------------------------------------------------|
| hashedEmail      | String  | This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player.                       |
| start_time     | String   | This includes the date - in format DD/MM/YY - and the time the player started playing the game. The time is in 24-hour format.                               |
| end_time             | String  | This includes the date - in format DD/MM/YY - and the time the player stopped playing the game. The time is in 24-hour format.                                             |
| original_start_time           | Integer  | This variable is a 14-digit integer indicating the start time in Unix Time format.                                        |
| original_end_time                | Integer | This variable is a 14-digit integer indicating the end time in Unix Time format.                                               |

Unlike the players data, none of the variables have negligible values, all of them providing information about the playing sessions rather than being empty observations. The two columns, 'original_start_time' and 'original_end_time' are both in Unix time, which is not a useful unit of measurement for the data analysis, therefore these columns will be dropped. Both 'start_time' and 'end_time' have two measurements per cell, with both the date and time of the sessions start/end. This will need to be tidied and wrangled so that these observations are not a string data type, so that date and time can be managed appropriatley.


## Methods

### _Summary_
Using data available about **Players** and **Sessions of Players**, the goal of this project is to predict player retention, whether a player will continue playing the game based on the data from the two given dataframes. The column "subscribe" is the categorical variable we are trying to predict, and this is based on our assumption that players subscribing to the game will continue to play it. The two varaibles chosen to predict this are:

1. Total number of hours played
2. Average session time per player

These have been chosen as predictors based on a few assumptions. The first assumption is that a player who has a higher total number of played hours, is more likely to subscribe. The second assumption is that a player who has a higher average session time is more likely to subscribe. These were assumptions were made as they indicate past intrest in the game, and thus it can be inferred that there will be future interest. 

To predict the value of the "subscribe" column, will require a KNN classification model with the most optimal K value, and the two above quantitative predictor variables. The data will also be split into training and test sets to evaluate the classifer model's performance using accuracy, recall and precision. Through cross validation, an optimal k-value will be chosen. 

### Reading Data:
> 1. Import Pandas and name as 'pd'.
> 2. Use a relative path to load in 'sessions.csv' data set, named as 'sessions_original'.
> 3. Use a relative path to load in 'players.csv' data set, named as 'players_original'.
> 4. Manually inspect each data set and use .info() to understand the contents of the data set, and create data descriptions (see above). Use descriptions and understanding to inform cleaning and wrangling.

### Cleaning and Wrangling Data:

Sessions Data Frame:
> 1. Convert the 'start_time' and 'end_time' columns from a dtype object to datetime, specifying dayfirst as true. For clarity, name them 'start_time_final' and 'end_time_final' respectively.
> 2. Subtract the 'start_time_final' from the 'end_time_final' to calculate and create a 'session_length' column that has the session time per session per player.
> 3. Drop columns 'original_start_time' and 'original_end_time', and name the data frame 'sessions'
> 4. Call Groupby on hashedEmail to find the mean average 'session_length', yielding the average session time (in hours) per player. Name this column 'session_length_hours' and name the data frame 'sessions_group'.

Merging Data Frames:
> 1. Merge the 'players_original' with the 'sessions'group' on the hashedEmail column. Name the new data frame to 'merged' data frame.

Cleaning the Merged Data Frame:
> 1. From the 'merged' data frame, drop unwanted columns identified in the 'players_original' data frame and unwanted columns created when wrangling the 'sessions_original' data frame. Dropped columns: 'name', 'gender', 'age','individualID', 'organizationName', 'experience', and 'session_length'.
> 2. For clairty, name the data frame 'merged_final'.

Balancing the Data Frame:
> 1. Address data imbalance in the 'subscribe' column, the predicted label, by oversampling the rare observation (False) from the 'merged_final' data frame.
> 2. Name the balanced data frame to 'upsampled_merged'. This data frame should have 186 rows and 4 columns: 'hashedEmail', 'session_length_hours', 'subscribe', and 'played_hours'.

### Exploratory Analysis:

Session Length and Played Hours Scatter Plot:
> 1. Import altair as 'alt'.
> 2. From the 'upsampled_merged' data frame, create a scatter plot for 'session_length_hours' on the x-axis and 'played_hours' on the y-axis. Assign color to the 'subscribe' column, to observe how the predicted label factors in the relationship between the variables.
> 3. From the visualisation, make note of any observations that may help with further analysis.  

### Data Analysis, Training and Predicting: 

Data Preprocessing:
> 1. From scikit-learn (sklearn) import all relevant commands.
> 2. Split the data into a training and testing data sets, called 'players_session_train' and 'playes_sessions_test' respectivley. The test size will be set to 25%, to try maximise data used for the training model.
> 3. Create a preprocessor, ensuring to scale and center (standardise) the data by calling StandardScaler on the predictor variables.

Data Processing and Creating a Pipeline:
> 1. Create a K-NN specification model, setting k to 3 as a arbitrary baseline.
> 2. From 'players_session_train' name predictor variables 'X_train' and the response variable 'y_train'.
> 3. Create a pipeline with the preprocessor and K-NN model, and fit the X and y training arguments.
> 4. Use the predict function to call a new prediction on the classifier.

Examining the Accuracy:
> 1. Examine the accuracy by using the score method with data frome the 'players_sessions_test' data frame.
> 2. Analyse the accuracy of the pipeline, considering its signifigance in this application.

### Data Analysis, Evaluation and Tuning:

Cross Validation and Tuning:
> 1. From sklean,.model_selection import GridSearchCV
> 2. Call get_params() on the pipeline to identify parameter values.
> 3. Construct a parameter_grid, using a range informed by the parameter values.
> 4. Make a tuning pipeline, calling the preprocessor and KNeighborsCalssifier.  
> 5. Create a K-NN tuning grid by calling GridSearchCV. Then it fit to the trainig data predictors and labels, naming this the K-NN model grid.
> 6. Create a accuracies grid by wrapping the cv_results_
> 7. Plot a accuracies grid called 'accuracy_versus_k_grid', with the accuracies estimate on the y-axis and K-neighbours on the x-axis. Ensure to layer data points on the line chart.
> 8.  Based on the plot, select the most optimal value of K, whilst critically considering the result of the tuning for the application.

### Final Model Training and Evaluation:

Retraining the K-NN Model for Test Data:
> 1. Retrain the K-NN classifier, using the most optimal value of K-Neighbours. For clairty, name this classifier 'knn_2'.
> 2. Make a new pipeline 'players_session_fit2', that calls upon the prerpocessor and 'knn_2' classifier. 
> 3. Create a new data frame called 'players_session_predict', creating a new column 'predicted' using the 'players_session_fit2' pipeline and data of predictor variables from 'players_session_test'.

Examining the Accuracy:
> 1. Examine the accuracy by using the score method with data frome the 'players_sessions_test' data frame.
> 2. Analyse the accuracy of the pipeline, considering its signifigance in this application.

Visualise and Summarise Performance with a Confusion Matrix:
> 1. Preform a 5-fold cross-validation using the cross_validate function.
> 2. Aggregate the mean and standard error of th classifier's validation accuracy across folds.


Computing Precision and Recall:
> 1. Use precision_score function, specifying the predictor variable as the y_true argument, and the 'True' label with pos_label.
> 2. Use recall_score function, specifying the predictor variable as the y_true argument, and the 'True' label with pos_label.

Create Prediction Plots with Test Data:
> 1. Generate a scatter plot of 'played_hours' vs 'session_length_hours', that functions as a colored prediction map visualising predicted labels.
> 2. Analyse the colored predicted map on its effectivness in addressesing the classification question; identifying something about our population of users, to predict if a player will continue contributing given past participation.

Crtically Analyzing Performance:
> 1. Using context of the problem and data, analyze the performane of the K-NN classification of model. Consider the precision-recall trade-off, and acknowledge both the accuracy and confusion matrix.

## Discussion > editing on a google doc

### Summary of Findings:
- no clear trend
- the only trend is that as played hours increase so does session length, however scoring higher on both does not mean they are more likely to subscribe 


### Comparispon of Findings to Expectations:
- not matching to our expectations -> reference our assumptions
  
### Impact of Findings:
- reevaluation of correlation betweeen predictor variables and preedicted label.
- reevaluation of the labels and their signfigiance
- refernce the question and the study
  
### Future Questions:
- what labels may provide better predictor variables in addressing the question?
- does past participation have a strong correlation to future contribution?

### _References cited in APA_

Note: All figures should have a figure number and a legend

- Discussion:
1. Summarize what you found
2. Discuss whether this is what you expected to find?
3. Discuss what impact could such findings have?
4. Discuss what future questions could this lead to?

- References
You may include references if necessary, as long as they all have a consistent citation style.


We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 