GitHub Repo: https://github.com/irealyash/Individual-Planning/tree/master

Overview:

This project uses data collected from a Minecraft research server hosted by a UBC research group. The dataset is divided into two files:

1. sessions.csv
2. players.csv


In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
sessions <- read_csv("sessions.csv")
players <- read_csv("players.csv")
sessions |>
head(5)
players |>
head(5)

Data Summary :

1. sessions.csv: It contains columns with hashed emails of users, session start time and session end time in 2 different formats.
2. players.csv: It contains columns with player exprience, newsletter subscription status , hashed emails of users, total played hours, name, gender and the player's age.

Potential Issues:

1. The data was collected through gameplay log and it may miss the time the player was offline and still add it in total hours played.
2. There could be missing values in the data columns which need to be kept in mind.

Broad Question :-

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question :-

Can a player’s total playtime, number of sessions, and average session length predict whether they subscribe to the newsletter?

Understanding the characteristics and playing style of the players who have subscribed for the new letter will help design recruitment strategies and understanding engagement. Players who are generally more active and consistent are probably more interested in things beyond the game.

Now lets visualize the data in different ways to analyze the data in more depth.

In [None]:
time_players_plot <- ggplot(players,aes(x=played_hours,color=as_factor(subscribe)))+
geom_line(stat="count")+xlim(c(0,3))+
labs(title="Number of Players vs Total Time Played",color="Subscription Status"
     ,x="Total Time Played" , y="Number of Players")
time_players_plot

Now we will analyze the average playtime of users.

In [15]:
players |>
group_by(gender) |>
  summarise(mean_playing_time=mean(played_hours),na.rm = TRUE)

gender,mean_playing_time,na.rm
<chr>,<dbl>,<lgl>
Agender,6.25,True
Female,10.63513514,True
Male,4.12741935,True
Non-binary,14.88,True
Other,0.2,True
Prefer not to say,0.37272727,True
Two-Spirited,0.08333333,True


The above data clearly depicts that the players who subscribes for the newletter we more active and had an overall higher total 
playing time. The data also show regardless the subscription status most players have short platime. This trend also suggests that 
subscription is linked to engagemenet with the game. It also shows the high differences in the playing time of few people which 
leads to the value of mean played time to increase significantly.

Proposed Method
K-Nearest Neighbors (KNN) Classification

It is the suitable method for this data because we need to predict whether a player is going to subscribe for the game's newsletter or not because it makes predictions based on similarity between observations which is perfect for behavioural data where players with similar gaming activity will have similar inclincation towards getting a newsletter subscription.

The assumptions taken into account for this method is that the data is scaled properly and each variable contributes equally to the prediction and the selected K value is appropriate for both recall and precision of the function.

Limitation of this method :- 
1. In case of a large data set it would take a lot of time to complete the prediction.
2. It does not explain why a certain player is classified as subscribed or not.

Model Selection :-

1. The dataset will be split into training (80%) and testing (20%) sets.
2. 5-fold cross-validation will be used on the training data to tune k.
3. Models will be compared based on accuracy.
4. The optimal k value will be chosen to maximize performance on validation data.
5. Final performance will be reported on the test set.
