# Planning Proposal: Identifying player types that contribute the most data

**Broad Question:** Which kinds of players contribute the most behavioral data? 

**Specific question:** Can subscription, experience tier, and age identify players in the top 10% of total session minutes?

**Data Description:** Two linked tables are used. players.csv has one row per player: experience, subscribe, hashedEmail, played_hours, name, gender, Age. sessions.csv has one row per session: hashedEmail, start/end times, epoch timestamps. I will parse times, compute per‑session duration (minutes), and aggregate to per‑player totals. Potential issues: missing/invalid timestamps, duplicated IDs, zero or extreme durations, clock skew, multi‑user accounts, automated play. Some of these issues are more unlikely than others, for instance the number of users doing automated play would be little to none given the complexity required.


**How the data answers the question.** After aggregation I define HighContributor as players at or above the 90th percentile of total minutes (computed on training folds). Then contribution is related to subscription, experience, and age. Protected attributes will not be used for targeting Names are ignored.

**Minimum wrangling for this deliverable.** Sort timestamps, compute session duration, aggregate sessions per player (counts, total minutes, mean minutes), left‑join with `players`, fill missing session metrics with zeros, and produce tidy tables and plots. 

**Exploratory checks.** Compare total minutes by experience × subscription using a log scale for skew. plot age versus minutes. report medians and high quantiles. Inspect session‑count and session‑length distributions. Flag missing times and implausible durations. `played_hours` will be excluded as it's a boring metric which won't show any interesting trend other than "players who play the most hours tend to contribute the most data", which is not a shocking or groundbreaking discovery.

**Method planned for later.** Regularized logistic regression for HighContributor using subscription, experience, age, and simple interactions. Rationale: easy to read, gives usable probabilities, works fine with only a few inputs. Assumptions: each player is independent, the model’s formula matches the data, inputs aren’t nearly the same thing, behavior doesn’t change much during the study period. Limitations: hidden factors may still drive results, behavior can shift over time, results depend on choosing the top-10% cutoff.

**Model selection and evaluation (later).** Hold out 20% of players once for final testing. On the remaining 80%, use 5‑fold stratified cross‑validation. Compare ROC AUC and PR AUC, examine calibration, and choose the simplest well‑calibrated model among top performers. Set the decision threshold by optimizing F1 or precision at a recall target. Compute the percentile threshold within each training fold to avoid leakage.


In [5]:
players  <- read.csv("players.csv", check.names = TRUE)
sessions <- read.csv("sessions.csv", check.names = TRUE)

cat("players: ", nrow(players), "rows x", ncol(players), "cols\n")
cat("sessions:", nrow(sessions), "rows x", ncol(sessions), "cols\n")

#creating a list of all the variables appearing at the moment, figuring out what is and isn't important for 
#the analysis I'm aiming to do
var_dict <- data.frame( 
  table   = c(rep("players",7), rep("sessions",5)),
  variable= c("experience","subscribe","hashedEmail","played_hours","name","gender","Age",
              "hashedEmail","start_time","end_time","original_start_time","original_end_time"),
  type    = c("factor","logical","id","numeric","string","factor","numeric",
              "id","string","string","numeric","numeric"),
  meaning = c("Experience tier","Has subscription","Join key","Lifetime hours","Given name",
              "Self-reported gender","Age in years",
              "Join key","Session start (d/m/Y H:M)","Session end (d/m/Y H:M)",
              "Start epoch (ms)","End epoch (ms)")
)
print(var_dict)

players:  196 rows x 7 cols
sessions: 1535 rows x 5 cols
      table            variable    type                   meaning
1   players          experience  factor           Experience tier
2   players           subscribe logical          Has subscription
3   players         hashedEmail      id                  Join key
4   players        played_hours numeric            Lifetime hours
5   players                name  string                Given name
6   players              gender  factor      Self-reported gender
7   players                 Age numeric              Age in years
8  sessions         hashedEmail      id                  Join key
9  sessions          start_time  string Session start (d/m/Y H:M)
10 sessions            end_time  string   Session end (d/m/Y H:M)
11 sessions original_start_time numeric          Start epoch (ms)
12 sessions   original_end_time numeric            End epoch (ms)
