# DSCI 100: Individual Project Planning Stage  
**Name:** Ishir Ghatpande  
**Dataset:** Minecraft Research Server (players.csv, sessions.csv)

In this planning report, I describe the video game research server data, pose a predictive question of interest, explore the relevant variables, and outline a modelling plan. 

 # Data Description

**Importing the datasets:** 

In [2]:
library(tidyverse)

players_url <- "players.csv"
sessions_url <- "sessions.csv"

players <- read_csv(players_url, show_col_types = FALSE)
sessions <- read_csv(sessions_url, show_col_types = FALSE)

head(players)
head(sessions)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


- **players.csv**
  - One row per unique player.
  - Includes:
    - A unique player identifier.
    - Demographic / account information (e.g., region, language, registration time).
    - Aggregate or categorical indicators of behaviour and engagement.
    - A binary indicator of whether the player subscribed to the game-related newsletter.
- **sessions.csv**
  - One row per play session.
  - Includes:
    - Player identifier (linking to `players.csv`).
    - Session start/end timestamps.
    - Duration and basic activity metrics.
    - Potentially time-zone or clock-time information.

**General properties & checks**

- Number of players and sessions, and number of variables in each file are inspected with `nrow()` and `ncol()`.
- Variable types (numeric, categorical, datetime) are confirmed via `glimpse()`.
- Potential issues:
  - Missing values in demographics or behaviour measures.
  - Extremely short/long sessions or outliers.
  - Players with zero or only one session.
  - Sampling / participation bias: data only represent players who joined this specific research server.
  - Temporal bias: activity patterns may depend on recruitment campaigns or academic terms.
- I will use tidy tables and consistent keys (`player_id`) to join data in later stages.


**Number of Quantitative Variables:**

In [4]:
players_means <- players |>
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "mean") |>
  mutate(mean = round(mean, 2))

players_means

variable,mean
<chr>,<dbl>
played_hours,5.85
Age,21.14


# Questions