Individual Planning Stage
-
In this project, I will perform statistical modeling and analysis using data collected through a Minecraft server set up by a UBC Computer Science team. 

I plan to address the following research question:
What kinds of players of the Minecraft server are likely to contribute a large amount of data? 

Through visualization, I will present insight into these relationships, which may help inform the team's research. I will load the datasets, one with player information and another recording each session hosted by the server, below.

In [48]:
library(tidyverse)
library(tidymodels)
library(repr)

In [49]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")
head(players)
head(sessions)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


Summary statistics of both datasets are presented below:

From the information generated above after loading the data, we can see that "players" has 196 observations, while "sessions" has 1535 observations. This means we have data on 1535 Minecraft sessions run by 196 different players. 

The variables in the "sessions" dataset are:
- hashedEmail (the user identification of the player)
- start_time (the start time of the session)
- end_time (the end time of the session)
- original_start_time and original end_time (unspecified raw numbers collected as time data)

The variables in the "players" dataset are:
- experience (the player's level of Minecraft experience)
- subscribe (whether or not the player is subscribed to the game-related newsletter)
- hashedEmail (the user identification of the player)
- played_hours (the total number of hours spent on the server)
- name (the player's name)
- gender (the player's gender)
- age (the player's age)

In [51]:
summary_players <- players |>
summarize(mean_hrs = mean(played_hours),
         mean_age = mean(Age, na.rm = TRUE))
summary_players

gender_freq <- players |>
count(gender)
gender_freq

mean_hrs,mean_age
<dbl>,<dbl>
5.845918,21.13918


gender,n
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


In the code above, we found that each player was on the server for an average of 5.85 hours, and their average age was 21.14 years old. Of the players, 2 are agender; 37 are female; 124 are male; 15 are non-binary; 6 are two-spirited; 1 identifies with a gender not listed here, and 11 prefer not to disclose their gender.

Upon my initial impression of the data, there are several potential issues:
1. The data appears to be skewed towards players who put in their information but may not have at all intended to produce data for the research.
2. It is not immediately clear how the variables for session start and end times can be manipulated to visualize patterns.
3. There may not be enough data from all groups in certain demographics (for example, for some of the gender identities and experience levels) to generalize findings in this analysis to greater populations. 

Focus Question
-
The specific question I will seek to answer in my analysis will be:

Can a player's age predict the amount of time they spend on this Minecraft server?

Here are some plots that can give a sense of the distribution within our explanatory and response variables.