
# Project Planning Stage(Individual)


In [1]:
library(tidyverse)
library(lubridate)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
url<- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players_data <- read_csv(url)

head(players_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,


## 1. Data Description

This dataset contains information about a Minecraft research server, and the data is collected by analyzing player gameplay, speech, and key presses. The dataset has 196 rows and 9 columns. This means there are 196 observations and 9 variables in the dataset. Each row represents a single player. Out of the 9 variables, 4 are characters, 2 are numerical, and 3 are logical.

#### Summary Statistics

In [20]:
players_data|>
summarize(across(
         c(played_hours, age),
    list(
        mean= ~ round(mean(.x),2),
        max = ~ round(max(.x),2),
        min= ~ round(min(.x),2),
        sd= ~ round(sd(.x),2))))
summarize(players_data,
          sum_played_hours= sum(played_hours))

subscribe_count <- players_data|>
count(subscribe)
experience_count <- players_data|>
  count(experience)
gender_count <- players_data |>
  count(gender)
age_count <- players_data |>
  count(age)
played_hours_count <- players_data |>
  count(played_hours)

subscribe_count
experience_count
gender_count
age_count
played_hours_count

played_hours_mean,played_hours_max,played_hours_min,played_hours_sd,age_mean,age_max,age_min,age_sd
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.85,223.1,0,28.36,21.28,99,8,9.71


sum_played_hours
<dbl>
1145.8


subscribe,n
<lgl>,<int>
False,52
True,144


experience,n
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


gender,n
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


age,n
<dbl>,<int>
8,1
9,1
10,1
11,1
12,1
14,2
15,2
16,3
17,75
18,7


played_hours,n
<dbl>,<int>
0.0,85
0.1,34
0.2,10
0.3,5
0.4,5
0.5,4
0.6,5
0.7,3
0.8,2
0.9,1


#### Summary

| Variable| Type| Description| Potential Errors| Observations|
|---------|------|-----------|-----------------|--------------|
|experience|chr|Players Gaming Experience Level|none|5 categories- Amateur = most common|
|hashedEmail|chr|Players Emails|Possible duplicates- hard to tell at first glance|Anonymous|
|name|chr|Players Name|none|Likely not usefulA
|gender|chr|Players Gender|none|Most common - Men|
|played_hours|dbl|Amount of Hours played|none|Highly Fluctuates|
|age|dbl|Players Age|none|Range 8-99|
|subscribe|lgl|If Players are subscribed|none|144 are subscribed|
|individualID|lgl|invalid|All values are N/A|Cannot be used|
|organizationName|lgl|invalid|All values are N/A|Cannot be used|

Initial Observations 
- played_hours varies a lot between players
- Age also has a wide range
- Two columns are completing missing values
- hasedEmail and name are identifiers are will not be used in the predictions