<h1><ins>Individual Project Planning Stage (Jairoop Brar - Group 33)</ins></h1>

<h2>Data Description</h2>

Our group will be generally be focusing on what type of players contribute the largest amount of data, and while more details about the question can be found under the **Questions** section, our analysis will primarily focus on the "players.csv" dataset, and as such, the following data description will work through and focus on just that dataset. Below I have uploaded and read the raw "player.csv" dataset using GitHub and have done some cleaning up (including the variable types and column names), followed by some preliminary summary statistics of the quantitative values. Below that, you can find my data description of the 'players.csv' dataset.

In [9]:
#importing libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)

In [62]:
#reading the dataset from a github url
players_raw_data <- read_csv("https://raw.githubusercontent.com/jai-o1/individual-project/refs/heads/main/players.csv",
                            col_names = c('experience', 'subscribed', 'hashed_email', 'hours_played', 'name', 'gender', 'age'),
                            skip = 1,    #skip the old column names
                            show_col_types = FALSE)

#making sure each vector is the right type of variable (eg: experience should be a categorical factor)
players_data <- players_raw_data |>
                mutate(experience = as.factor(experience),
                        gender = as.factor(gender),
                        age = as.integer(age))

players_data

experience,subscribed,hashed_email,hours_played,name,gender,age
<fct>,<lgl>,<chr>,<dbl>,<chr>,<fct>,<int>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,57
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


In [63]:
#summarizing data to get summary statistics for quantitative variables (hours_played, and age)
players_summary <- players_data |>
                    summarize(max_hours_played = max(hours_played), min_hours_played = min(hours_played), mean_hours_played = mean(hours_played),
                              max_age = max(age, na.rm = TRUE), min_age = min(age, na.rm = TRUE), mean_age = mean(age, na.rm = TRUE))
players_summary

max_hours_played,min_hours_played,mean_hours_played,max_age,min_age,mean_age
<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>
223.1,0,5.845918,58,9,21.13918


<h2>Data Description of 'players.cvs' Dataset</h2>

The players.csv dataset is a list of unique players who have played in a MineCraft server set up by a research group in Computer Science at UBC, andled by Frank Wood. This dataset includes information about each, unique player, including personal information such as their name, gender and age, more logistical information such as their hashed email and whether they are subscribed or not, and more gameplay-based information such as their experience level and the number of hours they played. This dataset helps illustrate the types of players who have played on the server, and how much they have played.
- There are **196** Total Observations
- There are **7** Total Variables
- Below is a list of each variable **name**, <ins>type</ins>, and what they mean:
  - **experience**, <ins>categorical (factor)</ins>, this variable is the experience/skill level of each player and can be one of the following: Pro, Veteran, Amateur, Regular, or Beginner
  - **subscribed**, <ins>logical (TRUE or FALSE)</ins>, this variable is whether the player is subscribed to the game-related newsletter or not, either TRUE or FALSE
  - **hashed_email**, <ins>character</ins>, this variable is a scrambled string of characters that is associated with each player's email (this is done to hide the player's real emails, and same emails will be the same string of characters)
  - **hours_played**, <ins>quantitative (double)</ins>, this quantitative variable is the total number of hours each player has played on the server and is in the form of a double-precision floating point number (summary stats can be found below)
  - **name**, <ins>character</ins>, this variable is the name of each player as a string of characters
  - **gender**, <ins>categorical</ins>, this variable is the gender of each player and can be one of the following: Male, Female, Non-binary, Two-Spirited, Agender, Other, or Prefer not to say
  - **age**, <ins>quantitative (integer)</ins>, this quantitative variable is the age of each player and is in the form of an integer  (summary stats can be found below)
- As seen in the table above, the summary statistics for the quantitative variables (hours_played and age) are as follows:
  - **Hours Played (hours_played)**: Mean = 5.85, Maximum Value = 223.10, Minimum Value = 0.00
  - **Age (age)**: Mean = 21.14, Maximum Value = 58, Minimum Value = 0

Some issues that appear in the dataset are that the age variable has some 'NA' values, which we would have to work around. Additionally, although we can't see it in the dataset itself, some players could have lied about either their age, name, gender, or they could have given a fake email, which we cannot explicitly identify, but it should be considered for later data analysis. Additionally, for the amount of hours played, some players could have had minecraft running in the background, or could be AFK doing something else as their number of hours accumulate, which also cannot be identified, but should be taken into consideration.

<h2>Questions</h2>
...