# DSCI 100-004 Individual Project Plan

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

##  Data Description

In [None]:
players <- read_csv("data/players.csv")
players

#### Overall data set description
- Contains 196 obeservations
- Contains 7 variables or columns
- Combines self-reported demographic information with gameplay-recorded data from the MineCraft server

#### Variables in the data set

##### experience
- Type: Character
- Meaning: MineCraft experience level
- Issues: Many seemingly synonymous terms that could make classification difficult
  
    Variable has not been standardized (example: amateur and beginner; veteran, pro, and regular)
##### subscribe
- Type: Logical
- Meaning: Indicates whether the player subscribed to updates or optional parts of the study
##### hashedEmail
- Type: Character
- Meaning: An anonymized identifier derived from playersâ€™ email addresses
- Issues: Not useful for analysis as an explanatory variable
- Functions only as a unique ID
##### played_hours
- Type: Doubles (Numeric)
- Meaning: Total number of hours a player spent on the MineCraft server
- Issues: Many extreme outliers, where many players have very low hours, while a few have very high values
##### name
- Type: Character
- Meaning: Player's name
- Issues: Not useful for analysis as an explanatory variable
- Every name is unique; does not contain analytical value and functions like an identifier
##### gender
- Type: Character
- Meaning: Self-reported gender identity
- Contains 7 categories where some categories have very few observations compared to others like "Male"
- Issues: Comparisons may be difficult if categories are underrepresented
##### Age
- Type: Doubles (Numeric)
- Meaning: Player's age in years
- Issues: 2 missing (NA) values

#### Summary statistics

##### played_hours
(Measured in hours)
- Mean: 12.93
- Min: 0.00
- Max: 223.10
- Standard deviation: 27.63
- First quartile: 0.00
- Median: 0.10
- Third quartile: 0.60

##### Age
(Measured in years)
- Mean: 21.14
- Min: 9
- Max: 58
- Standard deviation: 7.39
- First quartile: 17
- Median: 19
- Third quartile: 22.75

#### Potential hidden or indirect issues
- Self-reported information may have mistakes or inconsistencies
- Played hours does not indicate quality of gameplay; for example: one player could be completing many achievements in a short time
- Selection bias may be present where players who chose to join the server may differ than those who chose not to participate
- Differences in schedules, time zones, or server accessibility could influence played hours

#### How the data was likely collected
- Variables focused on demographics or that were self-reported were likely collected through a survey or consent form
- Gameplay data like played hours was likely collected automatically by the MineCraft server's logging system
- Player identities through email were likely anonymized for privacy and ethical reasons

## Questions

Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific question: Can age predict the total amount of gameplay data a player contributes on the MineCraft server?

To address the specific question, we would use each player's age as the explanatory variable and their played hours as the response variable. By wrangling the data set to select these two variables for each participant, we can visualize a potential relationship and fit a simple linear regression model to determine if age is a meaningful factor in total amount of gameplay.

## Exploratory Data Analysis and Visualization

In [None]:
# Loading data set into R
players <- read_csv("data/players.csv")

In [None]:
# Minimum wrangling of data
players_select <- players |>
    select(Age, played_hours) |>
    drop_na()
players_select

In [None]:
# Mean values for each variable in players data set
mean_table <- players_select |>
    summarise(mean_age = mean(Age, na.rm = TRUE), mean_played_hours = mean(played_hours, na.rm = TRUE))
mean_table

### Exploratory Visualizations

In [None]:
# Distribution of player's ages
options(repr.plot.width = 6, repr.plot.height = 8)
age_distribution <- players_select |>
  ggplot(aes(x = Age)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Player Ages", x = "Age (years)", y = "Count of Players") +
  theme(text = element_text(size = 12))
age_distribution

In [None]:
# Distribution of played hours per each player
options(repr.plot.width = 6, repr.plot.height = 8)
hours_distribution <- players_select |>
  ggplot(aes(x = played_hours)) +
  geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Total Gameplay Hours", x = "Total Hours Played on Server", y = "Number of Players") +
  theme(text = element_text(size = 12))
hours_distribution

In [None]:
# Relationship between played hours and age
options(repr.plot.width = 6, repr.plot.height = 8)
players_plot <- players_select |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point() +
    labs(title = "Relationship Between Age and Gameplay Hours", x = "Age (years)", y = "Total Hours Played") +
    theme(text = element_text(size = 12))
players_plot

#### Insights
- Hours played exhibits high variability, with many players contributing few hours and few players contributing many hours; presence of extreme outliers
- Age shows little variation with limited diversity; no even representation across lifespan
- When examining the scatterplot of Age versus played_hours, the points appear widely spread with no obvious visual structure

## Methods and Plan

### GitHub Repository

URL to GitHub Repo: https://github.com/gnouvles/individual_project_plan