## Individual Planning Stage<a href="#Individual-Planning-Stage" class="anchor-link">¶</a>

In \[1\]:

    library(tidyverse)
    library(repr)
    library(tidymodels)
    options(repr.matrix.rows = 6)

    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
    ✔ purrr     1.0.2     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

    ✔ broom        1.0.6     ✔ rsample      1.2.1
    ✔ dials        1.3.0     ✔ tune         1.1.2
    ✔ infer        1.0.7     ✔ workflows    1.1.4
    ✔ modeldata    1.4.0     ✔ workflowsets 1.0.1
    ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
    ✔ recipes      1.1.0     

    ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
    ✖ scales::discard() masks purrr::discard()
    ✖ dplyr::filter()   masks stats::filter()
    ✖ recipes::fixed()  masks stringr::fixed()
    ✖ dplyr::lag()      masks stats::lag()
    ✖ yardstick::spec() masks readr::spec()
    ✖ recipes::step()   masks stats::step()
    • Use suppressPackageStartupMessages() to eliminate package startup messages

In \[2\]:

    url_players <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
    players <- read_csv(url_players)
    url_sessions <- "https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
    sessions <- read_csv(url_sessions)

    Rows: 196 Columns: 9
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (4): experience, hashedEmail, name, gender
    dbl (2): played_hours, age
    lgl (3): subscribe, individualId, organizationName

    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    Rows: 1535 Columns: 5
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (3): hashedEmail, start_time, end_time
    dbl (2): original_start_time, original_end_time

    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# 1. Data Descriptions<a href="#1.-Data-Descriptions" class="anchor-link">¶</a>

## Players<a href="#Players" class="anchor-link">¶</a>

The players dataset contains information about individual players on the
Minecraft server, with a total of 196 observations and 9 variables.
Those are:

1.  experience(chr) : player's level of experience (e.g. Pro, Veteran,
    and Amateur)
2.  subscribe (lgl) : (TRUE or FALSE) stating whether the player
    subscribes to the server
3.  hashedEmail(chr) : player's hashed email
4.  played_hours(dbl) : total hours the player spent on the server
5.  name(chr) : player's name
6.  gender(chr) : player's gender
7.  age(dbl) : player's age (in years)
8.  individualID(lgl) : player's individual id
9.  organizationName(lgl): player's organization name

chr = character, lgl = (logical), dbl = (double)

variables 8 and 9 in the players dataset (individualId and
organizationName) could be considered an issue. Since they are logical
fields, they should contain TRUE or FALSE values, but instead, they have
only NA (missing) values

## Sessions<a href="#Sessions" class="anchor-link">¶</a>

The sessions dataset record player activity on the server, with 1,535
observations and 5 variables. Those variables are:

1.  hashedEmail (chr) : player's hashed email
2.  start_time (chr) : start time of the session, recorded as a time and
    date string
3.  end_time (chr) : end time of the session, recorded as a time and
    date string
4.  original_start_time (dbl) : timestamp corresponding to start_time
5.  original_end_time (dbl) : timestamp corresponding to end_time

# 2. Question<a href="#2.-Question" class="anchor-link">¶</a>

The question that we aim to answer is "which "kinds" of players are most
likely to contribute a large amount of data so that we can target those
players in our recruiting efforts?"

### Response variable : played_hours<a href="#Response-variable-:-played_hours" class="anchor-link">¶</a>

This variable is a direct measure of their contribution base on the
amount time the player has spent on the server.

### Explanatory variable :<a href="#Explanatory-variable-:" class="anchor-link">¶</a>

experience, subscribe, age, gender

## Data Wrangling Plan<a href="#Data-Wrangling-Plan" class="anchor-link">¶</a>

Before we analyze the data, we should make the data becomes tidy,
structured format and can be a predictive modelling.

1.  Filter and Select Columns
    -   Exclude columns with all missing values(IndividualId and
        organizationName)
    -   Just Focus on the response and explanatory variables
2.  Handle Missing Values
    -   check any missing values in each columns, and decide to include
        those or not
3.  Convert Data Types
    -   experience and gender should be categorical variables
    -   age and played_hours should remain numeric
4.  Aggregate and scaling predictors

# 3. Exploratory Data Analysis and Visualization<a href="#3.-Exploratory-Data-Analysis-and-Visualization"
class="anchor-link">¶</a>

In \[3\]:

    #Tidy Format of players
    players_tidy  <- players |>
                     select(experience, subscribe, gender, age, played_hours) |>
                     mutate(experience = as_factor(experience)) |>
                     mutate(gender = as_factor(gender)) |>
                     filter(!is.na(played_hours))

    #Exploratory Visualization
    plot1 <- players_tidy |>
             ggplot(aes (x = experience, y = played_hours, color = subscribe)) +
             geom_point() +
             labs(x = "Age (Years)", y = "Total Played Hours", color = "gender",
                 title = "Age Vs Amount of Time Player Has Spent on the Server") +
             theme(text = element_text(size=15))

    plot2 <- players_tidy |>
             ggplot(aes (x = age, y = played_hours, fill = gender)) +
             geom_bar(stat = "identity") +
             labs(x = "Age (Years)", y = "Total Played Hours", fill = "gender",
             title = "Age Vs Amount of Time Player Has Spent on the Server") +
             theme(text = element_text(size=15))

    plot1
    plot2

No description has been provided for this
image

No description has been provided for this
image

1.  Experience vs Played Hours (scatter plot)
    -   Shows the relationship between player experience and total hours
        played.
    -   Colors indicate subscription status, highlighting any
        differences in engagement between subscribed and non-subscribed
        players.
2.  Age vs Played Hours (Bar plot)
    -   Displays total played hours across different ages, with fill
        color representing gender.
    -   Highlights trends by age and any noticeable gender-related
        differences.

# 4. Method and Plan<a href="#4.-Method-and-Plan" class="anchor-link">¶</a>

I will use the linear regression model to answer the question no.1 by
predicting the played_hours as a continuous response variable based on
four predictors.

### Assumptions<a href="#Assumptions" class="anchor-link">¶</a>

1.  Assumes a linear relationship between the predictors and the
    response variable
2.  Observations should be independent each other

### Limitations<a href="#Limitations" class="anchor-link">¶</a>

1.  If the relationship is not linear, lm model might underperform
2.  Extreme values in played_hours distort the model
3.  Highly correlated predictor variables (age and experience) could
    make interpretation challenging

### Model Selection, Comparison, and Data Processing<a href="#Model-Selection,-Comparison,-and-Data-Processing"
class="anchor-link">¶</a>

1.  Split the data into training and testing by 75/25
2.  cross-validation will detect overfitting or underfitting
3.  calculate RMSPE
4.  Train the final model on the full training data with the selected
    parameters.
5.  Test the final model on the holdout test set to validate
    performance.

In \[ \]: