In [None]:
library(tidyverse)
library(janitor)

In [None]:
players <- read_csv("data/players.csv")
players

# preview_players <- players|> slice(1:10)
# preview_players

In [None]:
sessions <- read_csv("data/sessions.csv")
sessions

# preview <- sessions|> slice(1:10)
# preview

<p><h3>Data Description (players.csv and sessions.csv)</h3>
Understanding the demographics of video game players is important for improving user experience and allows developers to allocate server resources effectively. The first loaded dataset, players.csv, was collected from players on a public Minecraft server that described user details and profile information. It consists of 196 rows, each representing a unique player identification through hashed emails. The dataset includes information on users' demographics, experiences, and time spent on gaming, where each of the 7 columns represents a variable. <p></p>
<p>Table 1. Data type and meanings of the 7 variables:<p></p>

|Column|Data Type|Meaning|
|------|------|------|
|**experience** |Character| Player's skill level (Veteran, Amateur, Regular)|
|**subscribe**| Logical| Indicates whether users are subscribed to the game |
|**hashedEmail**| Character| Unique identifier for each player |
|**played_hours** |Double| Indicates the total amount of time spent on the game|
|**name**| Character| Represents each player's name|
|**gender** |Character| Represents each player's gender|
|**Age** |Double| Represents each player's age |
 <p></p>

 <p>The second loaded dataset, sessions.csv, was similarly collected through recorded player actions in the same server. This dataset contains 1535 rows, each regarding specific session start/end times and their respective dates. However, the columns start_time and end_time contains both the date and the time, making it difficult to analyze. Unlike the first data set, the same hashed Emails are listed multiple times as each player can have multiple recorded sessions. The original start/end times also show redundancy with session start/end time and it is not stored in a human-readable format. <p></p>
<p></p> Table 2. Data type and meanings of the 5 variables: <p>
     
|Column|Data Type|Meaning|
|------|------|------|
 |**hashedEmail**|Character| Unique identifier for each player |
 |**start_time** |Character| The start time and date of each session|
 |**end_time** |Character| The end time and date of each session|
 |**original_start_time** |Double| Start time of each session in milliseconds from 1970 (same time stamp as start_time variable)|
 |**original_end_time**|Double|End time of each session in milliseconds from 1970 (same time stamp as end_time variable)|



The summary statistics are shown below by computing the mean, min, and max of quantitative variables in the players.csv dataset (Rounded to 2 decimal places): 

In [None]:
summary_players <- players |>
group_by(experience) |>
summarize(avg_age = mean(Age, na.rm = TRUE),
          min_age = min(Age, na.rm = TRUE), 
          max_age = max(Age, na.rm = TRUE), 
          avg_played_hours = mean(played_hours, na.rm = TRUE),
          min_played_hours = min(played_hours, na.rm = TRUE),
          max_played_hours = max(played_hours, na.rm = TRUE)
         )


|Experience|Average Age|Youngest|Oldest|Average Played Hours|Shortest Hours|Longest Hours
|---|---|---|---|---|---|---|
|Amateur|21.37|11|57|6.02|0|150.00|
|Beginner|21.66|17|42|1.25|0|23.70|
|Pro|16.92|9|25|2.60|0|30.30|
|Regular|22.03|10|58|18.21|0|223.10|
|Veteran|20.96|16|46|0.65|0|12.50|

<p><h3>Predictive Question/Exploratory Analysis and Visualization</h3>
Can played hours and age predict the experience of the player? <p></p>

To answer this question, we can first tidy the players.csv dataframe by cleaning up column names, convert columns to appropriate data types (response variable for KNN classification must be a factor data type), and filter for desired predictors. We filter out played hours less than 0 since it is not meaningful for our analysis. Next, we group by both experience and age, and aggregate played hours to better see their relationship for each skill level. This can subsequently help us assess whether these variables are useful for predictive analysis.

In [None]:
players <- players|>
clean_names() |>
mutate(experience = as.factor(experience))|>
filter(played_hours > 0.0)

age_hours <- players|>
group_by(experience, age)|>
summarize(avg_hours = mean(played_hours))
age_hours
# session_summary <- sessions|>
# group_by(hashed_email) |>
# summarize(total_sessions = n()) 

# combined_data <- players |>
# left_join(session_summary, by = "hashed_email")|>
# select(experience, played_hours, total_sessions)
# combined_data


We can now visualize the relationship between age and played hours using the dataframe "age_hours".

In [None]:
summary_players_plot <- age_hours |>
ggplot(aes(x = age, y = avg_hours, color = experience)) +
geom_point() +
labs(x = "Ages of Users", y = "Hours Spent on Gaming", title = "Played Hours by Age", color = "Skill Level")
summary_players_plot


In [None]:
From the plot, it is notable that there are no distinct relationship between the two variables.