In [None]:
library(tidyverse)
library(janitor)

In [None]:
players <- read_csv("data/players.csv")
players

# preview_players <- players|> slice(1:10)
# preview_players

In [None]:
sessions <- read_csv("data/sessions.csv")
sessions

# preview <- sessions|> slice(1:10)
# preview

<p><h3>Data Description (players.csv and sessions.csv)</h3>
Understanding the demographics of video game players is important for improving user experience and allows developers to allocate server resources effectively. The first loaded dataset, players.csv, was collected from players on a public Minecraft server that described user details and profile information. It consists of 196 rows, each representing a unique player identification through hashed emails. The dataset includes information on users' demographics, experiences, and time spent on gaming, where each of the 7 columns represents a variable. <p></p>
<p>The data type and meanings of the 7 variables:<p></p>

|Column|Data Type|Meaning|
|------|------|------|
|**experience** |Character| Player's skill level (Veteran, Amateur, Regular)|
|**subscribe**| Logical| Indicates whether users are subscribed to the game |
|**hashedEmail**| Character| Unique identifier for each player |
|**played_hours** |Double| Indicates the total amount of time spent on the game|
|**name**| Character| Represents each player's name|
|**gender** |Character| Represents each player's gender|
|**Age** |Double| Represents each player's age |
 <p></p>

 <p>The second loaded dataset, sessions.csv, was similarly collected through recorded player actions in the same server. This dataset contains 1535 rows, each regarding specific session start/end times and their respective dates. However, the columns start_time and end_time contains both the date and the time, making it difficult to analyze. Unlike the first data set, the same hashed Emails are listed multiple times as each player can have multiple recorded sessions. The original start/end times also show redundancy with session start/end time and it is not stored in a human-readable format. <p></p>
<p></p>The data type and meanings of the 5 variables: <p>
     
|Column|Data Type|Meaning|
|------|------|------|
 |**hashedEmail**|Character| Unique identifier for each player |
 |**start_time** |Character| The start time and date of each session|
 |**end_time** |Character| The end time and date of each session|
 |**original_start_time** |Double| Start time of each session in milliseconds from 1970 (same time stamp as start_time variable)|
 |**original_end_time**|Double|End time of each session in milliseconds from 1970 (same time stamp as end_time variable)|



<p><h3>Predictive Question</h3>
Can played hours and total number of sessions predict the experience of the player? <p></p>

To answer this question, we must first tidy the two dataframes by cleaning up column names, converting columns to appropriate data types ( and filter for the desired variables. group multiple of the same hashed_email together from the sessions.csv dataframe and summarize by the total sessions (the number of times the same hashed email appears) and disregard sessions start/end times. 

In [None]:
players <- players|>
clean_names() |>
filter(played_hours > 0.0)

sessions <- sessions |>
clean_names()

sessions_summary <- sessions |>
group_by(hashed_email) |>
summarize(total_sessions = n()) 
sessions_summary

combined <- players |>
left_join(session_summary, by = "hashed_email")
combined