In [None]:
library(tidyverse)

In [None]:
# Load data

url_players <- "https://raw.githubusercontent.com/hayounggc/DSCI100_individual_project_planning/refs/heads/main/data/players.csv"
url_sessions <- "https://raw.githubusercontent.com/hayounggc/DSCI100_individual_project_planning/refs/heads/main/data/sessions.csv"

players <- read_csv(url_players)
sessions <- read_csv(url_sessions)

<h3>1. Data description</h3>

<p>This data was collected by Dr. Frank Wood and his research team, The Pacific Laboratory for Artificial Intelligence (PLAI), through PLAICraft, a MineCraft server link. This server collects data as the users play and records their actions as they navigate through the game. </p>

<h4>players.csv</h4>

In [None]:
players
summary(players)

<ul>
    <li>196 observations</li>
    <li>7 variables</li>
    <li>On average, the users played the game for 5.85 hours but with a median of 0.1 hours, suggesting that most users played closer to 0.1 hours except for some outlier users. The maximum value of 223.1 hours supports this.</li>
    <li>The age of the users ranged from 9 to 58 years old, with an average of 21.1 years. 2 observations did not include age data.</li>
</ul>

<br>

<table>
    <tr>
        <th>Variable name</th>
        <th>Variable type</th>
        <th>Description</th>
    </tr>
    <tr>
        <th>experience</th>
        <th>chr</th>
        <th>Values describe gaming experience of user.</th>
    </tr>
    <tr>
        <th>subscribe</th>
        <th>lgl</th>
        <th>Values are TRUE or FALSE. Uses logical expression to describe whether the user is subscribed to a game-related newsletter.</th>
    </tr>
    <tr>
        <th>hashedEmail</th>
        <th>chr</th>
        <th>Values are the 'hashed', or encrypted email addresses of users.</th>
    </tr>
    <tr>
        <th>played_hours</th>
        <th>dbl</th>
        <th>Values describe the amount of time the user played the game in hours.</th>
    </tr>
    <tr>
        <th>name</th>
        <th>chr</th>
        <th>Values are the names of the users.</th>
    </tr>
    <tr>
        <th>gender</th>
        <th>chr</th>
        <th>Values describe self-identified gender of user.</th>
    </tr>
    <tr>
        <th>age</th>
        <th>dbl</th>
        <th>Values describe age of users in years.</th>
    </tr>
</table>

<ul>
    <li>One thing that could be improved is the variable type of experience and gender, as there are a limited number of possible values. This correction can make it easier to analyze the data later on.</li>
    <li>Another aspect that could be improved is in the headers. For example, using all lowercase letters and underscores can help maintain consistency.</li>
</ul>

<h4>sessions.csv</h4>

In [None]:
sessions
summary(sessions)

<ul>
    <li>1535 observations</li>
    <li>5 variables</li>
    <li>The mean and median original_start_time was 1.719e+12 miliseconds in UNIX time.</li>
    <li>The mean and median original_end_time was 1.719e+12 miliseconds in UNIX time.</li>
</ul>

<table>
    <tr>
        <th>Variable name</th>
        <th>Variable type</th>
        <th>Description</th>
    </tr>
    <tr>
        <th>hashedEmail</th>
        <th>chr</th>
        <th>Values are the 'hashed', or encrypted email addresses of users.</th>
    </tr>
    <tr>
        <th>start_time</th>
        <th>chr</th>
        <th>Values describe the start date and time of the user's gaming session.</th>
    </tr>
    <tr>
        <th>end_time</th>
        <th>chr</th>
        <th>Values describe the end date and time of the user's gaming session.</th>
    </tr>
    <tr>
        <th>original_start_time</th>
        <th>dbl</th>
        <th>Values represent the start date and time of the user's gaming session in UNIX time (milliseconds).</th>
    </tr>
    <tr>
        <th>original_end_time</th>
        <th>dbl</th>
        <th>Values represent the end date and time of the user's gaming session in UNIX time (milliseconds).</th>
    </tr>
</table>

<ul>
    <li>One part that could be improved is in the organization of the start_time and end_time variables. For instance, I believe it would be better if it could be separated into date and time. This is because  the date and time are included in one variable as a chr variable type, which makes the data difficult to analyze. If the time was separate, for example, you could analyze the time at which the most users start to game.</li>
    <li>Another aspect that could be improved is in the original_start_time and original_end_time. This column is currently in units of UNIX time in miliseconds. Due to the small unit, it is difficult to read the data and make meaningful inferences. For example, the mean and median of the original start and end time is 1.719e+12 miliseconds. Therefore, changing to a larger unit will help use the data and make meaningful inferences during data analysis.</li>
</ul>

<h3>2. Questions</h3>

<strong>Broad question:</strong> 

<p>We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.</p>

<strong>Specific question:</strong>

<p>Can the time at which a user starts gaming predict the duration of how long the user games for?</p>

<p>To answer this question, I plan to separate the date and time from start_time in sessions.csv to obtain the time that users started gaming. I will then group the users into one of 4 groups: overnight, morning, afternoon, and evening gamers. If the user gamed more than once, the mode will be used to categorize the user. Next, I will train a regression model to predict if the "kind" of gamer (overnight, morning, afternoon, or evening) can predict the total duration of gaming for a user. The results of this will allow for the research team to target certain hours of the day to gather players that contribute a larger amount of data.</p>

<h3>3. Exploratory Data Analysis and Visualization</h3>

In [None]:
# Data is loaded

players
sessions

In [None]:
# Tidy players.csv

players_tidy <- players |>
    rename(hashed_email = hashedEmail,
           age = Age)

players_tidy

# Tidy sessions.csv

sessions_tidy <- sessions |>
    separate(col = start_time,
             into = c("start_date", "start_time"),
             sep = " ") |>
    separate(col = end_time,
             into = c("end_date", "end_time"),
             sep = " ") |>
    rename(hashed_email = hashedEmail,
           start_unix_time_ms = original_start_time,
           end_unix_time_ms = original_end_time)

sessions_tidy

In [None]:
# players.csv mean values

players_quantitative <- select(players_tidy, played_hours, age)

players_mean <- map_dfc(players_quantitative, mean, na.rm = TRUE)

players_mean

In [None]:
# Plot 1: played_hours vs. age

options(repr.plot.height = 5, repr.plot.width = 10)

plot1 <- players_tidy |>
    ggplot(aes(x = age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.5) +
    labs(x = 'Age of user (years)', y = 'Total duration of play (hours)') +
    ggtitle("Number of hours played based on the user's age") +
    theme(text = element_text(size = 15))
    
plot1

plot1.5 <- players_tidy |>
    ggplot(aes(x = age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.5) +
    labs(x = 'Age of user (years)', y = 'Total duration of play (hours)') +
    ggtitle("Number of hours played based on the user's age (max = 40 hours)") +
    theme(text = element_text(size = 15)) +
    ylim(c(0, 40))
    
plot1.5

<ul>
    <li>The first graph shows that most of the users are between the ages 15-30 years old and they tend to game for about 10 hours, with the exception of a few users recording over 100 hours. Most users older than 30 played less than an hour.</li>
    <li>The second graph is slightly modified to only include data points whose total duration was less than 40 hours. This graph was produced to obtain a closer look at the data points that better represent the total population of participants. From this, we can see that most users between the ages 15 and 30 played less than 5 hours.</li>
    <li>This graph relates to my question as I now know that the users who provides lots of data will be users who game for longer than 5 hours. I also know more about the data now and that there are a few values above 40 that do not represent the participant population very well.</li>
</ul>

In [None]:
# Plot 2: experience vs. median duration

experience_hours <- players_tidy |>
    group_by(experience) |>
    summarize(median_duration = median(played_hours, na.rm = TRUE))

experience_hours

options(repr.plot.height = 7, repr.plot.width = 9)

plot2 <- experience_hours |>
    ggplot(aes(x = experience, y = median_duration)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    labs(x = 'Gaming experience of user', y = 'Median duration of play (hours)') +
    ggtitle("Median number of hours played based on user's gaming experience") +
    theme(text = element_text(size = 15))
    
plot2

<ul>
    <li>Seeing that there a few outliers in </li>
</ul>