In [None]:
library(tidyverse)

In [None]:
# Load data

url_players <- "https://raw.githubusercontent.com/hayounggc/DSCI100_individual_project_planning/refs/heads/main/data/players.csv"
url_sessions <- "https://raw.githubusercontent.com/hayounggc/DSCI100_individual_project_planning/refs/heads/main/data/sessions.csv"

players <- read_csv(url_players)
sessions <- read_csv(url_sessions)

players
sessions

In [None]:
summary(players)
summary(sessions)

<h3>1. Data description</h3>

<p>This data was collected by Dr. Frank Wood and his research team, The Pacific Laboratory for Artificial Intelligence (PLAI), through PLAICraft, a MineCraft server link. This server collects data as the users play and records their actions as they navigate through the game. </p>

<h4>players.csv</h4>

<ul>
    <li>196 observations</li>
    <li>7 variables</li>
    <li>On average, the users played the game for 5.85 hours but with a median of 0.1 hours, suggesting that most users played closer to 0.1 hours except for some outlier users. The maximum value of 223.1 hours supports this.</li>
    <li>The age of the users ranged from 9 to 58 years old, with an average of 21.1 years. 2 observations did not include age data.</li>
</ul>

<br>

<table>
    <tr>
        <th>Variable name</th>
        <th>Variable type</th>
        <th>Description</th>
    </tr>
    <tr>
        <th>experience</th>
        <th>chr</th>
        <th>Values describe gaming experience of user.</th>
    </tr>
    <tr>
        <th>subscribe</th>
        <th>lgl</th>
        <th>Values are TRUE or FALSE. Uses logical expression to describe whether the user is subscribed to a game-related newsletter.</th>
    </tr>
    <tr>
        <th>hashedEmail</th>
        <th>chr</th>
        <th>Values are the 'hashed', or encrypted email addresses of users.</th>
    </tr>
    <tr>
        <th>played_hours</th>
        <th>dbl</th>
        <th>Values describe the amount of time the user played the game in hours.</th>
    </tr>
    <tr>
        <th>name</th>
        <th>chr</th>
        <th>Values are the names of the users.</th>
    </tr>
    <tr>
        <th>gender</th>
        <th>chr</th>
        <th>Values describe self-identified gender of user.</th>
    </tr>
    <tr>
        <th>age</th>
        <th>dbl</th>
        <th>Values describe age of users in years.</th>
    </tr>
</table>

<ul>
    <li>One thing that could be improved is the variable type of experience and gender, as there are a limited number of possible values. This correction can make it easier to analyze the data later on.</li>
    <li>Another aspect that could be improved is in the headers of the data. As the headers are inconsistenly named (capitalization, underscores), it may be confusing or inefficient when wrangling or analyzing data later on.</li>
</ul>

<h4>sessions.csv</h4>

<ul>
    <li>1535 observations</li>
    <li>5 variables</li>
    <li>The mean and median original_start_time was 1.719e+12 miliseconds in UNIX time.</li>
    <li>The mean and median original_end_time was 1.719e+12 miliseconds in UNIX time.</li>
</ul>

<table>
    <tr>
        <th>Variable name</th>
        <th>Variable type</th>
        <th>Description</th>
    </tr>
    <tr>
        <th>hashedEmail</th>
        <th>chr</th>
        <th>Values are the 'hashed', or encrypted email addresses of users.</th>
    </tr>
    <tr>
        <th>start_time</th>
        <th>chr</th>
        <th>Values describe the start date and time of the user's gaming session.</th>
    </tr>
    <tr>
        <th>end_time</th>
        <th>chr</th>
        <th>Values describe the end date and time of the user's gaming session.</th>
    </tr>
    <tr>
        <th>original_start_time</th>
        <th>dbl</th>
        <th>Values represent the start date and time of the user's gaming session in UNIX time (milliseconds).</th>
    </tr>
    <tr>
        <th>original_end_time</th>
        <th>dbl</th>
        <th>Values represent the end date and time of the user's gaming session in UNIX time (milliseconds).</th>
    </tr>
</table>

<ul>
    <li>One part that could be improved is in the organization of the start_time and end_time variables. For instance, I believe it would be better if it could be separated into date and time. This is because  the date and time are included in one variable as a chr variable type, which makes the data difficult to analyze. If the time was separate, for example, you could analyze the time at which the most users start to game.</li>
    <li>Another aspect that could be improved is in the original_start_time and original_end_time. This column is currently in units of UNIX time in miliseconds. Due to the small unit, it is difficult to read the data and make meaningful inferences. For example, the mean and median of the original start and end time is 1.719e+12 miliseconds. Therefore, changing to a larger unit will help use the data and make meaningful inferences during data analysis.</li>
</ul>