## DSCI100 GROUP PROJECT

##### Group Members: Aasha (**add student#**), Jamie (70834411), Nathelie (**add student#**), April (**add student#**)
##### Group Number: 007-39

In [1]:
#loading packages
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## Introduction
#### Background Information:
This project analyzes real-world data collected from a Minecraft research server (plaicraft.ai), run by the PLAI group of the UBC Computer Science department.             
The goal is to help researchers find out which "kinds" of players are most likely to contribute a large amount of data, so that they can maximize the amount of data collected with their limited resources.              
Specifically, the project will focus on predicting how much playing time a player contributes to the research based on their age and experience. The longer the play time, the more data the participant contributes. 

#### Research Question: Can player age and player experience predict the duration of playing?

#### Dataset Description
Two datasets were provided for this project, namely "players.csv" (assigned to `players`), and "sessions.csv" (assigned to `sessions`).           
`players` stores information about each research participant, while `sessions` stores information about each unique playing session.         

In [None]:
#initial data loading
players <- read_csv("https://raw.githubusercontent.com/jamiekyh/dsci100_project/refs/heads/main/players.csv")
head(players)

sessions <- read_csv("https://raw.githubusercontent.com/jamiekyh/dsci100_project/refs/heads/main/sessions.csv")
head(sessions)

#### Dataset Description: `players`
Now that we have loaded the entire comma-separated values file, we need to narrow the data down to what we need to answer the question.        
To answer our question, we will be using the `players` dataset to obtain data on players' age, experience and hours played.             

##### Summary
Let's take the summary and break down what is in the dataset:


In [None]:
summary(players)

To answer our question we will be using the dataset `players` to gain data on each players age, experience and hours played. Lets take the summary and breakdown whats in the dataset:

Tabulated summary of `players`:
| Variable         | Type   | Description |number of rows|
|-----------------|--------|-------------|--------|
| experience      | character | Player's self-reported experience level (e.g., Beginner, Regular, Veteran) | 196 |
| subscribe | logical | Whether the player is subscribed to a game-related newsletter | 196 |
| hashedEmail       | character | Unique identifier for each player | 196 |
| played_hours    | double | Total duration the player spent in the game | 196 |
| name    | character | Name of player | 196 |
| gender          | character | Player's gender (e.g., Male, Female, Other) | 196 |
| Age            | double | Player’s age in years| 196 |

The table above gives us a clear and legible summary of the dataset.                     
Each variable is named, along with the data type, description of the variable, and the number of entries for each variable.            

There are a few things to note from this table:    

                   
First, `Age` is of double type, although the ages listed are integers. Although the data type is not a problem for our data analysis, there is a hidden assumption that age is a discrete numerical variable, and thus should be treated as a categorical variable rather than a continuous one.                       
To address this problem, the distance between two ages will be defined as the absolute value of the difference between the two age values.
                 
                
Secondly, both of our predictors (`experience` and `Age`) are technically categorical. To be able to use these variables in a KNN regression model, these categorical data have to be converted into numerical values, with levels assigned according to increasing player experience and increasing player age.
While this "levelling" step is already done for `Age` (as a larger age has a larger numerical value), it has to be done manually for `experience`. This process will be included in the data wrangling portion in the Methods section.

### Methods

To answer our question of whether player age and experience can predict playing time, our analysis will involve data wrangling and cleaning, predictive modelling through KNN regression, and visualization of the obtained results.

First, the `players` dataset has to be tidied:
* `experience` should be of factor type rather than character, due to the ordered nature of experience: "Beginner" < "Amateur" < "Regular" < "Pro" < "Veteran"
* `Age` is a discrete numerical variable, so it should be of integer type rather than double.         
                
               
Below, the data is wrangled into a tidy format:

In [None]:
##wrangling the relevant datasets ("players.csv") into a tidy format:

tidy_players <- players |>
    mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Pro", "Veteran"))) |>
    mutate(Age = as.integer(Age))
head(tidy_players)

In order to narrow our data down to what we need for our research question, we need to ask what columns we would select for.            
In this case, since we are predicting playing duration from player age and experience, we need three variables: player age (`Age`), player experience (`experience`), and play duration (`played_hours`).                     
Below, we create a more targeted object containing only those three variables:

In [None]:
target_players <- tidy_players |>
    select(experience, Age, played_hours) 
head(target_players)

##### Summarizing the data set for exploratory analysis

Below is a summary of `Age` and `played_hours`, the two numerical variables involved in our data analysis:

In [None]:
mean_players <- target_players |>
    select(Age, played_hours) |>
    map_dfr(mean, na.rm = TRUE) |>
    as.data.frame()
colnames(mean_players) <- c("Mean Age of Participants", "Mean Duration Played (hours)")
mean_players

min_players <- target_players |>
    select(Age, played_hours) |>
    map_dfr(min, na.rm = TRUE) |>
    as.data.frame()
colnames(min_players) <- c("Minimum Age of Participants", "Minimum Duration Played (hours)")
min_players

max_players <- target_players |>
    select(Age, played_hours) |>
    map_dfr(max, na.rm = TRUE) |>
    as.data.frame()
colnames(max_players) <- c("Maximum Age of Participants", "Maximum Duration Played (hours)")
max_players

The following table summarizes the above statistics:

| Variable | Mean | Min | Max |
|----------|------|-----|-----|
| Age | 20.5206 | 8 | 50 |
| played_hours | 5.8459 | 0 |223.1 |

To understand the distribution of player experience, we count the number of players within each experience level:

In [None]:
count <- target_players |>
    group_by(experience) |>
    summarize(n = n())
count

There are 35 beginners, 63 amateur, 36 regular, 14 pro, and 48 veteran players in the dataset.

##### Visualising the data set for exploratory analysis
The above data can also be visualized through graphs to provide a more direct understanding: 

In [None]:
##distribution of player experience -need to add figure legend and number!!
exp_distribution <- target_players |>
    ggplot(aes(x = experience)) +
        geom_bar(stat = "count") +
        labs(x = "Player Experience", y = "Number of Players", title = "Distribution of Player Experience") +
        theme(text = element_text(size = 16))
exp_distribution