A research group in Computer Science at UBC, led by Frank Wood set up a minecraft server and recorded the actions and various observations of players. They need to target their recruitment efforts to make sure they have enough resources to accommodate the number of players they attract.

**Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

**Players.csv**: A list of all unique players, 196 rows, 7 columns, includes variables:
- **experience**: character, shows how familiar a play is with the game
    - includes entries: amateur, beginner, regular, veteran, and pro.
- **Subscribe**: logical, yes or no for if the player subscribed
- **hashedEmail**: character, string of characters that privately identify specific players' emails
- **Played_hours**: Decimal Values, amount of time (hours) spent playing on this server
- **Name**: character, name of player
- **Gender**: character, gender of player
    - Consists of entries: male, female, non-binary, prefer not to say, two-spirited, Agender, and other.
- **Age**: Decimal Values, age of player


In [1]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

Read in the data set!

In [2]:
players<- read_csv("individual_project/players.csv")
players

ERROR: Error: 'individual_project/players.csv' does not exist in current working directory ('/home/jovyan/work/individual_project').


**experience**:
- Count of each variable mostly equal across categories with pro having the least players and amateur having the most


In [None]:
experience_count<- players|>
    count(experience)|>
    arrange(desc(n))
experience_count

**Subscribe**:
- Most players subscribed


In [3]:
subscribe_count<- players|>
    count(subscribe)|>
    arrange(desc(n))
subscribe_count

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


**Played_hours**:
- High number of players playing 0 hours raises a concern, possibly skewing statistics like mean
- Large range of time played (0-223.1)
- Mean is 5.85(on lower end)
- SD 28.36 hours(high deviation)


In [4]:
played_hours_count<-players|>
    count(played_hours)|>
    arrange(desc(n))
played_hours_count

played_hours_min<-players|>
    summarize(min_played_hours=min(played_hours,na.rm=TRUE))
played_hours_min

played_hours_max<-players|>
    summarize(max_played_hours=max(played_hours,na.rm=TRUE))
played_hours_max

played_hours_mean<-players|>
    summarize(mean_played_hours=mean(played_hours,na.rm=TRUE))|>
    round(digits=2)
played_hours_mean

played_hours_sd<-players|>
    summarize(standard_deviation_played_hours=sd(played_hours,na.rm=TRUE))|>
    round(digits=2)
played_hours_sd

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


**Gender**:
- Distribution very unequal, male is most common category by far.

In [5]:
gender_count<-players|>
    count(gender)|>
    arrange(desc(n))
gender_count

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


**Age**:
- 9-58 years old, standard deviation 7.4(wide variation)
- Mean age about 21(younger end)

In [6]:
Age_count<-players|>
    count(Age)|>
    arrange(desc(n))
Age_count

Age_min<-players|>
    summarize(min_Age=min(Age,na.rm=TRUE))
Age_min

Age_max<-players|>
    summarize(max_Age=max(Age,na.rm=TRUE))
Age_max

Age_mean<-players|>
    summarize(mean_Age=mean(Age,na.rm=TRUE))|>
    round(digits=2)
Age_mean

Age_sd<-players|>
    summarize(standard_deviation_Age=sd(Age,na.rm=TRUE))|>
    round(digits=2)

Age_sd

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


**Name**: 
- Each name only listed once

In [7]:
name_count<-players|>
    count(name)
name_count

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


**hashedEmail**:
- Each email only recorded once

In [8]:
hashedEmail<- players|>
    count(hashedEmail)
hashedEmail

ERROR: Error in eval(expr, envir, enclos): object 'players' not found


Load in the next data set!

In [40]:
sessions<-read_csv("individual_project/sessions.csv")

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
