# individual plan
## loading data

In [5]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [6]:
players <- read_csv("data/players.csv")
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Data Description:
- numbers of observation:196
- numbers of variables:7
- experience: It demonstrates how experienced a player is in the five levels: pro, veteran, amateur, regular, beginner
- subscribe: Represents a key metric for the platform
- hashedEmail: It was not helpful for data prediction.


In [None]:
summarize(players, max_hr= max(played_hours), meam_hr=mean(played_hours))

## Question to answer
Question: Can experience, play hours, gender, and age predict whether a player subscribes to a game?

Explanation:
The goal is to analyze whether a player's experience level, total playtime, gender, and age influence their likelihood of subscribing. These factors reflect a player's engagement and preferences, helping us understand if certain characteristics make a player more likely to become a subscriber.

Data Wrangling Plan:
To make the data suitable for analysis, we need to convert categorical variables into numerical formats:

Moreover, If experience is categorical (e.g., beginner, intermediate, expert), it will be encoded numerically.
Play Hours & Age: These are already numerical but may need normalization for better model performance.
Gender: Since gender is categorical (male, female, other), I will use one-hot encoding, creating separate binary columns such as gender_male, gender_female, and gender_other.
One potential downside of one-hot encoding is the increase in the number of variables, which may introduce redundancy if differences between categories are small. However, this method ensures that categorical data is accurately represented in the model.

By converting all variables into numerical form, I will be able to apply predictive methods to analyze their impact on subscription rates effectively.



In [None]:
ggplot(players, aes(x=Age, y=played_hours))+
geom_point(aes(color=Age))

In [27]:
players_num |>map_df(mean, na.rm=TRUE)

played_hours,Age
<dbl>,<dbl>
5.845918,20.52062


In this exploratory analysis, we loaded the players.csv dataset into R and tidied it by removing duplicates and missing values. The mean values for quantitative variables are 5.85 for played_hours and 20.52 for Age. The scatter plot reveals that large play hours dominate the graph, making it hard to discern other players' information. However, most data points fall within the 10-20 range. Therefore, I plan to filter played_hours to this range to better analyze the majority of players and uncover clearer patterns without the influence of extreme values.

Summary Statistics:
- played_hours: Max/mean to be calculated (e.g., earlier sample: max 30.3, mean ~5.82).
- Age: Range/mean to be calculated (e.g., earlier: 9-21, mean 17).
- subscribe: Likely high TRUE proportion (e.g., earlier 83.33%).
- experience, gender: Distributions pending full data.

methods and plans

I propose a linear regression model to explore the relationship between Age and played_hours in players.csv, chosen for its ability to quantify linear trends between continuous variables. It’s appropriate but assumes linearity and normally distributed residuals, with limitations like sensitivity to outliers (e.g., 150 hours) and inability to capture non-linear patterns. Model selection will use adjusted R-squared and residual standard error. Data will be split into 70% training and 30% testing after exploration, with a 10% validation set from training and 5-fold cross-validation to ensure stability and reduce overfitting risks.