# Individual Planning Report

GitHub Repository link: https://github.com/mcheng250/DSCI_project

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

In [None]:
players_origin <- read_csv("https://raw.githubusercontent.com/mcheng250/DSCI_project/refs/heads/main/players.csv")
players_origin

In [None]:
players_tidy <- players_origin |> 
                    mutate(gender = as.factor(gender), experience = as.factor(experience)) 
# treat gender and experience as categorical
players_tidy

In [None]:
players_tidy |> summarize(mean_played_hours = round(mean(played_hours, na.rm = TRUE),2),
                            min_played_hours = round(min(played_hours, na.rm = TRUE),2),
                            max_played_hours = round(max(played_hours, na.rm = TRUE),2)
                           )
players_tidy |> summarize(mean_age = round(mean(Age, na.rm = TRUE),2),
                            min_age = round(min(Age, na.rm = TRUE),2),
                            max_age = round(max(Age, na.rm = TRUE),2)
                           )
players_tidy |> filter(subscribe == FALSE) |> summarize(subscribe_FALSE = n())
players_tidy |> filter(subscribe == TRUE) |> summarize(subscribe_TRUE = n())

mean_table <- players_tidy |> summarize(mean_played_hours = round(mean(played_hours, na.rm = TRUE),2),
                                        mean_age = round(mean(Age, na.rm = TRUE),2)) |>
                              pivot_longer(cols = everything(),
                                           names_to = "variable", 
                                           values_to = "mean_value")
mean_table

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8) 
age_plot <- players_tidy |>
                ggplot(aes(x=Age)) +
                geom_histogram(bins = 30) +
                labs(title = "Distribution of Age", x = "Age(years)", y = "count")
age_plot

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8) 
playered_hours_plot <- players_tidy |>
                ggplot(aes(x=played_hours)) +
                geom_histogram(bins = 15) +
                labs(title = "Distribution of Played Hours", x = "Played Hours(hours)", y = "count")
playered_hours_plot

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8) 
exp_vs_playered_hours <- players_tidy |>
                ggplot(aes(x=experience, y=played_hours)) +
                geom_bar(stat = "identity") +
                labs(title = "Experience VS Played Hours", x = "Experience", y = "Played Hours(hours)")
exp_vs_playered_hours

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8) 
age_vs_played_hours <- players_tidy |>
                ggplot(aes(x=Age, y=played_hours)) +
                geom_point(alpha = 0.5) +
                labs(title = "Age VS Played Hours", x = "Age(years)", y = "Played Hours(hours)")
age_vs_played_hours

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8) 
age_vs_played_hours_c <- players_tidy |>
                ggplot(aes(x=Age, y=played_hours)) +
                geom_point(alpha = 0.5,aes(color=experience)) +
                labs(title = "Age VS Played Hours", x = "Age(years)", y = "Played Hours(hours)")
age_vs_played_hours_c

# Insight from plots
* Clearly player ranging from age 15 to 27 played this game more often, especially between 15 and 20.
* Most player only played less than one hour
* people with amateur and regular experience have the most played hours, while Pro and Veteran have the least, indicating the form of the games is not interesting enough to attract Pro players.
* It's hard to make a conclusion about relationship between experience and age without further wrangling.

# Data Description-players.csv:

* Observations: 196
  
* summary statistics:
    * played_hours:
        * mean = 5.85
        * min = 0
        * max = 223.1
    * Age:
        * mean = 21.14
        * min = 9
        * max = 58
* Number of variables: 7

* Variable names & types & meanings:
    * experience(character--Categorical, Explanatory): Player’s self-assessed Minecraft experience level
    * subscribe(logical, Response & Explanatory): Whether the player subscribe the newsletter
    * hashedEmail(character, Identifier): The player's email, used to link player's profile data to sessions.
    * played_hours(double--Numeric, Explanatory & Response): Player’s cumulative play time
    * name(character, Identifier): Player’s display name
    * gender(character--Categorical, Explanatory): Player’s gender
    * Age(double--Numeric, Explanatory): Player's age
      
* any issues you see in the data:
    * There are two NA cells in the Age column.
    * could use some wrangling to make the dataset more organized, order the ascending age or play hours order.
      
* any other potential issues related to things you cannot directly see:
    * One person could have mutiple accounts since they are using using email to log in, which causes playing hours split.
    * There's an inbalance in subscribe column, with FALSE = 52 and TRUE = 144, which could affect the prediction stage.
      
* how the data were collected: The whole game project was running on Minecraft server, when the player tried to log in and play the game, they were asked to provide their information like experience, name gender, age and email, the email got transformed into hashedEmail in the dataset, which was used to connect to the sessions dataset to get the played_hours variable. And players were also asked whether they wanted to receive newsletter, which results in the subscribe variable.

In [None]:
sessions_origin <- read_csv("https://raw.githubusercontent.com/mcheng250/DSCI_project/refs/heads/main/sessions.csv")
sessions_origin

# Data Description-players.csv:
* Observations: 1535

* Number of variables: 5

* Variable names & types & meanings:
    * hashedEmail(character, Identifier): The player's email(user ID), used to link sessions to player's profile
    * start_time(character): Human-readable game log in time
    * end_time(character): Human-readable game log out time
    * original_start_time(double, Numeric): raw system game log in time
    * original_end_time(double, Numeric): raw system game log out time


* how the data were collected: every time the player log in, record the start time, and when they log out, record end time.

# Questions:
I will address Question 2, my formulated question: Can age of players predict played hours? 
I will explore relationship between age and played hours by plotting while wrangling players' age with their corresponding game experience. I will use regression when it comes to relationship between age and played hours, and use classification when it comes to players' age and experience if necessary.

# Methods and Plan:
  I will use linear regression as the primary method for age and played hours, since linear regression often provides a clear slope that can show strong relatonship between variables. A clear slope usually means the prediction will be more accruate. I'm not choosing classification since age and played hours do not involve factor, but I may use it for exploring players' age and experience.
  
  Assumptions: The relation should somewhat similar to linear, if it turns out to be like curve, I need to consider switching to knn regression.
  
  Limitations: I'm assuming this is a straight-line relationship, linear regression cannot predict curves in in how age relates to played hours.

  I will use RMSE as metric to determine whether to use k-nn or linear regression, the one with lower RMSE is better.

  Select numeric variable Age and played_hours, drop rows with NA, split the dataset into 80% training and 20% testing. There will be a cross validation
  