# DSCI 100: Individual Project Planning

## (1) Data Description

In the `sessions.csv` dataset, there are 1535 observations. There are 5 variables:
- hashedEmail: Anonymized identifier for each player's email
- start_time: Start time of game session
- end_time: End time of game session
- original_start_time: Original start time of game session
- oritinal_end_time: Original end time of game session


The `players.csv` dataset , there are 196 observations. There are 7 variables:
- experience: Experience level of the player
- subscribe: Whether a player subscribed to a game-related newsletter
- hashedEmail: Anonymized identifier for each player's email
- played_hours: Total time spent playing for each player in hours
- name: Player name
- gender: Player gender
- Age: Player age in years

### Summary Statistics

For the `players.csv` dataset, the mean age of each player is 21.14 years, with a range of player ages from 9 (minimum) to 58 (maximum) years. The mean hours played per player is 5.90 hours, with a range of 0 (minimum) to 223.1 (maximum) hours. The code for thse calculations are listed in section (3) Exploratory Data Analysis and Visualization.

## (2) Questions

The goal of this analysis is to answer the question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

In order to address this broader question, this analysis will use the data in the `players.csv` dataset to answer a more specific question: Can age and the total hours played by a player predict whether they subscribe to a game-related newsletter?

The `players.csv` dataset contains the data necessary to answer this specific question, formatted already into a table such that the explanatory variables, `played_hours` (total number of hours played) and `Age` (age of player), are in their own columns and can be analyzed to see if they can predict the response variable, `subscribe` (whether or not the player subscribes to the game-related newsletter). Each player is one observation in this dataset, so we can easily compare these variables for each player and see if the explanatory variables are good predictors.

Since only the variables from the `players.csv` dataset are relevant to this question, only this dataset will be wrangled and used in the analysis.

## (3) Exploratory Data Analysis and Visualization

In [None]:
#Loading libraries

library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6)
# source('cleanup.R')

In [None]:
#Reading in data from the sessions dataset

sessions_data <- read_csv("data/sessions.csv")

#Reading in data from the players dataset

players_data <- read_csv("data/players.csv")

### `players.csv` Dataset: Tidying and Summary Statistics

The following section of code 1) tidies the data so that it may be analyzed and 2) computes summary statistics (reported in section (1) Data Description).

In [None]:
#Tidying data: Removing rows with NA in order to analyze the data

players_data <- players_data |>
                filter(!is.na(Age), !is.na(played_hours), !is.na(subscribe))
players_data

In [None]:
#Computing summary statistics for the players.csv dataset

#Mean for each quantitative variable

mean_players_data <- players_data |>
    select(played_hours, Age) |>
    map_df(mean)
mean_players_data

#Min and max for each quantitative variable
max_min_players <- players_data |> 
    summarize(min_played_hours = min(played_hours),
              max_played_hours = max(played_hours),
              min_age = min(Age),
              max_age = max(Age))
max_min_players

### Exploratory Visualizations

In [None]:
subscribe_barplot <- players_data |>
    ggplot(aes(x = subscribe))+
    geom_bar(stat = "stack") +
    labs(x = "Subscription status",
         y = "Number of players",
         title = "Proportion of players subscribed")
subscribe_barplot

played_hours_histogram <- players_data |>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram(binwidth = 10, position = "stack") +
    labs(x = "Total hours played per player",
         y = "Number of players",
         title = "Distribution of the total hours played by each player") +
    scale_fill_manual(values = c("FALSE" = "darkorchid", "TRUE" = "darkorange")) +
    theme(text = element_text(size = 10))
played_hours_histogram


age_histogram <- players_data |> 
    ggplot(aes(x = Age, fill = subscribe)) +
    geom_histogram(binwidth = 10, position = "stack") +
    labs(x = "Age of player (years)",
         y = "Number of players",
         title = "Distribution of player age") +
    scale_fill_manual(values = c("FALSE" = "darkorchid", "TRUE" = "darkorange")) +
    theme(text = element_text(size = 10))
age_histogram

age_vs_played_hours <- players_data |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.6) +
    labs(x = "Age of players",
         y = "Total hours played per player",
         title = "Subscription status based on age vs total hours played") +
    scale_color_manual(values = c("FALSE" = "darkorchid", "TRUE" = "darkorange")) +
age_vs_played_hours

## (4) Methods and Plan



## (5) GitHub Repository