**Group 7**

group: project-009-7

ilin27

GitHub Link: https://github.com/ilin27/project_planning_stage_individual.git 
-

In [None]:
# Loading datasets into R
library(tidyverse)

url_players = "https://raw.githubusercontent.com/ilin27/project_planning_stage_individual/refs/heads/main/players.csv"
download.file(url_players, "players.csv")
players <- read_csv("players.csv")

url_sessions <- "https://raw.githubusercontent.com/ilin27/project_planning_stage_individual/refs/heads/main/sessions.csv"
download.file(url_sessions, destfile = "sessions.csv")
sessions <- read_csv("sessions.csv")

In [None]:
summary(players)

In [None]:
summary(sessions)

**(1) Data Description:**
-

players.csv
=
- **Number of observations**: 196
- **Summary statistics (2 d.p.)**: Please refer to the tables displayed by summary(players) above. Some interesting ones to note:
    - Many more players subscribed (144) than not (52)
    - The mean hours played is about 5.8 hours
    - The average age of a player is 21 years old
- **Number of variables**: 7
    - experience <chr> The experience level of the player (Amateur, Regular, Pro, Veteran)
    - subscribe <lgl> Whether players have subscribed to the newsletter (TRUE, FALSE)
    - hashedEmail <chr>	The player's email, hashed for privacy
    - played_hours <dbl> The number of hours the player has spent on the server
    - name <chr> The player's name
    - gender <chr> The gender of the player (Male, Female, Non-binary, Agender, Two-Spirited, Prefer not to say)
    - Age <dbl> (The "A" is capitalized) The players age
- **Any issues in the data**: Since many more players have subscribed than not (higher percentage of TRUE), this may be an imbalance, affecting classification (something to keep in mind).
- **How the data was collected**: Dr. Frank Wood collected data on player actions as they navigate in his MineCraft server.

sessions.csv
=
- **Number of observations**: 1535
- **Summary statistics (2 d.p.)**: Please refer to the tables displayed by summary(sessions) above. Some interesting ones to note:
    - Since start_time and end_time are chr and not dbl, we cannot see useful summary information (e.g. average start time)
- **Number of variables**: 5
    - hashedEmail <chr> The player's email, hashed for privacy
    - start_time <chr> The start date and start time (format: DD/MM/YYYY HH:MM)
    - end_time <chr> The end date and end time (format: DD/MM/YYYY HH:MM)
    - original_start_time <dbl> The original start time (unix standard time) - meaningless in the context of the proposed question
    - original_end_time <dbl> The original end time (unix standard time) - meaningless in the context of the proposed question
- **Any issues in the data**: The start_time and end_time columns are not tidy, since there are two values (data and time) in each cell.
- **How the data was collected**: Dr. Frank Wood collected data on player login and logout times on his MineCraft server.

**(2) Questions:**
-

I will be exploring Question 1: **"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"**

Specific question: **"Can the number of hours played (played_hours) and the age of the player (Age) predict if a player will subscribe (subscribe) to a game-related newletter in the players.csv dataset?"**

The rationale for choosing the predictor played_hours is that a greater number of hours played may indicate that the player has an interest in the game, and therefore would subscribe to the newsletter. The reasoning for choosing Age is because an older player may be more interested in reading a newsletter. Of course, these are just assumptions and we would need to explore these possibilities through visualization and modelling.

I will be focusing on the players.csv dataset, and will not be using the sessions.csv dataset since it does not contain any data relevant to the above stated question.

**(3) Exploratory Data Analysis and Visualization**
-

In [None]:
# Loading datasets into R
library(tidyverse)
url_players = "https://raw.githubusercontent.com/ilin27/project_planning_stage_individual/refs/heads/main/players.csv"
download.file(url_players, "players.csv")
players <- read_csv("players.csv")

In [None]:
## Wrangling

# The players.csv dataset is tidy. 
# Each row is a singular observation and each column is singular variable.
# Each cell contains a single value.
# Therefore, there is no need for minimum necessary wrangling to be done at this point of the project.

In [None]:
# Compute the mean value for each quantitative variable in the players.csv data set. 
# https://www.codecademy.com/resources/docs/markdown/tables

summarize(players,
          mean_played_hours = mean(played_hours, na.rm = TRUE),
          mean_Age = mean(Age, na.rm = TRUE))

summarize(sessions,
          mean_original_start_time = mean(original_start_time, na.rm = TRUE),
          mean_original_end_time = mean(original_end_time, na.rm = TRUE))

### Table 1: Mean Quantitative Variables in players.csv

| mean_played_hours    | mean_Age |
| -------------------- | -------- |
| 5.846                | 21       |

### Table 2: Mean Quantitative Variables in sessions.csv

| mean_original_start_time | mean_original_end_time |
| ------------------------ | ---------------------- |
| 1.72e+12                 | 1.72e+12               |

In [None]:
# A few exploratory visualizations of the data.
# Make sure to include labels, titles, units of measurement, etc.
# Explain any insights you gain from these plots that are relevant to address your question.

In [None]:
# Graph 1: Age vs Hours Played

options(repr.plot.width = 12, repr.plot.height = 6)
players_scatter_plot <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) + 
    geom_point(na.rm = TRUE) + 
    labs(x = "Age of Player",
         y = "Amount of Time Played (hours)",
         color = "Subscribe or Not",
         title = "Age vs Amount of Time Played") +
    theme(text = element_text(size = 16))
players_scatter_plot

### Graph 1 insights:

- The age of the players range from around 9-58 years old; most players are in the 15-25 year range.
- There seems to be no relationship between the age of the player and whether or not they subscribe to the newsletter.
- There seems to be a weak relationship between the hours played and whether a player subscribes or not. All the players that have not subscribed have spent little time playing the game. However, there are also many players that *did* subscribe that only spent a little time playing. All the players tht spent more than 20 hours on the game did subscribe to the newsletter.
- May need to explore what predictors other than Age contribute to player subscription. The next graph will explore if there is a relation between hours played, experience level, and subscription.

In [None]:
# Graph 2: Experience Level vs Hours Played

options(repr.plot.width = 8, repr.plot.height = 6)

# players_bar <- ggplot(players, aes(x = experience, y = played_hours, fill = subscribe)) + 
#     geom_bar(stat = "identity") +
#     labs(x = "Experience Level of Player",
#          y = "Amount of Time Played (hours)",
#          fill = "Subscribe or Not",
#          title = "Experience Level vs Amount of Time Played") +
#   scale_fill_manual(values = c("blue", "darkorange")) +
#   theme(text = element_text(size = 14))
# players_bar

players_hist <- ggplot(players, aes(x = Age)) +
  geom_histogram()

players_hist

### Graph 2 insights:

- Regular players spent the most time playing on the server, and all regular players subscribed.

In [None]:
players_bar <- ggplot(players, aes(x = experience, y = played_hours, fill = subscribe)) + 
    geom_bar(stat = "identity") +
    labs(x = "Experience Level of Player",
         y = "Amount of Time Played (hours)",
         fill = "Subscribe or Not",
         title = "Experience Level vs Amount of Time Played") +
  scale_fill_manual(values = c("blue", "darkorange")) +
  theme(text = element_text(size = 14))
players_bar