In [None]:
library(tidyverse)  
library(lubridate)
library(dplyr)
library(readr)
library(ggplot2)

# ***Title: How To Get More Players***

Analyzing user data has become increasingly important in today’s gaming industry. By collecting and analyzing players data, we can identify ***which "kinds" of players*** are most likely to contribute a large amount of data.  This insight allows us to better target those players in our recruiting efforts — and this is the primary goal of our project.

To achieve this, we developed and explored four key questions:
1. Relation between "age" and engagement level metrics  
    > ***Do older players spend more time per session? (age vs average session duration)***

2. Relation between "experience" and engagement level metrics  
    > ***Do players with more experience tend to play more? (experience vs Average Played Hours)***

3. Relation between "gender" and engagement level metrics  
    > ***Which gender has more sessions? (gender vs total sessions)***

4. Relation between "subscription" and engagement level metrics  
    > ***Do subscribers tend to play more than non-subscribers? (subscribe vs total play time)***

The datasets used for this analysis include personal information of players from a specific game platform, along with detailed records of their play sessions. ***We merged two datasets (players.csv, sessions.csv) using the players' email addresses as a common key.*** The variables in the merged dataset include age, gender, experience, subscription status, name, hashed email, total play hours, session start time, and end time. Based on these variables, we performed various calculations and comparisons to answer our research questions.



### **Data** 

In [None]:
players <- read_csv("https://raw.githubusercontent.com/Elvis614412/Dsci-100-group-project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Elvis614412/Dsci-100-group-project/refs/heads/main/sessions.csv")

merged_data <- left_join(players,sessions, by = "hashedEmail")

merged_data <- merged_data|>
    select(-hashedEmail,-original_start_time,-original_end_time)
    merged_data$Age[is.na(merged_data$Age)] <- mean(merged_data$Age, na.rm = TRUE)

head(merged_data, 5)

## ***Q1. Do older players spend more time per session? (age vs average session duration)***

In [None]:
merged_data <- merged_data|>
    mutate(
        start_time = as.POSIXct(start_time, format = "%m/%d/%Y %H:%M"),
        end_time = as.POSIXct(end_time, format = "%m/%d/%Y %H:%M"),
        session_duration = as.numeric(difftime(end_time, start_time, units = "mins"))
    )

name_order <- merged_data |>
  count(name, sort = TRUE) |>
  pull(name)

merged_data$name <- factor(merged_data$name, levels = name_order)

merged_data <- merged_data |>
  arrange(name,start_time)

# merged_data <- merged_data |>
#     filter(name == "Morgan") |>
#     arrange(start_time)

merged_data_clean <- merged_data |>
  filter(session_duration <= 1440 ,!is.na(start_time),!is.na(end_time))

merged_data2 <- merged_data_clean |> 
    group_by(name,Age,experience,subscribe) |>
    summarize(average_playtime = mean(session_duration),.groups = "drop")

merged_data2 <- merged_data2 |>
    select(-name,-experience,-subscribe)

merged_data2 <- merged_data2 |>
    mutate(
        Age = as.integer(Age),
        average_playtime = round(average_playtime,2)
    ) |>
    arrange(Age)

average_by_age <- merged_data2|> 
    group_by(Age) |>
    summarize(average_playtime = mean(average_playtime),.groups = "drop")

average_by_age <- average_by_age |> 
    mutate(
        average_playtime = round(average_playtime)
    )

head(average_by_age,5)



To calculate the average playtime per session, the start_time and end_time variables were first converted from factors to appropriate time formats. 

Before performing the calculation, **I identified the outliers represented non-overlapping, extremely long sessions** and were unlikely to affect the analysis.

**The outliers were extremly long sessions which is saying that some player played almost a month**. I pick specific player "Morgan" who appeared to have played for nearly a month, and I confirmed that this was due to erroneous data. **A few other players also exhibited similar issues**. To ensure data quality, **I filtered out any sessions longer than one day (1440 minutes).**

After cleaning the data, I retained only the necessary columns (Age, average_playtime) and converted the data types to integer and double (rounded to two decimal places) to improve readability and analysis.


In [None]:
age_plot <- average_by_age |>
    ggplot(aes(x = factor(Age), y = average_playtime)) +
    geom_bar(stat = "identity", fill = "skyblue") + 
    labs(
        x = "Age (9 ~ 50)",
        y = "Average Play Time per Session (minutes)",
        title = "Relationship Between Age and Average Play Time Per Session"
    ) +
    theme_minimal() 

age_plot

In [None]:
merged_data2 <- merged_data2 |>
  mutate(age_group = case_when(
    Age >= 10 & Age < 20 ~ "10s",
    Age >= 20 & Age < 30 ~ "20s",
    Age >= 30 & Age < 40 ~ "30s",
    Age >= 40 & Age <= 59 ~ "40s~50s",
    TRUE ~ NA_character_
  ))

head(merged_data2,5)

grouped_playtime <- merged_data2 |>
  filter(!is.na(age_group)) |>
  group_by(age_group) |>
  summarize(avg_playtime = round(mean(average_playtime), 2))


After grouping the data by age ranges (e.g., teens, 20s, 30s, etc.), I calculated the average playtime per session for each group. This allowed for a clearer comparison of gaming behavior across different age demographics.

In [None]:
age_plot_by_group <- ggplot(grouped_playtime, aes(x = age_group, y = avg_playtime, fill = age_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Play Time by Age Gruop ",
       x = "Age Group",
       y = "Average Play Time (minutes)") +
  theme_minimal()

age_plot_by_group


## ***Question 1 result***

Based on the analysis of average playtime by age, **20-year-old players recorded the longest average playtime per session at 150 minutes**, followed by **32-year-olds (116 minutes)** and **24-year-olds (105 minutes)**. 

In contrast, **50-year-olds had the shorest average playtime per session at 5 minutes**, with **45 -year-olds (7 minutes)** and **26-year-olds (15 minutes)** also showing relatively short play durations. 

When grouped by age ranges, **players in their 30s had the highest average playtime per session**, followed by those in their **20s, teens (10s), and 40s~50s.** 

These results indicate that **players in their 20s and 30s tend to spend the most time per game session.**

## ***Q2. Do players with more experience tend to play more? (experience vs Avergae Played Hours)***

To analyze player engagement, I first cleaned and prepared the dataset by converting the experience, subscribe, and gender variables into categorical factors to ensure proper grouping. I also recoded the subscribe variable for clarity, renaming "TRUE" to "Subscribed" and "FALSE" to "Not Subscribed". Then, I grouped the data by experience level and calculated the average number of hours played within each group, excluding any missing values. To visualize this relationship, I created a bar plot using ggplot2, displaying average played hours on the y-axis and experience levels on the x-axis, with color-coded bars for each category. This plot provides a clear overview of how playtime varies with experience, offering insights into relation between "experience" and play time.

In [None]:
# summary of the data set that is relevant for exploratory data analysis related to the planned analysis
tidy_player_dataset <- players |> 
    mutate(experience = as_factor(experience), 
           subscribe = as_factor(subscribe), 
           gender = as_factor(gender)) |>
    mutate(subscribe = fct_recode(subscribe, "Subscribed" = "TRUE", "Not Subscribed" = "FALSE"))

head(tidy_player_dataset, 5)

In [None]:
# visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
avg_played_hours <- tidy_player_dataset |>
  group_by(experience) |> 
  summarise(avg_hours = mean(played_hours, na.rm = TRUE))


experience_bar <- ggplot(avg_played_hours, aes(x = experience, y = avg_hours, fill = experience)) +
  geom_bar(stat = "identity") +
  labs(title = "Figure 2: Average Played Hours vs. Experience Level",
       x = "Experience Level",
       y = "Average Played Hours(hrs)") +
  theme(text = element_text(size = 15))

experience_bar

## ***Q2 Result***

Figure 2 shows that "Regular" players have the highest average playtime, significantly exceeding all other experience levels. "Veteran" players have the lowest average playtime, suggesting that higher experience does not necessarily lead to greater engagement. "Pro" and "Amateur" players fall in the middle range, while "Beginner" players have the lowest playtime overall. This pattern indicates that e "Regular" players appear to be the most engaged. Therefore, we should recruit "Regular" players as they most likely to contribute a large amount of data(greater play time).

## ***Q3. Which gender has more sessions? (gender vs total sessions)***

In [None]:
gender_sessions <- merged_data |>
filter(!is.na(gender)) |>
select(name, gender, start_time, end_time) |>
group_by(gender) |>
summarize(total_sessions=n()) 
gender_sessions

options(repr.plot.width = 15, repr.plot.height = 8)
gender_sessions_plot <- gender_sessions |>
ggplot(aes(x=gender, y=total_sessions, fill=gender)) +
geom_bar(stat="identity") +
labs(title="Total Sessions Related to Gender", x="Player's Gender", y="Total Count") +
theme(text = element_text(size=20))
gender_sessions_plot

**Steps**
1. Filter out rows with missing gender(NA) from gender column using `filter` function.
2. Select five column names (name, gender, start_time, end_time, session_duration) from data using `select` function.
3. Group the data by gender using `group_by` function.
4. Calculate total number of rows in sessions data for each gender using `summarize` function, then make a new name: total_sessions.
5. Print out gender_sessions.
6. Create a plot for relationship between gender and total sessions using gender_sessions data.
7. Using x = gender, y = total_sessions and fill gender to create a bar chart and create a title and each name of xlab and ylab using `labs` to combine it.

**Relationship with question2**
1. What I do is to research how a player's gender contributed to the large amount of data for the game. Then, I used different functions to create a final plot that can illustrate which gender contributed most, least and middle.
2. Firstly, I used a `filter` function and an argument !is.na to extract the gender's column rows in the merged data frame and find the values which are not NA.
3. In addition, using the `select` function to choose which columns I can use in this research, I choose name, gender, start time and end time columns to solve this research. Name and gender columns can help us know each player's names and which gender are they. Also, start time and end time columns can help us know how long they spent on game.
4. Thirdly, using `group_by` and `summarize` functions to split the gender column to each column and make a column name total_sessions and calculate each column to find values.
5. Last but not least, using the `ggplot` function to create a bar chart and find the total distribution for each genders.
6. Finally, the bar shows the most, least and middle contributed for each genders. *Male* gender contributed most for this game which is in the upper 1000. *Other* gender contributed least for this game which is almost 0. Besides, second place is *Female*, third place is *Non-binary*, fourth place is *Prefer not to say*, fifth place is *Agender* and sixth place is *Two-Spirited*.
7. Overall, I described the number of players related to gender and contribution from players for the game, so the most distributed is *Male* players and the least distributed is *Other* players.

## ***Q4. Do subscribers tend to play more than non-subscribers? (subscribe vs total play time)***

To examine the relationship between subscription status and player engagement, I began by grouping the cleaned dataset by the subscribe variable, which had been previously recoded for clarity ("Subscribed" and "Not Subscribed"). I then calculated the average number of hours played for each subscription group using the mean() function, excluding missing values with na.rm = TRUE. This step allowed me to quantify differences in playtime between subscribed and non-subscribed users. To visualize these differences, I created a bar plot, with subscription status on the x-axis and average played hours on the y-axis. The bars were color-filled according to subscription status, and appropriate axis labels and a title were added to enhance readability. This plot helps reveal how subscription status is associated with average playtime, which, in turn, indicates the volume of gameplay data contributed by each group.

In [None]:
# Compute average played hours by subscription status
avg_played_hours_sub <- tidy_player_dataset |>
  group_by(subscribe) |>
  summarise(avg_hours = mean(played_hours, na.rm = TRUE)) 

# Create the bar plot
sub_bar <- ggplot(avg_played_hours_sub, aes(x = subscribe, y = avg_hours, fill = subscribe)) +
  geom_bar(stat = "identity") +
  labs(title = "Figure 4: Average Played Hours vs. Subscription Status",
       x = "Subscription Status",
       y = "Average Played Hours (hrs)",
       fill = "Subscription Status") +
  theme(text = element_text(size = 15))
sub_bar

## ***Q4 Result***

In figure 4, the comparison between subscribed and non-subscribed players demonstrates a clear and significant difference in playtime. Subscribed players spend singnificantly more time playing than non-subscribers, suggesting that subscribed players are most likely to contribute more to large amount of data.

**Discussion:**
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

According to the diagram of the four small questions, players in 20 and 20 to 30 years old contributed the most to game time, followed by players between 30 and 50 years old, and players with the shortest game time in 50 years and older. Compared to all other experience levels, the average player spends more time playing than the pro player averages on top of the game time. In terms of the impact of gender on the game, male gamers have contributed to the game for the longest time, even far more than other players. While some of the results are not far from the predicted, pros spend less time playing than the average gamer, suggesting that pro gamers are more focused on mastering the game than on the amount of time they spend playing. These findings provide suggestions and ideas for planning and adjusting the direction of the game, as well as studying user retention. Future developments can focus on how to adjust better decisions to make games accessible to different ages.

Through our analysis, we found that certain types of player are strongly linked to higher engagement levels, which directly correlates with greater data contribution. Players in their 20s and 30s tend to spend the most time per session, suggesting that age influences how long players engage with the game at one session. Interestingly, while we expected more experienced players to play more, the data showed that "Regular" players had the highest overall playtime, indicating they may be the most engaged group. In terms of gender, male players had the highest number of total sessions, followed by female and non-binary players, meaning male players contributed the most session-based data. Also, we found that subscribed players played significantly more than non-subscribers, confirming that subscription is a strong indicator of player engagement and data contribution.

**References**
You may include references if necessary, as long as they all have a consistent citation style.