Individual Planning Report:

Predicting high-data contributors based on time-of-day habits

1. Data Description

In [None]:
library(tidyverse)

#load data directly from GitHub
players <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/sessions.csv")

In [None]:
#Inspect the data

glimpse(players)
glimpse(sessions)

In [None]:
#Get basic dimensions of the data

nrow(players)
ncol(players)

nrow(sessions)
ncol(sessions)

In [None]:
summary(players) #shows summary stat report - including N/a's

In [None]:
summary(sessions) #need to clean up sessions - seperate data and time in start time and end time cols - not tidy otherwise - also  make analysis easier

The data set includes two files:

    - players.csv: contains individual player demographics and play statistics such as experience level, subscribed to a newsletter, hashed email, hours played on the server, name, gender, and age.
    
    - sessions.csv: contains detailed session-level data, such as player email, session start time, end time, and original start and end time. 


Below is a summary of the main variables and their descriptions for the players data set:

|Variable Name|Data Type|Description/Meaning|Summary Statistics/Values|Missing Values|Notes/Potential Issues|
|-------------|---------|-------------------|-------------------------|--------------|----------------------|
|hashedemail| character | Unique ID for each player| N/A | 0 | Unique; can be used to link databases|
|played_hours| numeric| Total hours played by each player| Mean = 5.85, Median =0.1, Min = 0, Max = 223.1|0| Very skewed, potential outliers|
|age| numeric| Player age in years| Mean = 21.14, Median = 19, Min = 9, Max = 58|2|Check for unrealist ages, or missing values|
|gender| character|Player Gender|N/A|0|Some less common categories, creates outliers|
|experience| character| Experience of each player| N/A | 0 | Consider ordering them if modeling|
|subscribe| logical| Whether a player is subscribed or not| True = 144, False = 52| 0| Not super useful for our question|
|name| character| Name of each player| N/A | 0 | Not for modeling, may contain duplicates|

Number of rows (players): 196

Number of columns: 7

Number of missing points throughout the dataset: 2


Below is a summary of the main variables and their descriptions for the sessions data set:

|Variable Name|Data Type|Description/Meaning|Summary Statistics/Values|Missing Values|Notes/Potential Issues|
|-------------|---------|-------------------|-------------------------|--------------|----------------------|
|hashedemail| character | Unique ID for each player| N/A | 0 | Unique; can be used to link databases|
|start_time| character| Date and time of start of a session for a player| N/A| 0| Data not tidy, need to separate date and time into own column|
|end_time| character| Date and time of start of a session for a player| N/A| 0| Data not tidy, need to separate date and time into own column|
|original_start_time| numeric|When the session originally started, in milliseconds since 1970-01-01 UTC.|Mean=1.719e+12, median=1.719e+12, max=1.727e+12, min = 1.712e+12| 0| Will need to convert to readable units|
|original_end_time|numeric|When the session originally started, in milliseconds since 1970-01-01 UTC.|Mean=1.719e+12, median=1.719e+12, max=1.727e+12, min = 1.712e+12| 2| Will need to convert to readable units|

Number of rows (sessions): 1535
Number of columns: 5
Number of missing data points throughout the dataset: 2


Summary statistics show that the mean play time is 5.85 hours per player, with a maximum hours played being 223.1 hours played. This indicates that there were some players who contributed much more data than others.

Data was missing in two of the variables across both data sets. This will not affect data analysis, however, because we will not be analyzing these variables.

2. Research Question

Broad Question: Which kind of players are most likely to contribute the most amount of data?

Specific Question: Can the average time of day a player starts their session and their total number of sessions predict whether they will be a high or low data contributor?

Response Variable (dependent): 
- Binary variable - high vs low data contributor (high = 1, low = 0), based upon the median data contributed per player.
  

Explanatory variable (independent): 
- Average start time of sessions (converted to hour of day)

Explanation: 
These two data sets contain session-level information, including player identities, start and end times, and the hours played (data contributed) per session. To answer the question, first, the players.csv and sessions.csv datasets will be joined using the hashed player's email as a key. For the sessions dataset, the start time will be separated from the date and made into its own column to make the data tidy. For each player, the average start hour across all of their sessions will be calculated. This value will be linked to their total data contribution from the  players' data set. The resulting dataset will allow for predictive analysis to determine whether the average start time can be used to classify players as high or low data contributors.


3. Exploratory Data Analysis

In [None]:
#Load data:

 library(tidyverse)

players <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/sessions.csv")

glimpse(players)
glimpse(sessions)

In [None]:
#Wrangle the data

sessions_tidy <- sessions |> #Tidy data by making a column for only start time and selected wanted columns
    mutate(start_time_dt = parse_date_time(start_time, orders = c("mdy HM", "dmy HM", "ymd HM")),
    start_hour = hour(start_time_dt))

glimpse(sessions_tidy)

In [None]:
#Compute mean for all quantitive data for players datset

players_mean <- players |>
  summarize(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))

players_mean

In [None]:
#Exploratory Visualizations:

#A - Distribution of hours_played

options(repr.plot.width = 14, repr.plot.height = 8) 

hours_played_plot <- players |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 50, fill = 'blue', color = 'black') +
    labs(title = "Distribution of Hours Played", x = "Hours Played (hrs)", y = "Number of players") +
    theme_minimal()

hours_played_plot

Graph A shows that the total data contribution is very skewed, showing that many of the people who signed up did not contribute or contributed very little to the data.

In [None]:
#B - Distrubution of Age

options(repr.plot.width = 14, repr.plot.height = 8) 

age_plot <- players |>
    ggplot(aes(x = Age))+
    geom_bar(fill = 'blue', color = 'black')+
    labs(title = "Age Distribution of Players",
       x = "Age (years)",
       y = "Number of Players") +
  theme_minimal()

age_plot

Graph B shows that age is clustered around the younger ages of approximatly 15-20 years of age with many being approximatly 17 years old.

In [None]:
#C - Session start hour

sessions_tidy <- sessions |> #Tidy data by making a column for only start time
    mutate(start_time_dt = parse_date_time(start_time, orders = c("mdy HM", "dmy HM", "ymd HM")),
    start_hour = hour(start_time_dt))

start_hour_plot <- sessions_tidy |>
    ggplot(aes(x = start_hour)) +
    geom_histogram(binwidth = 1, fill = "salmon", color = "black") +
    labs(title = "Distribution of Session Start Hours",
       x = "Hour of Day (0-23)",
       y = "Number of Sessions") +
    theme_minimal()

start_hour_plot

Graph C shows the distribution of start times from players. It shows that many players begin in the late evening to early morning. With very few playing within the times of 10:00 - 15:00.

4. Methods and Plan:

The purposed method to adress the reaserch question, can the average start hour predict whether a player is a high or low data contributer, is to use a k-nearest neighbors (KNN) classification model. kNN is a learning method that assigns each observation to a classification based on it k nearest neighbors. In the case of this model kNN would assign an observation to being either a high contributer or low contributer. In this project the predictor variable is start_hour and the response va