Individual Planning Report:

1. Data Description

The data set includes two files:

    - players.csv: contains individual player demographics and play statistics such as experience level, subscribed to a newsletter, hashed email, hours played on the server, name, gender, and age.
    
    - sessions.csv: contains detailed session-level data, such as player email, session start time, end time, and original start and end time. 


Below is a summary of the main variables and their descriptions for the players data set:

|Variable Name|Data Type|Description/Meaning|Summary Statistics/Values|Missing Values|Notes/Potential Issues|
|-------------|---------|-------------------|-------------------------|--------------|----------------------|
|hashedemail| character | Unique ID for each player| N/A | 0 | Unique; can be used to link databases|
|played_hours| numeric| Total hours played by each player| Mean = 5.85, Median =0.1, Min = 0, Max = 223.1|0| Very skewed, potential outliers|
|age| numeric| Player age in years| Mean = 21.14, Median = 19, Min = 9, Max = 58|2|Check for unrealist ages, or missing values|
|gender| character|Player Gender|N/A|0|Some less common categories, creates outliers|
|experience| character| Experience of each player| N/A | 0 | Consider ordering them if modeling|
|subscribe| logical| Whether a player is subscribed or not| True = 144, False = 52| 0| Not super useful for our question|
|name| character| Name of each player| N/A | 0 | Not for modeling, may contain duplicates|

Players dataset: 196 rows, 7 columns, 2 missing points


Below is a summary of the main variables and their descriptions for the sessions data set:

|Variable Name|Data Type|Description/Meaning|Summary Statistics/Values|Missing Values|Notes/Potential Issues|
|-------------|---------|-------------------|-------------------------|--------------|----------------------|
|hashedemail| character | Unique ID for each player| N/A | 0 | Unique; can be used to link databases|
|start_time| character| Date and time of start of a session for a player| N/A| 0| Data not tidy, need to separate date and time into own column|
|end_time| character| Date and time of start of a session for a player| N/A| 0| Data not tidy, need to separate date and time into own column|
|original_start_time| numeric|When the session originally started, in milliseconds since 1970-01-01 UTC.|Mean=1.719e+12, median=1.719e+12, max=1.727e+12, min = 1.712e+12| 0| Will need to convert to readable units|
|original_end_time|numeric|When the session originally started, in milliseconds since 1970-01-01 UTC.|Mean=1.719e+12, median=1.719e+12, max=1.727e+12, min = 1.712e+12| 2| Will need to convert to readable units|

Sessions dataset: 1535 rows, 5 columns, 2 missing points


Summary statistics shows a mean playtime is 5.85 hours per player, with a maximum of 223.1 hours, indicating that some players contributed far more data than others. Missing data in two variables will not affect analysis, as they are not used.

In [None]:
library(tidyverse)

#load data directly from GitHub
players <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/sessions.csv")

In [None]:
#Inspect the data

glimpse(players)
glimpse(sessions)

In [None]:
#Get basic dimensions of the data

nrow(players)
ncol(players)

nrow(sessions)
ncol(sessions)

In [None]:
summary(players) #shows summary stat report - including N/a's

In [None]:
summary(sessions) #need to clean up sessions - seperate data and time in start time and end time cols - not tidy otherwise - also  make analysis easier

2. Research Question

Broad: Which kind of players are most likely to contribute the most data?

Specific: Can the average session start time predict whether a player will be a high or low data contributor?

Response Variable: Binary variable - high(1) vs low(0) data contributor, based upon median total contribution.
  

Explanatory variable: Average session start time (hour of day)

Explanation: 
These two data sets contain session-level information, including player identities, start and end times, and the hours played per session. To answer the question, players.csv and sessions.csv will be joined using the hashed email. The session start time will be extracted as an hour, and the average start hour per player will be calculated. This will be merged with the total contributions from players.csv, producing a dataset suitable for analysis.


3. Exploratory Data Analysis

In [None]:
#Load data:

 library(tidyverse)

players <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/haylieannel22/DSCI-100-Project/refs/heads/main/sessions.csv")

glimpse(players)
glimpse(sessions)

In [None]:
#Wrangle the data

sessions_tidy <- sessions |> #Tidy data by making a column for only start time and selected wanted columns
    mutate(start_time_dt = parse_date_time(start_time, orders = c("mdy HM", "dmy HM", "ymd HM")),
    start_hour = hour(start_time_dt))

glimpse(sessions_tidy)

In [None]:
#Compute mean for all quantitive data for players datset

players_mean <- players |>
  summarize(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))

players_mean

In [None]:
#Exploratory Visualizations:

#A - Distribution of hours_played

options(repr.plot.width = 14, repr.plot.height = 8) 

hours_played_plot <- players |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 50, fill = 'blue', color = 'black') +
    labs(title = "Distribution of Hours Played", x = "Hours Played (hrs)", y = "Number of players") +
    theme_minimal()

hours_played_plot

Graph A shows that total data contributions are highly skewed, with many players contributing very little.

In [None]:
#B - Distrubution of Age

options(repr.plot.width = 14, repr.plot.height = 8) 

age_plot <- players |>
    ggplot(aes(x = Age))+
    geom_bar(fill = 'blue', color = 'black')+
    labs(title = "Age Distribution of Players",
       x = "Age (years)",
       y = "Number of Players") +
  theme_minimal()

age_plot

Graph B shows most players are aged 15-20 years, clustered around 17 years old.

In [None]:
#C - Session start hour

sessions_tidy <- sessions |> #Tidy data by making a column for only start time
    mutate(start_time_dt = parse_date_time(start_time, orders = c("mdy HM", "dmy HM", "ymd HM")),
    start_hour = hour(start_time_dt))

start_hour_plot <- sessions_tidy |>
    ggplot(aes(x = start_hour)) +
    geom_histogram(binwidth = 1, fill = "salmon", color = "black") +
    labs(title = "Distribution of Session Start Hours",
       x = "Hour of Day (0-23)",
       y = "Number of Sessions") +
    theme_minimal()

start_hour_plot

Graph C shows many players start sessions in the late evening to early morning, with few between 10:00 - 15:00.

4. Methods and Plan:

The proposed method for, can the average start hour predict whether a player is a high or low data contributor is to use k-nearest neighbours (KNN) classification. kNN would assign an observation to being either a high contributor or a low contributor. The predictor variable is start_hour, and the response variable is high_contributer, a binary variable indicating whether a player contributes above or below the median amount of data. This approach allows classification based on similarity to other playersâ€™ behaviour and is suitable for exploratory predictive analysis

kNN is suitable because it handles binary classification without strong assumptions and identifies patterns based on similarity. It assumes players with similar starting hours will belong in the same class, and observations are independent. Limitations include k sensitivity and imbalanced daWhile kNN can be flexible and simple, it can be very sensitive to the choice of k and any imbalanced data. 

To apply the model, the average start hour will be calculated for each player and merged with their total data contributions to create the binary response. The dataset will be randomly split into a 70% training set and a 30% testing set. Model performance will be evaluated using metrics like accuracy, and k-fold cross-validation will be used on the training set to select the best value of k and ensure the model generalizes well.

5. Github repository: https://github.com/haylieannel22/DSCI-100-Project.git