<h2> Questions & Description</h2>	



**Broad Question**

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

- This broad question will explore the demographic and behavioural factor (such as age, gender, experience, and engagement) and how it may influence a player's likelihood of subscribing. It looks at who the players are and how they interact and engage with the game. 

**Specific Question** 

Using measurements of a player's engagement, such as their total playtime, average playing time, and experience, is an experienced player (pro players and veteran players) more likely to subscribe to the newsletter in comparison to a beginner player?

This specific question will be addressed through predictive modeling, where the response variable is:

`subscribe` (Logical: TRUE/FALSE),

and the explanatory variables include:

`experience` (categorical: Beginner, Amateur, Regular, Pro, Veteran),

`played_hours` (numeric: total hours played),

`avg_session_duration` (numeric: average session length),

`num_sessions` (numeric: number of sessions per player),

and possibly demographic controls (Age, gender).

Some of these variables are not in the original data sets, but I detail later that I will mutate and derive them. 

<h2> Data Summary Description</h2>	

The dataset consists of two files, players.csv and sessions.csv. 


<h3> players.csv File Summary </h3>	


The file "players.csv" contains demographic and behavioural information; there are seven variables that are "experience", "subscribe", "hashedEmail", "played_hours", "name", "gender", and "Age". Within this file, there are 196 observations and only the "Age" variable is missing two values. The key indentifier of each player is the "hashedEmail" variable, which is unique to each player. 

<h4>players.csv Variables Summary </h4>	




| Variable | Type | Missing Values | Unique Values | Description / Notes |
|-----------|------|----------------|----------------|----------------------|
| `experience` | fct | 0 | 5 | Describes the gaming experience of each player. Categories/limited number of values for data include Veteran, Pro, Regular, Amateur and Beginner in order of experience. |
| `subscribe` | lgl | 0 | 2 | Logical data type that indicates whether the player subscribed to the game-related newsletter. |
| `hashedEmail` | chr | 0 | 196 | Unique anonymized player ID (key for joining with `sessions.csv`). This identifies the players and is a string of lowercase letters and numbers. |
| `played_hours` | int | 0 | 43 | Total number of hours played of each player. |
| `name` | chr | 0 | 196 | Player alias or name (not used as an analytical variable). |
| `gender` | fct | 0 | 7 | Player-reported gender (categorical). Categories of gender include: Male, Female, Agender, Non-binary, and some individuals indicated "prefer not to say". |
| `Age` | int | 2 | 32 | Player’s age  |

**Notes and Potential Issues**

- There is an uneven class distribution (where the variable of "Amateur" is the most common). When analyzing my specific question of whether experienced players would be more likely to subscribe to a newsletter than a beginner player, there would be more data on beginner players (beginners and amateurs) than the experienced players. 
- The "subscribe" variable is my response variable for the project and will determine whether or not the players are more likely
- The "played_hours" variable is heavily skewed right, with some players with over 200 hours
- There may be outliers in the age, as the range is age 9 to age 57


<h3> sessions.csv File Summary </h3>	


The file "sessions.csv" is a list of individual play sessions by each player, with detailed records of each player's gameplay. Each player  is identified by the "hashedEmail" variable, like in the "player.csv". In the file, there are 1,535 observations, 5 variables: "hashedEmail", "start_time", "end_time", "original_start_time", and "orginal_end_time". 

<h4> sessions.csv Variables Summary </h4>	



| Variable | Type | Missing Values | Unique Values | Description |
|-----------|------|----------------|----------------|----------------------|
| `hashedEmail` | chr | 0 | 125 | Player identifier for joining with `players.csv` |
| `start_time` | int | 0 | 1504 | Start time of a play session |
| `end_time` | int | 2 | 1489 | End time of a play session (missing values probably mean interrupted sessions) |
| `original_start_time` | int | 0 | 649 | Original (possibly unadjusted) session start time |
| `original_end_time` | int | 2 | 650 | Original session end time (missing for incomplete sessions) |

**Notes and Potential Issues**

- Duplicate player sessions: some people have up to 310 sessions
- Outlier: there were some sessions that were a lot longer or shorter than average
- Should convert the start time to datetime format, it is currently a string of numbers

<h3> How Data Will Be Used </h3>	


To answer the question: "Can demographic characteristics (age, gender, and experience level), and gameplay behaviour predict newsletter subscription among players?" More specifically, "Can we determine if experienced players are more likely to subscribe to a game related newsletter than beginner players?""

1. We will merge the datasets on 'hashedEmail' to link player demographics and session data. Since the players are identified by their emails, which is the same for both datasets, I can merge the datasets using this variable. This way, we can join the datasets to get more information about each player in one, unified file about players' statistics and the aggregated session information. 
2. Next, I will wrangle the variables:
- I will replace or impute missing values in the "Age" variable with the mean or median age
- Remove or adjust sessions with missing end_time values
- Convert the timestamps in the time variables with the proper date-time format.
- I will standardize the variables by normalizing the highly skewed variables such as played_hours.
- I will also remove some of the variables and columns that are redundant. For example, the variable name and timestamps are not useful for the prediction after I calculate the summaries of each session.
3. To better capture the behavioural patterns, I will mutate and create new columsn from the sessions.csv data. I listed them below in the table with their uses:

  | Mutated Feature            | Description                          | Use                                                      |
| -------------------------- | ------------------------------------------ | ------------------------------------------------------------------- |
| **`num_sessions`**         | Count of total sessions per player         | Indicates player engagement frequency                              |
| **`avg_session_duration`** | Mean of (`end_time - start_time`)          | Measures in minutes how long players typically play per session               |
| **`total_play_time`**      | Sum of all session durations               | Proxy for total time spent in the game (may refine `played_hours`) |
| **`session_frequency`**    | `num_sessions` ÷ number of active days     | Shows consistency of daily engagement
| **`active_days`**          | Count of distinct days a player was active | Represents sustained engagement over a period of time                            |


The variables that I would keep in my merged dataset of 'players.csv' and 'sessions.csv' are: 
- `hashedEmail` = this is the key identifier that will merge the datasets. It is not a predictor but it is necessary during merging. I can drop this variable after merging to clean up the dataset and remove any extra noise.
- `subscribe` = this is the response variable for the prediction. It is a logical variable that indicates whether or not a player has subscribed to the newsletter with TRUE or FALSE
- `experience` = this is the most important categorical predictor as it displays the level that players are at, from Beginner to Veteran. This way, I can organize the players into the category of "experienced players" and "beginner players" when answering my question.
- `age` = this is a numeric predictor that may give me more insight on how it might affect other factors (such as experience or hours played) and how this could affect the likelihood of subscription. For instance, a younger, highly experienced player may engage differently from an old beginner, even if both play similar hours. Or, younger players might be more engaged or spend more time playing and older players might be more likely to subscribe if they prefer updates and reading newsletter. Although this variable does not seem that important, it can inform me and help me interpret the data. During the project, however; if I find that the age shows no correlation with subscription then I would consider removing it, or if there are too many outliers, then I will remove it.
- `gender` = I can keep this as it as a useful demographic indicator. During my process, I can better interpret the data and results with this variable, similar to age. Even if gender does not strongly predict subscription, it helps describe my sample and will help me better understand who my users are. This column notably has no missing values. However, during my project, if I find that this variable does not help me or if the gender distribition is too unbalanced, with a lot more of one gender than the other, I will remove it.
- `played_hours` = I would keep this variable as it is one of the predictor variables in this case. It is very skewed to the right, so I would try to standardize it using a log transformation first before I use" it.

The variables that I would completely remove are: 
- `name` = since I have already established that my identifier variable as the `hashedEmail`, having this would be extra and redundant
- `start_time` and `end_time` = I would need these to to calculate session duration and frequency after converting them to the date-time format. However, I would remove these shortly after as they would be redundant.
- `original_start_time` and `orginal_end_time` = I would remove this right away as they are redunant duplicates of the `start_time` and `end_time`.

After I finish cleaning and merging, I should have 196 rows (which are for all the players that my dataset will look at), and 10 variables. If needed, I can keep some temporarily and remove them if I find that they are not required. The `subscribe` variable is my response variable. 