# Project Proposal

Matt Wiens \
\#21158845 \
group 26

In [None]:
import altair as alt
import pandas as pd

## Data description

The Pacific Laboratory for Artificial Intelligence (PLAI) at UBC has provided data recorded from their Minecraft server, PLAICraft. The data were obtained by recording players’ gameplay and collecting user registration information, and include details about player activity and demographic attributes.

We'll begin by loading in the data.

In [None]:
players = pd.read_csv("data/players.csv")
sessions = pd.read_csv("data/sessions.csv")

In [None]:
players

In [None]:
sessions

The `players` dataframe contains information about 196 players with the following 9 variables:

| Variable | Type | Description | Notes |
|---|---|---|---|
| experience | ordinal | Experience level  | Values have the following ordering:  Beginner, Amateur, Regular, Pro, Veteran |
| subscribe | boolean | Newsletter subscription status | |
| hashedEmail | string | Email (hashed) | Unique for each player and can be used to join the `sessions` dataframe (see below) |
| played_hours | numeric | Total gameplay hours | |
| name | string | First name | |
| gender | categorical | Gender | Gender includes more than two categories, e.g., Male, Female, Two-Spirited, Non-binary |
| age | numeric | Player's age | |
| individualId | unknown | Player's ID | No values included in the data |
| organizationName | string | Player's organization | No values included in the data |

The `sessions` dataframe contains information about 1535 gameplay sessions with the following 5 variables:

| Variable | Type | Description | Notes |
|---|---|---|---|
| hashedEmail | string | Player email (hashed) | Unique for each player and can be used to join the `players` dataframe |
| start_time | datetime (string) | Session start datetime | Stored as string in `DD/MM/YYYY HH:MM` format |
| end_time | datetime (string) | Session end datetime | Stored as string in `DD/MM/YYYY HH:MM` format |
| original_start_time | numeric | Session start datetime | Stored as Unix time in milliseconds |
| original_end_time | numeric | Session end datetime | Stored as Unix time in milliseconds |

For our research question discussed below, we'll only need a subset of the variables.

## Research question

The project lead of PLAICraft, Frank Wood, is interested in answering the following question:

> "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

This proposal will investigate that question. The general idea of what we need to do is to find features—either the variables themselves or transformations of them—which help us predict player newsletter subscription status. This is a binary classification problem and we will focus on maximizing the accuracy of whichever model we choose, with recall and precision as secondary considerations (since we don't know the motivation behind Frank Wood's question, it's not clear which we should prioritize).

## Analysis

The only classification algorithm we've learned in class is $k$-nearest neighbors (KNN), which we'll use here. The response variable we will predict is the `subscribe` variable from the `players` dataframe. Candidate predictor features are the `experience`, `played_hours`, `gender`, and `age` variables, also all from the `players` dataframe. We can also derive additional predictor features from the `sessions` dataframe, such as:
+ number of sessions
+ average session length
+ average time of day played

These features can be obtained by grouping the `sessions` data by `hashedEmail` and aggregating appropriately. All data can then be combined by joining on `hashedEmail`.

Note that since KNN relies on a distance metric, all predictor features must be numerical. Therefore, if we want to use `experience` and `gender` as predictor features, we'll need to transform them into numerical variables first.

## Methods and plan

**TODO:**
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
