# Project Proposal

Matt Wiens \
\#21158845 \
group 26

In [None]:
import altair as alt
import pandas as pd

## Data description

The Pacific Laboratory for Artificial Intelligence (PLAI) at UBC has provided data recorded from their Minecraft server, PLAICraft. The data were obtained by recording players’ gameplay and collecting user registration information, and include details about player activity and demographic attributes.

We'll begin by loading in the data.

In [None]:
players = pd.read_csv("data/players.csv")
sessions = pd.read_csv("data/sessions.csv")

In [None]:
players

In [None]:
sessions

The `players` dataframe contains information about 196 players with the following 9 variables:

| Variable | Type | Description | Notes |
|---|---|---|---|
| experience | ordinal | Experience level  | Values have the following ordering:  Beginner, Amateur, Regular, Pro, Veteran |
| subscribe | boolean | Newsletter subscription status | |
| hashedEmail | string | Email (hashed) | Unique for each player and can be used to join the `sessions` dataframe (see below) |
| played_hours | numeric | Total gameplay hours | |
| name | string | First name | |
| gender | categorical | Gender | Gender includes more than two categories, e.g., Male, Female, Two-Spirited, Non-binary |
| age | numeric | Player's age | |
| individualId | unknown | Player's ID | No values included in the data |
| organizationName | string | Player's organization | No values included in the data |

The `sessions` dataframe contains information about 1535 gameplay sessions with the following 5 variables:

| Variable | Type | Description | Notes |
|---|---|---|---|
| hashedEmail | string | Player email (hashed) | Unique for each player and can be used to join the `players` dataframe |
| start_time | datetime (string) | Session start datetime | Stored as string in `DD/MM/YYYY HH:MM` format |
| end_time | datetime (string) | Session end datetime | Stored as string in `DD/MM/YYYY HH:MM` format |
| original_start_time | numeric | Session start datetime | Stored as Unix time in milliseconds |
| original_end_time | numeric | Session end datetime | Stored as Unix time in milliseconds |

For our research question discussed below, we'll only need a subset of the variables.

## Research question

The project lead of PLAICraft, Frank Wood, is interested in answering the following question:

> "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

This proposal will investigate that question. The general idea of what we need to do is to find features—either the variables themselves or transformations of them—which help us predict player newsletter subscription status. This is a binary classification problem and we will focus on maximizing the accuracy of whichever model we choose, with recall and precision as secondary considerations (since we don't know the motivation behind Frank Wood's question, it's not clear which we should prioritize).

## Analysis

The only classification algorithm we've learned in class is $k$-nearest neighbors (KNN), which we'll use here. The response variable we will predict is the `subscribe` variable from the `players` dataframe. Candidate predictor features are the `experience`, `played_hours`, `gender`, and `age` variables, all from the `players` dataframe. We can also derive additional predictor features from the `sessions` dataframe, such as:
+ number of sessions
+ average session length
+ average time of day played

These features can be obtained by grouping the `sessions` data by `hashedEmail` and aggregating appropriately. They then be combined with the player features by joining on `hashedEmail`.

Note that since KNN relies on a distance metric, all predictor features must be numerical. Therefore, if we want to use `experience` and `gender` as predictor features, we'll need to transform them into numerical variables first.

### Exploration

We'll now briefly explore how player experience, gender, gameplay hours, and age are associated with subscription status. 

In [None]:
# Drop the last two columns (which contain no values) from players to tidy the data
try:
    players.drop(["individualId", "organizationName"], axis=1, inplace=True)
except KeyError:
    # We've already dropped the columns and re-ran this cell for some reason
    pass

# Convert experience to an ordinal variable
players["experience"] = pd.Categorical(
    players["experience"],
    categories=["Beginner", "Amateur", "Regular", "Pro", "Veteran"],
    ordered=True,
)

# Add a column with label for subscription status
players["subscribe_label"] = players["subscribe"].map({True: "Subscribed", False: "Not Subscribed"})

#### Overall subscription counts

Before diving into potential associations, let's first get a baseline look at overall subscription counts.

In [None]:
# Overall subscription counts
overall_counts_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X("subscribe_label").title("Subscription status").axis(labelAngle=0),
    y=alt.Y("count()").title("Player count"),
).properties(
    width=500,
    title=f"{players['subscribe'].mean():.0%} of Players Have Newsletter Subscriptions",
)
overall_counts_plot

#### Subscription status by experience level

The first potential association we'll look is between subscription status and experience level.

In [None]:
# Proportion subscribed vs experience level
proportion_subscribed_by_experience = (
    players.groupby("experience", observed=True)["subscribe"]
    .agg([("count", "size"), ("proportion_subscribed", "mean")])
    .reset_index()
)
proportion_subscribed_by_experience

In [None]:
experience_plot = alt.Chart(proportion_subscribed_by_experience).mark_bar().encode(
    x=alt.X("experience").title("Experience level").axis(labelAngle=0),
    y=alt.Y("proportion_subscribed").title("Percentage subscribed to newsletter").axis(format="%").scale(domain=[0,1]),
).properties(
    width=500,
    title="Proportion of Players Subscribed by Experience Level",
)
experience_plot

We can see that experience level does not appear to be strongly associated with subscription status.

#### Subscription status by gender

Next we'll look at subscription status and gender.

In [None]:
# Proportion subscribed vs gender
proportion_subscribed_by_gender = (
    players.groupby("gender")["subscribe"]
    .agg([("count", "size"), ("proportion_subscribed", "mean")])
    .reset_index()
)
proportion_subscribed_by_gender

We can see some of the genders have low counts—the aggregate proportion subscribed isn't very meaningful in this case, so we'll exclude these genders in the plot below.

In [None]:
gender_plot = alt.Chart(
    proportion_subscribed_by_gender.loc[proportion_subscribed_by_gender["count"] > 5]
).mark_bar().encode(
    x=alt.X("gender").title("Gender").axis(labelAngle=0).sort("-y"),
    y=alt.Y("proportion_subscribed").title("Percentage subscribed to newsletter").axis(format="%").scale(domain=[0,1]),
).properties(
    width=500,
    title="Proportion of Players Subscribed by Gender (excluding genders with ≤5 players)",
)
gender_plot

Players who choose to declare their gender seem to be much more likely to subscribe than players who do not declare their gender.

#### Subscription status by gameplay hours

Now we'll look at how subscription status is associated with gameplay hours. Ideally we would compare the distribution split by subscription status with side-by-side boxplots. Unfortunately, as we can see below, the vast majority of players have negligible gameplay hours, which collapses any boxplot into a line.

In [None]:
# Histogram of hours played
hours_hist_plot = alt.Chart(players).mark_bar().encode(
    x=alt.X("played_hours").bin(step=2).title("Gameplay hours"),
    y=alt.Y("count()").title("Player count")
).properties(
    width=500,
    title="Distribution of Gameplay Hours",
)
hours_hist_plot

To address this, we'll break the data into players two groups: one having less than one hour of gameplay, the other having at least one hour.

We'll first look at subscription counts of players with less than one hour of gameplay (since the gameplay hours values are too clustered to make boxplots meaningful) and then look at boxplots of players with at least one hour of gameplay.

In [None]:
# Subscription counts for players with <1 hour of gameplay
low_hours_counts_plot = alt.Chart(
    players.loc[players["played_hours"] < 1]
).mark_bar().encode(
    x=alt.X("subscribe_label").title("Subscription status").axis(labelAngle=0),
    y=alt.Y("count()").title("Player count"),
).properties(
    width=500,
    title=f"{players.loc[players['played_hours'] < 1]['subscribe'].mean():.0%} of Players with Less Than One Hour of Gameplay Have Newsletter Subscriptions",
)
low_hours_counts_plot

Comparing this with our baseline subscription counts, there appears to be essentially no association between very low gameplay hours and subscription status.

In [None]:
# Gameplay hours vs subscription status (at least 1 hour played)
hours_positive_plot = alt.Chart(
    players.loc[players["played_hours"] >= 1]
).mark_boxplot(size=60).encode(
    x=alt.X("subscribe_label").title("Subscription status").axis(labelAngle=0),
    y=alt.Y("played_hours").title("Gameplay hours"),
).properties(
    width=500,
    title="Distribution of Played Hours by Subscription Status (at least 1 hour gameplay)"
)

hours_positive_plot

Here we can clearly see that having at least an hour of gameplay is strongly associated with newsletter subscriptions. For players with a large number of hours, all of them are subscribed.

#### Subscription status by age

Finally, we will look at how age is associated with subscription status.

In [None]:
# Subscribed vs age
hours_positive_plot = alt.Chart(players).mark_boxplot(size=60).encode(
    x=alt.X("subscribe_label").title("Subscription status").axis(labelAngle=0),
    y=alt.Y("age", title="Age")
).properties(
    width=500,
    title="Distribution of Age by Subscription Status"
)

hours_positive_plot

Here we can see that if a player is subscribed, they will tend to be younger, although since there is so much overlap in the distributions, this effect is weak.

### Summary

Through our exploration we have seen the following associations:
+ Minimal association between experience level and subscription status
+ Strong association between having declared gender and subscription status
+ Essentially no association between having having less than one hour of gameplay and subscription status
+ Strong association between having at least one hour of gameplay and subscription status
+ Weak association between age and subscription status

## Methods and plan

**TODO:**
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
