# **Individual Planning DSCI 100 Project**

# 1: Data Description

### Data Collection Method:
- The data of both datasets were collected by the UBC research group, Pacific Laboratory for Artificial Intelligence through recording player behaviour in MineCraft.

### players.csv:
- The players dataframe has 196 observations, with 7 variables

| Variable | Description | Data type |
|----------|-------------|-----------|
| experience | Self-identifying Level of Experience | chr|
| subscribe  | Game Newsletter Subscription Status | lgl |
| hashedEmail | Hashed Version of Player Email | chr |
| played_hours | Hours spent in game | dbl |
| name | Name of player | chr |
| gender | Gender of player | chr |
| age | Age of player | dbl |

### summary statistics
- From the `summary()` function:

| Variable | Min | Max | Mean |
|----------|-----|-----|------|
| played_hours | 0.00 | 223.10 | 5.85|
|age | 9.00 | 58.00 | 21.14 |


In [None]:
library(tidyverse)

player_data <- read_csv("data/players.csv") 
sessions_data <- read_csv("data/sessions.csv")

player_pro <- player_data |>
    filter(experience == "Pro")|>
    nrow()

player_veteran <- player_data |>
    filter(experience == "Veteran") |>
    nrow()

player_amateur <- player_data |>
    filter(experience == "Amateur")|>
    nrow()

player_regular <- player_data |>
    filter(experience == "Regular")|>
    nrow()

player_beginner <- player_data |>
    filter(experience == "Beginner")|>
    nrow()

player_rows <- nrow(player_data)

beginner_pc <- round((player_beginner/player_rows), 2)
amateur_pc <- round((player_amateur/player_rows), 2)
regular_pc <- round((player_regular/player_rows), 2)
veteran_pc <- round((player_veteran/player_rows), 2)
pro_pc <- round((player_pro/player_rows), 2)

The proportion of self-identifying experience is as follows:

- Beginner: 0.18
- Amateur: 0.32
- Regular: 0.18
- Veteran: 0.24
- Pro: 0.07

In [None]:
true_players <- player_data |>
    filter(subscribe == TRUE) |>
    nrow()

false_players <- player_data |>
    filter(subscribe == FALSE) |>
    nrow()

true_pc <- round((true_players/player_rows), 2)
false_pc <- round((false_players/player_rows), 2)

The proportion of players subscribed to the game newsletter is as follows:
- True: 0.73
- False: 0.27

### sessions.csv
- The sessions dataframe has 1535 observations, with 5 observations
| Variable | Description | Data type |
|----------|-------------|-----------
| hashedEmail | Hashed Version of Player Email | chr |
| start_time | Start Time of Session | chr|
| end_time | End Time of Session | chr |
| original_start_time | Start Time Represented in UNIX time | dbl |
| original_end_time | End Time Represented in UNIX time | dbl |

### summary statistics
- From the `summary()` function:

| Variable | Min | Max | Mean |
|----------|-----|-----|------|
| original_start | 1.71e+12 | 1.73e+12 | 1.72e+12 |
| orginal_end_time | 1.71e+12 |1.73e+12 | 1.72e+12 |

## Issues
- Data is given in 2 separate .csv files, making the data difficult to interpret between the files. This can be solved through merging the datasets into one through their mutual variable (hashedEmail), using `inner_join`
- In the `players.csv` dataframe, experience can be changed to fct data types. Age can also be changed to an int data type.
- `start_time` and `end_time` in `sessions.csv` is not tidy, there are 2 observations in the cells

# 2: Questions
**Broad question:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question:** Can the status of subscribing to a game-related newsletter be predicted from Age and played_hours?

The data can help answer this specific question by providing information on the user's demographics and whether they are subscribed. By observing who is subscribed, we can predict a user's subscribing status based on their characteristics. The data type of subscribing can be changed to an fct type, and knn-neighbors can be used to classify status. The data will need to be standardized, due to the scale differences in age and played hours.

# 3: Exploratory Data Analysis + Visualization
Using the tidyverse package, the datasets can be loaded into R with `read_csv`.

In [None]:
library(tidyverse)

player_data <- read_csv("data/players.csv") 
sessions_data <- read_csv("data/sessions.csv")

Because the question only uses variables from one dataframe, we will be making changes to the `players.csv` dataframe.

In [None]:
head(player_data)

Changing subscription and experience to factor types:

In [None]:
new_player_data <- player_data |>
    mutate(subscribe = as.factor(subscribe)) |>
    mutate(experience = as.factor(experience))
head(new_player_data)

summary(new_player_data$subscribe)
summary(new_player_data$experience)

From the `summary()` function:

| Variable | Mean |
|----------|------|
| played_hours | 5.85 |
| Age | 21.14 |

From researching, the `cut` function can be used to bin Age to create a colour coordinated graph. (https://stackoverflow.com/questions/5570293/create-binned-values-of-a-numeric-column)

In [None]:
new_player_data$age_ranges <- cut(new_player_data$Age, breaks=c(0,10,20,30,100), labels=c("0-10","11-20","21-30", "30+"))

In [None]:
age_plot <- new_player_data |>
    ggplot(aes(x = subscribe, fill = age_ranges)) +
    geom_bar() +
    labs(x = "Subscription Status", 
         y = "Number of Individuals", 
         fill = "Age Ranges",
         title = "Ages which are Subscribed vs. Not Subscribed")
age_plot

From this plot, we can see the proportion of individuals which are subscribed vs. not subscribed based on their age range. There is a greater count in individuals in the 11-30 age range subscribing, while there are less individuals in the 30+ age range subscribed.

A similar process can be done for `played_hours`.

In [None]:
new_player_data$hours_ranges <- cut(new_player_data$played_hours, breaks=c(-1,20,50,100,200,250), labels=c("0-20","20-50","50-100", "100-200", "200+"), na.rm = TRUE)

In [None]:
phours_plot <- new_player_data |>
    ggplot(aes(x = subscribe, fill = hours_ranges)) +
    geom_bar() +
    labs(x = "Subscription Status", 
         y = "Number of Individuals", 
         fill = "Hour Ranges",
         title = "Individuals who are Subscribed vs. Not Subscribed") +
    theme(text = element_text(size = 15))
phours_plot

Through this plot, we can see individuals who do not play regularly are not subscribed to the game newsletter. This is because the entire false bar is red (meaning they've played 0-20 hours), while the true bar is a mix of colours. 

# 4: Methods and Plan

Method is classification K-nearest neighbours, with the players.csv dataframe:
players.csv has more information on the player’s demographics
Classification is used because we are predicting categorical factor, not numerical
Method is appropriate because using existing patterns of subscribed or not, we can predict if future users will be subscribed by comparing it to the existing data.

Limitations:
Can perform poorly when classes are imbalance
Can be slow when the training data becomes larger

Data processing:
Predictors will be scaled and centered
The data will be split into 70/30 proportion of training and testing to judge the accuracy of the model at the end
To find the best K value, standard 5-fold cross validation can be used on the training data


### Method is classification K-nearest neighbours, with the players.csv dataframe:
- players.csv has more information on the player’s demographics
- Classification is used because we are predicting categorical factor, not numerical
- Method is appropriate because using existing patterns of subscribed or not, we can predict if future users will be subscribed by comparing it to the existing data.

### Limitations:
- Can perform poorly when classes are imbalanced
- Can be slow when the training data becomes larger