<h1><b>Demand Forecasting to Accomodate all Players </h1><br>

<h3><b>Introduction:</h3> <br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; This report is based on data collected by a research group in Computer Science from the University of British Columbia. This data was collected through a video game called Minecraft, with the purpose of figuring out how people play video games by recording the players' actions. Minecraft is a blocky video game that offers a variety of playstyles and challenges that can satisfy the experiences of a wide range of players. 
<br> <br>
During the making of this Science project, a concern was if the research group had enough resources to handle all the players they attracted. These players can play at any times throughout the day, as long as spots are available. To help address their concern, a question from the project lead, Frank Wood, that we will be answering is: <b>"We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability."</b>
<br><br>
To answer this question, we were given two datasets from the research group that we will be using. The first dataset is named, "players." The players dataset has 9 variables with 196 observations. The players dataset stores information that is about the user who is playing. The second dataset is named, "sessions." The sessions dataset has 5 variables with 1535 observations. The sessions dataset stores information that is about the user's play time. Illustrated with the tables below are the variables and a description of what the variables are, for each respective dataset. We will be focusing on the sessions dataset as we are more interested in finding what time window(s) there are the most amount of players playing.

<br>
<table>
<tr>
    <th>Players Dataset Variables:</th>
</tr>
<tr>
    <th>experience:</th>
    <td>Player's experience with Minecraft. </td>
</tr>
<tr>
    <th>subscribe:</th>
    <td>If the player is subscribed with their email. </td>
</tr>
<tr>
    <th>hashEmail:</th>
    <td>Encryption code used to identify specific player. </td>
</tr>
<tr>
    <th>played_hours:</th>
    <td>Player's total time in hours. </td>
</tr>
<tr>
    <th>name:</th>
    <td>Player's fake name. </td>
</tr>
<tr>
    <th>gender:</th>
    <td>Player's gender. </td>
</tr>
<tr>
    <th>age:</th>
    <td>Player's age. </td>
</tr>
<tr>
    <th>individualid:</th>
    <td>Player's individual ID. </td>
</tr>
<tr>
    <th>organizationName:</th>
    <td>Player's organization name. </td>
</tr>
</table>

<br><br>

<table>
<tr>
    <th>Sessions Dataset Variables:</th>
</tr>
<tr>
    <th>hashEmail:</th>
    <td>Encryption code used to identify specific player. </td>
</tr>
<tr>
    <th>start_time:</th>
    <td>When the player starts playing.</td>
</tr>
<tr>
    <th>end_time:</th>
    <td>When the player stops playing.</td>
</tr>
<tr>
    <th>original_start_time:</th>
    <td>When the player starts playing.</td>
</tr>
<tr>
    <th>original_end_time:</th>
    <td> When the player stops playing.</td>
</tr>
</table>


<h3><b>Methods & Results:</h3>

<h4>First we read both data sets and cleaned by dropping columns we don't plan to use, and editing the time variables to contain only the time in numbers</h4>


In [4]:
import pandas as pd

players = pd.read_csv('data/players.csv')
sessions = pd.read_csv('data/sessions.csv')

players = players.drop(columns = ["subscribe", "individualId", "organizationName", "hashedEmail"]) # RL - Unsure if this is needed as we only need to look at the sessions dataset
sessions = sessions.drop(columns = ["hashedEmail"])  #, "original_start_time", "original_end_time" # RL - We can also probably get rid of original_start_time and original_end_time
sessions[["start_date", "start_time"]] = sessions["start_time"].str.split(" ", expand=True)
sessions[["end_date", "end_time"]] = sessions["end_time"].str.split(" ", expand=True)
sessions["start_time"] = sessions["start_time"].str.replace(":", "") #this is to get start/end time with no date and no :
sessions["end_time"] = sessions["end_time"].str.replace(":", "")
display(sessions.head(5), players.head(5))

Unnamed: 0,start_time,end_time,original_start_time,original_end_time,start_date,end_date
0,1812,1824,1719770000000.0,1719770000000.0,30/06/2024,30/06/2024
1,2333,2346,1718670000000.0,1718670000000.0,17/06/2024,17/06/2024
2,1734,1757,1721930000000.0,1721930000000.0,25/07/2024,25/07/2024
3,322,358,1721880000000.0,1721880000000.0,25/07/2024,25/07/2024
4,1601,1612,1716650000000.0,1716650000000.0,25/05/2024,25/05/2024


Unnamed: 0,experience,played_hours,name,gender,age
0,Pro,30.3,Morgan,Male,9
1,Veteran,3.8,Christian,Male,17
2,Veteran,0.0,Blake,Male,17
3,Amateur,0.7,Flora,Female,21
4,Regular,0.1,Kylie,Male,21


In [5]:
#Gathering extra data (do we need this?)
#Time windows incrimented by the hour & player count during each window

sessions["hour"] = sessions["start_time"].str[:2].astype(int) #grab hourly windows
player_counts = sessions.groupby('hour').size().reset_index(name='player_count') #grab player count during each window

sessions = pd.merge( #adding hour and player_count to dataframe
    sessions,
    player_counts,
    on="hour",
    how="left" 
)
print(sessions)

     start_time end_time  original_start_time  original_end_time  start_date  \
0          1812     1824         1.719770e+12       1.719770e+12  30/06/2024   
1          2333     2346         1.718670e+12       1.718670e+12  17/06/2024   
2          1734     1757         1.721930e+12       1.721930e+12  25/07/2024   
3          0322     0358         1.721880e+12       1.721880e+12  25/07/2024   
4          1601     1612         1.716650e+12       1.716650e+12  25/05/2024   
...         ...      ...                  ...                ...         ...   
1530       2301     2307         1.715380e+12       1.715380e+12  10/05/2024   
1531       0408     0419         1.719810e+12       1.719810e+12  01/07/2024   
1532       1536     1557         1.722180e+12       1.722180e+12  28/07/2024   
1533       0615     0622         1.721890e+12       1.721890e+12  25/07/2024   
1534       0226     0245         1.716170e+12       1.716170e+12  20/05/2024   

        end_date  hour  player_count  


In [12]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

sessions_training, sessions_testing = train_test_split(
    sessions,
    test_size = 0.25,
    random_state = 2000,
)
X_train = sessions_training[["hour"]]
y_train = sessions_training["player_count"]

X_test = sessions_testing[["hour"]]
y_test = sessions_testing["player_count"]


# Check the shapes to verify
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1151, 1) (384, 1) (1151,) (384,)


In [13]:
sessions_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

sessions_cv = pd.DataFrame(
    cross_validate(
        estimator = sessions_pipe,
        cv = 5,
        X = sessions_training[["hour"]],
        y = sessions_training["player_count"],
        scoring = "neg_root_mean_squared_error",
        return_train_score=True,
    )
)
sessions_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003462,0.00183,-0.394771,-0.463681
1,0.012666,0.001695,-0.0,-0.392824
2,0.002487,0.001631,-0.746004,-0.453961
3,0.002406,0.001618,-0.619818,-0.41177
4,0.002414,0.001904,-0.698943,-0.390106


In [14]:
param_grid = {"kneighborsregressor__n_neighbors": range(1, 201, 1),}
sessions_tuned = GridSearchCV(sessions_pipe, param_grid, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
sessions_results = pd.DataFrame(sessions_tuned.fit(X_train, y_train).cv_results_)

sessions_results

sessions_min = sessions_tuned.best_params_
sessions_best_RMSPE = -sessions_tuned.best_score_

print(sessions_min)
print(sessions_best_RMSPE)

{'kneighborsregressor__n_neighbors': 1}
0.02637521893583148


In [15]:
sessions_prediction = sessions_tuned.predict(X_test)
sessions_summary = mean_squared_error(sessions_prediction, y_test)**0.5
sessions_summary

np.float64(0.0)

In [16]:
np.random.seed(2019) # DO NOT CHANGE

sessions_preds  = sessions_training.assign(
    predictions = sessions_tuned.predict(X_train)
)

sessions_plot_working = alt.Chart(sessions_preds).mark_circle(opacity = 0.4).encode(
    x=alt.X("hour").title("Hour of day"),
    y=alt.Y("player_count").title("Player Count").scale(zero=False),
)
sessions_line = alt.Chart(sessions_preds).mark_line(color="black").encode(
    x = "hour",
    y = "predictions",
)
sessions_plot = alt.layer(sessions_plot_working, sessions_line)
sessions_plot

<h3><b>Discussion:</h3>

#### Summary of Reults:

#### Were these results expected?

#### Impacts of Findings:

#### Future Questions:
Knowing which time periods are likely to have large amounts of simultaneous players could lead to more exploratory questions to give an even more detailed analysis of demand forcasting. Other variables could be tested and analysed to see how demand forcasting changes for differences in players such as experience level, age, and gender. It could also to lead to the question of how the amount of players in each time window changes when updates to the game are released, or when new features are added. 