##### (1) Data Description

<b>players.csv: 
- A set of all unique players containing 7 variables and 196 observations

Variables include:
- subscribe: subscriptions status (binary: True or False)
- hashedemail: encrypted email (string)
- played_hours: number of hours played (numerical)
- name: player first name (string)
- gender: player gender (categorical)
- age: player age (numerical)
- individualId (empty)
- organizationName (empty)

No major issues observed in this dataset.
Data collected by players from Plaicraft and their info personally submitted during signup.



<b>sessions.csv:
- Set of individual play session by each player including data about session containing
5 Variables and 1535 observations
  
Variables include: 
- hashedemail: encrypted email (string)
- start_time: time and date player began playing (24 hour time)
- end_time: time and date player stopped playing (24 hour time)
- original_start_time: time after 1970 that player began playing. Numerical data
- original_end_time: time after 1970 that player stopped playing. Numerical data

start_time and end_time should be divided so that date and time are separate variables.
Data collected from players on plaicraft.



##### (2) Question:

<i>Question 2: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

- Response variable: Number of players during specific time windows
- Variables of interest: start_time, end_time, original_start_time, original_end_time, age, played_hours, gender.

I plan to define specific time windows (1 hour increments), and build a model that deterimes the number of players playing in each window. The times could be wrangled into days, months, and seasons to observe trends. Including age and gender can show trends of demographics.

##### (3) Exploratory Data Analysis and Visualization

In [1]:
import pandas as pd
import altair as alt
import matplotlib.pyplot 
import numpy as np

#players
players = pd.read_csv("data/players.csv")
players = players.drop(columns=["individualId", "organizationName"])
players.rename(columns={"hashedEmail":"hashed_email"}, inplace=True)

#sessions 
sessions = pd.read_csv("data/sessions.csv")
sessions[["start_date", "start_time"]] = sessions["start_time"].str.split(" ", expand=True)
sessions[["end_date", "end_time"]] = sessions["end_time"].str.split(" ", expand=True)
sessions.rename(columns={"hashedEmail":"hashed_email"}, inplace=True)
sessions["start_time"] = sessions["start_time"].str.replace(":", "")
sessions["end_time"] = sessions["start_time"].str.replace(":", "")

display(sessions.head(10), players.head(10))

Unnamed: 0,hashed_email,start_time,end_time,original_start_time,original_end_time,start_date,end_date
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,1812,1812,1719770000000.0,1719770000000.0,30/06/2024,30/06/2024
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2333,2333,1718670000000.0,1718670000000.0,17/06/2024,17/06/2024
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,1734,1734,1721930000000.0,1721930000000.0,25/07/2024,25/07/2024
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,322,322,1721880000000.0,1721880000000.0,25/07/2024,25/07/2024
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,1601,1601,1716650000000.0,1716650000000.0,25/05/2024,25/05/2024
5,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,1508,1508,1719160000000.0,1719160000000.0,23/06/2024,23/06/2024
6,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,712,712,1713170000000.0,1713170000000.0,15/04/2024,15/04/2024
7,ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b1...,213,213,1726880000000.0,1726890000000.0,21/09/2024,21/09/2024
8,96e190b0bf3923cd8d349eee467c09d1130af143335779...,231,231,1718940000000.0,1718940000000.0,21/06/2024,21/06/2024
9,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,513,513,1715840000000.0,1715840000000.0,16/05/2024,16/05/2024


Unnamed: 0,experience,subscribe,hashed_email,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17
9,Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f90...,0.0,Nyla,Female,22


In [2]:
explore_plot = alt.Chart(sessions).mark_bar().encode(
    alt.X("start_time", bin=alt.Bin(maxbins=60), title="Start time (24 hour time)"),
    alt.Y("count()", title="Frequency (number of sessions)"),
).properties(
    title='Frequency of Start Time'
)
explore_plot_2 = alt.Chart(sessions).mark_bar().encode(
    alt.X("end_time", bin=alt.Bin(maxbins=60), title="End time (24 hour time)"),
    alt.Y("count()", title="Frequency (number of sessions)"),
).properties(
    title='Frequency of End Time'
)

print(explore_plot.show())
print(explore_plot_2.show())

None


None


Most players seem to start and end playing in the early and late hours of the day, both decrasing towards noon

##### (4) Methods and Plan



- Method: KNN regression 
- Assumptions: data is scaled properly
- Limitations/weaknesses: slow prediction times for large data sets (1535 observations)
- Compare: I could different evaluation metrics like RMSE and RMSPE
- Process: split into 75% and 25% training and testing before scaling. I will use cross validation

