# Reminder
Problem: Predicting Usage of a Video Game Research Server

Question 3: We would like to know something about our populations of users, in particular, we would like to have a good model whether or not a plater will continue contributing given past participation.

submit in two formats: html and ipynb

## 1. Data Description

**Summary**

  The players.csv filed is a data set containing information about the players in the game. There are 196 observations with data about the players such as their experience, whether they subscribe, their email, the number of hours played, their names, gender and age.

  The sessions.csv data set contains specific data about the playing sessions of the players in the game. There are 1535 observations in the sessions data set. It has variables like the players' email, start time, end time, original start time and original end time.


**Players.csv:**
1. Experience: The level of expertise that the player has in the game. The data type of this column variable is string. Possible values are "Amateur", "Beginner", "Regular", "Pro", "Veteran". This is a categorical variable.

2. Subscribe: Whether or not the player has subscribed to the game. The variable is a boolean value and can only take the value "TRUE" or "FALSE", indicating "yes"  or "no" to whether they are subscribed.

3. Hashed Email: This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player. It is in string form.

4. Played Hours: The played hours indicates the number of hours spent playing the game approximated to one decimal place. The data type of this variable is float.

5. Name: This is the name (first name) of the player. This is probably not a unique identifier since two people could coincidentally have the same name. The data type is string.

6. Gender: Gender is a categorical variable which has the following possible values: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited", "Other".


**Sessions.csv:**
1. Hashed Email: This is a string of letters and numbers to encrypt the email of the user. This is a unique identifier for each player. It is in string form.

2. Start Time: This includes the date - in format DD/MM/YY - and the time the player started playing the game. The time is in 24-hour format. The value is probably a string as it combines both the date and time.

3. End Time: This includes the date - in format DD/MM/YY - and the time the player stopped playing the game. The time is in 24-hour format. The value is probably a string as it combines both the date and time.

4. Original Start Time: This variable is a 14-digit integer indicating the start time.

5. Original End Time: This variable is a 14-digit integer indicating the end time.

**Some issues noted with the data include**
1. The meaning of some variables are not easily understandable. For example, the original start time and end time values are not very intuitive.
2. The "hashedEmail" variable name has a different convention compared to other two-worded variable like "played_hours". It should maybe have been "hashed_email" for more consistency.

## 2. Question

- **Question Chosen**

We would like to know something about our population of users, in particular, we would like to have a good model whether or not a player will continue contributing given past participation."

We would like to investigate whether the age and hours played can predict the experience of the players. 

The data available in the players.csv can help with this classification problem. Specifically the variables "experience", "played_hours" and "age" will be used in this data analysis.

- **Data Wrangling**

We can convert the played_hours column in "minutes" since there are a lot of players with less than an hour of playing.

## 3. Exploratory Data Analysis and Visualization

- minimum necessary wrangling for tidy format
- make a few plots using best practices (labels, titles, units)
- explain any insights gained from these plots to help understand

In [69]:
import pandas as pd
import altair as alt

#reading data
players_data = pd.read_csv("players.csv")
sessions_data = pd.read_csv("sessions.csv")

#data wrangling
players = players_data.assign(played_minutes = players_data["played_hours"] * 60)[["played_minutes","experience","age"]]

#graphs
players_plot_1 = alt.Chart(players).mark_bar(opacity=0.5).encode(
    x = alt.X("played_minutes").title("Minutes Played by Player"),
    y = alt.Y("experience").title("Level of Experience")
)

players_plot_2 = alt.Chart(players).mark_bar().encode(
    x = alt.X("age").title("Age of Player"),
    y = alt.Y("experience").title("Level of Experience")
)

players_plot_3 = alt.Chart(players).mark_point().encode(
    x = alt.X("age").title("Age of the Player"),
    y = alt.Y("played_minutes").title("Minutes Played by Player"),
)

players_plot_4 = alt.Chart(players).mark_point().encode(
    x = alt.X("age").title("Age of Player"),
    y = alt.Y("played_minutes").title("Minutes Played by Player"),
    color = alt.Color("experience").title("Level of Experience")
)
players_plot_3


Through Plot 1, it can be found that regular players spend the most time on the platform and amateur players come in second. However, beginners, pros and veterans spend a similar amount of time playing. This might make it hard to predict beginners, pros and veterans.

In Plot 3, there is no clear relationship on minutes played and age of the player. There might be more analysis required or change the explanatory variables for the project.

In [54]:
players

Unnamed: 0,played_minutes,experience,age
0,1818.0,Pro,9
1,228.0,Veteran,17
2,0.0,Veteran,17
3,42.0,Amateur,21
4,6.0,Regular,21
...,...,...,...
191,0.0,Amateur,17
192,18.0,Veteran,22
193,0.0,Amateur,17
194,138.0,Amateur,17


## 4. Methods and Plan

**Method:**

  One possible method is through the KNN Classification method and choosing the right number of neighbours to optimize the model.

**Why is this method appropriate?**
    This method is appropriate as this is a classification model, trying to predict which experience level a player would be based on their age and hours played.

**What are the potential limitations or weaknesses of the method selected?**
    These variables might not be the best ones to predict the experience level.

**How are you going to compare and select the model?**
    Comparing through accuracy and selecting the model with neighbours that would optimize the accuracy of the model. Cross validation also can be used to select the optimal number of neighbours.

  
**How are you going to process the data to apply the model? For example: - Are you splitting the data? How? How many splits? What proportions will - you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?**

Yes, the data has to be split into testing and training data. Two splits can be made with 40% of the data kept for testing for model and 60% for training the model.

The data will be split before creating the model. Yes cross validation will be used to find the optimal number of knn neighbours.