Project Planning
---
Olivia Kong, Student ID: 72594369
Group 29

Question
-
Question 3: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 

Group 29 has decided to study Question 3. The goal of this project is to predict player retention, whether a player will continue playing the game based on the data from the two given dataframes. This is a classification prediction problem where new observations will be assigned a class (for example: keeps playing or doesn't keep playing). The two dataframes, 'players.csv' and 'sessions.csv', have a few  variables that will help us to classify new observations, therefore helping us towards an answer for this research question. Specifically, we will use these variables as predictors when predicting the class for new observations. 
New observations will be organized as a class and not a numerical value. This will require a KNN classification model with the most optimal K value, and useful predictor variables. The data will also be split into training and test sets to evaluate the classifer model's performance using accuracy, recall and precision.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate



alt.data_transformers.enable('vegafusion')

set_config(transform_output="pandas")

Players Messy Dataframe
-

In [2]:
players_messy=pd.read_csv('players.csv')
players_messy

FileNotFoundError: [Errno 2] No such file or directory: 'players.csv'

This dataframe shows the data from the players.csv file, which includes a list about the players and data about each player. Variables include experience, hours played, age and gender. These variables can be plotted to find any trends or patterns in our data set, which can help to classify any new observations. This is because these columns can be used as predictor variables to predict the class of new observations in the dataframe. There are some columns in this data set that will not be very useful in this project, such as the hashed email, organization name and individual ID. In fact, the columns 'organizationName' and 'individualId' don't have any values for any observations. These three columns are not useful, because the values in these columns aren't applicable or relevant to our research question, therefore they can be dropped. 


There are 196 rows, which corresponds to 196 observations and there are 9 columns, each representing a variable. However, not all columns are necessarily useful variables, therefore they could be dropped during data wrangling, as mentioned above. Additionally, the columns contain a mix of data types, with some being numerical and others categorical.

Numerical columns: played_hours, age

Categorical columns: experience, subscribe, gender 


Note, the name column is not a categorical column because it simply acts to identify the players, it doesn't provide any predictive value.


Sessions messy dataframe
-

In [None]:
sessions=pd.read_csv('sessions.csv')
sessions

This dataframe shows the data from the sessions.csv file, which includes the start and stop time for each player, however compared to the players dataframe, only the 'hashedEmail' column is present to help identify each player (no names, gender, etc.). Therefore, it would be helpful to merge the two dataframes on the 'hashedEmail' column, as this would help find more patterns and trends in the data. For example, if pro players tend to play around the same time in the day, this information would be hard to find by looking at the two datasets separately. You would first need to identify the time in the sessions dataframe, try to match the hashed email on both data frames, and then find the player's experience class on the players dataframe. Therefore, this sessions dataframe doesn't offer as many potential predictor variables as the players dataframe, only the start and end time from the sessions dataframe would likely be used. Once again, the hashed email column doesn't provide any insight to the classification of new observations, it shouldn't be used as a predictor variable, however it can be used to merge the two dataframes together, as mentioned before.

The original start and end time columns are hard to interpret, as the values in these columns are really big (~1.7 trillion). Compared to the 'start_time' and 'end_time' columns that include the date and time, the 'original_start_time' and 'original_end_time' only provide a really big number. From external research, the 'original_start_time' and 'original_end_time' columns show Unix time which represents the number of seconds that have elapsed since January 1 1970. Since the 'start_time' and 'stop_time' columns are included in the sessions dataframe, there is no need for the original start and stop time, as it's not in a readable format. 
Therefore the columns hashedEmail, and the original start and stop time are not useful for the classification of new observations.

This dataframe contains 1535 rows corresponding to 1535 observations, and 5 columns each representing a variable. There are more rows in this data frame compared to the players dataframe that only has 196 rows. This is likely because each session is recorded as an observation, so if a player has more than one session, this would show up as multiple observations in the sessions dataframe. Once again, not every column can be used as a predictor variable as mentioned above, therefore they can be dropped during data wrangling. 

Looking at the 'start_time' and 'end_time' columns, there are two values in a single cell, which isn't very tidy. This could be improved by splitting these columns into two separate columns, for instance, one for time and one for date. This can be done using the str.split function.

Tidying the Data Frames
-

In [None]:
sessions[['start_date','start_time']]=sessions['start_time'].str.split(' ', expand=True)
sessions


In [None]:
sessions[['end_date','end_time']]=sessions['end_time'].str.split(' ',expand=True)
sessions


In [None]:
sessions_tidy=sessions.drop(columns=['original_start_time','original_end_time'])
sessions_tidy

In [None]:
sessions_tidy["start_time"] = pd.to_datetime(sessions["start_time"], format = "%H:%M")
sessions_tidy["start_time"] = sessions_tidy["start_time"].dt.time
#sessions_tidy["session_time"] = sessions_tidy["start_time"].dt.minute /60
sessions_tidy

Players dataframe merged with Sessions dataframe on hashed email
-

In [None]:
players_session_tgt=sessions_tidy.merge(players_messy, on='hashedEmail')
players_session_tgt

Players_session_tgt Dataframe, dropping unhelpful columns
-

In [None]:
players_session_tgt=players_session_tgt.drop(columns=['hashedEmail','individualId','organizationName', 'start_date', 'end_date'])
players_session_tgt


In [None]:
new_column_order = ['name', 'start_time', 'start_date', 'end_time', 'end_date', 'experience','subscribe','gender','played_hours','age']


players_session_tgt = players_session_tgt[new_column_order]
players_session_tgt


In [None]:
import altair as alt

age_vs_hours_plot=alt.Chart(players_session_tgt,title='Experience: Hours Played vs Age').mark_circle(clip=True).encode(
    x=alt.X('age').title('Age').scale(domain=['6','30'],zero=False),
    y=alt.Y('played_hours').title('Hours Played').scale(domain=['0','6'],zero=False),
    color=alt.Color('experience').title('Experience')
)
age_vs_hours_plot

The plot of hours played versus age shows significant overplotting, with most observations concentrated around age 20 and approximately 1 hour of playtime. Even when narrowing the x-axis domain (age) to examine the spacing of the points, the points remain densely packed, preventing any patterns or relationships from being observed. Color was added to the plot based on the experience column to help identify potential patterns or trends, but due to the overplotting, no conclusions could be extracted from this feature. The clustering of points can be attributed to the fact that the majority of observations come from the same age group (15-20), causing data from these individuals to overlap in the same region. In order for there to be any type of relationship (positive, negative, weak, strong), there would need to be a more broader distrubution of observations from each age. It can also be observed that many points remain on the x-axis (0 hours of playtime),making it harder to spot trends or patterns. 


In [None]:
age_vs_hours_plot_2=alt.Chart(players_session_tgt, title='Gender: Hours Played vs Age').mark_circle(clip=True).encode(
    x=alt.X('age').title('Age').scale(domain=['6','30'],zero=False),
    y=alt.Y('played_hours').title('Hours Played').scale(domain=['0','7'],zero=False),
    color=alt.Color('gender').title('Gender').scale(scheme='set1')
)
age_vs_hours_plot_2

This shows the same plot as above, except the 'Gender' column was used for the color function instead of 'Experience'. There isn't much difference, as the plots are still very condensed in the same area (age 20 and 1 hour of playtime), therefore no major trends or patterns can be observed. In this plot, I limited the domain once again, to zoom in closer to the points, and by doing so, it can be observed that there are many more green points than any other colored points. This corresponds to male players, therefore this could be helpful later one when deciding on predictor variables to use to classify new observations. In other words, from this plot we notice that there are more male players contributing to the data, therefore we could hypothesize that male players are more likely to keep playing. 

In [None]:
age_vs_hours_plot_3=alt.Chart(players_session_tgt, title='Subscribe: Hours Played vs Age').mark_circle(clip=True).encode(
    x=alt.X('age').title('Age').scale(domain=['6','30'],zero=False),
    y=alt.Y('played_hours').title('Hours Played').scale(domain=['0','7'],zero=False),
    color=alt.Color('subscribe').title('Subscribed (Yes or No)').scale(scheme='set1')
)
age_vs_hours_plot_3

Again, this is the same plot as the two above, however the 'subscribe' column is used in the color function. By zooming into the region of overplotting, we can see the distribution of the point slightly better. It can be observed that there are more blue points than red points, meaning more players are subscribed than not. 

Methods and Plan
-
Question 3: We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation. 


This question is a classification question, therefore a class will be predicted for new observations instead of a numerical value as in regression. As mentioned above, there are a few potential predictor variables that can be used to predict whether a player will continue playing, such as 'experience','gender' and 'subscribe'. Scatter plots were created above, examining these few predictor variables, however no major relationships could be determined. To address this question, a KNN classifier will be created and trained on the training set, with the most optimal k-value determined through cross-validation. 
A possible way of doing this, is to first split the data into a training and testing set (e.g. 75% training set, 25% testing set), creating a GridSearchCV object from the KNN classifier, selected parameters and chosen amount of folds.  By fitting the GridSearchCV object, we can evaluate cross-validation accuracy for each k-value and create a separate dataframe with these results using the 'cv_results_' attribute. We can plot this separate dataframe as a line plot to see which K value is the most accurate and optimal. Now that the optimal K value had been determined, we can create and retrain a new classifier with this value, and we can evaluate it's ability to correctly predict observations by making a confusion matrix. The confusion matrix will allow us to evaluate the model's accuracy, precision and recall. 

When determining the most optimal K value, we are doing so be observing which K value has the highest cross-validation accuracy on the line plot, but it is possible that a range of numbers would be perform similarly. This is because the values we see on the plot are estimates of the true accuracy of the classifier model. Therefore, the highest K value on the line plot doesn't necessarily mean that the classifier is the most accurate with this K value. Therefore, during this step in the project, we will assume that the most accurate K value estimated on our accuracy plot is the most optimal. Additionally, since we will be using a KNN classification model, this means that the KNN model will assume that the closer points are, the more related and similar they will be. This might be problematic because of the overplotting of the points in the relatively small dataset, because it becomes difficult to differenciate between classes based on proximity. In other words, it might be difficult for the classification model to accurately determine which class an observation belongs to based on its closest neighbors. Additionally, the smaller dataset means that the KNN model will have less information to learn any patterns and the overplotting makes this even worse by reducing the variation in the data points. 

A KNN classification model should work fine for this classification problem, because it is capable of working for multi-class classification problems and we are only aiming to classify if a player will continue playing (yes or no). One of the limitations of a KNN model is that it doesn't work as quickly with a larger dataset, however as mentioned above, the dataset is relatively small, therefore this shouldn't be too much of an issue. Another limitation is that it might not perform as well with multiple predictor variables, and this might be an issue since we are likely using 3 predictor variables (experience, gender and subscribed). Additionally, if the classes are imbalanced, this may impact the performance of the model, and we can potentially see this coming up by looking at the plots above. For example, from the plot above where the subscribe column is used for the color function, we see more blue points than red, therefore this may suggest that the clases are imbalanced. To address this, we can use the sample function which allows us to increase the number of rare observations to equal the amount of frequent observations. 