# Modeling Video Game Popularity

In the previous notebook we looked at several data sets, to determine threshold values and ways to determine popularity. After noting which datasets lacked sufficient information and what datasets we could combine, we have settled on 3 sets to model with. The main one we will be looking at is the 2019 Steam Tag data, the other two will be for personal curiosity to see if we get significantly different results. These two additional sets will be the 2019 Video Game Sales data and the Merged data set of the two previously mentioned.

#### Things to note that have been done:

- Dropped columns that were not independent variables (Ie. an individual does not have control over these when designing the game) other than the variable we chose to determine popularity (Rating).
- Created dummy variables of categorical values
- Train_test_split:
 - We chose our dependent variable to be categorical of Popularity, which is determined by if a game achieved more than 70% rating
 
#### What we will be doing in this notebook: 

Selecting a model that works best to predict the popularity of a game, and thus informing us as to what features are most important to have in our games. We will be looking at Logistic Regression, K-Nearest Neighbors, Support vector Machines, and Random Forests.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import math

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import validation_curve

import warnings
warnings.filterwarnings('ignore')

## Loading Training data: steam tags

In [2]:
x_train = pd.read_csv("C:/Users/book_/OneDrive/Documents/GitHub/Capstone2-Video-Game-Popularity/preprocessed_data/x_steam_train.csv", index_col=0)
y_train = pd.read_csv("C:/Users/book_/OneDrive/Documents/GitHub/Capstone2-Video-Game-Popularity/preprocessed_data/y_steam_train.csv", index_col=0)

x_test = pd.read_csv("C:/Users/book_/OneDrive/Documents/GitHub/Capstone2-Video-Game-Popularity/preprocessed_data/x_steam_test.csv", index_col=0)
y_test = pd.read_csv("C:/Users/book_/OneDrive/Documents/GitHub/Capstone2-Video-Game-Popularity/preprocessed_data/y_steam_test.csv", index_col=0)


In [3]:
print(x_train.shape)
print(y_train.shape)

(18949, 1476)
(18949, 1)


### I. Logistic Regression

In [4]:
C_param_range = [0.001,0.01,0.1,1,10,100]

table = pd.DataFrame(columns = ['C_parameter','Accuracy'])
table['C_parameter'] = C_param_range


j = 0
for i in C_param_range:
    
    # Apply logistic regression model to training data
    Logreg = LogisticRegression(penalty = 'l2', C = i,random_state = 40)
    Logreg.fit(x_train,y_train)
    
    # Predict using model
    y_pred_lr = Logreg.predict(x_test)
    
    # Saving accuracy score in table
    table.iloc[j,1] = accuracy_score(y_test,y_pred_lr)
    j += 1
    
table

Unnamed: 0,C_parameter,Accuracy
0,0.001,0.666585
1,0.01,0.667446
2,0.1,0.666338
3,1.0,0.666215
4,10.0,0.666215
5,100.0,0.665969


In [5]:
from sklearn.metrics import confusion_matrix

cnf_matrix= confusion_matrix(y_test,y_pred_lr)
print(cnf_matrix)
Accuracy_lr=Logreg.score(x_test,y_test)

print(Accuracy_lr)

[[1291 1974]
 [ 739 4118]]
0.6659689731593204


Looks like our accuracy for a linear regression is 0.67 no matter what our parameter is set to.

In [6]:
from sklearn.model_selection import cross_val_score

cv_scores_test= cross_val_score(Logreg,x_test,y_test,cv=5,scoring='roc_auc')
cv_scores_train= cross_val_score(Logreg,x_train,y_train,cv=5,scoring='roc_auc')
print(cv_scores_test)
cv_scores_lr_test= cv_scores_test.mean()
cv_scores_lr_train= cv_scores_train.mean()
cv_scores_std_test_lr= cv_scores_test.std()
print ('Mean cross validation test score: ' +str(cv_scores_lr_test))
print ('Mean cross validation train score: ' +str(cv_scores_lr_train))
print ('Standard deviation in cv test scores: ' +str(cv_scores_std_test_lr))

[0.64096698 0.63661307 0.66638252 0.65459584 0.66361939]
Mean cross validation test score: 0.6524355593139284
Mean cross validation train score: 0.6894576642310113
Standard deviation in cv test scores: 0.011883955624493046


## II. K-Nearest Neighbor (KNN):

In [7]:
from sklearn.neighbors import KNeighborsClassifier
#from sklearn.metrics import plot_roc_curve

# Apply KNN model to training data:

knn = KNeighborsClassifier(p=2,weights='distance',n_neighbors=50)
knn.fit(x_train,y_train)

# Predict using model:

y_predict_knn=knn.predict(x_test)

#Confusion matrix:

cnf_matrix = confusion_matrix(y_test, y_predict_knn)
print(cnf_matrix)
Accuracy_knn=knn.score(x_test,y_test)

print(Accuracy_knn)
#knn_disp= plot_roc_curve(knn,X_test,y_test

[[1546 1719]
 [1143 3714]]
0.6476237379955676


In [8]:
cv_scores_test= cross_val_score(knn,x_test,y_test,cv=5,scoring='roc_auc')
cv_scores_train= cross_val_score(knn,x_train,y_train,cv=5,scoring='roc_auc')
print(cv_scores_test)
cv_scores_knn_test= cv_scores_test.mean()
cv_scores_knn_train= cv_scores_train.mean()
cv_scores_std_knn= cv_scores_test.std()
print ('Mean cross validation test score: ' +str(cv_scores_knn_test))
print ('Mean cross validation train score: ' +str(cv_scores_knn_train))
print ('Standard deviation in cv scores: ' +str(cv_scores_std_knn))

[0.6206516  0.59312196 0.62883814 0.62840049 0.64009176]
Mean cross validation test score: 0.6222207906356931
Mean cross validation train score: 0.6548701478214691
Standard deviation in cv scores: 0.015814889351761386


## III. Support Vector Machine (SVM):

In [9]:
from sklearn.svm import SVC

# svm = SVC(kernel='linear')
# svm.fit(x_train, y_train)

# # Predict using model:

# y_predict_svm=svm.predict(x_test)

# #Confusion matrix:

# cnf_matrix = confusion_matrix(y_test, y_predict_svm)
# print(cnf_matrix)

# Accuracy_svm=svm.score(x_test,y_test)
# print(Accuracy_svm)

In [10]:
# cv_scores_test= cross_val_score(svm,x_test,y_test,cv=5,scoring='roc_auc')
# cv_scores_train= cross_val_score(svm,x_train,y_train,cv=5,scoring='roc_auc')
# print(cv_scores_test)
# cv_scores_svm_test= cv_scores_test.mean()
# cv_scores_svm_train= cv_scores_train.mean()
# cv_scores_std_svm= cv_scores_test.std()
# print ('Mean cross validation test score: ' +str(cv_scores_svm_test))
# print ('Mean cross validation train score: ' +str(cv_scores_svm_train))
# print ('Standard deviation in cv scores: ' +str(cv_scores_std_svm))

The SMV cells took too long for my computer to run. Here is the Accuracy_smv results. The computer could not handle cross validation:

[[ 962 2303]

 [ 435 4422]]

0.6628909135680867

## IV. Random Forest:

In [11]:
from sklearn.ensemble import RandomForestClassifier

#Apply RF to the training data:

rf = RandomForestClassifier(bootstrap=True,n_estimators=100,criterion='entropy')
rf.fit(x_train, y_train)

#Predict using the model:

y_predict_rf = rf.predict(x_test)

#Confusion matrix:

cnf_matrix = confusion_matrix(y_test, y_predict_rf)
print(cnf_matrix)
Accuracy_rf=rf.score(x_test,y_test)
print(Accuracy_rf)

[[1601 1664]
 [1189 3668]]
0.6487318394484117


In [12]:
cv_scores_test= cross_val_score(rf,x_test,y_test,cv=5,scoring='roc_auc')
cv_scores_train= cross_val_score(rf,x_train,y_train,cv=5,scoring='roc_auc')
print(cv_scores_test)
cv_scores_rf_test= cv_scores_test.mean()
cv_scores_rf_train= cv_scores_train.mean()
cv_scores_std_rf= cv_scores_test.std()
print ('Mean cross validation test score: ' +str(cv_scores_rf_test))
print ('Mean cross validation train score: ' +str(cv_scores_rf_train))
print ('Standard deviation in cv scores: ' +str(cv_scores_std_rf))

[0.6585528  0.63138947 0.65685429 0.67705417 0.66519494]
Mean cross validation test score: 0.6578091349583475
Mean cross validation train score: 0.6772081953820478
Standard deviation in cv scores: 0.014995899380975956
