## Predictive Modeling on Starcraft Data
This notebook aims to perform specific classification methods to predict LeagueIndex. As LeagueIndex contains values from 1 to 8, certain methods don't really make sense in the context of this problem statement. For example, logistic regression is a binary classifier, and this wouldn't be applicable for the problem statement. Additionally, the presence of missing data for the LeagueIndex 8 players means that there must also be some modifications to the data or else some methods won't work at all. Upon researching potential methods, I settled on two for the sake of this assessment: one "simple" and one more complex. I chose a Naive Bayes classification as well as XGBoost, as those were the ones that seemed most appealing while also being quite different in methodology. 

### What to do with missing data
With missing data points, I attempt to fill in the gaps in order to make the most accurate model. For Age, it seems as though the ages stay relatively the same throughout all LeagueIndices, so I will simply take the median age to fill in the missing values. When looking at HoursPerWeek, it doesn't seem to have a clearcut impact until the higher LeagueIndices, so predicting a rank accurately from LeagueIndex 1-5 based on HoursPerWeek sounds difficult. Thus, I have chosen to remove HoursPerWeek from the prediction model. However, we must fill the last column with missing values: TotalHours. It makes sense from a logical standpoint that someone who spends more time playing total will be better. Looking at the median across the ranks, it seems as though it is increasing somewhat predicatably. 

In [104]:
import numpy as np 
import pandas as pd
import load_data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import random

random.seed(12345)
#get data from local machine
spd = load_data.load_everything()
spd = spd.apply(pd.to_numeric, downcast = 'float')
spd = spd.drop('GameID', axis = 1)

#fill age with median age 
spd['Age'] = spd['Age'].fillna(spd['Age'].median())

#removing HoursPerWeek from the dataframe
spd = spd.drop('HoursPerWeek', axis = 1)

#fill TotalHours with 
# temp = spd.copy()
# temp = temp[['LeagueIndex', 'APM', 'SelectByHotkeys', "TotalHours"]]

# test = spd[spd['TotalHours'].isnull()]
# spd.dropna(inplace=True)
# x_train = spd.drop('TotalHours', axis = 1)
# y_train = spd['TotalHours']
# lr = LinearRegression()
# lr.fit(x_train,y_train)
# x_test = test.drop('TotalHours',axis = 1)
# y_pred = lr.predict(x_test)
# test['y_pred'] = y_pred
# test


# X = spd.drop('TotalHours', axis = 1) 
# Y = spd['TotalHours']
# X = X.values 
# Y = Y.values
# imputer = KNNImputer()
# imputer.fit(X)
# Xtrans = imputer.transform(X)
# print('Missing: %d' % sum(np.isnan(X).flatten()))

#spd = spd.to_numpy()


### Naive Bayes Multinomial Classification

In [106]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = spd.drop(['LeagueIndex', 'TotalHours', 'Age'], axis = 1) 
#print(X)
y =  spd['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
classifer = GaussianNB()
classifer.fit(X_train, y_train)
y_pred = classifer.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.37978410206084395


### Predicting LeagueIndex with KNN Impute

In [61]:
from sklearn.impute import KNNImputer


X = spd.drop('LeagueIndex', axis = 1) 
Y = spd['LeagueIndex']
Xv = X.values 
Yv = Y.values

imputer = KNNImputer()
imputer.fit(Xv)
Xtrans = imputer.transform(Xv)

asdf = pd.DataFrame(Xtrans, columns = X.columns)
asdf

Unnamed: 0,Age,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,27.0,3000.0,143.718002,0.003515,0.000220,7.0,0.000110,0.000392,0.004849,32.667702,40.867298,4.7508,28.0,0.001397,6.0,0.000000,0.000000
1,23.0,5000.0,129.232193,0.003304,0.000259,4.0,0.000294,0.000432,0.004307,32.919399,42.345402,4.8434,22.0,0.001194,5.0,0.000000,0.000208
2,30.0,200.0,69.961197,0.001101,0.000336,4.0,0.000294,0.000461,0.002926,44.647499,75.354797,4.0430,22.0,0.000745,6.0,0.000000,0.000189
3,19.0,400.0,107.601601,0.001034,0.000213,1.0,0.000053,0.000543,0.003783,29.220301,53.735199,4.9155,19.0,0.000426,7.0,0.000000,0.000384
4,32.0,500.0,122.890800,0.001136,0.000327,2.0,0.000000,0.001329,0.002368,22.688499,62.081299,9.3740,15.0,0.001174,4.0,0.000000,0.000019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3390,21.0,201160.0,259.629608,0.020425,0.000743,9.0,0.000621,0.000146,0.004555,18.605900,42.834202,6.2754,46.0,0.000877,5.0,0.000000,0.000000
3391,21.0,201072.0,314.670013,0.028043,0.001157,10.0,0.000246,0.001083,0.004259,14.302300,36.115601,7.1965,16.0,0.000788,4.0,0.000000,0.000000
3392,21.0,201072.0,299.428192,0.028341,0.000860,7.0,0.000338,0.000169,0.004439,12.402800,39.515598,6.3979,19.0,0.001260,4.0,0.000000,0.000000
3393,21.0,201280.0,375.866394,0.036436,0.000594,5.0,0.000204,0.000780,0.004346,11.691000,34.854698,7.9615,15.0,0.000613,6.0,0.000000,0.000631


In [99]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Xcolumns = spd.drop('LeagueIndex', axis = 1).columns
X = spd.drop(['LeagueIndex', 'TotalHours'], axis = 1).values
y =  spd['LeagueIndex']
# print(X)
# print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print((mean_squared_error(y_test,predictions)))
print((mean_absolute_error(y_test,predictions)))

Accuracy: 21.65%
2.3328424153166423
1.2223858615611194


In [100]:
X = spd[['ActionLatency','GapBetweenPACs','NumberOfPACs','AssignToHotkeys','SelectByHotkeys','APM','TotalHours']].values
#print(X)
y =  spd['LeagueIndex']
#print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print((mean_squared_error(y_test,predictions)))
print((mean_absolute_error(y_test,predictions)))

Accuracy: 22.83%
2.312223858615611
1.2047128129602356
