## Predictive Modeling on Starcraft Data
This notebook aims to perform specific classification methods to predict LeagueIndex. As LeagueIndex contains values from 1 to 8, certain methods don't really make sense in the context of this problem statement. For example, logistic regression is a binary classifier, and this wouldn't be applicable for the problem statement. Additionally, the presence of missing data for the LeagueIndex 8 players means that there must also be some modifications to the data or else some methods won't work at all. Upon researching potential methods, I settled on two for the sake of this assessment: one "simple" and one more complex. I chose a Naive Bayes classification as well as XGBoost, as those were the ones that seemed most appealing while also being quite different in methodology. 

### What to do with missing data
With missing data points, I attempt to fill in the gaps in order to make the most accurate model. For Age, it seems as though the ages stay relatively the same throughout all LeagueIndices, so I will simply take the median age to fill in the missing values. When looking at HoursPerWeek, it doesn't seem to have a clearcut impact until the higher LeagueIndices, so predicting a rank accurately from LeagueIndex 1-5 based on HoursPerWeek sounds difficult. Thus, I have chosen to remove HoursPerWeek from the prediction model. However, we must fill the last column with missing values: TotalHours. It makes sense from a logical standpoint that someone who spends more time playing total will be better. However, due to the large skew in TotalHours, as it is quite larger than every other data point present. 

In [137]:
import numpy as np 
import pandas as pd
import load_data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import random

random.seed(12345)
#get data from local machine
spd = load_data.load_everything()
spd = spd.apply(pd.to_numeric, downcast = 'float')

#Drop unnecessary GameID, but save it just in case 
spdID = spd['GameID']
spd = spd.drop('GameID', axis = 1)

#fill age with median age 
spd['Age'] = spd['Age'].fillna(spd['Age'].median())

from sklearn.impute import KNNImputer

impute = KNNImputer()
x = impute.fit_transform(spd)
spd_impute = pd.DataFrame(x, columns = spd.columns)
spd_impute

spd.head()

Unnamed: 0,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,5.0,27.0,10.0,3000.0,143.718002,0.003515,0.00022,7.0,0.00011,0.000392,0.004849,32.667702,40.867298,4.7508,28.0,0.001397,6.0,0.0,0.0
1,5.0,23.0,10.0,5000.0,129.232193,0.003304,0.000259,4.0,0.000294,0.000432,0.004307,32.919399,42.345402,4.8434,22.0,0.001194,5.0,0.0,0.000208
2,4.0,30.0,10.0,200.0,69.961197,0.001101,0.000336,4.0,0.000294,0.000461,0.002926,44.647499,75.354797,4.043,22.0,0.000745,6.0,0.0,0.000189
3,3.0,19.0,20.0,400.0,107.601601,0.001034,0.000213,1.0,5.3e-05,0.000543,0.003783,29.220301,53.735199,4.9155,19.0,0.000426,7.0,0.0,0.000384
4,3.0,32.0,10.0,500.0,122.8908,0.001136,0.000327,2.0,0.0,0.001329,0.002368,22.688499,62.081299,9.374,15.0,0.001174,4.0,0.0,1.9e-05


## Modeling with Imputed Columns
### Random Forest Classification

In [138]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_absolute_error


X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy: %.2f%%" % metrics.accuracy_score(y_test, y_pred))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

Accuracy: 0.41%
MAE: 0.723258


### XGBoost Classification

In [139]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

#classification with every feature
X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']
# print(X)
# print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("MAE: %2f" % (mean_absolute_error(y_test,predictions)))

Accuracy: 23.56%
MAE: 1.154639


With the poor performances with imputed columns, this begs the question to see if we can improve performance in other methods. One method could be to transform the data into loglog. This can be done to fix the large amount of skew in a lot of the variables (as seen in the exploratory data analysis). Additionally, I want to see if removed rows for the missing data would be better as well. 

## Modeling with LogLog Transformation

In [140]:
skew = spd_impute.skew(skipna = True).sort_values(ascending = False)
skew = skew[abs(skew) > 0.75]
#print(skew)
skew_feat = skew.index
lam = 0.15
for feat in skew_feat:
    spd_impute[feat] = np.log1p(spd_impute[feat])


### Random Forest with LogLog Transformation

In [141]:
X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

Accuracy: 0.43473994111874387
MAE: 0.704612


### XGBoost with LogLog Transformation

In [142]:
X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']
# print(X)
# print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("MSE: %2f" % mean_squared_error(y_test,predictions))
print("MAE: %2f" % (mean_absolute_error(y_test,predictions)))

Accuracy: 23.56%
MSE: 2.114875
MAE: 1.154639


Still, the models are not performing any better than it was prior. I want to try one more thing, LogLog transformation with removal of rows instead of columns. 

## Modeling with missing rows and LogLog Transformation

In [154]:
spd = load_data.load_everything()
spd = spd.dropna(axis = 0)
spd = spd.drop('GameID', axis = 1)
spd = spd.apply(pd.to_numeric, downcast = 'float')

#print(spd)

temp = spd.drop('LeagueIndex', axis = 1)
skew = temp.skew(skipna = True).sort_values(ascending = False)
skew = skew[abs(skew) > 0.75]
skew_feat = skew.index
lam = 0.15
for feat in skew_feat:
    spd[feat] = np.log1p(spd[feat])

In [155]:
X = spd.drop(['LeagueIndex'], axis = 1).values
y = spd['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

Accuracy: 0.3812375249500998
MAE: 0.797405
