## Predictive Modeling on Starcraft Data
This notebook aims to perform specific classification methods to predict LeagueIndex. As LeagueIndex contains values from 1 to 8, certain methods don't really make sense in the context of this problem statement. For example, logistic regression is a binary classifier, and this wouldn't be applicable for the problem statement. Additionally, the presence of missing data for the LeagueIndex 8 players means that there must also be some modifications to the data or else some methods won't work at all. Upon researching potential methods for classification, RandomForest as well as XGBoost. These methodologies are ones that I am familiar with, but have not really explored, and thus were chosen for this assignment. Other methods such as KNN or Naive Bayes could also work, but were not explored. 

### What to do with missing data
With missing data points, I attempt to fill in the gaps in order to make the most accurate model. For Age, it seems as though the ages stay relatively the same throughout all LeagueIndices, so I will simply take the median age to fill in the missing values. However, the other two features with missing values, HoursPerWeek and TotalHours, don't seem to have a linear or predictable function of growth. Therefore, I simply utilized KNN Imputer to use K Nearest Neighbors to fill the values. However, this will be something discussed in the conclusions, as the TotalHours section that was filled in the LeagueIndex 8 is way higher than all the other ranks. 

In [166]:
import numpy as np 
import pandas as pd
import load_data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import random

random.seed(12345)
#get data from local machine
spd = load_data.load_everything()
spd = spd.apply(pd.to_numeric, downcast = 'float')

#Drop unnecessary GameID, but save it just in case 
spdID = spd['GameID']
spd = spd.drop('GameID', axis = 1)

#fill age with median age 
spd['Age'] = spd['Age'].fillna(spd['Age'].median())

from sklearn.impute import KNNImputer

impute = KNNImputer()
x = impute.fit_transform(spd)
spd_impute = pd.DataFrame(x, columns = spd.columns)
spd_impute

spd_impute

Unnamed: 0,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,5.0,27.0,10.000000,3000.0,143.718002,0.003515,0.000220,7.0,0.000110,0.000392,0.004849,32.667702,40.867298,4.7508,28.0,0.001397,6.0,0.000000,0.000000
1,5.0,23.0,10.000000,5000.0,129.232193,0.003304,0.000259,4.0,0.000294,0.000432,0.004307,32.919399,42.345402,4.8434,22.0,0.001194,5.0,0.000000,0.000208
2,4.0,30.0,10.000000,200.0,69.961197,0.001101,0.000336,4.0,0.000294,0.000461,0.002926,44.647499,75.354797,4.0430,22.0,0.000745,6.0,0.000000,0.000189
3,3.0,19.0,20.000000,400.0,107.601601,0.001034,0.000213,1.0,0.000053,0.000543,0.003783,29.220301,53.735199,4.9155,19.0,0.000426,7.0,0.000000,0.000384
4,3.0,32.0,10.000000,500.0,122.890800,0.001136,0.000327,2.0,0.000000,0.001329,0.002368,22.688499,62.081299,9.3740,15.0,0.001174,4.0,0.000000,0.000019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3390,8.0,21.0,24.799999,201160.0,259.629608,0.020425,0.000743,9.0,0.000621,0.000146,0.004555,18.605900,42.834202,6.2754,46.0,0.000877,5.0,0.000000,0.000000
3391,8.0,21.0,25.200001,201072.0,314.670013,0.028043,0.001157,10.0,0.000246,0.001083,0.004259,14.302300,36.115601,7.1965,16.0,0.000788,4.0,0.000000,0.000000
3392,8.0,21.0,29.200001,201072.0,299.428192,0.028341,0.000860,7.0,0.000338,0.000169,0.004439,12.402800,39.515598,6.3979,19.0,0.001260,4.0,0.000000,0.000000
3393,8.0,21.0,26.400000,201280.0,375.866394,0.036436,0.000594,5.0,0.000204,0.000780,0.004346,11.691000,34.854698,7.9615,15.0,0.000613,6.0,0.000000,0.000631


## Modeling with Imputed Columns
### Random Forest Classification

In [174]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_absolute_error


X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy: %.2f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

Accuracy: 41.90%
MAE: 0.718351


### XGBoost Classification

In [175]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

#classification with every feature
X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']
# print(X)
# print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
y_test = lc.fit_transform(y_test)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("MAE: %2f" % (mean_absolute_error(y_test,predictions)))

Accuracy: 36.52%
MAE: 0.826215


With the poor performances with imputed columns, this begs the question to see if we can improve performance in other methods. One method could be to transform the data into loglog. This can be done to fix the large amount of skew in a lot of the variables (as seen in the exploratory data analysis). Additionally, I want to see if removed rows for the missing data would be better as well. 

## Modeling with LogLog Transformation

In [169]:
skew = spd_impute.skew(skipna = True).sort_values(ascending = False)
skew = skew[abs(skew) > 0.75]
#print(skew)
skew_feat = skew.index
lam = 0.15
for feat in skew_feat:
    spd_impute[feat] = np.log1p(spd_impute[feat])


### Random Forest with LogLog Transformation

In [183]:

X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy: %.2f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

pd.crosstab(y_pred,y_test, rownames = ['Predicted'], colnames = ['Actual'])

Accuracy: 41.61%
MAE: 0.716389


Actual,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1.0,19,12,5,0,0,0,0,0
2.0,19,22,21,17,2,0,0,0
3.0,12,44,37,41,20,0,0,0
4.0,5,17,74,122,62,15,0,0
5.0,0,5,14,61,101,57,0,0
6.0,0,0,1,9,71,109,11,0
8.0,0,0,0,0,0,0,0,14


### XGBoost with LogLog Transformation

In [176]:
X = spd_impute.drop(['LeagueIndex'], axis = 1).values
y = spd_impute['LeagueIndex']
# print(X)
# print(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder() 
y_train = lc.fit_transform(y_train)
y_test = lc.fit_transform(y_test)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("MAE: %2f" % (mean_absolute_error(y_test,predictions)))

Accuracy: 36.52%
MAE: 0.826215


Still, the models are not performing any better than it was prior. I want to try one more thing, LogLog transformation with removal of rows instead of columns. 

## Modeling with missing rows and LogLog Transformation

In [172]:
spd = load_data.load_everything()
spd = spd.dropna(axis = 0)
spd = spd.drop('GameID', axis = 1)
spd = spd.apply(pd.to_numeric, downcast = 'float')

print(spd)

temp = spd.drop('LeagueIndex', axis = 1)
skew = temp.skew(skipna = True).sort_values(ascending = False)
skew = skew[abs(skew) > 0.75]
skew_feat = skew.index
lam = 0.15
for feat in skew_feat:
    spd[feat] = np.log1p(spd[feat])

      LeagueIndex   Age  HoursPerWeek  TotalHours         APM   
0             5.0  27.0          10.0      3000.0  143.718002  \
1             5.0  23.0          10.0      5000.0  129.232193   
2             4.0  30.0          10.0       200.0   69.961197   
3             3.0  19.0          20.0       400.0  107.601601   
4             3.0  32.0          10.0       500.0  122.890800   
...           ...   ...           ...         ...         ...   
3335          4.0  20.0           8.0       400.0  158.139008   
3336          5.0  16.0          56.0      1500.0  186.132004   
3337          4.0  21.0           8.0       100.0  121.699203   
3338          3.0  20.0          28.0       400.0  134.284805   
3339          4.0  22.0           6.0       400.0   88.824600   

      SelectByHotkeys  AssignToHotkeys  UniqueHotkeys  MinimapAttacks   
0            0.003515         0.000220            7.0        0.000110  \
1            0.003304         0.000259            4.0        0.000294   


In [173]:
X = spd.drop(['LeagueIndex'], axis = 1).values
y = spd['LeagueIndex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_jobs = -1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy: %.2f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
print("MAE: %2f" % (mean_absolute_error(y_test, y_pred)))

Accuracy: 40.62%
MAE: 0.756487
