## Machine Learning/ Prediction

### Predicting using shallow-learning techniques
* Random Forrest
* XGBoost
* SVR

### Predicting using SOTA Machine-Learning NNs (blackbox though.....)
* To play it fair with those previous shallow-learning methods, the NNs will be restricted to be shallow. Meaning it will have less/ equal to 3 layers. 

### Procedures
We will first introduce RF, XGBoost, SVR techniques. While we are doing those tasks, we will time those methods. Making $time$ another factor it should consider, rather than taking indefinite amount of time and perform 2$\%$ better. That way, we will have a index of $\dfrac{accuracy}{time}$ (accuracy gain trained per minute). In my opinion, it can be a rough index upon how we should rapidly prototype some ideas.

### Predicting using SOTA Machine-Learning NNs (blackbox though.....)

In [1]:
import pandas as pd
import numpy as np
import gc
import time
# df = pd.read_csv("data/train_V2.csv")

In [4]:
def reduce_mem_usage(df):
    # iterate through all the columns of a dataframe and modify the data type
    #   to reduce memory usage.        
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

The above technique is used when the ram explodes in 16GB. Basically, it casts different types of number between different variable type. For instance, when we have $705023$, we wouldn't need a $int64$ to store it, what we need at most is $int16$. This can save up to 48 bit. When we have, say 1 million rows, we save upto 48 Mb.

In [5]:
def featureModify(isTrain):
    if isTrain:
        all_data = pd.read_csv("data/train_V2.csv")
        all_data = all_data[all_data['maxPlace'] > 1]
        all_data = reduce_mem_usage(all_data)
        all_data = all_data[all_data['winPlacePerc'].notnull()]
    else:
        all_data = pd.read_csv('../input/test_V2.csv')


    all_data['matchType'] = all_data['matchType'].map({
    'crashfpp':1,
    'crashtpp':2,
    'duo':3,
    'duo-fpp':4,
    'flarefpp':5,
    'flaretpp':6,
    'normal-duo':7,
    'normal-duo-fpp':8,
    'normal-solo':9,
    'normal-solo-fpp':10,
    'normal-squad':11,
    'normal-squad-fpp':12,
    'solo':13,
    'solo-fpp':14,
    'squad':15,
    'squad-fpp':16
    })
    all_data = reduce_mem_usage(all_data)

    print("Match size")
    matchSizeData = all_data.groupby(['matchId']).size().reset_index(name='matchSize')
    all_data = pd.merge(all_data, matchSizeData, how='left', on=['matchId'])
    del matchSizeData
    gc.collect()
    
    
    all_data.loc[(all_data['rankPoints']==-1), 'rankPoints'] = 0
    all_data['_killPoints_rankpoints'] = all_data['rankPoints']+all_data['killPoints']


    all_data["_Kill_headshot_Ratio"] = all_data["kills"]/all_data["headshotKills"]
    all_data['_killStreak_Kill_ratio'] = all_data['killStreaks']/all_data['kills']
    all_data['_totalDistance'] = 0.25*all_data['rideDistance'] + all_data["walkDistance"] + all_data["swimDistance"]
    all_data['_killPlace_MaxPlace_Ratio'] = all_data['killPlace'] / all_data['maxPlace']
    all_data['_totalDistance_weaponsAcq_Ratio'] = all_data['_totalDistance'] / all_data['weaponsAcquired']
    all_data['_walkDistance_heals_Ratio'] = all_data['walkDistance'] / all_data['heals']
    all_data['_walkDistance_kills_Ratio'] = all_data['walkDistance'] / all_data['kills']
    all_data['_kills_walkDistance_Ratio'] = all_data['kills'] / all_data['walkDistance']
    all_data['_totalDistancePerDuration'] =  all_data["_totalDistance"]/all_data["matchDuration"]
    all_data['_killPlace_kills_Ratio'] = all_data['killPlace']/all_data['kills']
    all_data['_walkDistancePerDuration'] =  all_data["walkDistance"]/all_data["matchDuration"]
    all_data['walkDistancePerc'] = all_data.groupby('matchId')['walkDistance'].rank(pct=True).values
    all_data['killPerc'] = all_data.groupby('matchId')['kills'].rank(pct=True).values
    all_data['killPlacePerc'] = all_data.groupby('matchId')['killPlace'].rank(pct=True).values
    all_data['weaponsAcquired'] = all_data.groupby('matchId')['weaponsAcquired'].rank(pct=True).values
    all_data['_walkDistance_kills_Ratio2'] = all_data['walkDistancePerc'] / all_data['killPerc']
    all_data['_kill_kills_Ratio2'] = all_data['killPerc']/all_data['walkDistancePerc']
    all_data['_killPlace_walkDistance_Ratio2'] = all_data['walkDistancePerc']/all_data['killPlacePerc']
    all_data['_killPlace_kills_Ratio2'] = all_data['killPlacePerc']/all_data['killPerc']
    all_data['_totalDistance'] = all_data.groupby('matchId')['_totalDistance'].rank(pct=True).values
    all_data['_walkDistance_kills_Ratio3'] = all_data['walkDistancePerc'] / all_data['kills']
    all_data['_walkDistance_kills_Ratio4'] = all_data['kills'] / all_data['walkDistancePerc']
    all_data['_walkDistance_kills_Ratio5'] = all_data['killPerc'] / all_data['walkDistance']
    all_data['_walkDistance_kills_Ratio6'] = all_data['walkDistance'] / all_data['killPerc']

    all_data[all_data == np.Inf] = np.NaN
    all_data[all_data == np.NINF] = np.NaN
    all_data.fillna(0, inplace=True)
    
    features = list(all_data.columns)
    features.remove("Id")
    features.remove("matchId")
    features.remove("groupId")
    features.remove("matchSize")
    features.remove("matchType")
    if isTrain:
        features.remove("winPlacePerc")

    
    print("Mean Data")
    meanData = all_data.groupby(['matchId','groupId'])[features].agg('mean')
    meanData = reduce_mem_usage(meanData)
    meanData = meanData.replace([np.inf, np.NINF,np.nan], 0)
    meanDataRank = meanData.groupby('matchId')[features].rank(pct=True).reset_index()
    meanDataRank = reduce_mem_usage(meanDataRank)
    all_data = pd.merge(all_data, meanData.reset_index(), suffixes=["", "_mean"], how='left', on=['matchId', 'groupId'])
    del meanData
    gc.collect()
    all_data = all_data.drop(["vehicleDestroys_mean","rideDistance_mean","roadKills_mean","rankPoints_mean"], axis=1)
    all_data = pd.merge(all_data, meanDataRank, suffixes=["", "_meanRank"], how='left', on=['matchId', 'groupId'])
    del meanDataRank
    gc.collect()
    all_data = all_data.drop(["numGroups_meanRank","rankPoints_meanRank"], axis=1)
    
    all_data = all_data.join(reduce_mem_usage(all_data.groupby('matchId')[features].rank(ascending=False).add_suffix('_rankPlace').astype(int)))

    
    print("Std Data")
    stdData = all_data.groupby(['matchId','groupId'])[features].agg('std').replace([np.inf, np.NINF,np.nan], 0)
    stdDataRank = reduce_mem_usage(stdData.groupby('matchId')[features].rank(pct=True)).reset_index()
    del stdData
    gc.collect()
    all_data = pd.merge(all_data, stdDataRank, suffixes=["", "_stdRank"], how='left', on=['matchId', 'groupId'])
    del stdDataRank
    gc.collect()
    
    print("Max Data")
    maxData = all_data.groupby(['matchId','groupId'])[features].agg('max')
    maxData = reduce_mem_usage(maxData)
    maxDataRank = maxData.groupby('matchId')[features].rank(pct=True).reset_index()
    maxDataRank = reduce_mem_usage(maxDataRank)
    all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['matchId', 'groupId'])
    del maxData
    gc.collect()
    all_data = all_data.drop(["assists_max","killPoints_max","headshotKills_max","numGroups_max","revives_max","teamKills_max","roadKills_max","vehicleDestroys_max"], axis=1)
    all_data = pd.merge(all_data, maxDataRank, suffixes=["", "_maxRank"], how='left', on=['matchId', 'groupId'])
    del maxDataRank
    gc.collect()
    all_data = all_data.drop(["roadKills_maxRank","matchDuration_maxRank","maxPlace_maxRank","numGroups_maxRank"], axis=1)


    print("Min Data")
    minData = all_data.groupby(['matchId','groupId'])[features].agg('min')
    minData = reduce_mem_usage(minData)
    minDataRank = minData.groupby('matchId')[features].rank(pct=True).reset_index()
    minDataRank = reduce_mem_usage(minDataRank)
    all_data = pd.merge(all_data, minData.reset_index(), suffixes=["", "_min"], how='left', on=['matchId', 'groupId'])
    del minData
    gc.collect()
    all_data = all_data.drop(["heals_min","killStreaks_min","killPoints_min","maxPlace_min","revives_min","headshotKills_min","weaponsAcquired_min","_walkDistance_kills_Ratio_min","rankPoints_min","matchDuration_min","teamKills_min","numGroups_min","assists_min","roadKills_min","vehicleDestroys_min"], axis=1)
    all_data = pd.merge(all_data, minDataRank, suffixes=["", "_minRank"], how='left', on=['matchId', 'groupId'])
    del minDataRank
    gc.collect()
    all_data = all_data.drop(["killPoints_minRank","matchDuration_minRank","maxPlace_minRank","numGroups_minRank"], axis=1)

    
    print("group Size")
    groupSize = all_data.groupby(['matchId','groupId']).size().reset_index(name='group_size')
    groupSize = reduce_mem_usage(groupSize)
    all_data = pd.merge(all_data, groupSize, how='left', on=['matchId', 'groupId'])
    del groupSize
    gc.collect()

    
    print("Match Mean")
    matchMeanFeatures = features
    matchMeanFeatures = [ v for v in matchMeanFeatures if v not in ["killPlacePerc","matchDuration","maxPlace","numGroups"] ]
    matchMeanData= reduce_mem_usage(all_data.groupby(['matchId'])[matchMeanFeatures].transform('mean')).replace([np.inf, np.NINF,np.nan], 0)
    all_data = pd.concat([all_data,matchMeanData.add_suffix('_matchMean')],axis=1)
    del matchMeanData,matchMeanFeatures
    gc.collect()

    print("matchMax")
    matchMaxFeatures = ["walkDistance","kills","_walkDistance_kills_Ratio","_kill_kills_Ratio2"]
    all_data = pd.merge(all_data, reduce_mem_usage(all_data.groupby(['matchId'])[matchMaxFeatures].agg('max')).reset_index(), suffixes=["", "_matchMax"], how='left', on=['matchId'])

    print("match STD")
    matchMaxFeatures = ["kills","_walkDistance_kills_Ratio2","_walkDistance_kills_Ratio","killPerc","_kills_walkDistance_Ratio"]
    all_data = pd.merge(all_data, reduce_mem_usage(all_data.groupby(['matchId'])[matchMaxFeatures].agg('std')).reset_index().replace([np.inf, np.NINF,np.nan], 0), suffixes=["", "_matchSTD"], how='left', on=['matchId'])


    all_data = all_data.drop(["Id","groupId"], axis=1)
    all_data = all_data.drop(["DBNOs","assists","headshotKills","heals","killPoints","_killStreak_Kill_ratio","killStreaks","longestKill","revives","roadKills","teamKills","vehicleDestroys","_walkDistance_kills_Ratio","weaponsAcquired"], axis=1)
    all_data = all_data.drop(["_walkDistance_heals_Ratio","_totalDistancePerDuration","_killPlace_kills_Ratio","_totalDistance_weaponsAcq_Ratio","_killPlace_MaxPlace_Ratio","_walkDistancePerDuration","rankPoints","rideDistance","boosts","winPoints","swimDistance","_kills_walkDistance_Ratio"], axis=1)
    all_data = all_data.drop(["_Kill_headshot_Ratio","maxPlace","_totalDistance","numGroups","walkDistance","killPlace"], axis=1)
    all_data = reduce_mem_usage(all_data)
    gc.collect()
    
    print("done")
    features_label = all_data.columns
    features_label = features_label.drop('matchId')
    if isTrain:
        features_label = features_label.drop('winPlacePerc')

    gc.collect()
    return all_data,features_label


def split_train_val(data, fraction):
    matchIds = data['matchId'].unique().reshape([-1])
    train_size = int(len(matchIds)*fraction)
    
    random_idx = np.random.RandomState(seed=2).permutation(len(matchIds))
    train_matchIds = matchIds[random_idx[:train_size]]
    val_matchIds = matchIds[random_idx[train_size:]]
    
    data_train = data.loc[data['matchId'].isin(train_matchIds)]
    data_val = data.loc[data['matchId'].isin(val_matchIds)]
    return data_train, data_val

In [6]:
X_train,features_label = featureModify(True) 

Memory usage of dataframe is 1017.83 MB
Memory usage after optimization is: 322.31 MB
Decreased by 68.3%
Memory usage of dataframe is 322.31 MB
Memory usage after optimization is: 292.63 MB
Decreased by 9.2%
Match size
Mean Data
Memory usage of dataframe is 642.07 MB
Memory usage after optimization is: 216.85 MB
Decreased by 66.2%
Memory usage of dataframe is 757.68 MB
Memory usage after optimization is: 212.61 MB
Decreased by 71.9%
Memory usage of dataframe is 1628.53 MB
Memory usage after optimization is: 233.25 MB
Decreased by 85.7%
Std Data
Memory usage of dataframe is 758.04 MB
Memory usage after optimization is: 212.98 MB
Decreased by 71.9%
Max Data
Memory usage of dataframe is 394.67 MB
Memory usage after optimization is: 189.79 MB
Decreased by 51.9%
Memory usage of dataframe is 757.68 MB
Memory usage after optimization is: 212.61 MB
Decreased by 71.9%
Min Data
Memory usage of dataframe is 394.67 MB
Memory usage after optimization is: 189.79 MB
Decreased by 51.9%
Memory usage of

In [7]:
X_train, X_train_test = split_train_val(X_train, 0.91)
print("Y time")
y = X_train['winPlacePerc']
y_test = X_train_test['winPlacePerc']
X_train = X_train.drop(columns=['matchId', 'winPlacePerc'])
X_train_test = X_train_test.drop(columns=['matchId', 'winPlacePerc'])

print("X test np time")
X_train_test = np.array(X_train_test)
print("y test np time")
y_test = np.array(y_test)


y = np.array(y)
X_train = np.array(X_train)
np.save("y", y)
np.save("x", X_train)
np.save("x_test",X_train_test)
np.save("y_test",y_test)

Y time
X test np time
y test np time


In [17]:
X_train.shape

(4044887, 409)

In [2]:
# Loading the file, so the next time I don't have to spend time waiting.
X_train = np.load("x.npy", allow_pickle=True)
X_train_test = np.load("x_test.npy", allow_pickle=True)
y = np.load('y.npy', allow_pickle=True)
y_test = np.load('y_test.npy', allow_pickle=True)

In [3]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization, Dropout
from keras.callbacks import EarlyStopping

Using TensorFlow backend.


In [8]:
model = Sequential()
model.add(Dense(450, input_dim=409))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(450))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(450))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('tanh'))
model.compile(optimizer='Adam', loss='mse', metrics=['mae'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 450)               184500    
_________________________________________________________________
batch_normalization_7 (Batch (None, 450)               1800      
_________________________________________________________________
activation_9 (Activation)    (None, 450)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 450)               202950    
_________________________________________________________________
batch_normalization_8 (Batch (None, 450)               1800      
_________________________________________________________________
activation_10 (Activation)   (None, 450)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 450)               202950    
__________

In [9]:
start = time.time()
es = EarlyStopping(patience=4)
model.fit(X_train,y, validation_data=(X_train_test,y_test), epochs=40, batch_size=2048, callbacks=[es])
end = time.time()

Train on 4044887 samples, validate on 402078 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40


The above results seems to have exploded in the validation set.

In [12]:
end - start, (1 - 0.0359)/ (end-start)

(863.3746693134308, 0.0011166646813563195)

It took us 863.37 seconds to reach a best mae of 0.0359. That is, for every 1 addtional second we train the model, the model roughly gives us a 0.0011 mae improvement.

In [4]:
# model.save('450_3.h5')
# del model
model = Sequential()
model.add(Dense(450, input_dim=409))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(450))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(450))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(1))
model.add(Activation('tanh'))
model.compile(optimizer='Adam', loss='mse', metrics=['mae'])
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 450)               184500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 450)               1800      
_________________________________________________________________
activation_1 (Activation)    (None, 450)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 450)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 450)               202950    
_________________________________________________________________
batch_normalization_2

In [5]:
start = time.time()
es = EarlyStopping(patience=10)
model.fit(X_train,y, validation_data=(X_train_test,y_test), epochs=80, batch_size=30000, callbacks=[es])
end = time.time()


Instructions for updating:
Use tf.cast instead.
Train on 4044887 samples, validate on 402078 samples
Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80


(3343.470095872879, 0.00028554165960044473)

In [6]:
end - start, (1 - 0.037)/ (end-start)

(3343.470095872879, 0.000288024110396175)


From there we see if we apply some more dropout layers to the neural network, although it did perform somewhat better 