Data Science Challenge

The dataset you are given contains matches of a MOBA game. The matches are uniquely identified by a `match_id`. Per match, there are several rows. Each row describes a unique gamestate in the match. We have removed the names of the other columns.

Your task is to produce a model that predicts the `winner` of the match given a particular gamestate. The winner may be the "home" team (`0`) or the away team (`1`). Please, bear in mind that the trained model would be used to produce live-odds while a game is being played, i.e. it would have to continuosly update the probability of each team winning, based on the game played so far.

The data is challenge_data.csv.

In [1]:
import pandas as pd
import numpy as np
# feel free to change the variable name to something you are comfortable with
df = pd.read_csv("challenge_data.csv")
# some basic libraries
import statsmodels.api as sm
import matplotlib
from matplotlib import pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k' 

# remove index col
del df["Unnamed: 0"]
# no null vals, one "catch me if you can" will be filled
df["feature_4"].loc[df["feature_4"] == "catch me if you can"] = 5845.0

# ok, so, what do we have is a list of games played between two player
# we suppose two-same players because otherwise we wold have to know 
# a sort of identification for each player to be able to take it into 
# account, since we don't, we will consider all these games were played
# between two team at different times
# therefore each match is a trial - a sample

# it is a semi-structured database with different recorded-times, but 
# they seem to have a pattern, next iteration increases roughly by 20 seconds

# get rid of - values in winner column
df.drop(df[df["winner"] == -1.0].index, inplace=True)

# convert the feature_4 column to floats
df = df.astype({"feature_4": float})

# get match ids in a list
match_ids = df["match_id"].unique()

## we will do the training for a specific time interval
# for single time interval
# x -> feat_1, feat_2, feat_3, feat_4
# y -> winner

# stats box to hold statistic results
stats_box = []
# verbosity for individual predictions
talkative = False
# methods to train, forest -> random forest classification (of sklearn), nn -> neural network (keras implemenation)
training = "forest" # NOTE: nn doesn't work

#iterate through the game_time_id(s) and collect accuracies for each id
# each 20 secs, get the approximate max id
for game_time_id in range(round(max(df["game_time"])/20)):
    # boxes for features and labels
    X = []
    y = []

    # creates X, y for all matches, for one time interval
    # quick intuition about the features, feat_4 seems to be linear
    # feat_3 barely changes and change doesn't seem significant
    # keep feat_1 and feat_2

    for match_id in match_ids:
        try:
            match = df[df["match_id"] == match_id].iloc[game_time_id]
            X.append(list(match[3:5])) # freature columns
            y.append(match[2])  # label - winner column value
        # game end times differ, in case game ended shortly, catch error
        except Exception as e:
            continue

    # some datasets will be so small, splitting them will raise error
    try:
        # train/test split
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
        
        if talkative:
            # some stats
            print("time int: ", (game_time_id +3)*20)
            print("total X/y: ", len(X))
            print("total X_train/y_train: ", len(X_train))
            print("total X_test/y_test: ", len(X_test))

        # without tuning, without actually thinking
        # training
        if training == "forest":
            from sklearn.ensemble import RandomForestClassifier
            regressor = RandomForestClassifier(n_estimators=100, random_state=0)
            regressor.fit(X_train, y_train)
            y_pred = regressor.predict(X_test)

            # metrics
            from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
            acc = accuracy_score(y_test, y_pred)
            if talkative:
                print(confusion_matrix(y_test,y_pred))
                print(classification_report(y_test,y_pred))
                print(acc)
                  
        elif training == "nn":
            # some sets are so small layers can't work on them
            # need smarter layer handling, failed!
            try:
                # NN approach, architecture copied form machine learning mastery webiste
                # first neural network with keras tutorial
                from numpy import loadtxt
                from keras.models import Sequential
                from keras.layers import Dense

                # conv to np arrays as tf requires
                X = np.array(X)
                y = np.array(y)

                # define the keras model
                model = Sequential()
                model.add(Dense(12, input_dim=4, activation='relu'))
                model.add(Dense(4, activation='relu'))
                model.add(Dense(1, activation='sigmoid'))
                # compile the keras model
                model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
                # fit the keras model on the dataset
                model.fit(X, y, epochs=150, batch_size=10, verbose=0)
                # evaluate the keras model
                _, acc = model.evaluate(X, y, verbose=0)
            except Exception as e:
                acc = 0
        else:
            print("please choose a training method above")            

    # catch them errors
    except Exception as e:
        pass

    # collect info - > game_time, total samples, training samples, test samples, accuracy
    stats_box.append([(game_time_id +3)*20, len(X), len(X_train), len(X_test), acc*100])
    
# stats_box to dataframe
df1 = pd.DataFrame(stats_box)
df1.columns = ["time","total","train","test","acc"]
df1[["total","train","test","acc"]].plot()

# intuitively towards the end of the game it must be easier to predict the winner
# very last predictions since there are less samples, predictions don't make sense
# overall it doesn't seem to learn a lot, further feature engineering is needed

# at least the slope is bigger than 0, I know this is so cheap...
from scipy.stats import linregress
linregress(df1["time"], df1["acc"])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


LinregressResult(slope=0.01975817364568719, intercept=47.45521868871651, rvalue=0.2912827466638715, pvalue=0.03809694844912819, stderr=0.00927003072712697)