Let's create a model that predicts the result of a running match based on the variables we've already selected.

The idea is understand if the variables we've selected are really predictive for the result of a match.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%pylab inline

import re

import gc

pd.set_option('display.max_columns', 2000)
pd.set_option('display.max_rows', 2000)

In [None]:
DBS_PATH = "../Bases/"
VARS_PATH = "../1. Data Preparation/"

Let's import the train data and test data

In [None]:
train = pd.read_csv(DBS_PATH + "processed_train_games.csv", index_col=0)
test = pd.read_csv(DBS_PATH + "processed_test_games.csv", index_col=0)

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
train.info()

In [None]:
test.info()

There's some extra preprocessing for modeling:
* We should train our model with the intermediate moves of the match, as we're going to test it with ongoing matches
* As we're playing the match, there are some fields that are unknown during the match
    - `opening_ply`
    - `last_move_at`
    - `turns` => let's remove after exploding the matches
    - `black_rating` (once we have created `rating_difference`, and we have `white_rating`)
    - `last_move_at` (once we have created `games_delay_in_sec`, and we have `created_at`) => already excluded
    - OneHotEncoder should `drop_invariant=True`
    - `victory_status` variables

_Interesting question:_ Would it be better to do the variable selection with the data in intermediate moves as well? (I think not, as when selecting variables we're really interested in finding the variables that are useful for winning, and the move information is present in each feature. For training, we'll be in an intermediate state, that's why we need preprocessing)

In [None]:
def remove_columns(df):
    victory_status_cols = [x for x in df.columns if 'victory_status' in x]
    return df.drop(columns =
        victory_status_cols + [
            'opening_ply',
            'turns',
            'black_rating',
            'opening_eco_A00',
            'moves'
        ])

In [None]:
train = remove_columns(train)

In [None]:
test = remove_columns(test)

In [None]:
train.shape, test.shape

Let's explode the matches, and get one line for each turn.

In [None]:
#pattern = re.compile("move{}$".format(10))
#[col for col in train.columns if pattern.search(col)]

In [42]:
explode_matches(train.head(2)).shape

0 -326
1 -19


(0, 0)

In [None]:
cols = []
for next_move in range(2, 10):
    cols = cols + [col for col in train.columns if "col{}".format(next_move)]
print(cols)

In [16]:
# testar
def explode_matches(df):
    expld_matches = pd.DataFrame()
    
    for move in range(1, 100):
        df_aux = df.copy()
        zero_cols = []
        for next_move in range(move+1, 100):
            pattern = re.compile("move{}$".format(next_move))
            zero_cols = zero_cols + [col for col in train.columns if pattern.search(col)]
        df_aux[zero_cols] = 0
        expld_matches = pd.concat([expld_matches, df_aux])
    
    return expld_matches

In [None]:
train = explode_matches(train)

In [None]:
test = explode_matches(test)

## Evaluation

* Match result over time
    * What are the decisions our model does during a single match?
    * It does better at the beginning or at the end? Is there a threshold move which is decisive for our algorithm to determine the result?
    
* Result at the end
    * How well does our model do at the end of the games? (ROC, Precision, Recall, F1)
    
* Compare with a benchmark
    * Engine evaluation

## Deploy

* Evaluate a match in real time