# Feature engineering

## Introduction
We always look at a round from the point of view of team_1.

## Features: 
- **target**: team_1 won the round
- **kills**: Amount of players killed by team_1
- **deaths**: Amount of players that died from team_1
- **equipment_value_ratio**: Total equipment value compared to equipment value of team_2. *This is better than saving the equipment values from both the teams. $650 in a pistol round is normal. $650 in round 10 can probably be considered as a saving round.*
- **first_kill**: Did team_1 make the first kill. Making the first kill results in a higher chance of winning the round.
- **map**: maybe there are more clutches on Mirage?
- **damage_done**: Amount of damage team1 did to team2
- **damage_taken**: Amount of damage team2 did to team1
- **median_player_health**: Median health of all players of the team.
- **is_bomb_planted**:

- **utility**: [TODO]

# Features to implement
- **defuse_kit_count**: Amount of defusekits.
  - **values**: [None,0,1,2,3,4,5] (None when team_1 are terrorists)
- **alive_ratio_on_bomb_planted**: Amount of players alive from team_1 compared to players alive from team_2 when the bomb is planted.
- **distance_from_bomb_on_plant**: Average distance from team_1 when bomb is planted. (Maybe they are already saving)




In [24]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sys
import os

import sys
sys.path.insert(0, '..')  # add parent folder path where lib folder is


In [25]:
# Feature engineering
from dp.database.connection import get_collections
import parser
from bson.objectid import ObjectId
from sklearn.model_selection import train_test_split

collections = get_collections()
Round = parser.Round.dataclass()

parsed_rounds = []
target = []
# TODO: remove this, we limit to 50 while testing.
matches = collections["matches"].find({}).limit(50)

# MATCHES
for match in matches:
  round_winstreak = 0
  team_1_id = match["teams"][0]

  # ROUND
  for round_id in match["rounds"]:

    # Sometimes a round is not parsed correctly and the data is not available.
    round = collections["rounds"].find_one({ "_id": ObjectId(round_id)})
    if round is None:
      continue

    # Create a set of 4 ticks: @25%,@50%,@75%,@100%.
    # To learn the model it does not always has all kills and all information.
    duration = round["endTick"] - round["startTick"]
    ticks = list(map(lambda n: round["startTick"] + duration*n,[.35,.5,.75,1]))

    # We should only change this once every round. Not for the multiple tick points we check.
    round = parser.Round(round_id,team_1_id,round["endTick"])
    if round.is_win():
      round_winstreak+= 1
    else:
      round_winstreak=0

    for tick in ticks:
      round = parser.Round(round_id,team_1_id, tick)
      (kills, deaths) = round.kills_and_deaths()
      target.append(1) if round.is_win() else target.append(0)
      parsed_rounds.append(Round(kills=kills,deaths=deaths,first_blood=round.is_first_blood(), round_winstreak = round_winstreak))

print("Parsed rounds:", len(parsed_rounds))
data = pd.DataFrame(parsed_rounds)


Parsed rounds: 5344


In [26]:
# Keep this separate otherwise the we can't rerun the cell above.
processed_data = parser.preprocessing(data)

In [27]:
# We should how diverse our dataset is. If in 80% of our games team_1 wins.
# We will learn the model that it can always pick team_1 and get an accuracy of 80%. We don't want that :)

# Split the data in training and testing data
train_X,test_X, train_Y, test_Y = train_test_split(processed_data,target,test_size=0.2, random_state=42)
print(train_X)


      kills  deaths  round_winstreak  first_blood__0.0  first_blood__1.0
5054      2       4                1                 1                 0
120       0       0                0                 0                 0
2351      5       3                1                 1                 0
1907      5       4                1                 1                 0
3648      0       0                0                 0                 0
...     ...     ...              ...               ...               ...
3092      0       0                1                 0                 0
3772      0       0                0                 0                 0
5191      5       1                1                 0                 1
5226      2       1                3                 0                 1
860       0       0                0                 0                 0

[4275 rows x 5 columns]


# Hyperparameter tuning
With SVM we try to find the maximum margin separator between our two classes (team_1_wins, team_2_wins). This is the line the furtest from the nearest training data points. SVM calculates the distance to the closest datapoint for each possible line and picks the one with the highest distance. This makes an SVM a **maximum margin estimator. 

In most real probelms, it is not possible to find the perfect separating plane. Sometimes there are datapoints that are not closer to the other class. To handle this we allow the SVM to soften the margin. Which means we allow some of the points to creep into the margin if that allows a better fit. This transforms our SVM into a soft-margin classifier since we are allowing for a few mistakes. This is typically called C.

C is a hyperparameter that needs to be tuned based on the dataset.

In [28]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

parameters = {
  "kernel": ["rbf", "linear", "sigmoid"],
  "C": range(1,50),
}

clf = GridSearchCV(svm.SVC(probability=True), parameters)
clf.fit(train_X, train_Y)

print(clf.cv_results_)


# TODO: handle overfitting, with cross validation and folding

{'mean_fit_time': array([0.14653964, 0.00781908, 0.74964576, 0.09669762, 0.00737991,
       0.80635633, 0.08717632, 0.00863495, 0.80525732, 0.08522401,
       0.00640221, 0.8079174 , 0.07568698, 0.00824566, 0.75144944,
       0.0773778 , 0.00752959, 0.77227373, 0.07512927, 0.00731697,
       0.79827757, 0.07812576, 0.00967927, 0.76048908, 0.07335486,
       0.00794411, 0.73167405, 0.07281537, 0.00967264, 0.75616679,
       0.07834916, 0.00955596, 0.77863951, 0.07207613, 0.00790763,
       0.76595054, 0.06952605, 0.00594273, 0.57948136, 0.07554116,
       0.01045094, 0.63120837, 0.07220359, 0.00666857, 0.54103451,
       0.07101865, 0.00727339, 0.65489922, 0.07116551, 0.00723133,
       0.58209982, 0.07801952, 0.00873957, 0.59841871, 0.07171707,
       0.0052433 , 0.6418036 , 0.07280679, 0.00842485, 0.60119996,
       0.07927508, 0.00740891, 0.62072988, 0.07361746, 0.00649161,
       0.48890362, 0.07262583, 0.00777707, 0.58114381, 0.07127008,
       0.00782957, 0.63149533, 0.07085848, 0

In [29]:
from sklearn.metrics import accuracy_score, confusion_matrix
# Performance metrics
predictions = clf.predict(test_X)
print(predictions)
accuracy = accuracy_score(test_Y, predictions, normalize=True)
print(f"accuracy:{accuracy*100}%")
confusion_matrix(test_Y, predictions)



# TODO: Should add some more performance metrics to tweak the model.
# certainty for each category  (team1_wins, team1_loses)
print(clf.predict_proba(test_X))

[1 1 1 ... 1 1 0]
accuracy:100.0%
[[3.00000090e-14 1.00000000e+00]
 [3.00000090e-14 1.00000000e+00]
 [3.62688426e-07 9.99999637e-01]
 ...
 [3.00000090e-14 1.00000000e+00]
 [5.32851941e-06 9.99994671e-01]
 [9.99107857e-01 8.92143090e-04]]


In [30]:
import pickle

# Export model to disk
try:
  filename = "../model/finalized_model.sav"
  pickle.dump(clf, open(filename, 'wb'))
except pickle.PickleError:
  print("Could not dump model.")


# Live demo parsing

A second part, we unfortunately had no time to implement, is **live demo parsing**.

We need to start by creating a probability N by N matrix, with N the different states.
This is also called a ** stochastic matrix**. Each combination of a row(x) and a column(y) is the probability to go from stateX to stateY.

Since we eventually want to find the winner of a game we also need something called absorbing states. **An absorbing state is a state that, once entered, cannot be left**: In this case ["team1_wins", "team2_wins"]
