# Feature engineering

## Introduction
We always look at a round from the point of view of team_1.

## Features: 
- **target**: team_1 won the round
- **kills**: Amount of players killed by team_1
- **deaths**: Amount of players that died from team_1
- **equipment_value_ratio**: Total equipment value compared to equipment value of team_2. *This is better than saving the equipment values from both the teams. $650 in a pistol round is normal. $650 in round 10 can probably be considered as a saving round.*
- **first_kill**: Did team_1 make the first kill. Making the first kill results in a higher chance of winning the round.
- **map**: maybe there are more clutches on Mirage?
- **damage_done**: Amount of damage team1 did to team2
- **damage_taken**: Amount of damage team2 did to team1
- **median_player_health**: Median health of all players of the team.
- **is_bomb_planted**:

- **utility**: [TODO]

# Features to implement
- **defuse_kit_count**: Amount of defusekits.
  - **values**: [None,0,1,2,3,4,5] (None when team_1 are terrorists)
- **alive_ratio_on_bomb_planted**: Amount of players alive from team_1 compared to players alive from team_2 when the bomb is planted.
- **distance_from_bomb_on_plant**: Average distance from team_1 when bomb is planted. (Maybe they are already saving)




In [8]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd



In [9]:
# Feature engineering
from feature_engineering import db
from feature_engineering import parser
from dataclasses import make_dataclass,fields
from bson.objectid import ObjectId
from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import minmax_scale

import warnings
warnings.filterwarnings('ignore')

collections = db.get_collections()
Round = parser.Round.get_dataclass()

parsed_rounds = []
target = []
# TODO: remove this, we limit to 50 while testing.
matches = collections["matches"].find({}).limit(50)

# MATCHES
for match in matches:
  round_winstreak = 0
  team_1_id = match["teams"][0]

  # ROUND
  for round_id in match["rounds"]:

    # Sometimes a round is not parsed correctly and the data is not available.
    round = collections["rounds"].find_one({ "_id": ObjectId(round_id)})
    if round is None:
      continue

    # Create a set of 4 ticks: @25%,@50%,@75%,@100%.
    # To learn the model it does not always has all kills and all information.
    duration = round["endTick"] - round["startTick"]
    ticks = list(map(lambda n: round["startTick"] + duration*n,[.35,.5,.75,1]))

    # we'll need to parse this beforehand TODO:
    round = parser.Round(round_id,team_1_id,round["endTick"])

    if round.is_win():
      round_winstreak+= 1
    else:
      round_winstreak=0

    for tick in ticks:
      round = parser.Round(round_id,team_1_id, tick)
      (kills, deaths) = round.kills_and_deaths()
      target.append(1) if round.is_win() else target.append(0)
      parsed_rounds.append(Round(kills=kills,deaths=deaths,first_blood=round.is_first_blood(), round_winstreak = round_winstreak))

    # We should only change this once every round. Not for the multiple tick points we check.
   

print("Parsed rounds:", len(parsed_rounds))
data = pd.DataFrame(parsed_rounds)



Parsed rounds: 5344


      kills  deaths  first_blood  round_winstreak  first_blood__0.0  \
0         0       0          NaN                1                 0   
1         0       0          NaN                1                 0   
2         2       3          0.0                1                 1   
3         3       5          0.0                1                 1   
4         0       0          NaN                0                 0   
...     ...     ...          ...              ...               ...   
5339      5       1          1.0                4                 0   
5340      0       0          NaN                5                 0   
5341      1       1          1.0                5                 0   
5342      3       3          1.0                5                 0   
5343      5       3          1.0                5                 0   

      first_blood__1.0  
0                    0  
1                    0  
2                    0  
3                    0  
4                    0

In [15]:
# Keep this separate otherwise the we can't rerun the cell above.
processed_data = parser.preprocessing(data)

In [17]:
# We should how diverse our dataset is. If in 80% of our games team_1 wins.
# We will learn the model that it can always pick team_1 and get an accuracy of 80%. We don't want that :)

# Split the data in training and testing data
train_X,test_X, train_Y, test_Y = train_test_split(processed_data,target,test_size=0.2, random_state=42)
print(train_X)


      kills  deaths  round_winstreak  first_blood__0.0  first_blood__1.0
5054      2       4                1                 1                 0
120       0       0                0                 0                 0
2351      5       3                1                 1                 0
1907      5       4                1                 1                 0
3648      0       0                0                 0                 0
...     ...     ...              ...               ...               ...
3092      0       0                1                 0                 0
3772      0       0                0                 0                 0
5191      5       1                1                 0                 1
5226      2       1                3                 0                 1
860       0       0                0                 0                 0

[4275 rows x 5 columns]


# Hyperparameter tuning
With SVM we try to find the maximum margin separator between our two classes (team_1_wins, team_2_wins). This is the line the furtest from the nearest training data points. SVM calculates the distance to the closest datapoint for each possible line and picks the one with the highest distance. This makes an SVM a **maximum margin estimator. 

In most real probelms, it is not possible to find the perfect separating plane. Sometimes there are datapoints that are not closer to the other class. To handle this we allow the SVM to soften the margin. Which means we allow some of the points to creep into the margin if that allows a better fit. This transforms our SVM into a soft-margin classifier since we are allowing for a few mistakes. This is typically called C.

C is a hyperparameter that needs to be tuned based on the dataset.

In [28]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

parameters = {
  "kernel": ["rbf", "linear", "sigmoid"],
  "C": range(1,50),
}

clf = GridSearchCV(svm.SVC(probability=True), parameters)
clf.fit(train_X, train_Y)

print(clf.cv_results_)


# TODO: handle overfitting, with cross validation and folding

{'mean_fit_time': array([0.1490994 , 0.00877304, 0.80585294, 0.11360884, 0.009513  ,
       0.82696967, 0.08790784, 0.00745935, 0.7352952 , 0.08256621,
       0.01148181, 0.77894468, 0.08114972, 0.00850945, 0.75271811,
       0.07339115, 0.00803609, 0.74470992, 0.0771153 , 0.00731525,
       0.80215673, 0.07510953, 0.00616293, 0.78338604, 0.07408609,
       0.00819111, 0.74001718, 0.07324615, 0.00747485, 0.74961782,
       0.07732539, 0.00645604, 0.71357889, 0.07369452, 0.00641627,
       0.74338303, 0.07338452, 0.00722771, 0.58009415, 0.07403798,
       0.00664129, 0.64220848, 0.07532516, 0.00785208, 0.60028772,
       0.07454095, 0.00759997, 0.6046989 , 0.07217684, 0.00658679,
       0.41780152, 0.07406902, 0.00865002, 0.58207464, 0.07050257,
       0.00803018, 0.60389848, 0.07549262, 0.00863109, 0.6368453 ,
       0.07556677, 0.00870028, 0.6295423 , 0.07339158, 0.00636778,
       0.50186682, 0.07436051, 0.00771565, 0.5599546 , 0.07355018,
       0.00803828, 0.64421325, 0.08605409, 0

In [45]:
from sklearn.metrics import accuracy_score, confusion_matrix
# Performance metrics
prediction = clf.predict(test_X)
accuracy = accuracy_score(test_Y, prediction, normalize=True)
print(f"accuracy:{accuracy*100}%")
confusion_matrix(test_Y, prediction)

# TODO: Should add some more performance metrics to tweak the model.

[[3.00000090e-14 1.00000000e+00]
 [3.00000090e-14 1.00000000e+00]
 [3.96029623e-07 9.99999604e-01]
 ...
 [3.00000090e-14 1.00000000e+00]
 [5.79727857e-06 9.99994203e-01]
 [9.99128568e-01 8.71432070e-04]]
accuracy:100.0%


array([[492,   0],
       [  0, 577]])

In [38]:
import pickle

# Export model to disk
try:
  filename = "../model/finalized_model.sav"
  pickle.dump(clf, open(filename, 'wb'))
except pickle.PickleError:
  print("could not dump model")




# Live demo parsing

A second part, we unfortunately had no time to implement, is **live demo parsing**.

We need to start by creating a probability N by N matrix, with N the different states.
This is also called a ** stochastic matrix**. Each combination of a row(x) and a column(y) is the probability to go from stateX to stateY.

Since we eventually want to find the winner of a game we also need something called absorbing states. **An absorbing state is a state that, once entered, cannot be left**: In this case ["team1_wins", "team2_wins"]
