<h1 style="color:#ff5500; font-family:Play; font-size:3em; margin:auto 32px;align:center">Part III - Feature Engineering</h1>

---


This document is a part of the FACEIT Predictor Data Science Workflow.

In this notebook several features are created and stored for the whole dataset.

# Imports


In [315]:
import pandas as pd
import numpy as np
import json
from pymongo import MongoClient
from statistics import mean
from glob import glob
import datetime
from tqdm import tqdm
from collections import defaultdict
from ast import literal_eval

# enable imports from parent directory
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))

# local modules
from src.db.config import read_config
from src.utils.loaders import read_data_iter
from src.utils.dirs import EXTERNAL_DATA_DIR, PROCESSED_DATA_DIR
from src.features.featurize import create_features
from src.data.performance_indicators import create_performance_indicators

from IPython import get_ipython

ipython = get_ipython()

# autoreload extension
if "autoreload" not in ipython.extension_manager.loaded:
   %load_ext autoreload

# autoreload python modules
%autoreload 2

# Database Connection

In [2]:
db_cfg = read_config("local.ingestorDB")

In [3]:
client = MongoClient(**db_cfg)
db = client['faceit_imported']

In [4]:
# Connect to the collections inside the local ingestor database
players_coll = db['player']
matches_coll = db['match']
lifetime_stats_coll = db['player_lifetime_stats']

# Performance indicators statistics

Compute the *Mean* and *Standard Deviation* for:
* Kills Per Round (KPR)
* Survived Rounds per Round (SPR)
* Multikills Score per Round (MKPR)
* Assists per Round (APR)
* MVPs per Round (MVPPR)

The Performance Indicators will be significantly important furter on when implementing some of the features. In fact, the Rating Feature measures how well a player performed compared to the mean values for all matches. Also, these values could be used to perform missing data imputation, whenever a certain feature is not computable.

In [None]:
create_performance_indicators(matches_coll)

In [313]:
performance_statistics_coll = db['performance_statistics']

In [318]:
performance_statistics = {}
for ps in performance_statistics_coll.find({}):
    performance_statistics[ps["_id"]] = {
        "mean":ps["value"]["mean"],
        "std_dev":ps["value"]["stdDev"]
        }
pd.DataFrame(performance_statistics)

Unnamed: 0,assists_pr,kills_pr,multikills_rating_pr,mvps_pr,survived_pr
mean,0.145781,0.702935,0.504272,0.101607,0.296504
std_dev,0.080769,0.238984,0.53184,0.077306,0.139417


# Featurization

The features functions were initially implemented and tested in this notebook. Then they were moved to appropriate modules inside `src/features/`

The features are divided into 4 types:
* match features
* lifetime features
* date features
* previous matches features

**Match Features**

Features that are directly computed from the match configuration or from data that is available in the match document. 

Example:

`add_feature(data, get_mean_elo)`  
`add_feature(data, get_num_parties)`  

**Lifetime Features**

The Lifetime Features are computed from the Lifetime Stats of each player as they were immediately before the beginning of the match. The values are averaged for each team.

The 5 main indicators are: *Number of Matches*, *Winrate*, *Kill/Death ratio*, *Multikills Score* and *Rating*. Those are computed globally (the total value for all maps) and specifically for the map of the match (on map). Also, the preference - the on map / total ratio - is taken into account.

Example:

Global: `add_feature(data, get_mean_matches)`  
Specific: `add_feature(data, get_mean_winrate_on_map)`  
Preference (Specific/Global): `add_feature(data, get_mean_rating_map_preference)`  

**Date Features**

At the time of writing, the Date Features only regard the age of the players' FACEIT accounts.

Example:

`add_feature(data, get_mean_created_at_faceit)`  

**Previous Matches Features**

The data from the 10 previous matches is processed in order to compute the Previous Matches Features. The 5 main indicators stated in the *Lifetime Features* are also computed for the previous matches. Moreover, the date recency of the matches is considered, as well as the number of matches played on the same day. Furthermore, it is checked if any of the players have played together in the past 10 matches and how well they performed.

Example:

`add_feature(match, get_mean_matches_on_map_prev, previous_matches=previous_matches)`  
`add_feature(match, get_mean_interval_time_prev, previous_matches=previous_matches)`  
`add_feature(match, get_mean_matches_on_day, previous_matches=previous_matches)`   
`add_feature(match, get_winrate_togthr_prev, previous_matches=previous_matches)` 

In [2]:
batch_data_gen = read_data_iter("interim")

In [3]:
for index, data in enumerate(batch_data_gen):
    data_featurized = create_features(data)
    data_featurized.to_csv(f'{str(PROCESSED_DATA_DIR)}/batch_{index}.csv')

In [4]:
data_featurized.head()

Unnamed: 0,_id,winner,match_mean_elo,5v5_free_queue,5v5_premium_queue,dif_mean_elo,dif_stddev_elo,dif_num_paid_memberships,dif_num_solo_players,dif_num_parties,...,dif_mean_dif_rounds_prev,dif_mean_dif_elo_prev,dif_mean_matches_afk,dif_num_played_togthr_prev,dif_winrate_togthr_prev,dif_mean_first_matches_on_day,dif_mean_matches_on_day,dif_mean_played_map_on_day,entity_dummies_hub,entity_dummies_matchmaking
0,1-42b7574e-2ff5-4413-84c8-5101096a46cb,0,2304.9,0,1,39.0,423.672183,0,-5,-3,...,-0.76,80.196,0.0,0.4,-0.818182,0,1.2,-0.4,0,1
1,1-42b8405a-abd2-4932-8b49-a7d7340d56b0,0,2799.9,1,0,250.6,-106.472846,-2,0,0,...,-1.98,-196.732,0.0,0.02,-0.062069,2,-0.2,-2.4,0,1
2,1-42b8a143-500e-45ca-9ee9-d0f064b63481,0,2310.4,1,0,338.8,221.640169,3,0,0,...,-1.5,86.628,0.0,-0.4,-0.102564,1,-1.6,-0.6,0,1
3,1-42b8f6db-d606-4f54-9cec-3dd072db5a8e,0,2621.0,1,0,-48.4,386.535716,3,0,0,...,-2.56,103.628,0.0,-0.7,0.51773,0,0.2,0.6,0,1
4,1-42b96028-cd51-4a24-bf42-d622f53f3454,0,1796.4,0,1,4.0,-371.477952,0,5,3,...,0.74,-10.588,0.0,-0.52,-0.230769,1,-0.4,-0.4,0,1
