<h1 style="color:#ff5500; font-family:Play; font-size:3em; margin:auto 32px;align:center">Part I - Data Preparation</h1>

---


This document is a part of the FACEIT Predictor Data Science Workflow.

In this notebook the collected data (stored in the local MongoDB database) is processed in order to create a dataset.


# Imports


In [5]:
from pprint import pprint

# enable imports from parent directory
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent.parent))

# local modules
from src.db.connections import get_local_db
from src.data.create_lifetime_stats import create_all_lifetime_stats
from src.utils.dirs import INTERIM_DATA_DIR, EXTERNAL_DATA_DIR, RAW_DATA_DIR
from src.data.processable_matches import processable_matches_complete
from src.data.build_raw_dataset import build_raw_dataset
from src.data.build_interim_dataset import build_interim_dataset
from src.data.performance_indicators import create_performance_indicators

from IPython import get_ipython

ipython = get_ipython()

# autoreload extension
if "autoreload" not in ipython.extension_manager.loaded:
   %load_ext autoreload

# autoreload python modules
%autoreload 2

# Database Connection


In [6]:
db = get_local_db()

In [7]:
# Connect to the collections inside the ingestor database
players_coll = db['player']
matches_coll = db['match']
lifetime_stats_coll = db['player_lifetime_stats']

In [8]:
print("Number of player documents", players_coll.estimated_document_count())
print("Number of matches documents", matches_coll.estimated_document_count())
print("Number of lifetime stats documents", lifetime_stats_coll.estimated_document_count())

Number of player documents 136007
Number of matches documents 1267413
Number of lifetime stats documents 6388405


# Sample Documents

## Player *Schema*

| Field Name   |      Type      |  Description |
|----------|:-------------:|------|
| _id |  str | FACEIT account ID formatted as an UUID|
| activatedAt |    str  |   Date of activation of the FACEIT account |
| steamCreatedAt| datetime |   Date of creation of the Steam account linked to the FACEIT account |
| updatedAt | datetime |    Date of the last FACEIT profile update |
| csgoId | str |    Steam64ID of the Steam account linked to the FACEIT account |
| verified | bool |    Whether or not it is a verified FACEIT account.<br/>(usually reserved to high profile players including pros and streamers) |
| mapStats | dict |    Dictionary where keys are the map names and the values are a dictionary<br/>containing the *Lifetime Stats* of the player on the correspondent map.<br/>The stats for each include the following fields:<ul style="margin:8px"><li>kills</li><li>deaths</li><li>assists</li><li>matches</li><li>wins</li><li>rounds</li><li>headshots</li><li>mvps</li><li>tripleKills</li><li>quadraKills</li><li>pentaKills</li></ul>|
| updatedAtIngestor | int |    Unix timestamp of the last time the player was processed in the FACEIT Ingestor service |
| matchHistory | list |    List of the player's match history containing both match ID and its start time.<br/>It is ordered from the most recent to the oldest match.|

In [16]:
sample_player = players_coll.find_one({})
sample_player["matchHistory"] = sample_player["matchHistory"][:3]
sample_player["mapStats"] = dict(list(sample_player["mapStats"].items())[:2])
pprint(sample_player)

{'_id': '1efb9d10-6871-479a-9d39-9cbb68546792',
 'activatedAt': 'Sat Sep 28 01:00:32 UTC 2013',
 'csgoId': '76561197998383222',
 'mapStats': {'de_cache': {'assists': 2029,
                           'deaths': 8536,
                           'headshots': 4354,
                           'kills': 9605,
                           'matches': 469,
                           'mvps': 1338,
                           'name': 'de_cache',
                           'pentaKills': 11,
                           'quadraKills': 130,
                           'rounds': 12535,
                           'tripleKills': 575,
                           'wins': 219},
              'de_inferno': {'assists': 2016,
                             'deaths': 7851,
                             'headshots': 4318,
                             'kills': 9170,
                             'matches': 441,
                             'mvps': 1354,
                             'name': 'de_inferno',
                    

## Match *Schema*

| Field Name   |      Type      |  Description |
|----------|:-------------:|------|
| _id |  str | FACEIT match ID |
| entity |    str  |   The kind of match it belongs to: matchmaking, hub or championship (tournament). |
| entityName| str |   The name of entity in which the match was played.<br/>Regarding the _matchmaking_ entity there are two queues: Free and Premium.<br/>As for the other cases the entity name is the name of the hub or tournament.|
| mapPlayed | str |    The CS GO map where the match was played. |
| parties | dict |    Dictionary where the keys are the parties IDs and the values are<br/> the list of the players' IDs in each party. |
| score | str |    The score of match: rounds won by team A followed by team B. |
| startTime | int |     Unix timestamp of the match start time. |
| teamA | list |    List of players in team A. For each player the following fields are collected:<ul style="margin:8px"><li>elo</li><li>id</li><li>membership</li><li>playerStats</li><ul style="margin:8px"><li>kills</li><li>deaths</li><li>assists</li><li>headshots</li><li>mvps</li><li>tripleKills</li><li>quadraKills</li><li>pentaKills</li></ul>|
| teamB | list |    Equivalent of teamA but for the players in team B. |
| teams | list |    List composed of teamA and teamB fields.|

In [17]:
sample_match = matches_coll.find_one({})
pprint(sample_match)

{'_id': '1-000004ff-41ce-4326-ae87-0fe580a42b03',
 'entity': 'matchmaking',
 'entityName': 'CS:GO 5v5 PREMIUM',
 'mapPlayed': 'de_inferno',
 'parties': {'1ee0dae1-8eb2-4918-a479-0015296fcc99': ['04ac1e03-7faf-4f7b-b2fa-3133e4ddfa2e',
                                                      '77662fe9-9ea1-4331-b960-e78f24721af2',
                                                      'ddbaeff3-519b-49bf-9ed0-96f440caf851'],
             '3b10e972-5f05-44cd-8bad-4e6a869a67a5': ['3b10e972-5f05-44cd-8bad-4e6a869a67a5'],
             'c2678d2e-8a18-4f5f-8c47-bc6e99e876c2': ['c2678d2e-8a18-4f5f-8c47-bc6e99e876c2'],
             'c6caff55-10c8-45eb-a22a-fc54cdd3dc61': ['c6caff55-10c8-45eb-a22a-fc54cdd3dc61'],
             'ea1a9416-62f8-4b6a-b791-aac5320225f9': ['8952ebf4-f135-4799-a7b7-1d3f66fe90ce',
                                                      'a638f08a-f223-4a08-8662-652d6494c191'],
             'edab45a5-2dd8-48bb-a54d-bc64545d92b5': ['4e6771cd-db7e-480f-853f-112fe616005e',
         

## Lifetime Stats *Schema*

| Field Name   |      Type      |  Description |
|----------|:-------------:|------|
| _id |  str | MongoDB Document ObjectID|
| mapStats | dict |    Dictionary where keys are the map names and the values are a dictionary<br/>containing the *Lifetime Stats* of the player on the correspondent map.<br/>The data refers to the lifetime stats right before the start of the match with *_id = matchId*.<br/>The stats for each include the following fields:<ul style="margin:8px"><li>kills</li><li>deaths</li><li>assists</li><li>matches</li><li>wins</li><li>rounds</li><li>headshots</li><li>mvps</li><li>tripleKills</li><li>quadraKills</li><li>pentaKills</li></ul>|
| matchId | str |    FACEIT match ID |
| playerId | str |   FACEIT account ID formatted as an UUID|
| startTime | int |    Unix timestamp of the match start time.|

In [19]:
sample_lifetime_stats = lifetime_stats_coll.find_one({})
sample_lifetime_stats["mapStats"] = dict(list(sample_lifetime_stats["mapStats"].items())[:2])
pprint(sample_lifetime_stats)

{'_id': ObjectId('6160f42c4100b69cfb8b95be'),
 'mapStats': {'de_cache': {'assists': 46,
                           'deaths': 195,
                           'headshots': 66,
                           'kills': 216,
                           'matches': 12,
                           'mvps': 38,
                           'name': 'de_cache',
                           'pentaKills': 0,
                           'quadraKills': 0,
                           'rounds': 305,
                           'tripleKills': 12,
                           'wins': 10},
              'de_inferno': {'assists': 342,
                             'deaths': 1746,
                             'headshots': 667,
                             'kills': 1843,
                             'matches': 97,
                             'mvps': 287,
                             'name': 'de_inferno',
                             'pentaKills': 2,
                             'quadraKills': 24,
                            

# Create Lifetime Stats


Initially only the most recent lifetime stats are stored in DB (`player.mapStats`). In order to have consistent player lifetime stats for each match and avoid repeating the process over again, the lifetime stats are processed once and stored in DB.

To do so one must work backwards and continuously subtract the player stats on each match to the lifetime stats.

In [None]:
create_all_lifetime_stats()

# Get Complete & Processable Matches


Even though there are more than one million matches stored in DB, not all are qualified to be part of the dataset.

A valid processable match should have complete player data for every participant. This includes the lifetime stats before the beginning of the match as well as his *10* previous matches.

- `match_history`: the match id is in the player's match history and the previous 10 matches are in DB
- `lifetime_stats`: the lifetime stats of the player regarding the match are available

The `match_history` and `lifetime_stats` are initially set to 0 and incremented each time the conditions are met for one player.

Finally the match ids are filtered to include only those who have full data for all ten players.

In [None]:
processable_matches_complete()

# Build Dataset


Dataset stored in batch files of size 2000. The format chosen is CSV.

## Raw Dataset

In [None]:
build_raw_dataset(RAW_DATA_DIR, EXTERNAL_DATA_DIR)

## Interim Dataset

Iterate over matches and for each player get the lifetime stats before the beginning of the match as well as the match ids of the 10 previous matches.

Store the data of every player on each element of the teamA and teamB lists.

In [None]:
build_interim_dataset(INTERIM_DATA_DIR, EXTERNAL_DATA_DIR, is_complete=True)

# Performance indicators statistics

Compute the *Mean* and *Standard Deviation* for:
* Kills Per Round (KPR)
* Survived Rounds per Round (SPR)
* Multikills Score per Round (MKPR)
* Assists per Round (APR)
* MVPs per Round (MVPPR)

The Performance Indicators will be significantly important further on when implementing some of the features. In fact, the Rating Feature measures how well a player performed compared to the mean values for all matches. Also, these values could be used to perform missing data imputation, whenever a certain feature is not computable.

In [None]:
create_performance_indicators()