# Data Preprocessing

Before we start conducting a detailed Exploratory Data Analysis (EDA), we need to build our final DataFrame by merging all the potentially valuable features for predicting fair matchmaking. This will include consolidating data from all the previously cleaned files and creating new features as needed.

---

## Initial Setup

In [1]:
# ---------------- Suppress all future warnings ---------------- #
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

# ---------------- Basic Data Science Libraries ---------------- #
import numpy as np # Linear algebra
import pandas as pd # Data processing
import random
import dask.dataframe as dd # Data processing for large DataFrames

# ---------------- System Libraries ---------------- #
import os # Miscellaneous operating system interfaces
import gc # Garbage collector interface
import ast
import nbimporter # Use functions from other Jupyter Notebooks'
from subprocess import check_output # Saves results written to the current directory as output

# ---------------- Plotting Libraries ---------------- #
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# ---------------- TrueSkill Library ---------------- #
import trueskill
import itertools
import math

# Function obtained from the documentation found in https://trueskill.org/
def win_probability(team1, team2):
    delta_mu = sum(r.mu for r in team1) - sum(r.mu for r in team2)
    sum_sigma = sum(r.sigma ** 2 for r in itertools.chain(team1, team2))
    size = len(team1) + len(team2)
    ts = trueskill.global_env()
    BETA = ts.beta
    denom = math.sqrt(size * (BETA * BETA) + sum_sigma)
    return ts.cdf(delta_mu / denom)

# Function to obtain a conservative skill rating
def conservative_trueskill_rating(mu, sigma):
    return mu - (3 * sigma)

# ---------------- Define Clean and Raw Directories ---------------- #
clean_folder = '../Data/Clean'
raw_folder = '../Data/Raw'

# ---------------- Set new DataFrame limiters ---------------- #
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 300)

# ---------------- Print files in my clean data folder ---------------- #
print(check_output(['ls', '../Data/Clean']).decode('utf8'))

ability_ids.csv
ability_upgrades.csv
chat.csv
eng_chat.csv
hero_ids.csv
item_ids.csv
matches.csv
mmr.csv
objectives.csv
patch_dates.csv
player_time.csv
players.csv
positions.csv
prev_outcomes.csv
purchase_log.csv
regions.csv
teamfights.csv
teamfights_players.csv
test_players.csv
trueskill.csv



---

## Feature Selection

Let's begin with our most crucial files in the dataset: matches.csv and players.csv. These contain the most information for each player per game.

In [2]:
# Load required files
players = pd.read_csv(clean_folder + '/players.csv', index_col=0)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

matches = pd.read_csv(clean_folder + '/matches.csv', index_col=0)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players: 500,000 observations, 73 features
matches: 50,000 observations, 13 features


It is challenging to consider adding more features from other files when our players' DataFrame already has 73 features. To simplify the process of creating fair matchmaking, we should reduce the number of features by removing those with a high number of null values, as well as those that have little impact on the match outcome.

In [3]:
# List unwanted features from the players DataFrame
features_to_drop = [
    'unit_order_none', 'unit_order_move_to_position', 'unit_order_move_to_target', 
    'unit_order_attack_move', 'unit_order_attack_target', 'unit_order_cast_position', 
    'unit_order_cast_target', 'unit_order_cast_target_tree', 'unit_order_cast_no_target', 
    'unit_order_cast_toggle', 'unit_order_hold_position', 'unit_order_train_ability', 
    'unit_order_drop_item', 'unit_order_give_item', 'unit_order_pickup_item', 
    'unit_order_pickup_rune', 'unit_order_purchase_item', 'unit_order_sell_item', 
    'unit_order_disassemble_item', 'unit_order_move_item', 'unit_order_cast_toggle_auto', 
    'unit_order_stop', 'unit_order_buyback', 'unit_order_glyph', 
    'unit_order_eject_item_from_stash', 'unit_order_cast_rune', 'unit_order_ping_ability', 
    'unit_order_move_to_direction', 'gold_abandon', 'gold_sell', 
    'gold_destroying_structure', 'gold_killing_couriers', 'match_slot_id'
]

# Drop the features
players = players.drop(columns=features_to_drop)
players.shape

(500000, 40)

Now that we have 40 features in our players' DataFrame, let's group together the features that can provide more insight into the overall team performance.

In [4]:
# Player Performance Features
player_features = ['kills', 'deaths', 'assists', 'denies', 'gold', 'gold_spent']

# Define categorical features
players['cluster'] = players['cluster'].astype('category')
players['hero_id'] = players['hero_id'].astype('category')
players['player_slot'] = players['player_slot'].astype('category')

# Display player features
players[player_features].head()

Unnamed: 0,kills,deaths,assists,denies,gold,gold_spent
0,9,3,18,1,3261,10960
1,13,3,18,9,2954,17760
2,0,4,15,1,110,12195
3,8,4,19,6,1179,22505
4,20,3,17,13,3307,23825


In [5]:
# Match Features
match_features = ['match_id', 'start_time', 'tower_status_radiant', 
                  'tower_status_dire', 'barracks_status_dire', 
                  'barracks_status_radiant', 'first_blood_time']
matches['start_time'] = pd.to_datetime(matches['start_time'], unit='s')
matches[match_features].head()

Unnamed: 0,match_id,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,2015-11-05 19:01:52,1982,4,3,63,1
1,1,2015-11-05 19:51:18,0,1846,63,0,221
2,2,2015-11-05 23:03:06,256,1972,63,48,190
3,3,2015-11-05 23:22:03,4,1924,51,3,40
4,4,2015-11-06 07:53:05,2047,0,0,63,58


In [6]:
# Merge the match features to the players DataFrame
players = players.merge(matches[match_features], on='match_id', how='left')
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
6,0,0,6k Slayer,0,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
8,0,0,-,0,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1


2019

---

## Feature Engineering

### Players DataFrame

#### Dealing with Player Anonymity

To achieve optimal results in our modelling stage, it is crucial to maximize player identification and treat truly anonymous players as individual entities *(i.e., as new players)*. Identifying as many players as possible will enhance the accuracy and effectiveness of our modelling process. We can achieve this by replacing the 0's from `account_id` with the string value from our `account` feature and recreating our `match_slot_id` from the Data Cleanup notebook for the truly anonymous.

In [7]:
# Look at the head from our players DataFrame
players.head()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1


In [8]:
# Replacing the 0 values from account_id
for idx, row in players.iterrows():
    if row['account_id'] == 0:
        if row['account'] == '-':
            players.loc[idx, 'account_id'] = str(row['match_id'])+'_'+str(row['player_slot'])
        else:
            players.loc[idx, 'account_id'] = row['account']
    else:
        players.loc[idx, 'account_id'] = str(row['account_id']) # Might throw some errors when calculating the initial TrueSkill data

print(players['account_id'].dtype)
display(players[players['account'] == '-'].sample(5))

object


Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
92128,9212,1,-,9212_131,48,131,934,10695,290,313,2,9,17,0,62,16.7823,6043,299,769,63,181,46,152,92,40,15,0,4839,7366,596,258,98,-2391,0,2385,2517,200,0,2499,156,2015-11-13 13:41:55,4,1574,63,51,119
100318,10031,0,-,10031_131,26,131,539,12115,304,264,4,15,14,0,25,29.7553,6758,0,214,1,218,37,29,102,36,15,0,8155,3648,0,177,150,-5115,0,4152,736,600,0,2716,122,2015-11-13 16:54:01,0,384,48,3,268
310534,31053,1,-,31053_4,30,4,7384,17180,368,479,10,9,18,8,120,79.5375,22077,2180,2210,50,116,36,96,42,108,25,0,20199,12242,0,128,50,-2241,0,9626,4707,0,0,4077,132,2015-11-15 21:28:30,1540,0,16,63,137
58126,5812,1,-,5812_129,21,129,4056,14335,561,529,5,3,4,12,198,37.6912,9102,0,4942,1,108,69,50,0,168,18,0,3194,12752,1043,104,0,-777,0,1828,8061,835,0,1935,224,2015-11-12 23:35:51,4,2046,63,3,152
140002,14000,1,-,14000_2,15,2,5482,17760,560,652,12,8,21,9,187,0.0,14297,0,3707,147,63,108,112,46,0,23,0,14725,13664,0,72,0,-2482,0,7947,7655,0,0,2614,204,2015-11-14 03:17:09,1830,0,1,63,126


#### Hero Attributes

In addition to capturing the hero IDs, it is important for us to gather detailed insights into the decision-making process behind each player's hero selections, as well as their capabilities in different roles with these chosen heroes. As a result, it is essential to extract this information from our `hero_id` file and meticulously process it to enrich our DataFrame with valuable insights.

In [9]:
# Load required files
heroes = pd.read_csv(clean_folder + '/hero_ids.csv')
print(f'heroes:', '{:,} observations, {:,} features'.format(heroes.shape[0], heroes.shape[1]))

heroes: 112 observations, 4 features


In [10]:
# Look at the head
heroes.head()

Unnamed: 0,hero_id,name,primary_attribute,roles
0,1,Anti-Mage,agi,"{'Nuker', 'Escape', 'Carry'}"
1,2,Axe,str,"{'Carry', 'Disabler', 'Durable', 'Initiator'}"
2,3,Bane,all,"{'Support', 'Disabler', 'Durable', 'Nuker'}"
3,4,Bloodseeker,agi,"{'Nuker', 'Initiator', 'Disabler', 'Carry'}"
4,5,Crystal Maiden,int,"{'Support', 'Disabler', 'Nuker'}"


The first thing we can notice is that these roles were saved as a set to avoid any duplicated values. We need to confirm that the file was read correctly and ensure that we can read the feature values as sets. Once we accomplish this, we'll be able to one-hot encode the roles that each hero can fulfill during a match.

In [11]:
# Convert the roles feature from strings to actual sets
heroes['roles'] = heroes['roles'].apply(ast.literal_eval)

# Extract all unique roles
unique_hero_roles = set(role for roles in heroes['roles'] for role in roles) # To ensure we won't have duplicated values
unique_hero_roles = list(unique_hero_roles) # To convert to columns later

# Create a new DataFrame for one-hot encoding the roles
one_hot_encoded_roles = pd.DataFrame(0, index=heroes.index, columns=unique_hero_roles)

# Fill the DataFrame with ones for each role covered by a hero
for idx, roles in enumerate(heroes['roles']):
    for role in roles:
        one_hot_encoded_roles.at[idx, role] = 1

# Concatenate to original DataFrame and dropping original feature
heroes = pd.concat([heroes, one_hot_encoded_roles], axis=1)
heroes.drop(columns='roles', inplace=True)

# Look at the head
heroes.head()

Unnamed: 0,hero_id,name,primary_attribute,Nuker,Durable,Initiator,Pusher,Escape,Disabler,Support,Carry
0,1,Anti-Mage,agi,1,0,0,0,1,0,0,1
1,2,Axe,str,0,1,1,0,0,1,0,1
2,3,Bane,all,1,1,0,0,0,1,1,0
3,4,Bloodseeker,agi,1,0,1,0,0,1,0,1
4,5,Crystal Maiden,int,1,0,0,0,0,1,1,0


Now that we have our one-hot encoded roles in place, it is time to merge them into the players' DataFrame.

In [12]:
# Rename the columns before merge
heroes.rename(columns={
    'primary_attribute': 'hero_primary_attribute',
    'Carry': 'hero_role_carry',
    'Escape': 'hero_role_escape',
    'Durable': 'hero_role_durable',
    'Pusher': 'hero_role_pusher',
    'Initiator': 'hero_role_initiator',
    'Disabler': 'hero_role_disabler',
    'Nuker': 'hero_role_nuker',
    'Support': 'hero_role_support'
}, inplace=True)

# Merge into the players' DataFrame
players = players.merge(right=heroes.drop(columns='name'), on='hero_id', how='left')

# Sanity check
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry
0,0,1,Double T,Double T,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,0,1,Trash!!!,Trash!!!,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
6,0,0,6k Slayer,6k Slayer,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,0,0,-,0_131,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0


1232

#### Team Features

To fully grasp the extent of each player's impact on the team, it is essential to calculate the ratio of each player's individual contribution to the overall team statistics for every match. This involves extracting specific performance metrics and consolidating them for each team to obtain a comprehensive understanding of player involvement.

##### Overall Team Performance

In [13]:
# Aggregate team stats
team_stats = players.groupby(['match_id', 'player_slot'])[player_features].sum().reset_index()
team_stats['radiant_team'] = team_stats['player_slot'].apply(lambda x: 1 if x < 5 else 0)

# Aggregating them by team
team_features = team_stats.groupby(['match_id', 'radiant_team'], observed=False)[player_features].sum().reset_index()

# Rename columns for merge
team_features.rename(columns={'kills': 'team_kills', 
                              'deaths': 'team_deaths', 
                              'assists': 'team_assists', 
                              'denies': 'team_denies', 
                              'gold': 'team_gold', 
                              'gold_spent': 'team_gold_spent'}, inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kills,deaths,assists,denies,gold,gold_spent,radiant_team
0,0,0,9,3,18,1,3261,10960,1
1,0,1,13,3,18,9,2954,17760,1
2,0,2,0,4,15,1,110,12195,1
3,0,3,8,4,19,6,1179,22505,1
4,0,4,20,3,17,13,3307,23825,1
5,0,128,5,6,8,5,476,12285,0
6,0,129,4,13,5,2,317,10355,0
7,0,130,4,8,6,31,2390,13395,0
8,0,131,1,14,8,0,475,5035,0
9,0,132,1,11,6,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kills,team_deaths,team_assists,team_denies,team_gold,team_gold_spent
0,0,0,15,52,33,38,3718,58620
1,0,1,50,17,87,30,10811,87245
2,1,0,50,37,83,16,9085,107750
3,1,1,35,53,49,27,4776,69310
4,2,0,48,22,90,16,11177,81620
5,2,1,22,49,31,10,2494,54990
6,3,0,63,65,110,29,5954,94430
7,3,1,64,66,92,32,6455,76685
8,4,0,16,37,30,21,2030,38980
9,4,1,37,16,59,26,14099,78980


##### KDA Scores

KDA Scores can be calculated using the following formula:

$\displaystyle{KDA = \frac{kills + assists}{deaths +1}}$

In [14]:
# Calculate KDA Scores
kda_score = (team_stats['kills'] + team_stats['assists']) / (team_stats['deaths'] + 1)
team_stats.insert(2, column='kda', value=kda_score)
team_stats.drop(columns=['kills', 'deaths', 'assists'], inplace=True)

team_kda_score = (team_features['team_kills'] + team_features['team_assists']) / (team_features['team_deaths'] + 1)
team_features.insert(2, column='team_kda', value=team_kda_score)
team_features.drop(columns=['team_kills', 'team_deaths', 'team_assists'], inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team
0,0,0,6.75,1,3261,10960,1
1,0,1,7.75,9,2954,17760,1
2,0,2,3.0,1,110,12195,1
3,0,3,5.4,6,1179,22505,1
4,0,4,9.25,13,3307,23825,1
5,0,128,1.857143,5,476,12285,0
6,0,129,0.642857,2,317,10355,0
7,0,130,1.111111,31,2390,13395,0
8,0,131,0.6,0,475,5035,0
9,0,132,0.583333,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,0.90566,38,3718,58620
1,0,1,7.611111,30,10811,87245
2,1,0,3.5,16,9085,107750
3,1,1,1.555556,27,4776,69310
4,2,0,6.0,16,11177,81620
5,2,1,1.06,10,2494,54990
6,3,0,2.621212,29,5954,94430
7,3,1,2.328358,32,6455,76685
8,4,0,1.210526,21,2030,38980
9,4,1,5.647059,26,14099,78980


Now let's calculate the ratios for each player.

In [15]:
# Calculate participation ratios
team_stats = team_stats.merge(team_features, on=['match_id', 'radiant_team'], how='left')
for col in team_stats.columns:
    if col.startswith('team_'):
        player_col = col.split('team_')
        team_stats[col] = team_stats[player_col[1]] / team_stats[col]

display(team_stats.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,6.75,1,3261,10960,1,0.886861,0.033333,0.301637,0.125623
1,0,1,7.75,9,2954,17760,1,1.018248,0.3,0.27324,0.203565
2,0,2,3.0,1,110,12195,1,0.394161,0.033333,0.010175,0.139779
3,0,3,5.4,6,1179,22505,1,0.709489,0.2,0.109056,0.257952
4,0,4,9.25,13,3307,23825,1,1.215328,0.433333,0.305892,0.273082
5,0,128,1.857143,5,476,12285,0,2.050595,0.131579,0.128026,0.20957
6,0,129,0.642857,2,317,10355,0,0.709821,0.052632,0.085261,0.176646
7,0,130,1.111111,31,2390,13395,0,1.226852,0.815789,0.642819,0.228506
8,0,131,0.6,0,475,5035,0,0.6625,0.0,0.127757,0.085892
9,0,132,0.583333,0,60,17550,0,0.644097,0.0,0.016138,0.299386


Finally, we are able to proceed with merging the team participation ratios into our players' DataFrame.

In [16]:
# Merge to the players DataFrame
players = players.merge(team_stats.drop(columns=['denies', 'gold', 'gold_spent']), 
                        on=['match_id', 'player_slot'], how='left')

# Define radiant_team as categorical
players['radiant_team'] = players['radiant_team'].astype('category')

# Sanity check
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,1,Double T,Double T,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,7.75,1,1.018248,0.3,0.27324,0.203565
2,0,1,Trash!!!,Trash!!!,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,1,0.394161,0.033333,0.010175,0.139779
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.4,1,0.709489,0.2,0.109056,0.257952
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,9.25,1,1.215328,0.433333,0.305892,0.273082
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.857143,0,2.050595,0.131579,0.128026,0.20957
6,0,0,6k Slayer,6k Slayer,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.642857,0,0.709821,0.052632,0.085261,0.176646
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.111111,0,1.226852,0.815789,0.642819,0.228506
8,0,0,-,0_131,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.6,0,0.6625,0.0,0.127757,0.085892
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.583333,0,0.644097,0.0,0.016138,0.299386


0

#### Teamfights

Team fights are a great way to assess each team's performance and coordination. Engineering features that help us analyze the dynamics and results of team fights can provide valuable insights into player performance, team synergy, and the influence of team fights on the overall match result.

In [17]:
# Load required files
tf_players = pd.read_csv(clean_folder + '/teamfights_players.csv', index_col=0)
print(f'tf_players:', '{:,} observations, {:,} features'.format(tf_players.shape[0], tf_players.shape[1]))

tf_players: 5,390,470 observations, 9 features


In [18]:
# Teamfight Participation
def count_values_not_zero(series):
    return (series > 0).sum()

player_teamfights = tf_players.groupby(['match_id', 'player_slot'])['damage'].agg(count_values_not_zero).reset_index(name='teamfights')

# Teamfight Performance
tf_player_damage = tf_players.groupby(['match_id', 'player_slot'])['damage'].sum().reset_index(name='tf_damage_dealt')
tf_player_buybacks = tf_players.groupby(['match_id', 'player_slot'])['buybacks'].sum().reset_index(name='tf_buybacks')
tf_player_deaths = tf_players.groupby(['match_id', 'player_slot'])['deaths'].sum().reset_index(name='tf_deaths')

# Teamfight Impact
tf_player_gold_delta = tf_players.groupby(['match_id', 'player_slot'])['gold_delta'].mean().reset_index(name='tf_avg_gold_delta')
tf_player_xp_delta = tf_players.groupby(['match_id', 'player_slot']).apply(lambda x: (x['xp_end'] - x['xp_start']).mean()).reset_index(name='tf_avg_xp_delta')

# Merge all features in a single DataFrame
player_teamfights = player_teamfights.merge(tf_player_damage, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_buybacks, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_deaths, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_gold_delta, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_xp_delta, on=['match_id', 'player_slot'], how='left')


# Display the head and shape of player_teamfights
display(player_teamfights.head(20))
print(f'player_teamfights:', '{:,} observations, {:,} features'.format(player_teamfights.shape[0], player_teamfights.shape[1]))

Unnamed: 0,match_id,player_slot,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
0,0,0,10,6099,0,2,329.166667,538.333333
1,0,1,10,13663,0,4,409.833333,1112.25
2,0,2,7,1155,1,3,123.666667,495.166667
3,0,3,9,15201,0,4,317.333333,795.75
4,0,4,12,30774,1,2,460.583333,1189.416667
5,0,128,10,23616,2,5,86.75,731.666667
6,0,129,9,12807,0,4,211.583333,516.75
7,0,130,8,15988,0,5,193.0,610.25
8,0,131,10,5718,1,9,-29.833333,401.75
9,0,132,10,9786,1,9,-65.75,639.5


player_teamfights: 499,310 observations, 8 features


It's odd that we are missing 690 observations in this new DataFrame. It's possible that these players did not engage in any teamfights, either by avoiding them entirely or due to thrown matches.

In [19]:
# Look for original match_ids
print('Total matches in original:', tf_players['match_id'].nunique())

Total matches in original: 49931


After verifying that there are no random missing values, instead of disregarding specific matches, let's merge these new features into our players' DataFrame and then examine the observations with null values. We will investigate the teamfight missing values after exploring the matches DataFrame.

In [20]:
# Merging to the players' DataFrame
players = players.drop(columns=['kills', 'deaths', 'assists'])\
                .merge(player_teamfights, on=['match_id', 'player_slot'], how='left')
display(players[players['teamfights'].isna()].sample(50))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
314388,31438,1,36382,36382,46,131,515,585,479,487,1,6,0.0,304,0,0,241,241,16,75,16,0,2,0,32,351,0,100,100,0,0,28,247,0,0,59,123,2015-11-15 22:10:50,2047,2047,63,63,0,agi,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0,0.142857,0.333333,0.264238,0.171053,,,,,,
366471,36647,0,649,649,11,1,1,820,368,43,0,3,0.0,368,0,0,75,16,16,0,0,0,2,1,0,412,0,91,0,0,0,66,128,0,0,681,204,2015-11-16 16:21:14,1920,2047,63,48,84,agi,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,0.0,,0.000134,0.292857,,,,,,
67063,6706,1,ImFaded,ImFaded,75,3,1656,3910,268,207,3,9,0.0,3240,0,497,34,102,0,29,44,0,8,0,689,2619,0,202,74,-179,0,379,298,0,2,1015,121,2015-11-13 02:08:30,2047,900,51,63,257,int,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.5,1,0.088235,0.107143,0.188032,0.125683,,,,,,
476134,47613,0,NooBaSaurus,83246,15,4,1,2820,155,172,5,34,0.0,3376,0,0,0,0,0,0,0,0,9,3,0,4578,0,126,100,-478,0,0,1404,0,2,1634,121,2015-11-17 23:28:26,0,2047,63,0,167,agi,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1,0.0,0.714286,0.000351,0.134831,,,,,,
384594,38459,1,-,38459_4,100,4,332,1025,171,254,0,2,1.06641,1306,0,0,20,29,39,0,16,16,4,0,160,1058,0,0,0,-89,0,261,81,0,0,286,138,2015-11-16 21:13:36,2047,2047,63,63,71,str,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.5,1,0.083333,0.0,0.170082,0.143057,,,,,,
196161,19616,1,5184,5184,80,1,2336,8265,520,565,0,172,0.0,920,0,1679,214,0,46,182,0,0,14,0,348,10387,0,90,58,-329,0,201,6607,0,0,1149,112,2015-11-14 19:26:30,2047,1830,63,63,209,all,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.5,1,0.214286,0.0,0.442089,0.26011,,,,,,
384593,38459,1,131997,131997,49,3,383,1860,317,440,2,20,0.0,740,0,51,44,29,16,41,0,182,6,0,0,1935,0,171,158,0,0,0,881,0,0,286,138,2015-11-16 21:13:36,2047,2047,63,63,71,str,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1,0.0,0.2,0.196209,0.259595,,,,,,
409935,40993,0,136450,136450,107,128,1,595,111,42,0,1,0.0,0,0,0,0,0,0,0,0,0,1,3,0,154,0,0,0,0,0,0,42,0,0,217,171,2015-11-17 03:44:40,2047,2047,63,63,5,str,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0,0.0,0.0,0.00036,0.147643,,,,,,
482598,48259,0,-,48259_131,101,131,1,1110,110,18,0,1,0.0,1504,0,0,0,38,44,15,20,0,2,4,97,226,0,0,-100,0,0,135,43,0,0,1030,171,2015-11-18 01:20:24,2047,256,48,63,5,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0,2.0,0.0,0.000703,0.046766,,,,,,
139145,13914,1,~SweetGuy~,34499,7,128,159,1245,645,164,0,1,1.03308,1055,0,0,16,44,41,16,0,0,1,0,122,51,0,0,0,0,0,535,38,0,2,63,112,2015-11-14 03:02:04,2047,2047,63,63,0,str,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,2.0,0,0.444444,0.0,0.194614,0.270652,,,,,,


0

It seems that some of these players have a leaver status, and others have very poor stats overall, suggesting that these matches may have been thrown. However, others seem to have decent stats across the board, indicating that the data could have been corrupted before manipulation. Let's check how many null values we have from our engineered features so far.

#### TrueSkill

We must consistently evaluate each player's skill level as they engage in matches to ensure a fair matchmaking process. Our data cleanup notebook includes a file with TrueSkill ratings for each player based on previous match results. However, it's important to note that some players in our current dataset may not be included in the TrueSkill file, which is a potential factor that needs to be addressed.

In [21]:
# Load required files
trueskill_df = pd.read_csv(clean_folder + '/trueskill.csv')
print(f'trueskill:', '{:,} observations, {:,} features'.format(trueskill_df.shape[0], trueskill_df.shape[1]))

trueskill: 834,226 observations, 6 features


In [22]:
trueskill_df.head()

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma,conservative_skill_estimate
0,236579,14,24,27.868035,5.212361,12.230953
1,-343,1,1,26.544163,8.065475,2.347736
2,-1217,1,1,26.521103,8.114989,2.176136
3,-1227,1,1,27.248025,8.092217,2.971375
4,-1284,0,1,22.931016,8.092224,-1.345657


We need to extract the values from the `trueskill.csv` file and add them to our players' DataFrame. Additionally, we should assign an initial ranking to players who are not listed in the original file. To establish suitable parameters, we should set our environment using the average TrueSkill $\mu$ and $\sigma$. We can assume that the unrated players have a similar ranking to the ones in the file.

In [23]:
# Set up the global parameters
trueskill.setup(mu=trueskill_df['trueskill_mu'].mean(), 
                sigma=trueskill_df['trueskill_sigma'].mean(),
                draw_probability=0)

# Instantiate the ratings dictionary
ts_init_ratings = {}

# Get the list of unique account IDs from the players DataFrame
account_ids = players['account_id'].unique().tolist()

# Read the mu and sigma from the file and append them to ts_ratings
for account in account_ids:
    try:
        account_row = trueskill_df[trueskill_df['account_id'] == int(account)]
    
        if not account_row.empty:
            ts_mu = account_row['trueskill_mu'].values[0]
            ts_sigma = account_row['trueskill_sigma'].values[0]
            ts_init_ratings[account] = trueskill.Rating(mu=ts_mu, sigma=ts_sigma)
        else:
            ts_init_ratings[account] = trueskill.Rating()
            
    except ValueError:
        ts_init_ratings[account] = trueskill.Rating()

print('Total account IDs in players DF: {:,}'.format(len(account_ids)))
print('Total keys in initial ratings dictionary: {:,}'.format(len(ts_init_ratings)))

Total account IDs in players DF: 311,717
Total keys in initial ratings dictionary: 311,717


In [24]:
# Check a random player's initial TrueSkill
rand_player_ts = '80491'
display(f'{rand_player_ts} TrueSkill:', ts_init_ratings[rand_player_ts])

# Sanity check from our dictionary
trueskill_df[trueskill_df['account_id'] == int(rand_player_ts)][['account_id', 'trueskill_mu', 'trueskill_sigma']]

'80491 TrueSkill:'

trueskill.Rating(mu=24.638, sigma=6.284)

Unnamed: 0,account_id,trueskill_mu,trueskill_sigma
119612,80491,24.637552,6.284008


Our dictionary now holds the TrueSkill ratings for all our players. Let's test the `win_probability` function from the TrueSkill library to assess its potential usefulness in the future.

In [25]:
# Test the win_probability function with one match's results
rand_match_id = random.choice(players['match_id'].unique().tolist())
print('Match ID picked:', rand_match_id)

team_radiant = []
team_dire = []

for i in range(5):
    team_radiant.append(ts_init_ratings[players[(players['match_id']==rand_match_id) &\
                        (players['player_slot']==i)]['account_id'].values[0]])

for i in range(128,133):
    team_dire.append(ts_init_ratings[players[(players['match_id']==rand_match_id) &\
                     (players['player_slot']==i)]['account_id'].values[0]])

print('Match quality:', trueskill.quality([team_radiant, team_dire]))
print('Win probability:', win_probability(team_radiant, team_dire))
print('True outcome:', matches[matches['match_id']==rand_match_id]['radiant_win'])

Match ID picked: 15971
Match quality: 0.5830021980251764
Win probability: 0.4617500254518506
True outcome: 15971    0
Name: radiant_win, dtype: int64


After establishing our initial ratings for each player, we can proceed by creating a function that will update the ratings after each match. This function should also incorporate a mechanism to penalize players who intentionally disconnect from a match, thus putting their team at a disadvantage.

After receiving the latest rankings, we can leverage them to evaluate the expected quality of a match, allowing us to ascertain whether it was fair and well-balanced or unfair.

In [26]:
# Function to determine leaver status weights
def get_leaver_weight(leaver_status):
    leaver_weights = {
        0: 1.0, # NONE - finished match, no abandon.
        1: 0.5, # DISCONNECTED - player DC, no abandon.
        2: 0.3, # DISCONNECTED_TOO_LONG - player DC > 5min, abandoned.
        3: 0.1, # ABANDONED - player DC, clicked leave, abandoned.
        4: 0.01, # AFK - player AFK, abandoned.
        5: 0.0, # NEVER_CONNECTED - player never connected, no abandon.
        6: 0.0, # NEVER_CONNECTED_TOO_LONG - player took too long to connect, no abandon.
    }
    return leaver_weights.get(leaver_status, 1.0)

# Sort the players DataFrame in chronological order
ts_players = players.sort_values(by=['start_time', 'match_id', 'player_slot'])

# Instantiate updated ratings' dictionary
ts_updated_ratings = ts_init_ratings.copy()

# Function to update the ratings based on a single match
def update_ratings(match_):
    global ts_updated_ratings

    # Extract player data from the match
    match_account_ids = match_['account_id'].tolist()
    match_leaver_status = match_['leaver_status'].tolist()
    match_outcome = match_['match_outcome'].tolist()
    
    # Get the current ratings and leaver weights
    ratings = [ts_updated_ratings[acc_id] for acc_id in match_account_ids]
    weights = [get_leaver_weight(ls) for ls in match_leaver_status]

    player_weights = {}
    for idx, weight in enumerate(weights):
        if idx < 5:
            key = (0, match_account_ids[idx])
        else:
            key = (1, match_account_ids[idx])
        player_weights[key] = weight    
    
    # Split players into teams
    radiant = {}
    dire = {}
    for idx, player in enumerate(ratings):
        if idx < 5:
            radiant[match_account_ids[idx]] = ratings[idx]
        else:
            dire[match_account_ids[idx]] = ratings[idx]

    # Lists to store TrueSkill values before each match
    trueskill_mu_list = []
    trueskill_sigma_list = []
    cons_trueskill_list = []
    
    # Add the current TrueSkill rating prior the match
    for rating in ratings:
        trueskill_mu_list.append(rating.mu)
        trueskill_sigma_list.append(rating.sigma)
        cons_trueskill_list.append(conservative_trueskill_rating(rating.mu, rating.sigma))
        
    match_['trueskill_mu'] = trueskill_mu_list
    match_['trueskill_sigma'] = trueskill_sigma_list
    match_['trueskill'] = cons_trueskill_list
    
    # Determine the outcome
    if match_outcome[0] == 0:
        ranks = [0, 1] # Radiant wins
    else:
        ranks = [1, 0] # Dire wins
    
    # Update ratings
    new_radiant, new_dire = trueskill.rate([radiant, dire], weights=player_weights, ranks=ranks)
    
    for i in range(5):
        ts_updated_ratings[match_account_ids[i]] = new_radiant[match_account_ids[i]]
        ts_updated_ratings[match_account_ids[i+5]] = new_dire[match_account_ids[i+5]]

    # Calculate match quality
    match_quality = trueskill.quality([new_radiant, new_dire])
    match_['match_quality'] = match_quality

    return match_

In [27]:
# Calculate the Conservative Skill Estimate for each player before a match
processed_matches = []

# Iterate through each match and update ratings
for m_id, match_group in ts_players.groupby('match_id'):
    processed_match = update_ratings(match_group)
    processed_matches.append(processed_match)

# Concatenate all processed match groups back into one DataFrame
players_with_ts = pd.concat(processed_matches)
players_with_ts[['match_id', 'start_time', 'account_id', 'player_slot', 'leaver_status', 'kda', 
                 'trueskill_mu', 'trueskill_sigma', 'trueskill', 'match_quality', 'match_outcome']].head(20)

Unnamed: 0,match_id,start_time,account_id,player_slot,leaver_status,kda,trueskill_mu,trueskill_sigma,trueskill,match_quality,match_outcome
0,0,2015-11-05 19:01:52,Double T,0,0,6.75,25.112577,7.270275,3.301753,0.333641,1
1,0,2015-11-05 19:01:52,1,1,0,7.75,26.232905,4.854238,11.670192,0.333641,1
2,0,2015-11-05 19:01:52,Trash!!!,2,0,3.0,25.112577,7.270275,3.301753,0.333641,1
3,0,2015-11-05 19:01:52,2,3,0,5.4,27.614505,6.550771,7.96219,0.333641,1
4,0,2015-11-05 19:01:52,3,4,0,9.25,20.221006,5.961434,2.336703,0.333641,1
5,0,2015-11-05 19:01:52,4,128,0,1.857143,26.773302,5.322094,10.807019,0.333641,0
6,0,2015-11-05 19:01:52,6k Slayer,129,0,0.642857,25.112577,7.270275,3.301753,0.333641,0
7,0,2015-11-05 19:01:52,5,130,0,1.111111,32.190551,2.93714,23.379132,0.333641,0
8,0,2015-11-05 19:01:52,0_131,131,0,0.6,23.1395,6.807861,2.715916,0.333641,0
9,0,2015-11-05 19:01:52,6,132,0,0.583333,34.77452,5.783084,17.425268,0.333641,0


In [28]:
# Sanity Check
display(players_with_ts.shape)
display(players.shape)

(500000, 68)

(500000, 64)

In [29]:
# Update the players' DataFrame
players = players_with_ts
display(players.sample(50))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta,trueskill_mu,trueskill_sigma,trueskill,match_quality
394958,39495,1,OG.fuckface,113746,44,131,6337,14625,507,579,7,147,3.64614,13732,0,1772,46,145,187,63,143,164,21,0,12896,9068,894,1157,620,-2152,0,5591,6217,352,1,2485,112,2015-11-16 23:30:35,0,1846,63,0,251,agi,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,3.111111,0,0.794809,0.241379,0.279545,0.198709,12.0,12239.0,0.0,6.0,359.75,806.5625,19.960772,5.419198,3.703178,0.537057
221297,22129,0,icн liєbє dicн :3,icн liєbє dicн :3,93,130,90,6700,235,240,5,73,0.0,3586,0,0,63,154,181,0,46,0,11,0,890,6241,0,100,100,-1734,0,726,2942,0,1,1806,123,2015-11-15 00:39:00,1982,0,0,63,10,agi,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.714286,0,0.795918,0.172414,0.01449,0.236624,6.0,2841.0,0.0,5.0,-64.5,144.5,26.375492,7.097063,5.084303,0.381952
178637,17863,1,Lwi,72429,2,130,2198,12085,490,460,0,117,0.0,10822,0,1167,1,127,36,214,53,61,16,0,5998,8054,0,87,0,-2004,0,3460,4459,0,1,1842,132,2015-11-14 15:27:39,0,2046,63,0,104,str,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,3.285714,0,0.566502,0.0,0.128133,0.193298,8.0,6268.0,0.0,5.0,265.9,552.7,25.112577,7.270275,3.301753,0.471857
220342,22034,0,-,22034_2,38,2,46,8615,268,335,5,71,34.2385,10505,0,210,1,36,187,23,214,182,17,0,8476,6426,0,315,202,-6071,-845,4080,2743,200,0,2721,204,2015-11-15 00:23:36,0,1974,63,0,7,all,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.9,1,0.640206,0.178571,0.010125,0.13836,12.0,8728.0,0.0,13.0,18.25,506.5625,21.597842,7.42535,-0.678209,0.47481
380528,38052,1,YoKo™,YoKo™,28,131,2805,10955,340,330,3,110,47.9802,5108,0,220,1,151,116,0,46,63,16,0,4582,9547,0,13,0,-2213,0,1478,4321,0,1,2567,111,2015-11-16 20:18:14,0,1798,51,0,126,str,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.25,0,0.511364,0.142857,0.149784,0.152142,4.0,3419.0,0.0,3.0,87.0,393.777778,25.112577,7.270275,3.301753,0.506124
9173,917,0,Escu,4873,26,3,638,5465,217,244,1,24,49.5208,6477,0,4,214,86,188,46,36,94,11,0,2525,4324,0,114,375,-1046,0,2364,428,0,4,1704,132,2015-11-12 09:49:57,256,1983,63,48,131,int,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.2,1,1.0125,0.076923,0.122598,0.161281,7.0,3325.0,0.0,3.0,112.7,301.0,19.432581,7.016206,-1.616036,0.570028
206069,20606,0,Norman.,Norman.,26,132,1856,17635,380,458,0,65,45.8382,8365,815,30,48,79,108,1,0,0,22,0,15010,11534,0,465,230,-3320,-1675,9497,2053,0,5,3533,138,2015-11-14 21:25:30,1974,0,0,63,190,int,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.727273,0,0.945887,0.0,0.11533,0.193653,9.0,6677.0,0.0,5.0,339.666667,624.166667,25.112577,7.270275,3.301753,0.20547
54130,5413,0,JackDragon,26825,71,0,125,11655,309,333,2,81,120.715,8168,0,0,63,152,40,92,55,166,16,0,6804,6917,0,304,211,-2810,0,5383,2903,0,3,2520,123,2015-11-12 22:35:55,0,2038,63,0,63,str,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.545455,1,1.030303,0.111111,0.032442,0.196361,6.0,5646.0,0.0,4.0,274.125,574.75,24.045396,4.771731,9.730203,0.309658
382145,38214,0,Danger-Sendo,129615,39,128,613,19495,520,606,14,178,0.0,23328,0,56,24,116,112,98,46,63,21,0,12309,10835,357,859,554,-2332,0,8332,7327,200,9,2410,153,2015-11-16 20:39:35,1974,0,0,63,90,int,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2.777778,0,1.43909,0.378378,0.147038,0.318546,9.0,16490.0,0.0,4.0,441.416667,795.666667,22.175439,5.539225,5.557764,0.419361
24589,2458,1,kuba,10342,21,132,432,13560,233,220,12,45,106.387,8138,0,324,102,108,50,46,0,0,17,0,5965,8702,804,270,178,-3769,-1603,2705,1633,800,10,4284,132,2015-11-12 15:34:47,0,1824,63,0,146,all,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.666667,0,0.570175,0.342857,0.026115,0.102977,13.0,8608.0,1.0,5.0,22.722222,354.166667,27.458478,5.305167,11.542977,0.157508


7588114

#### Filling Missing Values

In [30]:
# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0 ])

tf_avg_xp_delta           690
tf_avg_gold_delta         690
tf_deaths                 690
tf_buybacks               690
tf_damage_dealt           690
teamfights                690
team_denies               275
team_kda                  105
hero_role_carry            37
hero_primary_attribute     37
hero_role_initiator        37
hero_role_pusher           37
hero_role_escape           37
hero_role_disabler         37
hero_role_support          37
hero_role_durable          37
hero_role_nuker            37
dtype: int64

We have a large number of missing values in some of our newly created features. Besides the team aggregations, teamfight-related and hero-related features all have the same amount of missing values per category. Let's start by exploring the team aggregations for now.

In [31]:
# Explore team missing values in Team KDA and Team Denies
print('Team KDA:')
display(players[(players['team_kda'].isna()) & (players['kda'] != 0)][['match_id', 'player_slot', 'kda']])
print('\n------------------------------------\nTeam Denies')
display(players[(players['team_denies'].isna()) & (players['denies'] != 0)][['match_id', 'player_slot', 'denies']])

Team KDA:


Unnamed: 0,match_id,player_slot,kda



------------------------------------
Team Denies


Unnamed: 0,match_id,player_slot,denies


Based on the results, the null values in the team features can be interpreted as representing a 0. This is because these particular observations and the rest of the team's players had a 0 in their original columns. In this context, we should treat the null values as 0.

In [32]:
# Fill missing team_deaths and team_kda values
players['team_kda'].fillna(0, inplace=True)
players['team_denies'].fillna(0, inplace=True)

# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0])

tf_avg_xp_delta           690
tf_avg_gold_delta         690
tf_deaths                 690
tf_buybacks               690
tf_damage_dealt           690
teamfights                690
hero_role_carry            37
hero_primary_attribute     37
hero_role_escape           37
hero_role_disabler         37
hero_role_support          37
hero_role_durable          37
hero_role_nuker            37
hero_role_pusher           37
hero_role_initiator        37
dtype: int64

Now let's look at the hero-related features:

In [33]:
# Look at the rows with missing hero features' values
players[players[['hero_role_durable', 'hero_primary_attribute', 'hero_role_nuker', 
         'hero_role_carry', 'hero_role_initiator', 'hero_role_disabler', 
         'hero_role_support', 'hero_role_pusher', 'hero_role_escape']].isna().any(axis=1)]

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta,trueskill_mu,trueskill_sigma,trueskill,match_quality
7203,720,0,2956,2956,0,3,0,0,135,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2021,132,2015-11-12 08:49:43,0,1926,59,0,152,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.411772,2.694663,19.327783,0.509527
10320,1032,0,-,1032_0,0,0,0,0,124,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2358,132,2015-11-12 10:24:54,0,1956,63,0,0,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.363303,7.063993,-2.828676,0.45605
11088,1108,0,4765,4765,0,131,0,0,104,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2193,111,2015-11-12 10:45:17,2039,0,0,63,18,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.979409,2.081422,21.735143,0.431137
21343,2134,0,9442,9442,0,3,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,485,138,2015-11-12 14:43:48,2047,2047,63,63,118,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.599931,4.060508,15.418408,0.011821
21344,2134,0,9441,9441,0,4,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,485,138,2015-11-12 14:43:48,2047,2047,63,63,118,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.731222,2.19974,23.132003,0.011821
27738,2773,0,13922,13922,0,131,0,0,108,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1084,204,2015-11-12 16:25:25,1983,384,48,63,271,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.971784,5.596345,5.182748,0.334627
70983,7098,0,-,7098_3,0,3,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2156,112,2015-11-13 03:23:51,0,2047,63,0,0,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.59853,6.256666,14.828533,0.25831
74882,7488,0,6069,6069,0,2,0,0,117,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2709,171,2015-11-13 05:04:51,256,1846,63,48,204,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.535976,2.641245,21.61224,0.594716
75829,7582,0,35721,35721,0,132,0,0,111,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1959,121,2015-11-13 05:23:06,2044,6,3,63,105,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.783225,6.723435,8.61292,0.525323
78314,7831,0,35936,35936,0,4,0,0,106,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1321,111,2015-11-13 06:23:06,0,2047,63,0,85,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.820241,1.773467,20.499841,0.436867


It seems that the hero ID 0 is responsible for the null values in these rows. Since we can't fill the values without introducing bias into our models, we need to remove these matches entirely from our dataset. Let's create a list of the unique match IDs that we can use to drop the matches when we explore the Matches DataFrame.

In [34]:
# Create a list of the match IDs to be dropped
dropped_matches = players[players['hero_id'] == 0]['match_id'].tolist()
print('Matches to be removed:', len(dropped_matches))
print(dropped_matches)

Matches to be removed: 37
[720, 1032, 1108, 2134, 2134, 2773, 7098, 7488, 7582, 7831, 13396, 14592, 22764, 24509, 24711, 25043, 25888, 26106, 29388, 30150, 30790, 33087, 33443, 36245, 36647, 37020, 37020, 37147, 37860, 38150, 39664, 40522, 40706, 41029, 43519, 45879, 46280]


Let's maintain the teamfight-related features as they are for now. We can review them later after exploring the Matches DataFrame.

In [35]:
# Sanity Check
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

players: 500,000 observations, 68 features


### Matches DataFrame

In [36]:
# Look at the DataFrame
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))
matches.head()

matches: 50,000 observations, 13 features


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156


#### Team Aggregations

Next, we should proceed to create a new column containing the team hero picks for each player in their respective rows. This modification will provide us with valuable insight into the team composition for each match observation.

In [37]:
# Group players by match_id and separate radiant/dire heroes
match_picks = players.groupby(['match_id', 'radiant_team'], as_index=False, observed=False)['hero_id']\
                .apply(list).rename(columns={'hero_id': 'team_hero_picks'})

In [38]:
# Pivot the team_features DataFrame
team_pivoted = team_features.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
team_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in team_pivoted.columns]

# Reset the index
team_pivoted.reset_index(inplace=True)
display(team_pivoted.head())
print('----------------------------')

# Merge results with the matches DataFrame
matches = matches.merge(team_pivoted, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant
0,0,0.90566,7.611111,38,30,3718,10811,58620,87245
1,1,3.5,1.555556,16,27,9085,4776,107750,69310
2,2,6.0,1.06,16,10,11177,2494,81620,54990
3,3,2.621212,2.328358,29,32,5954,6455,94430,76685
4,4,1.210526,5.647059,21,26,2030,14099,38980,78980


----------------------------


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980


In [39]:
# Pivot the team_features DataFrame
picks_pivoted = match_picks.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
picks_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in picks_pivoted.columns]

# Reset the index
picks_pivoted.reset_index(inplace=True)
display(picks_pivoted.head())
print('----------------------------')

# Expanding the lists
radiant_picks = pd.DataFrame(picks_pivoted['team_hero_picks_radiant'].tolist(), 
                             index=picks_pivoted.index, 
                             columns=[f'hero_slot_{i}' for i in range(5)])

dire_picks = pd.DataFrame(picks_pivoted['team_hero_picks_dire'].tolist(), 
                          index=picks_pivoted.index, 
                          columns=[f'hero_slot_{i+128}' for i in range(5)])

picks_pivoted = pd.concat([picks_pivoted, radiant_picks, dire_picks], axis=1)
picks_pivoted.drop(columns=['team_hero_picks_radiant', 'team_hero_picks_dire'], inplace=True)

# Merge results with the matches DataFrame
matches = matches.merge(picks_pivoted, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,team_hero_picks_dire,team_hero_picks_radiant
0,0,"[106, 102, 46, 7, 73]","[86, 51, 83, 11, 67]"
1,1,"[73, 22, 5, 67, 106]","[7, 82, 71, 39, 21]"
2,2,"[38, 7, 10, 12, 85]","[51, 109, 9, 41, 27]"
3,3,"[78, 19, 31, 40, 47]","[50, 44, 32, 26, 39]"
4,4,"[101, 100, 22, 67, 21]","[8, 39, 55, 87, 69]"


----------------------------


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21


#### Hero Features

In [40]:
# Create a mask from our players DataFrame
team_hero_features = players[['match_id', 'player_slot', 'radiant_team', 'hero_primary_attribute',
                              'hero_role_disabler', 'hero_role_support', 'hero_role_carry',
                              'hero_role_initiator', 'hero_role_durable', 'hero_role_pusher',
                              'hero_role_nuker', 'hero_role_escape']]

# Pivot the mask
hero_features_pivot = team_hero_features.pivot_table(
    index='match_id',
    columns='player_slot',
    values=['radiant_team', 'hero_primary_attribute', 'hero_role_disabler', 
            'hero_role_support', 'hero_role_carry', 'hero_role_initiator', 
            'hero_role_durable', 'hero_role_pusher', 'hero_role_nuker', 'hero_role_escape'],
    aggfunc='first'
)

# Flatten the multi-level columns
hero_features_pivot.columns = [f'{i}_{j}' if j!= '' else f'{i}' for i, j in hero_features_pivot.columns]
hero_features_pivot.drop(columns=['radiant_team_0', 'radiant_team_1', 'radiant_team_2', 'radiant_team_3', 'radiant_team_4', 
                                  'radiant_team_128', 'radiant_team_129', 'radiant_team_130', 'radiant_team_131', 'radiant_team_132'], 
                         inplace=True)
hero_features_pivot.reset_index(inplace=True)

# Merge results with the matches DataFrame
matches = matches.merge(hero_features_pivot, on='match_id', how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


We have successfully included each player's one-hot encoded hero roles into our DataFrame. Let's continue with the features from Teamfights.

#### Teamfights

In [41]:
# Load required file
teamfights = pd.read_csv(clean_folder + '/teamfights.csv', index_col=0)
print(f'teamfights:', '{:,} observations, {:,} features'.format(teamfights.shape[0], teamfights.shape[1]))

teamfights: 539,047 observations, 6 features


In [42]:
# Check the total matches
teamfights['match_id'].nunique()

49931

In [43]:
# Look at the head
teamfights.head()

Unnamed: 0,match_id,tf_order,start,end,last_death,deaths
0,0,1,220,252,237,3
1,0,2,429,475,460,3
2,0,3,900,936,921,3
3,0,4,1284,1328,1313,3
4,0,5,1614,1666,1651,5


We want to calculate the total number of teamfights per match in the matches DataFrame, and we can also determine the average duration of teamfights per match.

In [44]:
# Create duration feature
tf_duration = teamfights['end'] - teamfights['start']
teamfights.insert(4, 'duration', value=tf_duration)

# Selecting the agg functions for each column
agg_funcs = {
    'tf_order': 'max',
    'duration': 'mean'
}

# Aggregating features
tfs_per_match = teamfights[['match_id', 'tf_order', 'duration']].groupby('match_id', as_index=False).agg(agg_funcs)

# Rename time column to match format
tfs_per_match.rename(columns={'tf_order': 'teamfights', 'duration': 'tf_avg_duration'}, inplace=True)

tfs_per_match.head(10)

Unnamed: 0,match_id,teamfights,tf_avg_duration
0,0,12,42.583333
1,1,15,44.066667
2,2,11,45.272727
3,3,16,52.0625
4,4,6,42.666667
5,5,13,47.384615
6,6,10,44.4
7,7,12,47.0
8,8,10,42.5
9,9,13,49.0


In [45]:
# Merge results with the matches DataFrame
matches = matches.merge(tfs_per_match, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,12.0,42.583333
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,15.0,44.066667
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,11.0,45.272727
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,16.0,52.0625
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,42.666667


#### TrueSkill

In [46]:
# Filtering trueskill features and looking at the head 
ts_df = players[['match_id', 'player_slot', 'trueskill_mu', 'trueskill_sigma', 'trueskill', 'match_quality']]
ts_df.head(20)

Unnamed: 0,match_id,player_slot,trueskill_mu,trueskill_sigma,trueskill,match_quality
0,0,0,25.112577,7.270275,3.301753,0.333641
1,0,1,26.232905,4.854238,11.670192,0.333641
2,0,2,25.112577,7.270275,3.301753,0.333641
3,0,3,27.614505,6.550771,7.96219,0.333641
4,0,4,20.221006,5.961434,2.336703,0.333641
5,0,128,26.773302,5.322094,10.807019,0.333641
6,0,129,25.112577,7.270275,3.301753,0.333641
7,0,130,32.190551,2.93714,23.379132,0.333641
8,0,131,23.1395,6.807861,2.715916,0.333641
9,0,132,34.77452,5.783084,17.425268,0.333641


In [47]:
# Group by match ID to avoid duplicates
ts_df_grouped = ts_df.groupby('match_id')['match_quality'].first().reset_index()

# Merge the match_quality to the matches DataFrame
matches = matches.merge(ts_df_grouped, on=['match_id'], how='left')
display(matches[['match_id', 'match_quality']].head())

Unnamed: 0,match_id,match_quality
0,0,0.333641
1,1,0.58092
2,2,0.402769
3,3,0.462334
4,4,0.476433


In [48]:
# Pivot the table
ts_pivot = ts_df.pivot_table(
    index='match_id',
    columns='player_slot',
    values='trueskill',
    aggfunc='first'
)

# Flatten the multi-level columns
ts_pivot.columns = [f'trueskill_{col}' for col in ts_pivot.columns]
ts_pivot.reset_index(inplace=True)

# Merge into the matches DataFrame
matches = matches.merge(ts_pivot, on='match_id', how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration,match_quality,trueskill_0,trueskill_1,trueskill_2,trueskill_3,trueskill_4,trueskill_128,trueskill_129,trueskill_130,trueskill_131,trueskill_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,12.0,42.583333,0.333641,3.301753,11.670192,3.301753,7.96219,2.336703,10.807019,3.301753,23.379132,2.715916,17.425268
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,15.0,44.066667,0.58092,3.301753,16.999098,2.89536,2.649581,11.455655,4.795492,23.069381,3.301753,23.548781,3.301753
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,11.0,45.272727,0.402769,3.301753,3.858394,3.858394,6.099669,10.246954,3.301753,21.139734,3.301753,3.301753,3.301753
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,16.0,52.0625,0.462334,0.386932,3.301753,3.301753,3.301753,-3.142269,4.672139,3.301753,8.500198,4.637102,3.301753
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,42.666667,0.476433,9.539045,13.140809,6.390039,12.183661,10.246954,5.658858,5.387267,3.301753,23.106103,7.392703


#### Filling Missing Values

In [49]:
# Find features with missing values
display(matches.isna().sum().sort_values(ascending=False)\
[matches.isna().sum().sort_values(ascending=False) > 0])

tf_avg_duration               69
teamfights                    69
hero_primary_attribute_4       6
hero_role_pusher_4             6
hero_role_carry_4              6
hero_role_nuker_4              6
hero_role_escape_4             6
hero_role_disabler_4           6
hero_role_support_4            6
hero_role_initiator_4          6
hero_role_durable_4            6
hero_role_pusher_3             5
hero_role_pusher_2             5
hero_primary_attribute_2       5
hero_primary_attribute_3       5
hero_role_escape_2             5
hero_role_durable_3            5
hero_role_pusher_131           5
hero_role_durable_2            5
hero_primary_attribute_131     5
hero_role_disabler_131         5
hero_role_carry_2              5
hero_role_carry_3              5
hero_role_disabler_3           5
hero_role_durable_131          5
hero_role_disabler_2           5
hero_role_initiator_3          5
hero_role_nuker_131            5
hero_role_escape_3             5
hero_role_support_2            5
hero_role_

In [50]:
# Look at the rows with missing teamfight values
matches[matches['teamfights'].isna()]

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration,match_quality,trueskill_0,trueskill_1,trueskill_2,trueskill_3,trueskill_4,trueskill_128,trueskill_129,trueskill_130,trueskill_131,trueskill_132
1221,1221,2015-11-12 11:23:25,272,2047,2047,63,63,8,22,0,0,0,171,5.0,0.142857,14,4,3891,2134,7015,4740,85,10,21,14,2,7,61,20,106,15,str,agi,all,str,str,str,all,all,agi,agi,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,,,0.060841,0.633727,23.478384,7.994206,23.631038,22.468244,20.491866,24.919724,14.926128,19.17216,9.736056
1632,1632,2015-11-12 13:09:28,605,2047,2047,63,63,145,22,0,0,0,138,3.5,0.4,20,8,2351,2636,15360,10255,22,9,56,72,57,93,39,50,36,104,int,all,agi,agi,str,agi,int,all,int,str,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,0.035195,11.582781,3.301753,3.301753,2.958588,3.301753,4.915572,6.19391,14.054129,19.173923,11.758558
2420,2420,2015-11-12 15:27:58,681,2047,2046,63,63,143,22,1,0,0,133,0.0,8.5,18,16,2320,5617,10490,12550,19,74,2,26,5,18,25,10,104,101,str,all,str,int,int,str,int,agi,str,int,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,,,0.030676,14.758948,17.221748,8.773373,2.054592,10.027432,3.301753,3.301753,3.301753,17.386708,7.92052
2774,2774,2015-11-12 16:25:31,1411,384,2047,63,48,95,22,0,0,0,111,8.166667,0.416667,21,31,17090,7006,46980,28045,99,17,100,11,9,112,46,69,28,62,str,int,str,agi,all,all,agi,str,str,agi,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,,,0.139828,3.301753,11.269129,3.301753,21.348246,13.357926,22.415946,3.301753,2.044854,8.521596,14.393527
3204,3204,2015-11-12 17:36:29,1122,2047,260,51,63,175,22,1,0,0,133,0.352941,11.333333,5,30,5453,8132,11590,35425,86,49,106,27,8,7,67,29,63,62,int,str,agi,int,agi,str,agi,str,agi,agi,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.235813,17.276265,16.535279,-1.662855,3.301753,1.262703,6.437503,3.301753,3.301753,-1.420664,3.301753
3548,3548,2015-11-12 18:27:22,362,2047,2047,63,63,132,2,0,0,0,182,4.0,0.25,17,8,4530,1922,6990,7300,100,85,47,2,40,50,61,28,7,39,str,str,agi,str,all,all,all,str,str,int,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,,,0.008901,9.953612,17.996606,12.795343,13.005647,15.01634,3.301753,3.301753,3.301753,3.301753,1.034654
3587,3587,2015-11-12 18:33:19,289,2047,2039,63,63,0,22,1,0,0,204,0.0,6.0,3,16,2712,4673,3895,6245,5,69,11,14,67,7,63,42,39,30,int,str,agi,str,agi,str,agi,str,int,int,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,,,0.00457,1.020875,3.301753,2.013298,14.464078,10.591201,14.125489,3.301753,17.659963,3.301753,12.280876
4896,4896,2015-11-12 21:24:44,904,2046,391,51,63,0,22,1,0,0,112,1.111111,2.428571,10,18,7496,8749,11135,23885,79,62,73,110,93,21,50,2,71,36,int,agi,str,all,agi,all,all,str,str,int,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,,,0.011893,18.397198,3.301753,7.033719,10.073099,9.667373,3.301753,3.301753,14.413403,0.679104,17.632735
6213,6213,2015-11-13 00:41:55,148,2047,2047,63,63,0,22,1,0,0,132,0.0,6.0,3,4,1638,1523,3160,6275,55,75,62,93,86,19,90,57,1,7,all,int,agi,agi,int,str,int,str,agi,str,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,,,0.027122,21.263721,11.079367,4.579395,17.273957,18.110785,10.745818,3.301753,10.293069,20.574189,6.872508
6706,6706,2015-11-13 02:08:30,1015,2047,900,51,63,257,22,1,0,0,121,0.3,5.666667,10,28,3978,8807,13295,31110,12,43,93,75,83,84,36,41,57,19,agi,int,agi,int,str,str,int,agi,str,str,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,,,0.00991,3.301753,3.301753,3.665339,3.301753,1.797714,3.301753,19.54846,3.765979,16.449575,0.349506


Upon reviewing the data, we can notice no discernible patterns indicating the absence of teamfights during these matches. Therefore, it would be best to exclude them from all of our DataFrames since they don't have complete information.

In [51]:
# Append match IDs to be dropped
tf_drop_matches = matches[matches['teamfights'].isna()]['match_id'].tolist()
dropped_matches = list(set(dropped_matches + tf_drop_matches))
print('Matches to be removed:', len(dropped_matches),'\n')

# Removing from the players and matches dfs
players = players.drop(players[players['match_id'].isin(dropped_matches)].index)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))
print('-----------------------------------------------------')
matches = matches.drop(matches[matches['match_id'].isin(dropped_matches)].index)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

Matches to be removed: 102 

players: 498,980 observations, 68 features
-----------------------------------------------------
matches: 49,898 observations, 134 features


#### Simplified Version

For our initial round of base modeling, we plan to start with a simple DataFrame at the match level, including only the information from our TrueSkill calculations, as well as the hero role values.

In [52]:
# Create a simplified version for modelling

simplified_match_columns = [
    'match_id', 'radiant_win', 'match_quality', 
    'trueskill_0', 'trueskill_1', 'trueskill_2', 'trueskill_3', 'trueskill_4', 
    'trueskill_128', 'trueskill_129', 'trueskill_130', 'trueskill_131', 'trueskill_132',
    'hero_role_carry_0', 'hero_role_carry_1', 'hero_role_carry_2', 'hero_role_carry_3', 'hero_role_carry_4', 
    'hero_role_carry_128', 'hero_role_carry_129', 'hero_role_carry_130', 'hero_role_carry_131', 'hero_role_carry_132', 
    'hero_role_disabler_0', 'hero_role_disabler_1', 'hero_role_disabler_2', 'hero_role_disabler_3', 'hero_role_disabler_4', 
    'hero_role_disabler_128', 'hero_role_disabler_129', 'hero_role_disabler_130', 'hero_role_disabler_131', 'hero_role_disabler_132', 
    'hero_role_durable_0', 'hero_role_durable_1', 'hero_role_durable_2', 'hero_role_durable_3', 'hero_role_durable_4', 
    'hero_role_durable_128', 'hero_role_durable_129', 'hero_role_durable_130', 'hero_role_durable_131', 'hero_role_durable_132', 
    'hero_role_escape_0', 'hero_role_escape_1', 'hero_role_escape_2', 'hero_role_escape_3', 'hero_role_escape_4', 
    'hero_role_escape_128', 'hero_role_escape_129', 'hero_role_escape_130', 'hero_role_escape_131', 'hero_role_escape_132', 
    'hero_role_initiator_0', 'hero_role_initiator_1', 'hero_role_initiator_2', 'hero_role_initiator_3', 'hero_role_initiator_4', 
    'hero_role_initiator_128', 'hero_role_initiator_129', 'hero_role_initiator_130', 'hero_role_initiator_131', 'hero_role_initiator_132', 
    'hero_role_nuker_0', 'hero_role_nuker_1', 'hero_role_nuker_2', 'hero_role_nuker_3', 'hero_role_nuker_4', 
    'hero_role_nuker_128', 'hero_role_nuker_129', 'hero_role_nuker_130', 'hero_role_nuker_131', 'hero_role_nuker_132', 
    'hero_role_pusher_0', 'hero_role_pusher_1', 'hero_role_pusher_2', 'hero_role_pusher_3', 'hero_role_pusher_4', 
    'hero_role_pusher_128', 'hero_role_pusher_129', 'hero_role_pusher_130', 'hero_role_pusher_131', 'hero_role_pusher_132', 
    'hero_role_support_0', 'hero_role_support_1', 'hero_role_support_2', 'hero_role_support_3', 'hero_role_support_4', 
    'hero_role_support_128', 'hero_role_support_129', 'hero_role_support_130', 'hero_role_support_131', 'hero_role_support_132'                           
]

matches_simple = matches[simplified_match_columns]
display(matches_simple.head())
gc.collect()

Unnamed: 0,match_id,radiant_win,match_quality,trueskill_0,trueskill_1,trueskill_2,trueskill_3,trueskill_4,trueskill_128,trueskill_129,trueskill_130,trueskill_131,trueskill_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132
0,0,1,0.333641,3.301753,11.670192,3.301753,7.96219,2.336703,10.807019,3.301753,23.379132,2.715916,17.425268,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
1,1,0,0.58092,3.301753,16.999098,2.89536,2.649581,11.455655,4.795492,23.069381,3.301753,23.548781,3.301753,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,2,0,0.402769,3.301753,3.858394,3.858394,6.099669,10.246954,3.301753,21.139734,3.301753,3.301753,3.301753,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,3,0,0.462334,0.386932,3.301753,3.301753,3.301753,-3.142269,4.672139,3.301753,8.500198,4.637102,3.301753,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,4,1,0.476433,9.539045,13.140809,6.390039,12.183661,10.246954,5.658858,5.387267,3.301753,23.106103,7.392703,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


0

### Timeseries DataFrame

<div class="alert alert-block alert-warning">
<b>Note:</b> <i>This section is still under development and will be completed once the project is ready for time series modelling.</i>.</div>

In [53]:
# Load required file
player_time = pd.read_csv(clean_folder + '/player_time.csv', index_col=0)
print(f'player_time:', '{:,} observations, {:,} features'.format(player_time.shape[0], player_time.shape[1]))

player_time: 2,209,778 observations, 32 features


In [54]:
# Melt the player_time DataFrame
player_time_melted = pd.melt(player_time, id_vars=['match_id', 'times'], 
                              value_vars=[col for col in player_time.columns if\
                                          col.startswith(('gold_t_', 'lh_t_', 'xp_t_'))],
                              var_name='metric', value_name='value')

# Look at the shape of the melted DataFrame
print(f'player_time_melted:', '{:,} observations, {:,} features'.format(player_time_melted.shape[0], player_time_melted.shape[1]))
display(player_time_melted.head())
gc.collect()

player_time_melted: 66,293,340 observations, 4 features


Unnamed: 0,match_id,times,metric,value
0,0,0,gold_t_0,0
1,0,60,gold_t_0,409
2,0,120,gold_t_0,546
3,0,180,gold_t_0,683
4,0,240,gold_t_0,956


0

In [55]:
# Create separate columns for gold, lh, and xp
player_time_melted[['metric_type', 'player_slot']] = player_time_melted['metric'].str.split('_t_', expand=True)
player_time_melted['player_slot'] = player_time_melted['player_slot'].astype(int)
display(player_time_melted.head())
gc.collect()

Unnamed: 0,match_id,times,metric,value,metric_type,player_slot
0,0,0,gold_t_0,0,gold,0
1,0,60,gold_t_0,409,gold,0
2,0,120,gold_t_0,546,gold,0
3,0,180,gold_t_0,683,gold,0
4,0,240,gold_t_0,956,gold,0


13

In [56]:
# Pivot the table to create a wide format
player_time_wide = player_time_melted.pivot_table(index=['match_id', 'times', 'player_slot'], 
                                                  columns='metric_type', 
                                                  values='value',
                                                  aggfunc='sum').reset_index()

# Look at the shape of the wide DataFrame
print(f'player_time_wide:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))
display(player_time_wide.head(50))
gc.collect()

player_time_wide: 22,097,780 observations, 6 features


metric_type,match_id,times,player_slot,gold,lh,xp
0,0,0,0,0,0,0
1,0,0,1,0,0,0
2,0,0,2,0,0,0
3,0,0,3,0,0,0
4,0,0,4,0,0,0
5,0,0,128,0,0,0
6,0,0,129,0,0,0
7,0,0,130,0,0,0
8,0,0,131,0,0,0
9,0,0,132,0,0,0


0

#### Ability Upgrades

#### Purchase Log

#### Objectives

In [57]:
# Load the file
objectives = pd.read_csv(clean_folder + '/objectives.csv', index_col=0)
print(f'objectives:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

objectives: 716,498 observations, 6 features


In [58]:
objectives.groupby('subtype')['value'].nunique()

subtype
CHAT_MESSAGE_AEGIS             1
CHAT_MESSAGE_AEGIS_STOLEN      1
CHAT_MESSAGE_FIRSTBLOOD      322
CHAT_MESSAGE_ROSHAN_KILL       1
CHAT_MESSAGE_TOWER_DENY        4
CHAT_MESSAGE_TOWER_KILL        2
Name: value, dtype: int64

In [59]:
# Separate the objectives into multiple features
objectives['aegis'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS', 1, 0)
objectives['aegis_stolen'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS_STOLEN', 1, 0)
objectives['firstblood'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_FIRSTBLOOD', 1, 0)
objectives['roshan_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_ROSHAN_KILL', 1, 0)
objectives['tower_deny'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_DENY', 1, 0)
objectives['tower_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_KILL', 1, 0)

# Look at the head
objectives.head()

Unnamed: 0,match_id,player1,player2,subtype,time,value,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,129,CHAT_MESSAGE_FIRSTBLOOD,1,309,0,0,1,0,0,0
1,0,3,-1,CHAT_MESSAGE_TOWER_KILL,894,2,0,0,0,0,0,1
2,0,2,-1,CHAT_MESSAGE_ROSHAN_KILL,925,200,0,0,0,1,0,0
3,0,1,-1,CHAT_MESSAGE_AEGIS,925,0,1,0,0,0,0,0
4,0,130,-1,CHAT_MESSAGE_TOWER_KILL,1016,3,0,0,0,0,0,1


In [60]:
# Round up the time values to the nearest 60-second intervals
objectives['time'] = (objectives['time'] // 60) * 60

# Aggregate objectives
objective_features = objectives.drop(columns=['player2', 'subtype', 'value']).\
                        groupby(['match_id', 'player1', 'time']).sum().reset_index()

# Look at the objectives features DataFrame head
display(objective_features.head())

Unnamed: 0,match_id,player1,time,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,0,0,0,1,0,0,0
1,0,1,900,1,0,0,0,0,0
2,0,1,1740,1,0,0,0,0,0
3,0,1,2280,0,0,0,0,0,1
4,0,2,900,0,0,0,1,0,0


In [61]:
# Rename time, player1 and subtype column to match format
objective_features.rename(columns={'time': 'times', 'player1': 'player_slot'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(objective_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0
1,0,0,1,0,0,0,,,,,,
2,0,0,2,0,0,0,,,,,,
3,0,0,3,0,0,0,,,,,,
4,0,0,4,0,0,0,,,,,,
5,0,0,128,0,0,0,,,,,,
6,0,0,129,0,0,0,,,,,,
7,0,0,130,0,0,0,,,,,,
8,0,0,131,0,0,0,,,,,,
9,0,0,132,0,0,0,,,,,,


0

#### Teamfight Durations

In [62]:
teamfights.head()

Unnamed: 0,match_id,tf_order,start,end,duration,last_death,deaths
0,0,1,220,252,32,237,3
1,0,2,429,475,46,460,3
2,0,3,900,936,36,921,3
3,0,4,1284,1328,44,1313,3
4,0,5,1614,1666,52,1651,5


In [63]:
# Create duration feature
tf_last_death = teamfights['last_death'] - teamfights['start']
teamfights.insert(6, 'tf_last_death', value=tf_last_death)

# Round up the time values to the nearest 60-second intervals
teamfights['times'] = (teamfights['start'] // 60) * 60

tfs_features = teamfights.drop(columns=['start', 'end', 'last_death', 'deaths'])

# Look at the teamfights features DataFrame head
display(tfs_features.head())

Unnamed: 0,match_id,tf_order,duration,tf_last_death,times
0,0,1,32,17,180
1,0,2,46,31,420
2,0,3,36,21,900
3,0,4,44,29,1260
4,0,5,52,37,1560


In [64]:
tf_players.head()

Unnamed: 0,match_id,tf_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,0,105,0,173,536,314
1,0,0,1,0,566,1,0,1583,1418
2,0,0,2,0,0,0,0,391,391
3,0,0,3,0,0,0,123,1775,1419
4,0,0,4,0,444,0,336,1267,983


In [65]:
# Calculate xp_delta on tf_players
tf_players['tf_xp_delta'] = tf_players['xp_end'] - tf_players['xp_start']
tf_players.drop(columns=['xp_end', 'xp_start'], inplace=True)

# Reset the index from tfs_features
tfs_features = tfs_features.reset_index()

# Merge tf_players with tfs_features
tfs_features = tf_players.merge(tfs_features, left_on=['match_id', 'tf_id'], right_on=['match_id', 'index'])
tfs_features.drop(columns=['index', 'tf_id', 'tf_order'], inplace=True)
display(tfs_features.head(15))

Unnamed: 0,match_id,player_slot,buybacks,damage,deaths,gold_delta,tf_xp_delta,duration,tf_last_death,times
0,0,0,0,105,0,173,222,32,17,180
1,0,1,0,566,1,0,165,32,17,180
2,0,2,0,0,0,0,0,32,17,180
3,0,3,0,0,0,123,356,32,17,180
4,0,4,0,444,0,336,284,32,17,180
5,0,128,0,477,1,249,283,32,17,180
6,0,129,0,636,1,-27,144,32,17,180
7,0,130,0,0,0,190,315,32,17,180
8,0,131,0,0,0,0,0,32,17,180
9,0,132,0,0,0,378,70,32,17,180


In [66]:
# Rename duration and deaths columns to match format
tfs_features.rename(columns={
    'buybacks': 'tf_buybacks',
    'damage': 'tf_damage',
    'deaths': 'tf_deaths',
    'gold_delta': 'tf_gold_delta',
    'duration': 'tf_duration'
}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(tfs_features, on=['match_id', 'player_slot', 'times'], how='left')

# Drop the matches with incomplete teamfight data
player_time_wide = player_time_wide[~player_time_wide['match_id'].isin(dropped_matches)]

# Display the first 50 final observations
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,
1,0,0,1,0,0,0,,,,,,,,,,,,,
2,0,0,2,0,0,0,,,,,,,,,,,,,
3,0,0,3,0,0,0,,,,,,,,,,,,,
4,0,0,4,0,0,0,,,,,,,,,,,,,
5,0,0,128,0,0,0,,,,,,,,,,,,,
6,0,0,129,0,0,0,,,,,,,,,,,,,
7,0,0,130,0,0,0,,,,,,,,,,,,,
8,0,0,131,0,0,0,,,,,,,,,,,,,
9,0,0,132,0,0,0,,,,,,,,,,,,,


0

#### Chat Log

In [67]:
# Load the file
chat = pd.read_csv(clean_folder + '/chat.csv', index_col=0)
print(f'chat:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

chat: 716,498 observations, 12 features


In [68]:
# Look at the head
chat.head()

Unnamed: 0,match_id,player_slot,match_slot_id,account,chat,time,match_outcome
0,0,129,0_129,6k Slayer,force it,-8,0
1,0,1,0_1,Monkey,space created,5,1
2,0,1,0_1,Monkey,hah,6,1
3,0,129,0_129,6k Slayer,ez 500,9,0
4,0,4,0_4,Kira,mvp ulti,934,1


In [69]:
# Round up the time values to the nearest 60-second intervals
chat['time'] = (chat['time'] // 60) * 60

# Aggregate chat messages
chat_features = chat.groupby(['match_id', 'player_slot', 'time'])['chat'].count().reset_index()

# Look at the objectives features DataFrame head
display(chat_features.head())

Unnamed: 0,match_id,player_slot,time,chat
0,0,0,1500,1
1,0,0,1680,1
2,0,0,2340,2
3,0,1,0,2
4,0,1,1440,1


In [70]:
chat_features[chat_features['time'] < 0]

Unnamed: 0,match_id,player_slot,time,chat
13,0,129,-60,1
36,2,0,-60,2
47,2,128,-60,1
54,2,130,-60,1
56,2,131,-60,1
...,...,...,...,...
811137,49994,130,-120,1
811148,49995,3,-60,1
811163,49995,130,-60,1
811184,49998,130,-300,1


In [71]:
# Rename time column to match format
chat_features.rename(columns={'time': 'times', 'chat': 'chats_sent'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(chat_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death,chats_sent
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,
1,0,0,1,0,0,0,,,,,,,,,,,,,,2.0
2,0,0,2,0,0,0,,,,,,,,,,,,,,
3,0,0,3,0,0,0,,,,,,,,,,,,,,
4,0,0,4,0,0,0,,,,,,,,,,,,,,
5,0,0,128,0,0,0,,,,,,,,,,,,,,
6,0,0,129,0,0,0,,,,,,,,,,,,,,1.0
7,0,0,130,0,0,0,,,,,,,,,,,,,,
8,0,0,131,0,0,0,,,,,,,,,,,,,,
9,0,0,132,0,0,0,,,,,,,,,,,,,,


0

#### Filling Missing Values

In [72]:
# Remove the matches from dropped_matches
player_time_wide = player_time_wide.drop(player_time_wide[player_time_wide['match_id'].isin(dropped_matches)].index)
print(f'player_time_wide:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))

player_time_wide: 22,200,460 observations, 20 features


In [73]:
# Find features with missing values
display(player_time_wide.isna().sum().sort_values(ascending=False)\
[player_time_wide.isna().sum().sort_values(ascending=False) > 0])

tower_deny       21512572
aegis            21512572
tower_kill       21512572
roshan_kill      21512572
firstblood       21512572
aegis_stolen     21512572
chats_sent       21425209
tf_buybacks      16825020
tf_damage        16825020
tf_deaths        16825020
tf_gold_delta    16825020
tf_xp_delta      16825020
tf_duration      16825020
tf_last_death    16825020
dtype: int64

---

## Saving all DataFrames

In [74]:
# Rename columns before merging
players_merge = players.drop(columns=['gold_per_min', 'xp_per_min', 'tf_buybacks', 
                                      'tf_deaths', 'tf_avg_gold_delta', 'tf_avg_xp_delta'])
player_time_wide.rename(columns={'gold': 'gold_per_min', 'lh': 'lh_per_min', 'xp': 'xp_per_min'}, inplace=True)

# Merge the player_time_wide with players df
dask_players = dd.from_pandas(players_merge, npartitions=4)  
dask_player_time = dd.from_pandas(player_time_wide, npartitions=8)  

merged_dask = dask_players.merge(dask_player_time, on=['match_id', 'player_slot'], how='left')
final_df = merged_dask.compute()

# Look at the initial shape of the final DataFrame
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))
final_df = final_df.sort_values(by=['match_id', 'player_slot']).reset_index()
display(final_df.head(50))
gc.collect()

final_df: 22,200,460 observations, 80 features


Unnamed: 0,index,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_nuker,hero_role_durable,hero_role_initiator,hero_role_pusher,hero_role_escape,hero_role_disabler,hero_role_support,hero_role_carry,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,trueskill_mu,trueskill_sigma,trueskill,match_quality,times,gold_per_min,lh_per_min,xp_per_min,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death,chats_sent
0,2100156,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,
1,2100157,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,60,409,0,63,,,,,,,,,,,,,,
2,2100158,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,120,546,0,283,,,,,,,,,,,,,,
3,2100159,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,180,683,1,314,,,,,,,0.0,105.0,0.0,173.0,222.0,32.0,17.0,
4,2100160,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,240,956,1,485,,,,,,,,,,,,,,
5,2100161,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,300,1056,1,649,,,,,,,,,,,,,,
6,2100162,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,360,1156,1,680,,,,,,,,,,,,,,
7,2100163,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,420,1257,2,778,,,,,,,0.0,159.0,0.0,452.0,337.0,46.0,31.0,
8,2100164,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,480,1809,3,1135,,,,,,,,,,,,,,
9,2100165,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.333641,540,2111,3,1393,,,,,,,,,,,,,,


215

In [75]:
# Save the final DataFrames to a CSV file
matches_simple.to_csv('../Data/Merged/matches_simple.csv', index=False)
print(f'matches_simple:', '{:,} observations, {:,} features'.format(matches_simple.shape[0], matches_simple.shape[1]))

matches.to_csv('../Data/Merged/matches.csv', index=False)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players.to_csv('../Data/Merged/players.csv', index=False)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

player_time_wide.to_csv('../Data/Merged/timeseries.csv', index=False)
print(f'timeseries:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))

final_df.to_csv('../Data/Merged/final_df.csv', index=False)
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))

matches_simple: 49,898 observations, 93 features
matches: 49,898 observations, 134 features
players: 498,980 observations, 68 features
timeseries: 22,200,460 observations, 20 features
final_df: 22,200,460 observations, 81 features
