# Data Preprocessing

Before we start conducting a detailed Exploratory Data Analysis (EDA), we need to build our final DataFrame by merging all the potentially valuable features for predicting fair matchmaking. This will include consolidating data from all the previously cleaned files and creating new features as needed.

---

## Initial Setup

In [1]:
# ---------------- Suppress all future warnings ---------------- #
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

# ---------------- Basic Data Science Libraries ---------------- #
import numpy as np # Linear algebra
import pandas as pd # Data processing
import random
import dask.dataframe as dd # Data processing for large DataFrames

# ---------------- System Libraries ---------------- #
import os # Miscellaneous operating system interfaces
import gc # Garbage collector interface
import ast
import nbimporter # Use functions from other Jupyter Notebooks'
from subprocess import check_output # Saves results written to the current directory as output

# ---------------- Plotting Libraries ---------------- #
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# ---------------- TrueSkill Library ---------------- #
import trueskill
import itertools
import math

# Function obtained from the documentation found in https://trueskill.org/
def win_probability(team1, team2):
    delta_mu = sum(r.mu for r in team1) - sum(r.mu for r in team2)
    sum_sigma = sum(r.sigma ** 2 for r in itertools.chain(team1, team2))
    size = len(team1) + len(team2)
    ts = trueskill.global_env()
    BETA = ts.beta
    denom = math.sqrt(size * (BETA * BETA) + sum_sigma)
    return ts.cdf(delta_mu / denom)

# Function to obtain a conservative skill rating
def conservative_trueskill_rating(mu, sigma):
    return mu - (3 * sigma)

# ---------------- Define Clean and Raw Directories ---------------- #
clean_folder = '../Data/Clean'
raw_folder = '../Data/Raw'

# ---------------- Set new DataFrame limiters ---------------- #
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 300)

# ---------------- Print files in my clean data folder ---------------- #
print(check_output(['ls', '../Data/Clean']).decode('utf8'))

ability_ids.csv
ability_upgrades.csv
chat.csv
eng_chat.csv
hero_ids.csv
item_ids.csv
matches.csv
mmr.csv
objectives.csv
patch_dates.csv
player_time.csv
players.csv
positions.csv
prev_outcomes.csv
purchase_log.csv
regions.csv
teamfights.csv
teamfights_players.csv
test_players.csv
trueskill.csv



---

## Feature Selection

Let's begin with our most crucial files in the dataset: matches.csv and players.csv. These contain the most information for each player per game.

In [2]:
# Load required files
players = pd.read_csv(clean_folder + '/players.csv', index_col=0)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

matches = pd.read_csv(clean_folder + '/matches.csv', index_col=0)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players: 500,000 observations, 73 features
matches: 50,000 observations, 13 features


It is challenging to consider adding more features from other files when our players' DataFrame already has 73 features. To simplify the process of creating fair matchmaking, we should reduce the number of features by removing those with a high number of null values, as well as those that have little impact on the match outcome.

In [3]:
# List unwanted features from the players DataFrame
features_to_drop = [
    'unit_order_none', 'unit_order_move_to_position', 'unit_order_move_to_target', 
    'unit_order_attack_move', 'unit_order_attack_target', 'unit_order_cast_position', 
    'unit_order_cast_target', 'unit_order_cast_target_tree', 'unit_order_cast_no_target', 
    'unit_order_cast_toggle', 'unit_order_hold_position', 'unit_order_train_ability', 
    'unit_order_drop_item', 'unit_order_give_item', 'unit_order_pickup_item', 
    'unit_order_pickup_rune', 'unit_order_purchase_item', 'unit_order_sell_item', 
    'unit_order_disassemble_item', 'unit_order_move_item', 'unit_order_cast_toggle_auto', 
    'unit_order_stop', 'unit_order_buyback', 'unit_order_glyph', 
    'unit_order_eject_item_from_stash', 'unit_order_cast_rune', 'unit_order_ping_ability', 
    'unit_order_move_to_direction', 'gold_abandon', 'gold_sell', 
    'gold_destroying_structure', 'gold_killing_couriers', 'match_slot_id'
]

# Drop the features
players = players.drop(columns=features_to_drop)
players.shape

(500000, 40)

Now that we have 40 features in our players' DataFrame, let's group together the features that can provide more insight into the overall team performance.

In [4]:
# Player Performance Features
player_features = ['kills', 'deaths', 'assists', 'denies', 'gold', 'gold_spent']

# Define categorical features
players['cluster'] = players['cluster'].astype('category')
players['hero_id'] = players['hero_id'].astype('category')
players['player_slot'] = players['player_slot'].astype('category')

# Display player features
players[player_features].head()

Unnamed: 0,kills,deaths,assists,denies,gold,gold_spent
0,9,3,18,1,3261,10960
1,13,3,18,9,2954,17760
2,0,4,15,1,110,12195
3,8,4,19,6,1179,22505
4,20,3,17,13,3307,23825


In [5]:
# Match Features
match_features = ['match_id', 'start_time', 'tower_status_radiant', 
                  'tower_status_dire', 'barracks_status_dire', 
                  'barracks_status_radiant', 'first_blood_time']
matches['start_time'] = pd.to_datetime(matches['start_time'], unit='s')
matches[match_features].head()

Unnamed: 0,match_id,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,2015-11-05 19:01:52,1982,4,3,63,1
1,1,2015-11-05 19:51:18,0,1846,63,0,221
2,2,2015-11-05 23:03:06,256,1972,63,48,190
3,3,2015-11-05 23:22:03,4,1924,51,3,40
4,4,2015-11-06 07:53:05,2047,0,0,63,58


In [6]:
# Merge the match features to the players DataFrame
players = players.merge(matches[match_features], on='match_id', how='left')
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
6,0,0,6k Slayer,0,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
8,0,0,-,0,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1


2019

---

## Feature Engineering

### Players DataFrame

#### Dealing with Player Anonymity

To achieve optimal results in our modelling stage, it is crucial to maximize player identification and treat truly anonymous players as individual entities *(i.e., as new players)*. Identifying as many players as possible will enhance the accuracy and effectiveness of our modelling process. We can achieve this by replacing the 0's from `account_id` with the string value from our `account` feature and recreating our `match_slot_id` from the Data Cleanup notebook for the truly anonymous.

In [7]:
# Look at the head from our players DataFrame
players.head()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1


In [8]:
# Replacing the 0 values from account_id
for idx, row in players.iterrows():
    if row['account_id'] == 0:
        if row['account'] == '-':
            players.loc[idx, 'account_id'] = str(row['match_id'])+'_'+str(row['player_slot'])
        else:
            players.loc[idx, 'account_id'] = row['account']
    else:
        players.loc[idx, 'account_id'] = str(row['account_id']) # Might throw some errors when calculating the initial TrueSkill data

print(players['account_id'].dtype)
display(players[players['account'] == '-'].sample(5))

object


Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
161504,16150,1,-,16150_4,72,4,1791,18835,718,600,10,1,9,14,242,4.59878,12769,0,5008,212,164,63,154,154,139,17,0,3803,11905,357,612,584,-29,0,2967,8492,200,0,1658,152,2015-11-14 10:06:38,2047,256,48,63,3
162203,16220,0,-,16220_3,73,3,1316,27490,650,642,8,10,15,3,240,45.6699,17816,0,263,50,147,137,135,129,127,24,0,13192,15913,0,950,8656,-5210,-1351,5338,11206,0,0,2807,171,2015-11-14 10:28:03,0,1974,63,0,13
473002,47300,0,-,47300_2,73,2,4009,14910,564,431,0,7,9,1,214,30.9765,13180,0,490,29,152,137,135,0,0,17,0,2711,12539,0,708,7295,-2663,0,1207,8171,0,0,2217,132,2015-11-17 22:44:06,0,1974,63,0,0
79975,7997,0,-,7997_128,71,128,197,8680,298,278,7,8,10,10,57,105.233,11620,0,31,46,40,36,63,152,127,12,0,3849,5036,0,37,0,-2062,0,3553,2285,0,0,1925,112,2015-11-13 07:09:57,2038,0,0,63,81
86086,8608,0,-,8608_129,95,129,2252,20465,695,663,13,3,10,2,241,19.1905,17695,0,2192,116,65,50,51,156,154,21,0,8960,13769,447,502,1339,-1857,-1144,5453,10372,455,0,2135,112,2015-11-13 10:27:52,1926,0,0,63,5


#### Hero Attributes

In [9]:
# Load required files
heroes = pd.read_csv(clean_folder + '/hero_ids.csv')
print(f'heroes:', '{:,} observations, {:,} features'.format(heroes.shape[0], heroes.shape[1]))

heroes: 112 observations, 4 features


In [10]:
# Look at the head
heroes.head()

Unnamed: 0,hero_id,name,primary_attribute,roles
0,1,Anti-Mage,agi,"{'Nuker', 'Escape', 'Carry'}"
1,2,Axe,str,"{'Carry', 'Disabler', 'Durable', 'Initiator'}"
2,3,Bane,all,"{'Support', 'Disabler', 'Durable', 'Nuker'}"
3,4,Bloodseeker,agi,"{'Nuker', 'Initiator', 'Disabler', 'Carry'}"
4,5,Crystal Maiden,int,"{'Support', 'Disabler', 'Nuker'}"


In [11]:
# Convert the roles feature from strings to actual sets
heroes['roles'] = heroes['roles'].apply(ast.literal_eval)

# Extract all unique roles
unique_hero_roles = set(role for roles in heroes['roles'] for role in roles) # To ensure we won't have duplicated values
unique_hero_roles = list(unique_hero_roles) # To convert to columns later

# Create a new DataFrame for one-hot encoding the roles
one_hot_encoded_roles = pd.DataFrame(0, index=heroes.index, columns=unique_hero_roles)

# Fill the DataFrame with ones for each role covered by a hero
for idx, roles in enumerate(heroes['roles']):
    for role in roles:
        one_hot_encoded_roles.at[idx, role] = 1

# Concatenate to original DataFrame and dropping original feature
heroes = pd.concat([heroes, one_hot_encoded_roles], axis=1)
heroes.drop(columns='roles', inplace=True)

# Look at the head
heroes.head()

Unnamed: 0,hero_id,name,primary_attribute,Pusher,Nuker,Escape,Disabler,Initiator,Durable,Carry,Support
0,1,Anti-Mage,agi,0,1,1,0,0,0,1,0
1,2,Axe,str,0,0,0,1,1,1,1,0
2,3,Bane,all,0,1,0,1,0,1,0,1
3,4,Bloodseeker,agi,0,1,0,1,1,0,1,0
4,5,Crystal Maiden,int,0,1,0,1,0,0,0,1


In [12]:
# Rename the columns before merge
heroes.rename(columns={
    'primary_attribute': 'hero_primary_attribute',
    'Carry': 'hero_role_carry',
    'Escape': 'hero_role_escape',
    'Durable': 'hero_role_durable',
    'Pusher': 'hero_role_pusher',
    'Initiator': 'hero_role_initiator',
    'Disabler': 'hero_role_disabler',
    'Nuker': 'hero_role_nuker',
    'Support': 'hero_role_support'
}, inplace=True)

# Merge into the players' DataFrame
players = players.merge(right=heroes.drop(columns='name'), on='hero_id', how='left')

# Sanity check
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support
0,0,1,Double T,Double T,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
2,0,1,Trash!!!,Trash!!!,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0
6,0,0,6k Slayer,6k Slayer,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
8,0,0,-,0_131,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


1232

#### Team Features

To fully grasp the extent of each player's impact on the team, it is essential to calculate the ratio of each player's individual contribution to the overall team statistics for every match. This involves extracting specific performance metrics and consolidating them for each team to obtain a comprehensive understanding of player involvement.

##### Overall Team Performance

In [13]:
# Aggregate team stats
team_stats = players.groupby(['match_id', 'player_slot'])[player_features].sum().reset_index()
team_stats['radiant_team'] = team_stats['player_slot'].apply(lambda x: 1 if x < 5 else 0)

# Aggregating them by team
team_features = team_stats.groupby(['match_id', 'radiant_team'], observed=False)[player_features].sum().reset_index()

# Rename columns for merge
team_features.rename(columns={'kills': 'team_kills', 
                              'deaths': 'team_deaths', 
                              'assists': 'team_assists', 
                              'denies': 'team_denies', 
                              'gold': 'team_gold', 
                              'gold_spent': 'team_gold_spent'}, inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kills,deaths,assists,denies,gold,gold_spent,radiant_team
0,0,0,9,3,18,1,3261,10960,1
1,0,1,13,3,18,9,2954,17760,1
2,0,2,0,4,15,1,110,12195,1
3,0,3,8,4,19,6,1179,22505,1
4,0,4,20,3,17,13,3307,23825,1
5,0,128,5,6,8,5,476,12285,0
6,0,129,4,13,5,2,317,10355,0
7,0,130,4,8,6,31,2390,13395,0
8,0,131,1,14,8,0,475,5035,0
9,0,132,1,11,6,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kills,team_deaths,team_assists,team_denies,team_gold,team_gold_spent
0,0,0,15,52,33,38,3718,58620
1,0,1,50,17,87,30,10811,87245
2,1,0,50,37,83,16,9085,107750
3,1,1,35,53,49,27,4776,69310
4,2,0,48,22,90,16,11177,81620
5,2,1,22,49,31,10,2494,54990
6,3,0,63,65,110,29,5954,94430
7,3,1,64,66,92,32,6455,76685
8,4,0,16,37,30,21,2030,38980
9,4,1,37,16,59,26,14099,78980


##### KDA Scores

KDA Scores can be calculated using the following formula: 
<center>$\frac{kills + assists}{deaths +1}$</center>

In [14]:
# Calculate KDA Scores
kda_score = (team_stats['kills'] + team_stats['assists']) / (team_stats['deaths'] + 1)
team_stats.insert(2, column='kda', value=kda_score)
team_stats.drop(columns=['kills', 'deaths', 'assists'], inplace=True)

team_kda_score = (team_features['team_kills'] + team_features['team_assists']) / (team_features['team_deaths'] + 1)
team_features.insert(2, column='team_kda', value=team_kda_score)
team_features.drop(columns=['team_kills', 'team_deaths', 'team_assists'], inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team
0,0,0,6.75,1,3261,10960,1
1,0,1,7.75,9,2954,17760,1
2,0,2,3.0,1,110,12195,1
3,0,3,5.4,6,1179,22505,1
4,0,4,9.25,13,3307,23825,1
5,0,128,1.857143,5,476,12285,0
6,0,129,0.642857,2,317,10355,0
7,0,130,1.111111,31,2390,13395,0
8,0,131,0.6,0,475,5035,0
9,0,132,0.583333,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,0.90566,38,3718,58620
1,0,1,7.611111,30,10811,87245
2,1,0,3.5,16,9085,107750
3,1,1,1.555556,27,4776,69310
4,2,0,6.0,16,11177,81620
5,2,1,1.06,10,2494,54990
6,3,0,2.621212,29,5954,94430
7,3,1,2.328358,32,6455,76685
8,4,0,1.210526,21,2030,38980
9,4,1,5.647059,26,14099,78980


Now let's calculate the ratios for each player.

In [15]:
# Calculate participation ratios
team_stats = team_stats.merge(team_features, on=['match_id', 'radiant_team'], how='left')
for col in team_stats.columns:
    if col.startswith('team_'):
        player_col = col.split('team_')
        team_stats[col] = team_stats[player_col[1]] / team_stats[col]

display(team_stats.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,6.75,1,3261,10960,1,0.886861,0.033333,0.301637,0.125623
1,0,1,7.75,9,2954,17760,1,1.018248,0.3,0.27324,0.203565
2,0,2,3.0,1,110,12195,1,0.394161,0.033333,0.010175,0.139779
3,0,3,5.4,6,1179,22505,1,0.709489,0.2,0.109056,0.257952
4,0,4,9.25,13,3307,23825,1,1.215328,0.433333,0.305892,0.273082
5,0,128,1.857143,5,476,12285,0,2.050595,0.131579,0.128026,0.20957
6,0,129,0.642857,2,317,10355,0,0.709821,0.052632,0.085261,0.176646
7,0,130,1.111111,31,2390,13395,0,1.226852,0.815789,0.642819,0.228506
8,0,131,0.6,0,475,5035,0,0.6625,0.0,0.127757,0.085892
9,0,132,0.583333,0,60,17550,0,0.644097,0.0,0.016138,0.299386


Finally, we are able to proceed with merging the team participation ratios into our players' DataFrame.

In [16]:
# Merge to the players DataFrame
players = players.merge(team_stats.drop(columns=['denies', 'gold', 'gold_spent']), 
                        on=['match_id', 'player_slot'], how='left')

# Define radiant_team as categorical
players['radiant_team'] = players['radiant_team'].astype('category')

# Sanity check
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,1,Double T,Double T,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,7.75,1,1.018248,0.3,0.27324,0.203565
2,0,1,Trash!!!,Trash!!!,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,1,0.394161,0.033333,0.010175,0.139779
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,5.4,1,0.709489,0.2,0.109056,0.257952
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,9.25,1,1.215328,0.433333,0.305892,0.273082
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.857143,0,2.050595,0.131579,0.128026,0.20957
6,0,0,6k Slayer,6k Slayer,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,all,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.642857,0,0.709821,0.052632,0.085261,0.176646
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,agi,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.111111,0,1.226852,0.815789,0.642819,0.228506
8,0,0,-,0_131,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.6,0,0.6625,0.0,0.127757,0.085892
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,str,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.583333,0,0.644097,0.0,0.016138,0.299386


0

#### Teamfights

Team fights are a great way to assess each team's performance and coordination. Engineering features that help us analyze the dynamics and results of team fights can provide valuable insights into player performance, team synergy, and the influence of team fights on the overall match result.

In [17]:
# Load required files
tf_players = pd.read_csv(clean_folder + '/teamfights_players.csv', index_col=0)
print(f'tf_players:', '{:,} observations, {:,} features'.format(tf_players.shape[0], tf_players.shape[1]))

tf_players: 5,390,470 observations, 9 features


In [18]:
# Teamfight Participation
def count_values_not_zero(series):
    return (series > 0).sum()

player_teamfights = tf_players.groupby(['match_id', 'player_slot'])['damage'].agg(count_values_not_zero).reset_index(name='teamfights')

# Teamfight Performance
tf_player_damage = tf_players.groupby(['match_id', 'player_slot'])['damage'].sum().reset_index(name='tf_damage_dealt')
tf_player_buybacks = tf_players.groupby(['match_id', 'player_slot'])['buybacks'].sum().reset_index(name='tf_buybacks')
tf_player_deaths = tf_players.groupby(['match_id', 'player_slot'])['deaths'].sum().reset_index(name='tf_deaths')

# Teamfight Impact
tf_player_gold_delta = tf_players.groupby(['match_id', 'player_slot'])['gold_delta'].mean().reset_index(name='tf_avg_gold_delta')
tf_player_xp_delta = tf_players.groupby(['match_id', 'player_slot']).apply(lambda x: (x['xp_end'] - x['xp_start']).mean()).reset_index(name='tf_avg_xp_delta')

# Merge all features in a single DataFrame
player_teamfights = player_teamfights.merge(tf_player_damage, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_buybacks, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_deaths, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_gold_delta, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_xp_delta, on=['match_id', 'player_slot'], how='left')


# Display the head and shape of player_teamfights
display(player_teamfights.head(20))
print(f'player_teamfights:', '{:,} observations, {:,} features'.format(player_teamfights.shape[0], player_teamfights.shape[1]))

Unnamed: 0,match_id,player_slot,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
0,0,0,10,6099,0,2,329.166667,538.333333
1,0,1,10,13663,0,4,409.833333,1112.25
2,0,2,7,1155,1,3,123.666667,495.166667
3,0,3,9,15201,0,4,317.333333,795.75
4,0,4,12,30774,1,2,460.583333,1189.416667
5,0,128,10,23616,2,5,86.75,731.666667
6,0,129,9,12807,0,4,211.583333,516.75
7,0,130,8,15988,0,5,193.0,610.25
8,0,131,10,5718,1,9,-29.833333,401.75
9,0,132,10,9786,1,9,-65.75,639.5


player_teamfights: 499,310 observations, 8 features


It's odd that we are missing 690 observations in this new DataFrame. It's possible that these players did not engage in any teamfights, either by avoiding them entirely or due to thrown matches.

In [19]:
# Look for original match_ids
print('Total matches in original:', tf_players['match_id'].nunique())

Total matches in original: 49931


After verifying that there are no random missing values, instead of disregarding specific matches, let's merge these new features into our players' DataFrame and then examine the observations with null values. We will investigate the teamfight missing values after exploring the matches DataFrame.

In [20]:
# Merging to the players' DataFrame
players = players.drop(columns=['kills', 'deaths', 'assists'])\
                .merge(player_teamfights, on=['match_id', 'player_slot'], how='left')
display(players[players['teamfights'].isna()].sample(50))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
488292,48829,1,Smurf on my own turf :::,47446,82,2,3989,6025,486,390,0,108,0.0,1499,0,1408,63,0,0,0,0,108,11,0,155,3100,0,4381,58,-209,0,446,3487,200,3,1168,132,2015-11-18 02:56:11,2047,256,48,63,1,agi,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1,0.173913,0.0,0.306752,0.130369,,,,,,
244746,24474,1,38207,38207,100,129,345,1825,458,257,1,7,3.13257,2052,0,70,41,46,20,29,16,16,3,0,254,614,0,0,0,0,0,931,276,0,0,202,156,2015-11-15 07:05:32,2047,2047,63,63,0,str,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,5.0,0,0.833333,0.25,0.252933,0.279265,,,,,,
306334,30633,0,★【Wu-Tang-Clan】★™♥,111847,2,4,1,1885,388,166,2,16,0.0,1321,0,0,39,44,13,0,29,13,4,1,0,1393,0,0,0,-237,0,0,668,0,1,500,123,2015-11-15 20:43:24,2046,2047,63,63,95,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1,,0.333333,0.000481,0.245763,,,,,,
48964,4896,1,f_[o]r,24520,93,4,788,5300,390,301,13,32,0.0,5419,0,775,71,181,0,63,0,152,9,0,1028,3486,0,31,0,-269,-430,1096,1267,0,3,904,112,2015-11-12 21:24:44,2046,391,51,63,0,agi,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,3.0,1,1.235294,0.722222,0.090067,0.221897,,,,,,
467471,46747,0,bOb-jOw,bOb-jOw,56,1,1,3250,234,221,3,39,0.0,2293,0,0,0,0,0,0,0,0,8,1,300,3636,0,100,70,-776,0,382,1725,0,1,1095,204,2015-11-17 21:31:35,390,2046,63,51,280,agi,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1,0.0,0.2,0.000254,0.157805,,,,,,
402778,40277,0,Affirmative™,136180,100,131,2,2600,231,215,1,10,5.0321,2863,0,0,41,50,0,0,16,16,7,1,322,2326,0,0,0,-179,0,660,407,0,1,738,122,2015-11-17 01:44:11,2047,2047,63,63,100,str,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0,2.75,0.125,0.001281,0.190128,,,,,,
92730,9273,0,42125,42125,83,0,1,1415,112,71,2,3,0.0,1468,134,0,0,0,0,0,0,0,4,3,0,1168,0,0,0,-296,0,0,122,0,0,983,123,2015-11-13 13:57:36,1974,2039,63,63,49,str,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.2,1,0.186667,0.5,0.000312,0.064597,,,,,,
162823,16282,0,66752,66752,11,3,1,610,128,35,1,4,0.0,734,0,0,0,0,0,0,0,0,3,3,0,639,0,0,0,-89,0,0,163,0,0,1069,132,2015-11-14 10:41:40,48,2039,63,12,123,agi,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,0.0,0.090909,0.00027,0.028903,,,,,,
432910,43291,1,=DB=|Bucks|.G2A,142866,50,0,339,16150,430,495,24,140,8.46826,8263,982,1075,231,46,43,31,108,229,18,0,4801,13063,0,482,238,-958,0,1922,5492,0,4,2220,133,2015-11-17 12:41:59,2047,0,0,63,7,all,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1,0.425532,0.48,0.023643,0.210588,,,,,,
493333,49333,0,Louis Litt,791,1,3,548,9335,451,400,26,159,0.0,1434,0,1148,63,145,46,81,71,0,12,0,43,8342,0,50,0,-1047,0,94,6361,200,6,1264,112,2015-11-18 04:21:16,256,1855,63,48,23,agi,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.25,1,0.545455,0.619048,0.117824,0.337064,,,,,,


0

It seems that some of these players have a leaver status, and others have very poor stats overall, suggesting that these matches may have been thrown. However, others seem to have decent stats across the board, indicating that the data could have been corrupted before manipulation. Let's check how many null values we have from our engineered features so far.

#### TrueSkill

We must consistently evaluate each player's skill level as they engage in matches to ensure a fair matchmaking process. Our data cleanup notebook includes a file with TrueSkill ratings for each player based on previous match results. However, it's important to note that some players in our current dataset may not be included in the TrueSkill file, which is a potential factor that needs to be addressed.

In [21]:
# Load required files
trueskill_df = pd.read_csv(clean_folder + '/trueskill.csv')
print(f'trueskill:', '{:,} observations, {:,} features'.format(trueskill_df.shape[0], trueskill_df.shape[1]))

trueskill: 834,226 observations, 6 features


In [22]:
trueskill_df.head()

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma,conservative_skill_estimate
0,236579,14,24,27.868035,5.212361,12.230953
1,-343,1,1,26.544163,8.065475,2.347736
2,-1217,1,1,26.521103,8.114989,2.176136
3,-1227,1,1,27.248025,8.092217,2.971375
4,-1284,0,1,22.931016,8.092224,-1.345657


In [23]:
# Set up the global parameters
trueskill.setup(mu=trueskill_df['trueskill_mu'].mean(), 
                sigma=trueskill_df['trueskill_sigma'].mean(),
                draw_probability=0)

# Instantiate the ratings dictionary
ts_init_ratings = {}

# Get the list of unique account IDs from the players DataFrame
account_ids = players['account_id'].unique().tolist()

# Read the mu and sigma from the file and append them to ts_ratings
for account in account_ids:
    try:
        account_row = trueskill_df[trueskill_df['account_id'] == int(account)]
    
        if not account_row.empty:
            ts_mu = account_row['trueskill_mu'].values[0]
            ts_sigma = account_row['trueskill_sigma'].values[0]
            ts_init_ratings[account] = trueskill.Rating(mu=ts_mu, sigma=ts_sigma)
        else:
            ts_init_ratings[account] = trueskill.Rating()
            
    except ValueError:
        ts_init_ratings[account] = trueskill.Rating()

print('Total account IDs in players DF: {:,}'.format(len(account_ids)))
print('Total keys in initial ratings dictionary: {:,}'.format(len(ts_init_ratings)))

Total account IDs in players DF: 311,717
Total keys in initial ratings dictionary: 311,717


In [24]:
# Check a random player's initial TrueSkill
rand_player_ts = '80491'
display(f'{rand_player_ts} TrueSkill:', ts_init_ratings[rand_player_ts])

# Sanity check from our dictionary
trueskill_df[trueskill_df['account_id'] == int(rand_player_ts)][['account_id', 'trueskill_mu', 'trueskill_sigma']]

'80491 TrueSkill:'

trueskill.Rating(mu=24.638, sigma=6.284)

Unnamed: 0,account_id,trueskill_mu,trueskill_sigma
119612,80491,24.637552,6.284008


In [25]:
ts_init_ratings[players[(players['match_id']==21169) & (players['player_slot']==1)]['account_id'].values[0]]

trueskill.Rating(mu=27.193, sigma=2.003)

In [26]:
# Test the win_probability function with one match's results
rand_match_id = random.choice(players['match_id'].unique().tolist())
print('Match ID picked:', rand_match_id)

team_radiant = []
team_dire = []

for i in range(5):
    team_radiant.append(ts_init_ratings[players[(players['match_id']==rand_match_id) &\
                        (players['player_slot']==i)]['account_id'].values[0]])

for i in range(128,133):
    team_dire.append(ts_init_ratings[players[(players['match_id']==rand_match_id) &\
                     (players['player_slot']==i)]['account_id'].values[0]])

print('Match quality:', trueskill.quality([team_radiant, team_dire]))
print('Win probability:', win_probability(team_radiant, team_dire))
print('True outcome:', matches[matches['match_id']==rand_match_id]['radiant_win'])

Match ID picked: 19237
Match quality: 0.6425416755193922
Win probability: 0.5484400128713172
True outcome: 19237    1
Name: radiant_win, dtype: int64


In [27]:
# Function to determine leaver status weights
def get_leaver_weight(leaver_status):
    leaver_weights = {
        0: 1.0, # NONE - finished match, no abandon.
        1: 0.5, # DISCONNECTED - player DC, no abandon.
        2: 0.3, # DISCONNECTED_TOO_LONG - player DC > 5min, abandoned.
        3: 0.1, # ABANDONED - player DC, clicked leave, abandoned.
        4: 0.01, # AFK - player AFK, abandoned.
        5: 0.0, # NEVER_CONNECTED - player never connected, no abandon.
        6: 0.0, # NEVER_CONNECTED_TOO_LONG - player took too long to connect, no abandon.
    }
    return leaver_weights.get(leaver_status, 1.0)

# Sort the players DataFrame in chronological order
ts_players = players.sort_values(by=['start_time', 'match_id', 'player_slot'])

# Instantiate updated ratings' dictionary
ts_updated_ratings = ts_init_ratings.copy()

# Function to update the ratings based on a single match
def update_ratings(match_):
    global ts_updated_ratings

    # Extract player data from the match
    match_account_ids = match_['account_id'].tolist()
    match_leaver_status = match_['leaver_status'].tolist()
    match_outcome = match_['match_outcome'].tolist()
    
    # Get the current ratings and leaver weights
    ratings = [ts_updated_ratings[acc_id] for acc_id in match_account_ids]
    weights = [get_leaver_weight(ls) for ls in match_leaver_status]

    player_weights = {}
    for idx, weight in enumerate(weights):
        if idx < 5:
            key = (0, match_account_ids[idx])
        else:
            key = (1, match_account_ids[idx])
        player_weights[key] = weight    
    
    # Split players into teams
    radiant = {}
    dire = {}
    for idx, player in enumerate(ratings):
        if idx < 5:
            radiant[match_account_ids[idx]] = ratings[idx]
        else:
            dire[match_account_ids[idx]] = ratings[idx]

    # Calculate pre-match quality
    pre_match_quality = trueskill.quality([radiant, dire])

    # Lists to store TrueSkill values before each match
    trueskill_mu_list = []
    trueskill_sigma_list = []
    cons_trueskill_list = []
    
    # Add the current TrueSkill rating prior the match
    for rating in ratings:
        trueskill_mu_list.append(rating.mu)
        trueskill_sigma_list.append(rating.sigma)
        cons_trueskill_list.append(conservative_trueskill_rating(rating.mu, rating.sigma))
        
    match_['trueskill_mu'] = trueskill_mu_list
    match_['trueskill_sigma'] = trueskill_sigma_list
    match_['trueskill'] = cons_trueskill_list
    match_['pre_match_quality'] = pre_match_quality
    
    # Determine the outcome
    if match_outcome[0] == 0:
        ranks = [0, 1] # Radiant wins
    else:
        ranks = [1, 0] # Dire wins
    
    # Update ratings
    new_radiant, new_dire = trueskill.rate([radiant, dire], weights=player_weights, ranks=ranks)
    
    for i in range(5):
        ts_updated_ratings[match_account_ids[i]] = new_radiant[match_account_ids[i]]
        ts_updated_ratings[match_account_ids[i+5]] = new_dire[match_account_ids[i+5]]

    return match_

In [28]:
# Calculate the Conservative Skill Estimate for each player before a match
processed_matches = []

# Iterate through each match and update ratings
for m_id, match_group in ts_players.groupby('match_id'):
    processed_match = update_ratings(match_group)
    processed_matches.append(processed_match)

# Concatenate all processed match groups back into one DataFrame
players_with_ts = pd.concat(processed_matches)
players_with_ts[['match_id', 'start_time', 'account_id', 'player_slot', 'leaver_status', 'kda', 
                 'trueskill_mu', 'trueskill_sigma', 'trueskill', 'pre_match_quality', 'match_outcome']].head(20)

Unnamed: 0,match_id,start_time,account_id,player_slot,leaver_status,kda,trueskill_mu,trueskill_sigma,trueskill,pre_match_quality,match_outcome
0,0,2015-11-05 19:01:52,Double T,0,0,6.75,25.112577,7.270275,3.301753,0.422474,1
1,0,2015-11-05 19:01:52,1,1,0,7.75,26.232905,4.854238,11.670192,0.422474,1
2,0,2015-11-05 19:01:52,Trash!!!,2,0,3.0,25.112577,7.270275,3.301753,0.422474,1
3,0,2015-11-05 19:01:52,2,3,0,5.4,27.614505,6.550771,7.96219,0.422474,1
4,0,2015-11-05 19:01:52,3,4,0,9.25,20.221006,5.961434,2.336703,0.422474,1
5,0,2015-11-05 19:01:52,4,128,0,1.857143,26.773302,5.322094,10.807019,0.422474,0
6,0,2015-11-05 19:01:52,6k Slayer,129,0,0.642857,25.112577,7.270275,3.301753,0.422474,0
7,0,2015-11-05 19:01:52,5,130,0,1.111111,32.190551,2.93714,23.379132,0.422474,0
8,0,2015-11-05 19:01:52,0_131,131,0,0.6,23.1395,6.807861,2.715916,0.422474,0
9,0,2015-11-05 19:01:52,6,132,0,0.583333,34.77452,5.783084,17.425268,0.422474,0


In [29]:
# Sanity Check
display(players_with_ts.shape)
display(players.shape)

(500000, 68)

(500000, 64)

In [30]:
# Update the players' DataFrame
players = players_with_ts
display(players.sample(50))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta,trueskill_mu,trueskill_sigma,trueskill,pre_match_quality
240170,24017,1,[Sq].a.re.S4...,91609,14,0,6114,13150,412,531,0,78,47.6488,16925,839,1027,108,92,214,102,37,60,21,0,17022,7805,0,288,170,-1882,0,7442,2924,0,3,2836,122,2015-11-15 05:49:45,1968,0,0,60,187,str,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,3.333333,1,0.983607,0.0,0.370658,0.160797,9.0,12782.0,0.0,4.0,442.363636,908.545455,23.231012,6.768598,2.925219,0.42585
135745,13574,1,ECHHS.Fisherswamp,53964,1,128,4520,39000,766,615,6,508,22.1732,28004,373,8320,220,208,114,147,116,145,25,0,9263,22857,894,145,100,-3084,-1830,9759,20563,400,1,3230,122,2015-11-14 02:04:08,0,1828,63,0,29,agi,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,4.142857,0,1.824377,0.24,0.397118,0.372084,10.0,26797.0,1.0,4.0,861.454545,816.181818,20.642746,5.69223,3.566056,0.598972
437411,43741,0,sebra,109892,73,1,4119,24160,667,564,8,348,43.5279,14375,0,925,48,152,116,137,143,112,23,0,5226,21832,0,660,11146,-5029,-1140,2901,13315,0,3,2948,132,2015-11-17 14:02:17,0,1846,63,0,4,str,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.416667,1,0.85,0.470588,0.627801,0.280881,6.0,12424.0,0.0,5.0,-82.625,121.75,20.786948,6.300831,1.884455,0.539783
250669,25066,1,P3G || FreexIIV,3350,92,132,1415,19740,579,619,6,78,10.478,22600,0,2134,110,0,229,29,108,160,20,0,14833,7534,447,106,50,-1586,0,10511,2984,200,2,2218,171,2015-11-15 08:57:53,0,1974,63,0,163,all,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,7.2,0,1.169231,0.214286,0.072032,0.261215,9.0,15107.0,0.0,4.0,844.222222,1391.222222,26.703339,5.273298,10.883443,0.300647
89612,8961,0,DatFeedingTho,2736,36,2,1108,9360,261,281,2,57,4.30425,5979,6645,112,180,40,108,79,0,46,15,0,5300,6783,0,5,0,-2184,0,4290,1880,0,1,2576,153,2015-11-13 12:43:11,0,1830,63,0,5,int,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,3.571429,1,1.347709,0.052632,0.243945,0.129362,11.0,6400.0,0.0,5.0,81.0,281.25,24.22116,6.813557,3.780489,0.52353
259354,25935,1,82818,82818,11,4,5628,23710,662,746,9,390,0.0,17955,0,3326,1,156,147,30,48,160,25,0,10412,21412,0,622,500,-1316,0,5876,14983,0,0,2609,132,2015-11-15 11:33:26,1974,256,51,63,0,agi,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,5.0,1,1.52,0.25,0.402979,0.296876,8.0,13299.0,0.0,2.0,594.454545,914.0,27.742208,7.212398,6.105015,0.455042
221699,22169,0,noMercy,22002,28,132,763,26550,450,493,3,261,93.8295,14521,0,2414,135,112,116,1,63,114,25,0,19002,13422,0,14,0,-3262,-31,9270,11727,200,3,3943,138,2015-11-15 00:45:16,1796,0,0,51,7,str,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.888889,0,1.444444,0.125,0.156449,0.292336,14.0,12309.0,1.0,7.0,249.0625,948.3125,20.067786,5.640122,3.147421,0.2756
397431,39743,1,-,39743_1,90,1,979,8515,357,351,0,51,0.0,7439,422,757,129,63,77,79,73,0,13,0,4710,5593,0,78,0,-1435,-786,2457,2064,0,0,1769,121,2015-11-17 00:08:58,1983,256,48,63,56,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.5,1,0.386598,0.0,0.11923,0.129054,4.0,2017.0,0.0,3.0,11.2,415.8,25.112577,7.270275,3.301753,0.476994
156210,15621,0,NOOBSTER,NOOBSTER,25,0,381,10250,295,275,0,88,10.8255,12572,0,525,50,185,108,46,215,0,15,0,4494,7465,0,51,0,-3978,0,4046,3432,0,1,2603,123,2015-11-14 08:02:41,0,1974,63,0,0,int,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.384615,1,0.878412,0.0,0.157178,0.199397,13.0,9463.0,0.0,10.0,-7.411765,192.529412,25.112577,7.270275,3.301753,0.48705
356001,35600,1,marchino11,40060,73,1,7334,26600,1019,761,2,308,34.2512,18210,0,6147,48,137,208,147,41,235,22,0,8574,15963,447,650,10053,-1077,0,4066,11188,464,4,2018,171,2015-11-16 12:36:15,1926,0,0,59,61,str,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,5.0,1,1.236559,0.1,0.35394,0.362497,7.0,10270.0,0.0,2.0,802.142857,977.142857,26.861318,3.348509,16.815793,0.581782


7588114

#### Filling Missing Values

In [31]:
# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0 ])

tf_avg_xp_delta           690
tf_avg_gold_delta         690
tf_deaths                 690
tf_buybacks               690
tf_damage_dealt           690
teamfights                690
team_denies               275
team_kda                  105
hero_role_support          37
hero_primary_attribute     37
hero_role_escape           37
hero_role_disabler         37
hero_role_initiator        37
hero_role_durable          37
hero_role_carry            37
hero_role_nuker            37
hero_role_pusher           37
dtype: int64

In [32]:
# Explore team missing values in Team KDA and Team Denies
print('Team KDA:')
display(players[(players['team_kda'].isna()) & (players['kda'] != 0)][['match_id', 'player_slot', 'kda']])
print('\n------------------------------------\nTeam Denies')
display(players[(players['team_denies'].isna()) & (players['denies'] != 0)][['match_id', 'player_slot', 'denies']])

Team KDA:


Unnamed: 0,match_id,player_slot,kda



------------------------------------
Team Denies


Unnamed: 0,match_id,player_slot,denies


Based on the results, the null values in the team features can be interpreted as representing a 0. This is because these particular observations and the rest of the team's players had a 0 in their original columns. In this context, we should treat the null values as 0.

In [33]:
# Fill missing team_deaths and team_kda values
players['team_kda'].fillna(0, inplace=True)
players['team_denies'].fillna(0, inplace=True)

# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0])

tf_avg_xp_delta           690
tf_avg_gold_delta         690
tf_deaths                 690
tf_buybacks               690
tf_damage_dealt           690
teamfights                690
hero_role_support          37
hero_primary_attribute     37
hero_role_initiator        37
hero_role_durable          37
hero_role_carry            37
hero_role_nuker            37
hero_role_pusher           37
hero_role_disabler         37
hero_role_escape           37
dtype: int64

In [34]:
# Look at the rows with missing hero features' values
players[players[['hero_role_durable', 'hero_primary_attribute', 'hero_role_nuker', 
         'hero_role_carry', 'hero_role_initiator', 'hero_role_disabler', 
         'hero_role_support', 'hero_role_pusher', 'hero_role_escape']].isna().any(axis=1)]

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta,trueskill_mu,trueskill_sigma,trueskill,pre_match_quality
7203,720,0,2956,2956,0,3,0,0,135,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2021,132,2015-11-12 08:49:43,0,1926,59,0,152,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.411772,2.694663,19.327783,0.634574
10320,1032,0,-,1032_0,0,0,0,0,124,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2358,132,2015-11-12 10:24:54,0,1956,63,0,0,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.363303,7.063993,-2.828676,0.396666
11088,1108,0,4765,4765,0,131,0,0,104,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2193,111,2015-11-12 10:45:17,2039,0,0,63,18,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.979409,2.081422,21.735143,0.624907
21343,2134,0,9442,9442,0,3,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,485,138,2015-11-12 14:43:48,2047,2047,63,63,118,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.599931,4.060508,15.418408,0.505559
21344,2134,0,9441,9441,0,4,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,485,138,2015-11-12 14:43:48,2047,2047,63,63,118,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.731222,2.19974,23.132003,0.505559
27738,2773,0,13922,13922,0,131,0,0,108,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1084,204,2015-11-12 16:25:25,1983,384,48,63,271,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.971784,5.596345,5.182748,0.534558
70983,7098,0,-,7098_3,0,3,0,0,100,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2156,112,2015-11-13 03:23:51,0,2047,63,0,0,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.59853,6.256666,14.828533,0.515006
74882,7488,0,6069,6069,0,2,0,0,117,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2709,171,2015-11-13 05:04:51,256,1846,63,48,204,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.535976,2.641245,21.61224,0.55469
75829,7582,0,35721,35721,0,132,0,0,111,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1959,121,2015-11-13 05:23:06,2044,6,3,63,105,,,,,,,,,,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.783225,6.723435,8.61292,0.68701
78314,7831,0,35936,35936,0,4,0,0,106,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1321,111,2015-11-13 06:23:06,0,2047,63,0,85,,,,,,,,,,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.820241,1.773467,20.499841,0.614545


It seems that the hero ID 0 is responsible for the null values in these rows. Since we can't fill the values without introducing bias into our models, we need to remove these matches entirely from our dataset. Let's create a list of the unique match IDs that we can use to drop the matches when we explore the Matches DataFrame.

In [35]:
# Create a list of the match IDs to be dropped
dropped_matches = players[players['hero_id'] == 0]['match_id'].tolist()
print('Matches to be removed:', len(dropped_matches))
print(dropped_matches)

Matches to be removed: 37
[720, 1032, 1108, 2134, 2134, 2773, 7098, 7488, 7582, 7831, 13396, 14592, 22764, 24509, 24711, 25043, 25888, 26106, 29388, 30150, 30790, 33087, 33443, 36245, 36647, 37020, 37020, 37147, 37860, 38150, 39664, 40522, 40706, 41029, 43519, 45879, 46280]


Let's maintain the teamfight-related features as they are for now. We can review them later after exploring the Matches DataFrame.

In [36]:
# Sanity Check
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

players: 500,000 observations, 68 features


### Matches DataFrame

In [37]:
# Look at the DataFrame
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))
matches.head()

matches: 50,000 observations, 13 features


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156


#### Team Aggregations

Next, we should proceed to create a new column containing the team hero picks for each player in their respective rows. This modification will provide us with valuable insight into the team composition for each match observation.

In [38]:
# Group players by match_id and separate radiant/dire heroes
match_picks = players.groupby(['match_id', 'radiant_team'], as_index=False, observed=False)['hero_id']\
                .apply(list).rename(columns={'hero_id': 'team_hero_picks'})

In [39]:
# Pivot the team_features DataFrame
team_pivoted = team_features.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
team_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in team_pivoted.columns]

# Reset the index
team_pivoted.reset_index(inplace=True)
display(team_pivoted.head())
print('----------------------------')

# Merge results with the matches DataFrame
matches = matches.merge(team_pivoted, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant
0,0,0.90566,7.611111,38,30,3718,10811,58620,87245
1,1,3.5,1.555556,16,27,9085,4776,107750,69310
2,2,6.0,1.06,16,10,11177,2494,81620,54990
3,3,2.621212,2.328358,29,32,5954,6455,94430,76685
4,4,1.210526,5.647059,21,26,2030,14099,38980,78980


----------------------------


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980


In [40]:
# Pivot the team_features DataFrame
picks_pivoted = match_picks.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
picks_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in picks_pivoted.columns]

# Reset the index
picks_pivoted.reset_index(inplace=True)
display(picks_pivoted.head())
print('----------------------------')

# Expanding the lists
radiant_picks = pd.DataFrame(picks_pivoted['team_hero_picks_radiant'].tolist(), 
                             index=picks_pivoted.index, 
                             columns=[f'hero_slot_{i}' for i in range(5)])

dire_picks = pd.DataFrame(picks_pivoted['team_hero_picks_dire'].tolist(), 
                          index=picks_pivoted.index, 
                          columns=[f'hero_slot_{i+128}' for i in range(5)])

picks_pivoted = pd.concat([picks_pivoted, radiant_picks, dire_picks], axis=1)
picks_pivoted.drop(columns=['team_hero_picks_radiant', 'team_hero_picks_dire'], inplace=True)

# Merge results with the matches DataFrame
matches = matches.merge(picks_pivoted, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,team_hero_picks_dire,team_hero_picks_radiant
0,0,"[106, 102, 46, 7, 73]","[86, 51, 83, 11, 67]"
1,1,"[73, 22, 5, 67, 106]","[7, 82, 71, 39, 21]"
2,2,"[38, 7, 10, 12, 85]","[51, 109, 9, 41, 27]"
3,3,"[78, 19, 31, 40, 47]","[50, 44, 32, 26, 39]"
4,4,"[101, 100, 22, 67, 21]","[8, 39, 55, 87, 69]"


----------------------------


Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21


#### Hero Features

In [41]:
# Create a mask from our players DataFrame
team_hero_features = players[['match_id', 'player_slot', 'radiant_team', 'hero_primary_attribute',
                              'hero_role_disabler', 'hero_role_support', 'hero_role_carry',
                              'hero_role_initiator', 'hero_role_durable', 'hero_role_pusher',
                              'hero_role_nuker', 'hero_role_escape']]

# Pivot the mask
hero_features_pivot = team_hero_features.pivot_table(
    index='match_id',
    columns='player_slot',
    values=['radiant_team', 'hero_primary_attribute', 'hero_role_disabler', 
            'hero_role_support', 'hero_role_carry', 'hero_role_initiator', 
            'hero_role_durable', 'hero_role_pusher', 'hero_role_nuker', 'hero_role_escape'],
    aggfunc='first'
)

# Flatten the multi-level columns
hero_features_pivot.columns = [f'{i}_{j}' if j!= '' else f'{i}' for i, j in hero_features_pivot.columns]
hero_features_pivot.drop(columns=['radiant_team_0', 'radiant_team_1', 'radiant_team_2', 'radiant_team_3', 'radiant_team_4', 
                                  'radiant_team_128', 'radiant_team_129', 'radiant_team_130', 'radiant_team_131', 'radiant_team_132'], 
                         inplace=True)
hero_features_pivot.reset_index(inplace=True)

# Merge results with the matches DataFrame
matches = matches.merge(hero_features_pivot, on='match_id', how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


#### Teamfights

In [42]:
# Load required file
teamfights = pd.read_csv(clean_folder + '/teamfights.csv', index_col=0)
print(f'teamfights:', '{:,} observations, {:,} features'.format(teamfights.shape[0], teamfights.shape[1]))

teamfights: 539,047 observations, 6 features


In [43]:
# Check the total matches
teamfights['match_id'].nunique()

49931

In [44]:
# Look at the head
teamfights.head()

Unnamed: 0,match_id,tf_order,start,end,last_death,deaths
0,0,1,220,252,237,3
1,0,2,429,475,460,3
2,0,3,900,936,921,3
3,0,4,1284,1328,1313,3
4,0,5,1614,1666,1651,5


We want to calculate the total number of teamfights per match in the matches DataFrame, and we can also determine the average duration of teamfights per match.

In [45]:
# Create duration feature
tf_duration = teamfights['end'] - teamfights['start']
teamfights.insert(4, 'duration', value=tf_duration)

# Selecting the agg functions for each column
agg_funcs = {
    'tf_order': 'max',
    'duration': 'mean'
}

# Aggregating features
tfs_per_match = teamfights[['match_id', 'tf_order', 'duration']].groupby('match_id', as_index=False).agg(agg_funcs)

# Rename time column to match format
tfs_per_match.rename(columns={'tf_order': 'teamfights', 'duration': 'tf_avg_duration'}, inplace=True)

tfs_per_match.head(10)

Unnamed: 0,match_id,teamfights,tf_avg_duration
0,0,12,42.583333
1,1,15,44.066667
2,2,11,45.272727
3,3,16,52.0625
4,4,6,42.666667
5,5,13,47.384615
6,6,10,44.4
7,7,12,47.0
8,8,10,42.5
9,9,13,49.0


In [46]:
# Merge results with the matches DataFrame
matches = matches.merge(tfs_per_match, on=['match_id'], how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,12.0,42.583333
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,15.0,44.066667
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,11.0,45.272727
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,16.0,52.0625
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,42.666667


#### TrueSkill

In [47]:
# Filtering trueskill features and looking at the head 
ts_df = players[['match_id', 'player_slot', 'trueskill_mu', 'trueskill_sigma', 'trueskill', 'pre_match_quality']]
ts_df.head(20)

Unnamed: 0,match_id,player_slot,trueskill_mu,trueskill_sigma,trueskill,pre_match_quality
0,0,0,25.112577,7.270275,3.301753,0.422474
1,0,1,26.232905,4.854238,11.670192,0.422474
2,0,2,25.112577,7.270275,3.301753,0.422474
3,0,3,27.614505,6.550771,7.96219,0.422474
4,0,4,20.221006,5.961434,2.336703,0.422474
5,0,128,26.773302,5.322094,10.807019,0.422474
6,0,129,25.112577,7.270275,3.301753,0.422474
7,0,130,32.190551,2.93714,23.379132,0.422474
8,0,131,23.1395,6.807861,2.715916,0.422474
9,0,132,34.77452,5.783084,17.425268,0.422474


In [48]:
# Group by match ID to avoid duplicates
ts_df_grouped = ts_df.groupby('match_id')['pre_match_quality'].first().reset_index()

# Merge the pre_match_quality to the matches DataFrame
matches = matches.merge(ts_df_grouped, on=['match_id'], how='left')
display(matches[['match_id', 'pre_match_quality']].head())

Unnamed: 0,match_id,pre_match_quality
0,0,0.422474
1,1,0.436399
2,2,0.501245
3,3,0.495215
4,4,0.562281


In [49]:
# Pivot the table
ts_pivot = ts_df.pivot_table(
    index='match_id',
    columns='player_slot',
    values='trueskill',
    aggfunc='first'
)

# Flatten the multi-level columns
ts_pivot.columns = [f'trueskill_{col}' for col in ts_pivot.columns]
ts_pivot.reset_index(inplace=True)

# Merge into the matches DataFrame
matches = matches.merge(ts_pivot, on='match_id', how='left')
display(matches.head())

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration,pre_match_quality,trueskill_0,trueskill_1,trueskill_2,trueskill_3,trueskill_4,trueskill_128,trueskill_129,trueskill_130,trueskill_131,trueskill_132
0,0,2015-11-05 19:01:52,2375,1982,4,3,63,1,22,1,0,1,155,0.90566,7.611111,38,30,3718,10811,58620,87245,86,51,83,11,67,106,102,46,7,73,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,12.0,42.583333,0.422474,3.301753,11.670192,3.301753,7.96219,2.336703,10.807019,3.301753,23.379132,2.715916,17.425268
1,1,2015-11-05 19:51:18,2582,0,1846,63,0,221,22,0,0,2,154,3.5,1.555556,16,27,9085,4776,107750,69310,7,82,71,39,21,73,22,5,67,106,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,15.0,44.066667,0.436399,3.301753,16.999098,2.89536,2.649581,11.455655,4.795492,23.069381,3.301753,23.548781,3.301753
2,2,2015-11-05 23:03:06,2716,256,1972,63,48,190,22,0,0,0,132,6.0,1.06,16,10,11177,2494,81620,54990,51,109,9,41,27,38,7,10,12,85,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,11.0,45.272727,0.501245,3.301753,3.858394,3.858394,6.099669,10.246954,3.301753,21.139734,3.301753,3.301753,3.301753
3,3,2015-11-05 23:22:03,3085,4,1924,51,3,40,22,0,0,0,191,2.621212,2.328358,29,32,5954,6455,94430,76685,50,44,32,26,39,78,19,31,40,47,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,16.0,52.0625,0.495215,0.386932,3.301753,3.301753,3.301753,-3.142269,4.672139,3.301753,8.500198,4.637102,3.301753
4,4,2015-11-06 07:53:05,1887,2047,0,0,63,58,22,1,0,0,156,1.210526,5.647059,21,26,2030,14099,38980,78980,8,39,55,87,69,101,100,22,67,21,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,42.666667,0.562281,9.539045,13.140809,6.390039,12.183661,10.246954,5.658858,5.387267,3.301753,23.106103,7.392703


#### Filling Missing Values

In [50]:
# Find features with missing values
display(matches.isna().sum().sort_values(ascending=False)\
[matches.isna().sum().sort_values(ascending=False) > 0])

tf_avg_duration               69
teamfights                    69
hero_primary_attribute_4       6
hero_role_pusher_4             6
hero_role_carry_4              6
hero_role_nuker_4              6
hero_role_escape_4             6
hero_role_disabler_4           6
hero_role_support_4            6
hero_role_initiator_4          6
hero_role_durable_4            6
hero_role_pusher_3             5
hero_role_pusher_2             5
hero_primary_attribute_2       5
hero_primary_attribute_3       5
hero_role_escape_2             5
hero_role_durable_3            5
hero_role_pusher_131           5
hero_role_durable_2            5
hero_primary_attribute_131     5
hero_role_disabler_131         5
hero_role_carry_2              5
hero_role_carry_3              5
hero_role_disabler_3           5
hero_role_durable_131          5
hero_role_disabler_2           5
hero_role_initiator_3          5
hero_role_nuker_131            5
hero_role_escape_3             5
hero_role_support_2            5
hero_role_

In [51]:
# Look at the rows with missing teamfight values
matches[matches['teamfights'].isna()]

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster,team_kda_dire,team_kda_radiant,team_denies_dire,team_denies_radiant,team_gold_dire,team_gold_radiant,team_gold_spent_dire,team_gold_spent_radiant,hero_slot_0,hero_slot_1,hero_slot_2,hero_slot_3,hero_slot_4,hero_slot_128,hero_slot_129,hero_slot_130,hero_slot_131,hero_slot_132,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132,teamfights,tf_avg_duration,pre_match_quality,trueskill_0,trueskill_1,trueskill_2,trueskill_3,trueskill_4,trueskill_128,trueskill_129,trueskill_130,trueskill_131,trueskill_132
1221,1221,2015-11-12 11:23:25,272,2047,2047,63,63,8,22,0,0,0,171,5.0,0.142857,14,4,3891,2134,7015,4740,85,10,21,14,2,7,61,20,106,15,str,agi,all,str,str,str,all,all,agi,agi,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,,,0.574539,0.633727,23.478384,7.994206,23.631038,22.468244,20.491866,24.919724,14.926128,19.17216,9.736056
1632,1632,2015-11-12 13:09:28,605,2047,2047,63,63,145,22,0,0,0,138,3.5,0.4,20,8,2351,2636,15360,10255,22,9,56,72,57,93,39,50,36,104,int,all,agi,agi,str,agi,int,all,int,str,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,0.471328,11.582781,3.301753,3.301753,2.958588,3.301753,4.915572,6.19391,14.054129,19.173923,11.758558
2420,2420,2015-11-12 15:27:58,681,2047,2046,63,63,143,22,1,0,0,133,0.0,8.5,18,16,2320,5617,10490,12550,19,74,2,26,5,18,25,10,104,101,str,all,str,int,int,str,int,agi,str,int,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,,,0.532036,14.758948,17.221748,8.773373,2.054592,10.027432,3.301753,3.301753,3.301753,17.386708,7.92052
2774,2774,2015-11-12 16:25:31,1411,384,2047,63,48,95,22,0,0,0,111,8.166667,0.416667,21,31,17090,7006,46980,28045,99,17,100,11,9,112,46,69,28,62,str,int,str,agi,all,all,agi,str,str,agi,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,,,0.601266,3.301753,11.269129,3.301753,21.348246,13.357926,22.415946,3.301753,2.044854,8.521596,14.393527
3204,3204,2015-11-12 17:36:29,1122,2047,260,51,63,175,22,1,0,0,133,0.352941,11.333333,5,30,5453,8132,11590,35425,86,49,106,27,8,7,67,29,63,62,int,str,agi,int,agi,str,agi,str,agi,agi,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.517699,17.276265,16.535279,-1.662855,3.301753,1.262703,6.437503,3.301753,3.301753,-1.420664,3.301753
3548,3548,2015-11-12 18:27:22,362,2047,2047,63,63,132,2,0,0,0,182,4.0,0.25,17,8,4530,1922,6990,7300,100,85,47,2,40,50,61,28,7,39,str,str,agi,str,all,all,all,str,str,int,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,,,0.550853,9.953612,17.996606,12.795343,13.005647,15.01634,3.301753,3.301753,3.301753,3.301753,1.034654
3587,3587,2015-11-12 18:33:19,289,2047,2039,63,63,0,22,1,0,0,204,0.0,6.0,3,16,2712,4673,3895,6245,5,69,11,14,67,7,63,42,39,30,int,str,agi,str,agi,str,agi,str,int,int,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,,,0.421774,1.020875,3.301753,2.013298,14.464078,10.591201,14.125489,3.301753,17.659963,3.301753,12.280876
4896,4896,2015-11-12 21:24:44,904,2046,391,51,63,0,22,1,0,0,112,1.111111,2.428571,10,18,7496,8749,11135,23885,79,62,73,110,93,21,50,2,71,36,int,agi,str,all,agi,all,all,str,str,int,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,,,0.57444,18.397198,3.301753,7.033719,10.073099,9.667373,3.301753,3.301753,14.413403,0.679104,17.632735
6213,6213,2015-11-13 00:41:55,148,2047,2047,63,63,0,22,1,0,0,132,0.0,6.0,3,4,1638,1523,3160,6275,55,75,62,93,86,19,90,57,1,7,all,int,agi,agi,int,str,int,str,agi,str,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,,,0.631958,21.263721,11.079367,4.579395,17.273957,18.110785,10.745818,3.301753,10.293069,20.574189,6.872508
6706,6706,2015-11-13 02:08:30,1015,2047,900,51,63,257,22,1,0,0,121,0.3,5.666667,10,28,3978,8807,13295,31110,12,43,93,75,83,84,36,41,57,19,agi,int,agi,int,str,str,int,agi,str,str,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,,,0.493377,3.301753,3.301753,3.665339,3.301753,1.797714,3.301753,19.54846,3.765979,16.449575,0.349506


Upon reviewing the data, we can notice no discernible patterns indicating the absence of teamfights during these matches. Therefore, it would be best to exclude them from all of our DataFrames since they don't have complete information.

In [52]:
# Append match IDs to be dropped
tf_drop_matches = matches[matches['teamfights'].isna()]['match_id'].tolist()
dropped_matches = list(set(dropped_matches + tf_drop_matches))
print('Matches to be removed:', len(dropped_matches),'\n')

# Removing from the players and matches dfs
players = players.drop(players[players['match_id'].isin(dropped_matches)].index)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))
print('-----------------------------------------------------')
matches = matches.drop(matches[matches['match_id'].isin(dropped_matches)].index)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

Matches to be removed: 102 

players: 498,980 observations, 68 features
-----------------------------------------------------
matches: 49,898 observations, 134 features


In [53]:
# Create a simplified version for modelling
simplified_match_columns = ['match_id', 'radiant_win', 
                            'hero_primary_attribute_0', 'hero_primary_attribute_1', 'hero_primary_attribute_2', 'hero_primary_attribute_3', 'hero_primary_attribute_4', 
                            'hero_primary_attribute_128', 'hero_primary_attribute_129', 'hero_primary_attribute_130', 'hero_primary_attribute_131', 'hero_primary_attribute_132', 
                            'hero_role_carry_0', 'hero_role_carry_1', 'hero_role_carry_2', 'hero_role_carry_3', 'hero_role_carry_4', 
                            'hero_role_carry_128', 'hero_role_carry_129', 'hero_role_carry_130', 'hero_role_carry_131', 'hero_role_carry_132', 
                            'hero_role_disabler_0', 'hero_role_disabler_1', 'hero_role_disabler_2', 'hero_role_disabler_3', 'hero_role_disabler_4', 
                            'hero_role_disabler_128', 'hero_role_disabler_129', 'hero_role_disabler_130', 'hero_role_disabler_131', 'hero_role_disabler_132', 
                            'hero_role_durable_0', 'hero_role_durable_1', 'hero_role_durable_2', 'hero_role_durable_3', 'hero_role_durable_4', 
                            'hero_role_durable_128', 'hero_role_durable_129', 'hero_role_durable_130', 'hero_role_durable_131', 'hero_role_durable_132', 
                            'hero_role_escape_0', 'hero_role_escape_1', 'hero_role_escape_2', 'hero_role_escape_3', 'hero_role_escape_4', 
                            'hero_role_escape_128', 'hero_role_escape_129', 'hero_role_escape_130', 'hero_role_escape_131', 'hero_role_escape_132', 
                            'hero_role_initiator_0', 'hero_role_initiator_1', 'hero_role_initiator_2', 'hero_role_initiator_3', 'hero_role_initiator_4', 
                            'hero_role_initiator_128', 'hero_role_initiator_129', 'hero_role_initiator_130', 'hero_role_initiator_131', 'hero_role_initiator_132', 
                            'hero_role_nuker_0', 'hero_role_nuker_1', 'hero_role_nuker_2', 'hero_role_nuker_3', 'hero_role_nuker_4', 
                            'hero_role_nuker_128', 'hero_role_nuker_129', 'hero_role_nuker_130', 'hero_role_nuker_131', 'hero_role_nuker_132', 
                            'hero_role_pusher_0', 'hero_role_pusher_1', 'hero_role_pusher_2', 'hero_role_pusher_3', 'hero_role_pusher_4', 
                            'hero_role_pusher_128', 'hero_role_pusher_129', 'hero_role_pusher_130', 'hero_role_pusher_131', 'hero_role_pusher_132', 
                            'hero_role_support_0', 'hero_role_support_1', 'hero_role_support_2', 'hero_role_support_3', 'hero_role_support_4', 
                            'hero_role_support_128', 'hero_role_support_129', 'hero_role_support_130', 'hero_role_support_131', 'hero_role_support_132']
matches_simple = matches[simplified_match_columns]
display(matches_simple.head())
gc.collect()

Unnamed: 0,match_id,radiant_win,hero_primary_attribute_0,hero_primary_attribute_1,hero_primary_attribute_2,hero_primary_attribute_3,hero_primary_attribute_4,hero_primary_attribute_128,hero_primary_attribute_129,hero_primary_attribute_130,hero_primary_attribute_131,hero_primary_attribute_132,hero_role_carry_0,hero_role_carry_1,hero_role_carry_2,hero_role_carry_3,hero_role_carry_4,hero_role_carry_128,hero_role_carry_129,hero_role_carry_130,hero_role_carry_131,hero_role_carry_132,hero_role_disabler_0,hero_role_disabler_1,hero_role_disabler_2,hero_role_disabler_3,hero_role_disabler_4,hero_role_disabler_128,hero_role_disabler_129,hero_role_disabler_130,hero_role_disabler_131,hero_role_disabler_132,hero_role_durable_0,hero_role_durable_1,hero_role_durable_2,hero_role_durable_3,hero_role_durable_4,hero_role_durable_128,hero_role_durable_129,hero_role_durable_130,hero_role_durable_131,hero_role_durable_132,hero_role_escape_0,hero_role_escape_1,hero_role_escape_2,hero_role_escape_3,hero_role_escape_4,hero_role_escape_128,hero_role_escape_129,hero_role_escape_130,hero_role_escape_131,hero_role_escape_132,hero_role_initiator_0,hero_role_initiator_1,hero_role_initiator_2,hero_role_initiator_3,hero_role_initiator_4,hero_role_initiator_128,hero_role_initiator_129,hero_role_initiator_130,hero_role_initiator_131,hero_role_initiator_132,hero_role_nuker_0,hero_role_nuker_1,hero_role_nuker_2,hero_role_nuker_3,hero_role_nuker_4,hero_role_nuker_128,hero_role_nuker_129,hero_role_nuker_130,hero_role_nuker_131,hero_role_nuker_132,hero_role_pusher_0,hero_role_pusher_1,hero_role_pusher_2,hero_role_pusher_3,hero_role_pusher_4,hero_role_pusher_128,hero_role_pusher_129,hero_role_pusher_130,hero_role_pusher_131,hero_role_pusher_132,hero_role_support_0,hero_role_support_1,hero_role_support_2,hero_role_support_3,hero_role_support_4,hero_role_support_128,hero_role_support_129,hero_role_support_130,hero_role_support_131,hero_role_support_132
0,0,1,int,all,str,agi,agi,agi,all,agi,str,str,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
1,1,0,str,agi,str,int,all,str,int,int,agi,agi,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,2,0,all,agi,all,agi,int,all,str,agi,agi,str,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,3,0,all,agi,agi,int,int,all,str,int,all,agi,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,4,1,agi,int,all,int,str,int,str,int,agi,all,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


0

### Timeseries DataFrame

In [54]:
# Load required file
player_time = pd.read_csv(clean_folder + '/player_time.csv', index_col=0)
print(f'player_time:', '{:,} observations, {:,} features'.format(player_time.shape[0], player_time.shape[1]))

player_time: 2,209,778 observations, 32 features


In [55]:
# Melt the player_time DataFrame
player_time_melted = pd.melt(player_time, id_vars=['match_id', 'times'], 
                              value_vars=[col for col in player_time.columns if\
                                          col.startswith(('gold_t_', 'lh_t_', 'xp_t_'))],
                              var_name='metric', value_name='value')

# Look at the shape of the melted DataFrame
print(f'player_time_melted:', '{:,} observations, {:,} features'.format(player_time_melted.shape[0], player_time_melted.shape[1]))
display(player_time_melted.head())
gc.collect()

player_time_melted: 66,293,340 observations, 4 features


Unnamed: 0,match_id,times,metric,value
0,0,0,gold_t_0,0
1,0,60,gold_t_0,409
2,0,120,gold_t_0,546
3,0,180,gold_t_0,683
4,0,240,gold_t_0,956


0

In [56]:
# Create separate columns for gold, lh, and xp
player_time_melted[['metric_type', 'player_slot']] = player_time_melted['metric'].str.split('_t_', expand=True)
player_time_melted['player_slot'] = player_time_melted['player_slot'].astype(int)
display(player_time_melted.head())
gc.collect()

Unnamed: 0,match_id,times,metric,value,metric_type,player_slot
0,0,0,gold_t_0,0,gold,0
1,0,60,gold_t_0,409,gold,0
2,0,120,gold_t_0,546,gold,0
3,0,180,gold_t_0,683,gold,0
4,0,240,gold_t_0,956,gold,0


13

In [57]:
# Pivot the table to create a wide format
player_time_wide = player_time_melted.pivot_table(index=['match_id', 'times', 'player_slot'], 
                                                  columns='metric_type', 
                                                  values='value',
                                                  aggfunc='sum').reset_index()

# Look at the shape of the wide DataFrame
print(f'player_time_wide:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))
display(player_time_wide.head(50))
gc.collect()

player_time_wide: 22,097,780 observations, 6 features


metric_type,match_id,times,player_slot,gold,lh,xp
0,0,0,0,0,0,0
1,0,0,1,0,0,0
2,0,0,2,0,0,0
3,0,0,3,0,0,0
4,0,0,4,0,0,0
5,0,0,128,0,0,0
6,0,0,129,0,0,0
7,0,0,130,0,0,0
8,0,0,131,0,0,0
9,0,0,132,0,0,0


0

#### Ability Upgrades

#### Purchase Log

#### Objectives

In [58]:
# Load the file
objectives = pd.read_csv(clean_folder + '/objectives.csv', index_col=0)
print(f'objectives:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

objectives: 716,498 observations, 6 features


In [59]:
objectives.groupby('subtype')['value'].nunique()

subtype
CHAT_MESSAGE_AEGIS             1
CHAT_MESSAGE_AEGIS_STOLEN      1
CHAT_MESSAGE_FIRSTBLOOD      322
CHAT_MESSAGE_ROSHAN_KILL       1
CHAT_MESSAGE_TOWER_DENY        4
CHAT_MESSAGE_TOWER_KILL        2
Name: value, dtype: int64

In [60]:
# Separate the objectives into multiple features
objectives['aegis'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS', 1, 0)
objectives['aegis_stolen'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS_STOLEN', 1, 0)
objectives['firstblood'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_FIRSTBLOOD', 1, 0)
objectives['roshan_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_ROSHAN_KILL', 1, 0)
objectives['tower_deny'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_DENY', 1, 0)
objectives['tower_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_KILL', 1, 0)

# Look at the head
objectives.head()

Unnamed: 0,match_id,player1,player2,subtype,time,value,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,129,CHAT_MESSAGE_FIRSTBLOOD,1,309,0,0,1,0,0,0
1,0,3,-1,CHAT_MESSAGE_TOWER_KILL,894,2,0,0,0,0,0,1
2,0,2,-1,CHAT_MESSAGE_ROSHAN_KILL,925,200,0,0,0,1,0,0
3,0,1,-1,CHAT_MESSAGE_AEGIS,925,0,1,0,0,0,0,0
4,0,130,-1,CHAT_MESSAGE_TOWER_KILL,1016,3,0,0,0,0,0,1


In [61]:
# Round up the time values to the nearest 60-second intervals
objectives['time'] = (objectives['time'] // 60) * 60

# Aggregate objectives
objective_features = objectives.drop(columns=['player2', 'subtype', 'value']).\
                        groupby(['match_id', 'player1', 'time']).sum().reset_index()

# Look at the objectives features DataFrame head
display(objective_features.head())

Unnamed: 0,match_id,player1,time,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,0,0,0,1,0,0,0
1,0,1,900,1,0,0,0,0,0
2,0,1,1740,1,0,0,0,0,0
3,0,1,2280,0,0,0,0,0,1
4,0,2,900,0,0,0,1,0,0


In [62]:
# Rename time, player1 and subtype column to match format
objective_features.rename(columns={'time': 'times', 'player1': 'player_slot'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(objective_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0
1,0,0,1,0,0,0,,,,,,
2,0,0,2,0,0,0,,,,,,
3,0,0,3,0,0,0,,,,,,
4,0,0,4,0,0,0,,,,,,
5,0,0,128,0,0,0,,,,,,
6,0,0,129,0,0,0,,,,,,
7,0,0,130,0,0,0,,,,,,
8,0,0,131,0,0,0,,,,,,
9,0,0,132,0,0,0,,,,,,


0

#### Teamfight Durations

In [63]:
teamfights.head()

Unnamed: 0,match_id,tf_order,start,end,duration,last_death,deaths
0,0,1,220,252,32,237,3
1,0,2,429,475,46,460,3
2,0,3,900,936,36,921,3
3,0,4,1284,1328,44,1313,3
4,0,5,1614,1666,52,1651,5


In [64]:
# Create duration feature
tf_last_death = teamfights['last_death'] - teamfights['start']
teamfights.insert(6, 'tf_last_death', value=tf_last_death)

# Round up the time values to the nearest 60-second intervals
teamfights['times'] = (teamfights['start'] // 60) * 60

tfs_features = teamfights.drop(columns=['start', 'end', 'last_death', 'deaths'])

# Look at the teamfights features DataFrame head
display(tfs_features.head())

Unnamed: 0,match_id,tf_order,duration,tf_last_death,times
0,0,1,32,17,180
1,0,2,46,31,420
2,0,3,36,21,900
3,0,4,44,29,1260
4,0,5,52,37,1560


In [65]:
tf_players.head()

Unnamed: 0,match_id,tf_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,0,105,0,173,536,314
1,0,0,1,0,566,1,0,1583,1418
2,0,0,2,0,0,0,0,391,391
3,0,0,3,0,0,0,123,1775,1419
4,0,0,4,0,444,0,336,1267,983


In [66]:
# Calculate xp_delta on tf_players
tf_players['tf_xp_delta'] = tf_players['xp_end'] - tf_players['xp_start']
tf_players.drop(columns=['xp_end', 'xp_start'], inplace=True)

# Reset the index from tfs_features
tfs_features = tfs_features.reset_index()

# Merge tf_players with tfs_features
tfs_features = tf_players.merge(tfs_features, left_on=['match_id', 'tf_id'], right_on=['match_id', 'index'])
tfs_features.drop(columns=['index', 'tf_id', 'tf_order'], inplace=True)
display(tfs_features.head(15))

Unnamed: 0,match_id,player_slot,buybacks,damage,deaths,gold_delta,tf_xp_delta,duration,tf_last_death,times
0,0,0,0,105,0,173,222,32,17,180
1,0,1,0,566,1,0,165,32,17,180
2,0,2,0,0,0,0,0,32,17,180
3,0,3,0,0,0,123,356,32,17,180
4,0,4,0,444,0,336,284,32,17,180
5,0,128,0,477,1,249,283,32,17,180
6,0,129,0,636,1,-27,144,32,17,180
7,0,130,0,0,0,190,315,32,17,180
8,0,131,0,0,0,0,0,32,17,180
9,0,132,0,0,0,378,70,32,17,180


In [67]:
# Rename duration and deaths columns to match format
tfs_features.rename(columns={
    'buybacks': 'tf_buybacks',
    'damage': 'tf_damage',
    'deaths': 'tf_deaths',
    'gold_delta': 'tf_gold_delta',
    'duration': 'tf_duration'
}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(tfs_features, on=['match_id', 'player_slot', 'times'], how='left')

# Drop the matches with incomplete teamfight data
player_time_wide = player_time_wide[~player_time_wide['match_id'].isin(dropped_matches)]

# Display the first 50 final observations
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,
1,0,0,1,0,0,0,,,,,,,,,,,,,
2,0,0,2,0,0,0,,,,,,,,,,,,,
3,0,0,3,0,0,0,,,,,,,,,,,,,
4,0,0,4,0,0,0,,,,,,,,,,,,,
5,0,0,128,0,0,0,,,,,,,,,,,,,
6,0,0,129,0,0,0,,,,,,,,,,,,,
7,0,0,130,0,0,0,,,,,,,,,,,,,
8,0,0,131,0,0,0,,,,,,,,,,,,,
9,0,0,132,0,0,0,,,,,,,,,,,,,


0

#### Chat Log

In [68]:
# Load the file
chat = pd.read_csv(clean_folder + '/chat.csv', index_col=0)
print(f'chat:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

chat: 716,498 observations, 12 features


In [69]:
# Look at the head
chat.head()

Unnamed: 0,match_id,player_slot,match_slot_id,account,chat,time,match_outcome
0,0,129,0_129,6k Slayer,force it,-8,0
1,0,1,0_1,Monkey,space created,5,1
2,0,1,0_1,Monkey,hah,6,1
3,0,129,0_129,6k Slayer,ez 500,9,0
4,0,4,0_4,Kira,mvp ulti,934,1


In [70]:
# Round up the time values to the nearest 60-second intervals
chat['time'] = (chat['time'] // 60) * 60

# Aggregate chat messages
chat_features = chat.groupby(['match_id', 'player_slot', 'time'])['chat'].count().reset_index()

# Look at the objectives features DataFrame head
display(chat_features.head())

Unnamed: 0,match_id,player_slot,time,chat
0,0,0,1500,1
1,0,0,1680,1
2,0,0,2340,2
3,0,1,0,2
4,0,1,1440,1


In [71]:
chat_features[chat_features['time'] < 0]

Unnamed: 0,match_id,player_slot,time,chat
13,0,129,-60,1
36,2,0,-60,2
47,2,128,-60,1
54,2,130,-60,1
56,2,131,-60,1
...,...,...,...,...
811137,49994,130,-120,1
811148,49995,3,-60,1
811163,49995,130,-60,1
811184,49998,130,-300,1


In [72]:
# Rename time column to match format
chat_features.rename(columns={'time': 'times', 'chat': 'chats_sent'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(chat_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

Unnamed: 0,match_id,times,player_slot,gold,lh,xp,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death,chats_sent
0,0,0,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,
1,0,0,1,0,0,0,,,,,,,,,,,,,,2.0
2,0,0,2,0,0,0,,,,,,,,,,,,,,
3,0,0,3,0,0,0,,,,,,,,,,,,,,
4,0,0,4,0,0,0,,,,,,,,,,,,,,
5,0,0,128,0,0,0,,,,,,,,,,,,,,
6,0,0,129,0,0,0,,,,,,,,,,,,,,1.0
7,0,0,130,0,0,0,,,,,,,,,,,,,,
8,0,0,131,0,0,0,,,,,,,,,,,,,,
9,0,0,132,0,0,0,,,,,,,,,,,,,,


0

#### Filling Missing Values

In [73]:
# Remove the matches from dropped_matches
player_time_wide = player_time_wide.drop(player_time_wide[player_time_wide['match_id'].isin(dropped_matches)].index)
print(f'player_time_wide:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))

player_time_wide: 22,200,460 observations, 20 features


In [74]:
# Find features with missing values
display(player_time_wide.isna().sum().sort_values(ascending=False)\
[player_time_wide.isna().sum().sort_values(ascending=False) > 0])

tower_deny       21512572
aegis            21512572
tower_kill       21512572
roshan_kill      21512572
firstblood       21512572
aegis_stolen     21512572
chats_sent       21425209
tf_buybacks      16825020
tf_damage        16825020
tf_deaths        16825020
tf_gold_delta    16825020
tf_xp_delta      16825020
tf_duration      16825020
tf_last_death    16825020
dtype: int64

---

## Saving all DataFrames

In [75]:
# Rename columns before merging
players_merge = players.drop(columns=['gold_per_min', 'xp_per_min', 'tf_buybacks', 
                                      'tf_deaths', 'tf_avg_gold_delta', 'tf_avg_xp_delta'])
player_time_wide.rename(columns={'gold': 'gold_per_min', 'lh': 'lh_per_min', 'xp': 'xp_per_min'}, inplace=True)

# Merge the player_time_wide with players df
dask_players = dd.from_pandas(players_merge, npartitions=4)  
dask_player_time = dd.from_pandas(player_time_wide, npartitions=8)  

merged_dask = dask_players.merge(dask_player_time, on=['match_id', 'player_slot'], how='left')
final_df = merged_dask.compute()

# Look at the initial shape of the final DataFrame
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))
final_df = final_df.sort_values(by=['match_id', 'player_slot']).reset_index()
display(final_df.head(50))
gc.collect()

final_df: 22,200,460 observations, 80 features


Unnamed: 0,index,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,hero_primary_attribute,hero_role_pusher,hero_role_nuker,hero_role_escape,hero_role_disabler,hero_role_initiator,hero_role_durable,hero_role_carry,hero_role_support,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,teamfights,tf_damage_dealt,trueskill_mu,trueskill_sigma,trueskill,pre_match_quality,times,gold_per_min,lh_per_min,xp_per_min,aegis,aegis_stolen,firstblood,roshan_kill,tower_deny,tower_kill,tf_buybacks,tf_damage,tf_deaths,tf_gold_delta,tf_xp_delta,tf_duration,tf_last_death,chats_sent
0,1400541,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,
1,1400542,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,60,409,0,63,,,,,,,,,,,,,,
2,1400543,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,120,546,0,283,,,,,,,,,,,,,,
3,1400544,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,180,683,1,314,,,,,,,0.0,105.0,0.0,173.0,222.0,32.0,17.0,
4,1400545,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,240,956,1,485,,,,,,,,,,,,,,
5,1400546,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,300,1056,1,649,,,,,,,,,,,,,,
6,1400547,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,360,1156,1,680,,,,,,,,,,,,,,
7,1400548,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,420,1257,2,778,,,,,,,0.0,159.0,0.0,452.0,337.0,46.0,31.0,
8,1400549,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,480,1809,3,1135,,,,,,,,,,,,,,
9,1400550,0,1,Double T,Double T,86,0,3261,10960,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,int,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,6.75,1,0.886861,0.033333,0.301637,0.125623,10.0,6099.0,25.112577,7.270275,3.301753,0.422474,540,2111,3,1393,,,,,,,,,,,,,,


215

In [76]:
# Save the final DataFrames to a CSV file
matches_simple.to_csv('../Data/Merged/matches_simple.csv', index=False)
print(f'matches_simple:', '{:,} observations, {:,} features'.format(matches_simple.shape[0], matches_simple.shape[1]))

matches.to_csv('../Data/Merged/matches.csv', index=False)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players.to_csv('../Data/Merged/players.csv', index=False)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

player_time_wide.to_csv('../Data/Merged/timeseries.csv', index=False)
print(f'timeseries:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))

final_df.to_csv('../Data/Merged/final_df.csv', index=False)
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))

matches_simple: 49,898 observations, 92 features
matches: 49,898 observations, 134 features
players: 498,980 observations, 68 features
timeseries: 22,200,460 observations, 20 features
final_df: 22,200,460 observations, 81 features
