# Data Preprocessing

Before we start conducting a detailed Exploratory Data Analysis (EDA), we need to build our final DataFrame by merging all the potentially valuable features for predicting fair matchmaking. This will include consolidating data from all the previously cleaned files and creating new features as needed.

---

## Initial Setup

In [21]:
# ---------------- Suppress all future warnings ---------------- #
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

# ---------------- Basic Data Science Libraries ---------------- #
import numpy as np # Linear algebra
import pandas as pd # Data processing
import dask.dataframe as dd # Data processing for large DataFrames

# ---------------- System Libraries ---------------- #
import os # Miscellaneous operating system interfaces
import gc # Garbage collector interface
import nbimporter # Use functions from other Jupyter Notebooks'
from subprocess import check_output # Saves results written to the current directory as output

# ---------------- Plotting Libraries ---------------- #
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# ---------------- TrueSkill Library ---------------- #
import trueskill
trueskill.setup(draw_probability=0)
import itertools
import math

# Function obtained from the documentation found in https://trueskill.org/
def win_probability(team1, team2):
    delta_mu = sum(r.mu for r in team1) - sum(r.mu for r in team2)
    sum_sigma = sum(r.sigma ** 2 for r in itertools.chain(team1, team2))
    size = len(team1) + len(team2)
    ts = trueskill.global_env()
    BETA = ts.beta
    denom = math.sqrt(size * (BETA * BETA) + sum_sigma)
    return ts.cdf(delta_mu / denom)

# Function to obtain a conservative skill rating
def conservative_trueskill_rating(mu, sigma):
    conservative_skill_rating = mu - (3 * sigma)
    return conservative_skill_rating

# ---------------- Define Clean and Raw Directories ---------------- #
clean_folder = '../Data/Clean'
raw_folder = '../Data/Raw'

# ---------------- Set new DataFrame limiters ---------------- #
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# ---------------- Print files in my clean data folder ---------------- #
print(check_output(['ls', '../Data/Clean']).decode('utf8'))

ability_ids.csv
ability_upgrades.csv
chat.csv
eng_chat.csv
hero_ids.csv
item_ids.csv
matches.csv
mmr.csv
objectives.csv
patch_dates.csv
player_time.csv
players.csv
positions.csv
prev_outcomes.csv
purchase_log.csv
regions.csv
teamfights.csv
teamfights_players.csv
test_players.csv
trueskill.csv



---

## Feature Selection

Let's begin with our most crucial files in the dataset: matches.csv and players.csv. These contain the most information for each player per game.

In [2]:
# Load required files
players = pd.read_csv(clean_folder + '/players.csv', index_col=0)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

matches = pd.read_csv(clean_folder + '/matches.csv', index_col=0)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players: 500,000 observations, 73 features
matches: 50,000 observations, 13 features


It is challenging to consider adding more features from other files when our players' DataFrame already has 73 features. To simplify the process of creating fair matchmaking, we should reduce the number of features by removing those with a high number of null values, as well as those that have little impact on the match outcome.

In [3]:
# List unwanted features from the players DataFrame
features_to_drop = [
    'unit_order_none', 'unit_order_move_to_position', 'unit_order_move_to_target', 
    'unit_order_attack_move', 'unit_order_attack_target', 'unit_order_cast_position', 
    'unit_order_cast_target', 'unit_order_cast_target_tree', 'unit_order_cast_no_target', 
    'unit_order_cast_toggle', 'unit_order_hold_position', 'unit_order_train_ability', 
    'unit_order_drop_item', 'unit_order_give_item', 'unit_order_pickup_item', 
    'unit_order_pickup_rune', 'unit_order_purchase_item', 'unit_order_sell_item', 
    'unit_order_disassemble_item', 'unit_order_move_item', 'unit_order_cast_toggle_auto', 
    'unit_order_stop', 'unit_order_buyback', 'unit_order_glyph', 
    'unit_order_eject_item_from_stash', 'unit_order_cast_rune', 'unit_order_ping_ability', 
    'unit_order_move_to_direction', 'gold_abandon', 'gold_sell', 
    'gold_destroying_structure', 'gold_killing_couriers', 'match_slot_id'
]

# Drop the features
players = players.drop(columns=features_to_drop)
players.shape

(500000, 40)

Now that we have 40 features in our players' DataFrame, let's group together the features that can provide more insight into the overall team performance.

In [4]:
# Player Performance Features
player_features = ['kills', 'deaths', 'assists', 'denies', 'gold', 'gold_spent']

# Define categorical features
players['cluster'] = players['cluster'].astype('category')
players['hero_id'] = players['hero_id'].astype('category')
players['player_slot'] = players['player_slot'].astype('category')

# Display player features
players[player_features].head()

Unnamed: 0,kills,deaths,assists,denies,gold,gold_spent
0,9,3,18,1,3261,10960
1,13,3,18,9,2954,17760
2,0,4,15,1,110,12195
3,8,4,19,6,1179,22505
4,20,3,17,13,3307,23825


In [5]:
# Match Features
match_features = ['match_id', 'start_time', 'tower_status_radiant', 
                  'tower_status_dire', 'barracks_status_dire', 
                  'barracks_status_radiant', 'first_blood_time']
matches['start_time'] = pd.to_datetime(matches['start_time'], unit='s')
matches[match_features].head()

Unnamed: 0,match_id,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,2015-11-05 19:01:52,1982,4,3,63,1
1,1,2015-11-05 19:51:18,0,1846,63,0,221
2,2,2015-11-05 23:03:06,256,1972,63,48,190
3,3,2015-11-05 23:22:03,4,1924,51,3,40
4,4,2015-11-06 07:53:05,2047,0,0,63,58


In [6]:
# Merge the match features to the players DataFrame
players = players.merge(matches[match_features], on='match_id', how='left')
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
6,0,0,6k Slayer,0,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
8,0,0,-,0,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1


2019

---

## Feature Engineering

### Players DataFrame

#### Team Features

To fully grasp the extent of each player's impact on the team, it is essential to calculate the ratio of each player's individual contribution to the overall team statistics for every match. This involves extracting specific performance metrics and consolidating them for each team to obtain a comprehensive understanding of player involvement.

##### Overall Team Performance

In [7]:
# Aggregate team stats
team_stats = players.groupby(['match_id', 'player_slot'])[player_features].sum().reset_index()
team_stats['radiant_team'] = team_stats['player_slot'].apply(lambda x: 1 if x < 5 else 0)

# Aggregating them by team
team_features = team_stats.groupby(['match_id', 'radiant_team'], observed=False)[player_features].sum().reset_index()

# Rename columns for merge
team_features.rename(columns={'kills': 'team_kills', 
                              'deaths': 'team_deaths', 
                              'assists': 'team_assists', 
                              'denies': 'team_denies', 
                              'gold': 'team_gold', 
                              'gold_spent': 'team_gold_spent'}, inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kills,deaths,assists,denies,gold,gold_spent,radiant_team
0,0,0,9,3,18,1,3261,10960,1
1,0,1,13,3,18,9,2954,17760,1
2,0,2,0,4,15,1,110,12195,1
3,0,3,8,4,19,6,1179,22505,1
4,0,4,20,3,17,13,3307,23825,1
5,0,128,5,6,8,5,476,12285,0
6,0,129,4,13,5,2,317,10355,0
7,0,130,4,8,6,31,2390,13395,0
8,0,131,1,14,8,0,475,5035,0
9,0,132,1,11,6,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kills,team_deaths,team_assists,team_denies,team_gold,team_gold_spent
0,0,0,15,52,33,38,3718,58620
1,0,1,50,17,87,30,10811,87245
2,1,0,50,37,83,16,9085,107750
3,1,1,35,53,49,27,4776,69310
4,2,0,48,22,90,16,11177,81620
5,2,1,22,49,31,10,2494,54990
6,3,0,63,65,110,29,5954,94430
7,3,1,64,66,92,32,6455,76685
8,4,0,16,37,30,21,2030,38980
9,4,1,37,16,59,26,14099,78980


##### KDA Scores

KDA Scores can be calculated using the following formula: 
<center>$\frac{kills + assists}{deaths +1}$</center>

In [8]:
# Calculate KDA Scores
kda_score = (team_stats['kills'] + team_stats['assists']) / (team_stats['deaths'] + 1)
team_stats.insert(2, column='kda', value=kda_score)
team_stats.drop(columns=['kills', 'deaths', 'assists'], inplace=True)

team_kda_score = (team_features['team_kills'] + team_features['team_assists']) / (team_features['team_deaths'] + 1)
team_features.insert(2, column='team_kda', value=team_kda_score)
team_features.drop(columns=['team_kills', 'team_deaths', 'team_assists'], inplace=True)

display(team_stats.head(10))
display(team_features.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team
0,0,0,6.75,1,3261,10960,1
1,0,1,7.75,9,2954,17760,1
2,0,2,3.0,1,110,12195,1
3,0,3,5.4,6,1179,22505,1
4,0,4,9.25,13,3307,23825,1
5,0,128,1.857143,5,476,12285,0
6,0,129,0.642857,2,317,10355,0
7,0,130,1.111111,31,2390,13395,0
8,0,131,0.6,0,475,5035,0
9,0,132,0.583333,0,60,17550,0


Unnamed: 0,match_id,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,0.90566,38,3718,58620
1,0,1,7.611111,30,10811,87245
2,1,0,3.5,16,9085,107750
3,1,1,1.555556,27,4776,69310
4,2,0,6.0,16,11177,81620
5,2,1,1.06,10,2494,54990
6,3,0,2.621212,29,5954,94430
7,3,1,2.328358,32,6455,76685
8,4,0,1.210526,21,2030,38980
9,4,1,5.647059,26,14099,78980


Now let's calculate the ratios for each player.

In [9]:
# Calculate participation ratios
team_stats = team_stats.merge(team_features, on=['match_id', 'radiant_team'], how='left')
for col in team_stats.columns:
    if col.startswith('team_'):
        player_col = col.split('team_')
        team_stats[col] = team_stats[player_col[1]] / team_stats[col]

display(team_stats.head(10))

Unnamed: 0,match_id,player_slot,kda,denies,gold,gold_spent,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,0,6.75,1,3261,10960,1,0.886861,0.033333,0.301637,0.125623
1,0,1,7.75,9,2954,17760,1,1.018248,0.3,0.27324,0.203565
2,0,2,3.0,1,110,12195,1,0.394161,0.033333,0.010175,0.139779
3,0,3,5.4,6,1179,22505,1,0.709489,0.2,0.109056,0.257952
4,0,4,9.25,13,3307,23825,1,1.215328,0.433333,0.305892,0.273082
5,0,128,1.857143,5,476,12285,0,2.050595,0.131579,0.128026,0.20957
6,0,129,0.642857,2,317,10355,0,0.709821,0.052632,0.085261,0.176646
7,0,130,1.111111,31,2390,13395,0,1.226852,0.815789,0.642819,0.228506
8,0,131,0.6,0,475,5035,0,0.6625,0.0,0.127757,0.085892
9,0,132,0.583333,0,60,17550,0,0.644097,0.0,0.016138,0.299386


Finally, we are able to proceed with merging the team participation ratios into our players' DataFrame.

In [10]:
# Merging to the players DataFrame
players = players.merge(team_stats.drop(columns=['denies', 'gold', 'gold_spent']), 
                        on=['match_id', 'player_slot'], how='left')

# Define radiant_team as categorical
players['radiant_team'] = players['radiant_team'].astype('category')

# Sanity check
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,6.75,1,0.886861,0.033333,0.301637,0.125623
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,7.75,1,1.018248,0.3,0.27324,0.203565
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,3.0,1,0.394161,0.033333,0.010175,0.139779
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,5.4,1,0.709489,0.2,0.109056,0.257952
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,9.25,1,1.215328,0.433333,0.305892,0.273082
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,1.857143,0,2.050595,0.131579,0.128026,0.20957
6,0,0,6k Slayer,0,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.642857,0,0.709821,0.052632,0.085261,0.176646
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,1.111111,0,1.226852,0.815789,0.642819,0.228506
8,0,0,-,0,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.6,0,0.6625,0.0,0.127757,0.085892
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.583333,0,0.644097,0.0,0.016138,0.299386


0

#### Team Picks

Next, we should proceed to create a new column containing the team hero picks for each player in their respective rows. This modification will provide us with valuable insight into the team composition for each match observation.

In [11]:
# Group players by match_id and separate radiant/dire heroes
match_picks = players.groupby(['match_id', 'radiant_team'], as_index=False, observed=False)['hero_id']\
                .apply(list).rename(columns={'hero_id': 'team_hero_picks'})
display(match_picks.head(10))

Unnamed: 0,match_id,radiant_team,team_hero_picks
0,0,0,"[106, 102, 46, 7, 73]"
1,0,1,"[86, 51, 83, 11, 67]"
2,1,0,"[73, 22, 5, 67, 106]"
3,1,1,"[7, 82, 71, 39, 21]"
4,2,0,"[38, 7, 10, 12, 85]"
5,2,1,"[51, 109, 9, 41, 27]"
6,3,0,"[78, 19, 31, 40, 47]"
7,3,1,"[50, 44, 32, 26, 39]"
8,4,0,"[101, 100, 22, 67, 21]"
9,4,1,"[8, 39, 55, 87, 69]"


In [12]:
# Merging to the players DataFrame
players = players.merge(match_picks, on=['match_id', 'radiant_team'], how='left')
display(players.head(20))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,team_hero_picks
0,0,1,Double T,0,86,0,3261,10960,347,362,9,3,18,1,30,76.7356,8690,218,143,180,37,73,56,108,0,16,0,8840,5440,0,83,50,-957,0,5145,1087,400,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,6.75,1,0.886861,0.033333,0.301637,0.125623,"[86, 51, 83, 11, 67]"
1,0,1,Monkey,1,51,1,2954,17760,494,659,13,3,18,9,109,87.4164,23747,0,423,46,63,119,102,24,108,22,0,14331,8440,2683,671,395,-1137,0,6676,4317,937,16,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,7.75,1,1.018248,0.3,0.27324,0.203565,"[86, 51, 83, 11, 67]"
2,0,1,Trash!!!,0,83,2,110,12195,350,385,0,4,15,1,58,0.0,4217,1595,399,48,60,59,108,65,0,17,0,6692,8112,0,453,259,-1436,-1015,2418,3697,400,2,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,3.0,1,0.394161,0.033333,0.010175,0.139779,"[86, 51, 83, 11, 67]"
3,0,1,2,2,11,3,1179,22505,599,605,8,4,19,6,271,0.0,14832,2714,6055,63,147,154,164,79,160,21,0,8583,14230,894,293,100,-2156,0,4104,10432,400,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,5.4,1,0.709489,0.2,0.109056,0.257952,"[86, 51, 83, 11, 67]"
4,0,1,Kira,3,67,4,3307,23825,613,762,20,3,17,13,245,0.0,33740,243,1833,114,92,147,0,137,63,24,0,15814,14325,0,62,0,-1437,-1056,7467,9220,400,1,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,9.25,1,1.215328,0.433333,0.305892,0.273082,"[86, 51, 83, 11, 67]"
5,0,0,4,4,106,128,476,12285,397,524,5,6,8,5,162,0.0,10725,0,112,145,73,149,48,212,0,19,0,8502,12259,0,1,0,-2394,-2240,5281,6193,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,1.857143,0,2.050595,0.131579,0.128026,0.20957,"[106, 102, 46, 7, 73]"
6,0,0,6k Slayer,0,102,129,317,10355,303,369,4,13,5,2,107,0.0,15028,764,0,50,11,102,36,185,81,16,0,5201,9417,0,1,0,-3287,0,3396,4356,0,18,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.642857,0,0.709821,0.052632,0.085261,0.176646,"[106, 102, 46, 7, 73]"
7,0,0,ｔｏｍｉａ～♥,5,46,130,2390,13395,452,517,4,8,6,31,208,0.0,10230,0,2438,41,63,36,147,168,21,19,0,6853,13396,0,244,107,-3682,0,4350,8797,0,6,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,1.111111,0,1.226852,0.815789,0.642819,0.228506,"[106, 102, 46, 7, 73]"
8,0,0,-,0,7,131,475,5035,189,223,1,14,8,0,27,67.0277,4774,0,0,36,0,0,46,0,180,12,0,4798,4038,0,27,0,-3286,-39,2127,1089,0,0,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.6,0,0.6625,0.0,0.127757,0.085892,"[106, 102, 46, 7, 73]"
9,0,0,u didnt see who highest here?,6,73,132,60,17550,496,456,1,11,6,0,147,60.9748,6398,292,0,63,9,116,65,229,79,18,0,6659,10471,0,933,5679,-4039,-1063,2685,7011,0,4,2375,155,2015-11-05 19:01:52,1982,4,3,63,1,0.583333,0,0.644097,0.0,0.016138,0.299386,"[106, 102, 46, 7, 73]"


0

#### Teamfights

Team fights are a great way to assess each team's performance and coordination. Engineering features that help us analyze the dynamics and results of team fights can provide valuable insights into player performance, team synergy, and the influence of team fights on the overall match result.

In [13]:
# Load required files
tf_players = pd.read_csv(clean_folder + '/teamfights_players.csv', index_col=0)
print(f'tf_players:', '{:,} observations, {:,} features'.format(tf_players.shape[0], tf_players.shape[1]))

tf_players: 5,390,470 observations, 9 features


In [14]:
# Teamfight Participation
def count_values_not_zero(series):
    return (series > 0).sum()

player_teamfights = tf_players.groupby(['match_id', 'player_slot'])['damage'].agg(count_values_not_zero).reset_index(name='teamfights')

# Teamfight Performance
tf_player_damage = tf_players.groupby(['match_id', 'player_slot'])['damage'].sum().reset_index(name='tf_damage_dealt')
tf_player_buybacks = tf_players.groupby(['match_id', 'player_slot'])['buybacks'].sum().reset_index(name='tf_buybacks')
tf_player_deaths = tf_players.groupby(['match_id', 'player_slot'])['deaths'].sum().reset_index(name='tf_deaths')

# Teamfight Impact
tf_player_gold_delta = tf_players.groupby(['match_id', 'player_slot'])['gold_delta'].mean().reset_index(name='tf_avg_gold_delta')
tf_player_xp_delta = tf_players.groupby(['match_id', 'player_slot']).apply(lambda x: (x['xp_end'] - x['xp_start']).mean()).reset_index(name='tf_avg_xp_delta')

# Merge all features in a single DataFrame
player_teamfights = player_teamfights.merge(tf_player_damage, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_buybacks, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_deaths, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_gold_delta, on=['match_id', 'player_slot'], how='left')
player_teamfights = player_teamfights.merge(tf_player_xp_delta, on=['match_id', 'player_slot'], how='left')


# Display the head and shape of player_teamfights
display(player_teamfights.head(20))
print(f'player_teamfights:', '{:,} observations, {:,} features'.format(player_teamfights.shape[0], player_teamfights.shape[1]))

Unnamed: 0,match_id,player_slot,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
0,0,0,10,6099,0,2,329.166667,538.333333
1,0,1,10,13663,0,4,409.833333,1112.25
2,0,2,7,1155,1,3,123.666667,495.166667
3,0,3,9,15201,0,4,317.333333,795.75
4,0,4,12,30774,1,2,460.583333,1189.416667
5,0,128,10,23616,2,5,86.75,731.666667
6,0,129,9,12807,0,4,211.583333,516.75
7,0,130,8,15988,0,5,193.0,610.25
8,0,131,10,5718,1,9,-29.833333,401.75
9,0,132,10,9786,1,9,-65.75,639.5


player_teamfights: 499,310 observations, 8 features


It's odd that we are missing 690 observations in this new DataFrame. It's possible that these players did not engage in any teamfights, either by avoiding them entirely or due to thrown matches.

In [15]:
# Look for original match_ids
print('Total matches in original:', tf_players['match_id'].nunique())

Total matches in original: 49931


After verifying that there are no random missing values, instead of disregarding specific matches, let's merge these new features into our players' DataFrame and then examine the observations with null values. We will investigate the teamfight missing values after exploring the matches DataFrame.

In [16]:
# Merging to the players' DataFrame
players = players.drop(columns=['kills', 'deaths', 'assists'])\
                .merge(player_teamfights, on=['match_id', 'player_slot'], how='left')
display(players[players['teamfights'].isna()].sample(50))
gc.collect()

Unnamed: 0,match_id,match_outcome,account,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,messages_sent,time_played,cluster,start_time,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,kda,radiant_team,team_kda,team_denies,team_gold,team_gold_spent,team_hero_picks,teamfights,tf_damage_dealt,tf_buybacks,tf_deaths,tf_avg_gold_delta,tf_avg_xp_delta
384595,38459,0,[uNs]NiceDay,131998,42,128,2,800,364,310,4,23,4.06567,772,0,0,11,182,44,16,34,0,5,1,202,1282,0,0,0,0,0,295,958,0,1,286,138,2015-11-16 21:13:36,2047,2047,63,63,71,1.0,0,3.0,0.571429,0.000643,0.149254,"[42, 75, 39, 64, 98]",,,,,,
373964,37396,1,Ziko,129004,5,4,902,3705,292,175,1,14,0.433295,1068,0,662,214,44,43,73,46,19,6,0,230,2130,0,26,0,0,0,296,530,0,1,815,133,2015-11-16 18:46:39,2047,1798,59,63,44,1.0,1,0.285714,0.047619,0.112778,0.149818,"[55, 44, 25, 50, 5]",,,,,,
162829,16282,1,bolu0beyi,66755,21,132,1341,8845,517,420,9,77,4.53223,5592,0,2365,50,166,168,16,16,88,11,0,923,6030,0,542,428,0,0,548,3004,0,1,1069,132,2015-11-14 10:41:40,48,2039,63,12,123,3.0,0,0.26087,0.2,0.129804,0.238089,"[48, 50, 39, 71, 21]",,,,,,
108787,10878,1,15176,15176,1,130,1372,1900,358,284,7,38,0.0,328,0,437,56,11,16,182,29,0,6,0,61,2044,0,12,0,0,0,35,1571,0,0,446,132,2015-11-13 19:19:49,2038,2047,63,63,0,1.0,0,0.4,0.368421,0.25597,0.220035,"[75, 5, 1, 47, 42]",,,,,,
185186,18518,0,-,0,8,129,769,3930,519,325,5,27,0.0,1598,0,0,50,11,71,36,46,0,7,1,0,2637,0,0,0,-119,0,0,1115,0,0,485,156,2015-11-14 17:00:39,2047,1975,63,63,12,0.0,0,0.0,0.714286,0.989704,0.441077,"[36, 8, 28, 70, 86]",,,,,,
92736,9273,1,322,0,99,129,589,6610,402,389,4,53,0.0,8816,0,626,0,34,63,125,127,0,11,0,2143,4189,0,56,0,-29,0,2144,2167,0,10,983,123,2015-11-13 13:57:36,1974,2039,63,63,49,5.0,0,1.5,0.666667,0.237022,0.290677,"[74, 99, 93, 68, 41]",,,,,,
16329,1632,1,gissi18,8326,104,132,391,3300,305,363,0,55,0.0,531,0,0,1,11,182,16,17,0,8,0,0,3503,0,170,158,0,0,0,1912,0,4,605,138,2015-11-12 13:09:28,2047,2047,63,63,145,0.0,0,0.0,0.0,0.166312,0.214844,"[93, 39, 50, 36, 104]",,,,,,
12210,1221,0,6341,6341,85,0,1,625,100,71,0,0,0.0,147,0,0,42,216,44,16,16,216,2,3,0,326,0,0,0,-29,0,0,0,0,0,272,171,2015-11-12 11:23:25,2047,2047,63,63,8,0.0,1,0.0,0.0,0.000469,0.131857,"[85, 10, 21, 14, 2]",,,,,,
314385,31438,1,63251,63251,104,128,179,625,180,187,0,2,0.0,0,0,0,44,182,12,16,16,0,1,0,0,186,0,0,0,0,0,0,79,0,0,59,123,2015-11-15 22:10:50,2047,2047,63,63,0,0.0,0,0.0,0.0,0.091842,0.182749,"[104, 21, 9, 46, 100]",,,,,,
35877,3587,0,the king i was,16344,42,130,0,1065,325,139,3,15,0.0,0,0,0,16,44,11,16,20,29,3,1,0,658,0,16,0,0,0,0,618,0,4,289,204,2015-11-12 18:33:19,2047,2039,63,63,0,0.0,0,,1.0,0.0,0.273427,"[7, 63, 42, 39, 30]",,,,,,


0

It seems that some of these players have a leaver status, and others have very poor stats overall, suggesting that these matches may have been thrown. However, others seem to have decent stats across the board, indicating that the data could have been corrupted before manipulation. Let's check how many null values we have from our engineered features so far.

#### TrueSkill

We must consistently evaluate each player's skill level as they engage in matches to ensure a fair matchmaking process. Our data cleanup notebook includes a file with TrueSkill ratings for each player based on previous match results. However, it's important to note that some players in our current dataset may not be included in the TrueSkill file, which is a potential factor that needs to be addressed.

In [17]:
# Load required files
trueskill_df = pd.read_csv(clean_folder + '/trueskill.csv')
print(f'trueskill:', '{:,} observations, {:,} features'.format(trueskill_df.shape[0], trueskill_df.shape[1]))

trueskill: 834,226 observations, 6 features


In [18]:
trueskill_df.head()

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma,conservative_skill_estimate
0,236579,14,24,27.868035,5.212361,12.230953
1,-343,1,1,26.544163,8.065475,2.347736
2,-1217,1,1,26.521103,8.114989,2.176136
3,-1227,1,1,27.248025,8.092217,2.971375
4,-1284,0,1,22.931016,8.092224,-1.345657


In [22]:
# Instantiate the ratings dictionary
ts_init_ratings = {}

# Get the list of unique account IDs from the players DataFrame
account_ids = players['account_id'].unique().tolist()
account_ids.remove(0) # Remove anon accounts with value 0

# Read the mu and sigma from the file and append them to ts_ratings
for account in account_ids:
    account_row = trueskill_df[trueskill_df['account_id'] == account]
    if not account_row.empty:
        ts_mu = trueskill_df[trueskill_df['account_id'] == account]['trueskill_mu'].values[0]
        ts_sigma = trueskill_df[trueskill_df['account_id'] == account]['trueskill_sigma'].values[0]
        ts_init_ratings[account] = trueskill.Rating(mu=ts_mu, sigma=ts_sigma)
    else:
        ts_init_ratings[account] = trueskill.Rating()

print('Total account IDs in players DF:', len(account_ids))
print('Total keys in initial ratings dictionary:', len(ts_init_ratings))

Total account IDs in players DF: 158360
Total keys in initial ratings dictionary: 158360


In [23]:
trueskill_df[trueskill_df['account_id'] == 9][['trueskill_mu', 'trueskill_sigma']].values[0]

array([27.24786409,  7.48412389])

In [24]:
ts_init_ratings[9]

trueskill.Rating(mu=27.248, sigma=7.484)

In [25]:
team1 = [ts_init_ratings[9],ts_init_ratings[10],ts_init_ratings[1],ts_init_ratings[2],ts_init_ratings[5]]
team2 = [ts_init_ratings[8],ts_init_ratings[7],ts_init_ratings[6],ts_init_ratings[3],ts_init_ratings[4]]

win_probability(team1, team2)

0.6533004626640428

In [None]:
# Rename the conservative_skill_estimate to trueskill
trueskill.rename(columns={'conservative_skill_estimate': 'trueskill'}, inplace=True)

# Merging to the players' DataFrame
players = players.merge(trueskill_df[['account_id', 'trueskill']], on='account_id', how='left')
display(players.sample(50))
gc.collect()

#### Filling Missing Values

In [None]:
# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0 ])

In [None]:
# Explore team missing values
print('Team KDA:')
display(players[(players['team_kda'].isna()) & (players['kda'] != 0)][['match_id', 'player_slot', 'kda']])
print('\n------------------------------------\nTeam Denies')
display(players[(players['team_denies'].isna()) & (players['denies'] != 0)][['match_id', 'player_slot', 'denies']])

Based on the results, the null values in the team features can be interpreted as representing a 0. This is because these particular observations and the rest of the team's players had a 0 in their original columns. In this context, we should treat the null values as 0.

In [None]:
# Fill missing team_deaths and team_kda values
players['team_kda'].fillna(0, inplace=True)
players['team_denies'].fillna(0, inplace=True)

# Find features with missing values
display(players.isna().sum().sort_values(ascending=False)\
[players.isna().sum().sort_values(ascending=False) > 0])

There are many null values in our trueskill column. We can assume these values were either 0 or the account ID was hidden (with a value of 0). Let's explore this assumption.

In [None]:
print('Displayed Account IDs:')
display(players[(players['trueskill'].isna()) & (players['account_id'] != 0)][['match_id', 'player_slot', 'account_id', 'trueskill']])
account_ids_null_trueskill = players[(players['trueskill'].isna()) & (players['account_id'] != 0)]['account_id']
print('\n------------------------------------\nTrueskill Dataset')
display(trueskill[trueskill['account_id'].isin(account_ids_null_trueskill)])

It seems that some of these scores were 

### Matches DataFrame

#### Team Aggregations

In [None]:
# Pivot the team_features DataFrame
team_pivoted = team_features.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
team_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in team_pivoted.columns]

# Reset the index
team_pivoted.reset_index(inplace=True)
display(team_pivoted.head())
print('----------------------------')

# Merge results with the matches DataFrame
matches = matches.drop(columns=['negative_votes', 'positive_votes'])\
            .merge(team_pivoted, on=['match_id'], how='left')
display(matches.head())

In [None]:
# Pivot the team_features DataFrame
picks_pivoted = match_picks.pivot(index='match_id', columns='radiant_team')

# Flatten MultiIndex columns
picks_pivoted.columns = ['{}_{}'.format(col[0], 'radiant' if col[1] == 1 else 'dire')\
                        for col in picks_pivoted.columns]

# Reset the index
picks_pivoted.reset_index(inplace=True)
display(picks_pivoted.head())
print('----------------------------')

# Expanding the lists
radiant_picks = pd.DataFrame(picks_pivoted['team_hero_picks_radiant'].tolist(), 
                             index=picks_pivoted.index, 
                             columns=[f'hero_slot_{i}' for i in range(5)])

dire_picks = pd.DataFrame(picks_pivoted['team_hero_picks_dire'].tolist(), 
                          index=picks_pivoted.index, 
                          columns=[f'hero_slot_{i+128}' for i in range(5)])

picks_pivoted = pd.concat([picks_pivoted, radiant_picks, dire_picks], axis=1)
picks_pivoted.drop(columns=['team_hero_picks_radiant', 'team_hero_picks_dire'], inplace=True)

# Merge results with the matches DataFrame
matches = matches.merge(picks_pivoted, on=['match_id'], how='left')
display(matches.head())

#### Teamfights

In [None]:
# Load required file
teamfights = pd.read_csv(clean_folder + '/teamfights.csv', index_col=0)
print(f'teamfights:', '{:,} observations, {:,} features'.format(teamfights.shape[0], teamfights.shape[1]))

In [None]:
# Check the total matches
teamfights['match_id'].nunique()

In [None]:
# Look at the head
teamfights.head()

We want to calculate the total number of teamfights per match in the matches DataFrame, and we can also determine the average duration of teamfights per match.

In [None]:
# Create duration feature
tf_duration = teamfights['end'] - teamfights['start']
teamfights.insert(4, 'duration', value=tf_duration)

# Selecting the agg functions for each column
agg_funcs = {
    'tf_order': 'max',
    'duration': 'mean'
}

# Aggregating features
tfs_per_match = teamfights[['match_id', 'tf_order', 'duration']].groupby('match_id', as_index=False).agg(agg_funcs)

# Rename time column to match format
tfs_per_match.rename(columns={'tf_order': 'teamfights', 'duration': 'tf_avg_duration'}, inplace=True)

tfs_per_match.head(10)

In [None]:
# Merge results with the matches DataFrame
matches = matches.merge(tfs_per_match, on=['match_id'], how='left')
display(matches.head())

#### Filling Missing Values

In [None]:
# Find features with missing values
display(matches.isna().sum().sort_values(ascending=False)\
[matches.isna().sum().sort_values(ascending=False) > 0])

In [None]:
# Look at the rows with missing values
matches[matches['teamfights'].isna()]

Upon reviewing the data, we can notice that there are no discernible patterns indicating the absence of teamfights during these matches. Therefore, it would be best to exclude them from all of our DataFrames since they don't have complete information.

In [None]:
# Create a list of the match IDs to be dropped
dropped_matches = matches[matches['teamfights'].isna()]['match_id'].tolist()

# Removing from the players and matches dfs
players.dropna(inplace=True)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))
print('-----------------------------------------------------')
matches.dropna(inplace=True)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

### Temporal Features

In [None]:
# Load required file
player_time = pd.read_csv(clean_folder + '/player_time.csv', index_col=0)
print(f'player_time:', '{:,} observations, {:,} features'.format(player_time.shape[0], player_time.shape[1]))

In [None]:
# Melt the player_time DataFrame
player_time_melted = pd.melt(player_time, id_vars=['match_id', 'times'], 
                              value_vars=[col for col in player_time.columns if\
                                          col.startswith(('gold_t_', 'lh_t_', 'xp_t_'))],
                              var_name='metric', value_name='value')

# Look at the shape of the melted DataFrame
print(f'player_time_melted:', '{:,} observations, {:,} features'.format(player_time_melted.shape[0], player_time_melted.shape[1]))
display(player_time_melted.head())
gc.collect()

In [None]:
# Create separate columns for gold, lh, and xp
player_time_melted[['metric_type', 'player_slot']] = player_time_melted['metric'].str.split('_t_', expand=True)
player_time_melted['player_slot'] = player_time_melted['player_slot'].astype(int)
display(player_time_melted.head())
gc.collect()

In [None]:
# Pivot the table to create a wide format
player_time_wide = player_time_melted.pivot_table(index=['match_id', 'times', 'player_slot'], 
                                                  columns='metric_type', 
                                                  values='value',
                                                  aggfunc='sum').reset_index()

# Look at the shape of the wide DataFrame
print(f'player_time_wide:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))
display(player_time_wide.head(50))
gc.collect()

#### Ability Upgrades

#### Purchase Log

#### Objectives

In [None]:
# Load the file
objectives = pd.read_csv(clean_folder + '/objectives.csv', index_col=0)
print(f'objectives:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

In [None]:
objectives.groupby('subtype')['value'].nunique()

In [None]:
# Separate the objectives into multiple features
objectives['aegis'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS', 1, 0)
objectives['aegis_stolen'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_AEGIS_STOLEN', 1, 0)
objectives['firstblood'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_FIRSTBLOOD', 1, 0)
objectives['roshan_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_ROSHAN_KILL', 1, 0)
objectives['tower_deny'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_DENY', 1, 0)
objectives['tower_kill'] = np.where(objectives['subtype'] == 'CHAT_MESSAGE_TOWER_KILL', 1, 0)

# Look at the head
objectives.head()

In [None]:
# Round up the time values to the nearest 60-second intervals
objectives['time'] = (objectives['time'] // 60) * 60

# Aggregate objectives
objective_features = objectives.drop(columns=['player2', 'subtype', 'value']).\
                        groupby(['match_id', 'player1', 'time']).sum().reset_index()

# Look at the objectives features DataFrame head
display(objective_features.head())

In [None]:
# Rename time, player1 and subtype column to match format
objective_features.rename(columns={'time': 'times', 'player1': 'player_slot'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(objective_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

#### Teamfight Durations

In [None]:
teamfights.head()

In [None]:
# Create duration feature
tf_last_death = teamfights['last_death'] - teamfights['start']
teamfights.insert(6, 'tf_last_death', value=tf_last_death)

# Round up the time values to the nearest 60-second intervals
teamfights['times'] = (teamfights['start'] // 60) * 60

tfs_features = teamfights.drop(columns=['start', 'end', 'last_death', 'deaths'])

# Look at the teamfights features DataFrame head
display(tfs_features.head())

In [None]:
tf_players.head()

In [None]:
# Calculate xp_delta on tf_players
tf_players['tf_xp_delta'] = tf_players['xp_end'] - tf_players['xp_start']
tf_players.drop(columns=['xp_end', 'xp_start'], inplace=True)

# Reset the index from tfs_features
tfs_features = tfs_features.reset_index()

# Merge tf_players with tfs_features
tfs_features = tf_players.merge(tfs_features, left_on=['match_id', 'tf_id'], right_on=['match_id', 'index'])
tfs_features.drop(columns=['index', 'tf_id', 'tf_order'], inplace=True)
display(tfs_features.head(15))

In [None]:
# Rename duration and deaths columns to match format
tfs_features.rename(columns={
    'buybacks': 'tf_buybacks',
    'damage': 'tf_damage',
    'deaths': 'tf_deaths',
    'gold_delta': 'tf_gold_delta',
    'duration': 'tf_duration'
}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(tfs_features, on=['match_id', 'player_slot', 'times'], how='left')

# Drop the matches with incomplete teamfight data
player_time_wide = player_time_wide[~player_time_wide['match_id'].isin(dropped_matches)]

# Display the first 50 final observations
display(player_time_wide.head(50))
gc.collect()

#### Chat Log

In [None]:
# Load the file
chat = pd.read_csv(clean_folder + '/chat.csv', index_col=0)
print(f'chat:', '{:,} observations, {:,} features'.format(objectives.shape[0], objectives.shape[1]))

In [None]:
# Look at the head
chat.head()

In [None]:
# Round up the time values to the nearest 60-second intervals
chat['time'] = (chat['time'] // 60) * 60

# Aggregate chat messages
chat_features = chat.groupby(['match_id', 'player_slot', 'time'])['chat'].count().reset_index()

# Look at the objectives features DataFrame head
display(chat_features.head())

In [None]:
chat_features[chat_features['time'] < 0]

In [None]:
# Rename time column to match format
chat_features.rename(columns={'time': 'times', 'chat': 'chats_sent'}, inplace=True)

# Merge aggregated objectives data
player_time_wide = player_time_wide.merge(chat_features, on=['match_id', 'player_slot', 'times'], how='left')
display(player_time_wide.head(50))
gc.collect()

#### Filling Missing Values

In [None]:
# Find features with missing values
display(player_time_wide.isna().sum().sort_values(ascending=False)\
[player_time_wide.isna().sum().sort_values(ascending=False) > 0])

In [None]:
# Replacing from the players and matches dfs
players.dropna(inplace=True)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

---

## Merging all DataFrames

In [None]:
# Rename columns before merging
players_merge = players.drop(columns=['gold_per_min', 'xp_per_min', 'tf_buybacks', 
                                      'tf_deaths', 'tf_avg_gold_delta', 'tf_avg_xp_delta'])
player_time_wide.rename(columns={'gold': 'gold_per_min', 'lh': 'lh_per_min', 'xp': 'xp_per_min'}, inplace=True)

# Merge the player_time_wide with players df
dask_players = dd.from_pandas(players_merge, npartitions=4)  
dask_player_time = dd.from_pandas(player_time_wide, npartitions=8)  

merged_dask = dask_players.merge(dask_player_time, on=['match_id', 'player_slot'], how='left')
final_df = merged_dask.compute()

# Look at the initial shape of the final DataFrame
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))
final_df = final_df.sort_values(by=['match_id', 'player_slot']).reset_index()
display(final_df.head(50))
gc.collect()

In [None]:
# Save the final DataFrames to a CSV file
matches.to_csv('../Data/Merged/matches.csv', index=False)
print(f'matches:', '{:,} observations, {:,} features'.format(matches.shape[0], matches.shape[1]))

players.to_csv('../Data/Merged/players.csv', index=False)
print(f'players:', '{:,} observations, {:,} features'.format(players.shape[0], players.shape[1]))

player_time_wide.to_csv('../Data/Merged/timeseries.csv', index=False)
print(f'timeseries:', '{:,} observations, {:,} features'.format(player_time_wide.shape[0], player_time_wide.shape[1]))

final_df.to_csv('../Data/Merged/final_df.csv', index=False)
print(f'final_df:', '{:,} observations, {:,} features'.format(final_df.shape[0], final_df.shape[1]))