# Dota 2 Matches: EDA

Created by: **Juan Pablo Nieto**

---

## Dataset Overview

*From [**Devin Anzelmo**](https://www.kaggle.com/datasets/devinanzelmo/dota-2-matches/data):*

This dataset contains 50,000 ranked ladder matches from the Dota 2 data dump created by [Opendota](https://www.opendota.com/). It was inspired by the [Dota 2 Matches](https://www.kaggle.com/jraramirez/dota-2-matches-dataset) data published here by **Joe Ramir**. This is an updated and improved version of that dataset.

Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.

This dataset aims to enable the exploration of player behaviour, skill estimation, or anything else you find interesting. The intent is to create an accessible and easy-to-use resource that can be expanded and modified if needed. As such, I am open to a wide variety of suggestions as to what additions or changes to make.

> [**Quick look at how the dataset is structured**](https://www.kaggle.com/code/devinanzelmo/a-quick-look-at-dota-2-dataset)

|   CSV File             |  Description  | Notes |
|:-----------------------|:--------------|:------|
|  **Match Info**        |  |  |
| match                  | Top-level information about each match | `tower_status` and `barracks_status` are binary masks indicating whether various structures have been destroyed |
| players                | Statistics about player's individual performance in each match | Some players chose to hide their account_id and are marked as `0` |
| player_time            | Contains XP, gold, and last-hit totals for each player at one-minute intervals | The suffix for each variable indicates the value of the `player_slot` variable |
| teamfights             | Basic information about each team fight | `start`, `end`, and `last_death` contain the time for those events in seconds |
| teamfights_players     | Detailed info about each team fight | Each row in `teamfights.csv` corresponds to ten rows in this file |
| chat                   | Chat log for all matches | These include the player's name in game |
| objectives             | Gives information on all the objectives completed, by which player and at what time |  |
| ability_upgrades       | Contains the upgrade performed at each level for each player |  |
| purchase_log           | Contains the time in seconds for each purchase made by every player in every match |  |
| **Game Info**          |  |  |
| ability_ids            | Ability names and ids | Use with `ability_upgrades.csv` to get the names of upgraded abilities |
| item_ids               | Contains `item_id` and item name | Use with `purchase_log.csv` to get the names of purchased items |
| hero_ids               | Contains the `name`, `hero_id`, and `localized_name` for each hero a player can pick | Concatenated this file with the one found [here](https://www.kaggle.com/datasets/nihalbarua/dota2-hero-preference-by-mmr) to obtain the `Primary Attribute` and possible Roles |
| cluster_region         | Contains the cluster number and geographic region | Allows to filter matches by region |
| patch_dates            | Release dates for various patches | Use `start_time` from `match.csv` to determine which patch was used to play in |
| **Historical Info**    |  |  |
| MMR                    | Contains `account_id` and players' **Matchmaking Rating** *(**MMR** for short)* | File extracted from the [**OpenDota Core Wiki**](https://github.com/odota/core/wiki/MMR-Data) where the original dataset is based from |
| player_ratings         | Skill data computed on **900k** previous matches and a possible way to measure skill rating when **MMR** is not available | `trueskill` ratings have two components, `mu`, which can be interpreted as the skill, with the higher value being better, and `sigma` which is the uncertainty of the rating. Negative `account_id` are players not appearing in other data available in this dataset |
| match_outcomes         | Results with `account_id` for **900k** matches occurring prior the rest of the dataset | Each match has data on two rows. the `rad` feature indicates whether the team is Radiant or Dire. *Useful for creating custom skill calculations* |
| **Tests**              |  |  |
| test_labels            | `match_id` and `radiant_win` as integer 1 or 0 |  |
| test_player            | Full player and match table with `hero_id`, `player_slot`, `match_id`, and `account_id`|  |

---

## Initial Setup

In [1]:
# Basic Data Science Libraries
import pandas as pd
import numpy as np
import os

# Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# Statistics Libraries
from statsmodels.api import tsa
import statsmodels.api as sm

# Removing the max columns limiter
pd.set_option('display.max_columns', None)

In [2]:
dfs = {}

# Iterate over files in the directory
for file in os.listdir('Dataset/'):
    
    # Skip hidden files and directories
    if file.startswith('.'):
        continue

    # Include CSV files exclusively
    if file.endswith('.csv'):
        
        # Construct full file path
        file_path = os.path.join('Dataset/', file)
    
        # Assign the dictionary name
        key = file.split('.csv')[0]

        # Assign each file to the dfs
        dfs[key] = pd.read_csv(file_path)

    else:
        continue

print('Total files imported:', len(dfs))
print('DataFrame Shapes:')
for df in dfs:
    print('\n', f'{df}:', '{:,} observations, {:,} features'.format(dfs[df].shape[0], dfs[df].shape[1]))

Total files imported: 19
DataFrame Shapes:

 player_time: 2,209,778 observations, 32 features

 test_player: 1,000,000 observations, 4 features

 teamfights_players: 5,390,470 observations, 8 features

 item_ids: 189 observations, 2 features

 test_labels: 100,000 observations, 2 features

 chat: 1,439,488 observations, 5 features

 ability_upgrades: 8,939,599 observations, 5 features

 purchase_log: 18,193,745 observations, 4 features

 match: 50,000 observations, 13 features

 cluster_regions: 53 observations, 2 features

 players: 500,000 observations, 73 features

 MMR: 1,069,672 observations, 2 features

 ability_ids: 688 observations, 2 features

 match_outcomes: 1,828,588 observations, 10 features

 player_ratings: 834,226 observations, 5 features

 teamfights: 539,047 observations, 5 features

 hero_ids: 112 observations, 5 features

 objectives: 1,173,396 observations, 9 features

 patch_dates: 19 observations, 2 features


---

---

## Data Wrangling

Despite going through the [**quick look into the dataset structure**](https://www.kaggle.com/code/devinanzelmo/a-quick-look-at-dota-2-dataset) posted above, I still have to make sure that the data is clean for modelling.

I'll leave the original DataFrames intact under the `dfs` directory as a backup while organizing the cleaned versions in new directories.

In [26]:
# Grouping the game metadata in one directory
dota2 = {
    'abilities': dfs['ability_ids'].copy(),
    'items': dfs['item_ids'].copy(),
    'heroes': dfs['hero_ids'].copy(),
    'regions': dfs['cluster_regions'].copy(),
    'patches': dfs['patch_dates'].copy(),
    'positions': pd.DataFrame({
        'player_slot': [0,1,2,3,4,
                        128,129,130,131,132], 
        'side': ['Radiant','Radiant','Radiant','Radiant','Radiant',
                 'Dire','Dire','Dire','Dire','Dire'],
        'position': ['Carry', 'Midlaner', 'Offlaner', 'Roamer', 'Hard Support',
                    'Carry', 'Midlaner', 'Offlaner', 'Roamer', 'Hard Support'],
        'roles': [{'Carry','Escape','Pushers'}, 
                  {'Carry','Durable','Pushers','Disabler','Nuker'}, 
                  {'Carry','Durable','Initiator','Pushers','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'},
                  {'Carry','Escape','Pushers'}, 
                  {'Carry','Durable','Pushers','Disabler','Nuker'}, 
                  {'Carry','Durable','Initiator','Pushers','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}],
    }) # Included the 'positions' dataframe to have a relationship between player_slot and other in-game data
}

# Grouping the data containing info regarding the 50k matches
matches = {
    'overview': dfs['match'].copy(),
    'players': dfs['players'].copy(),
    'time': dfs['player_time'].copy(),
    'objectives': dfs['objectives'].copy(),
    'upgrades': dfs['ability_upgrades'].copy(),
    'purchases': dfs['purchase_log'].copy(),
    'chatlog': dfs['chat'].copy()
}

# Grouping teamfight datasets
tfs = {
    'overview': dfs['teamfights'].copy(),
    'breakdown': dfs['teamfights_players'].copy()
}

# Grouping historical and referential information
ref = {
    'prev_outcomes': dfs['match_outcomes'].copy(),
    'ratings': dfs['player_ratings'].copy(),
    'mmr': dfs['MMR'].copy()
}

# Grouping the test datasets together
test = {
    'matches': dfs['test_labels'].copy(),
    'players': dfs['test_player'].copy()
}

In [27]:
matches.keys()

dict_keys(['overview', 'players', 'time', 'objectives', 'upgrades', 'purchases', 'chatlog'])

---

### Dota2

#### Abilities

In [4]:
dota2['abilities'].head()

Unnamed: 0,ability_id,ability_name
0,0,ability_base
1,5001,default_attack
2,5002,attribute_bonus
3,5003,antimage_mana_break
4,5004,antimage_blink


Noticing that this data frame contains just the `ability_id` and name, I'll convert it to a dictionary instead to make it easier to map it to other data frames if needed.

In [5]:
# Converting the dataframe to a dictionary
dota2['abilities'] = dota2['abilities'].set_index('ability_id').to_dict()['ability_name']
dota2['abilities']

{0: 'ability_base',
 5001: 'default_attack',
 5002: 'attribute_bonus',
 5003: 'antimage_mana_break',
 5004: 'antimage_blink',
 5005: 'antimage_spell_shield',
 5006: 'antimage_mana_void',
 5007: 'axe_berserkers_call',
 5008: 'axe_battle_hunger',
 5009: 'axe_counter_helix',
 5010: 'axe_culling_blade',
 5011: 'bane_brain_sap',
 5012: 'bane_enfeeble',
 5013: 'bane_fiends_grip',
 5014: 'bane_nightmare',
 5015: 'bloodseeker_bloodrage',
 5016: 'bloodseeker_blood_bath',
 5017: 'bloodseeker_thirst',
 5018: 'bloodseeker_rupture',
 5019: 'drow_ranger_frost_arrows',
 5020: 'drow_ranger_silence',
 5021: 'drow_ranger_trueshot',
 5022: 'drow_ranger_marksmanship',
 5023: 'earthshaker_fissure',
 5024: 'earthshaker_enchant_totem',
 5025: 'earthshaker_aftershock',
 5026: 'earthshaker_echo_slam',
 5027: 'juggernaut_blade_dance',
 5028: 'juggernaut_blade_fury',
 5029: 'juggernaut_healing_ward',
 5030: 'juggernaut_omni_slash',
 5031: 'kunkka_torrent',
 5032: 'kunkka_tidebringer',
 5033: 'kunkka_x_marks_the_

#### Items

In [6]:
dota2['items'].head()

Unnamed: 0,item_id,item_name
0,1,blink
1,2,blades_of_attack
2,3,broadsword
3,4,chainmail
4,5,claymore


In [7]:
# Converting the dataframe to a dictionary
dota2['items'] = dota2['items'].set_index('item_id').to_dict()['item_name']
dota2['items']

{1: 'blink',
 2: 'blades_of_attack',
 3: 'broadsword',
 4: 'chainmail',
 5: 'claymore',
 6: 'helm_of_iron_will',
 7: 'javelin',
 8: 'mithril_hammer',
 9: 'platemail',
 10: 'quarterstaff',
 11: 'quelling_blade',
 12: 'ring_of_protection',
 13: 'gauntlets',
 14: 'slippers',
 15: 'mantle',
 16: 'branches',
 17: 'belt_of_strength',
 18: 'boots_of_elves',
 19: 'robe',
 20: 'circlet',
 21: 'ogre_axe',
 22: 'blade_of_alacrity',
 23: 'staff_of_wizardry',
 24: 'ultimate_orb',
 25: 'gloves',
 26: 'lifesteal',
 27: 'ring_of_regen',
 28: 'sobi_mask',
 29: 'boots',
 30: 'gem',
 31: 'cloak',
 32: 'talisman_of_evasion',
 33: 'cheese',
 34: 'magic_stick',
 36: 'magic_wand',
 37: 'ghost',
 38: 'clarity',
 39: 'flask',
 40: 'dust',
 41: 'bottle',
 42: 'ward_observer',
 43: 'ward_sentry',
 44: 'tango',
 45: 'courier',
 46: 'tpscroll',
 48: 'travel_boots',
 50: 'phase_boots',
 51: 'demon_edge',
 52: 'eagle',
 53: 'reaver',
 54: 'relic',
 55: 'hyperstone',
 56: 'ring_of_health',
 57: 'void_stone',
 58: 'my

#### Heroes

In [8]:
dota2['heroes'].sample(5)

Unnamed: 0,name,hero_id,localized_name,Primary Attribute,Roles
15,npc_dota_hero_sand_king,16,Sand King,all,"Initiator, Disabler, Support, Nuker, Escape"
70,npc_dota_hero_gyrocopter,72,Gyrocopter,agi,"Carry, Nuker, Disabler"
12,npc_dota_hero_puck,13,Puck,int,"Initiator, Disabler, Escape, Nuker"
97,npc_dota_hero_bristleback,99,Bristleback,str,"Carry, Durable, Initiator, Nuker"
51,npc_dota_hero_furion,53,Nature's Prophet,int,"Carry, Pusher, Escape, Nuker"


From the looks of it, it seems that the roles contain several values that could be useful to access if we had them as a list type.

In [9]:
# Setting the hero_id as the index
dota2['heroes'].set_index('hero_id', inplace=True)

# Dropping the name feature since it is redundant
dota2['heroes'].drop(columns='name', inplace=True)

# Renaming the localized_name to name and formatting Primary Attribute and Roles to follow the same as the rest
dota2['heroes'].rename(columns={'localized_name': 'name', 'Primary Attribute': 'primary_attribute', 'Roles': 'roles'},
                      inplace=True)

# Changing the Role values to list type
dota2['heroes']['roles'] = dota2['heroes']['roles'].apply(lambda x: x.split(', ') if isinstance(x, str) else [])

dota2['heroes'].head()

Unnamed: 0_level_0,name,primary_attribute,roles
hero_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Anti-Mage,agi,"[Carry, Escape, Nuker]"
2,Axe,str,"[Initiator, Durable, Disabler, Carry]"
3,Bane,all,"[Support, Disabler, Nuker, Durable]"
4,Bloodseeker,agi,"[Carry, Disabler, Nuker, Initiator]"
5,Crystal Maiden,int,"[Support, Disabler, Nuker]"


#### Regions

In [10]:
dota2['regions'].head()

Unnamed: 0,cluster,region
0,111,US WEST
1,112,US WEST
2,113,US WEST
3,121,US EAST
4,122,US EAST


In [11]:
# Changing the index to cluster
dota2['regions'].set_index('cluster', inplace=True)
dota2['regions'].head()

Unnamed: 0_level_0,region
cluster,Unnamed: 1_level_1
111,US WEST
112,US WEST
113,US WEST
121,US EAST
122,US EAST


#### Patches

In [12]:
dota2['patches']

Unnamed: 0,patch_date,name
0,2010-12-24T00:00:00Z,6.7
1,2011-01-21T00:00:00Z,6.71
2,2011-04-27T00:00:00Z,6.72
3,2011-12-24T00:00:00Z,6.73
4,2012-03-10T00:00:00Z,6.74
5,2012-09-30T00:00:00Z,6.75
6,2012-10-21T00:00:00Z,6.76
7,2012-12-15T00:00:00Z,6.77
8,2013-05-30T00:00:00Z,6.78
9,2013-11-24T00:00:00Z,6.79


#### Positions

In [24]:
dota2['positions']

Unnamed: 0,player_slot,side,position,roles
0,0,Radiant,Carry,"{Carry, Escape, Pushers}"
1,1,Radiant,Midlaner,"{Carry, Durable, Nuker, Pushers, Disabler}"
2,2,Radiant,Offlaner,"{Carry, Durable, Pushers, Disabler, Initiator}"
3,3,Radiant,Roamer,"{Support, Escape, Disabler, Initiator}"
4,4,Radiant,Hard Support,"{Support, Escape, Disabler, Initiator}"
5,128,Dire,Carry,"{Carry, Escape, Pushers}"
6,129,Dire,Midlaner,"{Carry, Durable, Nuker, Pushers, Disabler}"
7,130,Dire,Offlaner,"{Carry, Durable, Pushers, Disabler, Initiator}"
8,131,Dire,Roamer,"{Support, Escape, Disabler, Initiator}"
9,132,Dire,Hard Support,"{Support, Escape, Disabler, Initiator}"


---

### Matches

#### Players

We need to begin cleaning our match-related data by focusing on the players' data frame. Some players have chosen to play anonymously, so their `account_id` is replaced with a value of $0$. Since our project's success depends on accurately identifying each player and monitoring their in-game actions, it's essential to remove all matches where we have one or more players with hidden `account_id` values.

In [9]:
matches['players'].sample(10)

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_abandon,gold_sell,gold_destroying_structure,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,gold_killing_couriers,unit_order_none,unit_order_move_to_position,unit_order_move_to_target,unit_order_attack_move,unit_order_attack_target,unit_order_cast_position,unit_order_cast_target,unit_order_cast_target_tree,unit_order_cast_no_target,unit_order_cast_toggle,unit_order_hold_position,unit_order_train_ability,unit_order_drop_item,unit_order_give_item,unit_order_pickup_item,unit_order_pickup_rune,unit_order_purchase_item,unit_order_sell_item,unit_order_disassemble_item,unit_order_move_item,unit_order_cast_toggle_auto,unit_order_stop,unit_order_taunt,unit_order_buyback,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
324905,32490,91333,44,128,127,16430,446,519,12,11,11,0,129,1.63895,14157,0,326,151,135,143,164,212,63,20,0,13310.0,7613.0,596.0,108.0,100.0,-4429.0,,,750.0,901.0,7855.0,5020.0,570.0,,,3885.0,168.0,,748.0,5.0,247.0,6.0,9.0,63.0,,20.0,1.0,,2.0,4.0,31.0,2.0,,13.0,2.0,,,,,,,,,,,,,
435632,43563,72616,22,2,2564,30480,512,480,21,8,17,0,356,28.6776,34358,1762,220,108,96,1,110,121,48,25,0,17093.0,15151.0,,414.0,371.0,-4799.0,,,1777.0,760.0,15096.0,11798.0,,,,5605.0,266.0,,642.0,84.0,575.0,3.0,136.0,,2.0,25.0,5.0,,11.0,17.0,46.0,3.0,1.0,12.0,,330.0,,,,,,21.0,,,,,,
413843,41384,0,62,3,4649,15095,526,466,6,4,14,2,24,1.10122,8574,354,749,181,50,204,81,71,46,18,0,12575.0,4537.0,,111.0,,-776.0,,,50.0,4113.0,10586.0,899.0,,150.0,,5902.0,140.0,39.0,428.0,26.0,91.0,,149.0,,5.0,16.0,,,,,48.0,2.0,,1.0,,,,,,,,,,,,,,
149735,14973,0,83,128,1131,6740,255,214,0,8,7,2,9,,1499,1905,109,180,46,0,1,0,0,11,0,3523.0,3124.0,,261.0,166.0,-1612.0,,,,3040.0,1151.0,278.0,200.0,175.0,,2965.0,40.0,14.0,89.0,38.0,91.0,4.0,34.0,,,11.0,,,,11.0,32.0,,,14.0,,2.0,,,,,,21.0,,,,,,
186063,18606,74983,32,3,2162,14690,396,444,12,5,10,1,50,9.53315,14489,0,413,196,208,36,0,63,46,19,0,11886.0,6869.0,447.0,214.0,119.0,-2215.0,,,,3860.0,6374.0,1869.0,522.0,175.0,,4524.0,57.0,1.0,618.0,49.0,59.0,4.0,37.0,,,19.0,1.0,,8.0,14.0,23.0,,,3.0,,,,,1.0,,,,,,,,,
446658,44665,146427,84,131,681,5210,221,192,3,10,5,1,19,25.2318,6624,0,0,180,46,102,43,0,0,10,0,1824.0,3755.0,,,,-2210.0,,,,320.0,2435.0,764.0,,,,2883.0,63.0,1.0,89.0,10.0,79.0,8.0,18.0,,,10.0,,,,3.0,23.0,,,11.0,,,,,1.0,,,8.0,,,,,,
288547,28854,48827,32,130,5059,9785,347,402,7,12,11,3,67,,10753,404,894,36,0,181,196,81,63,19,0,10312.0,7979.0,596.0,583.0,291.0,-4218.0,,,137.0,3780.0,4512.0,2813.0,600.0,,,4725.0,82.0,6.0,368.0,58.0,67.0,8.0,14.0,,,19.0,,,5.0,11.0,36.0,2.0,,9.0,,,,,3.0,,,,,,,,,
126924,12692,0,71,4,988,4195,235,232,3,11,6,0,13,54.9538,3661,0,0,180,43,16,0,92,0,11,0,3345.0,3025.0,,207.0,105.0,-2299.0,-605.0,,,320.0,2846.0,397.0,,70.0,,2075.0,61.0,10.0,135.0,18.0,52.0,3.0,6.0,,83.0,11.0,,,3.0,7.0,23.0,,,7.0,,,,1.0,,,,,,,,,,
374919,37491,0,87,132,756,7075,353,440,3,5,6,1,8,,4239,0,38,102,34,254,46,42,214,14,0,7601.0,2962.0,,182.0,132.0,-1285.0,-607.0,,75.0,,5749.0,303.0,,,,1481.0,11.0,52.0,129.0,29.0,35.0,,20.0,,120.0,14.0,,,,3.0,27.0,1.0,,5.0,,,,1.0,,,,,,,,,,
266832,26683,99718,68,2,1734,11620,270,305,6,9,12,2,41,19.3542,8525,0,47,29,65,102,36,108,43,17,0,7081.0,8927.0,357.0,9.0,,-2841.0,,,,600.0,4327.0,3980.0,200.0,,,6574.0,54.0,25.0,219.0,48.0,65.0,3.0,35.0,,224.0,17.0,,,,1.0,33.0,,,11.0,,,,,,,,11.0,,,,,,


In [12]:
matches['players'][matches['players']['match_id'] == 49999]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_abandon,gold_sell,gold_destroying_structure,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,gold_killing_couriers,unit_order_none,unit_order_move_to_position,unit_order_move_to_target,unit_order_attack_move,unit_order_attack_target,unit_order_cast_position,unit_order_cast_target,unit_order_cast_target_tree,unit_order_cast_no_target,unit_order_cast_toggle,unit_order_hold_position,unit_order_train_ability,unit_order_drop_item,unit_order_give_item,unit_order_pickup_item,unit_order_pickup_rune,unit_order_purchase_item,unit_order_sell_item,unit_order_disassemble_item,unit_order_move_item,unit_order_cast_toggle_auto,unit_order_stop,unit_order_taunt,unit_order_buyback,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
499990,49999,158359,94,0,251,16840,430,449,2,11,5,3,237,3.01025,3922,0,287,50,65,160,166,170,172,19,0,4136.0,16728.0,,26.0,,-4729.0,,,193.0,160.0,2999.0,12190.0,,,,2791.0,,166.0,521.0,8.0,62.0,39.0,177.0,34.0,1.0,19.0,1.0,,2.0,5.0,31.0,3.0,,5.0,,86.0,,,,,,,,,,,,
499991,49999,0,19,1,80,14050,407,611,11,14,4,1,148,54.3521,17104,0,0,170,41,152,29,108,24,23,0,16361.0,10868.0,,1208.0,746.0,-6166.0,-1345.0,,,160.0,8125.0,5259.0,,,,4532.0,94.0,31.0,334.0,53.0,45.0,,113.0,,49.0,23.0,1.0,,,14.0,25.0,,,13.0,,,,1.0,4.0,,,,,,,,,
499992,49999,0,68,2,473,17130,414,463,6,9,9,22,90,26.4467,12748,0,0,36,108,254,96,29,65,20,0,12721.0,8689.0,,126.0,220.0,-2811.0,-1071.0,,600.0,160.0,9249.0,4980.0,,,,4394.0,65.0,236.0,469.0,128.0,73.0,1.0,73.0,,66.0,20.0,,,,2.0,34.0,3.0,,3.0,,,,1.0,,,,3.0,,,,,,
499993,49999,0,35,3,51,10590,311,398,4,11,9,5,131,0.566851,15062,0,0,3,63,75,152,212,170,18,0,7463.0,11025.0,,18.0,160.0,-4189.0,-73.0,,424.0,160.0,4416.0,5102.0,,,,3948.0,94.0,106.0,781.0,60.0,26.0,3.0,106.0,1.0,124.0,18.0,1.0,2.0,1.0,1.0,26.0,3.0,,13.0,,,,1.0,,,,4.0,,,,,,
499994,49999,2737,21,4,15,19165,406,515,7,5,7,10,171,51.9928,9702,0,0,108,50,46,158,149,102,21,0,12615.0,11298.0,,42.0,,-2395.0,-1230.0,,2510.0,160.0,7785.0,6285.0,,,,22310.0,7.0,78.0,6233.0,82.0,113.0,4.0,210.0,,570.0,21.0,1.0,,,,35.0,3.0,,8.0,,,,1.0,1.0,,,8.0,54.0,,,,,
499995,49999,0,100,128,2718,17735,468,626,16,9,16,2,70,54.4912,22127,0,1227,249,41,0,50,141,168,23,0,21496.0,6025.0,596.0,1007.0,528.0,-4131.0,,,237.0,3860.0,9377.0,2940.0,400.0,,,4042.0,79.0,192.0,423.0,81.0,38.0,5.0,281.0,,,22.0,,,2.0,21.0,32.0,3.0,,9.0,21.0,169.0,,,,,,6.0,,,,,,
499996,49999,0,9,129,3755,20815,507,607,12,6,11,7,115,43.0999,12381,0,2269,135,63,166,30,36,139,23,0,16360.0,9653.0,1490.0,740.0,329.0,-2274.0,,,1587.0,4945.0,8292.0,4346.0,857.0,175.0,,4412.0,,89.0,625.0,75.0,,4.0,162.0,2.0,224.0,21.0,,,,9.0,42.0,7.0,,13.0,,,,,1.0,,,15.0,,,,,,
499997,49999,0,90,130,1059,16225,371,404,5,3,11,2,92,18.1353,7050,872,87,79,48,152,108,102,117,18,0,8205.0,10012.0,,600.0,303.0,-1287.0,,,,3860.0,4027.0,3833.0,400.0,175.0,,4824.0,146.0,53.0,266.0,135.0,153.0,7.0,49.0,,1.0,18.0,,,1.0,22.0,39.0,,,5.0,,108.0,,,,,,2.0,,,,,,
499998,49999,0,73,131,3165,31015,780,703,8,6,17,6,306,64.3631,16474,0,2851,249,154,112,48,114,137,25,0,11773.0,20005.0,596.0,327.0,8302.0,-3084.0,-1681.0,,1107.0,4668.0,5152.0,12927.0,400.0,175.0,,4111.0,67.0,172.0,785.0,75.0,37.0,6.0,349.0,2.0,,25.0,2.0,,2.0,15.0,38.0,3.0,,7.0,,75.0,,1.0,,,,4.0,,,,,,
499999,49999,158360,53,132,2972,21195,520,533,6,6,17,0,294,,9822,0,5747,1,158,48,51,40,141,21,0,8210.0,15571.0,894.0,103.0,,-2874.0,,,1207.0,4628.0,3442.0,10888.0,400.0,175.0,,1209.0,1.0,,448.0,131.0,37.0,,27.0,,6.0,21.0,1.0,,1.0,1.0,28.0,4.0,,7.0,,,,,1.0,,,,,,,,,


#### Overview

In [6]:
# Taking a look at the features and values
matches['overview'].sample(10)

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
19227,19227,1447526095,2455,4,1972,63,3,8,22,False,0,0,123
20653,20653,1447536642,2174,1847,0,0,63,81,22,True,0,0,133
2420,2420,1447342078,681,2047,2046,63,63,143,22,True,0,0,133
36256,36256,1447686325,2606,1828,4,3,63,168,22,True,0,2,224
7486,7486,1447391082,3068,1844,0,0,63,23,22,True,0,0,112
20543,20543,1447535944,2902,256,1956,63,48,283,22,False,0,0,138
30439,30439,1447619253,1982,1844,0,16,63,164,22,True,0,0,121
1679,1679,1447334506,1672,2038,260,51,63,5,22,True,0,0,138
7708,7708,1447394040,3020,1974,0,0,63,291,22,True,0,0,156
49354,49354,1447820786,2266,0,1974,63,0,84,22,False,0,0,123


In [7]:
matches['overview'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   match_id                 50000 non-null  int64
 1   start_time               50000 non-null  int64
 2   duration                 50000 non-null  int64
 3   tower_status_radiant     50000 non-null  int64
 4   tower_status_dire        50000 non-null  int64
 5   barracks_status_dire     50000 non-null  int64
 6   barracks_status_radiant  50000 non-null  int64
 7   first_blood_time         50000 non-null  int64
 8   game_mode                50000 non-null  int64
 9   radiant_win              50000 non-null  bool 
 10  negative_votes           50000 non-null  int64
 11  positive_votes           50000 non-null  int64
 12  cluster                  50000 non-null  int64
dtypes: bool(1), int64(12)
memory usage: 4.6 MB


In [8]:
matches['overview'].describe()

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,negative_votes,positive_votes,cluster
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,24999.5,1447573000.0,2476.4535,1000.01644,935.25006,34.52946,34.77526,93.82552,21.468,0.01548,0.03682,142.30472
std,14433.901067,148527.0,634.631261,948.211846,937.974714,29.209672,29.73214,92.648332,3.218258,0.364696,0.871068,25.156608
min,0.0,1446750000.0,59.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,111.0
25%,12499.75,1447456000.0,2029.0,0.0,0.0,0.0,0.0,9.0,22.0,0.0,0.0,123.0
50%,24999.5,1447577000.0,2415.0,1536.0,384.0,51.0,51.0,77.0,22.0,0.0,0.0,133.0
75%,37499.25,1447700000.0,2872.0,1974.0,1972.0,63.0,63.0,144.0,22.0,0.0,0.0,154.0
max,49999.0,1447829000.0,16037.0,2047.0,2047.0,63.0,63.0,831.0,22.0,47.0,80.0,242.0


#### Time

In [9]:
matches['time'][matches['time']['match_id'] == 49999]

Unnamed: 0,match_id,times,gold_t_0,lh_t_0,xp_t_0,gold_t_1,lh_t_1,xp_t_1,gold_t_2,lh_t_2,xp_t_2,gold_t_3,lh_t_3,xp_t_3,gold_t_4,lh_t_4,xp_t_4,gold_t_128,lh_t_128,xp_t_128,gold_t_129,lh_t_129,xp_t_129,gold_t_130,lh_t_130,xp_t_130,gold_t_131,lh_t_131,xp_t_131,gold_t_132,lh_t_132,xp_t_132
2209729,49999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2209730,49999,60,99,0,0,242,1,265,285,2,162,141,1,175,139,1,93,226,3,196,121,1,41,173,2,206,249,3,206,181,2,82
2209731,49999,120,326,2,238,470,4,429,460,4,450,531,8,691,402,5,423,494,7,499,297,3,261,392,5,474,847,12,784,461,6,345
2209732,49999,180,556,4,455,750,7,582,597,5,770,836,13,1145,714,10,856,755,11,808,441,4,456,577,7,876,1280,18,1176,631,8,448
2209733,49999,240,824,7,755,887,8,746,820,8,1058,1019,15,1625,938,13,1021,1100,16,1150,639,8,740,677,7,1133,1699,23,1635,872,15,665
2209734,49999,300,924,7,755,1190,13,951,1009,10,1346,1203,17,2079,1253,18,1563,1532,24,1490,739,8,894,823,8,1236,2084,28,2089,1185,19,975
2209735,49999,360,1193,10,1034,1290,13,1126,1190,12,1609,1422,20,2471,1565,23,1893,1882,30,1964,965,11,1177,1877,12,2110,2489,33,2481,1511,24,1357
2209736,49999,420,1611,13,1427,1431,14,1250,1290,12,1908,1522,20,2502,1873,28,2285,2104,31,2302,1160,11,1330,1977,12,2234,3186,34,3096,1706,28,1511
2209737,49999,480,2059,18,2058,1609,16,1745,1390,12,1908,2325,24,3633,2224,34,2806,2528,38,2782,1554,11,1495,2077,12,2399,3447,35,3349,1999,34,1892
2209738,49999,540,2382,21,2518,1817,17,2124,1490,12,1908,2510,26,3787,2876,36,3330,2668,39,2813,1899,17,2021,2545,13,2773,3966,42,3854,2310,42,2228


In [None]:
dfs['ability_upgrades'].sample(5)

In [None]:
dfs['chat'].sample(5)

---

### Teamfights

#### Overview

In [14]:
tfs['overview'].head(20)

Unnamed: 0,match_id,start,end,last_death,deaths
0,0,220,252,237,3
1,0,429,475,460,3
2,0,900,936,921,3
3,0,1284,1328,1313,3
4,0,1614,1666,1651,5
5,0,1672,1709,1694,3
6,0,1734,1783,1768,6
7,0,1818,1867,1852,5
8,0,1863,1912,1897,5
9,0,2101,2145,2130,4


Since multiple teamfights occur during a single match, it would be nice to know the order to get other significant stats.

In [15]:
# Inserting the teamfight order from each match
tfs['overview'].insert(1, 'tf_order', tfs['overview'].groupby('match_id').cumcount().add(1))
tfs['overview'].head(20)

Unnamed: 0,match_id,tf_order,start,end,last_death,deaths
0,0,1,220,252,237,3
1,0,2,429,475,460,3
2,0,3,900,936,921,3
3,0,4,1284,1328,1313,3
4,0,5,1614,1666,1651,5
5,0,6,1672,1709,1694,3
6,0,7,1734,1783,1768,6
7,0,8,1818,1867,1852,5
8,0,9,1863,1912,1897,5
9,0,10,2101,2145,2130,4


#### Breakdown

According to the notes from our files, it seems that for each observation in the `tfs['overview']` dataframe, we have a set of 10 *(one observation per player)* in the breakdown. If we end up merging the two tables together, it might come in handy to define a teamfight id.

In [19]:
# First we need to know the total teamfights we have
print('Total teamfights', tfs['overview'].shape[0])
print('Total observations in breakdown', tfs['breakdown'].shape[0])
print('Teamfights have 10x observations:', tfs['overview'].shape[0] == (tfs['breakdown'].shape[0]/10))

Total teamfights 539047
Total observations in breakdown 5390470
Teamfights have 10x observations: True


In [17]:
tfs['breakdown'].head(25)

Unnamed: 0,match_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,105,0,173,536,314
1,0,1,0,566,1,0,1583,1418
2,0,2,0,0,0,0,391,391
3,0,3,0,0,0,123,1775,1419
4,0,4,0,444,0,336,1267,983
5,0,128,0,477,1,249,1318,1035
6,0,129,0,636,1,-27,1048,904
7,0,130,0,0,0,190,1904,1589
8,0,131,0,0,0,0,210,210
9,0,132,0,0,0,378,659,589


In [20]:
# Including the tf_id column in the detailed dataset
tfs['breakdown'].insert(1, 'tf_id', tfs['breakdown'].index // 10)
tfs['breakdown'].head(50)

Unnamed: 0,match_id,tf_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,0,105,0,173,536,314
1,0,0,1,0,566,1,0,1583,1418
2,0,0,2,0,0,0,0,391,391
3,0,0,3,0,0,0,123,1775,1419
4,0,0,4,0,444,0,336,1267,983
5,0,0,128,0,477,1,249,1318,1035
6,0,0,129,0,636,1,-27,1048,904
7,0,0,130,0,0,0,190,1904,1589
8,0,0,131,0,0,0,0,210,210
9,0,0,132,0,0,0,378,659,589


---

### Reference

In [9]:
ref['ratings'].sample(5)

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma
682674,-191735769,4,6,26.876154,7.319408
143704,29808,5,13,19.115807,5.934214
752777,-221663725,1,1,27.134355,8.060786
593190,-162750222,3,9,20.000501,6.613655
593052,117142,1,4,23.370685,7.356318


In [10]:
# Now create a list of player rankings by using the formula mu - 3*sigma
# This ranking formula penalizes players with fewer matches because there is more uncertainty

ref['ratings']['conservative_skill_estimate'] = ref['ratings']['trueskill_mu'] - 3*ref['ratings']['trueskill_sigma']

In [11]:
ref['ratings'].sample(5)

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma,conservative_skill_estimate
735538,-211875317,1,2,25.796672,7.875411,2.17044
109077,-55631835,0,1,22.48685,8.044155,-1.645616
228037,-91407748,1,1,26.988738,8.080127,2.748355
389845,-117865043,5,7,29.803824,6.765374,9.507703
746338,-217777610,1,2,25.719225,7.866682,2.119179


---

### Tests

In [None]:
dfs['test_labels']

In [None]:
dfs['test_player']

---
## Chat Sentiment Analysis

In [None]:
dfs['cluster_regions'].groupby('region').count()

In [None]:
regions = {'US WEST', 'US EAST', 'EUROPE', 'AUSTRALIA'}
clusters = []

for i, cluster in enumerate(dota2['regions']['cluster'][dota2['regions']['cluster'].isin(regions)]):
    clusters.append(cluster)
    
clusters = set(clusters)

match_ids = []

for i, match in enumerate(dfs['match']['match_id'][dfs['match']['cluster'].isin(clusters)]):
    match_ids.append(match)

match_ids = set(match_ids)

player_ids = []

for i, player in enumerate(dfs['players']['account_id'][dfs['players']['match_id'].isin(match_ids)]):
    # Skipping all the players that play anonymously
    if player == 0:
        continue
    player_ids.append(player)

player_ids = set(player_ids)

print('Total matches:', len(match_ids))
print('Total players:', len(player_ids))

In [None]:
dfs['chat'][dfs['chat']['match_id'].isin(match_ids)]

In [None]:
dfs['player_ratings'][dfs['player_ratings']['account_id'].isin(player_ids)]

In [None]:
dfs['test_labels']['match_id']