# Dota 2 Matches: EDA

Created by: **Juan Pablo Nieto**

---

## Dataset Overview

*From [**Devin Anzelmo**](https://www.kaggle.com/datasets/devinanzelmo/dota-2-matches/data):*

This dataset contains 50,000 ranked ladder matches from the Dota 2 data dump created by [Opendota](https://www.opendota.com/). It was inspired by the [Dota 2 Matches](https://www.kaggle.com/jraramirez/dota-2-matches-dataset) data published here by **Joe Ramir**. This is an updated and improved version of that dataset.

Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.

This dataset aims to enable the exploration of player behaviour, skill estimation, or anything else you find interesting. The intent is to create an accessible and easy-to-use resource that can be expanded and modified if needed. As such, I am open to a wide variety of suggestions as to what additions or changes to make.

> [**Quick look at how the dataset is structured**](https://www.kaggle.com/code/devinanzelmo/a-quick-look-at-dota-2-dataset)

|   CSV File             |  Description  | Notes |
|:-----------------------|:--------------|:------|
|  **Match Info**        |  |  |
| match                  | Top-level information about each match | `tower_status` and `barracks_status` are binary masks indicating whether various structures have been destroyed |
| players                | Statistics about player's individual performance in each match | Some players chose to hide their account_id and are marked as `0` |
| player_time            | Contains XP, gold, and last-hit totals for each player at one-minute intervals | The suffix for each variable indicates the value of the `player_slot` variable |
| teamfights             | Basic information about each team fight | `start`, `end`, and `last_death` contain the time for those events in seconds |
| teamfights_players     | Detailed info about each team fight | Each row in `teamfights.csv` corresponds to ten rows in this file |
| chat                   | Chat log for all matches | These include the player's name in game |
| objectives             | Gives information on all the objectives completed, by which player and at what time |  |
| ability_upgrades       | Contains the upgrade performed at each level for each player |  |
| purchase_log           | Contains the time in seconds for each purchase made by every player in every match |  |
| **Game Info**          |  |  |
| ability_ids            | Ability names and ids | Use with `ability_upgrades.csv` to get the names of upgraded abilities |
| item_ids               | Contains `item_id` and item name | Use with `purchase_log.csv` to get the names of purchased items |
| hero_ids               | Contains the `name`, `hero_id`, and `localized_name` for each hero a player can pick | Concatenated this file with the one found [here](https://www.kaggle.com/datasets/nihalbarua/dota2-hero-preference-by-mmr) to obtain the `Primary Attribute` and possible Roles |
| cluster_region         | Contains the cluster number and geographic region | Allows to filter matches by region |
| patch_dates            | Release dates for various patches | Use `start_time` from `match.csv` to determine which patch was used to play in |
| **Historical Info**    |  |  |
| MMR                    | Contains `account_id` and players' **Matchmaking Rating** *(**MMR** for short)* | File extracted from the [**OpenDota Core Wiki**](https://github.com/odota/core/wiki/MMR-Data) where the original dataset is based from |
| player_ratings         | Skill data computed on **900k** previous matches and a possible way to measure skill rating when **MMR** is not available | `trueskill` ratings have two components, `mu`, which can be interpreted as the skill, with the higher value being better, and `sigma` which is the uncertainty of the rating. Negative `account_id` are players not appearing in other data available in this dataset |
| match_outcomes         | Results with `account_id` for **900k** matches occurring prior the rest of the dataset | Each match has data on two rows. the `rad` feature indicates whether the team is Radiant or Dire. *Useful for creating custom skill calculations* |
| **Tests**              |  |  |
| test_labels            | `match_id` and `radiant_win` as integer 1 or 0 |  |
| test_player            | Full player and match table with `hero_id`, `player_slot`, `match_id`, and `account_id`|  |

---

## Initial Setup

In [1]:
# Basic Data Science Libraries
import pandas as pd
import numpy as np
import os

# Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# Statistics Libraries
from statsmodels.api import tsa
import statsmodels.api as sm

# Removing the max columns limiter
pd.set_option('display.max_columns', None)

In [2]:
dfs = {}

# Iterate over files in the directory
for file in os.listdir('Dataset/'):
    
    # Skip hidden files and directories
    if file.startswith('.'):
        continue

    # Include CSV files exclusively
    if file.endswith('.csv'):
        
        # Construct full file path
        file_path = os.path.join('Dataset/', file)
    
        # Assign the dictionary name
        key = file.split('.csv')[0]

        # Assign each file to the dfs
        dfs[key] = pd.read_csv(file_path)

    else:
        continue

print('Total files imported:', len(dfs))
print('DataFrame Shapes:')
for df in dfs:
    print('\n', f'{df}:', '{:,} observations, {:,} features'.format(dfs[df].shape[0], dfs[df].shape[1]))

Total files imported: 19
DataFrame Shapes:

 player_time: 2,209,778 observations, 32 features

 test_player: 1,000,000 observations, 4 features

 teamfights_players: 5,390,470 observations, 8 features

 item_ids: 189 observations, 2 features

 test_labels: 100,000 observations, 2 features

 chat: 1,439,488 observations, 5 features

 ability_upgrades: 8,939,599 observations, 5 features

 purchase_log: 18,193,745 observations, 4 features

 match: 50,000 observations, 13 features

 cluster_regions: 53 observations, 2 features

 players: 500,000 observations, 73 features

 MMR: 1,069,672 observations, 2 features

 ability_ids: 688 observations, 2 features

 match_outcomes: 1,828,588 observations, 10 features

 player_ratings: 834,226 observations, 5 features

 teamfights: 539,047 observations, 5 features

 hero_ids: 112 observations, 5 features

 objectives: 1,173,396 observations, 9 features

 patch_dates: 19 observations, 2 features


---

## Data Wrangling

Despite going through the [**quick look into the dataset structure**](https://www.kaggle.com/code/devinanzelmo/a-quick-look-at-dota-2-dataset) posted above, I still have to make sure that the data is clean for modelling.

I'll leave the original DataFrames intact under the `dfs` directory as a backup while organizing the cleaned versions in new directories.

In [3]:
# Grouping the game metadata in one directory
dota2 = {
    'abilities': dfs['ability_ids'].copy(),
    'items': dfs['item_ids'].copy(),
    'heroes': dfs['hero_ids'].copy(),
    'regions': dfs['cluster_regions'].copy(),
    'patches': dfs['patch_dates'].copy(),
    'positions': pd.DataFrame({
        'player_slot': [0,1,2,3,4,
                        128,129,130,131,132], 
        'side': ['Radiant','Radiant','Radiant','Radiant','Radiant',
                 'Dire','Dire','Dire','Dire','Dire'],
        'position': ['Carry', 'Midlaner', 'Offlaner', 'Roamer', 'Hard Support',
                    'Carry', 'Midlaner', 'Offlaner', 'Roamer', 'Hard Support'],
        'roles': [{'Carry','Escape','Pushers'}, 
                  {'Carry','Durable','Pushers','Disabler','Nuker'}, 
                  {'Carry','Durable','Initiator','Pushers','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'},
                  {'Carry','Escape','Pushers'}, 
                  {'Carry','Durable','Pushers','Disabler','Nuker'}, 
                  {'Carry','Durable','Initiator','Pushers','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}, 
                  {'Support','Escape','Initiator','Disabler'}],
    }).set_index('player_slot') # Included the 'positions' dataframe to have a relationship between player_slot and other in-game data
}

# Grouping the data containing info regarding the 50k matches
matches = {
    'overview': dfs['match'].copy(),
    'players': dfs['players'].copy(),
    'time': dfs['player_time'].copy(),
    'objectives': dfs['objectives'].copy(),
    'upgrades': dfs['ability_upgrades'].copy(),
    'purchases': dfs['purchase_log'].copy(),
    'chatlog': dfs['chat'].copy()
}

# Grouping teamfight datasets
tfs = {
    'overview': dfs['teamfights'].copy(),
    'breakdown': dfs['teamfights_players'].copy()
}

# Grouping historical and referential information
ref = {
    'prev_outcomes': dfs['match_outcomes'].copy(),
    'ratings': dfs['player_ratings'].copy(),
    'mmr': dfs['MMR'].copy()
}

# Grouping the test datasets together
test = {
    'matches': dfs['test_labels'].copy(),
    'players': dfs['test_player'].copy()
}

---

### Dota2 Directory

#### Abilities

In [4]:
dota2['abilities'].head()

Unnamed: 0,ability_id,ability_name
0,0,ability_base
1,5001,default_attack
2,5002,attribute_bonus
3,5003,antimage_mana_break
4,5004,antimage_blink


Noticing that this data frame contains just the `ability_id` and name, I'll convert it to a dictionary instead to make it easier to map it to other data frames if needed.

In [5]:
# Converting the dataframe to a dictionary
dota2['abilities'] = dota2['abilities'].set_index('ability_id').to_dict()['ability_name']
dota2['abilities'][5232] # Checking a random ability

'dragon_knight_frost_breath'

#### Items

In [6]:
dota2['items'].head()

Unnamed: 0,item_id,item_name
0,1,blink
1,2,blades_of_attack
2,3,broadsword
3,4,chainmail
4,5,claymore


Same as the abilities, I'll convert this into a directory that can be mapped to other dataframes.

In [7]:
# Converting the dataframe to a dictionary
dota2['items'] = dota2['items'].set_index('item_id').to_dict()['item_name']
dota2['items'][32] # Checking a random item

'talisman_of_evasion'

#### Heroes

In [8]:
dota2['heroes'].sample(5)

Unnamed: 0,name,hero_id,localized_name,Primary Attribute,Roles
9,npc_dota_hero_morphling,10,Morphling,agi,"Carry, Escape, Durable, Nuker, Disabler"
24,npc_dota_hero_lion,26,Lion,int,"Support, Disabler, Nuker, Initiator"
16,npc_dota_hero_storm_spirit,17,Storm Spirit,int,"Carry, Escape, Nuker, Initiator, Disabler"
34,npc_dota_hero_necrolyte,36,Necrophos,int,"Carry, Nuker, Durable, Disabler"
61,npc_dota_hero_weaver,63,Weaver,agi,"Carry, Escape"


From the looks of it, it seems that the roles contain several values that could be useful to access if we had them as a list type. I also noticed that the `name` and `localized_name` hold similar values, so I'll drop the `name` column and assign the hero_id as its index.

In [9]:
# Setting the hero_id as the index
dota2['heroes'].set_index('hero_id', inplace=True)

# Dropping the name feature since it is redundant
dota2['heroes'].drop(columns='name', inplace=True)

# Renaming the localized_name to name and formatting Primary Attribute and Roles to follow the same as the rest
dota2['heroes'].rename(columns={'localized_name': 'name', 'Primary Attribute': 'primary_attribute', 'Roles': 'roles'},
                      inplace=True)

# Changing the Role values to list type
dota2['heroes']['roles'] = dota2['heroes']['roles'].apply(lambda x: x.split(', ') if isinstance(x, str) else [])

dota2['heroes'].head()

Unnamed: 0_level_0,name,primary_attribute,roles
hero_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Anti-Mage,agi,"[Carry, Escape, Nuker]"
2,Axe,str,"[Initiator, Durable, Disabler, Carry]"
3,Bane,all,"[Support, Disabler, Nuker, Durable]"
4,Bloodseeker,agi,"[Carry, Disabler, Nuker, Initiator]"
5,Crystal Maiden,int,"[Support, Disabler, Nuker]"


#### Regions

In [10]:
dota2['regions']

Unnamed: 0,cluster,region
0,111,US WEST
1,112,US WEST
2,113,US WEST
3,121,US EAST
4,122,US EAST
5,123,US EAST
6,124,US EAST
7,131,EUROPE
8,132,EUROPE
9,133,EUROPE


Looking at the structure, it might be simpler just to have a directory with the regions as keys and storing the cluster values for each region as a set.

In [11]:
# Converting the dataframe into a dictionary with clusters as sets
dota2['regions'] = dota2['regions'].groupby('region')['cluster'].apply(set).to_dict()
dota2['regions']['EUROPE'] # Checking the values for Europe clusters

{131, 132, 133, 134, 135, 136, 137, 138}

#### Patches

In [12]:
dota2['patches']

Unnamed: 0,patch_date,name
0,2010-12-24T00:00:00Z,6.7
1,2011-01-21T00:00:00Z,6.71
2,2011-04-27T00:00:00Z,6.72
3,2011-12-24T00:00:00Z,6.73
4,2012-03-10T00:00:00Z,6.74
5,2012-09-30T00:00:00Z,6.75
6,2012-10-21T00:00:00Z,6.76
7,2012-12-15T00:00:00Z,6.77
8,2013-05-30T00:00:00Z,6.78
9,2013-11-24T00:00:00Z,6.79


In [13]:
# Checking the datatypes for patches
dota2['patches'].dtypes

patch_date     object
name          float64
dtype: object

Apart from changing the `patch_date` type, I'll leave this one as it is, since changing it to another structure such as a list or dictionary won't store the values as datetime.

In [14]:
pd.to_datetime(dota2['patches']['patch_date'])

0    2010-12-24 00:00:00+00:00
1    2011-01-21 00:00:00+00:00
2    2011-04-27 00:00:00+00:00
3    2011-12-24 00:00:00+00:00
4    2012-03-10 00:00:00+00:00
5    2012-09-30 00:00:00+00:00
6    2012-10-21 00:00:00+00:00
7    2012-12-15 00:00:00+00:00
8    2013-05-30 00:00:00+00:00
9    2013-11-24 00:00:00+00:00
10   2014-01-27 00:00:00+00:00
11   2014-04-29 00:00:00+00:00
12   2014-09-24 00:00:00+00:00
13   2014-12-17 00:00:00+00:00
14   2015-04-30 21:00:00+00:00
15   2015-09-24 20:00:00+00:00
16   2015-12-16 20:00:00+00:00
17   2016-04-26 01:00:00+00:00
18   2016-06-12 08:00:00+00:00
Name: patch_date, dtype: datetime64[ns, UTC]

#### Positions

In [15]:
dota2['positions']

Unnamed: 0_level_0,side,position,roles
player_slot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Radiant,Carry,"{Pushers, Carry, Escape}"
1,Radiant,Midlaner,"{Durable, Pushers, Nuker, Carry, Disabler}"
2,Radiant,Offlaner,"{Durable, Initiator, Pushers, Carry, Disabler}"
3,Radiant,Roamer,"{Support, Initiator, Disabler, Escape}"
4,Radiant,Hard Support,"{Support, Initiator, Disabler, Escape}"
128,Dire,Carry,"{Pushers, Carry, Escape}"
129,Dire,Midlaner,"{Durable, Pushers, Nuker, Carry, Disabler}"
130,Dire,Offlaner,"{Durable, Initiator, Pushers, Carry, Disabler}"
131,Dire,Roamer,"{Support, Initiator, Disabler, Escape}"
132,Dire,Hard Support,"{Support, Initiator, Disabler, Escape}"


---

### Matches

#### Players

We must begin cleaning our match-related data by focusing on the players' data frame. Some players have chosen to play anonymously, so their `account_id` is replaced with a value of `0`. Since our project's scope depends on accurately identifying each player and monitoring their in-game actions, removing all matches where we have one or more players with hidden `account_id` values is essential.

In [16]:
matches['players'].sample(10)

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_abandon,gold_sell,gold_destroying_structure,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,gold_killing_couriers,unit_order_none,unit_order_move_to_position,unit_order_move_to_target,unit_order_attack_move,unit_order_attack_target,unit_order_cast_position,unit_order_cast_target,unit_order_cast_target_tree,unit_order_cast_no_target,unit_order_cast_toggle,unit_order_hold_position,unit_order_train_ability,unit_order_drop_item,unit_order_give_item,unit_order_pickup_item,unit_order_pickup_rune,unit_order_purchase_item,unit_order_sell_item,unit_order_disassemble_item,unit_order_move_item,unit_order_cast_toggle_auto,unit_order_stop,unit_order_taunt,unit_order_buyback,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
387819,38781,132824,9,132,206,4010,199,172,0,14,7,3,35,34.8581,4947,0,0,41,29,46,0,36,73,10,0,2629.0,2713.0,,302.0,180.0,-2746.0,-239.0,,75.0,320.0,1515.0,1238.0,,,,3017.0,91.0,23.0,187.0,54.0,4.0,3.0,154.0,,87.0,10.0,,,,7.0,32.0,1.0,,1.0,,,,1.0,,,,,,,,,,
49221,4922,0,3,1,4396,7690,334,395,6,6,13,8,34,77.337,6324,0,758,180,254,46,73,102,178,16,0,7294.0,6550.0,447.0,667.0,109.0,-1404.0,,,,3412.0,3809.0,1331.0,200.0,,,3660.0,2.0,22.0,620.0,4.0,103.0,2.0,74.0,1.0,84.0,16.0,1.0,,,14.0,21.0,,,12.0,,,,,,,,2.0,,,,,,
359764,35976,124995,85,4,3589,10495,398,511,7,16,23,7,69,,10101,3339,1008,180,79,108,36,46,0,20,0,15795.0,6661.0,,379.0,387.0,-5624.0,-1097.0,,250.0,2860.0,7266.0,2445.0,,350.0,,6623.0,262.0,6.0,1036.0,131.0,37.0,4.0,98.0,,34.0,20.0,1.0,,79.0,21.0,38.0,2.0,,14.0,,,,1.0,1.0,,,14.0,,,,,,
489933,48993,0,35,3,1621,31365,604,613,15,5,17,3,460,0.066875,28117,0,5631,48,172,158,247,141,160,25,0,10794.0,21132.0,596.0,370.0,191.0,-2575.0,-609.0,,3809.0,2753.0,7824.0,15744.0,536.0,,,2073.0,51.0,856.0,613.0,120.0,24.0,4.0,523.0,,,25.0,,,1.0,18.0,53.0,9.0,,7.0,,3.0,,1.0,,,,,,,,,,
197110,19711,0,18,0,2327,11165,443,409,10,11,11,5,82,38.5112,12066,0,1882,116,172,0,36,63,149,17,0,8607.0,6264.0,,475.0,252.0,-3709.0,-1092.0,,262.0,4207.0,5276.0,3164.0,,,,2273.0,95.0,212.0,547.0,7.0,66.0,5.0,148.0,,1.0,18.0,,,,21.0,29.0,1.0,,10.0,,,,1.0,,,,9.0,,,,,,
215394,21539,84776,94,4,1294,14500,444,578,6,5,7,4,219,6.06519,11901,0,493,212,63,160,147,0,46,21,0,8172.0,14693.0,,296.0,195.0,-2365.0,-702.0,,232.0,400.0,4604.0,8405.0,,175.0,,5172.0,33.0,160.0,211.0,11.0,94.0,8.0,58.0,99.0,40.0,21.0,,,,3.0,34.0,1.0,,6.0,,,,1.0,4.0,,,4.0,,,,,,
212811,21281,0,53,1,1921,16075,527,354,0,3,4,0,200,,3186,0,2217,65,29,108,158,168,0,15,0,211.0,11358.0,,538.0,335.0,-837.0,,,,4164.0,56.0,9842.0,200.0,,,1749.0,53.0,,562.0,70.0,18.0,7.0,3.0,,,15.0,,,,48.0,17.0,,,4.0,,,,,2.0,,,,,,,,,
184755,18475,0,44,128,2944,11790,493,384,2,3,2,3,122,,2835,0,0,36,145,21,63,164,8,14,0,1621.0,9484.0,,,,-1107.0,,3107.0,287.0,160.0,2571.0,4928.0,200.0,150.0,,3733.0,89.0,12.0,240.0,8.0,78.0,7.0,12.0,,,13.0,,,,,25.0,11.0,,4.0,,,,,,,,,,,,,,
423505,42350,140591,37,128,3460,13775,375,361,5,2,7,3,109,27.8898,8288,3323,1937,110,180,0,108,24,0,17,0,6287.0,9261.0,447.0,380.0,171.0,-718.0,,,75.0,3988.0,3367.0,4335.0,600.0,,,3439.0,81.0,63.0,253.0,31.0,45.0,4.0,22.0,,14.0,17.0,,,,7.0,25.0,1.0,,8.0,,,,,1.0,,,12.0,,,,,,
35076,3507,17592,19,129,9946,20185,723,761,12,5,13,41,418,32.8889,15850,0,5120,108,158,46,1,63,116,25,0,11420.0,20068.0,,973.0,879.0,-2485.0,,,687.0,4367.0,6656.0,14485.0,200.0,,,2356.0,110.0,,351.0,111.0,30.0,4.0,39.0,1.0,,25.0,1.0,,,10.0,38.0,5.0,,15.0,,,,,1.0,,,3.0,,,,,,


In [17]:
anonym_accounts = matches['players'][matches['players']['account_id'] == 0]
print('Observations with anonymous ids: {:,}'.format(len(anonym_accounts)))

Observations with anonymous ids: 181,169


In [18]:
# Looking for the total number of matches that include one or more anonymous account ids
anonym_match_ids = anonym_accounts['match_id'].unique()

print('Total matches with at least one anonymous account: {:,}'.format(len(anonym_match_ids)))

Total matches with at least one anonymous account: 46,485


Having so many matches with at least one player hidding their account id makes it useless to isolate the matches with every account shown. However, since we are focused on each individual player, we'll just go ahead and remove the players 

In [12]:
matches['players'][matches['players']['match_id'] == 49999]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,stuns,hero_damage,hero_healing,tower_damage,item_0,item_1,item_2,item_3,item_4,item_5,level,leaver_status,xp_hero,xp_creep,xp_roshan,xp_other,gold_other,gold_death,gold_buyback,gold_abandon,gold_sell,gold_destroying_structure,gold_killing_heros,gold_killing_creeps,gold_killing_roshan,gold_killing_couriers,unit_order_none,unit_order_move_to_position,unit_order_move_to_target,unit_order_attack_move,unit_order_attack_target,unit_order_cast_position,unit_order_cast_target,unit_order_cast_target_tree,unit_order_cast_no_target,unit_order_cast_toggle,unit_order_hold_position,unit_order_train_ability,unit_order_drop_item,unit_order_give_item,unit_order_pickup_item,unit_order_pickup_rune,unit_order_purchase_item,unit_order_sell_item,unit_order_disassemble_item,unit_order_move_item,unit_order_cast_toggle_auto,unit_order_stop,unit_order_taunt,unit_order_buyback,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
499990,49999,158359,94,0,251,16840,430,449,2,11,5,3,237,3.01025,3922,0,287,50,65,160,166,170,172,19,0,4136.0,16728.0,,26.0,,-4729.0,,,193.0,160.0,2999.0,12190.0,,,,2791.0,,166.0,521.0,8.0,62.0,39.0,177.0,34.0,1.0,19.0,1.0,,2.0,5.0,31.0,3.0,,5.0,,86.0,,,,,,,,,,,,
499991,49999,0,19,1,80,14050,407,611,11,14,4,1,148,54.3521,17104,0,0,170,41,152,29,108,24,23,0,16361.0,10868.0,,1208.0,746.0,-6166.0,-1345.0,,,160.0,8125.0,5259.0,,,,4532.0,94.0,31.0,334.0,53.0,45.0,,113.0,,49.0,23.0,1.0,,,14.0,25.0,,,13.0,,,,1.0,4.0,,,,,,,,,
499992,49999,0,68,2,473,17130,414,463,6,9,9,22,90,26.4467,12748,0,0,36,108,254,96,29,65,20,0,12721.0,8689.0,,126.0,220.0,-2811.0,-1071.0,,600.0,160.0,9249.0,4980.0,,,,4394.0,65.0,236.0,469.0,128.0,73.0,1.0,73.0,,66.0,20.0,,,,2.0,34.0,3.0,,3.0,,,,1.0,,,,3.0,,,,,,
499993,49999,0,35,3,51,10590,311,398,4,11,9,5,131,0.566851,15062,0,0,3,63,75,152,212,170,18,0,7463.0,11025.0,,18.0,160.0,-4189.0,-73.0,,424.0,160.0,4416.0,5102.0,,,,3948.0,94.0,106.0,781.0,60.0,26.0,3.0,106.0,1.0,124.0,18.0,1.0,2.0,1.0,1.0,26.0,3.0,,13.0,,,,1.0,,,,4.0,,,,,,
499994,49999,2737,21,4,15,19165,406,515,7,5,7,10,171,51.9928,9702,0,0,108,50,46,158,149,102,21,0,12615.0,11298.0,,42.0,,-2395.0,-1230.0,,2510.0,160.0,7785.0,6285.0,,,,22310.0,7.0,78.0,6233.0,82.0,113.0,4.0,210.0,,570.0,21.0,1.0,,,,35.0,3.0,,8.0,,,,1.0,1.0,,,8.0,54.0,,,,,
499995,49999,0,100,128,2718,17735,468,626,16,9,16,2,70,54.4912,22127,0,1227,249,41,0,50,141,168,23,0,21496.0,6025.0,596.0,1007.0,528.0,-4131.0,,,237.0,3860.0,9377.0,2940.0,400.0,,,4042.0,79.0,192.0,423.0,81.0,38.0,5.0,281.0,,,22.0,,,2.0,21.0,32.0,3.0,,9.0,21.0,169.0,,,,,,6.0,,,,,,
499996,49999,0,9,129,3755,20815,507,607,12,6,11,7,115,43.0999,12381,0,2269,135,63,166,30,36,139,23,0,16360.0,9653.0,1490.0,740.0,329.0,-2274.0,,,1587.0,4945.0,8292.0,4346.0,857.0,175.0,,4412.0,,89.0,625.0,75.0,,4.0,162.0,2.0,224.0,21.0,,,,9.0,42.0,7.0,,13.0,,,,,1.0,,,15.0,,,,,,
499997,49999,0,90,130,1059,16225,371,404,5,3,11,2,92,18.1353,7050,872,87,79,48,152,108,102,117,18,0,8205.0,10012.0,,600.0,303.0,-1287.0,,,,3860.0,4027.0,3833.0,400.0,175.0,,4824.0,146.0,53.0,266.0,135.0,153.0,7.0,49.0,,1.0,18.0,,,1.0,22.0,39.0,,,5.0,,108.0,,,,,,2.0,,,,,,
499998,49999,0,73,131,3165,31015,780,703,8,6,17,6,306,64.3631,16474,0,2851,249,154,112,48,114,137,25,0,11773.0,20005.0,596.0,327.0,8302.0,-3084.0,-1681.0,,1107.0,4668.0,5152.0,12927.0,400.0,175.0,,4111.0,67.0,172.0,785.0,75.0,37.0,6.0,349.0,2.0,,25.0,2.0,,2.0,15.0,38.0,3.0,,7.0,,75.0,,1.0,,,,4.0,,,,,,
499999,49999,158360,53,132,2972,21195,520,533,6,6,17,0,294,,9822,0,5747,1,158,48,51,40,141,21,0,8210.0,15571.0,894.0,103.0,,-2874.0,,,1207.0,4628.0,3442.0,10888.0,400.0,175.0,,1209.0,1.0,,448.0,131.0,37.0,,27.0,,6.0,21.0,1.0,,1.0,1.0,28.0,4.0,,7.0,,,,,1.0,,,,,,,,,


#### Overview

In [6]:
# Taking a look at the features and values
matches['overview'].sample(10)

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
19227,19227,1447526095,2455,4,1972,63,3,8,22,False,0,0,123
20653,20653,1447536642,2174,1847,0,0,63,81,22,True,0,0,133
2420,2420,1447342078,681,2047,2046,63,63,143,22,True,0,0,133
36256,36256,1447686325,2606,1828,4,3,63,168,22,True,0,2,224
7486,7486,1447391082,3068,1844,0,0,63,23,22,True,0,0,112
20543,20543,1447535944,2902,256,1956,63,48,283,22,False,0,0,138
30439,30439,1447619253,1982,1844,0,16,63,164,22,True,0,0,121
1679,1679,1447334506,1672,2038,260,51,63,5,22,True,0,0,138
7708,7708,1447394040,3020,1974,0,0,63,291,22,True,0,0,156
49354,49354,1447820786,2266,0,1974,63,0,84,22,False,0,0,123


In [7]:
matches['overview'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   match_id                 50000 non-null  int64
 1   start_time               50000 non-null  int64
 2   duration                 50000 non-null  int64
 3   tower_status_radiant     50000 non-null  int64
 4   tower_status_dire        50000 non-null  int64
 5   barracks_status_dire     50000 non-null  int64
 6   barracks_status_radiant  50000 non-null  int64
 7   first_blood_time         50000 non-null  int64
 8   game_mode                50000 non-null  int64
 9   radiant_win              50000 non-null  bool 
 10  negative_votes           50000 non-null  int64
 11  positive_votes           50000 non-null  int64
 12  cluster                  50000 non-null  int64
dtypes: bool(1), int64(12)
memory usage: 4.6 MB


In [8]:
matches['overview'].describe()

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,negative_votes,positive_votes,cluster
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,24999.5,1447573000.0,2476.4535,1000.01644,935.25006,34.52946,34.77526,93.82552,21.468,0.01548,0.03682,142.30472
std,14433.901067,148527.0,634.631261,948.211846,937.974714,29.209672,29.73214,92.648332,3.218258,0.364696,0.871068,25.156608
min,0.0,1446750000.0,59.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,111.0
25%,12499.75,1447456000.0,2029.0,0.0,0.0,0.0,0.0,9.0,22.0,0.0,0.0,123.0
50%,24999.5,1447577000.0,2415.0,1536.0,384.0,51.0,51.0,77.0,22.0,0.0,0.0,133.0
75%,37499.25,1447700000.0,2872.0,1974.0,1972.0,63.0,63.0,144.0,22.0,0.0,0.0,154.0
max,49999.0,1447829000.0,16037.0,2047.0,2047.0,63.0,63.0,831.0,22.0,47.0,80.0,242.0


#### Time

In [9]:
matches['time'][matches['time']['match_id'] == 49999]

Unnamed: 0,match_id,times,gold_t_0,lh_t_0,xp_t_0,gold_t_1,lh_t_1,xp_t_1,gold_t_2,lh_t_2,xp_t_2,gold_t_3,lh_t_3,xp_t_3,gold_t_4,lh_t_4,xp_t_4,gold_t_128,lh_t_128,xp_t_128,gold_t_129,lh_t_129,xp_t_129,gold_t_130,lh_t_130,xp_t_130,gold_t_131,lh_t_131,xp_t_131,gold_t_132,lh_t_132,xp_t_132
2209729,49999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2209730,49999,60,99,0,0,242,1,265,285,2,162,141,1,175,139,1,93,226,3,196,121,1,41,173,2,206,249,3,206,181,2,82
2209731,49999,120,326,2,238,470,4,429,460,4,450,531,8,691,402,5,423,494,7,499,297,3,261,392,5,474,847,12,784,461,6,345
2209732,49999,180,556,4,455,750,7,582,597,5,770,836,13,1145,714,10,856,755,11,808,441,4,456,577,7,876,1280,18,1176,631,8,448
2209733,49999,240,824,7,755,887,8,746,820,8,1058,1019,15,1625,938,13,1021,1100,16,1150,639,8,740,677,7,1133,1699,23,1635,872,15,665
2209734,49999,300,924,7,755,1190,13,951,1009,10,1346,1203,17,2079,1253,18,1563,1532,24,1490,739,8,894,823,8,1236,2084,28,2089,1185,19,975
2209735,49999,360,1193,10,1034,1290,13,1126,1190,12,1609,1422,20,2471,1565,23,1893,1882,30,1964,965,11,1177,1877,12,2110,2489,33,2481,1511,24,1357
2209736,49999,420,1611,13,1427,1431,14,1250,1290,12,1908,1522,20,2502,1873,28,2285,2104,31,2302,1160,11,1330,1977,12,2234,3186,34,3096,1706,28,1511
2209737,49999,480,2059,18,2058,1609,16,1745,1390,12,1908,2325,24,3633,2224,34,2806,2528,38,2782,1554,11,1495,2077,12,2399,3447,35,3349,1999,34,1892
2209738,49999,540,2382,21,2518,1817,17,2124,1490,12,1908,2510,26,3787,2876,36,3330,2668,39,2813,1899,17,2021,2545,13,2773,3966,42,3854,2310,42,2228


#### Objectives

In [29]:
matches['objectives'].sample(10)

Unnamed: 0,match_id,key,player1,player2,slot,subtype,team,time,value
920773,39253,2.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,3068,2
765014,32649,,3,-1,,CHAT_MESSAGE_ROSHAN_KILL,3.0,1425,200
938129,39986,,7,-1,7.0,CHAT_MESSAGE_TOWER_KILL,3.0,1351,3
1050307,44763,,5,-1,5.0,CHAT_MESSAGE_AEGIS,,1005,0
858532,36600,,8,-1,8.0,CHAT_MESSAGE_AEGIS,,1972,0
356752,15233,2.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,1710,2
977909,41680,8.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,2335,8
656521,28014,64.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,2556,64
521293,22275,8.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,1829,8
545964,23319,128.0,-1,-1,,CHAT_MESSAGE_BARRACKS_KILL,,2127,128


#### Upgrades

In [28]:
matches['upgrades'].sample(10)

Unnamed: 0,ability,level,time,player_slot,match_id
7203765,5134,2,466,130,40273
3716516,5595,13,1930,128,20812
6616287,5297,10,1911,2,37003
7327358,5002,23,3037,132,40975
8489847,5157,15,2523,131,47488
1702933,5022,6,779,129,9544
917934,5514,1,347,128,5133
3664528,5506,2,521,1,20522
4970848,5448,14,2421,131,27814
7861859,5568,16,2666,128,43963


#### Purchases

In [27]:
matches['purchases'].sample(10)

Unnamed: 0,item_id,time,player_slot,match_id
8100927,12,455,1,22307
16429811,218,2177,132,45150
8634053,188,1104,129,23773
3961619,46,1161,129,10911
7431220,46,1716,4,20456
8999691,22,1796,3,24778
8638236,28,140,131,23785
15924861,77,1875,0,43773
9801827,181,207,0,26973
342948,42,-45,4,933


#### Chat

In [26]:
matches['chatlog'].sample(10)

Unnamed: 0,match_id,key,slot,time,unit
207598,7003,i meant to say ult,3,1516,Deodorant spray
275849,9221,hahaaha,6,1264,Slim Shady
681967,23611,ggwp,9,3191,Raging Potato Fanboy
453168,15382,XD lol,7,1165,STyLe.FQ-Park
1346521,46887,coward,0,1924,Leslie F. Chow
345813,11750,for keeping the team,4,2641,Nova.*Maria* [Ro]
262933,8751,GGWP,8,2291,a C k y
782738,27111,the comeback,0,1221,Coru
288083,9635,король прост,5,656,Просто БОХ
241531,8094,:d,5,2214,barelyrobot


---

### Teamfights

#### Overview

In [14]:
tfs['overview'].head(20)

Unnamed: 0,match_id,start,end,last_death,deaths
0,0,220,252,237,3
1,0,429,475,460,3
2,0,900,936,921,3
3,0,1284,1328,1313,3
4,0,1614,1666,1651,5
5,0,1672,1709,1694,3
6,0,1734,1783,1768,6
7,0,1818,1867,1852,5
8,0,1863,1912,1897,5
9,0,2101,2145,2130,4


Since multiple teamfights occur during a single match, it would be nice to know the order to get other significant stats.

In [15]:
# Inserting the teamfight order from each match
tfs['overview'].insert(1, 'tf_order', tfs['overview'].groupby('match_id').cumcount().add(1))
tfs['overview'].head(20)

Unnamed: 0,match_id,tf_order,start,end,last_death,deaths
0,0,1,220,252,237,3
1,0,2,429,475,460,3
2,0,3,900,936,921,3
3,0,4,1284,1328,1313,3
4,0,5,1614,1666,1651,5
5,0,6,1672,1709,1694,3
6,0,7,1734,1783,1768,6
7,0,8,1818,1867,1852,5
8,0,9,1863,1912,1897,5
9,0,10,2101,2145,2130,4


#### Breakdown

According to the notes from our files, it seems that for each observation in the `tfs['overview']` dataframe, we have a set of 10 *(one observation per player)* in the breakdown. If we end up merging the two tables together, it might come in handy to define a teamfight id.

In [19]:
# First we need to know the total teamfights we have
print('Total teamfights', tfs['overview'].shape[0])
print('Total observations in breakdown', tfs['breakdown'].shape[0])
print('Teamfights have 10x observations:', tfs['overview'].shape[0] == (tfs['breakdown'].shape[0]/10))

Total teamfights 539047
Total observations in breakdown 5390470
Teamfights have 10x observations: True


In [17]:
tfs['breakdown'].head(25)

Unnamed: 0,match_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,105,0,173,536,314
1,0,1,0,566,1,0,1583,1418
2,0,2,0,0,0,0,391,391
3,0,3,0,0,0,123,1775,1419
4,0,4,0,444,0,336,1267,983
5,0,128,0,477,1,249,1318,1035
6,0,129,0,636,1,-27,1048,904
7,0,130,0,0,0,190,1904,1589
8,0,131,0,0,0,0,210,210
9,0,132,0,0,0,378,659,589


In [20]:
# Including the tf_id column in the detailed dataset
tfs['breakdown'].insert(1, 'tf_id', tfs['breakdown'].index // 10)
tfs['breakdown'].head(50)

Unnamed: 0,match_id,tf_id,player_slot,buybacks,damage,deaths,gold_delta,xp_end,xp_start
0,0,0,0,0,105,0,173,536,314
1,0,0,1,0,566,1,0,1583,1418
2,0,0,2,0,0,0,0,391,391
3,0,0,3,0,0,0,123,1775,1419
4,0,0,4,0,444,0,336,1267,983
5,0,0,128,0,477,1,249,1318,1035
6,0,0,129,0,636,1,-27,1048,904
7,0,0,130,0,0,0,190,1904,1589
8,0,0,131,0,0,0,0,210,210
9,0,0,132,0,0,0,378,659,589


---

### Reference

#### Previous Outcomes

In [25]:
ref['prev_outcomes'].sample(10)

Unnamed: 0,match_id,account_id_0,account_id_1,account_id_2,account_id_3,account_id_4,start_time,parser_version,win,rad
663392,1815768171,283959,291308,-233086354,108665,286258,1443082906,13,0,0
1483024,1882269882,0,-194817304,0,0,24712,1445437983,13,0,0
456221,1793890744,-124550828,-189244936,-201711238,0,0,1442249019,12,0,0
1019713,1838205284,-121446312,273951,268317,114280,195945,1443823622,13,1,1
315735,1751746140,209869,0,324357,-142763804,0,1440782519,12,0,0
264280,1733711933,-20012022,220608,-72868501,-110866782,0,1440215658,12,0,0
1521952,1886667270,33952,126224,-105556674,-33617657,-105527690,1445614723,13,1,1
910114,1830498229,0,0,0,0,131956,1443574358,13,0,0
1034011,1838882140,58381,107351,0,0,-81635188,1443857848,13,0,0
566082,1810041645,-80071131,77454,-28161904,-34915047,-75634857,1442848953,12,1,1


#### Ratings

In [9]:
ref['ratings'].sample(5)

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma
682674,-191735769,4,6,26.876154,7.319408
143704,29808,5,13,19.115807,5.934214
752777,-221663725,1,1,27.134355,8.060786
593190,-162750222,3,9,20.000501,6.613655
593052,117142,1,4,23.370685,7.356318


In [10]:
# This ranking formula penalizes players with fewer matches because there is more uncertainty
ref['ratings']['conservative_skill_estimate'] = ref['ratings']['trueskill_mu'] - 3*ref['ratings']['trueskill_sigma']

In [11]:
ref['ratings'].sample(5)

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma,conservative_skill_estimate
735538,-211875317,1,2,25.796672,7.875411,2.17044
109077,-55631835,0,1,22.48685,8.044155,-1.645616
228037,-91407748,1,1,26.988738,8.080127,2.748355
389845,-117865043,5,7,29.803824,6.765374,9.507703
746338,-217777610,1,2,25.719225,7.866682,2.119179


#### MMR

Looking into the source of the dataframe, I discovered the [**OpenDota Core Repository**](https://github.com/odota/core/wiki/MMR-Data) from the original dump, where it was posted on their wiki page the file containing all the **Matchmaking Ratings** for all players sampled.

In [30]:
ref['mmr'].sample(5)

Unnamed: 0,account_id,MMR
491045,260118435,3188
393715,298284056,2998
828652,239292186,3975
694647,194852608,3621
178198,93682092,2318


---

### Tests

In [31]:
test['matches'].sample(5)

Unnamed: 0,match_id,radiant_win
20348,70348,1
36759,86759,1
1596,51596,1
89267,139267,1
99627,149627,0


In [32]:
test['players'].sample(5)

Unnamed: 0,match_id,account_id,hero_id,player_slot
33731,53373,166350,74,1
111785,61178,0,98,128
679639,117963,12093,19,132
458421,95842,0,12,1
658219,115821,281398,17,132


---
## Chat Sentiment Analysis

In [36]:
dota2['regions'].keys()

dict_keys(['AUSTRALIA', 'AUSTRIA', 'BRAZIL', 'CHILE', 'DUBAI', 'EUROPE', 'INDIA', 'JAPAN', 'PERU', 'PW TELECOM GUANGDONG', 'PW TELECOM SHANGHAI', 'PW TELECOM WUHAN', 'PW TELECOM ZHEJIANG', 'PW UNICOM', 'PW UNICOM TIANJIN', 'SINGAPORE', 'SOUTHAFRICA', 'STOCKHOLM', 'US EAST', 'US WEST'])

In [None]:
regions = {'US WEST', 'US EAST', 'EUROPE', 'AUSTRALIA'}
clusters = []

for i, cluster in enumerate(dota2['regions']['cluster'][dota2['regions']['cluster'].isin(regions)]):
    clusters.append(cluster)
    
clusters = set(clusters)

match_ids = []

for i, match in enumerate(dfs['match']['match_id'][dfs['match']['cluster'].isin(clusters)]):
    match_ids.append(match)

match_ids = set(match_ids)

player_ids = []

for i, player in enumerate(dfs['players']['account_id'][dfs['players']['match_id'].isin(match_ids)]):
    # Skipping all the players that play anonymously
    if player == 0:
        continue
    player_ids.append(player)

player_ids = set(player_ids)

print('Total matches:', len(match_ids))
print('Total players:', len(player_ids))

In [None]:
dfs['chat'][dfs['chat']['match_id'].isin(match_ids)]

In [None]:
dfs['player_ratings'][dfs['player_ratings']['account_id'].isin(player_ids)]

In [None]:
dfs['test_labels']['match_id']