## Dota Match Prediction Project

- Source/Credit: The data for this project comes from a Kaggle dataset last updated 1 year ago by Devin Anzelmo.
- The dataset is available on Kaggle at: https://www.kaggle.com/devinanzelmo/dota-2-matches

In [1]:
# Importing the libraries:
import pandas as pd
import numpy as np
from math import sqrt
from scipy import stats

# visualizing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# plt.rc('figure', figsize=(13, 10))
# plt.rc('font', size=14)

# preparing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# modeling and evaluating
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

# acquiring
from pydataset import data

## Acquire

In [2]:
players = pd.read_csv("data/players.csv")
match = pd.read_csv("data/match.csv")
heroes = pd.read_csv("data/hero_names.csv")
items = pd.read_csv("data/item_ids.csv")
test_player = pd.read_csv("data/test_player.csv")
test_label = pd.read_csv("data/test_labels.csv")

In [3]:
# Additional data to be joined (as needed):

outcomes = pd.read_csv("data/match_outcomes.csv")
player_rating = pd.read_csv("data/player_ratings.csv")

In [4]:
players.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
0,0,0,86,0,3261,10960,347,362,9,3,...,,,,6.0,,,,,,
1,0,1,51,1,2954,17760,494,659,13,3,...,,,,14.0,,,,,,
2,0,0,83,2,110,12195,350,385,0,4,...,,,,17.0,,,,,,
3,0,2,11,3,1179,22505,599,605,8,4,...,1.0,,,13.0,,,,,,
4,0,3,67,4,3307,23825,613,762,20,3,...,3.0,,,23.0,,,,,,


In [5]:
match.head()

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,1446750112,2375,1982,4,3,63,1,22,True,0,1,155
1,1,1446753078,2582,0,1846,63,0,221,22,False,0,2,154
2,2,1446764586,2716,256,1972,63,48,190,22,False,0,0,132
3,3,1446765723,3085,4,1924,51,3,40,22,False,0,0,191
4,4,1446796385,1887,2047,0,0,63,58,22,True,0,0,156


In [6]:
heroes.head()

Unnamed: 0,name,hero_id,localized_name
0,npc_dota_hero_antimage,1,Anti-Mage
1,npc_dota_hero_axe,2,Axe
2,npc_dota_hero_bane,3,Bane
3,npc_dota_hero_bloodseeker,4,Bloodseeker
4,npc_dota_hero_crystal_maiden,5,Crystal Maiden


In [7]:
items.head()

Unnamed: 0,item_id,item_name
0,1,blink
1,2,blades_of_attack
2,3,broadsword
3,4,chainmail
4,5,claymore


In [8]:
print(players.info(), heroes.info(), match.info(), items.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 73 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   match_id                           500000 non-null  int64  
 1   account_id                         500000 non-null  int64  
 2   hero_id                            500000 non-null  int64  
 3   player_slot                        500000 non-null  int64  
 4   gold                               500000 non-null  int64  
 5   gold_spent                         500000 non-null  int64  
 6   gold_per_min                       500000 non-null  int64  
 7   xp_per_min                         500000 non-null  int64  
 8   kills                              500000 non-null  int64  
 9   deaths                             500000 non-null  int64  
 10  assists                            500000 non-null  int64  
 11  denies                             5000

In [9]:
# What are the most commonly picked heroes?
players.hero_id.value_counts()

21     20881
11     17007
74     11676
7      11323
28     11181
39     10590
8      10394
100    10306
73      9823
14      9447
1       9396
104     9025
93      8426
50      8403
25      8255
86      8183
69      7938
5       7846
42      7794
112     7697
106     7533
26      7382
30      7321
71      7311
44      7280
75      7224
9       7210
72      6856
62      6793
68      6753
67      6660
46      6042
36      5969
85      5951
19      5305
57      5161
87      4750
31      4687
2       4601
22      4589
84      4353
70      4302
51      4301
55      4219
20      4194
99      4167
32      4140
35      3809
59      3782
47      3690
12      3650
27      3589
18      3450
97      3431
53      3344
102     3310
41      3193
16      3150
110     3029
60      3023
40      3015
101     2976
4       2956
98      2934
107     2808
64      2748
88      2701
34      2610
6       2608
54      2585
33      2579
63      2566
3       2553
23      2543
56      2479
17      2407
29      2400

In [10]:
heroes.shape

(112, 3)

In [11]:
# Taking a quick look at the top 5 heroes picked:

heroes[(heroes.hero_id == 21) | (heroes.hero_id == 11) | (heroes.hero_id == 74) | (heroes.hero_id == 7) | (heroes.hero_id == 28)]

Unnamed: 0,name,hero_id,localized_name
6,npc_dota_hero_earthshaker,7,Earthshaker
10,npc_dota_hero_nevermore,11,Shadow Fiend
20,npc_dota_hero_windrunner,21,Windranger
26,npc_dota_hero_slardar,28,Slardar
72,npc_dota_hero_invoker,74,Invoker


#### Takeaways:

- I've discovered the top 5 most often picked heroes
- I still need to answer the questions posed in my prep section below..

#### Removing the 37 rows that don't have any hero_ids attached to them.
- In the interest of time, I'm simply going to drop these 27 rows out of 500,000.

In [12]:
players.hero_id.value_counts()

21     20881
11     17007
74     11676
7      11323
28     11181
39     10590
8      10394
100    10306
73      9823
14      9447
1       9396
104     9025
93      8426
50      8403
25      8255
86      8183
69      7938
5       7846
42      7794
112     7697
106     7533
26      7382
30      7321
71      7311
44      7280
75      7224
9       7210
72      6856
62      6793
68      6753
67      6660
46      6042
36      5969
85      5951
19      5305
57      5161
87      4750
31      4687
2       4601
22      4589
84      4353
70      4302
51      4301
55      4219
20      4194
99      4167
32      4140
35      3809
59      3782
47      3690
12      3650
27      3589
18      3450
97      3431
53      3344
102     3310
41      3193
16      3150
110     3029
60      3023
40      3015
101     2976
4       2956
98      2934
107     2808
64      2748
88      2701
34      2610
6       2608
54      2585
33      2579
63      2566
3       2553
23      2543
56      2479
17      2407
29      2400

In [13]:
heroes.hero_id.value_counts()

113    1
112    1
31     1
32     1
33     1
34     1
35     1
36     1
37     1
38     1
39     1
40     1
41     1
42     1
43     1
44     1
45     1
46     1
47     1
48     1
49     1
50     1
51     1
52     1
53     1
54     1
55     1
30     1
29     1
28     1
13     1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
14     1
27     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
25     1
26     1
56     1
57     1
58     1
99     1
88     1
89     1
90     1
91     1
92     1
93     1
94     1
95     1
96     1
97     1
98     1
100    1
86     1
101    1
102    1
103    1
104    1
105    1
106    1
107    1
108    1
109    1
110    1
111    1
87     1
85     1
59     1
71     1
60     1
61     1
62     1
63     1
64     1
65     1
66     1
67     1
68     1
69     1
70     1
72     1
84     1
73     1
74     1
75     1
76     1
77     1
78     1
79     1
80     1
81     1
82     1
83     1
1

In [14]:
# Dropping hero_id == 0
player = players.copy()
players = players[players.hero_id != 0]
players.shape, player.shape

((499963, 73), (500000, 73))

## Prep

- Key points I need to answer:
    - What is the time scale? I think it's either in seconds or minutes. Probably seconds.
    - How is 'player skill' determined, and is there a better set of features to create a "player skill" feature?
    - I need to join these tables; are there different types of data; ie, are there time-series tables vs statis tables I need to make sure I'm not mixing/matching?
    - Is there a specific combination of heroes and items that makes for a match-winning combination? That's the goal, so how to I prep the data to get those features in a df?
    

In [15]:
# First off, need to join the heroes df to my players df so that I have all the names of the heroes together.

In [16]:
# Checking first that there are no nulls 
players[players.hero_id.isna()]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue


In [None]:
# Now need to add the list of heroes full names to main df:

In [18]:
player_heroes = pd.merge(players, heroes, left_on = 'hero_id', right_on = 'hero_id', how = 'left')
player_heroes.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,name,localized_name
0,0,0,86,0,3261,10960,347,362,9,3,...,,6.0,,,,,,,npc_dota_hero_rubick,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,14.0,,,,,,,npc_dota_hero_rattletrap,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,17.0,,,,,,,npc_dota_hero_treant,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,13.0,,,,,,,npc_dota_hero_nevermore,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,23.0,,,,,,,npc_dota_hero_spectre,Spectre


In [20]:
player_heroes.drop(columns = ['name'], inplace = True)
player_heroes.rename(columns = {"localized_name": "hero"}, inplace = True)

In [28]:
player_heroes[player_heroes.hero_id != 0]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,hero
0,0,0,86,0,3261,10960,347,362,9,3,...,,,6.0,,,,,,,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,,14.0,,,,,,,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,,17.0,,,,,,,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,,13.0,,,,,,,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,,23.0,,,,,,,Spectre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,128,2718,17735,468,626,16,9,...,,,6.0,,,,,,,Tusk
499959,49999,0,9,129,3755,20815,507,607,12,6,...,,,15.0,,,,,,,Mirana
499960,49999,0,90,130,1059,16225,371,404,5,3,...,,,2.0,,,,,,,Keeper of the Light
499961,49999,0,73,131,3165,31015,780,703,8,6,...,,,4.0,,,,,,,Alchemist


In [29]:
player_heroes.shape

(499963, 74)

In [30]:
items.head()

Unnamed: 0,item_id,item_name
0,1,blink
1,2,blades_of_attack
2,3,broadsword
3,4,chainmail
4,5,claymore


#### Dropping unneeded columns

- Since the point of this modeling is going to revovle around directly modeling heroes and items, I don't need columns that are only indirectly related to those potential features. Thus, I'm going to drop any mention of `unit_order` or `gold` in my columns. Need to get the dataframe down to a more manageable size.

In [31]:
player_heroes.columns

Index(['match_id', 'account_id', 'hero_id', 'player_slot', 'gold',
       'gold_spent', 'gold_per_min', 'xp_per_min', 'kills', 'deaths',
       'assists', 'denies', 'last_hits', 'stuns', 'hero_damage',
       'hero_healing', 'tower_damage', 'item_0', 'item_1', 'item_2', 'item_3',
       'item_4', 'item_5', 'level', 'leaver_status', 'xp_hero', 'xp_creep',
       'xp_roshan', 'xp_other', 'gold_other', 'gold_death', 'gold_buyback',
       'gold_abandon', 'gold_sell', 'gold_destroying_structure',
       'gold_killing_heros', 'gold_killing_creeps', 'gold_killing_roshan',
       'gold_killing_couriers', 'unit_order_none',
       'unit_order_move_to_position', 'unit_order_move_to_target',
       'unit_order_attack_move', 'unit_order_attack_target',
       'unit_order_cast_position', 'unit_order_cast_target',
       'unit_order_cast_target_tree', 'unit_order_cast_no_target',
       'unit_order_cast_toggle', 'unit_order_hold_position',
       'unit_order_train_ability', 'unit_order_drop_item',


In [32]:
# Instead of dropping columns, I'm simply assigning the columns I want to keep:

player_heroes_cleaned = player_heroes[['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 
                                       'item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5', 
                                       'kills', 'deaths', 'assists', 'denies', 'last_hits']]

In [33]:
columns_reduced = player_heroes.shape[1] - player_heroes_cleaned.shape[1]
print(f'Reduced the number of columns by {columns_reduced}.')

Reduced the number of columns by 58.


In [34]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,180,37,73,56,108,0,9,3,18,1,30
1,0,1,51,Clockwerk,1,46,63,119,102,24,108,13,3,18,9,109
2,0,0,83,Treant Protector,2,48,60,59,108,65,0,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,63,147,154,164,79,160,8,4,19,6,271
4,0,3,67,Spectre,4,114,92,147,0,137,63,20,3,17,13,245


In [35]:
player_heroes.item_0

0         180
1          46
2          48
3          63
4         114
         ... 
499958    249
499959    135
499960     79
499961    249
499962      1
Name: item_0, Length: 499963, dtype: int64

In [40]:
item_lookup = dict(zip(items['item_id'], items['item_name']))
item_lookup[0] = 'Unknown'

In [41]:
item_lookup

{1: 'blink',
 2: 'blades_of_attack',
 3: 'broadsword',
 4: 'chainmail',
 5: 'claymore',
 6: 'helm_of_iron_will',
 7: 'javelin',
 8: 'mithril_hammer',
 9: 'platemail',
 10: 'quarterstaff',
 11: 'quelling_blade',
 12: 'ring_of_protection',
 13: 'gauntlets',
 14: 'slippers',
 15: 'mantle',
 16: 'branches',
 17: 'belt_of_strength',
 18: 'boots_of_elves',
 19: 'robe',
 20: 'circlet',
 21: 'ogre_axe',
 22: 'blade_of_alacrity',
 23: 'staff_of_wizardry',
 24: 'ultimate_orb',
 25: 'gloves',
 26: 'lifesteal',
 27: 'ring_of_regen',
 28: 'sobi_mask',
 29: 'boots',
 30: 'gem',
 31: 'cloak',
 32: 'talisman_of_evasion',
 33: 'cheese',
 34: 'magic_stick',
 36: 'magic_wand',
 37: 'ghost',
 38: 'clarity',
 39: 'flask',
 40: 'dust',
 41: 'bottle',
 42: 'ward_observer',
 43: 'ward_sentry',
 44: 'tango',
 45: 'courier',
 46: 'tpscroll',
 48: 'travel_boots',
 50: 'phase_boots',
 51: 'demon_edge',
 52: 'eagle',
 53: 'reaver',
 54: 'relic',
 55: 'hyperstone',
 56: 'ring_of_health',
 57: 'void_stone',
 58: 'my

In [42]:
def find_item(_id):
    return item_lookup.get(_id, 'u_' + str(_id))

In [43]:
player_heroes_cleaned['item_0'] = player_heroes_cleaned['item_0'].apply(find_item)
player_heroes_cleaned['item_1'] = player_heroes_cleaned['item_1'].apply(find_item)
player_heroes_cleaned['item_2'] = player_heroes_cleaned['item_2'].apply(find_item)
player_heroes_cleaned['item_3'] = player_heroes_cleaned['item_3'].apply(find_item)
player_heroes_cleaned['item_4'] = player_heroes_cleaned['item_4'].apply(find_item)
player_heroes_cleaned['item_5'] = player_heroes_cleaned['item_5'].apply(find_item)

In [44]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245


In [46]:
player_combos = pd.get_dummies(player_heroes_cleaned['hero'])
player_combos.head()

Unnamed: 0,Abaddon,Alchemist,Ancient Apparition,Anti-Mage,Axe,Bane,Batrider,Beastmaster,Bloodseeker,Bounty Hunter,...,Venomancer,Viper,Visage,Warlock,Weaver,Windranger,Winter Wyvern,Witch Doctor,Wraith King,Zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
item0 = pd.get_dummies(players['item_0'].fillna(0))
item1 = pd.get_dummies(players['item_1'].fillna(0))
item2 = pd.get_dummies(players['item_2'].fillna(0))
item3 = pd.get_dummies(players['item_3'].fillna(0))
item4 = pd.get_dummies(players['item_4'].fillna(0))
item5 = pd.get_dummies(players['item_5'].fillna(0))

In [48]:
matches = match.copy()
matches.columns

Index(['match_id', 'start_time', 'duration', 'tower_status_radiant',
       'tower_status_dire', 'barracks_status_dire', 'barracks_status_radiant',
       'first_blood_time', 'game_mode', 'radiant_win', 'negative_votes',
       'positive_votes', 'cluster'],
      dtype='object')

In [49]:
match_result = matches[['match_id', 'radiant_win']]

In [52]:
dota_df = pd.merge(player_heroes_cleaned, match_result, left_on = 'match_id', right_on = 'match_id', how = 'left')
dota_df.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30,True
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58,True
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245,True


In [53]:
dota_df[dota_df.match_id == 0]

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30,True
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58,True
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245,True
5,0,4,106,Ember Spirit,128,bfury,bracer,lesser_crit,travel_boots,ring_of_aquila,Unknown,5,6,8,5,162,True
6,0,0,102,Abaddon,129,phase_boots,quelling_blade,force_staff,magic_wand,ancient_janggo,vladmir,4,13,5,2,107,True
7,0,5,46,Templar Assassin,130,bottle,power_treads,magic_wand,manta,desolator,ogre_axe,4,8,6,31,208,True
8,0,0,7,Earthshaker,131,magic_wand,Unknown,Unknown,tpscroll,Unknown,arcane_boots,1,14,8,0,27,True
9,0,6,73,Alchemist,132,power_treads,platemail,black_king_bar,hand_of_midas,solar_crest,mekansm,1,11,6,0,147,True


In [57]:
dota_df.player_slot.value_counts()

129    49999
128    49998
1      49998
132    49997
130    49996
0      49996
131    49995
3      49995
2      49995
4      49994
Name: player_slot, dtype: int64

In [56]:
dota_df[dota_df.match_id == 20]

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
200,20,112,71,Spirit Breaker,0,urn_of_shadows,lotus_orb,monkey_king_bar,power_treads,hand_of_midas,tpscroll,9,3,11,1,73,True
201,20,113,59,Huskar,1,armlet,solar_crest,black_king_bar,power_treads,assault,helm_of_the_dominator,14,9,10,19,153,True
202,20,114,95,Troll Warlord,2,phase_boots,tpscroll,helm_of_the_dominator,manta,skadi,invis_sword,4,5,6,8,229,True
203,20,115,21,Windranger,3,phase_boots,ultimate_orb,pers,blink,tpscroll,ultimate_scepter,6,4,11,6,51,True
204,20,0,7,Earthshaker,4,travel_boots,tpscroll,ward_dispenser,ghost,blink,force_staff,3,9,17,1,39,True
205,20,116,69,Doom,128,hyperstone,phase_boots,ultimate_scepter,platemail,hand_of_midas,silver_edge,10,8,3,0,150,True
206,20,117,31,Lich,129,tranquil_boots,ghost,force_staff,tpscroll,urn_of_shadows,Unknown,4,7,11,4,24,True
207,20,118,39,Queen of Pain,130,null_talisman,power_treads,bottle,ultimate_scepter,Unknown,null_talisman,7,13,10,7,137,True
208,20,119,8,Juggernaut,131,javelin,bfury,sange_and_yasha,javelin,poor_mans_shield,phase_boots,3,7,8,25,239,True
209,20,120,55,Dark Seer,132,pipe,soul_ring,blink,tpscroll,guardian_greaves,ogre_axe,6,3,10,2,199,True


In [58]:
dota_df[dota_df['hero'].isna()]

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win


In [59]:
# I should now be ready to split the data, and start exploring the data.

In [60]:
dota_df.groupby(['item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5'])['hero'].count()

item_0   item_1       item_2            item_3                 item_4        item_5      
Unknown  Unknown      Unknown           Unknown                Unknown       Unknown         4493
                                                                             aegis              2
                                                                             arcane_boots       2
                                                                             blink             10
                                                                             boots              4
                                                                                             ... 
yasha    wraith_band  tpscroll          javelin                power_treads  javelin            1
                      ultimate_scepter  Unknown                power_treads  Unknown            1
                      wraith_band       lifesteal              lesser_crit   power_treads       1
         yasha        armlet

In [61]:
item_only = dota_df[['match_id', 'account_id', 'hero', 'item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5']]

In [62]:
item_only.hero.isnull().sum()

0

In [63]:
# Using the pandas unstack on the original df:
dota_unstack = item_only.unstack()

In [64]:
# Printing the shape and series:
dota_unstack.shape, dota_unstack

((4499667,),
 match_id  0                    0
           1                    0
           2                    0
           3                    0
           4                    0
                         ...     
 item_5    499958       desolator
           499959       butterfly
           499960           aegis
           499961        radiance
           499962    greater_crit
 Length: 4499667, dtype: object)

In [68]:
dota_unstack[dota_unstack == "Unknown"].count()

343738

In [69]:
dota_unstack[dota_unstack != "Unknown"].count()

4155929

In [71]:
4155929 + 343738

4499667

In [65]:
dota_item_clean = dota_unstack[dota_unstack != "Unknown"]

In [79]:
df_test = pd.DataFrame(dota_item_clean)
df_test

Unnamed: 0,Unnamed: 1,0
match_id,0,0
match_id,1,0
match_id,2,0
match_id,3,0
match_id,4,0
...,...,...
item_5,499958,desolator
item_5,499959,butterfly
item_5,499960,aegis
item_5,499961,radiance


In [80]:
df_test.index

MultiIndex([('match_id',      0),
            ('match_id',      1),
            ('match_id',      2),
            ('match_id',      3),
            ('match_id',      4),
            ('match_id',      5),
            ('match_id',      6),
            ('match_id',      7),
            ('match_id',      8),
            ('match_id',      9),
            ...
            (  'item_5', 499953),
            (  'item_5', 499954),
            (  'item_5', 499955),
            (  'item_5', 499956),
            (  'item_5', 499957),
            (  'item_5', 499958),
            (  'item_5', 499959),
            (  'item_5', 499960),
            (  'item_5', 499961),
            (  'item_5', 499962)],
           length=4155929)

4499667

In [None]:
# Attempting to put series into a df so that I can drop rows?

dota_unstack_df = pd.DataFrame(dota_unstack)
dota_unstack_df

In [None]:
dota_unstack

In [None]:
dota_df.unstack()

### Splitting into Train, Validate, and Test

In [None]:
def df_split(df):
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.radiant_win)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=123, stratify=train_validate.radiant_win)
    return train, validate, test

In [None]:
train, validate, test = df_split(dota_df)
print(train.shape, validate.shape, test.shape)

In [None]:
dota_df.radiant_win.value_counts(normalize = True).plot(kind = "bar")

## Explore

Questions I would like to answer:

- Is there a common item bought be winning teams?
- Is there a common set of items bought by winning teams?
- Is there an average player skill level distinct to winning teams (hypo t-test...?)
- Are there player K/D ratios that lead to higher win %?
- Do the Raidient vs Dire teams win more? Is that random or something that a feature that can be developed from team?
- Create visuals of most popular heroes picked over time = 2012 - 2015.
- If I can get more data from Opendota api, add to already existsing data.

#### Other things to explore:

- Which heroes have a low pick % but a high win %, so in other words.
- A high win rate for a hero is > 50%. They spend a lot of time trying to balance the game.
- Look at Dotabuff/Dota Plus. It'll give some good player pick vs. win rate.
- How do I want to visualize "winning"? Do I wanna consider a radient win as a "win"?
- I think my baseline should be radient wins overall; that would be an interesting baseline to use...


In [None]:
# What is my baseline?

train