## Dota Match Prediction Project

- Source/Credit: The data for this project comes from a Kaggle dataset last updated 1 year ago by Devin Anzelmo.
- The dataset is available on Kaggle at: https://www.kaggle.com/devinanzelmo/dota-2-matches

In [104]:
# Importing the libraries:
import pandas as pd
import numpy as np
from math import sqrt
from scipy import stats

# visualizing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# plt.rc('figure', figsize=(13, 10))
# plt.rc('font', size=14)

# preparing
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# modeling and evaluating
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

# acquiring
from pydataset import data

## Acquire

In [2]:
players = pd.read_csv("data/players.csv")
match = pd.read_csv("data/match.csv")
heroes = pd.read_csv("data/hero_names.csv")
items = pd.read_csv("data/item_ids.csv")
test_player = pd.read_csv("data/test_player.csv")
test_label = pd.read_csv("data/test_labels.csv")

In [3]:
# Additional data to be joined (as needed):

# outcomes = pd.read_csv("data/match_outcomes.csv")
# player_rating = pd.read_csv("data/player_ratings.csv")
# objectives = pd.read_csv("data/objectives.csv")

In [4]:
players.hero_id.value_counts().head(10)

21     20881
11     17007
74     11676
7      11323
28     11181
39     10590
8      10394
100    10306
73      9823
14      9447
Name: hero_id, dtype: int64

In [28]:
# Taking a quick look at the top 5 heroes picked:

# heroes[(heroes.hero_id == 21) | (heroes.hero_id == 11) | (heroes.hero_id == 74) | (heroes.hero_id == 7) | (heroes.hero_id == 28)]
top_ten = [21, 11, 74, 7, 28, 39, 8, 100, 73, 14]
heroes[heroes.hero_id.isin(top_ten)]

Unnamed: 0,name,hero_id,localized_name
6,npc_dota_hero_earthshaker,7,Earthshaker
7,npc_dota_hero_juggernaut,8,Juggernaut
10,npc_dota_hero_nevermore,11,Shadow Fiend
13,npc_dota_hero_pudge,14,Pudge
20,npc_dota_hero_windrunner,21,Windranger
26,npc_dota_hero_slardar,28,Slardar
37,npc_dota_hero_queenofpain,39,Queen of Pain
71,npc_dota_hero_alchemist,73,Alchemist
72,npc_dota_hero_invoker,74,Invoker
98,npc_dota_hero_tusk,100,Tusk


#### Takeaways:

- I've discovered the top 5 most often picked heroes
- I still need to answer the questions posed in my prep section below..

#### Removing the 37 rows that don't have any hero_ids attached to them.
- In the interest of time, I'm simply going to drop these 27 rows out of 500,000.

In [6]:
# Dropping hero_id == 0
player = players.copy()
players = players[players.hero_id != 0]
players.shape, player.shape

((499963, 73), (500000, 73))

## Prep

- Key points I need to answer:
    - What is the time scale? I think it's either in seconds or minutes. Probably seconds.
    - How is 'player skill' determined, and is there a better set of features to create a "player skill" feature?
    - I need to join these tables; are there different types of data; ie, are there time-series tables vs statis tables I need to make sure I'm not mixing/matching?
    - Is there a specific combination of heroes and items that makes for a match-winning combination? That's the goal, so how to I prep the data to get those features in a df?
    

In [7]:
# First off, need to join the heroes df to my players df so that I have all the names of the heroes together.

In [8]:
# Checking first that there are no nulls 
players[players.hero_id.isna()]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue


In [9]:
# Now need to add the list of heroes full names to main df:

In [10]:
player_heroes = pd.merge(players, heroes, left_on = 'hero_id', right_on = 'hero_id', how = 'left')
player_heroes.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,name,localized_name
0,0,0,86,0,3261,10960,347,362,9,3,...,,6.0,,,,,,,npc_dota_hero_rubick,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,14.0,,,,,,,npc_dota_hero_rattletrap,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,17.0,,,,,,,npc_dota_hero_treant,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,13.0,,,,,,,npc_dota_hero_nevermore,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,23.0,,,,,,,npc_dota_hero_spectre,Spectre


In [11]:
player_heroes.drop(columns = ['name'], inplace = True)
player_heroes.rename(columns = {"localized_name": "hero"}, inplace = True)

In [12]:
player_heroes[player_heroes.hero_id != 0]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,hero
0,0,0,86,0,3261,10960,347,362,9,3,...,,,6.0,,,,,,,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,,14.0,,,,,,,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,,17.0,,,,,,,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,,13.0,,,,,,,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,,23.0,,,,,,,Spectre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,128,2718,17735,468,626,16,9,...,,,6.0,,,,,,,Tusk
499959,49999,0,9,129,3755,20815,507,607,12,6,...,,,15.0,,,,,,,Mirana
499960,49999,0,90,130,1059,16225,371,404,5,3,...,,,2.0,,,,,,,Keeper of the Light
499961,49999,0,73,131,3165,31015,780,703,8,6,...,,,4.0,,,,,,,Alchemist


#### Dropping unneeded columns

- Since the point of this modeling is going to revovle around directly modeling heroes and items, I don't need columns that are only indirectly related to those potential features. Thus, I'm going to drop any mention of `unit_order` or `gold` in my columns. Need to get the dataframe down to a more manageable size.

In [13]:
# Instead of dropping columns, I'm simply assigning the columns I want to keep:

player_heroes_cleaned = player_heroes[['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 
                                       'item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5', 
                                       'kills', 'deaths', 'assists', 'denies', 'last_hits']]

In [14]:
columns_reduced = player_heroes.shape[1] - player_heroes_cleaned.shape[1]
print(f'Reduced the number of columns by {columns_reduced}.')

Reduced the number of columns by 58.


In [15]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,180,37,73,56,108,0,9,3,18,1,30
1,0,1,51,Clockwerk,1,46,63,119,102,24,108,13,3,18,9,109
2,0,0,83,Treant Protector,2,48,60,59,108,65,0,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,63,147,154,164,79,160,8,4,19,6,271
4,0,3,67,Spectre,4,114,92,147,0,137,63,20,3,17,13,245


In [16]:
item_lookup = dict(zip(items['item_id'], items['item_name']))
item_lookup[0] = "Unkown"

In [17]:
def find_item(_id):
    return item_lookup.get(_id, 'u_' + str(_id))

In [18]:
player_heroes_cleaned['item_0'] = player_heroes_cleaned['item_0'].apply(find_item)
player_heroes_cleaned['item_1'] = player_heroes_cleaned['item_1'].apply(find_item)
player_heroes_cleaned['item_2'] = player_heroes_cleaned['item_2'].apply(find_item)
player_heroes_cleaned['item_3'] = player_heroes_cleaned['item_3'].apply(find_item)
player_heroes_cleaned['item_4'] = player_heroes_cleaned['item_4'].apply(find_item)
player_heroes_cleaned['item_5'] = player_heroes_cleaned['item_5'].apply(find_item)

In [19]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unkown,9,3,18,1,30
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unkown,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unkown,radiance,power_treads,20,3,17,13,245


In [20]:
player_combos = pd.get_dummies(player_heroes_cleaned['hero'])
player_combos.head()

Unnamed: 0,Abaddon,Alchemist,Ancient Apparition,Anti-Mage,Axe,Bane,Batrider,Beastmaster,Bloodseeker,Bounty Hunter,...,Venomancer,Viper,Visage,Warlock,Weaver,Windranger,Winter Wyvern,Witch Doctor,Wraith King,Zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
item0 = pd.get_dummies(players['item_0'].fillna(0))
item1 = pd.get_dummies(players['item_1'].fillna(0))
item2 = pd.get_dummies(players['item_2'].fillna(0))
item3 = pd.get_dummies(players['item_3'].fillna(0))
item4 = pd.get_dummies(players['item_4'].fillna(0))
item5 = pd.get_dummies(players['item_5'].fillna(0))

In [22]:
matches = match.copy()
matches.columns

Index(['match_id', 'start_time', 'duration', 'tower_status_radiant',
       'tower_status_dire', 'barracks_status_dire', 'barracks_status_radiant',
       'first_blood_time', 'game_mode', 'radiant_win', 'negative_votes',
       'positive_votes', 'cluster'],
      dtype='object')

In [23]:
match_result = matches[['match_id', 'radiant_win']]

In [24]:
dota_df = pd.merge(player_heroes_cleaned, match_result, left_on = 'match_id', right_on = 'match_id', how = 'left')
dota_df.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unkown,9,3,18,1,30,True
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unkown,0,4,15,1,58,True
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unkown,radiance,power_treads,20,3,17,13,245,True


In [25]:
dota_df[dota_df.match_id == 0]

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unkown,9,3,18,1,30,True
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unkown,0,4,15,1,58,True
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unkown,radiance,power_treads,20,3,17,13,245,True
5,0,4,106,Ember Spirit,128,bfury,bracer,lesser_crit,travel_boots,ring_of_aquila,Unkown,5,6,8,5,162,True
6,0,0,102,Abaddon,129,phase_boots,quelling_blade,force_staff,magic_wand,ancient_janggo,vladmir,4,13,5,2,107,True
7,0,5,46,Templar Assassin,130,bottle,power_treads,magic_wand,manta,desolator,ogre_axe,4,8,6,31,208,True
8,0,0,7,Earthshaker,131,magic_wand,Unkown,Unkown,tpscroll,Unkown,arcane_boots,1,14,8,0,27,True
9,0,6,73,Alchemist,132,power_treads,platemail,black_king_bar,hand_of_midas,solar_crest,mekansm,1,11,6,0,147,True


In [26]:
dota_df[dota_df.match_id == 1]

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
10,1,0,7,Earthshaker,0,Unkown,Unkown,Unkown,Unkown,Unkown,Unkown,3,4,9,0,36,False
11,1,7,82,Meepo,1,travel_boots,Unkown,ethereal_blade,sheepstick,blink,ultimate_scepter,9,10,8,9,343,False
12,1,0,71,Spirit Breaker,2,mask_of_madness,orb_of_venom,power_treads,black_king_bar,platemail,bracer,5,13,11,3,76,False
13,1,8,39,Queen of Pain,3,black_king_bar,blade_mail,power_treads,tpscroll,cyclone,Unkown,12,15,9,8,169,False
14,1,4,21,Windranger,4,phase_boots,magic_wand,bottle,desolator,ultimate_scepter,ogre_axe,6,11,12,7,131,False
15,1,9,73,Alchemist,128,soul_ring,radiance,blink,ultimate_orb,soul_booster,travel_boots,2,12,24,2,220,False
16,1,0,22,Zeus,129,dagon_2,blink,veil_of_discord,travel_boots,hand_of_midas,octarine_core,8,5,17,0,193,False
17,1,0,5,Crystal Maiden,130,glimmer_cape,force_staff,Unkown,black_king_bar,ultimate_scepter,travel_boots,14,8,9,0,101,False
18,1,5,67,Spectre,131,mjollnir,phase_boots,manta,diffusal_blade_2,heart,tpscroll,16,5,21,11,226,False
19,1,0,106,Ember Spirit,132,travel_boots,Unkown,blink,greater_crit,bfury,bfury,10,7,12,3,250,False


In [36]:
# Adding another feature:
# dota_df["k_d"] = (dota_df.)

dota_df['k_d'] = round((dota_df.kills / dota_df.deaths), 2)

In [38]:
dota_df['top_ten_hero'] = dota_df.hero_id.isin(top_ten).astype(int)
dota_df.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win,top_ten_hero,k_d
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unkown,9,3,18,1,30,True,0,3.0
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True,0,4.33
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unkown,0,4,15,1,58,True,0,0.0
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True,1,2.0
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unkown,radiance,power_treads,20,3,17,13,245,True,0,6.67


In [39]:
# Item sum column:

dota_df[dota_df['item_5'] == None].sum()

match_id        0.0
account_id      0.0
hero_id         0.0
hero            0.0
player_slot     0.0
item_0          0.0
item_1          0.0
item_2          0.0
item_3          0.0
item_4          0.0
item_5          0.0
kills           0.0
deaths          0.0
assists         0.0
denies          0.0
last_hits       0.0
radiant_win     0.0
top_ten_hero    0.0
k_d             0.0
dtype: float64

# Carving off needed columns for actual train/validate/test dataframe

In [40]:
dota = dota_df[['match_id', 'account_id', 'hero_id', 'hero', 'top_ten_hero', 'player_slot', 'kills', 'deaths', 'k_d', 'radiant_win']]
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,radiant_win
0,0,0,86,Rubick,0,0,9,3,3.0,True
1,0,1,51,Clockwerk,0,1,13,3,4.33,True
2,0,0,83,Treant Protector,0,2,0,4,0.0,True
3,0,2,11,Shadow Fiend,1,3,8,4,2.0,True
4,0,3,67,Spectre,0,4,20,3,6.67,True


### Creating a target variable

I have to create a new y, or target variable that actually matches up with the winning team. In other words, the original dataset didn't clearly dictate which team was radiant and which team was dire, so I couldn't tell which teams

In [41]:
dota['is_radiant'] = pd.cut(dota.player_slot, bins = [-1,100,200], labels = [1, 0])

In [42]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].mean()

is_radiant  radiant_win
1           False          5.939393
            True           8.745035
0           False          9.142906
            True           5.812356
Name: kills, dtype: float64

In [43]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].median()

is_radiant  radiant_win
1           False          5
            True           8
0           False          8
            True           5
Name: kills, dtype: int64

In [44]:
# I can also verify with elims, last_hits, etc...

In [45]:
dota['radiant_win'] = dota.radiant_win.astype(int)

In [46]:
dota['win'] = dota.radiant_win == dota.is_radiant

In [47]:
dota.win.value_counts(normalize = True)

True     0.500025
False    0.499975
Name: win, dtype: float64

In [48]:
dota['win'] = dota.win.astype(int)

In [49]:
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,radiant_win,is_radiant,win
0,0,0,86,Rubick,0,0,9,3,3.0,1,1,1
1,0,1,51,Clockwerk,0,1,13,3,4.33,1,1,1
2,0,0,83,Treant Protector,0,2,0,4,0.0,1,1,1
3,0,2,11,Shadow Fiend,1,3,8,4,2.0,1,1,1
4,0,3,67,Spectre,0,4,20,3,6.67,1,1,1


In [50]:
# Final cleanup of hero column:

dota['hero'] = dota['hero'].str.lower()

In [51]:
dota

Unnamed: 0,match_id,account_id,hero_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,radiant_win,is_radiant,win
0,0,0,86,rubick,0,0,9,3,3.00,1,1,1
1,0,1,51,clockwerk,0,1,13,3,4.33,1,1,1
2,0,0,83,treant protector,0,2,0,4,0.00,1,1,1
3,0,2,11,shadow fiend,1,3,8,4,2.00,1,1,1
4,0,3,67,spectre,0,4,20,3,6.67,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,tusk,1,128,16,9,1.78,0,0,1
499959,49999,0,9,mirana,0,129,12,6,2.00,0,0,1
499960,49999,0,90,keeper of the light,0,130,5,3,1.67,0,0,1
499961,49999,0,73,alchemist,1,131,8,6,1.33,0,0,1


In [52]:
dota_dummy = pd.get_dummies(dota[['hero']], dummy_na = False)
dota_dummy

Unnamed: 0,hero_abaddon,hero_alchemist,hero_ancient apparition,hero_anti-mage,hero_axe,hero_bane,hero_batrider,hero_beastmaster,hero_bloodseeker,hero_bounty hunter,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499960,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499961,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
dota = pd.concat([dota, dota_dummy], axis = 1)
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,radiant_win,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,86,rubick,0,0,9,3,3.0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,51,clockwerk,0,1,13,3,4.33,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,83,treant protector,0,2,0,4,0.0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,2,11,shadow fiend,1,3,8,4,2.0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,3,67,spectre,0,4,20,3,6.67,1,...,0,0,0,0,0,0,0,0,0,0


In [54]:
dota.drop(columns = ['hero_id', 'radiant_win'], inplace = True)
dota.head()

Unnamed: 0,match_id,account_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,is_radiant,win,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,rubick,0,0,9,3,3.0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,clockwerk,0,1,13,3,4.33,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,treant protector,0,2,0,4,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,2,shadow fiend,1,3,8,4,2.0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,3,spectre,0,4,20,3,6.67,1,1,...,0,0,0,0,0,0,0,0,0,0


## I need to come back and scale the kills, deaths, and k_d once I figure out how to. **But right now, it's a race to the MVP finish line.**

In [None]:
dota.columns.tolist()

Now I should be done with all the data prep; I have my indepdendent variables, and my target variable, the `win` column, which has also been encoded.

### Splitting into Train, Validate, and Test

So that I can explore using train.

In [55]:
def df_split(df):
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.win)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=123, stratify=train_validate.win)
    return train, validate, test

In [56]:
train, validate, test = df_split(dota)
print(train.shape, validate.shape, test.shape)

(279979, 120) (119991, 120) (99993, 120)


In [57]:
train.win.value_counts(normalize = True), dota.win.value_counts(normalize = True)

(1    0.500027
 0    0.499973
 Name: win, dtype: float64,
 1    0.500025
 0    0.499975
 Name: win, dtype: float64)

In [58]:
players.columns.tolist()

['match_id',
 'account_id',
 'hero_id',
 'player_slot',
 'gold',
 'gold_spent',
 'gold_per_min',
 'xp_per_min',
 'kills',
 'deaths',
 'assists',
 'denies',
 'last_hits',
 'stuns',
 'hero_damage',
 'hero_healing',
 'tower_damage',
 'item_0',
 'item_1',
 'item_2',
 'item_3',
 'item_4',
 'item_5',
 'level',
 'leaver_status',
 'xp_hero',
 'xp_creep',
 'xp_roshan',
 'xp_other',
 'gold_other',
 'gold_death',
 'gold_buyback',
 'gold_abandon',
 'gold_sell',
 'gold_destroying_structure',
 'gold_killing_heros',
 'gold_killing_creeps',
 'gold_killing_roshan',
 'gold_killing_couriers',
 'unit_order_none',
 'unit_order_move_to_position',
 'unit_order_move_to_target',
 'unit_order_attack_move',
 'unit_order_attack_target',
 'unit_order_cast_position',
 'unit_order_cast_target',
 'unit_order_cast_target_tree',
 'unit_order_cast_no_target',
 'unit_order_cast_toggle',
 'unit_order_hold_position',
 'unit_order_train_ability',
 'unit_order_drop_item',
 'unit_order_give_item',
 'unit_order_pickup_item',
 

In [None]:
X = dota.drop(columns = ['match_id',
                         'account_id',
                         'hero',
                         'player_slot',
                         'kills',
                         'deaths',
                         'is_radiant',
                         'win',])
X

In [None]:
x_train_and_validate, x_test = train_test_split(X, random_state=123)
x_train, x_validate = train_test_split(x_train_and_validate)

In [None]:
def df_split(df):
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.win)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=123, stratify=train_validate.win)
    return train, validate, test

In [None]:
train, validate, test = df_split(dota)
print(train.shape, validate.shape, test.shape)

In [None]:
X = dota[['hero_abaddon',
 'hero_alchemist',
 'hero_ancient apparition',
 'hero_anti-mage',
 'hero_axe',
 'hero_bane',
 'hero_batrider',
 'hero_beastmaster',
 'hero_bloodseeker',
 'hero_bounty hunter',
 'hero_brewmaster',
 'hero_bristleback',
 'hero_broodmother',
 'hero_centaur warrunner',
 'hero_chaos knight',
 'hero_chen',
 'hero_clinkz',
 'hero_clockwerk',
 'hero_crystal maiden',
 'hero_dark seer',
 'hero_dazzle',
 'hero_death prophet',
 'hero_disruptor',
 'hero_doom',
 'hero_dragon knight',
 'hero_drow ranger',
 'hero_earth spirit',
 'hero_earthshaker',
 'hero_elder titan',
 'hero_ember spirit',
 'hero_enchantress',
 'hero_enigma',
 'hero_faceless void',
 'hero_gyrocopter',
 'hero_huskar',
 'hero_invoker',
 'hero_io',
 'hero_jakiro',
 'hero_juggernaut',
 'hero_keeper of the light',
 'hero_kunkka',
 'hero_legion commander',
 'hero_leshrac',
 'hero_lich',
 'hero_lifestealer',
 'hero_lina',
 'hero_lion',
 'hero_lone druid',
 'hero_luna',
 'hero_lycan',
 'hero_magnus',
 'hero_medusa',
 'hero_meepo',
 'hero_mirana',
 'hero_morphling',
 'hero_naga siren',
 "hero_nature's prophet",
 'hero_necrophos',
 'hero_night stalker',
 'hero_nyx assassin',
 'hero_ogre magi',
 'hero_omniknight',
 'hero_oracle',
 'hero_outworld devourer',
 'hero_phantom assassin',
 'hero_phantom lancer',
 'hero_phoenix',
 'hero_puck',
 'hero_pudge',
 'hero_pugna',
 'hero_queen of pain',
 'hero_razor',
 'hero_riki',
 'hero_rubick',
 'hero_sand king',
 'hero_shadow demon',
 'hero_shadow fiend',
 'hero_shadow shaman',
 'hero_silencer',
 'hero_skywrath mage',
 'hero_slardar',
 'hero_slark',
 'hero_sniper',
 'hero_spectre',
 'hero_spirit breaker',
 'hero_storm spirit',
 'hero_sven',
 'hero_techies',
 'hero_templar assassin',
 'hero_terrorblade',
 'hero_tidehunter',
 'hero_timbersaw',
 'hero_tinker',
 'hero_tiny',
 'hero_treant protector',
 'hero_troll warlord',
 'hero_tusk',
 'hero_undying',
 'hero_ursa',
 'hero_vengeful spirit',
 'hero_venomancer',
 'hero_viper',
 'hero_visage',
 'hero_warlock',
 'hero_weaver',
 'hero_windranger',
 'hero_winter wyvern',
 'hero_witch doctor',
 'hero_wraith king',
 'hero_zeus']]

In [None]:
X

In [None]:
dota.columns.tolist()

In [None]:
dota.drop(columns = ['match_id',
 'account_id',
 'hero',
 'player_slot',
 'kills',
 'deaths',
 'k_d',
 'is_radiant',
 'win',])

## Explore

Questions I would like to answer:

- Is there a common item bought be winning teams?
- Is there a common set of items bought by winning teams?
- Is there an average player skill level distinct to winning teams (hypo t-test...?)
- Are there player K/D ratios that lead to higher win %?
- Do the Raidient vs Dire teams win more? Is that random or something that a feature that can be developed from team?
- Create visuals of most popular heroes picked over time = 2012 - 2015.
- If I can get more data from Opendota api, add to already existsing data.

#### Hypotheses:

1.  Does the mean 

#### Other things to explore:

- Which heroes have a low pick % but a high win %, so in other words.
- A high win rate for a hero is > 50%. They spend a lot of time trying to balance the game.
- Look at Dotabuff/Dota Plus. It'll give some good player pick vs. win rate.
- How do I want to visualize "winning"? Do I wanna consider a radient win as a "win"?
- I think my baseline should be radient wins overall; that would be an interesting baseline to use...


In [None]:
# Baseline model: Radiant team wins just over 50% of the time.


train.win.value_counts(normalize = True).plot(kind = "bar")

## Hypothesis Testing:


#### Hypothesis 1:
- $H_0$: The average K/D rate for the winning team's players is not different than the average K/D of the losing team's players
- $H_a$: The average K/D rate for the winning team's players is statistically different than the average K/D of the losing team's players

#### Hypothesis 2: Using a $X^2$ Test
- $H_0$: The heroes the a team picks has no impact (is independent of) the team's outcome in the match (whether that team wins or loses).

- $H_a$: The heroes a team picks *do* have an impact (is not independent of) the team's outcome in the match.

In [59]:
train.head()

Unnamed: 0,match_id,account_id,hero,top_ten_hero,player_slot,kills,deaths,k_d,is_radiant,win,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
191915,19192,8839,pudge,1,130,9,10,0.9,0,0,...,0,0,0,0,0,0,0,0,0,0
36408,3641,0,vengeful spirit,0,4,4,9,0.44,1,1,...,0,0,0,0,0,0,0,0,0,0
12827,1283,6348,night stalker,0,0,4,5,0.8,1,0,...,0,0,0,0,0,0,0,0,0,0
336629,33665,0,bane,0,2,6,4,1.5,1,1,...,0,0,0,0,0,0,0,0,0,0
30884,3089,0,invoker,1,0,13,6,2.17,1,1,...,0,0,0,0,0,0,0,0,0,0


### Hypothesis Test #1:

- $H_0$: The average K/D rate for the winning team's players is not different than the average K/D of the losing team's players
- $H_a$: The average K/D rate for the winning team's players is statistically different than the average K/D of the losing team's players


In [106]:
X_train_log = train.drop(columns = ['match_id',
                                     'account_id',
                                     'hero',
                                     'player_slot',
                                     'k_d',
                                     'deaths',
                                     'is_radiant',
                                     'win'])
y_train_log = train[['win']]

# Validate dataset features:
X_validate_log = validate.drop(columns = ['match_id',
                                         'account_id',
                                         'hero',
                                         'player_slot',
                                         'k_d',
                                         'deaths',
                                         'is_radiant',
                                         'win'])
y_validate_log = validate[['win']]

# Test dataset features:
X_test_log = test.drop(columns = ['match_id',
                                 'account_id',
                                 'hero',
                                 'player_slot',
                                 'k_d',
                                 'deaths',
                                 'is_radiant',
                                 'win'])
y_test_log = test[['win']]

In [107]:
import sklearn.preprocessing

In [108]:
# Scaling the data:

scaler = sklearn.preprocessing.MinMaxScaler()

scaler.fit(X_train_log)

x_train_scaled_log = scaler.transform(X_train_log)
x_validate_scaled_log = scaler.transform(X_train_log)
x_test_scaled_log = scaler.transform(X_train_log)

In [109]:
y_test_log

Unnamed: 0,win
48932,1
397521,1
195443,0
94504,0
186988,0
...,...
113844,1
70415,1
273306,0
222144,1


In [110]:
# Modeling practice using Logistic Regression Model, basic hyper-parameters

# Using a logistic regression model first:

# Only fit on my training dataset
logit = LogisticRegression(C = 1.0, random_state=123)

# Fitting the data to the train dataset:
logit.fit(x_train_scaled_log, y_train_log)

# Printing the coefficients and intercept of the model:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

# Train data prediction:
y_pred_log = logit.predict(x_train_scaled_log)

# Now the est. of churn based on train predict:
y_pred_prob_log = logit.predict_proba(x_train_scaled_log)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(logit.score(x_train_scaled_log, y_train_log)))
print(classification_report(y_train_log, y_pred_log))

Coefficient: 
 [[-1.50810805e-01  8.88037622e+00  6.33082084e-01  3.15498404e-01
   3.14995617e-01 -1.79998444e-01 -5.14043900e-01  1.82488426e-01
   2.60737343e-02  3.74241922e-01 -5.63040923e-01  6.91722801e-02
  -4.64657798e-03 -1.10441324e-01 -4.20270872e-01  1.28983340e-01
  -4.41243227e-01  3.15650313e-01 -5.89562436e-01 -1.16835126e-02
   5.47803177e-01  3.77473716e-01  7.74456236e-01 -1.30024442e-01
   5.20564733e-01  2.23129137e-01 -1.98909033e-02  1.07394810e-01
   3.58282893e-01  4.69613551e-01  4.47349094e-01 -5.19307698e-01
  -3.59622167e-01  3.43809368e-01 -4.27516245e-01 -3.75524243e-01
  -5.70888945e-01 -1.56684642e-01  3.67549112e-01  4.89444166e-01
  -1.58102497e-01  5.14271263e-01 -2.21935142e-01 -2.09338902e-01
  -3.94411720e-02  4.77671094e-01 -2.11678215e-01 -5.76659127e-01
   2.19355118e-01 -6.38371502e-02 -9.92269366e-02  1.06733233e-01
  -2.47880341e-02  3.53691616e-01 -4.31506807e-01  2.25462726e-01
  -3.67158240e-01  2.03436141e-01 -1.00162029e-01  9.00521984

In [98]:
# Modeling practice using Logistic Regression Model, basic hyper-parameters

# Using a logistic regression model first:

# Only fit on my training dataset
logit = LogisticRegression(C = 1.0, random_state=123)

# Fitting the data to the train dataset:
logit.fit(X_train_log, y_train_log)

# Printing the coefficients and intercept of the model:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

# Train data prediction:
y_pred_log = logit.predict(X_train_log)

# Now the est. of churn based on train predict:
y_pred_prob_log = logit.predict_proba(X_train_log)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(logit.score(X_train_log, y_train_log)))
print(classification_report(y_train_log, y_pred_log))

Coefficient: 
 [[ 0.32584975  0.10700504  0.02177606 -0.11273622 -0.24623023 -0.12725986
  -0.097759    0.16727136 -0.14013044  0.04035175 -0.02618095 -0.04474803
  -0.35199641  0.05168187 -0.02154732 -0.15785179 -0.08577911 -0.01172006
   0.18357011  0.08923931  0.14962025  0.02888399  0.14405114  0.13990761
   0.00061618  0.08711004  0.13144131  0.03194242  0.0457541  -0.10307564
  -0.30764515 -0.01742943 -0.21395155  0.02430734 -0.01186964  0.02326798
  -0.13594839  0.02154939  0.07434342 -0.09186874 -0.22425422 -0.04163195
  -0.10828094  0.28041022 -0.10727794 -0.22502628 -0.09127305 -0.34237304
   0.15030636  0.06220102 -0.19284912  0.2172797  -0.02179038  0.12479217
  -0.23134185 -0.1813497  -0.28776365  0.32828028  0.08832002  0.00537524
   0.07484093  0.38294041 -0.07438465 -0.13732029 -0.08814132 -0.20794577
   0.14200841 -0.27205564  0.01167151 -0.03325269 -0.18726103 -0.0795775
   0.21713005 -0.17080807 -0.08752006 -0.26685947  0.10133036  0.14425331
   0.17186413 -0.1482950

In [78]:
# Random Forest Model

In [79]:
from sklearn.ensemble import RandomForestClassifier

In [111]:
X_train_rf = X_train_log
y_train_rf = y_train_log

In [116]:
# Using Random Forest Model
rf = RandomForestClassifier(bootstrap=True,
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=5, 
                            random_state=123)

# Fitting the model using the train data:
rf.fit(X_train_rf, y_train_rf)

# Making prediction:
y_pred_rf = rf.predict(X_train_rf)

# Estimating the probability of churn using the training dataset:
y_pred_proba_rf = rf.predict_proba(X_train_rf)


print('Accuracy of Random Forest Model on training set: {:.2f}'
     .format(logit.score(X_train_rf, y_train_rf)))
print(classification_report(y_train_rf, y_pred_rf))

Accuracy of Random Forest Model on training set: 0.52
              precision    recall  f1-score   support

           0       0.62      0.64      0.63    139982
           1       0.63      0.61      0.62    139997

    accuracy                           0.63    279979
   macro avg       0.63      0.63      0.63    279979
weighted avg       0.63      0.63      0.63    279979



### Decision Tree

In [113]:
# Defining the X and Y variables for my modeling
X_train_dt = X_train_log
y_train_dt = y_train_log

# Fitting the DT model:
clf = DecisionTreeClassifier(max_depth=5, random_state=123)
clf.fit(X_train_dt, y_train_dt)

# prediction with training data
y_pred_dt = clf.predict(X_train_dt)
#estimate the probability
y_pred_proba_dt = clf.predict_proba(X_train_dt)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(clf.score(X_train_dt, y_train_dt)))
print(classification_report(y_train_dt, y_pred_dt))

Accuracy of Logistic Classifier on training set: 0.63
              precision    recall  f1-score   support

           0       0.62      0.64      0.63    139982
           1       0.63      0.61      0.62    139997

    accuracy                           0.63    279979
   macro avg       0.63      0.63      0.63    279979
weighted avg       0.63      0.63      0.63    279979



In [114]:
# Adjusting Hyperparameters of DT Model: 
X_train_dt2 = X_train_log
y_train_dt2 = y_train_log


# Fitting the DT model:
clf = DecisionTreeClassifier(max_depth=10, random_state=123)
clf.fit(X_train_dt2, y_train_dt2)

# prediction with training data
y_pred_dt2 = clf.predict(X_train_dt2)
#estimate the probability
y_pred_proba_2 = clf.predict_proba(X_train_dt2)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(clf.score(X_train_dt2, y_train_dt2)))
print(classification_report(y_train_dt2, y_pred_dt2))

Accuracy of Logistic Classifier on training set: 0.63
              precision    recall  f1-score   support

           0       0.63      0.64      0.64    139982
           1       0.63      0.62      0.62    139997

    accuracy                           0.63    279979
   macro avg       0.63      0.63      0.63    279979
weighted avg       0.63      0.63      0.63    279979



In [91]:
from sklearn.neighbors import KNeighborsClassifier

In [92]:
# Trying KNN:

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')


X_train_knn = X_train_log
y_train_knn = y_train_log

# Fitting the model:
knn.fit(X_train_knn, y_train_knn)

# Getting the score:
knn.score(X_train_knn, y_train_knn)


# predict y values
y_predknn = knn.predict(y_train_knn)

KeyboardInterrupt: 

In [None]:
# X_initial = dota[dota.columns.drop(['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 'radiant_win', 'is_radiant', 'win'])]
# y_initial = dota[['win']]

In [117]:
# Next steps: I want to do a count of items per team, and see if that'll help me get a more accurate result on the modeling.