## Dota Match Prediction Project

- Source/Credit: The data for this project comes from a Kaggle dataset last updated 1 year ago by Devin Anzelmo.
- The dataset is available on Kaggle at: https://www.kaggle.com/devinanzelmo/dota-2-matches

In [1]:
# Importing the libraries:
import pandas as pd
import numpy as np
from math import sqrt
from scipy import stats

# visualizing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# plt.rc('figure', figsize=(13, 10))
# plt.rc('font', size=14)

# preparing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# modeling and evaluating
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

# acquiring
from pydataset import data

## Acquire

In [2]:
players = pd.read_csv("data/players.csv")
match = pd.read_csv("data/match.csv")
heroes = pd.read_csv("data/hero_names.csv")
items = pd.read_csv("data/item_ids.csv")
test_player = pd.read_csv("data/test_player.csv")
test_label = pd.read_csv("data/test_labels.csv")

In [3]:
# Additional data to be joined (as needed):

# outcomes = pd.read_csv("data/match_outcomes.csv")
# player_rating = pd.read_csv("data/player_ratings.csv")
# objectives = pd.read_csv("data/objectives.csv")

In [4]:
# Taking a quick look at the top 5 heroes picked:

heroes[(heroes.hero_id == 21) | (heroes.hero_id == 11) | (heroes.hero_id == 74) | (heroes.hero_id == 7) | (heroes.hero_id == 28)]

Unnamed: 0,name,hero_id,localized_name
6,npc_dota_hero_earthshaker,7,Earthshaker
10,npc_dota_hero_nevermore,11,Shadow Fiend
20,npc_dota_hero_windrunner,21,Windranger
26,npc_dota_hero_slardar,28,Slardar
72,npc_dota_hero_invoker,74,Invoker


#### Takeaways:

- I've discovered the top 5 most often picked heroes
- I still need to answer the questions posed in my prep section below..

#### Removing the 37 rows that don't have any hero_ids attached to them.
- In the interest of time, I'm simply going to drop these 27 rows out of 500,000.

In [5]:
# Dropping hero_id == 0
player = players.copy()
players = players[players.hero_id != 0]
players.shape, player.shape

((499963, 73), (500000, 73))

## Prep

- Key points I need to answer:
    - What is the time scale? I think it's either in seconds or minutes. Probably seconds.
    - How is 'player skill' determined, and is there a better set of features to create a "player skill" feature?
    - I need to join these tables; are there different types of data; ie, are there time-series tables vs statis tables I need to make sure I'm not mixing/matching?
    - Is there a specific combination of heroes and items that makes for a match-winning combination? That's the goal, so how to I prep the data to get those features in a df?
    

In [6]:
# First off, need to join the heroes df to my players df so that I have all the names of the heroes together.

In [7]:
# Checking first that there are no nulls 
players[players.hero_id.isna()]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue


In [8]:
# Now need to add the list of heroes full names to main df:

In [9]:
player_heroes = pd.merge(players, heroes, left_on = 'hero_id', right_on = 'hero_id', how = 'left')
player_heroes.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,name,localized_name
0,0,0,86,0,3261,10960,347,362,9,3,...,,6.0,,,,,,,npc_dota_hero_rubick,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,14.0,,,,,,,npc_dota_hero_rattletrap,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,17.0,,,,,,,npc_dota_hero_treant,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,13.0,,,,,,,npc_dota_hero_nevermore,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,23.0,,,,,,,npc_dota_hero_spectre,Spectre


In [10]:
player_heroes.drop(columns = ['name'], inplace = True)
player_heroes.rename(columns = {"localized_name": "hero"}, inplace = True)

In [11]:
player_heroes[player_heroes.hero_id != 0]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,hero
0,0,0,86,0,3261,10960,347,362,9,3,...,,,6.0,,,,,,,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,,14.0,,,,,,,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,,17.0,,,,,,,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,,13.0,,,,,,,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,,23.0,,,,,,,Spectre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,128,2718,17735,468,626,16,9,...,,,6.0,,,,,,,Tusk
499959,49999,0,9,129,3755,20815,507,607,12,6,...,,,15.0,,,,,,,Mirana
499960,49999,0,90,130,1059,16225,371,404,5,3,...,,,2.0,,,,,,,Keeper of the Light
499961,49999,0,73,131,3165,31015,780,703,8,6,...,,,4.0,,,,,,,Alchemist


#### Dropping unneeded columns

- Since the point of this modeling is going to revovle around directly modeling heroes and items, I don't need columns that are only indirectly related to those potential features. Thus, I'm going to drop any mention of `unit_order` or `gold` in my columns. Need to get the dataframe down to a more manageable size.

In [12]:
# Instead of dropping columns, I'm simply assigning the columns I want to keep:

player_heroes_cleaned = player_heroes[['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 
                                       'item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5', 
                                       'kills', 'deaths', 'assists', 'denies', 'last_hits']]

In [13]:
columns_reduced = player_heroes.shape[1] - player_heroes_cleaned.shape[1]
print(f'Reduced the number of columns by {columns_reduced}.')

Reduced the number of columns by 58.


In [14]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,180,37,73,56,108,0,9,3,18,1,30
1,0,1,51,Clockwerk,1,46,63,119,102,24,108,13,3,18,9,109
2,0,0,83,Treant Protector,2,48,60,59,108,65,0,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,63,147,154,164,79,160,8,4,19,6,271
4,0,3,67,Spectre,4,114,92,147,0,137,63,20,3,17,13,245


In [15]:
item_lookup = dict(zip(items['item_id'], items['item_name']))
item_lookup[0] = 'Unknown'

In [16]:
def find_item(_id):
    return item_lookup.get(_id, 'u_' + str(_id))

In [17]:
player_heroes_cleaned['item_0'] = player_heroes_cleaned['item_0'].apply(find_item)
player_heroes_cleaned['item_1'] = player_heroes_cleaned['item_1'].apply(find_item)
player_heroes_cleaned['item_2'] = player_heroes_cleaned['item_2'].apply(find_item)
player_heroes_cleaned['item_3'] = player_heroes_cleaned['item_3'].apply(find_item)
player_heroes_cleaned['item_4'] = player_heroes_cleaned['item_4'].apply(find_item)
player_heroes_cleaned['item_5'] = player_heroes_cleaned['item_5'].apply(find_item)

In [18]:
player_heroes_cleaned.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245


In [19]:
player_combos = pd.get_dummies(player_heroes_cleaned['hero'])
player_combos.head()

Unnamed: 0,Abaddon,Alchemist,Ancient Apparition,Anti-Mage,Axe,Bane,Batrider,Beastmaster,Bloodseeker,Bounty Hunter,...,Venomancer,Viper,Visage,Warlock,Weaver,Windranger,Winter Wyvern,Witch Doctor,Wraith King,Zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
item0 = pd.get_dummies(players['item_0'].fillna(0))
item1 = pd.get_dummies(players['item_1'].fillna(0))
item2 = pd.get_dummies(players['item_2'].fillna(0))
item3 = pd.get_dummies(players['item_3'].fillna(0))
item4 = pd.get_dummies(players['item_4'].fillna(0))
item5 = pd.get_dummies(players['item_5'].fillna(0))

In [21]:
matches = match.copy()
matches.columns

Index(['match_id', 'start_time', 'duration', 'tower_status_radiant',
       'tower_status_dire', 'barracks_status_dire', 'barracks_status_radiant',
       'first_blood_time', 'game_mode', 'radiant_win', 'negative_votes',
       'positive_votes', 'cluster'],
      dtype='object')

In [22]:
match_result = matches[['match_id', 'radiant_win']]

In [23]:
dota_df = pd.merge(player_heroes_cleaned, match_result, left_on = 'match_id', right_on = 'match_id', how = 'left')
dota_df.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30,True
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58,True
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245,True


In [None]:
dota_df[dota_df.match_id == 0]

In [None]:
dota_df[dota_df.match_id == 1]

In [24]:
# Adding another feature:
# dota_df["k_d"] = (dota_df.)

dota_df['k_d'] = round((dota_df.kills / dota_df.deaths), 2)
dota_df.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,denies,last_hits,radiant_win,k_d
0,0,0,86,Rubick,0,arcane_boots,ghost,bracer,ring_of_health,ultimate_scepter,Unknown,9,3,18,1,30,True,3.0
1,0,1,51,Clockwerk,1,tpscroll,power_treads,shivas_guard,force_staff,ultimate_orb,ultimate_scepter,13,3,18,9,109,True,4.33
2,0,0,83,Treant Protector,2,travel_boots,point_booster,energy_booster,ultimate_scepter,hand_of_midas,Unknown,0,4,15,1,58,True,0.0
3,0,2,11,Shadow Fiend,3,power_treads,manta,sange_and_yasha,helm_of_the_dominator,mekansm,skadi,8,4,19,6,271,True,2.0
4,0,3,67,Spectre,4,heart,urn_of_shadows,manta,Unknown,radiance,power_treads,20,3,17,13,245,True,6.67


In [26]:
dota = dota_df[['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 'kills', 'deaths', 'k_d', 'radiant_win']]
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,kills,deaths,k_d,radiant_win
0,0,0,86,Rubick,0,9,3,3.0,True
1,0,1,51,Clockwerk,1,13,3,4.33,True
2,0,0,83,Treant Protector,2,0,4,0.0,True
3,0,2,11,Shadow Fiend,3,8,4,2.0,True
4,0,3,67,Spectre,4,20,3,6.67,True


### Creating a target variable

In [27]:
dota['is_radiant'] = pd.cut(dota.player_slot, bins = [-1,100,200], labels = [1, 0])

In [29]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].mean()

is_radiant  radiant_win
1           False          5.939393
            True           8.745035
0           False          9.142906
            True           5.812356
Name: kills, dtype: float64

In [30]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].median()

is_radiant  radiant_win
1           False          5
            True           8
0           False          8
            True           5
Name: kills, dtype: int64

In [31]:
# I can also verify with elims, last_hits, etc...

In [32]:
dota['radiant_win'] = dota.radiant_win.astype(int)

In [33]:
dota['win'] = dota.radiant_win == dota.is_radiant

In [34]:
dota.win.value_counts(normalize = True)

True     0.500025
False    0.499975
Name: win, dtype: float64

In [37]:
dota['win'] = dota.win.astype(int)

In [38]:
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,kills,deaths,k_d,radiant_win,is_radiant,win
0,0,0,86,Rubick,0,9,3,3.0,1,1,1
1,0,1,51,Clockwerk,1,13,3,4.33,1,1,1
2,0,0,83,Treant Protector,2,0,4,0.0,1,1,1
3,0,2,11,Shadow Fiend,3,8,4,2.0,1,1,1
4,0,3,67,Spectre,4,20,3,6.67,1,1,1


In [40]:
# Final cleanup of hero column:

dota['hero'] = dota['hero'].str.lower()

In [41]:
dota

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,kills,deaths,k_d,radiant_win,is_radiant,win
0,0,0,86,rubick,0,9,3,3.00,1,1,1
1,0,1,51,clockwerk,1,13,3,4.33,1,1,1
2,0,0,83,treant protector,2,0,4,0.00,1,1,1
3,0,2,11,shadow fiend,3,8,4,2.00,1,1,1
4,0,3,67,spectre,4,20,3,6.67,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,tusk,128,16,9,1.78,0,0,1
499959,49999,0,9,mirana,129,12,6,2.00,0,0,1
499960,49999,0,90,keeper of the light,130,5,3,1.67,0,0,1
499961,49999,0,73,alchemist,131,8,6,1.33,0,0,1


In [42]:
dota_dummy = pd.get_dummies(dota[['hero']], dummy_na = False)
dota_dummy

Unnamed: 0,hero_abaddon,hero_alchemist,hero_ancient apparition,hero_anti-mage,hero_axe,hero_bane,hero_batrider,hero_beastmaster,hero_bloodseeker,hero_bounty hunter,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499960,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
499961,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
dota = pd.concat([dota, dota_dummy], axis = 1)
dota.head()

Unnamed: 0,match_id,account_id,hero_id,hero,player_slot,kills,deaths,k_d,radiant_win,is_radiant,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,86,rubick,0,9,3,3.0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,51,clockwerk,1,13,3,4.33,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,83,treant protector,2,0,4,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,2,11,shadow fiend,3,8,4,2.0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,3,67,spectre,4,20,3,6.67,1,1,...,0,0,0,0,0,0,0,0,0,0


In [45]:
dota.drop(columns = ['hero_id', 'hero'])

Unnamed: 0,match_id,account_id,player_slot,kills,deaths,k_d,radiant_win,is_radiant,win,hero_abaddon,...,hero_venomancer,hero_viper,hero_visage,hero_warlock,hero_weaver,hero_windranger,hero_winter wyvern,hero_witch doctor,hero_wraith king,hero_zeus
0,0,0,0,9,3,3.00,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,13,3,4.33,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,2,0,4,0.00,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,2,3,8,4,2.00,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,3,4,20,3,6.67,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,128,16,9,1.78,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
499959,49999,0,129,12,6,2.00,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
499960,49999,0,130,5,3,1.67,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
499961,49999,0,131,8,6,1.33,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


#### Scaling the data:

Now I should be done with all the data prep; I have my indepdendent variables, and my target variable, the `win` column, which has also been encoded.

### Splitting into Train, Validate, and Test

In [None]:
def df_split(df):
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.radiant_win)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=123, stratify=train_validate.radiant_win)
    return train, validate, test

In [None]:
train, validate, test = df_split(dota_df)
print(train.shape, validate.shape, test.shape)

In [None]:
dota_df.radiant_win.value_counts(normalize = True).plot(kind = "bar")

In [None]:
# X_initial = dota[dota.columns.drop(['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 'radiant_win', 'is_radiant', 'win'])]
# y_initial = dota[['win']]

## Explore

Questions I would like to answer:

- Is there a common item bought be winning teams?
- Is there a common set of items bought by winning teams?
- Is there an average player skill level distinct to winning teams (hypo t-test...?)
- Are there player K/D ratios that lead to higher win %?
- Do the Raidient vs Dire teams win more? Is that random or something that a feature that can be developed from team?
- Create visuals of most popular heroes picked over time = 2012 - 2015.
- If I can get more data from Opendota api, add to already existsing data.

#### Other things to explore:

- Which heroes have a low pick % but a high win %, so in other words.
- A high win rate for a hero is > 50%. They spend a lot of time trying to balance the game.
- Look at Dotabuff/Dota Plus. It'll give some good player pick vs. win rate.
- How do I want to visualize "winning"? Do I wanna consider a radient win as a "win"?
- I think my baseline should be radient wins overall; that would be an interesting baseline to use...
