## Dota Match Prediction Project

- Source/Credit: The data for this project comes from a Kaggle dataset last updated 1 year ago by Devin Anzelmo.
- The dataset is available on Kaggle at: https://www.kaggle.com/devinanzelmo/dota-2-matches

In [1]:
# Importing the libraries:
import pandas as pd
import numpy as np
from math import sqrt
from scipy import stats

# visualizing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# plt.rc('figure', figsize=(13, 10))
# plt.rc('font', size=14)

# preparing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# modeling and evaluating
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

# acquiring
from pydataset import data

## Acquire

In [21]:
players = pd.read_csv("data/players.csv")
match = pd.read_csv("data/match.csv")
heroes = pd.read_csv("data/hero_names.csv")
items = pd.read_csv("data/item_ids.csv")
test_player = pd.read_csv("data/test_player.csv")
test_label = pd.read_csv("data/test_labels.csv")

In [24]:
# Additional data to be joined (as needed):

outcomes = pd.read_csv("data/match_outcomes.csv")
player_rating = pd.read_csv("data/player_ratings.csv")

In [15]:
players.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
0,0,0,86,0,3261,10960,347,362,9,3,...,,,,6.0,,,,,,
1,0,1,51,1,2954,17760,494,659,13,3,...,,,,14.0,,,,,,
2,0,0,83,2,110,12195,350,385,0,4,...,,,,17.0,,,,,,
3,0,2,11,3,1179,22505,599,605,8,4,...,1.0,,,13.0,,,,,,
4,0,3,67,4,3307,23825,613,762,20,3,...,3.0,,,23.0,,,,,,


In [16]:
match.head()

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,1446750112,2375,1982,4,3,63,1,22,True,0,1,155
1,1,1446753078,2582,0,1846,63,0,221,22,False,0,2,154
2,2,1446764586,2716,256,1972,63,48,190,22,False,0,0,132
3,3,1446765723,3085,4,1924,51,3,40,22,False,0,0,191
4,4,1446796385,1887,2047,0,0,63,58,22,True,0,0,156


In [50]:
heroes.head()

Unnamed: 0,name,hero_id,localized_name,0
0,npc_dota_hero_antimage,1,Anti-Mage,Unkown
1,npc_dota_hero_axe,2,Axe,Unkown
2,npc_dota_hero_bane,3,Bane,Unkown
3,npc_dota_hero_bloodseeker,4,Bloodseeker,Unkown
4,npc_dota_hero_crystal_maiden,5,Crystal Maiden,Unkown


In [18]:
items.head()

Unnamed: 0,item_id,item_name
0,1,blink
1,2,blades_of_attack
2,3,broadsword
3,4,chainmail
4,5,claymore


In [23]:
print(players.info(), heroes.info(), match.info(), items.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 73 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   match_id                           500000 non-null  int64  
 1   account_id                         500000 non-null  int64  
 2   hero_id                            500000 non-null  int64  
 3   player_slot                        500000 non-null  int64  
 4   gold                               500000 non-null  int64  
 5   gold_spent                         500000 non-null  int64  
 6   gold_per_min                       500000 non-null  int64  
 7   xp_per_min                         500000 non-null  int64  
 8   kills                              500000 non-null  int64  
 9   deaths                             500000 non-null  int64  
 10  assists                            500000 non-null  int64  
 11  denies                             5000

In [39]:
# What are the most commonly picked heroes?
players.hero_id.value_counts()

21     20881
11     17007
74     11676
7      11323
28     11181
39     10590
8      10394
100    10306
73      9823
14      9447
1       9396
104     9025
93      8426
50      8403
25      8255
86      8183
69      7938
5       7846
42      7794
112     7697
106     7533
26      7382
30      7321
71      7311
44      7280
75      7224
9       7210
72      6856
62      6793
68      6753
67      6660
46      6042
36      5969
85      5951
19      5305
57      5161
87      4750
31      4687
2       4601
22      4589
84      4353
70      4302
51      4301
55      4219
20      4194
99      4167
32      4140
35      3809
59      3782
47      3690
12      3650
27      3589
18      3450
97      3431
53      3344
102     3310
41      3193
16      3150
110     3029
60      3023
40      3015
101     2976
4       2956
98      2934
107     2808
64      2748
88      2701
34      2610
6       2608
54      2585
33      2579
63      2566
3       2553
23      2543
56      2479
17      2407
29      2400

In [57]:
heroes.shape

(112, 4)

In [38]:
# Taking a quick look at the top 5 heroes picked:

heroes[(heroes.hero_id == 21) | (heroes.hero_id == 11) | (heroes.hero_id == 74) | (heroes.hero_id == 7) | (heroes.hero_id == 28)]

Unnamed: 0,name,hero_id,localized_name
6,npc_dota_hero_earthshaker,7,Earthshaker
10,npc_dota_hero_nevermore,11,Shadow Fiend
20,npc_dota_hero_windrunner,21,Windranger
26,npc_dota_hero_slardar,28,Slardar
72,npc_dota_hero_invoker,74,Invoker


#### Takeaways:

- I've discovered the top 5 most often picked heroes
- I still need to answer the questions posed in my prep section below..

## Prep

- Key points I need to answer:
    - What is the time scale? I think it's either in seconds or minutes. Probably seconds.
    - How is 'player skill' determined, and is there a better set of features to create a "player skill" feature?
    - I need to join these tables; are there different types of data; ie, are there time-series tables vs statis tables I need to make sure I'm not mixing/matching?
    - Is there a specific combination of heroes and items that makes for a match-winning combination? That's the goal, so how to I prep the data to get those features in a df?
    

In [64]:
# First off, need to join the heroes df to my players df so that I have all the names of the heroes together.

In [None]:
# Checking first that there are no nulls 
players[players.hero_id.isna()]

In [None]:
# Now need to add the list of heroes full names to main df:

In [56]:
players.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue
0,0,0,86,0,3261,10960,347,362,9,3,...,,,,6.0,,,,,,
1,0,1,51,1,2954,17760,494,659,13,3,...,,,,14.0,,,,,,
2,0,0,83,2,110,12195,350,385,0,4,...,,,,17.0,,,,,,
3,0,2,11,3,1179,22505,599,605,8,4,...,1.0,,,13.0,,,,,,
4,0,3,67,4,3307,23825,613,762,20,3,...,3.0,,,23.0,,,,,,


In [74]:
player_heroes = pd.merge(players, heroes, left_on = 'hero_id', right_on = 'hero_id', how = 'left')
player_heroes.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,name,localized_name,0
0,0,0,86,0,3261,10960,347,362,9,3,...,6.0,,,,,,,npc_dota_hero_rubick,Rubick,Unkown
1,0,1,51,1,2954,17760,494,659,13,3,...,14.0,,,,,,,npc_dota_hero_rattletrap,Clockwerk,Unkown
2,0,0,83,2,110,12195,350,385,0,4,...,17.0,,,,,,,npc_dota_hero_treant,Treant Protector,Unkown
3,0,2,11,3,1179,22505,599,605,8,4,...,13.0,,,,,,,npc_dota_hero_nevermore,Shadow Fiend,Unkown
4,0,3,67,4,3307,23825,613,762,20,3,...,23.0,,,,,,,npc_dota_hero_spectre,Spectre,Unkown


In [75]:
player_heroes.drop(columns = ['name', 0], inplace = True)
player_heroes.rename(columns = {"localized_name": "hero"}, inplace = True)

In [76]:
player_heroes.hero

0                      Rubick
1                   Clockwerk
2            Treant Protector
3                Shadow Fiend
4                     Spectre
                 ...         
499995                   Tusk
499996                 Mirana
499997    Keeper of the Light
499998              Alchemist
499999       Nature's Prophet
Name: hero, Length: 500000, dtype: object

In [78]:
player_heroes.shape

(500000, 74)

## Explore

Questions I would like to answer:

- Is there a common item bought be winning teams?
- Is there a common set of items bought by winning teams?
- Is there an average player skill level distinct to winning teams (hypo t-test...?)
- Are there player K/D ratios that lead to higher win %?
- Do the Raidient vs Dire teams win more? Is that random or something that a feature that can be developed from team?
- Create visuals of most popular heroes picked over time = 2012 - 2015.
- If I can get more data from Opendota api, add to already existsing data.

#### Other things to explore:

- Which heroes have a low pick % but a high win %, so in other words.
- A high win rate for a hero is > 50%. They spend a lot of time trying to balance the game.
- Look at Dotabuff/Dota Plus. It'll give some good player pick vs. win rate.
- How do I want to visualize "winning"? Do I wanna consider a radient win as a "win"?
- I think my baseline should be radient wins overall; that would be an interesting baseline to use...
