## Dota Match Prediction Project

- Source/Credit: The data for this project comes from a Kaggle dataset last updated 1 year ago by Devin Anzelmo.
- The dataset is available on Kaggle at: https://www.kaggle.com/devinanzelmo/dota-2-matches

In [1]:
# Importing the libraries:
import pandas as pd
import numpy as np
from math import sqrt
from scipy import stats

# visualizing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# plt.rc('figure', figsize=(13, 10))
# plt.rc('font', size=14)

# preparing
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# modeling and evaluating
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

# acquiring
from pydataset import data

## Acquire

In [2]:
players = pd.read_csv("data/players.csv")
match = pd.read_csv("data/match.csv")
heroes = pd.read_csv("data/hero_names.csv")
items = pd.read_csv("data/item_ids.csv")
test_player = pd.read_csv("data/test_player.csv")
test_label = pd.read_csv("data/test_labels.csv")

In [23]:
# Additional data to be joined (as needed):

# outcomes = pd.read_csv("data/match_outcomes.csv")
# player_rating = pd.read_csv("data/player_ratings.csv")
objectives = pd.read_csv("data/objectives.csv")

In [4]:
players.hero_id.value_counts().head(10)

21     20881
11     17007
74     11676
7      11323
28     11181
39     10590
8      10394
100    10306
73      9823
14      9447
Name: hero_id, dtype: int64

In [5]:
# Taking a quick look at the top 5 heroes picked:

# heroes[(heroes.hero_id == 21) | (heroes.hero_id == 11) | (heroes.hero_id == 74) | (heroes.hero_id == 7) | (heroes.hero_id == 28)]
top_ten = [21, 11, 74, 7, 28, 39, 8, 100, 73, 14]
heroes[heroes.hero_id.isin(top_ten)]

Unnamed: 0,name,hero_id,localized_name
6,npc_dota_hero_earthshaker,7,Earthshaker
7,npc_dota_hero_juggernaut,8,Juggernaut
10,npc_dota_hero_nevermore,11,Shadow Fiend
13,npc_dota_hero_pudge,14,Pudge
20,npc_dota_hero_windrunner,21,Windranger
26,npc_dota_hero_slardar,28,Slardar
37,npc_dota_hero_queenofpain,39,Queen of Pain
71,npc_dota_hero_alchemist,73,Alchemist
72,npc_dota_hero_invoker,74,Invoker
98,npc_dota_hero_tusk,100,Tusk


#### Takeaways:

- I've discovered the top 5 most often picked heroes
- I still need to answer the questions posed in my prep section below..

#### Removing the 37 rows that don't have any hero_ids attached to them.
- In the interest of time, I'm simply going to drop these 27 rows out of 500,000.

In [6]:
# Dropping hero_id == 0
player = players.copy()
players = players[players.hero_id != 0]
players.shape, player.shape

((499963, 73), (500000, 73))

## Prep

- Key points I need to answer:
    - What is the time scale? I think it's either in seconds or minutes. Probably seconds.
    - How is 'player skill' determined, and is there a better set of features to create a "player skill" feature?
    - I need to join these tables; are there different types of data; ie, are there time-series tables vs statis tables I need to make sure I'm not mixing/matching?
    - Is there a specific combination of heroes and items that makes for a match-winning combination? That's the goal, so how to I prep the data to get those features in a df?
    

In [7]:
# First off, need to join the heroes df to my players df so that I have all the names of the heroes together.

In [8]:
# Checking first that there are no nulls 
players[players.hero_id.isna()]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_glyph,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue


In [9]:
# Now need to add the list of heroes full names to main df:

In [10]:
player_heroes = pd.merge(players, heroes, left_on = 'hero_id', right_on = 'hero_id', how = 'left')
player_heroes.head()

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,name,localized_name
0,0,0,86,0,3261,10960,347,362,9,3,...,,6.0,,,,,,,npc_dota_hero_rubick,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,14.0,,,,,,,npc_dota_hero_rattletrap,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,17.0,,,,,,,npc_dota_hero_treant,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,13.0,,,,,,,npc_dota_hero_nevermore,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,23.0,,,,,,,npc_dota_hero_spectre,Spectre


In [11]:
player_heroes.drop(columns = ['name'], inplace = True)
player_heroes.rename(columns = {"localized_name": "hero"}, inplace = True)

In [12]:
player_heroes[player_heroes.hero_id != 0]

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,hero
0,0,0,86,0,3261,10960,347,362,9,3,...,,,6.0,,,,,,,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,,14.0,,,,,,,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,,17.0,,,,,,,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,,13.0,,,,,,,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,,23.0,,,,,,,Spectre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,49999,0,100,128,2718,17735,468,626,16,9,...,,,6.0,,,,,,,Tusk
499959,49999,0,9,129,3755,20815,507,607,12,6,...,,,15.0,,,,,,,Mirana
499960,49999,0,90,130,1059,16225,371,404,5,3,...,,,2.0,,,,,,,Keeper of the Light
499961,49999,0,73,131,3165,31015,780,703,8,6,...,,,4.0,,,,,,,Alchemist


In [24]:
objectives.drop(columns = ['key', 'slot', 'player1', 'player2', 'time'], inplace = True)
objectives.head()

Unnamed: 0,match_id,subtype,team,value
0,0,CHAT_MESSAGE_FIRSTBLOOD,,309
1,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
2,0,CHAT_MESSAGE_ROSHAN_KILL,2.0,200
3,0,CHAT_MESSAGE_AEGIS,,0
4,0,CHAT_MESSAGE_TOWER_KILL,3.0,3


In [25]:
objectives.dropna(inplace = True)
objectives[:40]

Unnamed: 0,match_id,subtype,team,value
1,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
2,0,CHAT_MESSAGE_ROSHAN_KILL,2.0,200
4,0,CHAT_MESSAGE_TOWER_KILL,3.0,3
5,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
6,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
7,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
8,0,CHAT_MESSAGE_ROSHAN_KILL,2.0,200
10,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
11,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
12,0,CHAT_MESSAGE_TOWER_KILL,3.0,3


In [27]:
objindex = objectives[objectives['team'] == 80].index
objectives.drop(objindex, inplace = True)
objectives.head(20)

Unnamed: 0,match_id,subtype,team,value
1,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
2,0,CHAT_MESSAGE_ROSHAN_KILL,2.0,200
4,0,CHAT_MESSAGE_TOWER_KILL,3.0,3
5,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
6,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
7,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
8,0,CHAT_MESSAGE_ROSHAN_KILL,2.0,200
10,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
11,0,CHAT_MESSAGE_TOWER_KILL,2.0,2
12,0,CHAT_MESSAGE_TOWER_KILL,3.0,3


In [30]:
player_heroes.head(20)

Unnamed: 0,match_id,account_id,hero_id,player_slot,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,...,unit_order_eject_item_from_stash,unit_order_cast_rune,unit_order_ping_ability,unit_order_move_to_direction,unit_order_patrol,unit_order_vector_target_position,unit_order_radar,unit_order_set_item_combine_lock,unit_order_continue,hero
0,0,0,86,0,3261,10960,347,362,9,3,...,,,6.0,,,,,,,Rubick
1,0,1,51,1,2954,17760,494,659,13,3,...,,,14.0,,,,,,,Clockwerk
2,0,0,83,2,110,12195,350,385,0,4,...,,,17.0,,,,,,,Treant Protector
3,0,2,11,3,1179,22505,599,605,8,4,...,,,13.0,,,,,,,Shadow Fiend
4,0,3,67,4,3307,23825,613,762,20,3,...,,,23.0,,,,,,,Spectre
5,0,4,106,128,476,12285,397,524,5,6,...,,,2.0,,,,,,,Ember Spirit
6,0,0,102,129,317,10355,303,369,4,13,...,1.0,,1.0,,,,,,,Abaddon
7,0,5,46,130,2390,13395,452,517,4,8,...,,,4.0,110.0,,,,,,Templar Assassin
8,0,0,7,131,475,5035,189,223,1,14,...,,,4.0,,,,,,,Earthshaker
9,0,6,73,132,60,17550,496,456,1,11,...,,,14.0,,,,,,,Alchemist


In [None]:
# What was I thinking... use gold, and gold_per_min as one of my main features for predicting a win...
# So gropuby

#### Dropping unneeded columns

- Since the point of this modeling is going to revovle around directly modeling heroes and items, I don't need columns that are only indirectly related to those potential features. Thus, I'm going to drop any mention of `unit_order` or `gold` in my columns. Need to get the dataframe down to a more manageable size.

In [None]:
# Instead of dropping columns, I'm simply assigning the columns I want to keep:

player_heroes_cleaned = player_heroes[['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 
                                       'item_0', 'item_1', 'item_2', 'item_3', 'item_4', 'item_5', 
                                       'kills', 'deaths', 'assists', 'denies', 'last_hits']]

In [None]:
columns_reduced = player_heroes.shape[1] - player_heroes_cleaned.shape[1]
print(f'Reduced the number of columns by {columns_reduced}.')

In [None]:
player_heroes_cleaned.head()

In [None]:
item_lookup = dict(zip(items['item_id'], items['item_name']))
item_lookup[0] = "Unkown"

In [None]:
def find_item(_id):
    return item_lookup.get(_id, 'u_' + str(_id))

In [None]:
player_heroes_cleaned['item_0'] = player_heroes_cleaned['item_0'].apply(find_item)
player_heroes_cleaned['item_1'] = player_heroes_cleaned['item_1'].apply(find_item)
player_heroes_cleaned['item_2'] = player_heroes_cleaned['item_2'].apply(find_item)
player_heroes_cleaned['item_3'] = player_heroes_cleaned['item_3'].apply(find_item)
player_heroes_cleaned['item_4'] = player_heroes_cleaned['item_4'].apply(find_item)
player_heroes_cleaned['item_5'] = player_heroes_cleaned['item_5'].apply(find_item)

In [None]:
player_heroes_cleaned.head()

In [None]:
player_combos = pd.get_dummies(player_heroes_cleaned['hero'])
player_combos.head()

In [None]:
item0 = pd.get_dummies(players['item_0'].fillna(0))
item1 = pd.get_dummies(players['item_1'].fillna(0))
item2 = pd.get_dummies(players['item_2'].fillna(0))
item3 = pd.get_dummies(players['item_3'].fillna(0))
item4 = pd.get_dummies(players['item_4'].fillna(0))
item5 = pd.get_dummies(players['item_5'].fillna(0))

In [None]:
matches = match.copy()
matches.columns

In [None]:
match_result = matches[['match_id', 'radiant_win']]

In [None]:
dota_df = pd.merge(player_heroes_cleaned, match_result, left_on = 'match_id', right_on = 'match_id', how = 'left')
dota_df.head()

In [None]:
dota_df[dota_df.match_id == 0]

In [None]:
dota_df[dota_df.match_id == 1]

In [None]:
# Adding another feature:
# dota_df["k_d"] = (dota_df.)

dota_df['k_d'] = round((dota_df.kills / dota_df.deaths), 2)

In [None]:
dota_df['top_ten_hero'] = dota_df.hero_id.isin(top_ten).astype(int)
dota_df.head()

In [None]:
# Item sum column:

dota_df[dota_df['item_5'] == None].sum()

# Carving off needed columns for actual train/validate/test dataframe

In [None]:
dota = dota_df[['match_id', 'account_id', 'hero_id', 'hero', 'top_ten_hero', 'player_slot', 'kills', 'deaths', 'k_d', 'radiant_win']]
dota.head()

### Creating a target variable

I have to create a new y, or target variable that actually matches up with the winning team. In other words, the original dataset didn't clearly dictate which team was radiant and which team was dire, so I couldn't tell which teams

In [None]:
dota['is_radiant'] = pd.cut(dota.player_slot, bins = [-1,100,200], labels = [1, 0])

In [None]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].mean()

In [None]:
dota.groupby(['is_radiant', 'radiant_win'])['kills'].median()

In [None]:
# I can also verify with elims, last_hits, etc...

In [None]:
dota['radiant_win'] = dota.radiant_win.astype(int)

In [None]:
dota['win'] = dota.radiant_win == dota.is_radiant

In [None]:
dota.win.value_counts(normalize = True)

In [None]:
dota['win'] = dota.win.astype(int)

In [None]:
dota.head()

In [None]:
# Final cleanup of hero column:

dota['hero'] = dota['hero'].str.lower()

In [None]:
dota

In [None]:
dota_dummy = pd.get_dummies(dota[['hero']], dummy_na = False)
dota_dummy

In [None]:
dota = pd.concat([dota, dota_dummy], axis = 1)
dota.head()

In [None]:
dota.drop(columns = ['hero_id', 'radiant_win'], inplace = True)
dota.head()

## I need to come back and scale the kills, deaths, and k_d once I figure out how to. **But right now, it's a race to the MVP finish line.**

In [None]:
dota.columns.tolist()

Now I should be done with all the data prep; I have my indepdendent variables, and my target variable, the `win` column, which has also been encoded.

### Splitting into Train, Validate, and Test

So that I can explore using train.

In [None]:
def df_split(df):
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.win)
    train, validate = train_test_split(train_validate, test_size=.3, random_state=123, stratify=train_validate.win)
    return train, validate, test

In [None]:
train, validate, test = df_split(dota)
print(train.shape, validate.shape, test.shape)

In [None]:
train.win.value_counts(normalize = True), dota.win.value_counts(normalize = True)

In [None]:
players.columns.tolist()

In [None]:
X = dota.drop(columns = ['match_id',
                         'account_id',
                         'hero',
                         'player_slot',
                         'kills',
                         'deaths',
                         'is_radiant',
                         'win',])
X

In [None]:
x_train_and_validate, x_test = train_test_split(X, random_state=123)
x_train, x_validate = train_test_split(x_train_and_validate)

In [None]:
# X = dota[['hero_abaddon',
#  'hero_alchemist',
#  'hero_ancient apparition',
#  'hero_anti-mage',
#  'hero_axe',
#  'hero_bane',
#  'hero_batrider',
#  'hero_beastmaster',
#  'hero_bloodseeker',
#  'hero_bounty hunter',
#  'hero_brewmaster',
#  'hero_bristleback',
#  'hero_broodmother',
#  'hero_centaur warrunner',
#  'hero_chaos knight',
#  'hero_chen',
#  'hero_clinkz',
#  'hero_clockwerk',
#  'hero_crystal maiden',
#  'hero_dark seer',
#  'hero_dazzle',
#  'hero_death prophet',
#  'hero_disruptor',
#  'hero_doom',
#  'hero_dragon knight',
#  'hero_drow ranger',
#  'hero_earth spirit',
#  'hero_earthshaker',
#  'hero_elder titan',
#  'hero_ember spirit',
#  'hero_enchantress',
#  'hero_enigma',
#  'hero_faceless void',
#  'hero_gyrocopter',
#  'hero_huskar',
#  'hero_invoker',
#  'hero_io',
#  'hero_jakiro',
#  'hero_juggernaut',
#  'hero_keeper of the light',
#  'hero_kunkka',
#  'hero_legion commander',
#  'hero_leshrac',
#  'hero_lich',
#  'hero_lifestealer',
#  'hero_lina',
#  'hero_lion',
#  'hero_lone druid',
#  'hero_luna',
#  'hero_lycan',
#  'hero_magnus',
#  'hero_medusa',
#  'hero_meepo',
#  'hero_mirana',
#  'hero_morphling',
#  'hero_naga siren',
#  "hero_nature's prophet",
#  'hero_necrophos',
#  'hero_night stalker',
#  'hero_nyx assassin',
#  'hero_ogre magi',
#  'hero_omniknight',
#  'hero_oracle',
#  'hero_outworld devourer',
#  'hero_phantom assassin',
#  'hero_phantom lancer',
#  'hero_phoenix',
#  'hero_puck',
#  'hero_pudge',
#  'hero_pugna',
#  'hero_queen of pain',
#  'hero_razor',
#  'hero_riki',
#  'hero_rubick',
#  'hero_sand king',
#  'hero_shadow demon',
#  'hero_shadow fiend',
#  'hero_shadow shaman',
#  'hero_silencer',
#  'hero_skywrath mage',
#  'hero_slardar',
#  'hero_slark',
#  'hero_sniper',
#  'hero_spectre',
#  'hero_spirit breaker',
#  'hero_storm spirit',
#  'hero_sven',
#  'hero_techies',
#  'hero_templar assassin',
#  'hero_terrorblade',
#  'hero_tidehunter',
#  'hero_timbersaw',
#  'hero_tinker',
#  'hero_tiny',
#  'hero_treant protector',
#  'hero_troll warlord',
#  'hero_tusk',
#  'hero_undying',
#  'hero_ursa',
#  'hero_vengeful spirit',
#  'hero_venomancer',
#  'hero_viper',
#  'hero_visage',
#  'hero_warlock',
#  'hero_weaver',
#  'hero_windranger',
#  'hero_winter wyvern',
#  'hero_witch doctor',
#  'hero_wraith king',
#  'hero_zeus']]

In [None]:
# dota.drop(columns = ['match_id',
#  'account_id',
#  'hero',
#  'player_slot',
#  'kills',
#  'deaths',
#  'k_d',
#  'is_radiant',
#  'win',])

## Explore

Questions I would like to answer:

- Is there a common item bought be winning teams?
- Is there a common set of items bought by winning teams?
- Is there an average player skill level distinct to winning teams (hypo t-test...?)
- Are there player K/D ratios that lead to higher win %?
- Do the Raidient vs Dire teams win more? Is that random or something that a feature that can be developed from team?
- Create visuals of most popular heroes picked over time = 2012 - 2015.
- If I can get more data from Opendota api, add to already existsing data.

#### Hypotheses:

1.  Does the mean 

#### Other things to explore:

- Which heroes have a low pick % but a high win %, so in other words.
- A high win rate for a hero is > 50%. They spend a lot of time trying to balance the game.
- Look at Dotabuff/Dota Plus. It'll give some good player pick vs. win rate.
- How do I want to visualize "winning"? Do I wanna consider a radient win as a "win"?
- I think my baseline should be radient wins overall; that would be an interesting baseline to use...


In [None]:
# Baseline model: Radiant team wins just over 50% of the time.


train.win.value_counts(normalize = True).plot(kind = "bar")

## Hypothesis Testing:


#### Hypothesis 1:
- $H_0$: The average K/D rate for the winning team's players is not different than the average K/D of the losing team's players
- $H_a$: The average K/D rate for the winning team's players is statistically different than the average K/D of the losing team's players

#### Hypothesis 2: Using a $X^2$ Test
- $H_0$: The heroes the a team picks has no impact (is independent of) the team's outcome in the match (whether that team wins or loses).

- $H_a$: The heroes a team picks *do* have an impact (is not independent of) the team's outcome in the match.

In [None]:
train.head()

### Hypothesis Test #1:

- $H_0$: The average K/D rate for the winning team's players is not different than the average K/D of the losing team's players
- $H_a$: The average K/D rate for the winning team's players is statistically different than the average K/D of the losing team's players


In [None]:
X_train_log = train.drop(columns = ['match_id',
                                     'account_id',
                                     'hero',
                                     'player_slot',
                                     'k_d',
                                     'deaths',
                                     'is_radiant',
                                     'win'])
y_train_log = train[['win']]

# Validate dataset features:
X_validate_log = validate.drop(columns = ['match_id',
                                         'account_id',
                                         'hero',
                                         'player_slot',
                                         'k_d',
                                         'deaths',
                                         'is_radiant',
                                         'win'])
y_validate_log = validate[['win']]

# Test dataset features:
X_test_log = test.drop(columns = ['match_id',
                                 'account_id',
                                 'hero',
                                 'player_slot',
                                 'k_d',
                                 'deaths',
                                 'is_radiant',
                                 'win'])
y_test_log = test[['win']]

In [None]:
import sklearn.preprocessing

In [None]:
# Scaling the data:

scaler = sklearn.preprocessing.MinMaxScaler()

scaler.fit(X_train_log)

x_train_scaled_log = scaler.transform(X_train_log)
x_validate_scaled_log = scaler.transform(X_train_log)
x_test_scaled_log = scaler.transform(X_train_log)

In [None]:
# Modeling practice using Logistic Regression Model, basic hyper-parameters

# Using a logistic regression model first:

# Only fit on my training dataset
logit1 = LogisticRegression(C = 1.0, random_state=123)

# Fitting the data to the train dataset:
logit1.fit(x_train_scaled_log, y_train_log)

# Printing the coefficients and intercept of the model:
print('Coefficient: \n', logit1.coef_)
print('Intercept: \n', logit1.intercept_)

# Train data prediction:
y_pred_log_1 = logit1.predict(x_train_scaled_log)

# Now the est. of churn based on train predict:
y_pred_prob_log_1 = logit1.predict_proba(x_train_scaled_log)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(logit1.score(x_train_scaled_log, y_train_log)))
print(classification_report(y_train_log, y_pred_log_1))

In [None]:
# Modeling practice using Logistic Regression Model, basic hyper-parameters

# Using a logistic regression model first:

# Only fit on my training dataset
logit2 = LogisticRegression(C = .05, random_state=123)

# Fitting the data to the train dataset:
logit2.fit(X_train_log, y_train_log)

# Printing the coefficients and intercept of the model:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

# Train data prediction:
y_pred_log2 = logit.predict(X_train_log)

# Now the est. of churn based on train predict:
y_pred_prob_log2 = logit.predict_proba(X_train_log)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(logit.score(X_train_log, y_train_log)))
print(classification_report(y_train_log, y_pred_log2))

In [None]:
# Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
X_train_rf = X_train_log
y_train_rf = y_train_log

In [None]:
# Using Random Forest Model
rf = RandomForestClassifier(bootstrap=True,
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=5, 
                            random_state=123)

# Fitting the model using the train data:
rf.fit(X_train_rf, y_train_rf)

# Making prediction:
y_pred_rf = rf.predict(X_train_rf)

# Estimating the probability of churn using the training dataset:
y_pred_proba_rf = rf.predict_proba(X_train_rf)


print('Accuracy of Random Forest Model on training set: {:.2f}'
     .format(logit.score(X_train_rf, y_train_rf)))
print(classification_report(y_train_rf, y_pred_rf))

### Decision Tree

In [None]:
# Defining the X and Y variables for my modeling
X_train_dt = X_train_log
y_train_dt = y_train_log

# Fitting the DT model:
clf = DecisionTreeClassifier(max_depth=5, random_state=123)
clf.fit(X_train_dt, y_train_dt)

# prediction with training data
y_pred_dt = clf.predict(X_train_dt)
#estimate the probability
y_pred_proba_dt = clf.predict_proba(X_train_dt)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(clf.score(X_train_dt, y_train_dt)))
print(classification_report(y_train_dt, y_pred_dt))

In [None]:
# Adjusting Hyperparameters of DT Model: 
X_train_dt2 = X_train_log
y_train_dt2 = y_train_log


# Fitting the DT model:
clf = DecisionTreeClassifier(max_depth=10, random_state=123)
clf.fit(X_train_dt2, y_train_dt2)

# prediction with training data
y_pred_dt2 = clf.predict(X_train_dt2)
#estimate the probability
y_pred_proba_2 = clf.predict_proba(X_train_dt2)

print('Accuracy of Logistic Classifier on training set: {:.2f}'
     .format(clf.score(X_train_dt2, y_train_dt2)))
print(classification_report(y_train_dt2, y_pred_dt2))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# # Trying KNN:

# knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')


# X_train_knn = X_train_log
# y_train_knn = y_train_log

# # Fitting the model:
# knn.fit(X_train_knn, y_train_knn)

# # Getting the score:
# knn.score(X_train_knn, y_train_knn)


# # predict y values
# y_predknn = knn.predict(y_train_knn)

### Validate:

- Logistic Regression and DT models

In [None]:
y_pred_log_1 = logit1.predict(X_validate_log)


In [None]:
print("model 1\n", logit1.score(X_validate_log, y_validate_log))

In [None]:
# Using the logistic regression model on the validate dataset:

# Train data prediction:
y_pred_val_1 = logit.predict(X_validate_log)

# Now the est. of churn based on train predict:
y_pred_prob_val_1 = logit.predict_proba(X_validate_log)

print('Accuracy of Logistic Classifier on validate set: {:.2f}'
     .format(logit.score(X_validate_log, y_validate_log)))
print(classification_report(y_validate_log, y_pred_val_1))

In [None]:
X_validate_log

# MVP version #2:
- Focus only on radiant team
- count of items **per team**. Probably means I'm going to have to groupby or something and reduce the rows
- Count of objectives completed per match per team. Don't know if that's possible.

In [None]:
# X_initial = dota[dota.columns.drop(['match_id', 'account_id', 'hero_id', 'hero', 'player_slot', 'radiant_win', 'is_radiant', 'win'])]
# y_initial = dota[['win']]

In [None]:
# Next steps: I want to do a count of items per team, and see if that'll help me get a more accurate result on the modeling.