# Linear Regression
## Goal
The goal of this linear regression is to predict the duration of a DOTA 2 match given the breakdown of the competing teams.
The data we will be using for this is found at Kaggle.com, at this link https://www.kaggle.com/devinanzelmo/dota-2-matches#test_player.csv
The datasets to be used from this source are listed below
1. match.csv - Basic information regarding match duraton, winner, etc.
2. match_outcomes.csv - Links matches to players
3. player_ratings.csv - Lists the statistics for each player in the dataset
4. test_player.csv - Lists the player and hero participating in each match

From these we will use 

In [55]:
import pandas as pd
from sklearn import linear_model

## Data Import

In [56]:
match = pd.read_csv('match.csv')
match_outcomes = pd.read_csv('match_outcomes.csv')
player_ratings = pd.read_csv('player_ratings.csv')
test_player = pd.read_csv('test_player.csv')

## Data Manipulation
To get these datasets into a usable format, we need to identify our variables, and what we need to do to find them.
### Class Variable
The calss variable for this regression will be match duration, found in the match dataset

In [57]:
match[0:5]

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,1446750112,2375,1982,4,3,63,1,22,True,0,1,155
1,1,1446753078,2582,0,1846,63,0,221,22,False,0,2,154
2,2,1446764586,2716,256,1972,63,48,190,22,False,0,0,132
3,3,1446765723,3085,4,1924,51,3,40,22,False,0,0,191
4,4,1446796385,1887,2047,0,0,63,58,22,True,0,0,156


Since we only care about the match_id to link rows of this table to others, and the duration for use as out class variable, we can trim this table

In [58]:
match = match[['match_id', 'duration']]

### Descriptive Variables
There are several variables that can be used to predict match length. A logical assumption to make is that a match will take longer for two evenly matched teams, so we can use the difference in the aggregate values for games won, games played, and trueskill for each player on the team to compare how well the two teams stack up against one another. These statistics can be calculated from the player_ratings dataset.

In [59]:
player_ratings[5:13]#some players only have one game played and therefore don't have excellent data

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma
5,308663,1,1,26.761476,8.10888
6,79749,21,40,30.553417,3.868734
7,-1985,0,1,23.263409,8.09802
8,-2160,8,12,27.426018,6.3913
9,26500,26,50,27.943621,4.049005
10,-2776,0,1,23.053522,8.110911
11,137046,46,89,26.025998,2.865184
12,56881,15,23,32.856424,5.132469


#### Tranforming player_ratings
We need to join the ratings table to the match table to get the appropriate descriptive variables. To do this we must join the player ratings with the test_player dataframe to connect players to matches. From there we must summarise the resulting dataframe to get aggregate values for the match, and then we can compare the values of both teams when we connect it the match dataframe. 

In [60]:
test_player[0:10]

Unnamed: 0,match_id,account_id,hero_id,player_slot
0,50000,117784,96,0
1,50000,158361,84,1
2,50000,158362,46,2
3,50000,137970,85,3
4,50000,1090,39,4
5,50000,2391,9,128
6,50000,2393,75,129
7,50000,2394,106,130
8,50000,36737,74,131
9,50000,2392,62,132


In [61]:
match_ratings = test_player.join(player_ratings, on = 'account_id', how = 'left', lsuffix = 'l', rsuffix = 'r')

In [62]:
def findTeam(player_slot):#A function to determine if the team is radiant or not
#There are two possible teams, so if it's not radiant it is the other one
    if player_slot < 50:
        return True
    else:
        return False

In [63]:
match_ratings['radiant'] = match_ratings.apply(lambda row: findTeam(row.player_slot), axis = 1)
match_ratings[0:10]

Unnamed: 0,match_id,account_idl,hero_id,player_slot,account_idr,total_wins,total_matches,trueskill_mu,trueskill_sigma,radiant
0,50000,117784,96,0,-58808162,3,5,30.032321,6.985622,True
1,50000,158361,84,1,-74184241,0,1,23.113746,8.109135,True
2,50000,158362,46,2,-74184365,0,4,16.547848,7.230257,True
3,50000,137970,85,3,-66786661,0,1,22.030027,8.020303,True
4,50000,1090,39,4,91317,25,39,33.719716,4.454236,True
5,50000,2391,9,128,50100,4,8,23.864505,6.794551,False
6,50000,2393,75,129,-445349,1,3,23.641084,7.623091,False
7,50000,2394,106,130,-445365,2,2,30.058583,7.704654,False
8,50000,36737,74,131,-24571240,1,1,27.560054,8.081343,False
9,50000,2392,62,132,-444994,1,4,21.150729,7.202481,False


Unnamed: 0_level_0,Unnamed: 1_level_0,total_wins,total_matches,trueskill_mu
match_id,radiant,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50000,False,1.8,3.6,25.254991
50000,True,5.6,10.0,25.088732
50001,False,6.8,12.2,26.193805
50001,True,6.4,11.0,26.026271
50002,False,12.4,20.8,28.498263
50002,True,7.2,15.2,24.646404
50003,False,10.8,19.4,25.82627
50003,True,6.4,11.0,26.62662
50004,False,6.8,11.2,27.821453
50004,True,11.2,19.6,26.200487
