# Linear Regression
## Goal
The goal of this linear regression is to predict the duration of a DOTA 2 match given the breakdown of the competing teams.
The data we will be using for this is found at Kaggle.com, at this link https://www.kaggle.com/devinanzelmo/dota-2-matches#test_player.csv
The datasets to be used from this source are listed below
1. match.csv - Basic information regarding match duraton, winner, etc.
2. match_outcomes.csv - Links matches to players
3. player_ratings.csv - Lists the statistics for each player in the dataset
4. test_player.csv - Lists the player and hero participating in each match

From these we will use 

In [2]:
import pandas as pd
from sklearn import linear_model

## Data Import

In [3]:
match = pd.read_csv('match.csv')
match_outcomes = pd.read_csv('match_outcomes.csv')
player_ratings = pd.read_csv('player_ratings.csv')
test_player = pd.read_csv('test_player.csv')

## Data Manipulation
To get these datasets into a usable format, we need to identify our variables, and what we need to do to find them.
### Class Variable
The calss variable for this regression will be match duration, found in the match dataset

In [4]:
match[0:5]

Unnamed: 0,match_id,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
0,0,1446750112,2375,1982,4,3,63,1,22,True,0,1,155
1,1,1446753078,2582,0,1846,63,0,221,22,False,0,2,154
2,2,1446764586,2716,256,1972,63,48,190,22,False,0,0,132
3,3,1446765723,3085,4,1924,51,3,40,22,False,0,0,191
4,4,1446796385,1887,2047,0,0,63,58,22,True,0,0,156


Since we only care about the match_id to link rows of this table to others, and the duration for use as out class variable, we can trim this table

In [5]:
match = match[['match_id', 'duration']]

### Descriptive Variables
There are several variables that can be used to predict match length. A logical assumption to make is that a match will take longer for two evenly matched teams, so we can use the difference in the aggregate values for games won, games played, and trueskill for each player on the team to compare how well the two teams stack up against one another. These statistics can be calculated from the player_ratings dataset.

In [6]:
player_ratings[5:13]#some players only have one game played and therefore don't have excellent data

Unnamed: 0,account_id,total_wins,total_matches,trueskill_mu,trueskill_sigma
5,308663,1,1,26.761476,8.10888
6,79749,21,40,30.553417,3.868734
7,-1985,0,1,23.263409,8.09802
8,-2160,8,12,27.426018,6.3913
9,26500,26,50,27.943621,4.049005
10,-2776,0,1,23.053522,8.110911
11,137046,46,89,26.025998,2.865184
12,56881,15,23,32.856424,5.132469


An important note about this dataset is that players have the option to play anonymously. About a third of players choose to do so. These players have accout_id 0, and statistics are calculated for these accounts. Their trueskill is slightly below average, and sigma is above average for non-anonymous players.
#### Tranforming player_ratings
We need to join the ratings table to the match table to get the appropriate descriptive variables. To do this we must join the player ratings with the test_player dataframe to connect players to matches. From there we must summarise the resulting dataframe to get aggregate values for the match, and then we can compare the values of both teams when we connect it the match dataframe. 

In [17]:
match_outcomes[1:10]

Unnamed: 0,match_id,account_id_0,account_id_1,account_id_2,account_id_3,account_id_4,start_time,parser_version,win,rad
1,1636204962,0,61598,138825,0,207232,1437014585,12,0,1
2,1636322679,0,-44943233,-240360907,19599,0,1437019968,12,0,0
3,1636322679,-97530201,0,0,0,-116349387,1437019968,12,1,1
4,1637385965,0,0,0,104738,0,1437052551,12,1,0
5,1637385965,0,0,278620,278619,0,1437052551,12,0,1
6,1637623870,-123447796,68408,-100048908,-16784805,320715,1437058007,12,1,0
7,1637623870,-108454938,-251819996,0,51172,-106710926,1437058007,12,0,1
8,1637739731,320093,0,178850,-45490226,-119392638,1437060903,12,0,0
9,1637739731,0,241925,-115963827,14072,-67386586,1437060903,12,1,1


In [31]:
values = ['account_id_' + str(i) for i in range(0,5)]
ids = ['match_id', 'win']
accounts = pd.melt(match_outcomes, id_vars = ids, value_vars = values, value_name = 'account_id')
#accounts.sort_values(by = ['match_id', 'win', 'variable'])#This was sorted to ensure the data behaved. It did.

In [33]:
match_ratings = accounts.join(player_ratings, on = 'account_id', how = 'left', lsuffix = 'l', rsuffix = 'r')
match_ratings

Unnamed: 0,match_id,win,variable,account_idl,account_idr,total_wins,total_matches,trueskill_mu,trueskill_sigma
0,1636204962,1,account_id_0,34549,-23137528.0,1.0,2.0,24.084556,7.817849
1,1636204962,0,account_id_0,0,236579.0,14.0,24.0,27.868035,5.212361
2,1636322679,0,account_id_0,0,236579.0,14.0,24.0,27.868035,5.212361
3,1636322679,1,account_id_0,-97530201,,,,,
4,1637385965,1,account_id_0,0,236579.0,14.0,24.0,27.868035,5.212361
5,1637385965,0,account_id_0,0,236579.0,14.0,24.0,27.868035,5.212361
6,1637623870,1,account_id_0,-123447796,,,,,
7,1637623870,0,account_id_0,-108454938,,,,,
8,1637739731,0,account_id_0,320093,-107742417.0,2.0,2.0,31.300924,7.744427
9,1637739731,1,account_id_0,0,236579.0,14.0,24.0,27.868035,5.212361


In [9]:
def findTeam(player_slot):#A function to determine if the team is radiant or not
#There are two possible teams, so if it's not radiant it is the other one
    if player_slot < 50:
        return True
    else:
        return False

In [10]:
match_ratings['radiant'] = match_ratings.apply(lambda row: findTeam(row.player_slot), axis = 1)
match_ratings


Unnamed: 0,match_id,account_idl,hero_id,player_slot,account_idr,total_wins,total_matches,trueskill_mu,trueskill_sigma,radiant
0,50000,117784,96,0,-58808162,3,5,30.032321,6.985622,True
1,50000,158361,84,1,-74184241,0,1,23.113746,8.109135,True
2,50000,158362,46,2,-74184365,0,4,16.547848,7.230257,True
3,50000,137970,85,3,-66786661,0,1,22.030027,8.020303,True
4,50000,1090,39,4,91317,25,39,33.719716,4.454236,True
5,50000,2391,9,128,50100,4,8,23.864505,6.794551,False
6,50000,2393,75,129,-445349,1,3,23.641084,7.623091,False
7,50000,2394,106,130,-445365,2,2,30.058583,7.704654,False
8,50000,36737,74,131,-24571240,1,1,27.560054,8.081343,False
9,50000,2392,62,132,-444994,1,4,21.150729,7.202481,False


In [11]:
regress = match_ratings.join(match, on = 'match_id', lsuffix = 'l', rsuffix = 'r')
regress

Unnamed: 0,match_idl,account_idl,hero_id,player_slot,account_idr,total_wins,total_matches,trueskill_mu,trueskill_sigma,radiant,match_idr,duration
0,50000,117784,96,0,-58808162,3,5,30.032321,6.985622,True,,
1,50000,158361,84,1,-74184241,0,1,23.113746,8.109135,True,,
2,50000,158362,46,2,-74184365,0,4,16.547848,7.230257,True,,
3,50000,137970,85,3,-66786661,0,1,22.030027,8.020303,True,,
4,50000,1090,39,4,91317,25,39,33.719716,4.454236,True,,
5,50000,2391,9,128,50100,4,8,23.864505,6.794551,False,,
6,50000,2393,75,129,-445349,1,3,23.641084,7.623091,False,,
7,50000,2394,106,130,-445365,2,2,30.058583,7.704654,False,,
8,50000,36737,74,131,-24571240,1,1,27.560054,8.081343,False,,
9,50000,2392,62,132,-444994,1,4,21.150729,7.202481,False,,


In [12]:
regress.groupby(by = ['match_id', 'radiant']).mean()[['total_wins', 'total_matches', 'trueskill_mu']][0:15]

KeyError: 'match_id'

## A New DIrection
The data being used above is not suitable for our purposes as the class variable, match duration, exists for match ids 1 - 50 000, but the players are only linked to matches 50 000 to 150 000. I will proceed by looking for a dataset linking players to matches with a known duration.

In [37]:
#bruv
match_outcomes['match_id'].max()


1930334829