# Fantasy baseball draft - Lahman data

The purpose of this document is to prepare data for my fantasy baseball draft with Will. I will develop a scoring algorithm whose result should be correlated with the league's winning conditions , and then attempt to run machine learning on it to find other markers that might help us maximize the score.

Two separate models will be needed; one for hitters and another for pitchers, as each are judged by different criteria (naturally).

The data is from the latest version of the Lahman database.

## Import needed libraries

In [1]:
import pandas as pd
import numpy as np

## Import the hitters data

In [2]:
hitters = pd.read_csv("hitterdata.csv", header=0)

In [3]:
hitters.head()

Unnamed: 0,playerID,yearID,LastOfstint,LastOfteamID,retroID,nameFirst,nameLast,weight,height,bats,...,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh
0,abreujo02,2014,1,CHA,abrej003,Jose,Abreu,255,75,R,...,0.0,152.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
1,abreujo02,2015,1,CHA,abrej003,Jose,Abreu,255,75,R,...,0.0,152.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
2,abreujo02,2016,1,CHA,abrej003,Jose,Abreu,255,75,R,...,0.0,152.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
3,ackledu01,2011,1,SEA,ackld001,Dustin,Ackley,205,73,L,...,0.0,13.0,1.0,0.0,0.0,0.0,0.0,9.0,9.0,3.0
4,ackledu01,2012,1,SEA,ackld001,Dustin,Ackley,205,73,L,...,0.0,13.0,1.0,0.0,0.0,0.0,0.0,9.0,9.0,3.0


## Eligibility criteria

Under league rules, players are eligible to be selected "at their primary position, plus positions they've played 10 games last year or 1 game this year." For the draft, we need to figure out what these are. Since I have no source for "primary position", I will assume the position at which a player has played the most games is his primary position. We need to construct a function to identify these.

In [4]:
hitters.columns

Index(['playerID', 'yearID', 'LastOfstint', 'LastOfteamID', 'retroID',
       'nameFirst', 'nameLast', 'weight', 'height', 'bats', 'lgID', 'G', 'H',
       'AB', 'HR', 'RBI', 'SB', 'R', '2B', '3B', 'CS', 'BB', 'SO', 'IBB',
       'HBP', 'SH', 'SF', 'GIDP', 'G_p', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss',
       'G_lf', 'G_cf', 'G_rf', 'G_of', 'G_dh'],
      dtype='object')

In [5]:
gamescols = ['G_p', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss', 'G_lf', 'G_cf', 'G_rf', 'G_of', 'G_dh']

In [6]:
hitters[gamescols].max(axis = 1)

0       152.0
1       152.0
2       152.0
3        13.0
4        13.0
5        13.0
6        13.0
7        26.0
8        13.0
9        47.0
10       47.0
11       47.0
12       86.0
13       86.0
14       86.0
15       86.0
16       86.0
17       13.0
18       13.0
19       13.0
20       13.0
21        7.0
22        7.0
23        7.0
24       88.0
25       88.0
26       88.0
27       11.0
28       11.0
29        5.0
        ...  
2447      2.0
2448     69.0
2449     69.0
2450     69.0
2451     69.0
2452     69.0
2453     69.0
2454     69.0
2455      4.0
2456      4.0
2457      8.0
2458      4.0
2459      8.0
2460      4.0
2461    114.0
2462    114.0
2463    114.0
2464    114.0
2465    114.0
2466    114.0
2467    119.0
2468    119.0
2469    119.0
2470    119.0
2471    238.0
2472    119.0
2473     52.0
2474     52.0
2475     52.0
2476     52.0
dtype: float64

In [7]:
prim_pos_index = np.argmax(hitters[gamescols].as_matrix(), axis=1)

In [10]:
# for 1 to number of hitters:
#    get position of max games played out of all positions played
#   use position to identify position
#   store position in 'prim_pos' column
prim_pos = []
for i in range(len(hitters)):
    prim_pos.append(gamescols[prim_pos_index[i]][2:])
hitters['prim_pos'] = prim_pos

Now I need to find other eligible positions. 
1. For each player, find the gamescols where their games played is >=10.
2. Combine these into a string rep and store in 'other_pos'.

In [22]:
test = hitters.iloc[178,]

In [42]:
np.where(test[gamescols]>10)[0].tolist()

[6, 7, 8, 9]

In [46]:
list(gamescols[i][2:] for i in np.where(test[gamescols]>10)[0].tolist())

['lf', 'cf', 'rf', 'of']

In [51]:
def findOtherPositions(x):
    other_pos = list(gamescols[i][2:] for i in np.where(x[gamescols]>=10)[0].tolist())
    return other_pos

In [55]:
hitters['other_pos'] = hitters.apply(findOtherPositions, axis=1)

In [56]:
findOtherPositions(hitters.iloc[178,])

['lf', 'cf', 'rf', 'of']

In [59]:
# Check row 178 - it works!
hitters.iloc[178,]

playerID               blancgr01
yearID                      2012
LastOfstint                    1
LastOfteamID                 SFN
retroID                 blang001
nameFirst                 Gregor
nameLast                  Blanco
weight                       175
height                        71
bats                           L
lgID                          NL
G                            141
H                             96
AB                           393
HR                             5
RBI                           34
SB                            26
R                             56
2B                            14
3B                             5
CS                             6
BB                            51
SO                           104
IBB                            2
HBP                            2
SH                             5
SF                             2
GIDP                           0
G_p                            0
G_c                            0
G_1b      

I now have each player's primary position (assumed to be the one they played the most games at during 2016) and each position for which they are eligible (played at least 10 games at during 2016).

## Calculating any other desired statistics

I will also calculate any necessary statistics. Statistics I calculate are as follows:

* On-base percentage
* Slugging percentage
* OPS (on-base plus slugging)
* ... others