## Predicting the Baseball World Series Champion

Last week, in the Moneyball lecture, we discussed how regular season performance is not strongly correlated with winning the World Series in baseball. In this homework question, we'll use the same data to investigate how well we can predict the World Series winner at the beginning of the playoffs.

To begin, load the dataset baseball.csv into R using the read.csv function, and call the data frame "baseball". This is the same data file we used during the Moneyball lecture, and the data comes from Baseball-Reference.com.

As a reminder, this dataset contains data concerning a baseball team's performance in a given year. It has the following variables:

- Team: A code for the name of the team
- League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League)
- Year: The year of the corresponding record
- RS: The number of runs scored by the team in that year
- RA: The number of runs allowed by the team in that year
- W: The number of regular season wins by the team in that year
- OBP: The on-base percentage of the team in that year
- SLG: The slugging percentage of the team in that year
- BA: The batting average of the team in that year
- Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no)
- RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best)
- RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1.
- G: The number of games a team played in that year
- OOBP: The team's opponents' on-base percentage in that year
- OSLG: The team's opponents' slugging percentage in that year

### Limiting to Teams Making the Playoffs

In [1]:
baseball = read.csv('./dataset/baseball.csv')
str(baseball)

'data.frame':	1232 obs. of  15 variables:
 $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
 $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
 $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
 $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
 $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
 $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
 $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
 $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
 $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
 $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
 $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
 $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
 $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.3

In [5]:
table(baseball$Year)


1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
  20   20   20   20   20   20   20   24   24   24   24   24   24   24   26   26 
1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
  26   26   26   26   26   26   26   26   26   26   26   26   26   28   28   28 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
  30   30   30   30   30   30   30   30   30   30   30   30   30   30   30 

In [6]:
length(table(baseball$Year))

In [7]:
baseball = subset(baseball, Playoffs == 1)

In [8]:
nrow(baseball)

In [16]:
table(baseball$Year)


1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
   2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
   4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
   8    8    8    8    8    8    8    8    8    8    8    8    8    8   10 

In [17]:
table(table(baseball$Year))


 2  4  8 10 
 7 23 16  1 

### Adding an Important Predictor

It's much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.

We start by storing the output of the table() function that counts the number of playoff teams from each year:
```R
PlayoffTable = table(baseball$Year)
```
You can output the table with the following command:
```R
PlayoffTable
```
We will use this stored table to look up the number of teams in the playoffs in the year of each team/year pair.

In [18]:
PlayoffTable = table(baseball$Year)

In [21]:
names(PlayoffTable)

In [25]:
str(names(PlayoffTable))

 chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" ...


In [26]:
PlayoffTable[c("1990", "2001")]


1990 2001 
   4    8 

In [27]:
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]

In [30]:
table(baseball$NumCompetitors)


  2   4   8  10 
 14  92 128  10 

In [31]:
playoff_team = subset(baseball, NumCompetitors == 8)
nrow(playoff_team)

### Bivariate Models for Predicting World Series Winner

In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:
```R
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
```

In [32]:
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)

In [33]:
nrow(subset(baseball, WorldSeries == 0))

When we're not sure which of our variables are useful in predicting a particular outcome, it's often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We'll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question!

In [37]:
WSLogs1 = glm(WorldSeries ~ RS, data=baseball, family='binomial')
summary(WSLogs1)


Call:
glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8254  -0.6819  -0.6363  -0.5561   2.0308  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.661226   1.636494   0.404    0.686
RS          -0.002681   0.002098  -1.278    0.201

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 237.45  on 242  degrees of freedom
AIC: 241.45

Number of Fisher Scoring iterations: 4


In [38]:
WSLogs2 = glm(WorldSeries ~ RA, data=baseball, family='binomial')
summary(WSLogs2)


Call:
glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9749  -0.6883  -0.6118  -0.4746   2.1577  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.888174   1.483831   1.272   0.2032  
RA          -0.005053   0.002273  -2.223   0.0262 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 233.88  on 242  degrees of freedom
AIC: 237.88

Number of Fisher Scoring iterations: 4


In [39]:
WSLogs3 = glm(WorldSeries ~ W, data=baseball, family='binomial')
summary(WSLogs3)


Call:
glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0623  -0.6777  -0.6117  -0.5367   2.1254  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -6.85568    2.87620  -2.384   0.0171 *
W            0.05671    0.02988   1.898   0.0577 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 235.51  on 242  degrees of freedom
AIC: 239.51

Number of Fisher Scoring iterations: 4


In [40]:
WSLogs4 = glm(WorldSeries ~ OBP, data=baseball, family='binomial')
summary(WSLogs4)


Call:
glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8071  -0.6749  -0.6365  -0.5797   1.9753  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    2.741      3.989   0.687    0.492
OBP          -12.402     11.865  -1.045    0.296

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 238.02  on 242  degrees of freedom
AIC: 242.02

Number of Fisher Scoring iterations: 4


In [41]:
WSLogs5 = glm(WorldSeries ~ SLG, data=baseball, family='binomial')
summary(WSLogs5)


Call:
glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9498  -0.6953  -0.6088  -0.5197   2.1136  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    3.200      2.358   1.357   0.1748  
SLG          -11.130      5.689  -1.956   0.0504 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 235.23  on 242  degrees of freedom
AIC: 239.23

Number of Fisher Scoring iterations: 4


In [42]:
WSLogs6 = glm(WorldSeries ~ BA, data=baseball, family='binomial')
summary(WSLogs6)


Call:
glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6797  -0.6592  -0.6513  -0.6389   1.8431  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.6392     3.8988  -0.164    0.870
BA           -2.9765    14.6123  -0.204    0.839

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 239.08  on 242  degrees of freedom
AIC: 243.08

Number of Fisher Scoring iterations: 4


In [43]:
WSLogs7 = glm(WorldSeries ~ RankSeason, data=baseball, family='binomial')
summary(WSLogs7)


Call:
glm(formula = WorldSeries ~ RankSeason, family = "binomial", 
    data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7805  -0.7131  -0.5918  -0.4882   2.1781  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.8256     0.3268  -2.527   0.0115 *
RankSeason   -0.2069     0.1027  -2.016   0.0438 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 234.75  on 242  degrees of freedom
AIC: 238.75

Number of Fisher Scoring iterations: 4


In [44]:
WSLogs8 = glm(WorldSeries ~ OOBP, data=baseball, family='binomial')
summary(WSLogs8)


Call:
glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5318  -0.5176  -0.5106  -0.5023   2.0697  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9306     8.3728  -0.111    0.912
OOBP         -3.2233    26.0587  -0.124    0.902

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 84.926  on 113  degrees of freedom
Residual deviance: 84.910  on 112  degrees of freedom
  (130 observations deleted due to missingness)
AIC: 88.91

Number of Fisher Scoring iterations: 4


In [45]:
WSLogs9 = glm(WorldSeries ~ OSLG, data=baseball, family='binomial')
summary(WSLogs9)


Call:
glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5610  -0.5209  -0.5088  -0.4902   2.1268  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725    6.07285  -0.014    0.989
OSLG        -4.65992   15.06881  -0.309    0.757

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 84.926  on 113  degrees of freedom
Residual deviance: 84.830  on 112  degrees of freedom
  (130 observations deleted due to missingness)
AIC: 88.83

Number of Fisher Scoring iterations: 4


In [46]:
WSLogs10 = glm(WorldSeries ~ NumCompetitors, data=baseball, family='binomial')
summary(WSLogs10)


Call:
glm(formula = WorldSeries ~ NumCompetitors, family = "binomial", 
    data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9871  -0.8017  -0.5089  -0.5089   2.2643  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     0.03868    0.43750   0.088 0.929559    
NumCompetitors -0.25220    0.07422  -3.398 0.000678 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 226.96  on 242  degrees of freedom
AIC: 230.96

Number of Fisher Scoring iterations: 4


In [47]:
WSLogs11 = glm(WorldSeries ~ League, data=baseball, family='binomial')
summary(WSLogs11)


Call:
glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6772  -0.6772  -0.6306  -0.6306   1.8509  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3558     0.2243  -6.045  1.5e-09 ***
LeagueNL     -0.1583     0.3252  -0.487    0.626    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 238.88  on 242  degrees of freedom
AIC: 242.88

Number of Fisher Scoring iterations: 4


In [48]:
WSLogs0 = glm(WorldSeries ~ Year, data=baseball, family='binomial')
summary(WSLogs0)


Call:
glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0297  -0.6797  -0.5435  -0.4648   2.1504  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 72.23602   22.64409    3.19  0.00142 **
Year        -0.03700    0.01138   -3.25  0.00115 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 228.35  on 242  degrees of freedom
AIC: 232.35

Number of Fisher Scoring iterations: 4


### Multivariate Models for Predicting World Series Winner

In [49]:
WSLogsMult = glm(WorldSeries ~ Year + RA + RankSeason + NumCompetitors, data=baseball, family='binomial')
summary(WSLogsMult)


Call:
glm(formula = WorldSeries ~ Year + RA + RankSeason + NumCompetitors, 
    family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0336  -0.7689  -0.5139  -0.4583   2.2195  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)    12.5874376 53.6474210   0.235    0.814
Year           -0.0061425  0.0274665  -0.224    0.823
RA             -0.0008238  0.0027391  -0.301    0.764
RankSeason     -0.0685046  0.1203459  -0.569    0.569
NumCompetitors -0.1794264  0.1815933  -0.988    0.323

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 226.37  on 239  degrees of freedom
AIC: 236.37

Number of Fisher Scoring iterations: 4


In [53]:
cor(baseball$Year, baseball$RA)
cor(baseball$Year, baseball$RankSeason)
cor(baseball$Year, baseball$NumCompetitors)
cor(baseball$RA, baseball$RankSeason)
cor(baseball$RA, baseball$NumCompetitors)
cor(baseball$RankSeason, baseball$NumCompetitors)

In [54]:
library("ROCR")

Loading required package: gplots


Attaching package: ‘gplots’


The following object is masked from ‘package:stats’:

    lowess




In [58]:
WSLogsMult = glm(WorldSeries ~ Year + RA, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [59]:
WSLogsMult = glm(WorldSeries ~ Year + RankSeason, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [60]:
WSLogsMult = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [61]:
WSLogsMult = glm(WorldSeries ~ RA + RankSeason, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [62]:
WSLogsMult = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [63]:
WSLogsMult = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [64]:
WSLogsMult = glm(WorldSeries ~ Year, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [65]:
WSLogsMult = glm(WorldSeries ~ RA, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [66]:
WSLogsMult = glm(WorldSeries ~ RankSeason, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)

In [67]:
WSLogsMult = glm(WorldSeries ~ NumCompetitors, data=baseball, family='binomial')
pred = predict(WSLogsMult, type='response')
ROCRpred = prediction(pred, baseball$WorldSeries)
as.numeric(performance(ROCRpred, "auc")@y.values)