# HW4: Baseball Modeling

For this problem please install the <i>Lahman</i> package, a comprehensive package about Baseball statistics,
and use it to answer a few questions.

Important information:
<ul><li>
project home page (with links to impressive graphics):  http://lahman.r-forge.r-project.org/
</li><li>
package documentation (html):  http://lahman.r-forge.r-project.org/doc/
</li></ul>

The documentation includes descriptions of the many tables in this package, such as the
Salaries table: http://lahman.r-forge.r-project.org/doc/Salaries.html


#  The Goal

There are two problems for you to solve:
<ul><li>
Problem 1: construct a model that predicts a player's salary based on his baseball statistics.
Your model should have better performance (higher R-squared) than the baseline model given.
</li><li>
Problem 2: construct a model that predicts whether a player will be inducted into the Hall of Fame.
Your model should have better performance (higher Hall-of-Fame-Accuracy-Rate) than the baseline model given.
</li></ul>
Here, <i>Hall-of-Fame-Accuracy-Rate</i> is a weighted percentage of correct predictions
for players in the Hall of Fame:  <u>correct prediction for players in the Hall of Fame
is worth 100 times more than for players who are not in the Hall of Fame.</u>

Then, as in HW3, upload a .csv file containing your models to CCLE.


## Step 1: build the models

Using the 'RelevantInformation' table, one model should predict a player's maximum salary,
the other should predict whether or not they will get into the Hall of Fame.

<b>YOU CAN USE ANY MODEL YOU LIKE.</b>
The baseline models are a linear regression model and a logistic regression model ----------
but you can choose <i>any</i> model.
Please produce the most accurate models you can --
more accurate models will get a higher score.

<hr style="border-width:20px;">

## Step 2: generate a CSV file "HW4_Baseball_Models.csv" including your 2 models

If these were your two models, then to complete the assignment you would create
a CSV file <tt>HW4_Baseball_Models.csv</tt> containing two lines:

<code>
      0.8999,"lm( log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB, data = RelevantInformation)"
      0.7888,"glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+SlugPct, data = RelevantInformation, family=binomial)"
</code>

<b>Each line gives the accuracy of a model</b>,
as well as <b>the exact command you used to generate the model</b>.
There is no length restriction on the lines.

<hr style="border-width:20px;">

## Step 3: upload your CSV file and notebook to CCLE

Finally, go to CCLE and upload:
<ul><li>
your output CSV file <tt>HW4_Baseball_Models.csv</tt>
</li><li>
your notebook file <tt>HW4_Baseball_Modeling.ipynb</tt>
</li></ul>

We are not planning to run any of the uploaded notebooks.
However, your notebook should have the commands you used in developing your models ---
in order to show your work.
As announced, all assignment grading in this course will be automated,
and the notebook is needed in order to check results of the grading program.

# Get the Lahman package for R -- a database of Baseball Statistics

<hr style="border-width:20px;">

### The safe way to install it, so it will work with Jupyter -- execute the command:

<pre>
         sudo conda install -c https://conda.anaconda.org/asmeurer r-lahman
</pre>
### (The 'sudo' is not necessary if your conda installation is not write-protected.)

<hr style="border-width:20px;">

### Another way to install the Lahman package (if this works from within your Jupyter session):

In [1]:
if (!(is.element("Lahman", installed.packages()))) install.packages("Lahman", repos="http://cran.us.r-project.org")

### Load the Lahman baseball data

In [2]:
library(Lahman)

: package 'Lahman' was built under R version 3.2.5

<hr style="border-width:20px;">

### Another way to get the data, if you cannot load the Lahman package:

The files
<tt>PlayersAndStats.csv</tt>
and
<tt>PlayersAndStatsAndSalary.csv</tt>
are distributed with the homework assignment, and are used in the notebook below.

You can use these fiels rather than recompute the tables using the Lahman package.

# Extract Tables of Relevant Information for your Models

### Player information -- from the Master table
http://lahman.r-forge.r-project.org/doc/Master.html

In [3]:
head(Master)

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,ellip.h,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,deathDate,birthDate
1,aardsda01,1981,12,27,USA,CO,Denver,,,,<8b>,205,75,R,R,4/6/2004,9/28/2013,aardd001,aardsda01,,1981-12-27
2,aaronha01,1934,2,5,USA,AL,Mobile,,,,<8b>,180,72,R,R,4/13/1954,10/3/1976,aaroh101,aaronha01,,1934-02-05
3,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,<8b>,190,75,R,R,4/10/1962,9/26/1971,aarot101,aaronto01,1984-08-16,1939-08-05
4,aasedo01,1954,9,8,USA,CA,Orange,,,,<8b>,190,75,R,R,7/26/1977,10/3/1990,aased001,aasedo01,,1954-09-08
5,abadan01,1972,8,25,USA,FL,Palm Beach,,,,<8b>,184,73,L,L,9/10/2001,4/13/2006,abada001,abadan01,,1972-08-25
6,abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,<8b>,220,73,L,L,7/28/2010,9/27/2014,abadf001,abadfe01,,1985-12-17


In [4]:
SelectedColumns = c("playerID","nameFirst","nameLast","birthYear", "weight","height","bats","throws")
Players = na.omit( Master[, SelectedColumns] )
summary(Players)

   playerID          nameFirst           nameLast           birthYear   
 Length:17071       Length:17071       Length:17071       Min.   :1835  
 Class :character   Class :character   Class :character   1st Qu.:1902  
 Mode  :character   Mode  :character   Mode  :character   Median :1941  
                                                          Mean   :1935  
                                                          3rd Qu.:1969  
                                                          Max.   :1994  
     weight          height      bats      throws   
 Min.   : 65.0   Min.   :43.00   B: 1131   L: 3430  
 1st Qu.:170.0   1st Qu.:71.00   L: 4721   R:13641  
 Median :185.0   Median :72.00   R:11219            
 Mean   :186.2   Mean   :72.34                      
 3rd Qu.:200.0   3rd Qu.:74.00                      
 Max.   :320.0   Max.   :83.00                      

### Player Maximum Salary -- from the Salaries table
http://lahman.r-forge.r-project.org/doc/Salaries.html

In [5]:
head(Salaries)

Unnamed: 0,yearID,teamID,lgID,playerID,salary
1,1985,ATL,NL,barkele01,870000
2,1985,ATL,NL,bedrost01,550000
3,1985,ATL,NL,benedbr01,545000
4,1985,ATL,NL,campri01,633333
5,1985,ATL,NL,ceronri01,625000
6,1985,ATL,NL,chambch01,800000


In [6]:
summary(Salaries)

# example(Salaries)  # see demos of results from the Salaries table

PlayerMaxSalary = aggregate( salary ~ playerID, Salaries, max )
head(PlayerMaxSalary)
colnames(PlayerMaxSalary) = gsub( "salary", "max_salary", colnames(PlayerMaxSalary) )

head(PlayerMaxSalary)

     yearID         teamID      lgID         playerID        
 Min.   :1985   CLE    :  893   AL:12123   Length:24758      
 1st Qu.:1993   LAN    :  893   NL:12635   Class :character  
 Median :2000   PHI    :  893              Mode  :character  
 Mean   :2000   SLN    :  886                                
 3rd Qu.:2007   BAL    :  883                                
 Max.   :2014   BOS    :  883                                
                (Other):19427                                
     salary        
 Min.   :       0  
 1st Qu.:  260000  
 Median :  525000  
 Mean   : 1932905  
 3rd Qu.: 2199643  
 Max.   :33000000  
                   

Unnamed: 0,playerID,salary
1,aardsda01,4500000
2,aasedo01,675000
3,abadan01,327000
4,abadfe01,525900
5,abbotje01,300000
6,abbotji01,2775000


Unnamed: 0,playerID,max_salary
1,aardsda01,4500000
2,aasedo01,675000
3,abadan01,327000
4,abadfe01,525900
5,abbotje01,300000
6,abbotji01,2775000


In [7]:
PlayerStartYear = aggregate( yearID ~ playerID, Salaries, min )
colnames(PlayerStartYear) = gsub( "yearID", "startYear", colnames(PlayerStartYear) )

PlayerEndYear = aggregate( yearID ~ playerID, Salaries, max )
colnames(PlayerEndYear) = gsub( "yearID", "endYear", colnames(PlayerEndYear) )

head(PlayerStartYear)
head(PlayerEndYear)

Unnamed: 0,playerID,startYear
1,aardsda01,2004
2,aasedo01,1986
3,abadan01,2006
4,abadfe01,2011
5,abbotje01,1998
6,abbotji01,1989


Unnamed: 0,playerID,endYear
1,aardsda01,2012
2,aasedo01,1989
3,abadan01,2006
4,abadfe01,2014
5,abbotje01,2001
6,abbotji01,1999


### Batting Statistics -- from the BattingStats table
http://lahman.r-forge.r-project.org/doc/battingStats.html
   
(See also the Batting table:
http://lahman.r-forge.r-project.org/doc/Batting.html )

A glossary for Baseball Statistics Acronyms is in
   http://en.wikipedia.org/wiki/Baseball_statistics

In [8]:
BattingStats = battingStats()

In [9]:
head(BattingStats)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,X2B,ellip.h,SH,SF,GIDP,BA,PA,TB,SlugPct,OBP,OPS,BABIP
1,abercda01,1871,1,TRO,,1,4,0,0,0,<8b>,,,,0.0,4,0,0.0,0.0,0.0,0.0
2,addybo01,1871,1,RC1,,25,118,30,32,6,<8b>,,,,0.271,122,38,0.322,0.295,0.617,0.271
3,allisar01,1871,1,CL1,,29,137,28,40,4,<8b>,,,,0.292,139,54,0.394,0.302,0.696,0.303
4,allisdo01,1871,1,WS3,,27,133,28,44,10,<8b>,,,,0.331,133,64,0.481,0.331,0.812,0.326
5,ansonca01,1871,1,RC1,,25,120,29,39,11,<8b>,,,,0.325,122,56,0.467,0.336,0.803,0.328
6,armstbo01,1871,1,FW1,,12,49,9,11,2,<8b>,,,,0.224,49,15,0.306,0.224,0.53,0.229


### Aggregate Batting Stats -- cumulative, over a player's career

In [10]:
TotalBattingCounts = aggregate( cbind(AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB) ~ playerID,
                                     BattingStats, sum)
nrow(TotalBattingCounts)
head(TotalBattingCounts)
MaxBattingPcts = aggregate( cbind(SlugPct,OBP,OPS,BABIP) ~ playerID,
                                 BattingStats, max )
nrow(MaxBattingPcts)
head(MaxBattingPcts)

AggregateBattingStats = merge(TotalBattingCounts,MaxBattingPcts, by="playerID")
summary(AggregateBattingStats)
nrow(AggregateBattingStats)

Unnamed: 0,playerID,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB
1,aardsda01,3,0,0,0,0,0,0,0,0,0,0.0,4,0
2,aaronha01,12364,2174,3771,624,98,755,2297,240,73,1402,6.927,13940,6856
3,aaronto01,944,102,216,42,6,13,94,9,8,86,1.545,1045,309
4,aasedo01,5,0,0,0,0,0,0,0,0,0,0.0,5,0
5,abadan01,21,1,2,0,0,0,0,0,1,4,0.118,25,2
6,abadfe01,8,0,1,0,0,0,0,0,0,0,0.143,8,1


Unnamed: 0,playerID,SlugPct,OBP,OPS,BABIP
1,aardsda01,0.0,0.0,0.0,0.0
2,aaronha01,0.669,0.41,1.079,0.338
3,aaronto01,0.374,0.318,0.686,0.276
4,aasedo01,0.0,0.0,0.0,0.0
5,abadan01,0.118,0.4,0.4,0.167
6,abadfe01,0.143,0.143,0.286,0.25


   playerID               AB                R                H         
 Length:11532       Min.   :    1.0   Min.   :   0.0   Min.   :   0.0  
 Class :character   1st Qu.:   19.0   1st Qu.:   1.0   1st Qu.:   3.0  
 Mode  :character   Median :  136.5   Median :  12.0   Median :  25.0  
                    Mean   :  896.7   Mean   : 117.6   Mean   : 234.8  
                    3rd Qu.:  834.5   3rd Qu.:  95.0   3rd Qu.: 199.0  
                    Max.   :14053.0   Max.   :2295.0   Max.   :4256.0  
      X2B              X3B                HR             RBI        
 Min.   :  0.00   Min.   :  0.000   Min.   :  0.0   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.0   1st Qu.:   1.0  
 Median :  4.00   Median :  0.000   Median :  1.0   Median :  10.0  
 Mean   : 41.29   Mean   :  6.723   Mean   : 21.4   Mean   : 109.6  
 3rd Qu.: 33.00   3rd Qu.:  5.000   3rd Qu.: 10.0   3rd Qu.:  85.0  
 Max.   :746.00   Max.   :173.000   Max.   :762.0   Max.   :2297.0  
       SB    

### Inducted into the Hall of Fame?  -- from the HallOfFame table
http://lahman.r-forge.r-project.org/doc/HallOfFame.html

In [11]:
data(HallOfFame)
head(HallOfFame)
nrow(HallOfFame)
InductedIntoHallOfFame = subset(HallOfFame, inducted == 'Y')[ , 1:2]

head(InductedIntoHallOfFame)
nrow(InductedIntoHallOfFame)

Unnamed: 0,playerID,yearID,votedBy,ballots,needed,votes,inducted,category,needed_note
1,cobbty01,1936,BBWAA,226,170,222,Y,Player,
2,ruthba01,1936,BBWAA,226,170,215,Y,Player,
3,wagneho01,1936,BBWAA,226,170,215,Y,Player,
4,mathech01,1936,BBWAA,226,170,205,Y,Player,
5,johnswa01,1936,BBWAA,226,170,189,Y,Player,
6,lajoina01,1936,BBWAA,226,170,146,N,Player,


Unnamed: 0,playerID,yearID
1,cobbty01,1936
2,ruthba01,1936
3,wagneho01,1936
4,mathech01,1936
5,johnswa01,1936
111,lajoina01,1937


### Include HallOfFame information in the Players table

In [12]:
PlayersWithHallOfFame = transform( merge( Players, InductedIntoHallOfFame, all.x=TRUE, by="playerID"),
                                        HallOfFame = ifelse( is.na(yearID), 0, 1 ),
                                        yearID = ifelse( is.na(yearID), 0, yearID )
                                        )
colnames(PlayersWithHallOfFame) = gsub( "yearID", "HallOfFameYear", colnames(PlayersWithHallOfFame) )
head(PlayersWithHallOfFame, 20)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0
5,abadan01,Andy,Abad,1972,184,73,L,L,0,0
6,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0
8,abbated01,Ed,Abbaticchio,1877,170,71,R,R,0,0
9,abbeybe01,Bert,Abbey,1869,175,71,R,R,0,0
10,abbeych01,Charlie,Abbey,1866,169,68,L,L,0,0


In [13]:
nrow(PlayersWithHallOfFame)
nrow(subset(PlayersWithHallOfFame, HallOfFame == 1))

In [14]:
PlayersAndStats = merge( PlayersWithHallOfFame, AggregateBattingStats )

nrow(PlayersAndStats)
nrow(subset(PlayersAndStats, HallOfFame == 1))

head(PlayersAndStats)
summary(PlayersAndStats)
# write.csv(PlayersAndStats, file="PlayersAndStats.csv", quote=FALSE, row.names=FALSE)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,ellip.h,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,<8b>,0,0,0,0.0,4,0,0.0,0.0,0.0,0.0
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1,<8b>,240,73,1402,6.927,13940,6856,0.669,0.41,1.079,0.338
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0,<8b>,9,8,86,1.545,1045,309,0.374,0.318,0.686,0.276
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0,<8b>,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0
5,abadan01,Andy,Abad,1972,184,73,L,L,0,0,<8b>,0,1,4,0.118,25,2,0.118,0.4,0.4,0.167
6,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,<8b>,0,0,0,0.143,8,1,0.143,0.143,0.286,0.25


   playerID          nameFirst           nameLast           birthYear   
 Length:11299       Length:11299       Length:11299       Min.   :1835  
 Class :character   Class :character   Class :character   1st Qu.:1921  
 Mode  :character   Mode  :character   Mode  :character   Median :1951  
                                                          Mean   :1945  
                                                          3rd Qu.:1972  
                                                          Max.   :1994  
     weight          height      bats     throws   HallOfFameYear   
 Min.   :120.0   Min.   :63.00   B: 854   L:2179   Min.   :   0.00  
 1st Qu.:175.0   1st Qu.:71.00   L:3181   R:9120   1st Qu.:   0.00  
 Median :185.0   Median :73.00   R:7264            Median :   0.00  
 Mean   :188.5   Mean   :72.59                     Mean   :  33.95  
 3rd Qu.:200.0   3rd Qu.:74.00                     3rd Qu.:   0.00  
 Max.   :320.0   Max.   :83.00                     Max.   :2015.00  
   Hal

# Join Information for your Baseball Salary model into one Table

### Merge Aggregate Batting Statistics and Maximum Salary into the Relevant Information table

In [15]:
PlayersAndStatsAndSalary = transform(
    merge( merge( merge( PlayersAndStats, PlayerMaxSalary ), PlayerStartYear), PlayerEndYear ),
    totalYears = endYear - startYear + 1
    )
head(PlayersAndStatsAndSalary)
nrow(PlayersAndStatsAndSalary)

PlayersAndStatsAndSalaryAndFame=subset(PlayersAndStatsAndSalary, HallOfFame == 1)
head(PlayersAndStatsAndSalaryAndFame)
nrow(PlayersAndStatsAndSalaryAndFame)
# write.csv(PlayersAndStatsAndSalary, file="PlayersAndStatsAndSalary.csv", quote=FALSE, row.names=FALSE)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,ellip.h,PA,TB,SlugPct,OBP,OPS,BABIP,max_salary,startYear,endYear,totalYears
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,<8b>,4,0,0.0,0.0,0.0,0.0,4500000,2004,2012,9
2,aasedo01,Don,Aase,1954,190,75,R,R,0,0,<8b>,5,0,0.0,0.0,0.0,0.0,675000,1986,1989,4
3,abadan01,Andy,Abad,1972,184,73,L,L,0,0,<8b>,25,2,0.118,0.4,0.4,0.167,327000,2006,2006,1
4,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,<8b>,8,1,0.143,0.143,0.286,0.25,525900,2011,2014,4
5,abbotje01,Jeff,Abbott,1972,190,74,R,L,0,0,<8b>,649,248,0.492,0.343,0.79,0.32,300000,1998,2001,4
6,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,<8b>,24,2,0.095,0.095,0.19,0.182,2775000,1989,1999,11


Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,ellip.h,PA,TB,SlugPct,OBP,OPS,BABIP,max_salary,startYear,endYear,totalYears
61,alomaro01,Roberto,Alomar,1968,184,72,B,R,2011,1,<8b>,10400,4018,0.541,0.422,0.956,0.351,8000000,1989,2004,16
317,biggicr01,Craig,Biggio,1965,185,71,R,R,2015,1,<8b>,12503,4711,0.503,0.415,0.916,0.368,9750000,1989,2007,19
347,blylebe01,Bert,Blyleven,1951,200,75,R,R,2011,1,<8b>,514,66,0.181,0.177,0.358,0.28,2000000,1985,1992,8
359,boggswa01,Wade,Boggs,1958,190,74,L,R,2005,1,<8b>,10740,4064,0.588,0.476,1.049,0.396,4724316,1985,1999,15
432,brettge01,George,Brett,1953,185,72,L,R,1999,1,<8b>,11624,5044,0.664,0.454,1.118,0.368,3105000,1985,1993,9
593,carewro01,Rod,Carew,1945,170,72,L,R,1991,1,<8b>,10550,3998,0.57,0.449,1.019,0.415,875000,1985,1985,1


# Problem 1: construct a model with better performance  (higher R-squared) than this Baseline Salary Model

### For this salary model, consider only those players who started playing after 2000:

In [16]:
RecentPlayersAndStatsAndSalary = subset( PlayersAndStatsAndSalary, startYear >= 2000 )
nrow(RecentPlayersAndStatsAndSalary)

In [17]:
#summary(PlayersAndStatsAndSalary)
#head(PlayersAndStatsAndSalary)

In [18]:
BaselineSalaryModel = lm( log10(max_salary) ~
                         AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears,
                         data = PlayersAndStatsAndSalary)
summary(BaselineSalaryModel)


Call:
lm(formula = log10(max_salary) ~ AB + R + H + X2B + X3B + HR + 
    RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear + 
    totalYears, data = PlayersAndStatsAndSalary)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9378 -0.2139 -0.0706  0.2238  1.5111 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.752e+01  1.247e+00 -38.122  < 2e-16 ***
AB          -2.826e-03  2.860e-04  -9.881  < 2e-16 ***
R           -1.949e-03  2.633e-04  -7.400 1.64e-13 ***
H            4.363e-04  1.886e-04   2.314 0.020727 *  
X2B          5.930e-04  3.455e-04   1.716 0.086173 .  
X3B          3.191e-03  8.454e-04   3.774 0.000163 ***
HR           2.807e-03  5.168e-04   5.432 5.89e-08 ***
RBI         -5.176e-04  2.389e-04  -2.167 0.030328 *  
SB           6.342e-04  2.307e-04   2.749 0.006005 ** 
CS          -6.031e-04  7.514e-04  -0.803 0.422225    
BB          -2.548e-03  2.635e-04  -9.672  < 2e-16 ***
BA          -9.005e-02  1.061e-02  -8.4

In [19]:
trans.PlayersAndStatsAndSalary  =  transform(
                                            PlayersAndStatsAndSalary,
                                            log10_max_salary=log10(max_salary),
                                            log10_AB = log10(1+AB),
                                            log10_R = log10(1+R),
                                            log10_H = log10(1+H),
                                            log10_X2B = log10(1+X2B),
                                            log10_X3B = log10(1+X3B),
                                            log10_HR= log10(1+HR) ,
                                            log10_RBI = log10(1+RBI),
                                            log10_SB = log10(1+SB), 
                                            log10_CS = log10(1+CS),
                                            log10_BB = log10(1+BB),
                                            log10_BA = log10(1+BA),
                                            log10_PA = log10(1+PA),
                                            log10_SlugPct = log10(1+SlugPct),
                                            log10_OBP= log10(1+OBP) ,
                                            log10_BABIP = log10(1+BABIP),
                                            log10_startYear = log10(startYear),
                                            log10_totalYears= log10(totalYears)
                                            )

In [20]:
head(trans.PlayersAndStatsAndSalary)
#summary(trans.PlayersAndStatsAndSalary)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,ellip.h,log10_SB,log10_CS,log10_BB,log10_BA,log10_PA,log10_SlugPct,log10_OBP,log10_BABIP,log10_startYear,log10_totalYears
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,<8b>,0.0,0.0,0.0,0.0,0.69897,0.0,0.0,0.0,3.301898,0.9542425
2,aasedo01,Don,Aase,1954,190,75,R,R,0,0,<8b>,0.0,0.0,0.0,0.0,0.7781513,0.0,0.0,0.0,3.297979,0.60206
3,abadan01,Andy,Abad,1972,184,73,L,L,0,0,<8b>,0.0,0.30103,0.69897,0.0484418,1.414973,0.0484418,0.146128,0.06707086,3.302331,0.0
4,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,<8b>,0.0,0.0,0.0,0.05804623,0.9542425,0.05804623,0.05804623,0.09691001,3.303412,0.60206
5,abbotje01,Jeff,Abbott,1972,190,74,R,L,0,0,<8b>,0.845098,0.7781513,1.591065,0.3494718,2.812913,0.1737688,0.128076,0.1205739,3.300595,0.60206
6,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,<8b>,0.0,0.0,0.0,0.03941412,1.39794,0.03941412,0.03941412,0.07261748,3.298635,1.041393


In [41]:
ImprovedSalaryModel = lm( log10(max_salary) ~ 
                         AB + R + H + X2B + X3B + HR + RBI + SB + CS + BB + BA + PA 
                         + SlugPct + OBP + BABIP + startYear + totalYears
                         + weight + height + bats + throws,
                         data = PlayersAndStatsAndSalary)
test.salary.set = read.csv( file("HW4_Baseball_test/HW4_Baseball_Salary_test.csv"), header = TRUE )
salaryPred = 10^(predict(ImprovedSalaryModel,test.salary.set))
summary(ImprovedSalaryModel) 


Call:
lm(formula = log10(max_salary) ~ AB + R + H + X2B + X3B + HR + 
    RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear + 
    totalYears + weight + height + bats + throws, data = PlayersAndStatsAndSalary)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.82865 -0.21440 -0.06089  0.21559  1.52102 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.721e+01  1.411e+00 -33.461  < 2e-16 ***
AB          -2.755e-03  2.852e-04  -9.661  < 2e-16 ***
R           -1.616e-03  2.630e-04  -6.145 8.77e-10 ***
H            2.737e-04  1.880e-04   1.456 0.145489    
X2B          5.793e-04  3.417e-04   1.695 0.090080 .  
X3B          2.987e-03  8.431e-04   3.543 0.000401 ***
HR           2.358e-03  5.154e-04   4.575 4.91e-06 ***
RBI         -5.559e-04  2.366e-04  -2.349 0.018869 *  
SB           4.857e-04  2.283e-04   2.127 0.033458 *  
CS          -3.692e-04  7.436e-04  -0.496 0.619599    
BB          -2.585e-03  2.649e-04  -9.758  < 2e-16

In [38]:
head(salaryPred)
10^head(salaryPred)

# Problem 2: construct a model with better performance  (higher accuracy) than this Baseline Hall of Fame Model

###  Hall of Fame election rules:


A. A baseball player must have been active as a player in the Major Leagues at some time during a period beginning fifteen (15) years before and ending five (5) years prior to election.

B. Player must have played in each of ten (10) Major League championship seasons, some part of which must have been within the period described in 3(A).

C. Player shall have ceased to be an active player in the Major Leagues at least five (5) calendar years preceding the election but may be otherwise connected with baseball.

### Consequently:   only consider players born before 1970
(They must start around 20 years of age, play at least 10 years, have stopped playing at least 5 years earlier, and take perhaps 10 years to win the ballot -- so born at least 45 years ago.)

In [23]:
HallOfFameContenders = subset( PlayersAndStats, birthYear < 1970 )
head(HallOfFameContenders)
nrow(HallOfFameContenders)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,ellip.h,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1,<8b>,240,73,1402,6.927,13940,6856,0.669,0.41,1.079,0.338
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0,<8b>,9,8,86,1.545,1045,309,0.374,0.318,0.686,0.276
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0,<8b>,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0,<8b>,1,0,0,0.472,49,11,0.25,0.25,0.5,0.25
9,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,<8b>,0,0,0,0.095,24,2,0.095,0.095,0.19,0.182
10,abbotku01,Kurt,Abbott,1969,180,71,R,R,0,0,<8b>,22,11,133,2.511,2227,864,0.465,0.326,0.77,0.354


In [24]:
BaselineHallOfFameModel = glm( HallOfFame ~ AB + R + H + X2B + X3B + HR + RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP,
                         data = HallOfFameContenders, family=binomial)

#summary(BaselineHallOfFameModel)
baselineConfusionMatrix = table( round(predict(BaselineHallOfFameModel, type="response")), HallOfFameContenders$HallOfFame )
baselineConfusionMatrix

   
       0    1
  0 7899  155
  1   19   38

In [25]:
sum(round(predict(BaselineHallOfFameModel, type="response")))

In [26]:
# we need the ggplot2 package to get the "diamonds" dataset
not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])

if (not.installed("e1071")) install.packages("e1071", repos = "http://cran.r-project.org")
library(e1071)
svmModel = svm( 
                 HallOfFame~ AB + R + H + X2B + X3B + HR + RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + weight + height + bats + throws + birthYear, 
                data=HallOfFameContenders, 
                type = "C", 
                kernel = "radial", gamma = 0.5,
                cost = 62.5,  tolerance = 0.001, epsilon = 0.001,
                na.action = na.omit
              )
test.hallOfFame.set = read.csv( file("HW4_Baseball_test/HW4_Baseball_HallOfFame_test.csv"), header = TRUE )
hallOfFamePred = predict(svmModel,test.hallOfFame.set)
#summary(svmModel)

: package 'e1071' was built under R version 3.2.5

In [43]:
sum(hallOfFamePred==1)

In [32]:
svmPred = predict(svmModel, HallOfFameContenders)
svmConfusionMatrix=table( predict(svmModel, type="response"), HallOfFameContenders$HallOfFame )
svmConfusionMatrix

   
       0    1
  0 7918    3
  1    0  190

In [33]:
svmWeightedAccuracy = 
                    (svmConfusionMatrix[1]+svmConfusionMatrix[4]*100)/
                    (svmConfusionMatrix[1]+svmConfusionMatrix[2] +svmConfusionMatrix[3]*100+svmConfusionMatrix[4]*100)


In [34]:
svmWeightedAccuracy

In [40]:
IAccuracy = summary(ImprovedSalaryModel)$r.squared
vecLine1=c( IAccuracy, "lm( log10(max_salary) ~ AB + R + H + X2B + X3B + HR + RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear + totalYears+ weight + height + bats + throws,data = PlayersAndStatsAndSalary)" )  
vecLine2=c( svmWeightedAccuracy, "svmModel = svm( HallOfFame~AB + R + H + X2B + X3B + HR + RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + weight + height + bats + throws + birthYear, ,data=HallOfFameContenders, type = \"C\", kernel = \"radial\", gamma = 0.5,cost = 62.5,  tolerance = 0.001, epsilon = 0.001,na.action = na.omit)")
DF = rbind(vecLine1,vecLine2)
write.table(DF, file = "HW4_Baseball_Models.csv", append =FALSE, quote = FALSE, sep = ",",
            eol = "\n", na = "", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")
write.table(salaryPred, file = "HW4_Baseball_Salary_predictions.csv", append =FALSE, quote = FALSE, sep = ",",
            eol = "\n", na = "", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")
write.table(hallOfFamePred, file = "HW4_Baseball_HallOfFame_predictions.csv", append =FALSE, quote = FALSE, sep = ",",
            eol = "\n", na = "", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

##  Warning!  This dataset is severely imbalanced.  Read Ch.16 of [APM]

Only about 1% or 2% of all players are inducted into the Hall of Fame:

In [None]:
( FameTally = table( HallOfFameContenders$HallOfFame ) )

In [None]:
data.frame( percentageOfHallOfFamers = FameTally[2] / sum(FameTally) )

##  The measure of accuracy will heavily emphasize correct prediction of Hall-of-Fame players

(i.e., the measurement of accuracy will focus on correct prediction of Hall-of-Fame players)

Even though classifying everybody as a NON-Hall-of-Fame player is right
for about 98% of the players, predictions for Hall-of-Fame players will be weighted heavily in this assignment.
Ignoring these players will get a very low score on this assignment.

Specifically, your model will be evaluated by its <b>Hall-of-Fame-Accuracy-Rate</b>:
<blockquote>
This rate is a weighted percentage of correct predictions
for players in the Hall of Fame:  <u>correct prediction for players in the Hall of Fame
is worth 100 times more than for players who are not in the Hall of Fame.</u>
</blockquote>


In [None]:
#  Create your model here ...