# Variable selection

In [2]:
data bloodp;
  set '/folders/myfolders/lect2exa.sas7bdat';
run;

In [2]:
proc print data=bloodp  (obs=5);
run;

Obs,sbp,bmi,smkstat,age,smoke_current,smoke_former,eversmoker,patient,weight,CardDeath10yr,BetaBlocker,SRHealth
1,108.828,34.4536,F,56.5066,0,1,1,1,286.458,0,0,.
2,108.828,34.4536,F,56.5066,0,1,1,2,286.458,0,0,.
3,108.828,34.4536,F,56.5066,0,1,1,3,286.458,0,0,.
4,108.828,34.4536,F,56.5066,0,1,1,4,286.458,0,0,.
5,108.828,34.4536,F,56.5066,0,1,1,5,286.458,0,0,.


Using same data set from lecture 2, we will try different methods for variable selection. SAS allows users to subdivide your data into three parts called the training, validation, and test data. 

1. Models are fit on training data and prediction error is obtained. The most common one is the mean squared error (MSE), i.e.,

$$
MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2
$$


2. Tuning is done on the validation data, i.e., variable selection.
3. Test set allows us to which model performs the best.

Why doing this? Because of the Bias-Variance trade-off.

<center><img src='Bias-Variance.png'  width="600"></center>

$$
\mathbb{E}[y-\hat{f}(x)]^2=\mathbb{Var}[\hat{f}(x)]+(\mathbb{Bias}[\hat{f}(x)])^2+\sigma^2
$$

Time to run some examples

In [4]:
proc glmselect data=bloodp plots=(CriterionPanel ASE) seed=1;
     partition fraction(validate=0.3 test=0.2);
     class smkstat BetaBlocker SRHealth; 
     model sbp = SRHealth smkstat BetaBlocker bmi age 
                  / selection=forward(choose=validate stop=none);
run;

0,1
Data Set,WORK.BLOODP
Dependent Variable,sbp
Selection Method,Forward
Select Criterion,SBC
Stop Criterion,
Choose Criterion,Validation ASE
Effect Hierarchy Enforced,
Random Number Seed,1

0,1
Number of Observations Read,4963
Number of Observations Used,1934
Number of Observations Used for Training,980
Number of Observations Used for Validation,547
Number of Observations Used for Testing,407

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
smkstat,3,C F N
BetaBlocker,2,0 1
SRHealth,5,1 2 3 4 5

Dimensions,Dimensions.1
Number of Effects,6
Number of Parameters,13

Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary
Step,Effect Entered,Number Effects In,Number Parms In,SBC,ASE,Validation ASE,Test ASE
0,Intercept,1,1,5630.3701,310.5198,332.1197,333.2859
1,smkstat,2,3,5612.9380,300.7873,323.3835,332.6240
2,age,3,4,5599.0676,294.4832,313.0288,327.2807
3,bmi,4,5,5589.2159*,289.4959,303.1074,321.2684
4,BetaBlocker,5,6,5590.9456,287.9762,299.6076*,322.3838
5,SRHealth,6,10,5613.5267,286.5197,302.1574,324.1637
* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion

0
Selection stopped because all effects are in the final model.

0,1
Effects:,Intercept smkstat BetaBlocker bmi age

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,5,22093,4418.53548,15.25
Error,974,282217,289.75019,
Corrected Total,979,304309,,

0,1
Root MSE,17.02205
Dependent Mean,124.64659
R-Square,0.0726
Adj R-Sq,0.0678
AIC,6543.6203
AICC,6543.73553
SBC,5590.94562
ASE (Train),287.97621
ASE (Validate),299.60763
ASE (Test),322.38378

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,120.279323,7.698234,15.62
smkstat C,1,-10.026465,1.593268,-6.29
smkstat F,1,-5.084009,1.725869,-2.95
smkstat N,0,0.0,.,.
BetaBlocker 0,1,-3.645106,1.607816,-2.27
BetaBlocker 1,0,0.0,.,.
bmi,1,-0.632731,0.149362,-4.24
age,1,0.663179,0.138381,4.79


In [5]:
proc glmselect data=bloodp plots=(CriterionPanel ASE) seed=1;
     partition fraction(validate=0.3 test=0.2);
     class smkstat BetaBlocker SRHealth; 
     model sbp = SRHealth smkstat BetaBlocker bmi age 
                  / selection=stepwise(select=SL choose=validate stop=none);
run;

0,1
Data Set,WORK.BLOODP
Dependent Variable,sbp
Selection Method,Stepwise
Select Criterion,Significance Level
Stop Criterion,
Choose Criterion,Validation ASE
Entry Significance Level (SLE),0.15
Stay Significance Level (SLS),0.15
Effect Hierarchy Enforced,
Random Number Seed,1

0,1
Number of Observations Read,4963
Number of Observations Used,1934
Number of Observations Used for Training,980
Number of Observations Used for Validation,547
Number of Observations Used for Testing,407

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
smkstat,3,C F N
BetaBlocker,2,0 1
SRHealth,5,1 2 3 4 5

Dimensions,Dimensions.1
Number of Effects,6
Number of Parameters,13

Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary
Step,Effect Entered,Effect Removed,Number Effects In,Number Parms In,ASE,Validation ASE,Test ASE,F Value,Pr > F
0,Intercept,,1,1,310.5198,332.1197,333.2859,0.00,1.0000
1,smkstat,,2,3,300.7873,323.3835,332.6240,15.81,<.0001
2,age,,3,4,294.4832,313.0288,327.2807,20.89,<.0001
3,bmi,,4,5,289.4959,303.1074,321.2684,16.80,<.0001
4,BetaBlocker,,5,6,287.9762,299.6076*,322.3838,5.14,0.0236
* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion

0
Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has SLS < 0.15.

0,1
Effects:,Intercept smkstat BetaBlocker bmi age

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,5,22093,4418.53548,15.25
Error,974,282217,289.75019,
Corrected Total,979,304309,,

0,1
Root MSE,17.02205
Dependent Mean,124.64659
R-Square,0.0726
Adj R-Sq,0.0678
AIC,6543.6203
AICC,6543.73553
SBC,5590.94562
ASE (Train),287.97621
ASE (Validate),299.60763
ASE (Test),322.38378

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,120.279323,7.698234,15.62
smkstat C,1,-10.026465,1.593268,-6.29
smkstat F,1,-5.084009,1.725869,-2.95
smkstat N,0,0.0,.,.
BetaBlocker 0,1,-3.645106,1.607816,-2.27
BetaBlocker 1,0,0.0,.,.
bmi,1,-0.632731,0.149362,-4.24
age,1,0.663179,0.138381,4.79


In [6]:
proc glmselect data=bloodp plots=(CriterionPanel ASE) seed=1;
     partition fraction(validate=0.3 test=0.2);
     class smkstat BetaBlocker SRHealth; 
     model sbp = SRHealth smkstat BetaBlocker bmi age 
                  / selection=stepwise(select=SL choose=validate stop=SBC);
run;

0,1
Data Set,WORK.BLOODP
Dependent Variable,sbp
Selection Method,Stepwise
Select Criterion,Significance Level
Stop Criterion,SBC
Choose Criterion,Validation ASE
Entry Significance Level (SLE),0.15
Stay Significance Level (SLS),0.15
Effect Hierarchy Enforced,
Random Number Seed,1

0,1
Number of Observations Read,4963
Number of Observations Used,1934
Number of Observations Used for Training,980
Number of Observations Used for Validation,547
Number of Observations Used for Testing,407

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
smkstat,3,C F N
BetaBlocker,2,0 1
SRHealth,5,1 2 3 4 5

Dimensions,Dimensions.1
Number of Effects,6
Number of Parameters,13

Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary
Step,Effect Entered,Effect Removed,Number Effects In,Number Parms In,SBC,ASE,Validation ASE,Test ASE,F Value,Pr > F
0,Intercept,,1,1,5630.3701,310.5198,332.1197,333.2859,0.00,1.0000
1,smkstat,,2,3,5612.9380,300.7873,323.3835,332.6240,15.81,<.0001
2,age,,3,4,5599.0676,294.4832,313.0288,327.2807,20.89,<.0001
3,bmi,,4,5,5589.2159*,289.4959,303.1074*,321.2684,16.80,<.0001
* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion

0
Selection stopped at a local minimum of the SBC criterion.

Stop Details,Stop Details,Stop Details,Stop Details,Stop Details
Candidate For,Effect,Candidate SBC,Unnamed: 3_level_1,Compare SBC
Entry,BetaBlocker,5590.9456,>,5589.2159
Removal,bmi,5599.0676,>,5589.2159

0,1
Effects:,Intercept smkstat bmi age

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,4,20603,5150.85382,17.7
Error,975,283706,290.98046,
Corrected Total,979,304309,,

0,1
Root MSE,17.05815
Dependent Mean,124.64659
R-Square,0.0677
Adj R-Sq,0.0639
AIC,6546.77818
AICC,6546.86451
SBC,5589.21594
ASE (Train),289.49586
ASE (Validate),303.10738
ASE (Test),321.2684

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,115.230643,7.384728,15.60
smkstat C,1,-10.722896,1.566689,-6.84
smkstat F,1,-5.504869,1.719495,-3.20
smkstat N,0,0.0,.,.
bmi,1,-0.612327,0.149406,-4.10
age,1,0.696442,0.137893,5.05


## Cross validation

In [7]:
proc glmselect data=bloodp plots=(CriterionPanel ASE) seed=1;
     class smkstat BetaBlocker SRHealth; 
     model sbp = SRHealth smkstat BetaBlocker bmi age 
                  / selection=stepwise(select=SL choose=cv)
                  cvMethod=split(10)
                  cvDetails = all;
run;

0,1
Data Set,WORK.BLOODP
Dependent Variable,sbp
Selection Method,Stepwise
Select Criterion,Significance Level
Stop Criterion,Significance Level
Choose Criterion,Cross Validation
Entry Significance Level (SLE),0.15
Stay Significance Level (SLS),0.15
Cross Validation Method,Split
Cross Validation Fold,10

0,1
Number of Observations Read,4963
Number of Observations Used,1934

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
smkstat,3,C F N
BetaBlocker,2,0 1
SRHealth,5,1 2 3 4 5

Dimensions,Dimensions.1
Number of Effects,6
Number of Parameters,13

Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary
Step,Effect Entered,Effect Removed,Number Effects In,Number Parms In,CV PRESS,F Value,Pr > F
0,Intercept,,1,1,621947.599,0.00,1.0000
1,smkstat,,2,3,607706.343,23.87,<.0001
2,age,,3,4,593901.297,45.52,<.0001
3,bmi,,4,5,580874.066,43.67,<.0001
4,BetaBlocker,,5,6,578374.389*,9.86,0.0017
* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion

0
Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has SLS < 0.15.

Stop Details,Stop Details,Stop Details,Stop Details,Stop Details,Stop Details
Candidate For,Effect,Candidate Significance,Unnamed: 3_level_1,Compare Significance,Unnamed: 5_level_1
Entry,SRHealth,0.697,>,0.15,(SLE)
Removal,BetaBlocker,0.0017,<,0.15,(SLS)

0,1
Effects:,Intercept smkstat BetaBlocker bmi age

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,5,45039,9007.79028,30.13
Error,1928,576498,299.01358,
Corrected Total,1933,621537,,

0,1
Root MSE,17.29201
Dependent Mean,124.86117
R-Square,0.0725
Adj R-Sq,0.0701
AIC,12967.0
AICC,12967.0
SBC,11064.0
CV PRESS,578374.0

Cross Validation Details,Cross Validation Details,Cross Validation Details,Cross Validation Details
Index,Observations,Observations,CV PRESS
Index,Fitted,Left Out,CV PRESS
1,1740.0,194.0,55243.193
2,1740.0,194.0,61391.594
3,1740.0,194.0,59623.711
4,1740.0,194.0,56227.423
5,1741.0,193.0,48459.687
6,1741.0,193.0,50877.621
7,1741.0,193.0,58437.345
8,1741.0,193.0,62786.246
9,1741.0,193.0,65458.248
10,1741.0,193.0,59869.32

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates,Cross Validation Estimates
Parameter,DF,Estimate,Standard Error,t Value,1,2,3,4,5,6,7,8,9,10
Intercept,1,120.920253,5.720644,21.14,121.173,120.735,122.287,121.594,120.814,120.989,118.612,120.841,120.806,121.414
smkstat C,1,-10.033185,1.172245,-8.56,-10.2,-10.12,-9.655,-9.937,-9.952,-10.16,-9.848,-10.233,-10.039,-10.179
smkstat F,1,-5.746808,1.249818,-4.60,-5.642,-6.277,-5.458,-5.491,-5.785,-5.851,-6.005,-5.743,-5.682,-5.526
smkstat N,0,0.0,.,.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BetaBlocker 0,1,-3.679431,1.172026,-3.14,-4.071,-3.239,-3.443,-3.712,-3.313,-3.924,-3.201,-3.686,-4.154,-4.027
BetaBlocker 1,0,0.0,.,.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bmi,1,-0.75194,0.110372,-6.81,-0.766,-0.743,-0.737,-0.734,-0.756,-0.752,-0.765,-0.767,-0.767,-0.734
age,1,0.736633,0.102847,7.16,0.75,0.731,0.69,0.711,0.738,0.741,0.782,0.75,0.754,0.718


In [4]:
proc glmselect data=bloodp plots=(CriterionPanel ASE);
     class smkstat BetaBlocker SRHealth; 
     model sbp = SRHealth smkstat BetaBlocker bmi age 
                  / selection=forward(select=SL);
run;

0,1
Data Set,WORK.BLOODP
Dependent Variable,sbp
Selection Method,Forward
Select Criterion,Significance Level
Stop Criterion,Significance Level
Entry Significance Level (SLE),0.5
Effect Hierarchy Enforced,

0,1
Number of Observations Read,4963
Number of Observations Used,1934

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
smkstat,3,C F N
BetaBlocker,2,0 1
SRHealth,5,1 2 3 4 5

Dimensions,Dimensions.1
Number of Effects,6
Number of Parameters,13

Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary,Forward Selection Summary
Step,Effect Entered,Number Effects In,Number Parms In,F Value,Pr > F
0,Intercept,1,1,0.0,1.0000
1,smkstat,2,3,23.87,<.0001
2,age,3,4,45.52,<.0001
3,bmi,4,5,43.67,<.0001
4,BetaBlocker,5,6,9.86,0.0017

0
Selection stopped as the candidate for entry has SLE > 0.5.

Stop Details,Stop Details,Stop Details,Stop Details,Stop Details,Stop Details
Candidate For,Effect,Candidate Significance,Unnamed: 3_level_1,Compare Significance,Unnamed: 5_level_1
Entry,SRHealth,0.697,>,0.5,(SLE)

0,1
Effects:,Intercept smkstat BetaBlocker bmi age

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,5,45039,9007.79028,30.13
Error,1928,576498,299.01358,
Corrected Total,1933,621537,,

0,1
Root MSE,17.29201
Dependent Mean,124.86117
R-Square,0.0725
Adj R-Sq,0.0701
AIC,12967.0
AICC,12967.0
SBC,11064.0

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,120.920253,5.720644,21.14
smkstat C,1,-10.033185,1.172245,-8.56
smkstat F,1,-5.746808,1.249818,-4.60
smkstat N,0,0.0,.,.
BetaBlocker 0,1,-3.679431,1.172026,-3.14
BetaBlocker 1,0,0.0,.,.
bmi,1,-0.75194,0.110372,-6.81
age,1,0.736633,0.102847,7.16
