# BAN110 Group project - Heart disease dataset



In [17]:
*Reading dataset;
libname mylib '/folders/myfolders/ban110/group_project';

FILENAME REFFILE '/folders/myfolders/ban110/group_project/heart.csv';

PROC IMPORT DATAFILE=REFFILE
    DBMS=CSV
    OUT= mylib.heart replace;
    GETNAMES=YES;    
RUN;

*ignore the error because it is about the missing value;

In [18]:
*label the dataset;
data mylib.heart;
    set mylib.heart;
    label
        age ="age in years"
        sex="Gender (1 Male, 0 Female)"
        cp="chest pain type (4 values)"
        trestbps = "resting blood pressure (in mm Hg on admission to the hospital)"
        chol="serum cholestoral in mg/dl"
        fbs="(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)"
        restecg="restecg: resting electrocardiographic results 
                (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
                ,2: showing probable or definite left ventricular hypertrophy by Estes' criteria)"
        thalach="maximum heart rate achieved"
        exang="exercise induced angina (1 = yes; 0 = no)"
        oldpeak="ST depression induced by exercise relative to rest"
        slope="the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)"
        ca="number of major vessels (0-3) colored by flourosopy"
        thal="3 = normal; 6 = fixed defect; 7 = reversable defect"
        target= "presence of heart disease in the patient with integer value 0-4";
run;

*create id field;
data mylib.heart;
    retain id;
    set mylib.heart;
    id = _n_;
run;

title "First 15 rows from dataset";
proc print data=mylib.heart (obs=15);
run;


Obs,id,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
1,1,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
2,2,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
3,3,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
4,4,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
5,5,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
6,6,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
7,7,62,0,4,140,268,0,2,160,0,3.6,3,2,3,3
8,8,57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
9,9,63,1,4,130,254,0,2,147,0,1.4,2,1,7,2
10,10,53,1,4,140,203,1,2,155,1,3.1,3,0,7,1


## Dataset characteristic

In [20]:
title "Contents of the dataset";
proc contents data=mylib.heart order=varnum;
run;

title "Frequency table of target variable";
proc freq data=mylib.heart;
    tables target /  nocum;
run;

0,1,2,3
Data Set Name,MYLIB.HEART,Observations,303
Member Type,DATA,Variables,15
Engine,V9,Indexes,0
Created,11/25/2019 00:17:02,Observation Length,120
Last Modified,11/25/2019 00:17:02,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,1
First Data Page,1
Max Obs per Page,545
Obs in First Data Page,303
Number of Data Set Repairs,0
Filename,/folders/myfolders/ban110/group_project/heart.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,153

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Format,Informat,Label
1,id,Num,8,,,
2,age,Num,8,BEST12.,BEST32.,age in years
3,sex,Num,8,BEST12.,BEST32.,"Gender (1 Male, 0 Female)"
4,cp,Num,8,BEST12.,BEST32.,chest pain type (4 values)
5,trestbps,Num,8,BEST12.,BEST32.,resting blood pressure (in mm Hg on admission to the hospital)
6,chol,Num,8,BEST12.,BEST32.,serum cholestoral in mg/dl
7,fbs,Num,8,BEST12.,BEST32.,(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
8,restecg,Num,8,BEST12.,BEST32.,"restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est"
9,thalach,Num,8,BEST12.,BEST32.,maximum heart rate achieved
10,exang,Num,8,BEST12.,BEST32.,exercise induced angina (1 = yes; 0 = no)

presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4
target,Frequency,Percent
0,164,54.13
1,55,18.15
2,36,11.88
3,35,11.55
4,13,4.29


1. Number of instances: 303
2. List of variables names: output of proc contents. Type: All numeric
3. Target varaible: target. Dependent variable: all other variables
4. The dataset is balanced is not balanced (freq for 0 is 54%) -> suggestion: create derived variable: 1-4 -> 1. New variable will be more balances between 2 values 0 and 1.


## Categorical variable
   **List of categorical variable in the dataset**
   
   - sex
   - cp
   - fbs
   - restecg
   - exang
   - slope
   - ca
   - thal
   - target
   
   
### Check and correct errors when necessary

#### Var: Sex

In [21]:
title "Frequency table of sex variable";
proc freq data=mylib.heart;
    tables sex /  nocum;
run;

"Gender (1 Male, 0 Female)","Gender (1 Male, 0 Female)","Gender (1 Male, 0 Female)"
sex,Frequency,Percent
0,97,32.01
1,206,67.99


Unbalanced between 2 values, no error in this variable

#### Var: cp

In [22]:
title "Frequency table of cp variable";
proc freq data=mylib.heart;
    tables cp /  nocum;
run;

chest pain type (4 values),chest pain type (4 values),chest pain type (4 values)
cp,Frequency,Percent
1,23,7.59
2,50,16.5
3,86,28.38
4,144,47.52


Unbalanced between 4 values, no error in this variable

#### Var: fps

In [23]:
title "Frequency table of fbs variable";
proc freq data=mylib.heart;
    tables fbs /  nocum;
run;

(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false),(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false),(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
fbs,Frequency,Percent
0,258,85.15
1,45,14.85


Unbalanced, no error

#### Var: restecg

In [24]:
title "Frequency table of restecg variable";
proc freq data=mylib.heart;
    tables restecg /  nocum;
run;

"restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est","restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est","restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est"
restecg,Frequency,Percent
0,151,49.83
1,4,1.32
2,148,48.84


Unbalanced, no error

#### Var: exang

In [25]:
title "Frequency table of exang variable";
proc freq data=mylib.heart;
    tables exang /  nocum;
run;

exercise induced angina (1 = yes; 0 = no),exercise induced angina (1 = yes; 0 = no),exercise induced angina (1 = yes; 0 = no)
exang,Frequency,Percent
0,204,67.33
1,99,32.67


Unbalance, no error

#### Var: slope

In [26]:
title "Frequency table of slope variable";
proc freq data=mylib.heart;
    tables slope /  nocum;
run;

"the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)","the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)","the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)"
slope,Frequency,Percent
1,142,46.86
2,140,46.2
3,21,6.93


Unbalanced, no error

#### Var: ca

In [27]:
title "Frequency table of ca variable";
proc freq data=mylib.heart;
    tables ca /  nocum;
run;

number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy
ca,Frequency,Percent
0,176,58.86
1,65,21.74
2,38,12.71
3,20,6.69
Frequency Missing = 4,Frequency Missing = 4,Frequency Missing = 4


Unbalanced, no error


#### Var: thal

In [28]:
title "Frequency table of thal variable";
proc freq data=mylib.heart;
    tables thal /  nocum;
run;

3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect
thal,Frequency,Percent
3,166,55.15
6,18,5.98
7,117,38.87
Frequency Missing = 2,Frequency Missing = 2,Frequency Missing = 2


Unbalanced, No error

#### Var: target

In [29]:
title "Frequency table of target variable";
proc freq data=mylib.heart;
    tables target /  nocum;
run;

presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4
target,Frequency,Percent
0,164,54.13
1,55,18.15
2,36,11.88
3,35,11.55
4,13,4.29


Unbalanced, no error

### Check for missing values

In [30]:
proc format;
    value MissingCheck
            . = 'Missing'
        other = 'non-missing';
run;

proc freq data=mylib.heart;    
    tables sex cp fbs restecg exang slope ca thal target /  nocum missing;   
    format sex cp fbs restecg exang slope ca thal target MissingCheck.;
run;



"Gender (1 Male, 0 Female)","Gender (1 Male, 0 Female)","Gender (1 Male, 0 Female)"
sex,Frequency,Percent
non-missing,303,100.0

chest pain type (4 values),chest pain type (4 values),chest pain type (4 values)
cp,Frequency,Percent
non-missing,303,100.0

(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false),(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false),(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
fbs,Frequency,Percent
non-missing,303,100.0

"restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est","restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est","restecg: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) ,2: showing probable or definite left ventricular hypertrophy by Est"
restecg,Frequency,Percent
non-missing,303,100.0

exercise induced angina (1 = yes; 0 = no),exercise induced angina (1 = yes; 0 = no),exercise induced angina (1 = yes; 0 = no)
exang,Frequency,Percent
non-missing,303,100.0

"the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)","the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)","the slope of the peak exercise ST segment (1 Upsloping, 2 flat, 3 downsloping)"
slope,Frequency,Percent
non-missing,303,100.0

number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy
ca,Frequency,Percent
Missing,4,1.32
non-missing,299,98.68

3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect
thal,Frequency,Percent
Missing,2,0.66
non-missing,301,99.34

presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4
target,Frequency,Percent
non-missing,303,100.0


Found missing value in variables: ca, thal. In the SAS [documentation](https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_stdize_sect012.htm "Standardize method"), Proc stdize method doesn't support MODE. We will impute manually.

In [31]:
DATA MYLIB.HEART2;
    set mylib.heart;
    if missing(ca) then ca = 0; *most frequent value in var ca;
    if missing(thal) then thal = 3; *most frequent value in var thal;
run;

proc freq data=mylib.heart2;
    tables ca thal / nocum;
run;

number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy,number of major vessels (0-3) colored by flourosopy
ca,Frequency,Percent
0,180,59.41
1,65,21.45
2,38,12.54
3,20,6.6

3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect,3 = normal; 6 = fixed defect; 7 = reversable defect
thal,Frequency,Percent
3,168,55.45
6,18,5.94
7,117,38.61


### Derived variable
Target: presence of heart disease in the patient with integer value 0-4. This variable is highly unblanced with 54% of values is 0.
We create a derived variable: target_new: 0 if target =0, 1 otherwise. This variable can serve the 2 class classifier.

In [32]:
data  mylib.heart;
    set mylib.heart;
    if target = 0 then target_new = 0;
    else target_new = 1;
run;

proc freq data=mylib.heart;
    tables target target_new /nocum;
run;

presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4,presence of heart disease in the patient with integer value 0-4
target,Frequency,Percent
0,164,54.13
1,55,18.15
2,36,11.88
3,35,11.55
4,13,4.29

target_new,Frequency,Percent
0,164,54.13
1,139,45.87


Data in target_new is more balanced and it's better for machine learning algorithm.

## Numerical variables
- age 
- trestbps
- chol
- thalach
- oldpeak


#### Check and correct errors

In [38]:
*using proc means to get min max mean;
proc means data= mylib.heart nolabel  n nmiss min max mean maxdec = 2;
    var age trestbps chol thalach oldpeak;
run;

Variable,N,N Miss,Minimum,Maximum,Mean
age trestbps chol thalach oldpeak,303 303 303 303 303,0 0 0 0 0,29.00 94.00 126.00 71.00 0.00,77.00 200.00 564.00 202.00 6.20,54.44 131.69 246.69 149.61 1.04


In [44]:
title "Running PROC UNIVARIATE on numerical variables to get quantile table and some basic statistical measures";
*select output of proc univariate;
ODS Select BasicMeasures Quantiles; 
proc univariate data = mylib.heart;
    id id;
    var age trestbps chol thalach oldpeak;
run;


Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,54.43894,Std Deviation,9.03866
Median,56.0,Variance,81.69742
Mode,58.0,Range,48.0
,,Interquartile Range,13.0

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,77
99%,71
95%,68
90%,66
75% Q3,61
50% Median,56
25% Q1,48
10%,42
5%,40
1%,35

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,131.6898,Std Deviation,17.59975
Median,130.0,Variance,309.75112
Mode,120.0,Range,106.0
,,Interquartile Range,20.0

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,200
99%,180
95%,160
90%,152
75% Q3,140
50% Median,130
25% Q1,120
10%,110
5%,108
1%,100

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,246.6931,Std Deviation,51.77692
Median,241.0,Variance,2681.0
Mode,197.0,Range,438.0
,,Interquartile Range,64.0

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,564
99%,407
95%,327
90%,309
75% Q3,275
50% Median,241
25% Q1,211
10%,188
5%,175
1%,149

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,149.6073,Std Deviation,22.875
Median,153.0,Variance,523.26577
Mode,162.0,Range,131.0
,,Interquartile Range,33.0

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,202
99%,192
95%,182
90%,177
75% Q3,166
50% Median,153
25% Q1,133
10%,116
5%,108
1%,95

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,1.039604,Std Deviation,1.16108
Median,0.8,Variance,1.3481
Mode,0.0,Range,6.2
,,Interquartile Range,1.6

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,6.2
99%,4.2
95%,3.4
90%,2.8
75% Q3,1.6
50% Median,0.8
25% Q1,0.0
10%,0.0
5%,0.0
1%,0.0


Looking at the Min, Max and quantile tables of numerical variables, we can conclude that all the values seem reasonable and we do not have error in this group of variables. For example, if we have some errors such as extreme thalach (maximum heart rate achieved) or extreme trestbps (resting blood pressure (in mm Hg on admission to the hospital)) or some text value in these numerical variable, we can easily find out and remove by a datastep with if statements(if ... then delete);

### Check for missing values

In [43]:
*using proc means to check for missing value;
proc means data= mylib.heart nolabel  n nmiss ;
    var age trestbps chol thalach oldpeak;
run;

Variable,N,N Miss
age trestbps chol thalach oldpeak,303 303 303 303 303,0 0 0 0 0


We do not have missing value in this data set. If we have, before imputation, we need to examine the outliers before impute new values.
### Detect and remove outliers

In [66]:
title "Normality test of numerical variables";
*select output of proc univariate;
ODS Select TestsForNormality;  * select histogram and testsfornormality;
proc univariate data = mylib.heart normal; *option normal to get normality test;
    id id;
    var age trestbps chol thalach oldpeak;
run;


Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.986707,Pr < W,0.0070
Kolmogorov-Smirnov,D,0.077054,Pr > D,<0.0100
Cramer-von Mises,W-Sq,0.259668,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,1.502735,Pr > A-Sq,<0.0050

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.966929,Pr < W,<0.0001
Kolmogorov-Smirnov,D,0.101995,Pr > D,<0.0100
Cramer-von Mises,W-Sq,0.409466,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,2.508495,Pr > A-Sq,<0.0050

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.947452,Pr < W,<0.0001
Kolmogorov-Smirnov,D,0.054295,Pr > D,0.0289
Cramer-von Mises,W-Sq,0.261756,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,1.692224,Pr > A-Sq,<0.0050

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.977448,Pr < W,0.0001
Kolmogorov-Smirnov,D,0.071169,Pr > D,<0.0100
Cramer-von Mises,W-Sq,0.380789,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,2.233849,Pr > A-Sq,<0.0050

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.843764,Pr < W,<0.0001
Kolmogorov-Smirnov,D,0.185658,Pr > D,<0.0100
Cramer-von Mises,W-Sq,2.247159,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,14.21443,Pr > A-Sq,<0.0050


Since the sample size is less than 2000, Shapiro-Wilk test is the choice even though three other tests are also done at the same time. More information about normality test in SAS in this [link](https://www.lexjansen.com/pharmasug/2004/Posters/PO04.pdf). 

The Hypothesis:

H_0: Data is Normally Distributed.

H_a: Data is not Normally Distributed


<font color=blue>Since all the P-values of the Shapiro-Wilk test is < 0.05, we can conclude that all these variables are not normally distributed.</font> We will use interquantile rage (IQR) for checking outliers

<font color = red> We create a macro to detect outlier using IQR range method

In [60]:

%macro outliers_IQR_oneVar(
					Dsn=, /* Dataset name        */
					id= , /* ID variable       */
					Var=, /* Variable name        */       
              );
title "Outliers Based on Interquartile Range";
proc means data=&Dsn noprint;
var &Var;
output out=Tmp
Q1=Q1
Q3=Q3
QRange= QRange;
run;

data _null_;
file print;
set &Dsn;
if _n_ = 1 then set Tmp;
if &Var le Q1- 1.5*QRange and not missing(&Var) or
&Var ge Q3 + 1.5*QRange then
put "Possible Outlier for id " &id "Value of &var is " &var;
run;
%mend outliers_IQR_oneVar;



Calling the method to check outliers on all numerical variable

In [61]:
%outliers_IQR_oneVar(dsn=mylib.heart, id = id, var = age)
%outliers_IQR_oneVar(dsn=mylib.heart, id = id, var = trestbps)
%outliers_IQR_oneVar(dsn=mylib.heart, id = id, var = chol)
%outliers_IQR_oneVar(dsn=mylib.heart, id = id, var = thalach)
%outliers_IQR_oneVar(dsn=mylib.heart, id = id, var = oldpeak)



Base on our investigation, we remove outlier in thlach (maximum heart rate is 71 is too low and may cause error to our model.

In [64]:
data mylib.heart;
    set mylib.heart;
    if id=246 then delete;
run;


### Test for normality and transformation of distribution

In [69]:
ods select histogram;
title "Histogram of numerical variable";
proc univariate data = mylib.heart;
    id id;
    var age trestbps chol thalach oldpeak;
    Histogram / normal;
run;

We can choose oldpeak, a highly right-skew variable. Draw the histogram and P-P plot and Q-Q plot for that variable.

In [78]:
ods select histogram Plots;
title "Histogram, P-P plots and Q-Q plots of oldpeak";
proc univariate data = mylib.heart plot;
    id id;
    var oldpeak;
    Histogram / normal;
run;

comment: We can clearly see that the data points is far away from the straight line (which is normal data). Especialy for the lower end where we have nearly 40% of values is 0. Base on Q-Q plots we can conclude that oldpeak is highly right-skewed.

We apply log transformation (with min of oldpeak is 0), and quadratic root.

In [84]:
Data mylib.heart; 
    SET mylib.heart;
    log_oldpeak = log(oldpeak);
    root4_oldpeak = (oldpeak) ** 0.25;
RUN;

ods select histogram Plots TestsForNormality;
title "Histogram, P-P plots and Q-Q plots of log_oldpeak";
proc univariate data = mylib.heart plot normal;
    id id;
    var log_oldpeak root4_oldpeak;
    Histogram / normal;
run;

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.939517,Pr < W,<0.0001
Kolmogorov-Smirnov,D,0.123686,Pr > D,<0.0100
Cramer-von Mises,W-Sq,0.546396,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,3.505673,Pr > A-Sq,<0.0050

Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality,Tests for Normality
Test,Statistic,Statistic.1,p Value,p Value.1
Shapiro-Wilk,W,0.818928,Pr < W,<0.0001
Kolmogorov-Smirnov,D,0.238538,Pr > D,<0.0100
Cramer-von Mises,W-Sq,3.475643,Pr > W-Sq,<0.0050
Anderson-Darling,A-Sq,22.95621,Pr > A-Sq,<0.0050


After transformation using both method, data is still not qualified for normal distribution. Base on the W statistic from Shapiro-Wilk test, we can see that the log transformation is slightly better in this situation. In the Q-Q plots, for both case, a lot of points are not in the straight line, but for log_oldpeak, it seems to be more normal than root4_oldpeak.