Michael Wilson

DSC-609: Machine Learning

Programming Assignment 2 - Ridge & Lasso Regression

## Dataset

The dataset used for this assignment is the Heart Failure Prediction Dataset, compiled and posted on Kaggle by user fedesoriano.  The dataset is available for public download from https://www.kaggle.com/fedesoriano/heart-failure-prediction.  This dataset is actually a compilation of multiple smaller datasets covering 11 features that can be used to predict exisitence of heart disease. (fedesoriano, 2021)


The features tracked as independent variables are:

Age: age of the patient [years]

Sex: sex of the patient [M: Male, F: Female]

ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]

RestingBP: resting blood pressure [mm Hg]

Cholesterol: serum cholesterol [mm/dl]

FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]

MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]

ExerciseAngina: exercise-induced angina [Y: Yes, N: No]

Oldpeak: oldpeak = ST [Numeric value measured in depression]

ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

(fedesoriano, 2021)


The target variable is that record's status on Heart Disease, with 1 representing presence of Heart Disease. (fedesoriano, 2021) 

Of the 11 features available within the dataset, we will choose only 5 numeric variables to try and indicate on the same target.  The independent variables chosen are Age, Resting Blood Pressure, Cholesterol, Max Heart Rate during exercise, and Oldpeak.  Summary statistics for the variables being used for this exercise are below:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display

#Read in the data
HeartDataRaw = pd.read_csv(r'C:\Users\Mike\Documents\Grad School 2021\DSC-609 Machine Learning\heart.csv')

HeartDataRaw.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [2]:
# Drop the columns that are not going to be used in the model

HeartData = HeartDataRaw.drop(['Sex','ChestPainType','FastingBS',
                              'RestingECG','ExerciseAngina','ST_Slope'], axis = 1)

HeartData.head() #Check to see that variable selection is as expected

Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease
0,40,140,289,172,0.0,0
1,49,160,180,156,1.0,1
2,37,130,283,98,0.0,0
3,48,138,214,108,1.5,1
4,54,150,195,122,0.0,0


In [3]:
# Generate summary statistics

HeartData.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,120.0,0.0,0.0
50%,54.0,130.0,223.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,156.0,1.5,1.0
max,77.0,200.0,603.0,202.0,6.2,1.0


In [4]:
# Still a single zero value for Blood Pressure and 172 instances of zero values for Cholesterol, 
# which need to be excluded as data points.

HeartData = HeartData[HeartData.Cholesterol != 0]  #Sole blood pressure zero is also a Cholesterol zero.

HeartData.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease
count,746.0,746.0,746.0,746.0,746.0,746.0
mean,52.882038,133.022788,244.635389,140.226542,0.901609,0.477212
std,9.505888,17.28275,59.153524,24.524107,1.072861,0.499816
min,28.0,92.0,85.0,69.0,-0.1,0.0
25%,46.0,120.0,207.25,122.0,0.0,0.0
50%,54.0,130.0,237.0,140.0,0.5,0.0
75%,59.0,140.0,275.0,160.0,1.5,1.0
max,77.0,200.0,603.0,202.0,6.2,1.0


Finally, in order to get the Ridge and Lasso algorithms to function with the binary output data, we need to change the labels of zero to -1 for the Heart Disease output.

In [5]:
# Change zeros to negative ones on target variable.

HeartData['HeartDisease'].replace(to_replace = 0, value = -1, inplace = True)

HeartData.head()

Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease
0,40,140,289,172,0.0,-1
1,49,160,180,156,1.0,1
2,37,130,283,98,0.0,-1
3,48,138,214,108,1.5,1
4,54,150,195,122,0.0,-1


In [6]:
HeartData.describe()  # This is with the final manipulation of the dataset, ready for model-building

Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,HeartDisease
count,746.0,746.0,746.0,746.0,746.0,746.0
mean,52.882038,133.022788,244.635389,140.226542,0.901609,-0.045576
std,9.505888,17.28275,59.153524,24.524107,1.072861,0.999631
min,28.0,92.0,85.0,69.0,-0.1,-1.0
25%,46.0,120.0,207.25,122.0,0.0,-1.0
50%,54.0,130.0,237.0,140.0,0.5,-1.0
75%,59.0,140.0,275.0,160.0,1.5,1.0
max,77.0,200.0,603.0,202.0,6.2,1.0


In [7]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets

target = HeartData['HeartDisease']
Predictors = HeartData.drop(['HeartDisease'], axis = 1)

Predictors_train, Predictors_test, target_train, target_test = train_test_split(Predictors, target)
print('Number of records to train with: ', len(Predictors_train))
print('Number of records to test with: ', len(Predictors_test))

Number of records to train with:  559
Number of records to test with:  187


### Ridge Regression

In [8]:
from sklearn.linear_model import RidgeCV

#5-fold cross validation used on 7 alpha levels
HD_ridgeCV = RidgeCV(alphas = (0.001, 0.01, 0.1, 1, 10, 100, 1000), cv = 5).fit(Predictors_train, target_train)

print('Alpha = ', HD_ridgeCV.alpha_)
print('\tTraining Score :\t{:.3f}'.format(HD_ridgeCV.score(Predictors_train, target_train)))
print('\tTest Score :\t\t{:.3f}'.format(HD_ridgeCV.score(Predictors_test, target_test)))
print('\tIntercept :\t\t{:.4f}'.format(HD_ridgeCV.intercept_))
print('\tCoefficients :\t\t', [round(coeff,4) for coeff in HD_ridgeCV.coef_])

Alpha =  10.0
	Training Score :	0.310
	Test Score :		0.351
	Intercept :		-0.3447
	Coefficients :		 [0.0099, 0.0018, 0.0014, -0.0082, 0.3661]


### Lasso Regression

In [9]:
from sklearn.linear_model import LassoCV

HD_lassoCV = LassoCV(eps = 0.001, cv = 5).fit(Predictors_train, target_train)

print('Alpha = ', HD_lassoCV.alpha_)
print('\tTraining Score :\t{:.3f}'.format(HD_lassoCV.score(Predictors_train, target_train)))
print('\tTest Score :\t\t{:.3f}'.format(HD_lassoCV.score(Predictors_test, target_test)))
print('\tIntercept :\t\t{:.4f}'.format(HD_lassoCV.intercept_))
print('\tCoefficients :\t\t', [round(coeff,4) for coeff in HD_lassoCV.coef_])
print('\tFeatures used :\t\t', HD_lassoCV.n_features_in_)

Alpha =  0.008485482317324891
	Training Score :	0.310
	Test Score :		0.351
	Intercept :		-0.3380
	Coefficients :		 [0.0099, 0.0018, 0.0014, -0.0083, 0.3645]
	Features used :		 5


## Results

Using those 5 variables as input, both the Lasso and Ridge method, (both using 5-fold cross-validation) arrive at very similar results.  Most of the regression equation coefficients for both methods are nearly identical, with the final coefficient for the Old Peak being the most different between the fitted models.  The intercepts are also highly similar between the two models.

The training and test scores reported here are the R-squared values for the best performing model using the parameter spreads indicated.  For both regularization methods (Ridge and Lasso), the training and test scores are also nearly identical.

### Conclusion

While the two models agree with each other highly, the scores reported by the model do not represent a very complete explanation for why Heart Disease would or would not be present for any particular record.  A much higher testing score than training score indicates we haven't overfit the model, but that those subset of variables from the data is not able to explain very much of the difference in the presence of Heart Disease based on those features.  Because of the low scores, there are likely additional independent variables that could be included that, while increasing model complexity, may improve overall fit and R-squared values without immediately causing generalization to suffer.  

## References

fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [5 Nov 2021] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.