This notebook contains the hyperparameter tuning for the different models.

# DATA LOADING


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data_up.csv')

In [3]:
data.head(5)

Unnamed: 0,SEQN,Gender,Age,Annual-Family-Income,Ratio-Family-Income-Poverty,X60-sec-pulse,Systolic,Diastolic,Weight,Height,...,Total-Cholesterol,HDL,Glycohemoglobin,Vigorous-work,Moderate-work,Health-Insurance,Diabetes,Blood-Rel-Diabetes,Blood-Rel-Stroke,CoronaryHeartDisease
0,2,1,77,8,5.0,68,98,56,75.4,174.0,...,5.56,1.39,4.7,3,3,1,2,2,2,0
1,5,1,49,11,5.0,66,122,83,92.5,178.3,...,7.21,1.08,5.5,1,1,1,2,2,2,0
2,12,1,37,11,4.93,64,174,99,99.2,180.0,...,4.03,0.98,5.2,2,1,1,2,1,1,0
3,13,1,70,3,1.07,102,130,66,63.6,157.7,...,8.12,1.28,7.6,3,3,1,1,1,2,0
4,14,1,81,5,2.67,72,136,61,75.5,166.2,...,4.5,1.04,5.8,1,1,1,2,2,2,0


In [4]:
data.columns

Index(['SEQN', 'Gender', 'Age', 'Annual-Family-Income',
       'Ratio-Family-Income-Poverty', 'X60-sec-pulse', 'Systolic', 'Diastolic',
       'Weight', 'Height', 'Body-Mass-Index', 'White-Blood-Cells',
       'Lymphocyte', 'Monocyte', 'Eosinophils', 'Basophils', 'Red-Blood-Cells',
       'Hemoglobin', 'Mean-Cell-Vol', 'Mean-Cell-Hgb-Conc.',
       'Mean-cell-Hemoglobin', 'Platelet-count', 'Mean-Platelet-Vol',
       'Segmented-Neutrophils', 'Hematocrit', 'Red-Cell-Distribution-Width',
       'Albumin', 'ALP', 'AST', 'ALT', 'Cholesterol', 'Creatinine', 'Glucose',
       'GGT', 'Iron', 'LDH', 'Phosphorus', 'Bilirubin', 'Protein', 'Uric.Acid',
       'Triglycerides', 'Total-Cholesterol', 'HDL', 'Glycohemoglobin',
       'Vigorous-work', 'Moderate-work', 'Health-Insurance', 'Diabetes',
       'Blood-Rel-Diabetes', 'Blood-Rel-Stroke', 'CoronaryHeartDisease'],
      dtype='object')

In [5]:
data.shape

(37079, 51)

In [6]:
data.isnull().sum()

Unnamed: 0,0
SEQN,0
Gender,0
Age,0
Annual-Family-Income,0
Ratio-Family-Income-Poverty,0
X60-sec-pulse,0
Systolic,0
Diastolic,0
Weight,0
Height,0


Conclusion : Data is clean and all the data is numerical
CoronaryHeartDisease is target class. This is a classification problem predicting the heart disease.
SEQN is just representing the row number hence we need not take in the prediction columns.
Here, we have 49 features and 1 target column.
We have originally 37079 rows with no cell being empty.

# DATA ANALYSIS


In [7]:
data.describe()

Unnamed: 0,SEQN,Gender,Age,Annual-Family-Income,Ratio-Family-Income-Poverty,X60-sec-pulse,Systolic,Diastolic,Weight,Height,...,Total-Cholesterol,HDL,Glycohemoglobin,Vigorous-work,Moderate-work,Health-Insurance,Diabetes,Blood-Rel-Diabetes,Blood-Rel-Stroke,CoronaryHeartDisease
count,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,...,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0,37079.0
mean,48901.041236,1.513282,48.943661,7.358208,2.559026,72.57925,124.090078,69.919253,80.988276,167.389601,...,5.081713,1.370344,5.676496,1.78384,1.598856,1.218587,1.907333,1.549502,1.796165,0.04067
std,26753.636441,0.49983,18.01044,3.994083,1.624789,12.242108,19.254741,13.575804,20.678734,10.122908,...,1.072682,0.415985,1.050223,0.448324,0.511199,0.461102,0.349674,0.49755,0.402853,0.197527
min,2.0,1.0,20.0,1.0,0.0,32.0,0.0,0.0,32.3,129.7,...,1.53,0.16,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,26120.5,1.0,33.0,4.0,1.14,64.0,111.0,62.0,66.5,160.0,...,4.32,1.07,5.2,2.0,1.0,1.0,2.0,1.0,2.0,0.0
50%,50065.0,2.0,48.0,7.0,2.18,72.0,121.0,70.0,78.2,167.1,...,5.02,1.29,5.4,2.0,2.0,1.0,2.0,2.0,2.0,0.0
75%,71173.5,2.0,63.0,10.0,4.13,80.0,134.0,78.0,92.1,174.6,...,5.74,1.6,5.8,2.0,2.0,1.0,2.0,2.0,2.0,0.0
max,93702.0,2.0,85.0,15.0,5.0,224.0,270.0,132.0,371.0,204.5,...,14.09,5.84,18.8,3.0,3.0,9.0,3.0,2.0,2.0,1.0


In [8]:
data['Pulse Pressure'] = data['Diastolic'] - data['Systolic']

In [9]:
data['BMI'] = data['Weight'] / (data['Height'] ** 2)

In [10]:
data = data.drop(columns=['Diastolic', 'Systolic', 'Height', 'Weight'])

In [11]:
correlation_matrix = data.corr()

In [12]:
correlation_matrix

Unnamed: 0,SEQN,Gender,Age,Annual-Family-Income,Ratio-Family-Income-Poverty,X60-sec-pulse,Body-Mass-Index,White-Blood-Cells,Lymphocyte,Monocyte,...,Glycohemoglobin,Vigorous-work,Moderate-work,Health-Insurance,Diabetes,Blood-Rel-Diabetes,Blood-Rel-Stroke,CoronaryHeartDisease,Pulse Pressure,BMI
SEQN,1.0,-0.001776,0.001023,0.178423,-0.044699,0.011624,0.05961,-0.006061,0.050807,0.006424,...,0.085223,0.054571,0.05667,-0.001002,-0.032711,0.054461,0.176478,-0.004985,0.012633,0.059638
Gender,-0.001776,1.0,-0.031112,-0.044304,-0.05115,0.154636,0.059787,0.041347,0.043523,-0.198896,...,-0.042684,0.151403,0.058853,-0.049752,0.014295,-0.05413,-0.043613,-0.078408,0.005023,0.059808
Age,0.001023,-0.031112,1.0,-0.042854,0.051216,-0.149767,0.029349,-0.073692,-0.097218,0.153205,...,0.283285,0.207679,0.11795,-0.243152,-0.190432,-0.000434,0.024753,0.222649,-0.4799,0.029341
Annual-Family-Income,0.178423,-0.044304,-0.042854,1.0,0.871092,-0.056541,-0.02535,-0.07853,0.017299,0.014725,...,-0.070729,-0.053815,-0.066759,-0.219581,0.064417,0.046367,0.073422,-0.037952,0.13475,-0.025332
Ratio-Family-Income-Poverty,-0.044699,-0.05115,0.051216,0.871092,1.0,-0.075063,-0.051215,-0.089086,-0.012918,0.051305,...,-0.083041,-0.078578,-0.108287,-0.27886,0.062939,0.033664,0.022297,-0.012472,0.080485,-0.051201
X60-sec-pulse,0.011624,0.154636,-0.149767,-0.056541,-0.075063,1.0,0.125678,0.206151,-0.127914,-0.113995,...,0.071834,0.055944,0.026132,0.028147,-0.049708,-0.04443,-0.026403,-0.071665,0.112258,0.125675
Body-Mass-Index,0.05961,0.059787,0.029349,-0.02535,-0.051215,0.125678,1.0,0.145922,-0.013779,-0.087412,...,0.203001,0.044353,0.026196,-0.02116,-0.133292,-0.141442,-0.0533,0.022211,-0.043846,0.999997
White-Blood-Cells,-0.006061,0.041347,-0.073692,-0.07853,-0.089086,0.206151,0.145922,1.0,-0.234926,-0.300591,...,0.063116,0.029061,0.013363,0.04149,-0.038354,-0.04854,-0.036055,0.011565,-0.018193,0.14591
Lymphocyte,0.050807,0.043523,-0.097218,0.017299,-0.012918,-0.127914,-0.013779,-0.234926,1.0,0.111487,...,0.015934,-0.032706,-0.001298,0.039871,0.032459,-0.01274,0.011889,-0.076522,0.073651,-0.013772
Monocyte,0.006424,-0.198896,0.153205,0.014725,0.051305,-0.113995,-0.087412,-0.300591,0.111487,1.0,...,-0.028304,-0.025576,-0.01973,-0.043162,0.018914,0.041089,0.01541,0.068353,-0.082676,-0.087401


# Feature Selection


**ALGORITHM : BMPA**

In [13]:
!ls -l


total 8184
-rw-r--r-- 1 root root 8374838 Oct 13 16:05 data_up.csv
drwxr-xr-x 1 root root    4096 Oct  9 13:36 sample_data


In [14]:
!pip install mealpy



In [15]:
import mealpy
import pkgutil
mods = [name for _, name, _ in pkgutil.walk_packages(mealpy.__path__, mealpy.__name__ + ".")]

print([m for m in mods if "mpa" in m.lower() or "bmpa" in m.lower()])


['mealpy.swarm_based.MPA']


In [16]:
!pip install --upgrade git+https://github.com/thieu1995/mealpy.git scikit-learn pandas numpy -q

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [17]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from mealpy.swarm_based.MPA import OriginalMPA
from mealpy.utils.space import BinaryVar

Load dataset


In [18]:
df = pd.read_csv('data_up.csv')

In [19]:
X = df.drop(columns=['SEQN', 'CoronaryHeartDisease'])

In [20]:
y = df['CoronaryHeartDisease']

Here we are defining fitness function by (minimize 1- accuracy)
And also we Train/Test split



In [21]:
def fitness(sol):
    mask = sol > 0.5
    if not mask.any():
        return 1
    Xs = X.loc[:, mask]

    Xtr, Xte, ytr, yte = train_test_split(Xs, y, test_size=0.2, random_state=42)
    clf = RandomForestClassifier(random_state=42).fit(Xtr, ytr)

    return 1 - accuracy_score(yte, clf.predict(Xte))

In [22]:
bounds = [BinaryVar() for _ in range(X.shape[1])]

Defining Optimization Problem




In [23]:
problem = {
    "obj_func": fitness,
    "bounds": bounds,
    "minmax": "min",
}

Initializing Model and Running Binary MPA

In [24]:
model = OriginalMPA(epoch=5, pop_size=20)
result = model.solve(problem)

INFO:mealpy.swarm_based.MPA.OriginalMPA:OriginalMPA(epoch=5, pop_size=20)
INFO:mealpy.swarm_based.MPA.OriginalMPA:>>>Problem: P, Epoch: 1, Current best: 0.04153182308522119, Global best: 0.04153182308522119, Runtime: 148.30503 seconds
INFO:mealpy.swarm_based.MPA.OriginalMPA:>>>Problem: P, Epoch: 2, Current best: 0.04153182308522119, Global best: 0.04153182308522119, Runtime: 160.68031 seconds
INFO:mealpy.swarm_based.MPA.OriginalMPA:>>>Problem: P, Epoch: 3, Current best: 0.04153182308522119, Global best: 0.04153182308522119, Runtime: 162.87467 seconds
INFO:mealpy.swarm_based.MPA.OriginalMPA:>>>Problem: P, Epoch: 4, Current best: 0.04153182308522119, Global best: 0.04153182308522119, Runtime: 165.66727 seconds
INFO:mealpy.swarm_based.MPA.OriginalMPA:>>>Problem: P, Epoch: 5, Current best: 0.04153182308522119, Global best: 0.04153182308522119, Runtime: 173.85492 seconds


In [25]:
best_position = result.solution

if hasattr(result, "solution_fitness"):
    best_fitness = result.solution_fitness
elif hasattr(result, "fitness_history"):
    best_fitness = result.fitness_history[-1]
else:
    best_fitness = None
    print("Could not find fitness attribute in result. Inspect result.__dict__")

Could not find fitness attribute in result. Inspect result.__dict__


Selected Features


In [26]:
selected_features = np.where(best_position > 0.05)[0]
X_selected = X.iloc[:, selected_features] if selected_features.size > 0 else pd.DataFrame()

Calculating the Actual Accuracy and printing the results

In [27]:
if X_selected.shape[1] > 0:
    clf_final = RandomForestClassifier(random_state=42)
    accuracy = np.mean(cross_val_score(clf_final, X_selected, y, cv=5))
else:
    accuracy = 0



In [28]:
print("Best Fitness (1 - Accuracy):", best_fitness)


Best Fitness (1 - Accuracy): None


In [29]:
print("Selected Features:", selected_features)


Selected Features: [ 0  1  4  6  7  8 10 11 12 13 14 15 17 20 21 22 24 26 27 30 33 34 36 39
 40 46 48]


In [30]:
print("Total Features Selected:", len(selected_features))


Total Features Selected: 27


In [31]:
print("Accuracy of Selected Features:", accuracy)

Accuracy of Selected Features: 0.9593300810843644


# MODEL SELECTION


In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

models = {
    'RandomForestClassifier' : RandomForestClassifier(random_state=42),
    'LogisticRegression' : LogisticRegression(random_state=42),
}

In [33]:
X = df.drop(columns=['SEQN', 'CoronaryHeartDisease'])
y = df['CoronaryHeartDisease']
filtered_X = X.iloc[:, selected_features]

In [34]:
filtered_X

Unnamed: 0,Gender,Age,X60-sec-pulse,Diastolic,Weight,Height,White-Blood-Cells,Lymphocyte,Monocyte,Eosinophils,...,ALP,AST,Creatinine,Iron,LDH,Bilirubin,Triglycerides,Total-Cholesterol,Diabetes,Blood-Rel-Stroke
0,1,77,68,56,75.4,174.0,7.6,21.1,7.1,4.4,...,62,19,61.90,11.28,140,12.00,1.298,5.56,2,2
1,1,49,66,83,92.5,178.3,5.9,37.8,6.2,3.4,...,63,22,70.70,24.54,133,8.60,3.850,7.21,2,2
2,1,37,64,99,99.2,180.0,10.2,23.7,9.0,3.2,...,63,17,88.40,11.28,131,6.80,1.581,4.03,2,1
3,1,70,102,66,63.6,157.7,11.6,13.1,3.8,0.4,...,103,24,61.90,12.18,181,8.60,3.635,8.12,1,2
4,1,81,72,61,75.5,166.2,9.1,29.8,5.6,1.7,...,110,23,88.40,11.82,150,10.30,0.756,4.50,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37074,1,25,92,76,39.2,136.5,7.1,24.2,8.9,0.9,...,67,34,56.58,13.10,144,11.97,1.264,4.14,2,2
37075,2,76,78,46,59.1,165.8,6.4,21.0,12.7,3.9,...,50,22,97.24,15.00,124,18.81,0.948,3.62,2,2
37076,2,80,74,58,71.7,152.2,4.7,45.9,12.5,3.8,...,54,25,81.33,8.40,120,5.13,1.095,6.62,2,2
37077,1,35,76,66,78.2,173.3,7.6,26.4,9.2,2.0,...,140,73,82.21,9.00,136,3.42,0.937,3.72,2,1


In [35]:
for name, model in models.items():
    # Use the filtered_X data for model selection
    X_train, X_test, y_train, y_test = train_test_split(filtered_X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy}")

RandomForestClassifier Accuracy: 0.9584681769147788
LogisticRegression Accuracy: 0.9572545846817692


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Hyperparameter Tuning
Tune the hyperparameters of the models using GridSearchCV.

## Define parameter grids

### Subtask:
Define parameter grids for each model to search during tuning.


**Reasoning**:
Defining parameter grids for each model as requested in the instructions.



In [36]:
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 6],
    'min_samples_split': [2, 5],
  }

param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

## Set up gridsearchcv

### Subtask:
Use GridSearchCV for each model with appropriate cross-validation.


**Reasoning**:
Import GridSearchCV and create GridSearchCV objects for each model with the defined parameter grids and cross-validation.



In [37]:
from sklearn.model_selection import GridSearchCV

tuned_models = {}
param_grids = {
    'RandomForestClassifier': param_grid_rf,
    'LogisticRegression': param_grid_lr
}

for name, model in models.items():
    if name in param_grids:
        grid_search = GridSearchCV(estimator=model, param_grid=param_grids[name], cv=5, scoring='accuracy')
        tuned_models[name] = grid_search


## Perform tuning

### Subtask:
Fit the GridSearchCV objects to the selected data.


**Reasoning**:
Fit each GridSearchCV object to the selected features and target variable.



In [38]:
subset_size = 0.3
X_subset, _, y_subset, _ = train_test_split(filtered_X, y, test_size=1 - subset_size, random_state=42)

for name, grid_search in tuned_models.items():
    print(f"Tuning {name}...")
    grid_search.fit(X_subset, y_subset)
    print(f"Finished tuning {name}.")

Tuning RandomForestClassifier...
Finished tuning RandomForestClassifier.
Tuning LogisticRegression...


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to sca

Finished tuning LogisticRegression.


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
results = {}
for name, grid_search in tuned_models.items():
    best_model = grid_search.best_estimator_
    X_train, X_test, y_train, y_test = train_test_split(filtered_X, y, test_size=0.2, random_state=42)
    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"Tuned {name} Accuracy: {accuracy}")

Tuned RandomForestClassifier Accuracy: 0.9583333333333334
Tuned LogisticRegression Accuracy: 0.9571197411003236


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
