I tried testing the different solvers and different Regularization techniques (when applicable) and found that the accuracies are nearly identical and after some research, it seems that the reason the solvers had no effect on the accuracies was likely because our data was not too complicated and linearly separable.

However, in terms of how long the model took, the second model, solver='liblinear' penalty='l2' was significantly faster than the other 4 models.

Probably best to adjust the class_weight parameter, since there is likely many more instances of Diabetes=0 rather than Diabetes=1


In [None]:
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score

# I tried testing the different solvers and different Regularization techniques (when applicable) and found that the accuracies are nearly identical
# after some research, the reason the solvers had no effect on the accuracies was likely because our data was not too complicated and linearly separable.
# However, in terms of how long the model took, the second model, solver='liblinear' penalty='l2' was significantly faster than the other 4 models.

# Probably best to adjust the class_weight parameter, since there is likely many more instances of Diabetes=0 rather than

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

# create pandas dataframe
df = pd.concat([X, y], axis=1)

# Define features (X) and target (y)
selected_features = [
    'HighBP', 'GenHlth', 'DiffWalk', 'BMI', 'HighChol', 'Age',
    'PhysHlth', 'HeartDiseaseorAttack', 'NoDocbcCost', 'MentHlth'
]
X = df[selected_features]
y = df['Diabetes_binary']

# Split the original data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Going to change around a few of the parameters, namely the solver and the C value for the regularization

# Model 1: solver=lbfgs
# Initialize the model
# This model takes a moderately long time to run
log_reg1 = LogisticRegression(
    penalty='l2',
    C=1.0,
    random_state=42,
    class_weight='balanced',
    solver='lbfgs',
    max_iter=1000
    )

# train the model
log_reg1.fit(X_train, y_train)

# make predictions
y_pred1 = log_reg1.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred1)
print(f"\nModel 1 Accuracy: {accuracy:.4f}")



# Model 2: solver=liblinear, penalty=l2
# Initialize the model
# This model is significantly faster than the other models
log_reg2 = LogisticRegression(
    penalty='l2',
    C=1.0,
    random_state=42,
    class_weight='balanced',
    solver='liblinear',
    max_iter=1000
    )

# train the model
log_reg2.fit(X_train, y_train)

# make predictions
y_pred2 = log_reg2.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred2)
print(f"\nModel 2 Accuracy: {accuracy:.4f}")



# Model 3: solver=liblinear, penalty=l1
# Initialize the model
# This model is relatively fast compared to other models
log_reg3 = LogisticRegression(
    penalty='l1',
    C=1.0,
    random_state=42,
    class_weight='balanced',
    solver='liblinear',
    max_iter=1000
    )

# train the model
log_reg3.fit(X_train, y_train)

# make predictions
y_pred3 = log_reg3.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred3)
print(f"\nModel 3 Accuracy: {accuracy:.4f}")



# Model 4: solver=sag, penalty=l2
# Initialize the model
# This model takes decently long to run
log_reg4 = LogisticRegression(
    penalty='l2',
    C=1.0,
    random_state=42,
    class_weight='balanced',
    solver='sag',
    max_iter=1000
    )

# train the model
log_reg4.fit(X_train, y_train)

# make predictions
y_pred4 = log_reg4.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred4)
print(f"\nModel 4 Accuracy: {accuracy:.4f}")



# Model 5: solver=saga, penalty=l2
# Initialize the model
# This model takes significantly longer to run than the rest
log_reg5 = LogisticRegression(
    penalty='l2',
    C=1.0,
    class_weight='balanced',
    random_state=42,
    solver='saga',
    max_iter=1000
    )

# train the model
log_reg5.fit(X_train, y_train)

# make predictions
y_pred5 = log_reg5.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred5)
print(f"\nModel 5 Accuracy: {accuracy:.4f}")


Model 1 Accuracy: 0.7277

Model 2 Accuracy: 0.7277

Model 3 Accuracy: 0.7277

Model 4 Accuracy: 0.7275

Model 5 Accuracy: 0.7277


In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

# Count the Instances for Diabetes vs No Diabetes

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

# create pandas dataframe
df = pd.concat([X, y], axis=1)

y = df['Diabetes_binary']

diabetes=0
no_diabetes=0
for i in y:
  if (i == 0):
    no_diabetes = no_diabetes + 1
  else:
    diabetes = diabetes + 1

print(f"Diabetes: {diabetes}")
print(f"No Diabetes: {no_diabetes}")

Diabetes: 35346
No Diabetes: 218334


Based on the above results, there seems to be a Diabetes to No Diabetes ratio of 1:6.177 or 1:6. Hence, it could be worth while to try and change the class_weights directly using this ratio and see how this affects model performance. In theory, this should try and hold the Diabetes results in 6x more importance than that of No Diabetes.

In [None]:
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score


# We'll use model 2 and keep the weight='balanced' for model 1 as a reference

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

# create pandas dataframe
df = pd.concat([X, y], axis=1)

# Define features (X) and target (y)
selected_features = [
    'HighBP', 'GenHlth', 'DiffWalk', 'BMI', 'HighChol', 'Age',
    'PhysHlth', 'HeartDiseaseorAttack', 'NoDocbcCost', 'MentHlth'
]
X = df[selected_features]
y = df['Diabetes_binary']

# Split the original data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model 1: solver=liblinear, penalty=l2
# Initialize the model
# Balanced class weights
log_reg1 = LogisticRegression(
    penalty='l2',
    C=1.0,
    random_state=42,
    class_weight='balanced',
    solver='liblinear',
    max_iter=1000
    )

# train the model
log_reg1.fit(X_train, y_train)

# make predictions
y_pred1 = log_reg1.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred1)
print(f"\nModel 2 Accuracy: {accuracy:.4f}")


# Model 2: solver=liblinear, penalty=l2
# Initialize the model
# Manually adjust class weights:
#   @ 0 refers to No Diabetes
#   @ 1 refers to Diabetes
#   @ Hence, the class weight should be
#     @ class_weight={0: 1, 1: 6}

log_reg2 = LogisticRegression(
    penalty='l2',
    C=1.0,
    random_state=42,
    class_weight={0: 1, 1: 6},
    solver='liblinear',
    max_iter=1000
    )

# train the model
log_reg2.fit(X_train, y_train)

# make predictions
y_pred2 = log_reg2.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred2)
print(f"\nModel 2 Accuracy: {accuracy:.4f}")




Model 2 Accuracy: 0.7277

Model 2 Accuracy: 0.7325


There doesn't seem to be an incredibly high difference in model performance, but it did increase, although it was less than 1%. Next, we can try putting a much larger weight on Diabetes to see how this affects the performance of the model. I did notice that having a ratio of about 1:1 does skyrocket the accuracy, but I think that is because it completely ignores the existence of Diabetes, not that the model is better at predicting if someone has Diabetes or not.

In [None]:
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression


# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

# create pandas dataframe
df = pd.concat([X, y], axis=1)

# Define features (X) and target (y)
selected_features = [
    'HighBP', 'GenHlth', 'DiffWalk', 'BMI', 'HighChol', 'Age',
    'PhysHlth', 'HeartDiseaseorAttack', 'NoDocbcCost', 'MentHlth'
]
X = df[selected_features]
y = df['Diabetes_binary']

'''
# Split the original data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
'''

# define models and parameters
model = LogisticRegression()
solvers = ['lbfgs', 'liblinear']
penalty = ['l2']
c_values = [10, 1.0, 0.1, 0.01, 0.001]
class_weights = ['balanced', 'none', {0:1, 1:1} , {0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 4}, {0: 1, 1: 5}, {0: 1, 1: 6}]

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values, class_weight=class_weights)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


300 fits failed out of a total of 2400.
The score on these train-test partitions for these parameters will be set to 0.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
300 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Best: 0.863681 using {'C': 0.01, 'class_weight': {0: 1, 1: 1}, 'penalty': 'l2', 'solver': 'liblinear'}
0.728644 (0.002651) with: {'C': 10, 'class_weight': 'balanced', 'penalty': 'l2', 'solver': 'lbfgs'}
0.728726 (0.002617) with: {'C': 10, 'class_weight': 'balanced', 'penalty': 'l2', 'solver': 'liblinear'}
0.000000 (0.000000) with: {'C': 10, 'class_weight': 'none', 'penalty': 'l2', 'solver': 'lbfgs'}
0.000000 (0.000000) with: {'C': 10, 'class_weight': 'none', 'penalty': 'l2', 'solver': 'liblinear'}
0.863094 (0.001163) with: {'C': 10, 'class_weight': {0: 1, 1: 1}, 'penalty': 'l2', 'solver': 'lbfgs'}
0.863119 (0.001111) with: {'C': 10, 'class_weight': {0: 1, 1: 1}, 'penalty': 'l2', 'solver': 'liblinear'}
0.846982 (0.001905) with: {'C': 10, 'class_weight': {0: 1, 1: 2}, 'penalty': 'l2', 'solver': 'lbfgs'}
0.847038 (0.001991) with: {'C': 10, 'class_weight': {0: 1, 1: 2}, 'penalty': 'l2', 'solver': 'liblinear'}
0.818856 (0.001890) with: {'C': 10, 'class_weight': {0: 1, 1: 3}, 'penalty': 'l2'