# **Summary**
* Explore logistic regression for classifying diabetes using L1 and L2 penalties, and different magnitude of C values.


# **Import Data & Data Overview**

In [None]:
import pandas as pd

#Import data from drive
file_path = '/content/drive/My Drive/Diabetes.csv'
Diabetes = pd.read_csv(file_path)

In [None]:
#Check null value and data type
Diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4949 entries, 0 to 4948
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   HighBP                4949 non-null   int64  
 1   HighChol              4949 non-null   int64  
 2   CholCheck             4949 non-null   int64  
 3   Smoker                4949 non-null   int64  
 4   Stroke                4949 non-null   int64  
 5   HeartDiseaseorAttack  4949 non-null   int64  
 6   PhysActivity          4949 non-null   int64  
 7   Fruits                4949 non-null   int64  
 8   Veggies               4949 non-null   int64  
 9   HvyAlcoholConsump     4949 non-null   int64  
 10  AnyHealthcare         4949 non-null   int64  
 11  NoDocbcCost           4949 non-null   int64  
 12  GenHlth               4949 non-null   int64  
 13  DiffWalk              4949 non-null   int64  
 14  Sex                   4949 non-null   int64  
 15  Age                  

In [None]:
#Check shape
Diabetes.shape

(4949, 22)

# **Import necessary packages**

In [None]:
# packages for data
import numpy as np
import pandas as pd

# packages for machine learning
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# **Training Data and Testing Data Preparation**

In [None]:
columns = Diabetes.columns
columns

Index(['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke',
       'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'DiffWalk', 'Sex', 'Age', 'Education', 'Income', 'Diabetes_binary',
       'MenHlth_bins', 'PhysHlth_bins', 'BMI_normalized'],
      dtype='object')

In [None]:
#Target outcome
target = 'Diabetes_binary'

#Numerical Feature
num = ['BMI_normalized']

#Categorical Features
cats = columns[~columns.isin(num) & ~columns.isin([target])]

In [None]:
#Create new data frame named "Dia_X" by concatenaitng numerical featrues and categorical features
Dia_X = pd.concat([Diabetes[num], Diabetes[cats]], axis = 1)

#Check shape of features columns
Dia_X.shape

(4949, 21)

In [None]:
#Target variable column
Dia_y = Diabetes[target]
Dia_y.shape

(4949,)

In [None]:
# Create split with Sklearn, 70% of data for training, 30% for testing, random_state = 66
X_train, X_test, y_train, y_test  = train_test_split(Dia_X, Dia_y, test_size = 0.3,random_state = 66)

#Check the shape of train set and test set
X_train.shape, X_test.shape

((3464, 21), (1485, 21))

# **Logistic Regression**
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html



**Warning: The choice of the algorithm depends on the penalty chosen. Supported penalties by solver:**
* 'lbfgs' - [‘l2’, None]

* 'liblinear' - [‘l1’, ‘l2’]

* 'newton-cg' - [‘l2’, None]

* 'newton-cholesky' - [‘l2’, None]

* 'sag' - [‘l2’, None]

* 'saga' - [‘elasticnet’, ‘l1’, ‘l2’, None]


## **Comparison Between No Penalty, L1 Penalty, and L2 Penalty**

* Solver = 'saga' will be used for comparison purpose because 'saga' is the only solver that support all no penalty (*None*), L1, and L2.



In [None]:
#Create a logisticregression object
#penalty = None
Log_saga_reg = LogisticRegression(random_state=66, solver = 'saga', penalty = None)

#Fit the model with trainign data
Log_saga_reg.fit(X_train,y_train)

#Evaluate the fitted model with testing data
np.round(Log_saga_reg.score(X_test, y_test),3)



0.865

In [None]:
#ConvergenceWarning: increase max_iter = 1000
#penalty = None

#Create a logisticregression object
Log_saga_reg = LogisticRegression(random_state=66, solver = 'saga', penalty = None, max_iter = 1000)

#Fit the model with trainign data
Log_saga_reg.fit(X_train, y_train)

#Evaluate the fitted model with testing data
np.round(Log_saga_reg.score(X_test, y_test),3)

0.868

In [None]:
#penalty = 'l1'
#max_iter = 1000

#Create a logisticregression object
Log_saga_l1 = LogisticRegression(random_state=66, solver = 'saga', penalty = 'l1', max_iter = 1000)

#Fit the model with trainign data
Log_saga_l1.fit(X_train, y_train)

#Evaluate the fitted model with testing data
np.round(Log_saga_l1.score(X_test, y_test),3)

0.867

In [None]:
#penalty = 'l2'
#max_iter = 1000

#Create a logisticregression object
Log_saga_l2 = LogisticRegression(random_state=66, solver = 'saga', penalty = 'l2', max_iter = 1000)

#Fit the model with trainign data
Log_saga_l2.fit(X_train,y_train)

#Evaluate the fitted model with testing data
np.round(Log_saga_l2.score(X_test, y_test),3)

0.869

In [None]:
#Comparing the value of coefficients from three models

#Extract the coefficients from threee models
Log_saga_reg_coef = Log_saga_reg.coef_[0]
Log_saga_l1_coef = Log_saga_l1.coef_[0]
Log_saga_l2_coef = Log_saga_l2.coef_[0]

#Create dataframe
df = pd.DataFrame({'Log_saga_reg_coef':Log_saga_reg_coef, 'l2_coef':Log_saga_l2_coef, 'l1_coef':Log_saga_l1_coef}, index = [X_train.columns])
df = df.sort_values(by=['Log_saga_reg_coef'],ascending = False)
df = np.round(df,3)
df

Unnamed: 0,Log_saga_reg_coef,l2_coef,l1_coef
BMI_normalized,2.627,2.623,2.827
HighBP,0.702,0.691,0.679
HighChol,0.663,0.66,0.657
GenHlth,0.449,0.506,0.506
Stroke,0.329,0.297,0.279
Sex,0.3,0.318,0.308
DiffWalk,0.299,0.277,0.252
HeartDiseaseorAttack,0.176,0.141,0.135
Age,0.112,0.125,0.131
Veggies,0.031,0.064,0.041


**Discussion**
* The coefficient of 'BMI_normalized' is the highest among all features across all models. followed by 'HighBP' and 'HighChol'.
* L1 tends to shrink coefficients to zero more than L2 ('AnyHealthcare' in this case)
* Overall, the shrinkage of coefficients when applying regularization terms is not significant/substantial.

## **Compare Model Performance with Different c Values**

* This comparison is only applicable to L1 and L2 penalties. The 'None' penalty has no regularization term and therefore no regularization strength.
* Solvers used 'liblinear' and 'saga'

In [None]:
def compare_c(X_train, y_train, X_test, y_test, solver, p):

    df = []

    for c in [0.001, 0.01, 0.1, 1, 10, 100]:
        comp = LogisticRegression(random_state=66, max_iter=1000, solver=solver, C=c, penalty=p)
        comp.fit(X_train, y_train)

        coef = comp.coef_

        score = comp.score(X_test, y_test)
        min_coef = np.min(coef)
        max_coef = np.max(coef)
        average_coef = np.mean(np.abs(coef))
        zero_coef = np.sum(coef == 0)

        df.append({"c":c, "min" : min_coef, "max" : max_coef,
                        "mean_abs": average_coef, "n_zero": zero_coef,
                        "test_score": score})

    df = np.round(pd.DataFrame(df), 3)

    return df


**solver = 'liblinear'**

In [None]:
#penalty = 'l1', solver = 'liblinear'
c_l1_liblinear= compare_c(X_train, y_train, X_test, y_test, p='l1', solver = 'liblinear')
c_l1_liblinear

Unnamed: 0,c,min,max,mean_abs,n_zero,test_score
0,0.001,-0.208,0.0,0.012,18,0.858
1,0.01,-0.348,0.133,0.031,16,0.858
2,0.1,-0.209,1.34,0.191,8,0.861
3,1.0,-0.201,2.758,0.321,2,0.867
4,10.0,-0.256,2.918,0.366,0,0.869
5,100.0,-0.264,2.927,0.369,0,0.869


**Discussion**
* As C increases (larger value of C), the number of features with zero coefficients decreases, and the model's score increases.
* Strong regularization (C=0.001, 0.01) with L1 shrinks more than half of the features to 0, which implies underfitting when applying such regularization.

In [None]:
#penalty = 'l2', solver = 'liblinear'
c_l2_liblinear= compare_c(X_train, y_train, X_test, y_test, p='l2', solver = 'liblinear')
c_l2_liblinear

Unnamed: 0,c,min,max,mean_abs,n_zero,test_score
0,0.001,-0.212,0.084,0.05,0,0.858
1,0.01,-0.35,0.399,0.151,0,0.859
2,0.1,-0.474,0.992,0.281,0,0.858
3,1.0,-0.294,2.356,0.307,0,0.865
4,10.0,-0.269,2.855,0.357,0,0.868
5,100.0,-0.263,2.929,0.371,0,0.869


**Discussion**
* None of the coefficients has zero coefficient (shrinkage in coefficients is not as strong as with L1).
* As c increases, test_score increases.





**solver = 'saga'**

In [None]:
#penalty = 'l1', solver = 'saga'
c_l1_saga= compare_c(X_train, y_train, X_test, y_test, p='l1', solver = 'saga')
c_l1_saga

Unnamed: 0,c,min,max,mean_abs,n_zero,test_score
0,0.001,0.0,0.0,0.0,21,0.858
1,0.01,-0.032,0.496,0.031,18,0.858
2,0.1,-0.069,1.931,0.208,7,0.867
3,1.0,-0.19,2.827,0.336,1,0.867
4,10.0,-0.256,2.921,0.367,0,0.869
5,100.0,-0.263,2.931,0.371,0,0.868


**Discussion**
* c = 0.001 results in all features having zero coefficients.
* Even when all coefficients are shrunk to zero, the accuracy remains at 0.858. However, there are other evaluation metrics (Ex: precision) that I have not yet looked at, which I will delve into it in the next part of the project.
* As the regularization strength decreases, score increases at a non-steady rate.

In [None]:
#penalty = 'l2', solver = 'saga'
c_l2_saga= compare_c(X_train, y_train, X_test, y_test, p='l2', solver = 'saga')
c_l2_saga

Unnamed: 0,c,min,max,mean_abs,n_zero,test_score
0,0.001,-0.07,0.177,0.045,0,0.858
1,0.01,-0.129,0.442,0.127,0,0.859
2,0.1,-0.184,1.383,0.253,0,0.861
3,1.0,-0.257,2.623,0.345,0,0.869
4,10.0,-0.263,2.898,0.368,0,0.869
5,100.0,-0.264,2.929,0.371,0,0.869


**Discussion**
* There is an overall improvement in the score as the value of C increases.
* No zero coefficients


# **Interpretation of results**

* Applying L1 penalty helps in avoiding overfitting by controlling the complexity of the model and selecting important features.
* The default setting of C (1.0) gives a good balance between the complexity of model and strength of regularization of model with L1 penalty.
* Optimal parameters: C = 1.0, max_iter = 1000, p = 'l1'

In [None]:
L1_saga = LogisticRegression(random_state=66, max_iter=1000, solver='saga', C=1.0, penalty='l1')
L1_saga.fit(X_train,y_train)
L1_saga.score(X_test, y_test)
pd.DataFrame({'features': L1_saga.feature_names_in_, 'coefficient': L1_saga.coef_[0]}).sort_values(by='coefficient', ascending=False).reset_index(drop=True)

Unnamed: 0,features,coefficient
0,BMI_normalized,2.827426
1,HighBP,0.678963
2,HighChol,0.657256
3,CholCheck,0.598991
4,GenHlth,0.506462
5,Sex,0.308296
6,Stroke,0.279389
7,DiffWalk,0.251646
8,HeartDiseaseorAttack,0.135017
9,Age,0.130538


In [None]:
L1_lib = LogisticRegression(random_state=66, max_iter=1000, solver='liblinear', C=1.0, penalty='l1')
L1_lib.fit(X_train,y_train)
L1_lib.score(X_test, y_test)
pd.DataFrame({'features': L1_lib.feature_names_in_, 'coefficient': L1_lib.coef_[0]}).sort_values(by='coefficient', ascending=False).reset_index(drop=True)

Unnamed: 0,features,coefficient
0,BMI_normalized,2.758217
1,HighBP,0.681824
2,HighChol,0.655894
3,GenHlth,0.488459
4,CholCheck,0.35476
5,Sex,0.302265
6,Stroke,0.283851
7,DiffWalk,0.260106
8,HeartDiseaseorAttack,0.143633
9,Age,0.124089


**Discussion**

* It comes to an agreement between the two models that 'AnyHealthcare' is an insignificant feature when predicting diabetes. In the model using the 'liblinear' solver, the coefficient of 'NoDocbcCost' is also zero, which suggests that healthcare coverage and affordability of healthcare are not important indicators of diabetes.
* Healthcare coverage and affordability of healthcare are likely to be correlated in this case. Individuals without any healthcare plan often have difficulty affording health care costs, given the medical costs in the U.S. are expensive.
* The model with L1 regularization suggests that the healthcare coverage and affordability are not indicators of diabetes, which may make sense because the impacts of daily habits and health history often dominate over the impacts of other factors in the case of diabetes.
* Mental Health, income, physical health, education, fruits, physical activity and heavy alcohol consumption have negative coefficients, which means the probability of having diabetes decreases with the increasing levels within these features.
* Surprisingly, heavy alcohol drinkers are less likely to have diabetes based on this analysis, which is counterintuitive given that alcoholic drink is typically high in sugar and often mixed with high-carb mixers. Moreover, heavy drinkers are more likely to have poor health management, thus leading to diabetes.
* BMI, high blood pressure, high cholesterol, general health, cholesterol check, sex, history of stroke, difficult walking, history of heart disease, age, smoking habit, veggies consumption are positively correlated with diabetes. Those features are identified as indicators of diabetes by the model with L1 regularization.
* Veggie consumption is intuitively considered as a healthy habit. It could be due to the complex relationship between features and changes in dietary habits among diabetes patients after diagnosing with diabetes.