---

## Data Analysis

- This file differs from [2_data_analysis_1_base_data.ipynb](2_data_analysis_1_base_data.ipynb) in that it:
    - scales the base cleaned data created in [1_data_cleaning.ipynb](1_data_cleaning.ipynb).

Source dataset: 247076 rows × 37 columns
Processed and analyzed dataset: 247076 rows × 37 columns

---

In [2]:
# package imports go here
import pandas as pd
import numpy as np
import fastparquet as fp
import os
import sys
import pickle
import matplotlib.pyplot as plt
import importlib
import config

sys.path.insert(1, config.package_path)
import ml_analysis as mlanlys
import ml_clean_feature as mlclean

import warnings
warnings.filterwarnings("ignore")

---

## 1. Read the cleaned dataset from file

---

In [3]:
# reload any changes to Config Settings
importlib.reload(config)

year                        = config.year

clean_file                  = config.clean_file

print(f"Year:                        {year}")
print(f"Clean File:                  {clean_file}")


Year:                        2021
Clean File:                  data/brfss_2021_clean.parquet.gzip


In [4]:
# Read final cleaned dataset from parquet file
df = pd.read_parquet(clean_file, engine="fastparquet")

In [5]:
diabetes_labels = df.columns

In [6]:
df.shape

(247076, 37)

---

## 2. Prepare the dataset for analysis

- Split the dataset into features and labels.
- Split the dataset into training and testing sets.
- Scale the dataset

---

In [7]:
from sklearn.datasets import make_regression, make_swiss_roll
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [8]:
# reload any changes to mlanlys
importlib.reload(mlanlys)

target = 'diabetes'
# Dictionary defining modification to be made to the base dataset
operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'none'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
prepared_data = mlanlys.modify_base_dataset(df_modified, operation_dict)

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  none

Dataframe, Train Test Summary
-----------------------------
Dataframe: (247076, 37)  Data:4, X_train:185307, y_train:185307, X_test:61769, y_test:61769
ValueCounts:   y_train: len:2   0: 160529   1: 24778
ValueCounts:   y_test : len:2   0:  53514  

In [9]:
# Print some statistics about the original df and the modified dataframe
print(f"Original Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

print(f"\nModified Dataframe")
print(f"------------------")
print(f"df_modified.shape: {df_modified.shape}")
print(f"df_modified[{target}].value_counts:  {df_modified[target].value_counts()}")

Original Dataframe
------------------
df.shape: (247076, 37)
df[diabetes].value_counts:  diabetes
0.0    208389
2.0     33033
1.0      5654
Name: count, dtype: int64

Modified Dataframe
------------------
df_modified.shape: (247076, 37)
df_modified[diabetes].value_counts:  diabetes
0.0    214043
1.0     33033
Name: count, dtype: int64


In [10]:
X_train, X_test, y_train, y_test = prepared_data
print(f"Dataframe: {df_modified.shape}  Data:{len(prepared_data)}, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")

Dataframe: (247076, 37)  Data:4, X_train:185307, y_train:185307, X_test:61769, y_test:61769


In [11]:
y_train.value_counts()

diabetes
0.0    160529
1.0     24778
Name: count, dtype: int64

In [12]:
y_test.value_counts()

diabetes
0.0    53514
1.0     8255
Name: count, dtype: int64

---

## 3. Optimization

---

#### 3.1 Pre-optimization metric results

- The summary report of the metrics for all pre-optimization runs is here:  [performance_report.txt](reports/performance_report.txt)
- The details of the runs are contained in these file:
    - [base_dataset_detailed_performance_report.txt](reports/base_dataset_detailed_performance_report.txt)
    - [binary_dataset_detailed_performance_report.txt](reports/binary_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [randomoversample_dataset_detailed_performance_report.txt](reports/randomoversample_dataset_detailed_performance.txt)
    - [cluster_dataset_detailed_performance_report.txt](reports/cluster_dataset_detailed_performance_report.txt)
    - [randomundersampled_dataset_detailed_performance_report.txt](reports/randomundersampled_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [smoteen_dataset_detailed_performance_report.txt](reports/smoteen_dataset_detailed_performance_report.txt)
    - [standard_scaled_dataset_detailed_performance_report.txt](reports/standard_scaled_dataset_detailed_performance_report.txt)
    - [smote_dataset_detailed_performance_report.txt](reports/smote_dataset_detailed_performance_report.txt)

#### 3.2 Optimization Dataset used

**Note:**  Modify the dataset as desired in Section 2.
<br><br>
Currently the dataset uses the Base dataset for 2015 with the following modifications:
- **Target converted to Binary**:  (0,1) from (0,1,2).
    - Base:  0: No diabetes, 1: Pre-diabetes, 2: have diabetes
    - binary: 0: No diabetes/pre-diabetes, 1: have diabetes
- Scaled the data with **StandardScaler**
- Resampled the data with **RandomUnderSampler**


#### 3.2 Optimization Analysis

In [13]:
# Note:  Modify the dataset as desired in Section 2.
# Currently the dataset uses the Base dataset for 2015 with the following modifications

In [14]:
# Dataset summary:
X_train, X_test, y_train, y_test = prepared_data

print(f"Dataset Lens, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")
print(f"y_train value_counts: {y_train.value_counts()}")
print(f"y_test value_counts: {y_test.value_counts()}")


Dataset Lens, X_train:185307, y_train:185307, X_test:61769, y_test:61769
y_train value_counts: diabetes
0.0    160529
1.0     24778
Name: count, dtype: int64
y_test value_counts: diabetes
0.0    53514
1.0     8255
Name: count, dtype: int64


##### 3.2.1 Perform Optimization:

- Datamodel:  2015 Diabetes Data Set
- Optimization approaches: LogisticRegression RandomizedSearchCV


In [15]:
df_modified_sample = df_modified.head(100)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import classification_report, accuracy_score

# Define the parameter grid
param_distributions = {
    'C': uniform(loc=0, scale=4),  # Regularization parameter
    'penalty': ['l1', 'l2'],     # Regularization type
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga']  # Solvers that support L2 regularization
}

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=5000)

# Initialize the RandomizedSearchCV
random_search = RandomizedSearchCV(
    log_reg, 
    param_distributions=param_distributions, 
    n_iter=100,  # Number of parameter settings that are sampled
    cv=5,        # 5-fold cross-validation
    verbose=3,   # Verbosity mode
    n_jobs=-1,   # Use all available cores
    random_state=42  # Seed for reproducibility
)

# Fit the model
random_search.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV 2/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.1s
[CV 3/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.1s
[CV 1/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 5/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 1/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 4/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 4/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/5] END C=0.6240745617697461, penalty=l1, solver=sag;, score=nan total time=   0.1s
[CV 3/5] END C=0.6240745617697461, penalty=l1, solver=sag;, score=nan total time=   0.1s
[CV 5/5] END C=0.6240745617697461, pena

In [16]:
# Summarize the Performance of the Best parameters
print(f"---------------------------------------------")
print(f"Best Parameters for Logistic Regression: {random_search.best_params_}")
print(f"---------------------------------------------\n\n")

# Best model
best_log_reg = random_search.best_estimator_

# Make predictions on the test set
y_pred_log_reg = best_log_reg.predict(X_test)

best_data = X_test, y_test, y_pred_log_reg
mlanlys.model_performance_metrics(best_log_reg, best_data, "Best Logistic Regression")

---------------------------------------------
Best Parameters for Logistic Regression: {'C': 1.4338629141770904, 'penalty': 'l1', 'solver': 'saga'}
---------------------------------------------


------------------------------------------------------------------------
---------- Best Logistic Regressioning Data Performance
------------------------------------------------------------------------
Confusion Matrix
[[52444  1070]
 [ 7003  1252]]

-----------------------
Best Logistic Regression score: 0.8693
Balanced Accuracy Score: 0.5658
ROC AUC Score: 0.8241
Mean Squared Error: 0.1307
------------------------------
--- Classification values
------------------------------
Accuracy: 0.8693
Precision: 0.5392
Recall: 0.1517
F1-score: 0.2368
Specificity: 0.98
False Positive Rate: 0.02
Matthews Correlation Coefficient: 0.2356

-----------------------
Classification Report
              precision    recall  f1-score   support

         0.0       0.88      0.98      0.93     53514
         1.0 

{'model': 'LogisticRegression',
 'slice': 'Best Logistic Regression',
 'score': 0.8693,
 'balanced_accuracy': 0.5658,
 'roc_auc_score': 0.8241,
 'Mean Squared Error': 0.1307,
 'Accuracy': 0.8693,
 'Precision': 0.5392,
 'Recall': 0.1517,
 'F1-score': 0.2368,
 'Specificity': 0.98,
 'False Positive Rate': 0.02,
 'Matthews Correlation Coefficient': 0.2356}

In [16]:
# Model without optimization
X_train, X_test, y_train, y_test = prepared_data

model = LogisticRegression()
# Train the model
model = model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

test_data = [ X_test, y_test, y_test_pred]

mlanlys.model_performance_metrics(model, test_data, "Original Logistic Regression" )


------------------------------------------------------------------------
---------- Original Logistic Regressioning Data Performance
------------------------------------------------------------------------
Confusion Matrix
[[52571  1020]
 [ 6962  1216]]

-----------------------
Original Logistic Regression score: 0.8708
Balanced Accuracy Score: 0.5648
ROC AUC Score: 0.825
Mean Squared Error: 0.1292
------------------------------
--- Classification values
------------------------------
Accuracy: 0.8708
Precision: 0.5438
Recall: 0.1487
F1-score: 0.2335
Specificity: 0.981
False Positive Rate: 0.019
Matthews Correlation Coefficient: 0.2353

-----------------------
Classification Report
              precision    recall  f1-score   support

         0.0       0.88      0.98      0.93     53591
         1.0       0.54      0.15      0.23      8178

    accuracy                           0.87     61769
   macro avg       0.71      0.56      0.58     61769
weighted avg       0.84      0.87    

{'model': 'LogisticRegression',
 'slice': 'Original Logistic Regression',
 'score': 0.8708,
 'balanced_accuracy': 0.5648,
 'roc_auc_score': 0.825,
 'Mean Squared Error': 0.1292,
 'Accuracy': 0.8708,
 'Precision': 0.5438,
 'Recall': 0.1487,
 'F1-score': 0.2335,
 'Specificity': 0.981,
 'False Positive Rate': 0.019,
 'Matthews Correlation Coefficient': 0.2353}

---

## 4. Conclusions

---