---

## Data Analysis

- This file differs from [2_data_analysis_1_base_data.ipynb](2_data_analysis_1_base_data.ipynb) in that it:
    - scales the base cleaned data created in [1_data_cleaning.ipynb](1_data_cleaning.ipynb).

Source dataset: 247076 rows × 37 columns
Processed and analyzed dataset: 247076 rows × 37 columns

---

In [1]:
# package imports go here
import pandas as pd
import numpy as np
import fastparquet as fp
import os
import sys
import pickle
import matplotlib.pyplot as plt
import importlib

sys.path.insert(1, 'pkgs')
import ml_analysis as mlanlys
import ml_clean_feature as mlclean

import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'ml_analysis'

---

## 1. Read the cleaned dataset from file

---

In [2]:
# Path to results
year = 2015
source_path     = "data/"
clean_file      = source_path + 'brfss_' + str(year) + '_clean.parquet.gzip'

report_path = 'reports/'
performance_report = report_path + 'performance_report.pkl'

# BE SURE TO UPDATE THE LABEL FOR THIS ANALYSIS
dataset_label = 'RandomUndersampled Dataset'

file_label = dataset_label.lower().replace(' ','_')

detailed_performance_report = report_path + file_label + '_detailed_performance_report.txt'

In [3]:
# Read final cleaned dataset from parquet file
df = pd.read_parquet(clean_file, engine="fastparquet")

In [4]:
diabetes_labels = df.columns

In [5]:
df.shape

(253680, 22)

---

## 2. Prepare the dataset for analysis

- Split the dataset into features and labels.
- Split the dataset into training and testing sets.
- Scale the dataset

---

In [6]:
from sklearn.datasets import make_regression, make_swiss_roll
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [7]:
# reload any changes to mlanlys
importlib.reload(mlanlys)

target = 'diabetes'
# Dictionary defining modification to be made to the base dataset
operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'undersample'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

NameError: name 'mlanlys' is not defined

In [8]:
# Print some statistics about the original df and the modified dataframe
print(f"Original Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

print(f"\nModified Dataframe")
print(f"------------------")
print(f"df_modified.shape: {df_modified.shape}")
print(f"df_modified[{target}].value_counts:  {df_modified[target].value_counts()}")

Original Dataframe
------------------
df.shape: (253680, 22)


NameError: name 'target' is not defined

In [9]:
X_train, X_test, y_train, y_test = data
print(f"Dataframe: {df_modified.shape}  Data:{len(data)}, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")

NameError: name 'data' is not defined

In [10]:
y_train.value_counts()

NameError: name 'y_train' is not defined

In [11]:
y_test.value_counts()

NameError: name 'y_test' is not defined

---

## 3. Optimization

---

#### 3.1 Pre-optimization metric results

- The summary report of the metrics for all pre-optimization runs is here:  [performance_report.txt](reports/performance_report.txt)
- The details of the runs are contained in these file:
    - [base_dataset_detailed_performance_report.txt](reports/base_dataset_detailed_performance_report.txt)
    - [binary_dataset_detailed_performance_report.txt](reports/binary_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [randomoversample_dataset_detailed_performance_report.txt](reports/randomoversample_dataset_detailed_performance.txt)
    - [cluster_dataset_detailed_performance_report.txt](reports/cluster_dataset_detailed_performance_report.txt)
    - [randomundersampled_dataset_detailed_performance_report.txt](reports/randomundersampled_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [smoteen_dataset_detailed_performance_report.txt](reports/smoteen_dataset_detailed_performance_report.txt)
    - [standard_scaled_dataset_detailed_performance_report.txt](reports/standard_scaled_dataset_detailed_performance_report.txt)
    - [smote_dataset_detailed_performance_report.txt](reports/smote_dataset_detailed_performance_report.txt)

#### 3.2 Optimization Dataset used

**Note:**  Modify the dataset as desired in Section 2.
<br><br>
Currently the dataset uses the Base dataset for 2015 with the following modifications:
- **Target converted to Binary**:  (0,1) from (0,1,2).
    - Base:  0: No diabetes, 1: Pre-diabetes, 2: have diabetes
    - binary: 0: No diabetes/pre-diabetes, 1: have diabetes
- Scaled the data with **StandardScaler**
- Resampled the data with **RandomUnderSampler**


#### 3.2 Optimization Analysis

In [12]:
# Note:  Modify the dataset as desired in Section 2.
# Currently the dataset uses the Base dataset for 2015 with the following modifications

In [13]:
# Dataset summary:
X_train, X_test, y_train, y_test = data
print(f"Dataset Lens, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")
print(f"y_train value_counts: {y_train.value_counts()}")
print(f"y_test value_counts: {y_test.value_counts()}")


NameError: name 'data' is not defined

##### 3.2.1 Perform Optimization:

- Datamodel:  2015 Diabetes Data Set
- Optimization approaches: LogisticRegression RandomizedSearchCV


In [14]:
df_modified_sample = df_modified.head(100)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import classification_report, accuracy_score

# Define the parameter grid
param_distributions = {
    'C': uniform(loc=0, scale=4),  # Regularization parameter
    'penalty': ['l1', 'l2'],     # Regularization type
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga']  # Solvers that support L2 regularization
}

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=5000)

# Initialize the RandomizedSearchCV
random_search = RandomizedSearchCV(
    log_reg, 
    param_distributions=param_distributions, 
    n_iter=100,  # Number of parameter settings that are sampled
    cv=5,        # 5-fold cross-validation
    verbose=3,   # Verbosity mode
    n_jobs=-1,   # Use all available cores
    random_state=42  # Seed for reproducibility
)

# Fit the model
random_search.fit(X_train_scaled, y_train)

# Best model
best_log_reg = random_search.best_estimator_

# Make predictions on the test set
y_pred_log_reg = best_log_reg.predict(X_test_scaled)

data = X_train, y_train, y_pred_log_reg

# Evaluate the classifier
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
report_log_reg = classification_report(y_test, y_pred_log_reg, target_names=['No Diabetes','Diabetes'])

print(f"Best Parameters for Logistic Regression: {random_search.best_params_}")
print(f"Accuracy of Logistic Regression classifier: {accuracy_log_reg}")
print("Classification Report for Logistic Regression:")

best_performance = model_performance(best_log_reg, data, 'Best_Logistic_Regression')

print(report_log_reg)

NameError: name 'df_modified' is not defined

---

## 4. Conclusions

---