---

## Data Analysis

- This file differs from [2_data_analysis_1_base_data.ipynb](2_data_analysis_1_base_data.ipynb) in that it:
    - scales the base cleaned data created in [1_data_cleaning.ipynb](1_data_cleaning.ipynb).

Source dataset: 247076 rows × 37 columns
Processed and analyzed dataset: 247076 rows × 37 columns

---

In [21]:
# package imports go here
import pandas as pd
import numpy as np
import fastparquet as fp
import os
import sys
import pickle
import matplotlib.pyplot as plt
import importlib
import config
import time

sys.path.insert(1, config.package_path)
import ml_analysis as mlanlys
import ml_clean_feature as mlclean

import warnings
warnings.filterwarnings("ignore")

In [22]:
start_time = time.time()

---

## 1. Read the cleaned dataset from file

---

In [23]:
year                        = config.year

clean_file                  = config.clean_file
optimization_report         = config.optimization_report

# report_path                 = config.report_path
# file_label                  = dataset_label.lower().replace(' ','_')
# detailed_performance_report = report_path + file_label + '_detailed_performance_report.txt'

print(f"Year:                        {year}")
print(f"Clean File:                  {clean_file}")
# print(f"Performance Report:          {performance_report}")
# print(f"Detailed Performance Report: {detailed_performance_report}")

Year:                        2021
Clean File:                  data/brfss_2021_clean.parquet.gzip


In [24]:
# Read final cleaned dataset from parquet file
df = pd.read_parquet(clean_file, engine="fastparquet")

In [25]:
diabetes_labels = df.columns

In [26]:
df.shape

(247076, 37)

---

## 2. Prepare the dataset for analysis

- Split the dataset into features and labels.
- Split the dataset into training and testing sets.
- Scale the dataset

---

In [27]:
from sklearn.datasets import make_regression, make_swiss_roll
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [28]:
# reload any changes to mlanlys
importlib.reload(mlanlys)

target = 'diabetes'
# Dictionary defining modification to be made to the base dataset
operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'undersample'      # options: none, undersample, oversample
                    }

if operation_dict:
    dataset_label = 'binary_' + operation_dict['scaler'] + '_' + operation_dict['random_sample']
else:
    dataset_label = operation_dict['scaler'] + '_' + operation_dict['random_sample']

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  undersample
  -- Performing RandomUnderSampler on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (247076, 37)  Data:4, X_train:49358, y_train:49358, X_test:61769, y_test:61769
ValueCount

In [29]:
# Print some statistics about the original df and the modified dataframe
print(f"Original Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

print(f"\nModified Dataframe")
print(f"------------------")
print(f"df_modified.shape: {df_modified.shape}")
print(f"df_modified[{target}].value_counts:  {df_modified[target].value_counts()}")

Original Dataframe
------------------
df.shape: (247076, 37)
df[diabetes].value_counts:  diabetes
0.0    208389
2.0     33033
1.0      5654
Name: count, dtype: int64

Modified Dataframe
------------------
df_modified.shape: (247076, 37)
df_modified[diabetes].value_counts:  diabetes
0.0    214043
1.0     33033
Name: count, dtype: int64


In [30]:
X_train, X_test, y_train, y_test = data
print(f"Dataframe: {df_modified.shape}  Data:{len(data)}, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")

Dataframe: (247076, 37)  Data:4, X_train:49358, y_train:49358, X_test:61769, y_test:61769


In [31]:
y_train.value_counts()

diabetes
0.0    24679
1.0    24679
Name: count, dtype: int64

In [32]:
y_test.value_counts()

diabetes
0.0    53415
1.0     8354
Name: count, dtype: int64

---

## 3. Optimization Prep

---

#### 3.1 Pre-optimization metric results

- The summary report of the metrics for all pre-optimization runs is here:  [performance_report.txt](reports/performance_report.txt)
- The details of the runs are contained in these file:
    - [base_dataset_detailed_performance_report.txt](reports/base_dataset_detailed_performance_report.txt)
    - [binary_dataset_detailed_performance_report.txt](reports/binary_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [randomoversample_dataset_detailed_performance_report.txt](reports/randomoversample_dataset_detailed_performance.txt)
    - [cluster_dataset_detailed_performance_report.txt](reports/cluster_dataset_detailed_performance_report.txt)
    - [randomundersampled_dataset_detailed_performance_report.txt](reports/randomundersampled_dataset_detailed_performance_report.txt)
    - [minmax_scaled_dataset_detailed_performance_report.txt](reports/minmax_scaled_dataset_detailed_performance_report.txt)
    - [smoteen_dataset_detailed_performance_report.txt](reports/smoteen_dataset_detailed_performance_report.txt)
    - [standard_scaled_dataset_detailed_performance_report.txt](reports/standard_scaled_dataset_detailed_performance_report.txt)
    - [smote_dataset_detailed_performance_report.txt](reports/smote_dataset_detailed_performance_report.txt)

#### 3.2 Optimization Dataset used

**Note:**  Modify the dataset as desired in Section 2.
<br><br>
Currently the dataset uses the Base dataset for 2015 with the following modifications:
- **Target converted to Binary**:  (0,1) from (0,1,2).
    - Base:  0: No diabetes, 1: Pre-diabetes, 2: have diabetes
    - binary: 0: No diabetes/pre-diabetes, 1: have diabetes
- Scaled the data with **StandardScaler**
- Resampled the data with **RandomUnderSampler**

#### 3.3 Notes:  Hyperparameter Spaces

| Section| Classifier | hyperparameter space |
| -------| ---------- | -------------------- |
|4.1| **DecisionTreeClassifier:** | dt_param_distributions |
|4.2| **LogisticRegression:** | lr_param_distributions |
|4.3| **AdaBoostClassifier:** | ada_param_distributions |
|4.4| **GradientBoostingClassifier:** | gb_param_distributions |
|4.5| **RandomForestClassifier:** | rf_param_distributions |
|4.6| **ExtraTreesClassifier:** | et_param_distributions |
|4.7| **KNeighborsClassifier:** | knn_param_distributions |


In [33]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import ParameterSampler
from scipy.stats import uniform

In [34]:
# Define the hyperparameter search space for the Decision Tree Classifier:


dt_param_distributions = {
    'max_depth': [int(x) for x in uniform(1, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
}

# Define the hyperparameter search space for the AdaBoostClassifier

ada_param_distributions = {
#    'alogrithm': ['SAMME'],
    'n_estimators': [int(x) for x in uniform(10, 1000).rvs(10)],
    'learning_rate': [x for x in uniform(0.001, 1).rvs(10)],
    'estimator': [DecisionTreeClassifier(**dt_params) for dt_params in
                       ParameterSampler(dt_param_distributions, n_iter=10)]
}

# GradientBoostingClassifier:
gb_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'learning_rate': [x for x in uniform(0.01, 0.5).rvs(10)],
    'subsample': [x for x in uniform(0.5, 1).rvs(10)],
}

# RandomForestClassifier
rf_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy'],
}

# KNeighborsClassifier
knn_param_distributions = {
    'n_neighbors': [int(x) for x in uniform(1, 20).rvs(10)],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],
    'leaf_size': [int(x) for x in uniform(10, 60).rvs(10)],
}

# LogisticRegression
lr_param_distributions = {
    'C': [x for x in uniform(0.001, 10).rvs(10)],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [int(x) for x in uniform(100, 1000).rvs(10)],
}

# ExtraTreesClassifier 
et_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy'],
    'max_leaf_nodes': [int(x) for x in uniform(10, 100).rvs(10)],
    'min_impurity_decrease': [x for x in uniform(0, 0.1).rvs(10)],
}

## 4. Optimization Analysis

In [35]:
# Note:  Modify the dataset as desired in Section 2.
# Currently the dataset uses the Base dataset for 2015 with the following modifications

In [36]:
# Dataset summary:
X_train, X_test, y_train, y_test = data
print(f"Dataset Lens, X_train:{len(X_train)}, y_train:{len(y_train)}, X_test:{len(X_test)}, y_test:{len(y_test)}")
print(f"y_train value_counts: {y_train.value_counts()}")
print(f"y_test value_counts: {y_test.value_counts()}")


Dataset Lens, X_train:49358, y_train:49358, X_test:61769, y_test:61769
y_train value_counts: diabetes
0.0    24679
1.0    24679
Name: count, dtype: int64
y_test value_counts: diabetes
0.0    53415
1.0     8354
Name: count, dtype: int64


In [37]:
dataset_label

'binary_standard_undersample'

#### 4.1 Perform Optimization:  DecisionTreeClassifier

- **model:** DecisionTreeClassifier 
- **Optimization:** RandomizedSearchCV


In [38]:
# reload any changes to mlanlys
importlib.reload(mlanlys)
from sklearn.tree import DecisionTreeClassifier

# Model
model = DecisionTreeClassifier()

# Define the parameter grid
param_distributions = {
    'max_depth': [int(x) for x in np.linspace(10, 110, num=11)],  # Depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],    # Minimum number of samples required to be at a leaf node
    'max_features': ['sqrt', 'log2']  # Number of features to consider for the best split
}

optimize_report_params = {
    'param_distributions': param_distributions,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 3,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

[CV 4/5] END max_depth=40, max_features=log2, min_samples_leaf=1, min_samples_split=10;, score=0.680 total time=   0.1s
[CV 1/5] END max_depth=40, max_features=log2, min_samples_leaf=1, min_samples_split=10;, score=0.676 total time=   0.2s
[CV 2/5] END max_depth=40, max_features=log2, min_samples_leaf=1, min_samples_split=10;, score=0.672 total time=   0.1s
[CV 5/5] END max_depth=40, max_features=log2, min_samples_leaf=1, min_samples_split=10;, score=0.668 total time=   0.2s
[CV 3/5] END max_depth=40, max_features=log2, min_samples_leaf=1, min_samples_split=10;, score=0.668 total time=   0.1s
[CV 5/5] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5;, score=0.717 total time=   0.1s
[CV 2/5] END max_depth=70, max_features=sqrt, min_samples_leaf=4, min_samples_split=2;, score=0.676 total time=   0.2s
[CV 3/5] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5;, score=0.730 total time=   0.1s
[CV 1/5] END max_depth=10, max_features=log

Unnamed: 0,dataset,model,slice,score,balanced_accuracy,roc_auc_score,Mean Squared Error,Accuracy,Precision,Recall,F1-score,Specificity,False Positive Rate,Matthews Correlation Coefficient,Optimizer,best_parameters
0,binary_standard_undersample,DecisionTreeClassifier,un-optimized,0.6652,0.6605,0.6605,0.3348,0.6652,0.2349,0.6539,0.3456,0.667,0.333,0.2265,RandomizedSearchCV,
1,binary_standard_undersample,DecisionTreeClassifier,optimized,0.7016,0.7329,0.7974,0.2984,0.7016,0.2813,0.7758,0.4129,0.69,0.31,0.3294,RandomizedSearchCV,"{'min_samples_split': 10, 'min_samples_leaf': ..."


#### 4.2 Perform Optimization:  LogisticRegression

- **model:** LogisticRegression 
- **Optimization:** RandomizedSearchCV

In [39]:
# reload any changes to mlanlys
importlib.reload(mlanlys)
from sklearn.linear_model import LogisticRegression


# Model
model = LogisticRegression(max_iter=5000)

# Define the parameter grid
param_distributions = {
    'C': uniform(loc=0, scale=4),   # Regularization parameter
    'penalty': ['l1', 'l2'],        # Regularization type
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga']  # Solvers that support L2 regularization
}


optimize_report_params = {
    'param_distributions': param_distributions,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 3,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

[CV 1/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 3/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 4/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 2/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 5/5] END C=1.49816047538945, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 1/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 3/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 5/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 4/5] END C=2.9279757672456204, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 1/5] END C=0.6240745617697461, penalty=l1, solver=sag;, score=nan total time=   0.0s
[

Unnamed: 0,dataset,model,slice,score,balanced_accuracy,roc_auc_score,Mean Squared Error,Accuracy,Precision,Recall,F1-score,Specificity,False Positive Rate,Matthews Correlation Coefficient,Optimizer,best_parameters
0,binary_standard_undersample,LogisticRegression,un-optimized,0.732,0.7532,0.8275,0.268,0.732,0.3072,0.7824,0.4412,0.7241,0.2759,0.3645,RandomizedSearchCV,
1,binary_standard_undersample,LogisticRegression,optimized,0.7319,0.7532,0.8275,0.2681,0.7319,0.3071,0.7824,0.4411,0.724,0.276,0.3644,RandomizedSearchCV,"{'C': 0.4783769837532068, 'penalty': 'l2', 'so..."


#### 4.3 Perform Optimization: AdaBoostClassifier

- **model:** AdaBoostClassifier 
- **Optimization:** RandomizedSearchCV

In [41]:
# reload any changes to mlanlys
importlib.reload(mlanlys)
from sklearn.ensemble import AdaBoostClassifier
from scipy.stats import uniform

# Model
model = AdaBoostClassifier(random_state=1, algorithm='SAMME')

# Define the parameter grid
ada_param_distributions = {
    'n_estimators': [int(x) for x in uniform(10, 1000).rvs(10)],
    'learning_rate': [x for x in uniform(0.001, 1).rvs(10)],
    'estimator': [DecisionTreeClassifier(**dt_params) for dt_params in
                       ParameterSampler(dt_param_distributions, n_iter=10)]
}

ada_param_distributions2 = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0],
    'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2)]
}
# # valid estimators: ['algorithm', 'estimator', 'learning_rate', 'n_estimators', 'random_state']


optimize_report_params = {
    'param_distributions': ada_param_distributions2,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 3,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

[CV 1/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.01, n_estimators=50;, score=0.694 total time=   6.4s
[CV 5/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.01, n_estimators=50;, score=0.689 total time=   6.3s
[CV 4/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.1, n_estimators=50;, score=0.737 total time=   6.2s
[CV 5/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.1, n_estimators=50;, score=0.728 total time=   6.5s
[CV 2/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.01, n_estimators=50;, score=0.691 total time=   7.1s
[CV 2/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.1, n_estimators=50;, score=0.730 total time=   7.2s
[CV 3/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.1, n_estimators=50;, score=0.731 total time=   7.2s
[CV 1/5] END estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.1, n_estimators=50;, scor

Unnamed: 0,dataset,model,slice,score,balanced_accuracy,roc_auc_score,Mean Squared Error,Accuracy,Precision,Recall,F1-score,Specificity,False Positive Rate,Matthews Correlation Coefficient,Optimizer,best_parameters
0,binary_standard_undersample,AdaBoostClassifier,un-optimized,0.7311,0.755,0.8304,0.2689,0.7311,0.3073,0.7878,0.4421,0.7223,0.2777,0.3665,RandomizedSearchCV,
1,binary_standard_undersample,AdaBoostClassifier,optimized,0.7404,0.7619,0.8374,0.2596,0.7404,0.3163,0.7914,0.452,0.7324,0.2676,0.3786,RandomizedSearchCV,"{'n_estimators': 200, 'learning_rate': 1.0, 'e..."


#### 4.4 Perform Optimization: GradientBoostingClassifier

- **model:** GradientBoostingClassifier 
- **Optimization:** RandomizedSearchCV

In [20]:
# GradientBoostingClassifier:
Optimize_start_time = time.time()

# Model
model = GradientBoostingClassifier(random_state=1)

gb_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'learning_rate': [x for x in uniform(0.01, 0.5).rvs(10)],
    'subsample': [x for x in uniform(0.5, 1).rvs(10)]
}

optimize_report_params = {
    'param_distributions': gb_param_distributions,    
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 3,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV 3/5] END learning_rate=0.31319921722362126, max_depth=11, max_features=sqrt, min_samples_leaf=6, min_samples_split=20, n_estimators=601, subsample=1.0205771576053602;, score=nan total time=   0.0s
[CV 2/5] END learning_rate=0.3518748455557092, max_depth=7, max_features=sqrt, min_samples_leaf=3, min_samples_split=5, n_estimators=829, subsample=1.4609444104406972;, score=nan total time=   0.0s
[CV 4/5] END learning_rate=0.3518748455557092, max_depth=7, max_features=sqrt, min_samples_leaf=3, min_samples_split=5, n_estimators=829, subsample=1.4609444104406972;, score=nan total time=   0.0s
[CV 5/5] END learning_rate=0.3518748455557092, max_depth=7, max_features=sqrt, min_samples_leaf=3, min_samples_split=5, n_estimators=829, subsample=1.4609444104406972;, score=nan total time=   0.0s
[CV 1/5] END learning_rate=0.3518748455557092, max_depth=7, max_features=sqrt, min_samples_leaf=3, min_samples_split=5, n_estimators=829, subs

Unnamed: 0,dataset,model,slice,score,balanced_accuracy,roc_auc_score,Mean Squared Error,Accuracy,Precision,Recall,F1-score,Specificity,False Positive Rate,Matthews Correlation Coefficient,Optimizer,best_parameters
0,binary_standard_undersample,GradientBoostingClassifier,un-optimized,0.7272,0.7602,0.8381,0.2728,0.7272,0.3042,0.8053,0.4416,0.7151,0.2849,0.3705,RandomizedSearchCV,
1,binary_standard_undersample,GradientBoostingClassifier,optimized,0.7231,0.7564,0.8342,0.2769,0.7231,0.3002,0.8018,0.4368,0.7109,0.2891,0.3643,RandomizedSearchCV,"{'subsample': 0.701619217698642, 'n_estimators..."


#### 4.5 Perform Optimization: RandomForestClassifier

- **model:** RandomForestClassifier 
- **Optimization:** RandomizedSearchCV

In [22]:
# RandomForestClassifier

# Model
model = RandomForestClassifier()

rf_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy'],
}

optimize_report_params = {
    'param_distributions': rf_param_distributions,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 0,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report


#### 4.6 Perform Optimization: ExtraTreesClassifier

- **model:** ExtraTreesClassifier 
- **Optimization:** RandomizedSearchCV

In [None]:
# ExtraTreesClassifier 

# Model
model = ExtraTreesClassifier(random_state=1)

et_param_distributions = {
    'n_estimators': [int(x) for x in uniform(100, 1000).rvs(10)],
    'max_depth': [int(x) for x in uniform(2, 10).rvs(10)],
    'min_samples_split': [int(x) for x in uniform(2, 20).rvs(10)],
    'min_samples_leaf': [int(x) for x in uniform(1, 10).rvs(10)],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy'],
    'max_leaf_nodes': [int(x) for x in uniform(10, 100).rvs(10)],
    'min_impurity_decrease': [x for x in uniform(0, 0.1).rvs(10)],
}

optimize_report_params = {
    'param_distributions': et_param_distributions,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 0,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

#### 4.7 Perform Optimization: KNeighborsClassifier

- **model:** KNeighborsClassifier 
- **Optimization:** RandomizedSearchCV

**NOte:** KNN took a long time to do a single run.  So it might need to be skipped for optimization.

In [None]:
# KNeighborsClassifier

# Model
k_value = 3
model = KNeighborsClassifier(n_neighbors=k_value)

knn_param_distributions = {
    'n_neighbors': [int(x) for x in uniform(1, 20).rvs(10)],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],
    'leaf_size': [int(x) for x in uniform(10, 60).rvs(10)],
}

optimize_report_params = {
    'param_distributions': knn_param_distributions,
    'optimize_path': config.optimize_path,
    'dataset': dataset_label,
    'optimization_report': config.optimization_report,
    'scoring': 'accuracy',
    'verbose': 0,
    'print_results': False
    }

optimize_report = mlanlys.run_random_search(model, optimize_report_params, data)

print(f"Optimization Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

optimize_report

---
---
---
---

In [None]:
print(f"Completed: Execution Time %s seconds:" % round((time.time() - start_time),2) )

---

## 4. Conclusions

- The confusion matrix shows more TP/TN values and Fewer FP/FN values 
- Classification accuracy increased from .66 to .68
- Balanced accuracy increased from .66 to .72
- ROC AUC Score increased from .66 to .78

**Summary:  Decision Tree Classifiering Performance Improved**

---

In [None]:


#
