---

## Data Analysis

- This file differs from [2_data_analysis_1_base_data.ipynb](2_data_analysis_1_base_data.ipynb) in that it:
    - scales the base cleaned data created in [1_data_cleaning.ipynb](1_data_cleaning.ipynb).

Source dataset: 247076 rows × 37 columns
Processed and analyzed dataset: 247076 rows × 37 columns

---

In [1]:
# package imports go here
import pandas as pd
import numpy as np
import fastparquet as fp
import os
import sys
import pickle
import matplotlib.pyplot as plt
import importlib
import config

sys.path.insert(1, config.package_path)
import ml_analysis as mlanlys
import ml_clean_feature as mlclean

---

## 1. Read the cleaned dataset from file

---

In [2]:
# reload any changes to Config Settings
importlib.reload(config)

# BE SURE TO UPDATE THE LABEL FOR THIS ANALYSIS
# #############################
dataset_label = '2.0 StandardScaler Dataset'
# #############################

year                        = config.year

clean_file                  = config.clean_file

print(f"Year:                        {year}")
print(f"Clean File:                  {clean_file}")

Year:                        2015
Clean File:                  data/brfss_2015_clean.parquet.gzip


In [3]:
prepared_data_path = config.prepared_data_path
os.listdir(prepared_data_path)

['minmax_scaled.pkl',
 'sb_cluster_scaled.pkl',
 'sb_random_oversample_scaled.pkl',
 'sb_random_undersample_scaled.pkl',
 'sb_smoteenn_scaled.pkl',
 'sb_smote_scaled.pkl',
 'scaled.pkl',
 'standard_scaled.pkl']

In [4]:
# Read final cleaned dataset from parquet file
df = pd.read_parquet(clean_file, engine="fastparquet")

In [5]:
diabetes_labels = df.columns

In [6]:
df.shape

(253680, 22)

---

## 2. Prepare the dataset for analysis

- Split the dataset into features and labels.
- Split the dataset into training and testing sets.
- Scale the dataset

---

Options

    operation_dict = {  'target_column'    : 'diabetes',
                        'convert_to_binary':  True,
                        'scaler'           : 'standard', # options: none, standard, minmax
                        'random_sample'    : 'none'      # options: none, undersample, oversample, cluster, smote, smoteenn
                        }

In [7]:
from sklearn.datasets import make_regression, make_swiss_roll
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [8]:
target = 'diabetes'

#### 2.1 StandardScaler 
- Standard Scaler

In [9]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_standard

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  False,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'none'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  False
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  none

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:190260, y_train:190260, X_test:63420, y_test:63420
ValueCounts:   y_train: len:2   0: 160371   1:  3457
ValueCounts:   y_test : len:2   0:  53332   1:  1174

Before write: lengths:  X_Trn: 190260  X_Tst: 63420  y_Trn: 190260 y_Tst: 63420
After write:  lengths:  X_Trn: 190260  X_Tst: 63420  y_Trn: 190260 y_Tst: 63420

y_train.value_counts diabetes
0.0    160371
2.0     26432
1.0      3457
Name: count, dtype: int64

y_test.value_counts diabetes
0.0    53332
2.0     8914
1.0     117

#### 2.2 MinMaxScaler
- MinMax Scaler

In [10]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_minmax

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  False,
                    'scaler'            : 'minmax', # options: none, standard, minmax
                    'random_sample'     : 'none'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  False
**Operation:scaler  minmax
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing MinMaxScaler on X_train: Updates X_train, y_test
**Operation:random_sample  none

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:190260, y_train:190260, X_test:63420, y_test:63420
ValueCounts:   y_train: len:2   0: 160403   1:  3476
ValueCounts:   y_test : len:2   0:  53300   1:  1155

Before write: lengths:  X_Trn: 190260  X_Tst: 63420  y_Trn: 190260 y_Tst: 63420
After write:  lengths:  X_Trn: 190260  X_Tst: 63420  y_Trn: 190260 y_Tst: 63420

y_train.value_counts diabetes
0.0    160403
2.0     26381
1.0      3476
Name: count, dtype: int64

y_test.value_counts diabetes
0.0    53300
2.0     8965
1.0     1155
Na

#### 2.3 Binary
- Convert target from 0/1/2 (No Diabetes/Pre-Diabetes/Diabetes) to 0/1 values  (No Diabetes/Diabetes)

In [17]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_binary

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'none', # options: none, standard, minmax
                    'random_sample'     : 'none'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"Types: X_Trn: {type(X_Trn)}  X_Tst: {type(X_Tst)}  y_Trn: {type(y_Trn)}  r_Tst {type(y_Tst)}")
# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  none
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
**Operation:random_sample  none

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:190260, y_train:190260, X_test:63420, y_test:63420
ValueCounts:   y_train: len:2   0: 163556   1: 26704
ValueCounts:   y_test : len:2   0:  54778   1:  8642

Before write: lengths:  X_Trn: 190260  X_Tst: 63420  y_Trn: 

#### 2.4 RandomOverSampler
- Standard Scaler
- Binary
- RandomOverSampler sampling method

In [12]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_sb_random_oversample

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'oversample'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  oversample
  -- Performing RandomOverSampler on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:327292, y_train:327292, X_test:63420, y_test:63420
ValueCount

#### 2.5 RandomUnderSampler
- Standard Scaler
- Binary
- RandomUnderSampler sampling method

In [13]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_sb_random_undersample

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'undersample'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  undersample
  -- Performing RandomUnderSampler on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:52950, y_train:52950, X_test:63420, y_test:63420
ValueCount

#### 2.6 ClusterCentroids
- Standard Scaler
- Binary
- ClusterCentroids sampling method

In [14]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_sb_cluster

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'cluster'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  cluster
  -- Performing ClusterCentroids on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:52968, y_train:52968, X_test:63420, y_test:63420
ValueCounts:   y

#### 2.7 SMOTE
- Standard Scaler
- Binary
- SMOTE sampling method

In [15]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_sb_smote

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'smote'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  smote
  -- Performing SMOTE on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:327622, y_train:327622, X_test:63420, y_test:63420
ValueCounts:   y_train: len

#### 2.8 SMOTEENN
- Standard Scaler
- Binary
- SMOTEENN sampling method

In [16]:
# Dictionary defining modification to be made to the base dataset
prepared_file = config.prepared_data_sb_smoteenn

operation_dict = {  'target_column'     :  target,
                    'convert_to_binary' :  True,
                    'scaler'            : 'standard', # options: none, standard, minmax
                    'random_sample'     : 'smoteenn'      # options: none, undersample, oversample
                    }

# This insures that df if not modified during the call to modify_base_dataset()
df_modified = df.copy()

# Modify the base dataset
# data is returned where: X_train, X_test, y_train, y_test = data
data = mlanlys.modify_base_dataset(df_modified, operation_dict)

X_Trn, X_Tst, y_Trn, y_Tst = data
print(f"\nBefore write: lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")

# Save Prepared Data to a file
with open(prepared_file, 'wb') as file: pickle.dump(data, file)

with open(prepared_file, 'rb') as file: data_prepared = pickle.load(file)

X_Trn, X_Tst, y_Trn, y_Tst = data_prepared
print(f"After write:  lengths:  X_Trn: {len(X_Trn)}  X_Tst: {len(X_Tst)}  y_Trn: {len(y_Trn)} y_Tst: {len(y_Tst)}")
print(f"\ny_train.value_counts {y_Trn.value_counts()}")
print(f"\ny_test.value_counts {y_Tst.value_counts()}")

# Print some statistics about the original df and the modified dataframe
print(f"\nOriginal Dataframe")
print(f"------------------")
print(f"df.shape: {df.shape}")
print(f"df[{target}].value_counts:  {df[target].value_counts()}")

Base Dataset Modifications in Process
-------------------------------------
**Operation:target_column  diabetes
**Operation:convert_to_binary  True
  -- Converting dataset to binary (0,1) from (0,1,2)


****Cleaning Feature: diabetes
  Initial Unique features in [diabetes]:  [0. 1. 2.]
  values_to_drop: ********* NO Parameters were specified *********
  translate: {1: 0, 2: 1}
  scale: ********* NO Parameters were specified *********
  FINAL Unique features in [diabetes]:  [0. 1.]
**Operation:scaler  standard
  -- Performing train_test_split on dataframe with target:'diabetes'
     -- Run automatically before scalar or random_sample operations
  -- Performing StandardScaler on X_train: Updates X_train, y_test
**Operation:random_sample  smoteenn
  -- Performing SMOTEENN on X_train, y_train: Updates X_train, y_train

Dataframe, Train Test Summary
-----------------------------
Dataframe: (253680, 22)  Data:4, X_train:254380, y_train:254380, X_test:63420, y_test:63420
ValueCounts:   y_trai