# DeepLearning Predictions on Multiclass Obesity Risk dataset

<HR>

[<b>Multi-Class Prediction of Obesity Risk</b>](https://www.kaggle.com/competitions/playground-series-s4e2) dataset.

<hr>

<div class="alert alert-block alert-info"><p style ="font-size:1.3em">
<h4>Accuracy will improve if dataset EDA and clean-up adheres to guielines to train <u>Deep Neural Network models</u> (actually not sure yet as to what they really are)</h4>
</div>
<p>

**Note:** I tried multiple diffferent ways to clean DataSet to get it in to a shape that can be create clusters(spatials) and improve predictions. The best I could get so far was 88.945. I put comments in appropriate places in Notebook as to what EDA methods I had tried. <br />

**If you are going to try out this notebook, appreciate if you let me know what you did to improve accuracy**
</p>


<div class="alert alert-block alert-info">

<b>Used Keras_Tuner to search for best hyperparameter values - search was over `25` parameters</b><br />
Since preiction is on non-image dataset, and we only use Fully-Connected Dense layers, used only few layers + this is a Functional Neural Network model:<br />

$$Input -> Dense -> DropOut -> BatchNormalization -> Dense -> DropOut -> Dense(output, sigmoid-activation)$$

[Notebook on Kaggle](https://www.kaggle.com/code/jayyanamandala/keras-tuner-hyperparameters-search-obesiry-risk)
<ul>
    <li>batch_size - same is referenced in 'HyperTuningNetwork' class but name is different</li>
    <li>number of epochs in run</li>
    <li>Number of Neurons in first fully connected Dense layer</li>
    <li>Number of Neurons in second fully connected Dense layer</li>
    <li>drop_rates - for two Dropout layers</li>
    <li>kernel_regularizers(4), bias_regularizers(4), activity_regularizers(4)</li>
    <li>layer activation - relu, tanh, sigmoid, ..
    <li>model optimizer - adam, sgd, ...</li>
    <li>learning_rate - learning_rate for Model optimizer</li>
    <li>decay_steps - - learning rate decay steps</li>
    <li>decay_rate - learning_rate decay</li>
  </ul>

</div>

<div class="alert alert-block alert-info"><h1>import packages</h1></div>

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%autosave 60
from datetime import datetime

pd.set_option('display.max_columns', None)

In [None]:
import os
import shutil
import sys
from glob import glob
import re
import math
import random as py_random   # to differentiate btw Numpy and Python - incase random is set to np.random

In [None]:
# preprocessing and model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

# metrics and utils
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils import compute_class_weight, compute_sample_weight
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display="diagram")

import scipy

In [None]:
import tensorflow as tf
import tensorflow.keras as keras
from keras.utils import to_categorical

In [None]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.set_visible_devices(gpus[1:],'GPU')
        tf.config.experimental.set_memory_growth(gpus[0], True)
        print('setting session for memory growth')
    except:
        pass

In [None]:
# Notebook uses PowerTransformer scaling - version of scikit-learn package must be higher than 1.2.2
# !pip install scikit-learn>1.2.2 --upgrade
import sklearn as sk
print(sk.__version__)

In [None]:
def reset_seeds():
   np.random.seed(42)
   py_random.seed(42)
   tf.random.set_seed(42)

# set a beginning for consistensy
reset_seeds()

<div class="alert alert-block alert-info"><h1>load datasets</h1></div>

In [None]:
# download dataset
# If you are on Kaggle go to competition page and create a notebook
# -OR- if Kaggle is setup at home, please download dataset
# !kaggle competitions download -c playground-series-s4e2

In [None]:
train_df = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')
train_cat_cols = train_df.select_dtypes(include=['object', 'category'])
test_cat_cols = test_df.select_dtypes(include=['object', 'category'])
train_num_cols = train_df.select_dtypes(exclude=['object', 'category'])
test_num_cols = test_df.select_dtypes(exclude=['object', 'category'])

# Drop 'id' column from Train and Test
train_num_cols.drop(['id'], inplace=True, axis=1)
test_num_cols.drop(['id'], inplace=True, axis=1)

<div class="alert alert-block alert-info"><h1>Intial Data Explortion - and cleanup</h1></div>

In [None]:
# print columns to check
pd.DataFrame(data=[train_df.columns, test_df.columns]).T.rename({0:'Train', 1:'Test'}, axis=1)

<div class="alert alert-block alert-info"><h4>Get list of category columns from train and test datasets and check if the unique values match</h4>
 <b>do not check the 'ground truth' column</b>

</div>

In [None]:
# check unique values in Category columns - must be equal
def check_columns_exist(df1,df2, check_equal=True, traindf = train_df, testdf=test_df):
    for i in df1.columns:
        if i in df2.columns:try:
        if not np.array_equal( np.sort(np.unique(traindf[i])), 
                            np.sort(np.unique(testdf[i]))):
          print('\n#####Column:', i , 'elements are not equal ######')
          print('Train:', np.sort(np.unique(traindf[i])), 
              '\nTest:', np.sort(np.unique(testdf[i])), end='\n\n')
        else:
              pass
              # print('\nColumn:', i , '\nTrain:', np.sort(np.unique(traindf[i])), 
              #     '\nTest:', np.sort(np.unique(testdf[i])))
      except: 
          pass
    else:
      print('\nColumn:', i , 'does not exist in testdf', end='\n\n')

In [None]:
# check category columns in Train and Test and also check if categorical elements in each Category are same
check_columns_exist(train_cat_cols, test_cat_cols)

In [None]:
# get totals and percentages for each category and kind
def get_percent(col, kk, df):
  # for kk in np.sort(np.unique(df[col])):
  total = len(df[df[col] == kk])
  val = (total/len(df)) * 100
  val = f'{val:.2f}'
  val=float(val)
  return total, val


In [None]:
train_totals=[]
train_values=[]
train_categories = []
train_columns = []

# check unique values in Category columns - must be equal
for i in train_cat_cols.columns:
  if i in test_cat_cols.columns:
    kk = np.sort(np.unique(train_df[i]))
    for k in kk:
      # print(i, kk, k)
      train_columns.append(i)
      tr_total, tr_val = get_percent(i, k, train_df)
      train_categories.append(k)
      train_totals.append(tr_total)
      train_values.append(tr_val)
  

In [None]:
test_categories = []
test_totals=[]
test_values=[]

# check unique values in Category columns - must be equal
for i in train_cat_cols.columns:
  if i in test_cat_cols.columns:
    kk = np.sort(np.unique(test_df[i]))
    for k in kk:
      # print(i, kk, k)
      tr_total, tr_val = get_percent(i, k, test_df)
      test_categories.append(k)
      test_totals.append(tr_total)
      test_values.append(tr_val)
  

In [None]:
# Create MultiIndex DataFrame
arrays_col_cats = [np.array(train_columns), np.array(train_categories)]
arrays = list(zip(train_totals, train_values, test_totals, test_values))
df = pd.DataFrame(arrays, columns=['Train-Totals', 'Train-Values', 'Test-Totals', 'Test-Values'])
df.set_index(arrays_col_cats, inplace=True)

diff1 = abs(df['Train-Values'] - df['Test-Values'])
sum1 = abs(df['Train-Values'] + df['Test-Values'])
tr_std = np.std(df['Train-Values'].astype(np.float32))
te_std = np.std(df['Test-Values'].astype(np.float32))

df['Diff %'] = round(diff1,2).astype(str) + '%'
df['Train-Values'] = df['Train-Values'].astype(str) + '%'
df['Test-Values'] = df['Test-Values'].astype(str) + '%'
df['Diff/Sum'] = round(np.divide(diff1,sum1) * 100,2)

df

<h4>from the above stats we can see that the datasets 'Train' and 'Test' are spread approximately equally amongst individual categorical features - extra CALC 'feature' in Test dataset</h4>

from the above table, based on Diff/Sum column - we combine the following:

1. Bike + Motobike                   - Two_Wheelers 
2. Public Transportation + Walking   - Non_Motors      (looks the dataset is based on entirely different demographics)
3. Combine Test 'CALC - Always' with Frequently   and drop 'Always'
4. Create BMI column and delete Height and Weight from both datasets



<div class="alert alert-block alert-info"><h4>Create a BMI column and delete Height and Weight from both datasets</h4>

In most people, BMI correlates to body fat<b>
[https://my.clevelandclinic.org/health/body/24052-adipose-tissue-body-fat](https://my.clevelandclinic.org/health/body/24052-adipose-tissue-body-fat) </b>
- the higher the number, the more body fat you have, but according to some clinical studies itâ€™s not accurate in some cases.

From the train and test dataset we can infer that the weight is in 'pounds' and height is in feet   
we convert height to inches and calculate BMI


In [None]:
def calc_bmi(x,y): 

    # Assuming Height is in Meters and Weight in 'pounds'
    # USC - ONE -of- x in lbs, and y in inches
    # x = x * 703;  y = np.square(y); x/y

    # SI -ONE- of - x in kgs, and y in meters
    # x = x;  y = np.square(y); x/y

    # convert to inches - since weight is in pounds
    # convert height from meters first to Centimeters 
    # and multiply by 0.394 to convert to inches
    # calculate BMI and return value
    return (x * 703 )/np.square(y * 100 * 0.394)

## create BMI columns for train and test datasets - and drop 'Age' 'Height'

In [None]:
train_df['BMI'] = train_df.apply(lambda x: calc_bmi(x['Weight'], x['Height']), axis=1)
test_df['BMI'] = test_df.apply(lambda x: calc_bmi(x['Weight'], x['Height']), axis=1)

# drop columns 'Weight' and 'Height' from both train_df and test_df create_datasets
train_df.drop(['Weight', 'Height'], axis=1, inplace=True)
test_df.drop(['Weight', 'Height'], axis=1, inplace=True)

<div class="alert alert-block alert-info"><h4>Replace 'Public_transportation' & 'Walking' with 'Non_Motors',<br>
and 'Bike' & 'Motorbike' with 'Two_Wheelers'</h4>

In [None]:
test_df['MTRANS'].replace(['Public_Transportation', 'Walking'], 'Non_Motors', inplace=True)
train_df['MTRANS'].replace(['Public_Transportation', 'Walking'], 'Non_Motors', inplace=True)

test_df['MTRANS'].replace(['Bike', 'Motorbike'], 'Two_Wheelers', inplace=True)
train_df['MTRANS'].replace(['Bike', 'Motorbike'], 'Two_Wheelers', inplace=True)
test_df['CALC'].replace(['Always'], 'Frequently', inplace=True)

In [None]:
# let's recreate columns and numerals lists
train_cat_cols = train_df.select_dtypes(include=['object', 'category'])
test_cat_cols = test_df.select_dtypes(include=['object', 'category'])
train_num_cols = train_df.select_dtypes(exclude=['object', 'category'])
test_num_cols = test_df.select_dtypes(exclude=['object', 'category'])

In [None]:
test_df.head()

In [None]:
train_df.head()

<div class="alert alert-block alert-info"><h4>check duplicates and NAs in train and test datasets</h4></div>

In [None]:
# ref: https://www.kaggle.com/code/nnjjpp/pipelines-for-preprocessing-a-tutorial
train_df.duplicated().sum()
pd.DataFrame([train_df.duplicated().sum(), 
           test_df.duplicated().sum()]).T.rename({0:'Train', 
                                                  1:'Test'}, 
                                                 axis=1).rename(index={0: '# of Duplicates'})

In [None]:
# check NA values
pd.concat([train_df.isna().sum(0), 
           test_df.isna().sum(0)], 
          axis=1).T.rename(index={0:'Train', 
                          1:'Test'})

In [None]:
# print columns
pd.DataFrame([train_cat_cols.columns, test_cat_cols.columns, 
              train_num_cols.columns, test_num_cols.columns, 
             ]).T.rename({0:'Train Cat', 1:'Test Cat', 2:'Train Num', 3:'Test Num'}, axis=1)

<div class="alert alert-block alert-info"><h1>plots</h1></div>

<div class="alert alert-block alert-info"><p style ="font-size:1.2em">Outliers Detection</p></div>

In [None]:
num_rows =  2
plt.figure(figsize=(num_rows*10,4))

plt.suptitle('Age/BMI Box Plots')
plt.subplots_adjust(hspace=0.7)
plt.subplot(num_rows,2,1)
plt.boxplot(train_df.Age, vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Train Age Box Plot')

plt.subplot(num_rows,2,2)
plt.boxplot(train_df.BMI, vert=False)
plt.ylabel('Variable')
plt.xlabel('BMI')
plt.title('Train BMI Box Plot')

plt.subplot(num_rows,2,3)
plt.boxplot(test_df.Age, vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Test Age Box Plot')

plt.subplot(num_rows,2,4)
plt.boxplot(test_df.BMI, vert=False)
plt.ylabel('Variable')
plt.xlabel('BMI')
plt.title('Test BMI Box Plot')

plt.show()

From the above boxplots for 'Age' we can see that the spread for Test and Train datasets seems similar.<br>
**we will look into 'Age' column later**

In [None]:
# create a new dataframe to show boxplots between numerals and:
# family_history_with_overweight
# SMOKE
# Gender
train_gender_df = pd.concat([train_df[train_num_cols.columns],  
                             train_df['Gender'].to_frame(),
                             train_df['family_history_with_overweight'].to_frame(),
                             train_df['SMOKE'].to_frame(),
                            ], 
                            axis=1)

In [None]:
train_gender_df.head()

In [None]:
num_rows=1
plt.figure(figsize=(num_rows*14,4))
plt.suptitle('Age/BMI against family_history_with_overweight Box Plots')
plt.subplots_adjust(hspace=0.7)
plt.subplot(num_rows,2,1)
sns.boxplot( x="Age", y='family_history_with_overweight', data=train_gender_df, )
plt.subplot(num_rows,2,2)
sns.boxplot( x="BMI", y='family_history_with_overweight', data=train_gender_df, )
plt.show()

form the above boxplot we see that we have a lot of data of young people with no family_history_with_overweight
- data is not distributed between yes & no of family_history_with_overweight for Age - skewed
- <b><p style ="color:red; font-size:1.2em">we will not discard'Age', but create bins and convert to categorical</p></b>

In [None]:
train_gender_df = pd.concat([train_df[train_num_cols.columns],  
                             train_df['Gender'].to_frame(),
                             train_df['family_history_with_overweight'].to_frame(),
                             train_df['SMOKE'].to_frame(),
                            ], 
                            axis=1)

In [None]:
num_rows=3
plt.figure(figsize=(num_rows*12,25))
plt.suptitle('Age/BMI against family_history_with_overweight/SMOKE/Gender Box Plots')
plt.subplots_adjust(hspace=0.7)
plt.subplot(num_rows,2,1)
sns.boxplot( x="Age", y='family_history_with_overweight', data=train_gender_df, )
plt.subplot(num_rows,2,2)
sns.boxplot( x="BMI", y='family_history_with_overweight', data=train_gender_df, )

plt.subplot(num_rows,2,3)
sns.boxplot( x="Age", y='SMOKE', data=train_gender_df, )
plt.subplot(num_rows,2,4)
sns.boxplot( x="BMI", y='SMOKE', data=train_gender_df, )

plt.subplot(num_rows,2,5)
sns.boxplot( x="Age", y='Gender', data=train_gender_df, )
plt.subplot(num_rows,2,6)
sns.boxplot( x="BMI", y='Gender', data=train_gender_df, )

plt.show()

<p style ="font-style:bold; font-size:1.4em; color:red">Bin BMI and convert to  categories</p>
**follow the sequence listed**: <br />
1. bin  <br />
2. create labels  <br />
3. run qcut firs time without labels  <br />
4. run qcut again with labels  <br />

In [None]:
# let's bin BMI - we have 7 classes
num_bins = 7
bmi1 = ((train_df.BMI//num_bins)*num_bins).min()
bmi2 = ((train_df.BMI//num_bins+1)*num_bins).max()

bmi_bins = np.arange(bmi1,bmi2+num_bins,num_bins)
bmi_labels = ['bmi_'+str(round(f)) for f in np.arange(bmi1,bmi2+num_bins,num_bins)]
bmi_labels, bmi_bins

<p style ="font-style:bold; font-size:1.4em; color:red">Bin Age and convert to  categories</p>

In [None]:
# let's bin Age - we have 7 classes
num_bins = 7
age1 = ((train_df.Age//num_bins)*num_bins).min()
age2 = ((train_df.Age//num_bins+1)*num_bins).max()

age_bins = np.arange(age1,age2+num_bins,num_bins)
age_labels = ['age_'+str(round(f)) for f in np.arange(age1,age2+num_bins,num_bins)]
age_labels, age_bins

In [None]:
# We use same bins for Training and Test datasets

In [None]:
test_df.shape, train_df.shape

In [None]:
test_df.shape, train_df.shape

In [None]:
train_df['BMI_bins'] = pd.qcut(train_df.BMI, q=len(bmi_bins), duplicates='drop' )
train_df['BMI_bins'] = pd.qcut(train_df.BMI, q=len(bmi_bins),  labels=bmi_labels, duplicates='drop' )

In [None]:
test_df['BMI_bins'] = pd.qcut(test_df.BMI, q=len(bmi_bins), duplicates='drop' )
test_df['BMI_bins'] = pd.qcut(test_df.BMI, q=len(bmi_bins),  labels=bmi_labels, duplicates='drop' )

In [None]:
train_df['Age_bins'] = pd.qcut(train_df.Age, q=len(age_bins), duplicates='drop' )
train_df['Age_bins'] = pd.qcut(train_df.Age, q=len(age_bins),  labels=age_labels, duplicates='drop' )

In [None]:
test_df['Age_bins'] = pd.qcut(test_df.Age, q=len(age_bins), duplicates='drop' )
test_df['Age_bins'] = pd.qcut(test_df.Age, q=len(age_bins),  labels=age_labels, duplicates='drop' )

In [None]:
test_df.shape, train_df.shape

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
test_df.drop(['Age'], axis=1, inplace=True)
train_df.drop(['Age'], axis=1, inplace=True)
test_df.drop(['BMI'], axis=1, inplace=True)
train_df.drop(['BMI'], axis=1, inplace=True)

In [None]:
# categorical
train_cat_cols = train_df.select_dtypes(include=['object', 'category'])
test_cat_cols = test_df.select_dtypes(include=['object', 'category'])

# numerical
train_num_cols = train_df.select_dtypes(exclude=['object', 'category'])
test_num_cols = test_df.select_dtypes(exclude=['object', 'category'])

In [None]:
# Plot all features
plt.figure(figsize=(3,3))
plt.subplots_adjust(hspace=0.4)
cols = 3
rows=2
fig,ax = plt.subplots(nrows=rows,ncols=cols,figsize=(20,20))
ax = ax.flatten()
plt.rcParams["axes.labelsize"] = 10
truth_label='NObeyesdad'
plt.suptitle("Distributions of Multi-Class Obesity dataset\n",size=24)
# textprops={'fontsize': 16},
xx = 0
for i,col in enumerate(train_df.columns):
    if col not in train_cat_cols.columns:
      # print('1',col)
        if col == 'id':    # do not plot 'id'column
            continue
        else:
            sns.histplot(data=train_df,x=col,ax=ax[xx],kde=True,line_kws={"ls":"-"})
            xx += 1

plt.tight_layout()
plt.show()

<div class="alert alert-block alert-info"><h4>function to split trainXY and test_X</h4></div>

<div class="alert alert-block alert-warning"><h4>In the function below from the train and test datasets we will drop 'id'</h4>

In [None]:
train_df.head()

In [None]:
def create_datasets(trainxy, testx):
  # capture test_ids
  submit_id = testx['id']
  
  # Remove 'id' from dataseta
  testx = testx.drop(['id'], axis=1)
  trainx = trainxy.drop(['id'], axis=1)
  
  return trainx, testx, submit_id
  

In [None]:
# create training sets train_X, train_Y, test_X, test_id 
# remove 'id' cp;umn from Train and Test
# train_X, train_Y,  test_X, test_id = create_datasets(train_df, test_df)
train_X, test_X, test_id = create_datasets(train_df, test_df)

# recapture columns list - category, and numerical
# 
# categorical
train_cat_cols = train_df.select_dtypes(include=['object', 'category'])
test_cat_cols = test_df.select_dtypes(include=['object', 'category'])

# numerical
train_num_cols = train_df.select_dtypes(exclude=['object', 'category'])
test_num_cols = test_df.select_dtypes(exclude=['object', 'category'])

# drop some columns - id, NObeyesdad from list
train_num_cols.drop(['id'], inplace=True, axis=1)
test_num_cols.drop(['id'], inplace=True, axis=1)
train_cat_cols.drop(['NObeyesdad'], inplace=True, axis=1)

# show columns
pd.DataFrame([train_num_cols.columns, test_num_cols.columns]).rename(index={0:'Train', 1:'Test'})


<div class="alert alert-block alert-info"><p style ="font-size:1.2em">Create train_Y</p></div>

In [None]:
train_Y = train_X.NObeyesdad
train_X.drop(['NObeyesdad'], axis=1, inplace=True)

In [None]:
# check NA values
pd.concat([train_X.isna().sum(0), 
           test_X.isna().sum(0)], 
          axis=1).T.rename(index={0:'Train', 
                          1:'Test'})

<div class="alert alert-block alert-info"><h1>plot categorical columns</h1></div>

In [None]:
plt.figure(figsize=(22,22))
plt.rcParams["axes.labelsize"] = 20
rows, num = 3, 3
cols = 0

# ref: https://stackoverflow.com/questions/63687789/how-do-i-create-a-pie-chart-using-categorical-data-in-matplotlib
def label_function(val):
  return f'{val / 100 * len(train_cat_cols):.0f}\n{val:.0f}%'   # returns nums and percent
  # return f'{val:.0f}%'

for n in range(rows):
  for i in range(num):
    plt.subplot(3,3,cols+1)
    if len(test_cat_cols.columns) > cols:
      train_cat_cols.groupby(train_cat_cols.columns[cols]).size().plot(kind='pie', 
                                                                     autopct=label_function, 
                                                                     textprops={'fontsize': 16},
                                                                     colormap='prism_r'
                                                                    )
      plt.title(train_cat_cols.columns[cols])
    cols += 1
    plt.axis('off')
  
  

In [None]:
plt.figure(figsize=(22,22))
plt.rcParams["axes.labelsize"] = 20
rows, num = 3, 3
cols = 0

# ref: https://stackoverflow.com/questions/63687789/how-do-i-create-a-pie-chart-using-categorical-data-in-matplotlib
def label_function(val):
  return f'{val / 100 * len(test_cat_cols):.0f}\n{val:.0f}%'   # returns nums and percent
  # return f'{val:.0f}%'

for n in range(rows):
  for i in range(num):
    plt.subplot(3,3,cols+1)
    if len(test_cat_cols.columns) > cols:
      test_cat_cols.groupby(test_cat_cols.columns[cols]).size().plot(kind='pie', 
                                                                     autopct=label_function, 
                                                                     textprops={'fontsize': 16},
                                                                     colormap='plasma_r'
                                                                    )
      plt.title(test_cat_cols.columns[cols])
    cols += 1
    plt.axis('off')

<div class="alert alert-block alert-info"><h1>Final Data Explortion - one last look</h1></div>

In [None]:
type(train_X), train_X.shape, train_X.columns

In [None]:
type(test_X), test_X.shape, test_X.columns

In [None]:
train_X.describe().transpose()

In [None]:
train_X.info(show_counts=False)

In [None]:
# print(train_X.isnull().sum() !=0)
print ('No Null values in train dataset') if not 1 in train_X.isnull().sum() else print(train_X.isnull().sum())

In [None]:
# print(test_X.isnull().sum() !=0)
print ('No Null values in test dataset') if not 1 in test_X.isnull().sum() else print(test_X.isnull().sum())

# Datasets - scaling and encoding

## Scikit-Learn PowerTransformer scaling technique

<div class="alert alert-block alert-info"><h4>
For Fully Connected Neural Networks when used to predict majority class for classification problems or predict values against continuous data<br>
**Depending on EDA** analysis, and how data is cleaned, sklearn scaler packages seem to react differently since the spread of numerals is different<br />
<br><u>PowerTransformer</u> scaling seem to work best in some cases - but Accuracy of validation set never seem to cross 89.945%
<br><u>StandardScaled and/or MinMaxScaled</u> I think StandardScaler did okin last few runs

<br>The prediction -or- inference against testset (Kaggle competition) synthetically created <u>rose by <b>14.2</b> percentage point</u>
  
</h4></div>

In [None]:
data_tr = train_X.copy()
data_te = test_X.copy()

In [None]:
labelEnc = LabelEncoder()
y_encoded = labelEnc.fit_transform(train_Y)

y = tf.keras.utils.to_categorical(y_encoded)

In [None]:
print(y[:10])

In [None]:
# categorical  
cat_col_ = train_X.select_dtypes(include=['object', 'category']).columns 
train_cat_cols = train_X.select_dtypes(include=['object', 'category'])   
test_cat_cols = test_X.select_dtypes(include=['object', 'category'])
print('categorical columns:', cat_col_)                                                                                                                                                                                                                                               

# numerical                                                                                                                                
num_col_ = train_X.select_dtypes(exclude=['object', 'category']).columns                                                                   
train_num_cols = train_X.select_dtypes(exclude=['object', 'category'])  
test_num_cols = test_X.select_dtypes(exclude=['object', 'category'])                                                                       
print('numerical columns:', num_col_)

In [None]:
test_X = pd.get_dummies(test_X, dtype=int)
train_X = pd.get_dummies(train_X, dtype=int)

In [None]:
type(train_X), train_X.shape, type(test_X), test_X.shape

In [None]:
scaler = StandardScaler()                                                                                                                  train_X[num_col_] = scaler.fit_transform(train_X[num_col_])
test_X[num_col_]  = scaler.transform(test_X[num_col_])

<div class="alert alert-block alert-info"><p style ="font-size:1.3em">EnScaling</div>

In [None]:
X.shape, type(X), X_test.shape, type(X_test)

## end different scaling technique

# split datasets into three - training, val, and hold_out

<div class="alert alert-block alert-info"><h4>
We split the dataset into three sets: <b>Train, Validation, and Test</b>.<br><br>
All 3 come from the same stream, but only Train/Validation are used for training and evaluation.<br>
We will use 'Test' to check predictions and graph confusion matrix (sns.heatmap)
</h4></div>

In [None]:
# split data befo augmentation
trainX, valX, trainY, valY = train_test_split(X, y,
                                              test_size=0.2,    # split 15% for validation & test
                                              shuffle=True,
                                              random_state=42)

valX, testX, valY, testY = train_test_split(valX, valY,
                                            test_size=0.3,    # spit  30% for test and 70% for validation
                                            shuffle=True,
                                            random_state=42)

<div class="alert alert-block alert-info"><h4>
<b>Ref:</b>

[barbagrande007](https://www.kaggle.com/code/barbagrande007/bbg007-s4e2-obesity)

<br>Add jittering. Introduce noise to X to increase data size, similar to image augmentation techniques in Convolutional Neural Networks.
  <ul>
  <li>Creates more training samples
  <li>decreases overfitting
  <li>improves accuracy and predictability
  </ul>
  
</h4></div>

# Augment - Datasets for Analysis and Prediction

In [None]:
dng1 = np.random.default_rng(seed=42)
dng2 = np.random.default_rng(seed=46)
dng3 = np.random.default_rng(seed=142)
dng4 = np.random.default_rng(seed=146)
dng5 = np.random.default_rng(seed=16)
dng6 = np.random.default_rng(seed=66)
dng7 = np.random.default_rng(seed=116)
dng8 = np.random.default_rng(seed=166)

X_jitter1 = trainX + dng1.random(1) * 0.3
X_jitter2 = trainX + dng2.random(1) * 0.3
X_jitter3 = trainX + dng3.random(1) * 0.3
X_jitter4 = trainX + dng4.random(1) * 0.3
X_jitter5 = trainX + dng5.random(1) * 0.3
X_jitter6 = trainX + dng6.random(1) * 0.3
X_jitter7 = trainX + dng7.random(1) * 0.3
X_jitter8 = trainX + dng8.random(1) * 0.3

# Duplicate X, y - COMMENT OUT those we don't need
trainX = np.vstack((trainX,
                    X_jitter1, 
                    X_jitter2, 
                    X_jitter3, 
#                X_jitter4, 
#               X_jitter5, 
#               X_jitter6, 
#               X_jitter7,
#               X_jitter8,
              ))

trainY = np.vstack((trainY,
                    trainY, # 1
                    trainY, # 2
                    trainY, # 3 
#               trainY, # 4 
#               trainY, # 5 
#               trainY, # 6 
#               trainY, # 7 
#               trainY, # 8 
              ))

# Randomize samples
shuffled_indices = np.random.permutation(len(X))
trainX = trainX[shuffled_indices]
trainY = trainY[shuffled_indices]

# delete jitter arrays
del X_jitter1, X_jitter2, X_jitter3, X_jitter4
del X_jitter5, X_jitter6, X_jitter7, X_jitter8

In [None]:
# check shapes of all 3 sets
print(f'Train: X:{trainX.shape} Y:{trainY.shape}')
print(f'Val  : X:{valX.shape}   Y:{valY.shape}')
print(f'Test : X:{testX.shape}  Y:{testY.shape}')

# Deep Neural Network - Tensorflow -> Keras

<div class="alert alert-block alert-info"><h4>
Imported packages - most are not used in this Notebook, similar setup can be used to Keras_Tune hyperparameters for Convolutional Neural Networks
</h4></div>

In [None]:
import tensorflow.keras as keras
import keras.backend as K

from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.layers import Input
from keras.layers import ReLU, LeakyReLU

from keras.models import Model, Sequential
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, LearningRateScheduler
from tensorflow.keras.optimizers.schedules import ExponentialDecay
from  keras.regularizers import L1 , L2, L1L2
import  keras.regularizers as regularizers
from keras.optimizers import Adam, SGD
from tensorflow.keras.backend import clear_session

<div class="alert alert-block alert-info"><h1>model callbacks</h1></div>

In [None]:
# ModelCheckpoint
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
  filepath='obesity.hd5',
  save_weights_only=True,         # only save weights
  monitor='val_accuracy',
  mode='max',
  save_best_only=True,
)

# Reduce Learning Rate
# Giving ERROR when enabled - doesn't work when assigning Learning_Rate to Adam
reduce_lr = ReduceLROnPlateau(
  monitor='val_loss',
  factor=0.04,
  patience=5,
  min_lr=0.0,
)

# Early Stopping
early_stop = EarlyStopping(
  monitor='val_loss',
  mode='auto',
  verbose=0,
  patience=3,
)


In [None]:
class My_Callback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs={}):
      self.epoch = epoch

    def on_batch_end(self, batch, logs={}):
        # if self.epoch == 10 and batch == 3:
        if self.epoch == 10:
          print (f"\nStopping at Epoch {self.epoch}, Batch {batch}")
          self.model.stop_training = True

    # def on_epoch_end(self, epoch, logs={}):
    #     if self.epoch == 20:
    #        print (f"\nStopping at Epoch {self.epoch}")
    #      # cannot access self.model.stop_training in this function - check source

<div class="alert alert-block alert-info"><h4>
The function below is a stand-alone function and uses the following <b>hyperparameters</b> to tune Neural Network to improve validation accuracy and decrease training and validation loss.<br>
  <ul>
    <li>batch_size - same is referenced in 'HyperTuningNetwork' class but name is different</li>
    <li>epochs - number of epochs in run</li>
    <li>layer1 - Number of Neurons in first fully connected Dense layer</li>
    <li>layer2 - A float to scale the number of neurons in layer1 and use as Number of Neurons in second fully connected Dense layer</li>
    <li>l2_reg - Float - value of Kernel_regularizer - L2</li>
    <li>learning_rate - To setup learning_rate_scheduler for Adam - Model optimizer</li>
    <li>decay_steps - To setup learning_rate_scheduler for Adam - Model optimizer</li>
    <li>decay_rate - To setup learning_rate_scheduler for Adam - Model optimizer</li>
  </ul>
</h4></div>

<div class="alert alert-block alert-info"><h1>create & build model</h1></div>

In [None]:
# Functional Model
def build_keras_model():
    keras.backend.clear_session()       # turn-off and check how Keras_Tuner behaves
 
    # setup up parameter search values for batch_size
    # This is also setup in 'HyperTuningNetwork' class - 'fit' function
    tr_shape = trainX.shape[1]                     # don't really - using to shorten name
    num_classes = trainY.shape[1]                  # same reason as above

  # Tune the number of units in the first Dense layer
  # base number is '560'   - min is 450, max i 700
    hp_units1 = 1520  # 980

  # For layer2 units, use a scaling factor based on # of Neurons in first layer
  hp_units2 = 190  # 750 
  
  # setting conditional hyperparameters
  # https://github.com/keras-team/keras-tuner/issues/66
  # a = hp.Int('a', 0, 10)
  # with hp.conditionaLscope('a', 5):
  #   b = hp.Int('b', 0, 10)


  # drop rates
    hp_drop1 = 0.44
    hp_drop2 = 0.71


  # Input
  inp = Input(shape=(trainX.shape[1],))
 
  #### ONE ####
    stage1 = Dense(units=1520,
                 activation='relu',
                 kernel_regularizer=L2(l2=,|0.000144)
                 bias_regularizer=L2(l2=0.00849)
                 activity_regularizer=,
                 kernel_initializer = 'glorot uniform'),
                 )(inp)

    drop1 = Dropout(hp_drop1)(stage1)
    batch1 = BatchNormalization()(drop1)

  #### TWO ####
  stage2 = Dense(units = 190,
                 activation='relu',
                 kernel_regularizer=L1(l1=0.0004999)
                 bias_regularizer=L2(l2=0.00098332)
                 activity_regularizer=L1(l1=0.00010774),
                 kernel_initializer='glorot uniform',
                )(batch1)
    drop2 = Dropout(hp_drop2)(stage2)

  #### OUT ####
  outp = Dense(num_classes, activation='softmax',)(drop2)

{  ##################################################################
  # To setup learning_rate_scheduler for Adam - Model optimizer
  hp_learning_rate = 0.0003450908797091988      # 0.0004122296936091948      # 0.0005776234810416469 # 0.0010388011586892568
  hp_decay_steps =   43510.0                    # 39990.0                    # 30570.0               # 19990
  hp_decay_rate =    0.14500871466470402        # 0.36073213824416583        # 0.28642696462807177    @ 0.3132430029454445

  ##################################################################
  # tensorflow.keras.optimizers.schedules.ExponentialDecay
  lr_schedule = ExponentialDecay(initial_learning_rate=hp_learning_rate, 
                                 decay_steps=hp_decay_steps,
                                 decay_rate=hp_decay_rate,
                                )

  ## opt = Adam(learning_rate=hp_learning_rate)
  # opt = Adam(learning_rate=lr_schedule)
  optimizer = 'adam' # 'adamax'

  # get optimizer from tensorflow.keras.optimizers baseclass
  opt=keras.optimizers.get(optimizer)
  # learning_rate = hp.Choice("learning_rate", values=[0.01, 0.1])
  learning_rate = lr_schedule
  opt.learning_rate=lr_schedule    # setup learning_rate
  model = Model(inp, outp)
  model.compile(loss = 'categorical_crossentropy',
                # optimizer=optimizer,
                optimizer=opt,
                metrics=['accuracy'],
               )
  return model}

In [None]:
# build model
epochs = 480     
batch_size = 2000

# select between using valX, valY or a subset of training data as validation
USE_VAL_BATCH_DATA = True

# build model and print summary
model = build_keras_model()

model.summary()

<div class="alert alert-block alert-info"><h3>
first epoch run - stop after epoch 10 completes
</h3></div>

In [None]:
history = model.fit(trainX, trainY, 
                    epochs = epochs, 
                    batch_size = batch_size, 
                    validation_data=(valX,valY) if USE_VAL_BATCH_DATA else None,
                    validation_split=0.3 if not USE_VAL_BATCH_DATA else None,
                    callbacks=[model_checkpoint_callback, My_Callback()],
                    verbose=2,
                   )

<div class="alert alert-block alert-info">
<h2>plot loss and accuracy - first run</h2>
</div>

In [None]:
hist_frame=pd.DataFrame(data=history.history)

plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
sns.lineplot(data=(hist_frame.loss, hist_frame.val_loss))
plt.title('Loss')
plt.subplot(1,2,2)
sns.lineplot(data=(hist_frame.accuracy, hist_frame.val_accuracy))
plt.title('Accuracy')

<div class="alert alert-block alert-info"><h3>
second epoch run - start from epoch 11
</h3></div>

In [None]:
use_early_stop = True
history = model.fit(trainX, trainY, 
                    epochs = epochs, 
                    batch_size = batch_size, 
                    validation_data=(valX,valY) if USE_VAL_BATCH_DATA else None,
                    validation_split=0.3 if not USE_VAL_BATCH_DATA else None,
                    callbacks=[model_checkpoint_callback, early_stop] if use_early_stop else [model_checkpoint_callback],
                    initial_epoch=11,
                    verbose=2,
                   )


<div class="alert alert-block alert-info">
<h2>plot loss and accuracy - second run</h2>
</div>

In [None]:
hist_frame=pd.DataFrame(data=history.history)

plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
sns.lineplot(data=(hist_frame.loss, hist_frame.val_loss))
plt.title('Loss')
plt.subplot(1,2,2)
sns.lineplot(data=(hist_frame.accuracy, hist_frame.val_accuracy))
plt.title('Accuracy')

<div class="alert alert-block alert-info"><h1>model evaluate (validation) and predict (hold-out)</h1></div>

In [None]:
model.evaluate(valX, valY, batch_size=32)

In [None]:
val_predictions = model.predict(valX)
v_predictions=[]
for i in range(len(val_predictions)):
  # print("Predicted=%s" % np.argmax(val_predictions[i]))
  v_predictions.append(np.argmax(val_predictions[i]))


In [None]:
# convert testY to true_labels
valY_actual=[]
for i in range(len(valY)):
  valY_actual.append(np.argmax(valY[i]))

unique_nums = np.unique([valY_actual, v_predictions])
unique_label = labelEnc.inverse_transform(unique_nums)

In [None]:
sns.heatmap(confusion_matrix(valY_actual, 
                             v_predictions), 
            annot=True, 
            cmap='viridis', 
            fmt='d', 
            xticklabels=unique_label,
            yticklabels=unique_label,
            square=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Validation Set')
plt.show()

In [None]:
# classification report
print(classification_report(valY_actual, v_predictions))

In [None]:
# confusion_matrix
print(confusion_matrix(valY_actual, v_predictions))

<div class="alert alert-block alert-info"><h1>Predictions - Hold-Out set</h1></div>

In [None]:
test_predictions = model.predict(testX)

In [None]:
predictions=[]
for i in range(len(test_predictions)):
  # print("Predicted=%s" % np.argmax(test_predictions[i]))
  predictions.append(np.argmax(test_predictions[i]))

In [None]:
# convert testY to true_labels
testY_actual=[]
for i in range(len(testY)):
  testY_actual.append(np.argmax(testY[i]))

In [None]:
unique_nums = np.unique([testY_actual, predictions])
unique_label = labelEnc.inverse_transform(unique_nums)
sns.heatmap(confusion_matrix(testY_actual, predictions), 
            xticklabels=unique_label, 
            yticklabels=unique_label,
            annot=True, 
            cmap='viridis', 
            fmt='d', 
            square=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Test (HoldOut Set')
plt.show()

In [None]:
# classification report
print(classification_report(testY_actual, predictions))

In [None]:
# confusion_matrix
print(confusion_matrix(testY_actual, predictions))

<div class="alert alert-block alert-info"><h1>inference - Multi-Class Prediction (Obesity Risk)</h1></div>

In [None]:
# predictions_ = model.predict(X_test_cluster)
predictions_ = model.predict(X_test)

print(predictions_[:5])

predictions_max=[]
for i in range(len(predictions_)):
  predictions_max.append(np.argmax(predictions_[i]))

# Inverse label encoder
predictions_submit = labelEnc.inverse_transform(predictions_max)
print(predictions_submit[:5])

In [None]:
file_name = 'submission_88945.csv'
submit_pd = pd.read_csv(file_name)

In [None]:
# convert to onehot-encoding
submit_encoded = labelEnc.fit_transform(submit_pd.iloc[:,1])

In [None]:
# heatmap
sns.heatmap(confusion_matrix(submit_encoded, predictions_max), 
            xticklabels=unique_label, 
            yticklabels=unique_label,
            annot=True, 
            cmap='viridis', 
            fmt='d', 
            square=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# classification report
print(classification_report(submit_encoded, predictions_max))

In [None]:
# confusion_matrix
confusion_matrix(submit_encoded, predictions_max)

In [None]:
curr = datetime.now()
curr = curr.strftime('%y_%m_%d_%H_%M_%S')

In [None]:
# create DataFrame to write CSV file
# del predictions_data
predictions_data = pd.DataFrame(predictions_submit, columns=['NObeyesdad'])
predictions_data.insert(0, 'id', test_id)
hold_submission = f'submission_{curr}.csv'
predictions_data.to_csv(hold_submission, index = False)
predictions_data.to_csv('submission.csv', index = False)

predictions_data.head()

In [None]:
!head -10 submission.csv

## end inference

<div class="alert alert-block alert-success">
<h4>submit your CSV file</h4>
<div>

In [None]:
stop

In [None]:
curr = f'Multi-Class Prediction (Obesity Risk) submitted:  {curr}'
print(curr)

In [None]:
!kaggle competitions submit -c playground-series-s4e2 -f submission.csv -m curr

<block><pre>
@misc{playground-series-s4e2,
    author = {Walter Reade, Ashley Chow},
    title = {Multi-Class Prediction of Obesity Risk},
    publisher = {Kaggle},
    year = {2024},
    url = {https://kaggle.com/competitions/playground-series-s4e2}
}
</block></pre>

<block><pre>
@misc{omalley2019kerastuner,
    title        = {KerasTuner},
    author       = {O'Malley, Tom and Bursztein, Elie and Long, James and Chollet, Fran\c{c}ois and Jin, Haifeng and Invernizzi, Luca and others},
    year         = 2019,
    howpublished = {\url{https://github.com/keras-team/keras-tuner}}
}
</block></pre>