### **By Kyle Weldon**
Everything in this file is writen and maintained by Kyle Weldon.

### **Importing needed packages**
Versions used are dysplayed below the code.

In [25]:
import pip
import os
# Supress TensorFlow messages
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
import pandas as pd

import warnings
# Ignore the specific TensorFlow warning
warnings.filterwarnings('ignore', category=UserWarning, message="Your TensorFlow version is newer than 2.4.0 and so graph support has been removed in eager mode and some static graphs may not be supported. See PR #1483 for discussion.")

import tensorflow as tf
from tensorflow.keras.layers import Input,Dense, Concatenate, Reshape
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.regularizers import l2

import shap

import sklearn
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score

print('pip version:', pip.__version__)
print('NumPy version:', np.__version__)
print('Pandas version:', pd.__version__)
print('Scikit-learn version:', sklearn.__version__)
print('Tensorflow version:', tf.__version__)
print('shap version:', shap.__version__)

pip version: 24.1.2
NumPy version: 1.26.4
Pandas version: 2.2.2
Scikit-learn version: 1.5.1
Tensorflow version: 2.17.0
shap version: 0.46.0


### **Filtering raw data given:** 

The data that was given had incomplete data in a '.xlsx' file. The code below is what was used to remove all of the uncomplete samples and save the result as a '.csv' file.

In [26]:
def filter_data(excel_file, output_csv):
    try:
        df = pd.read_excel(excel_file)
    except FileNotFoundError:
        print(f"Error: The file '{excel_file}' was not found.")
        return
    except Exception as e:
        print(f"Error occurred while reading '{excel_file}': {str(e)}")
        return

    complete_rows = []
    for index, row in df.iterrows():
        if is_row_complete(row):
            complete_rows.append(row)

    cleaned_df = pd.DataFrame(complete_rows, columns=df.columns)

    try:
        cleaned_df.to_csv(output_csv, index=False)
        print(f"Cleaned data saved to '{output_csv}' successfully.")
        print(f"There are {len(cleaned_df)} samples in the cleaned data.")
    except Exception as e:
        print(f"Error occurred while saving to '{output_csv}': {str(e)}")
        return

def is_row_complete(row):
    for cell in row:
        if pd.isna(cell) or str(cell).strip() == '':
            return False
    return True

filter_data('Data/RawData.xlsx', 'Data/FilteredData.csv')

Cleaned data saved to 'Data/FilteredData.csv' successfully.
There are 881 samples in the cleaned data.


### **How to use the data:**
 There are 10 different 'senerios' or 'decisions' made by each sample (each sample represents one person). When making a decision they were able to choose between 0-10 based on how sure they are. This gives 11 possible choices per situation per sample. Given the fact there are only 881 samples attempting to accruetly predict 11 possible choices will likely not be accurate due to the limited data. To account for this The decisions are going to be split into three catagories. Anyone that chose a 0, 1, 2, or 3 will be a part of catagory one. Anyone that chose either 4, 5, or 6 will be a part of catagory two and anyone that chose 7, 8, 9, or 10 will be a part of catagory three. This gives a 4-3-4 catigorical split. Below is the code that completes this. It is also essential to remember to split the data into training and validating data. 

In [27]:
df = pd.read_csv('Data/FilteredData.csv')
# Column tites for all the output data
output_columns = ['Scenario 1 ',
                  'Unnamed: 40',
                  'Scenario 2 ',
                  'Unnamed: 42',
                  'Scenario 3 ',
                  'Unnamed: 44',
                  'Scenario 4',
                  'Unnamed: 46',
                  'Scenario 5 ',
                  'Unnamed: 48']

def classiy_and_catigorize(column):
    return to_categorical([0 if x <= 3 else 1 if x <= 6 else 2 for x in column])

columns = df[output_columns].to_numpy().T # The 'T' is to transpose the array

S1P1, S1P2, S2P1, S2P2, S3P1, S3P2, S4P1, S4P2, S5P1, S5P2 = [classiy_and_catigorize(col) for col in columns]
all_situations = [S1P1, S1P2, S2P1, S2P2, S3P1, S3P2, S4P1, S4P2, S5P1, S5P2]

for situation in all_situations:
    print(situation.shape)

(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)
(881, 3)


### **Splitting into training and validating:**
Before this data can be used to train a model it first needs to be split into traning and validating data. Below is the code that does that. The first 800 samples (people) are going to be used train the model while the last 81 are going to be for validation.

In [28]:
def split(array):
    return array[:800], array[800:]

S1P1_train, S1P1_val = split(S1P1)
S1P2_train, S1P2_val = split(S1P2)
S2P1_train, S2P1_val = split(S2P1)
S2P2_train, S2P2_val = split(S2P2)
S3P1_train, S3P1_val = split(S3P1)
S3P2_train, S3P2_val = split(S3P2)
S4P1_train, S4P1_val = split(S4P1)
S4P2_train, S4P2_val = split(S4P2)
S5P1_train, S5P1_val = split(S5P1)
S5P2_train, S5P2_val = split(S5P2)

print(f"S1P1 training shape: {S1P1_train.shape}")
print(f"Validation set length: {len(S1P1_val)}")

S1P1 training shape: (800, 3)
Validation set length: 81


### **Prepare input data:**
Now that the output data is fully prepared and ready for training it is time to prepare the coresponding input data. Simalar to the multiple outputs there our also miltiple inputs. Below is the code to complete this.

In [29]:
# Column titles used for this input set
input_columns1 = ['MAx1', 'Max2', 'Max3']
input_columns2 = ['Q105_1','Q105_2','Q105_3','Q105_4','Q105_5','Q105_6','Q105_7','Q105_8','Q105_9','Q105_10','Q105_11','Q105_12','Q105_13','Q105_14','Q105_15','Q105_16','Q105_17','Q105_18','Q105_19','Q105_20','Q105_21','Q105_22','Q105_23','Q105_24','Q105_25','Q105_26','Q105_27','Q105_28','Q105_29','Q105_30','Q105_31','Q105_32','Q105_33','Q105_34']

input_df1 = df[input_columns1]
input1_X_values = input_df1.to_numpy()
input_df2 = df[input_columns2]
input2_X_values = input_df2.to_numpy()

layer1_X_train = input1_X_values[:800]
layer1_X_val = input1_X_values[800:]
layer2_X_train = input2_X_values[:800]
layer2_X_val = input2_X_values[800:]

print(f"Layer1 training shape: {layer1_X_train.shape} -> validating shape: {layer1_X_val.shape}")
print(f"Layer2 training shape: {layer2_X_train.shape} -> validating shape: {layer2_X_val.shape}")

Layer1 training shape: (800, 3) -> validating shape: (81, 3)
Layer2 training shape: (800, 34) -> validating shape: (81, 34)


### **Building model archetecture:**
First instinct is build a deep neural network. Because this initial model will predict all 10 different decisions for each sample a multi-output model is need. There was some adjustment needed to the output layers so they are the correct shape. The functional API from Keras is used for this.

In [30]:
input1 = Input(shape=(3,), name='InputLayer1')

hidden1_input1 = Dense(256, activation='relu', kernel_regularizer=l2(0.01), name='DenseOne_Input1')(input1)
hidden2_input1 = Dense(128, activation='relu', kernel_regularizer=l2(0.01), name='DenseTwo_Input1')(hidden1_input1)
hidden3_input1 = Dense(64, activation='relu', kernel_regularizer=l2(0.01), name='DenseThree_Input1')(hidden2_input1)
hidden4_input1 = Dense(32, activation='relu', kernel_regularizer=l2(0.01), name='DenseFour_Input1')(hidden3_input1)
hidden5_input1 = Dense(16, activation='relu', kernel_regularizer=l2(0.01), name='DenseFive_Input1')(hidden4_input1)

input2 = Input(shape=(34,), name='InputLayer2')

hidden1_input2 = Dense(256, activation='relu', kernel_regularizer=l2(0.01), name='DenseOne_Input2')(input2)
hidden2_input2 = Dense(128, activation='relu', kernel_regularizer=l2(0.01), name='DenseTwo_Input2')(hidden1_input2)
hidden3_input2 = Dense(64, activation='relu', kernel_regularizer=l2(0.01), name='DenseThree_Input2')(hidden2_input2)
hidden4_input2 = Dense(32, activation='relu', kernel_regularizer=l2(0.01), name='DenseFour_Input2')(hidden3_input2)
hidden5_input2 = Dense(16, activation='relu', kernel_regularizer=l2(0.01), name='DenseFive_Input2')(hidden4_input2)

concatenated = Concatenate(name='ConcatinatedInput')([hidden5_input1, hidden5_input2])

def create_output_layer(name, input_layer):
    return Dense(3, activation='softmax', name=name)(input_layer)
def itterate_situations_and_parts(num_itts=10):
    scenerio = 1
    sittos = []
    for i in range(num_itts):
        part = 1 if i % 2 == 0 else 2
        sittos.append(f"S{scenerio}P{part}")
        if part == 2: scenerio += 1
    return sittos

outputs = [create_output_layer(name, concatenated) for name in itterate_situations_and_parts()]

model = Model(inputs=[input1, input2], outputs=outputs, name='CustomizedDeepNeuralNetwork')

metrics = ['accuracy'] * 10  

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=metrics)

print(model.summary())

None


### **Training the model:**
Now that the model has been built and compiled it is ready for the training data that has been previously prepared. Below is the code for training the model.

In [31]:
model.fit([layer1_X_train, layer2_X_train], [S1P1_train, S1P2_train, S2P1_train,
                    S2P2_train, S3P1_train, S3P2_train, S4P1_train,
                    S4P2_train, S5P1_train, S5P2_train],
          epochs=10,
          batch_size=32,
          validation_data=([layer1_X_val, layer2_X_val], [S1P1_val, S1P2_val, S2P1_val,
                                                          S2P2_val, S3P1_val, S3P2_val, S4P1_val,
                                                          S4P2_val, S5P1_val, S5P2_val]))

Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 41ms/step - S1P1_accuracy: 0.1341 - S1P2_accuracy: 0.4188 - S2P1_accuracy: 0.2602 - S2P2_accuracy: 0.6191 - S3P1_accuracy: 0.5262 - S3P2_accuracy: 0.3560 - S4P1_accuracy: 0.3877 - S4P2_accuracy: 0.4369 - S5P1_accuracy: 0.7120 - S5P2_accuracy: 0.6360 - loss: 18.4647 - val_S1P1_accuracy: 0.8642 - val_S1P2_accuracy: 0.7654 - val_S2P1_accuracy: 0.7901 - val_S2P2_accuracy: 0.7654 - val_S3P1_accuracy: 0.7160 - val_S3P2_accuracy: 0.2222 - val_S4P1_accuracy: 0.8889 - val_S4P2_accuracy: 0.8272 - val_S5P1_accuracy: 0.7531 - val_S5P2_accuracy: 0.7654 - val_loss: 13.0960
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - S1P1_accuracy: 0.8288 - S1P2_accuracy: 0.7471 - S2P1_accuracy: 0.7930 - S2P2_accuracy: 0.6781 - S3P1_accuracy: 0.6589 - S3P2_accuracy: 0.5254 - S4P1_accuracy: 0.8779 - S4P2_accuracy: 0.6000 - S5P1_accuracy: 0.7785 - S5P2_accuracy: 0.7082 - loss: 13.0331 - val_S1P1_accuracy

<keras.src.callbacks.history.History at 0x1b6075fe240>

### **Attempts made to improve output:**
The loss calculation is high as seen in this output. The closer a loss is to 0 the better, but the above calculations are far from 0. Simplifying the model's complexity, adjusting the L2 regularzation penalties, and changing the number of epochs but none seemed to have any effect on the loss of the model.

### **Explaining the output with SHAP (SHapley Additive exPlanations):**
Currently unable to get this to work due to unexplained reasons.

In [32]:
separate_models = []

for i, output in enumerate(model.outputs):
    separate_model = Model(inputs=model.inputs, outputs=output)
    separate_models.append(separate_model)

background = [layer1_X_train[:50], layer2_X_train[:50]]

X1_sample = layer1_X_train[50:60]
X2_sample = layer2_X_train[50:60]

shap_values = []

for separate_model in separate_models:
    explainer = shap.DeepExplainer(separate_model, background)
    shap_values.append(explainer.shap_values([X1_sample, X2_sample]))