# Primitive Modeling
The data processing involved:
- removing a few outliers
- dropping unnecessary features
- dropping rows with high missingness
- imputing/interpolating remaining missing values

We are going to try fitting as simple regression model to get a baseline, and then we will try more complex methods to:
- Transform the feature space
- Deal with missing values
- Overcome imbalanced classes (may not be necessary if regression and thresholding works)

## Preprocessing

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
# read survey
data_dictionary = pd.read_csv('data/data_dictionary.csv')
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

# read aggregated actigraphy
actigraphy_data = pd.read_csv('data/PA_data.csv', index_col=0)

In [51]:
# Outliers ----------------------------------------------------------------------------
# removing outliers (should change this to be automatic so it applies to the test data)
# From CGAS
train_data.loc[2065, 'CGAS-CGAS_Score'] = 99
# From BIA (there might still be some suspicious extreme values)
cols = data_dictionary.loc[(data_dictionary['Instrument'] == 'Bio-electric Impedance Analysis') & (data_dictionary['Type'] == 'float'), 'Field']
train_data.loc[[3205, 3511], cols] = np.nan # remove 3511 and 3205's BIA values because they seem wrong. They have normal heights and weights but extreme values for BIA measures

# Drop features ----------------------------------------------------------------------
# combine FitnessGram Minutes and seconds
train_data['Fitness_Endurance-Total_Time_sec'] = train_data['Fitness_Endurance-Time_Mins'] * 60 + train_data['Fitness_Endurance-Time_Sec'] # remove remaining Fitness_Endurance Columns
# drop all PCIAT columns, any column that ends in -Season, FitnessGram Zones, remaining Fitness_Endurance columns, and redundant SDS column
columns_to_drop = [col for col in train_data.columns if col.startswith('PCIAT') or col.endswith('Season') or col.endswith('Zone')]
columns_to_drop.extend(['Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Sec', 'Fitness_Endurance-Time_Mins', 'SDS-SDS_Total_Raw'])
train_data_cleaned = train_data.drop(columns=columns_to_drop)

# merge PAQ_A and PAQ_C
# keep adolescent value if 13 or older
train_data_cleaned.loc[train_data_cleaned['PAQ_A-PAQ_A_Total'].notna() & train_data_cleaned['PAQ_C-PAQ_C_Total'].notna() & (train_data_cleaned['Basic_Demos-Age'] >= 13), 'PAQ_C-PAQ_C_Total'] = np.nan
# keep child value is younger than 13
train_data_cleaned.loc[train_data_cleaned['PAQ_A-PAQ_A_Total'].notna() & train_data_cleaned['PAQ_C-PAQ_C_Total'].notna() & (train_data_cleaned['Basic_Demos-Age'] < 13), 'PAQ_A-PAQ_A_Total'] = np.nan
# merge columns
train_data_cleaned['PAQ-PAQ_Total'] = train_data_cleaned['PAQ_A-PAQ_A_Total'].fillna(train_data_cleaned['PAQ_C-PAQ_C_Total'])
# drop columns
train_data_cleaned = train_data_cleaned.drop(columns = ['PAQ_A-PAQ_A_Total', 'PAQ_C-PAQ_C_Total'])

# include aggregate acitgraphy features
train_data_cleaned = pd.merge(train_data_cleaned, actigraphy_data, left_on='id', how='left', right_index=True)

# Missing Values (might not have to handle completely manually for CATBoost, though it could improve performance) -----------
# Drop rows with high missingness (should investigate characteristics of rows with high missingness)
thresh = 50
percent_missing_per_row = train_data_cleaned.isnull().mean(axis=1) * 100
high_missingness_idx = percent_missing_per_row[percent_missing_per_row > thresh].index.values
train_data_cleaned = train_data_cleaned.drop(high_missingness_idx)
print(f'{len(high_missingness_idx)} rows dropped because {thresh}% or more of the data was missing.')
print(f'There are {len(train_data_cleaned)} rows remaining in the train data.')

# Drop features with high missingness?
print('This is the proportion of data available per feature. It might be wise to drop features with high missingness. We can try both ways though.')
display(train_data_cleaned.notna().mean().sort_values())

# Impute/interpolate remaining missing values (not necessary for CATBoost)

# Convert categorical columns to the correct data type? Since their ordinal, it might work to not convert
# use data dictionary to convert features that have type=="categorical int" into str dtype

# drop rows without target
train_data_cleaned = train_data_cleaned.dropna(subset='sii')
print(f'Final length of the training dataset: {len(train_data_cleaned)}')

809 rows dropped because 50% or more of the data was missing.
There are 1928 rows remaining in the train data.
This is the proportion of data available per feature. It might be wise to drop features with high missingness. We can try both ways though.


Physical-Waist_Circumference              0.213174
Fitness_Endurance-Total_Time_sec          0.335062
FGC-FGC_GSD                               0.404046
FGC-FGC_GSND                              0.404046
vigorous                                  0.418568
moderate                                  0.429979
light                                     0.430498
sendentary                                0.430498
PAQ-PAQ_Total                             0.672199
FGC-FGC_PU                                0.860477
FGC-FGC_SRL                               0.860996
FGC-FGC_SRR                               0.862033
FGC-FGC_CU                                0.863589
FGC-FGC_TL                                0.864108
CGAS-CGAS_Score                           0.880705
SDS-SDS_Total_T                           0.906639
BIA-BIA_LST                               0.939315
BIA-BIA_SMM                               0.939315
BIA-BIA_TBW                               0.939315
BIA-BIA_LDM                    

Final length of the training dataset: 1928


## CAT Boost
Will automatically handle the categorical features. No need for one-hot encoding. Have to convert the categorical features to str though.

In [62]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score
from catboost import CatBoostRegressor
import pandas as pd
import numpy as np

Should optimize hyperparameters

In [63]:
X = train_data_cleaned.drop(['id','sii'], axis=1)
y = train_data_cleaned['sii']

train_data_cleaned['Basic_Demos-Sex'] = train_data_cleaned['Basic_Demos-Sex'].astype(str) # technically already dummy coded, so don't have to do 
categorical_features = ['Basic_Demos-Sex'] # can add to this

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = CatBoostRegressor(iterations=100, depth=6, learning_rate=0.1, verbose=10) # cat_features=categorical_features (had issues)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the performance of the regression model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print regression metrics
print(f'Mean Absolute Error (MAE): {mae:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
print(f'R-squared (R²): {r2:.4f}')

# Get feature importance
feature_importances = model.get_feature_importance()
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
display(importance_df)

0:	learn: 0.7583334	total: 3.9ms	remaining: 386ms
10:	learn: 0.6802886	total: 20.7ms	remaining: 167ms
20:	learn: 0.6416826	total: 31.9ms	remaining: 120ms
30:	learn: 0.6159614	total: 43.5ms	remaining: 96.8ms
40:	learn: 0.5954498	total: 54ms	remaining: 77.7ms
50:	learn: 0.5798528	total: 64.7ms	remaining: 62.2ms
60:	learn: 0.5676271	total: 75.7ms	remaining: 48.4ms
70:	learn: 0.5520926	total: 88.4ms	remaining: 36.1ms
80:	learn: 0.5424177	total: 98.4ms	remaining: 23.1ms
90:	learn: 0.5306812	total: 109ms	remaining: 10.7ms
99:	learn: 0.5181189	total: 118ms	remaining: 0us
Mean Absolute Error (MAE): 0.5387
Mean Squared Error (MSE): 0.4636
Root Mean Squared Error (RMSE): 0.6809
R-squared (R²): 0.2311


Unnamed: 0,Feature,Importance
33,SDS-SDS_Total_T,13.390361
34,PreInt_EduHx-computerinternet_hoursday,12.246025
0,Basic_Demos-Age,5.086482
4,Physical-Height,4.167854
8,Physical-HeartRate,3.175268
2,CGAS-CGAS_Score,2.92739
1,Basic_Demos-Sex,2.923253
9,Physical-Systolic_BP,2.888344
16,FGC-FGC_TL,2.860958
10,FGC-FGC_CU,2.818706


In [87]:
# convert to classes
y_pred = y_pred.round()
y_pred[y_pred <= 0] = 0
y_pred[y_pred > 3] = 3
acc = (y_pred == y_test).mean()
print(f'Accuracy: {acc}')

0.5440414507772021

In [None]:
# try CatBoostClassifier too

## CNN on time series

**Steps to Build a CNN for Time Series**
1. Prepare the Data
    - Shape the Input: CNNs for time series require input of shape (samples, timesteps, features):
        - samples: Number of data points or observations.
        - timesteps: Length of the time series for each sample.
        - features: Number of features per timestep.
    For a univariate time series, features=1.
    - Split into Training and Testing Sets:
        - Ensure a robust split, often chronological (e.g., train on earlier data and test on later).
    - Scale the Data:
        - Normalize or standardize the data for better convergence.
2. Build the CNN
3. Adjust hyperparameters
    - Filters
    - Kernel Size
    - Pooling Size
    - Stride
5. Evaluate and tune

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Dropout

# Create a CNN model
model = Sequential([
    Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(timesteps, features)),
    MaxPooling1D(pool_size=2),
    Conv1D(filters=32, kernel_size=3, activation='relu'),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.5),
    Dense(1)  # Output layer (adjust units and activation for specific tasks)
])

# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_val, y_val))


## Combine