# Introduction

The goal of this competition is to **detect freezing of gait (FOG)**, a debilitating symptom that afflicts many people **with Parkinson’s disease**. It is requred to **develop a machine learning model trained on data collected from a wearable 3D lower back sensor** to better understand **when and why FOG episodes occur**.

# Import Libraries

In [None]:
!pip install tsflex
!pip install alive-progress

In [None]:
from alive_progress import alive_bar
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from imblearn.over_sampling import SMOTE
import scipy.stats as ss
from tsflex.features import MultipleFeatureDescriptors, FeatureCollection, FeatureDescriptor
from tsflex.features.utils import make_robust
import warnings


In addition, **tdcsfog_metadata.csv identifies** each series in the tdcsfog dataset by **a unique Subject, Visit, Test, and Medication condition**.

In [None]:
# tdcsfog metadata file
tdcsfog_metadata = pd.read_csv("/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/defog_metadata.csv")
tdcsfog_metadata.head(5)

# Take All the CSV Files in the Train tdcsfog Folder

In [None]:
def slope(x): return (x[-1] - x[0]) / x[0] if x[0] else 0
def abs_diff_mean(x): return np.mean(np.abs(x[1:] - x[:-1])) if len(x) > 1 else 0
def diff_std(x): return np.std(x[1:] - x[:-1]) if len(x) > 1 else 0

strides = [5] # If stride size == window size there is no overlap between windows
windows = [50]


funcs = [make_robust(f) for f in [np.min,np.var, np.max, np.std, np.mean, slope, ss.skew, ss.kurtosis, abs_diff_mean, diff_std, np.sum,]]

fc = FeatureCollection(
    MultipleFeatureDescriptors(
          functions=funcs,
          series_names=["AccV", "AccML", "AccAP"],
          windows=windows,
          strides=strides[0],
    )
)

npmean = make_robust(np.mean)

fc.add(FeatureDescriptor(npmean, "StartHesitation", windows[0], strides[0]))
fc.add(FeatureDescriptor(npmean, "Walking", windows[0], strides[0]))
fc.add(FeatureDescriptor(npmean, "Turn", windows[0], strides[0]))

In [None]:
# Set the directory path to the folder containing the CSV files.
tdcsfog_path = '/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/train/tdcsfog'

# Initialize an empty list to store the dataframes.
tdcsfog_list = []


# Loop through each file in the directory and read it into a dataframe.
for file_name in os.listdir(tdcsfog_path):
    if file_name.endswith('.csv'):
        file_path = os.path.join(tdcsfog_path, file_name)
        file = pd.read_csv(file_path)
        file.Time = file.Time # / (len(file) - 1)
        tdcsfog_list.append(file)

In [None]:
# For each tdcsfog DataFrame, extract the features and add them to the total Dataframe.

# Initialize final dataframe
tdcsfog_final = pd.DataFrame()

for idx, tdcsfog in enumerate(tdcsfog_list): 
    tdcsfog = tdcsfog.reset_index(drop = True)
    if idx % int(len(tdcsfog_list)/10) == 0 and idx != 0:
        print("Progress: " + str(int(idx/int(len(tdcsfog_list)/10)) * 10) + "%")
    df_feats = fc.calculate(data=[tdcsfog], window_idx="end", approve_sparsity=True, return_df=True)
    df_feats = df_feats.join(tdcsfog.drop(columns = ["StartHesitation","Turn","Walking"]))
    tdcsfog_final = pd.concat([tdcsfog_final,df_feats],ignore_index = True)
    
print(tdcsfog_final.tail())

It is better to reduce the memory usage. Reference: [Reducing DataFrame memory size by ~65%](https://www.kaggle.com/code/arjanso/reducing-dataframe-memory-size-by-65)

In [None]:
def reduce_memory_usage(df):
    
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype.name
        if ((col_type != 'datetime64[ns]') & (col_type != 'category')):
            if (col_type != 'object'):
                c_min = df[col].min()
                c_max = df[col].max()

                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)

                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        pass
            else:
                df[col] = df[col].astype('category')
    mem_usg = df.memory_usage().sum() / 1024 ** 2 
    print("Memory usage became: ",mem_usg," MB")
    
    return df

In [None]:
tdcsfog = reduce_memory_usage(tdcsfog_final)

In [None]:
tdcsfog.describe()

In [None]:
tdcsfog['StartHesitation__mean__w=' + str(windows[0])].mean()

In [None]:
tdcsfog['Turn__mean__w='+ str(windows[0])].mean()

In [None]:
tdcsfog['Walking__mean__w='+str(windows[0])].mean()

In [None]:
len(tdcsfog)

## Take All the CSV Files in the Train defog Folder

In [None]:
# Set the directory path to the folder containing the CSV files.
defog_path = '/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/train/defog'

# Initialize an empty list to store the dataframes.
defog_list = []

# Loop through each file in the directory and read it into a dataframe.
for file_name in os.listdir(defog_path):
    if file_name.endswith('.csv'):
        file_path = os.path.join(defog_path, file_name)
        file = pd.read_csv(file_path)
        file.Time = file.Time # / (len(file) - 1)
        file = file[(file['Task'] == 1) & (file['Valid'] == 1)]
        defog_list.append(file)


In [None]:
# For each defog DataFrame, extract the features and add them to the total Dataframe.
# Initialize final dataframe
defog_final = pd.DataFrame()

for idx, defog in enumerate(defog_list): 
    defog = defog.reset_index(drop = True)
    if idx % int(len(defog_list)/10) == 0 and idx != 0:
        print("Progress: " + str(int(idx/int(len(defog_list)/10)) * 10) + "%")
    df_feats = fc.calculate(data=[defog], window_idx="end", approve_sparsity=True, return_df=True)
    df_feats = df_feats.join(defog.drop(columns = ["StartHesitation","Turn","Walking", "Task", "Valid"]))
    defog_final = pd.concat([defog_final,df_feats],ignore_index = True)
    
print(defog_final.tail())

In [None]:
defog = reduce_memory_usage(defog_final)

**We are going to use valid data only.**

In [None]:
defog.describe()

**We merge tdcsfog and defog datasets into one merged dataset.**

In [None]:
# Concatenate the dataframes vertically using pd.concat().
merged = pd.concat([tdcsfog, defog], axis = 0, ignore_index = True)

merged

# Create Dataset

First, we need to **split the data into input features (i.e. "Time", "AccV", "AccML", and "AccAP") and target variables (i.e. "StartHesitation", "Turn", and "Walking")**. We can do this using the .iloc method to select the appropriate columns.

In [None]:
# input features
# merged['label'] = np.where(merged['Turn'] == 1, 1,
#                      np.where(merged['Walking'] == 1, 2,
#                      np.where(merged['StartHesitation'] == 1, 3, 0)))

# Use smote to create synthetic data

# smote = SMOTE(random_state = 4, k_neighbors=100)
# X_syn, y_syn = smote.fit_resample(X_merged, merged['label'])

In [None]:
# Create Synthetic dataset
# syn = pd.concat([X_syn,y_syn.to_frame(name = "label")], axis=1)
# syn["Turn"], syn["Walking"], syn["StartHesitation"] = (syn["label"] == 1).astype(int), (syn["label"] == 2).astype(int), (syn["label"] == 3).astype(int)

In [None]:
# tot = pd.concat([merged,syn])
# tot = tot.sort_values("Time",ignore_index = True)

# Normalize time
# tot["Time"] = tot["Time"] / (len(tot) - 1)

In [None]:
# tot = reduce_memory_usage(tot)
tot = reduce_memory_usage(merged)

In [None]:
data = np.array([tot['Walking__mean__w='+ str(windows[0])], tot['StartHesitation__mean__w=' + str(windows[0])], tot['Turn__mean__w='+ str(windows[0])]])

labels = np.argmax(data, axis = 0)
sums = np.sum(data, axis = 0)
labels = np.where(sums == 0 , 5, labels)


tot['Walking'] = np.where(labels == 0 , 1, 0)
tot['StartHesitation'] = np.where(labels == 1 , 1, 0)
tot['Turn'] = np.where(labels == 2 , 1, 0)

# try:
#     tot = tot.drop(columns=["Time"])
# finally:
#     pass

# Change this by hand if you want to try more features
X_tot = pd.concat([tot.iloc[:, 0:33],tot.iloc[:, 36:39]], axis = 1, ignore_index = False) 

y1 = tot['StartHesitation']  # target variable for StartHesitation
y2 = tot['Turn']  # target variable for Turn
y3 = tot['Walking']  # target variable for Walking


Most of the target variables are 0. We had better **create each balanced dataset with the target variables of 0 and 1 equally**.

In [None]:
# Find the positions of y1 where it equals 0.
y1_zeros = np.where(y1 == 0)[0]
y1_ones = np.where(y1 == 1)[0]

# Choose the same number of samples with y1 == 1 as there are with y1 == 0.
num1_ones = (y1 == 1).sum()
np.random.seed(42)
y1_zeros = np.random.choice(np.where(y1 == 0)[0], size = num1_ones, replace = False)

# Combine the positions of y1 == 0 and y1 == 1.
y1_balanced_idxs = np.sort(np.concatenate([y1_zeros, y1_ones]))

# Use the balanced indices to get the corresponding rows of X and y1.
X1_balanced = X_tot.iloc[y1_balanced_idxs, :]
y1_balanced = y1.iloc[y1_balanced_idxs]

In [None]:
# Find the positions of y2 where it equals 0.
y2_zeros = np.where(y2 == 0)[0]
y2_ones = np.where(y2 == 1)[0]

# Choose the same number of samples with y2 == 1 as there are with y2 == 0.
num2_ones = (y2 == 1).sum()
np.random.seed(42)
y2_zeros = np.random.choice(np.where(y2 == 0)[0], size = num2_ones, replace = False)

# Combine the positions of y2 == 0 and y2 == 1.
y2_balanced_idxs = np.sort(np.concatenate([y2_zeros, y2_ones]))

# Use the balanced indices to get the corresponding rows of X and y1.
X2_balanced = X_tot.iloc[y2_balanced_idxs, :]
y2_balanced = y2.iloc[y2_balanced_idxs]

In [None]:
# Find the positions of y3 where it equals 0.
y3_zeros = np.where(y3 == 0)[0]
y3_ones = np.where(y3 == 1)[0]

# Choose the same number of samples with y3 == 1 as there are with y3 == 0.
num3_ones = (y3 == 1).sum()
np.random.seed(42)
y3_zeros = np.random.choice(np.where(y3 == 0)[0], size = num3_ones, replace = False)

# Combine the positions of y3 == 0 and y3 == 1.
y3_balanced_idxs = np.sort(np.concatenate([y3_zeros, y3_ones]))

# Use the balanced indices to get the corresponding rows of X and y3.
X3_balanced = X_tot.iloc[y3_balanced_idxs, :]
y3_balanced = y3.iloc[y3_balanced_idxs]

Next, we can **split the data into training and testing sets using the train_test_split function from scikit-learn**.

In [None]:
from sklearn.model_selection import train_test_split

X1_train, X1_test, y1_train, y1_test = train_test_split(X1_balanced, y1_balanced, test_size = 0.2, random_state = 42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2_balanced, y2_balanced, test_size = 0.2, random_state = 42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X3_balanced, y3_balanced, test_size = 0.2, random_state = 42)

Then, we **standardize the independent variables**.

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the independent variables.
scaler1 = StandardScaler()
X1_train = scaler1.fit_transform(X1_train)
X1_test = scaler1.transform(X1_test)

scaler2 = StandardScaler()
X2_train = scaler2.fit_transform(X2_train)
X2_test = scaler2.transform(X2_test)

scaler3 = StandardScaler()
X3_train = scaler3.fit_transform(X3_train)
X3_test = scaler3.transform(X3_test)

# Create, Train, and Evaluate Model

Finally, we can **create and train three separate models**, one for each target variable, using a suitable algorithm.
       
### This time we use **Random Forest Regressor instead of the Logistic Regression model**.

**For a Logistic Regression model, please see [PD FOG Prediction Baseline by Logistic Regression](https://www.kaggle.com/code/gokifujiya/pd-fog-prediction-baseline-by-logistic-regression).**

In [None]:
#from sklearn.linear_model import LogisticRegression
from sklearn import ensemble

# Create three separate logistic regression models.
#model1 = LogisticRegression()
#model2 = LogisticRegression()
#model3 = LogisticRegression()

# Create three separate Random Forest Regressor models.
model1 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)
model2 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)
model3 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)

# Train the models on the training data.
model1.fit(X1_train, y1_train)
model2.fit(X2_train, y2_train)
model3.fit(X3_train, y3_train)

# Evaluate the models on the test data.
print('R2 for StartHesitation:', model1.score(X1_test, y1_test))
print('R2 for Turn:', model2.score(X2_test, y2_test))
print('R2 for Walking:', model3.score(X3_test, y3_test))

# Recreate Dataset and Training

**For submission we should not split the datasets to keep the amount of data and to get a higher score.**

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the independent variables.
scaler1 = StandardScaler()
X1_balanced = scaler1.fit_transform(X1_balanced)

scaler2 = StandardScaler()
X2_balanced = scaler2.fit_transform(X2_balanced)

scaler3 = StandardScaler()
X3_balanced = scaler3.fit_transform(X3_balanced)

In [None]:
#from sklearn.linear_model import LogisticRegression
from sklearn import ensemble

# Create three separate logistic regression models.
#model1 = LogisticRegression()
#model2 = LogisticRegression()
#model3 = LogisticRegression()

# Create three separate Random Forest Regressor models.
model1 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)
model2 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)
model3 = ensemble.RandomForestRegressor(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = 42)

# Train the models on the training data.
model1.fit(X1_balanced, y1_balanced)
model2.fit(X2_balanced, y2_balanced)
model3.fit(X3_balanced, y3_balanced)

# Create Test Dataset

In [None]:
# Set the directory path to the folder containing the CSV files.
tdcsfog_test_path = '/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/test/tdcsfog'

# Initialize an empty list to store the dataframes.
tdcsfog_test_list = []

# Loop through each file in the directory and read it into a dataframe.
for file_name in os.listdir(tdcsfog_test_path):
    if file_name.endswith('.csv'):
        file_path = os.path.join(tdcsfog_test_path, file_name)
        file = pd.read_csv(file_path)
        file['Id'] = file_name[:-4] + '_' + file['Time'].apply(str)
        file.Time = file.Time 
        tdcsfog_test_list.append(file)

In [None]:
def slope(x): return (x[-1] - x[0]) / x[0] if x[0] else 0
def abs_diff_mean(x): return np.mean(np.abs(x[1:] - x[:-1])) if len(x) > 1 else 0
def diff_std(x): return np.std(x[1:] - x[:-1]) if len(x) > 1 else 0

strides = [5] # If stride size == window size there is no overlap between windows
windows = [10]


funcs = [make_robust(f) for f in [np.min,np.var, np.max, np.std, np.mean, slope, ss.skew, ss.kurtosis, abs_diff_mean, diff_std, np.sum,]]

fc_test = FeatureCollection(
    MultipleFeatureDescriptors(
          functions=funcs,
          series_names=["AccV", "AccML", "AccAP"],
          windows=windows,
          strides=strides[0],
    )
)


In [None]:
tdcsfog_test = pd.DataFrame()

for idx, tdcsfog in enumerate(tdcsfog_test_list): 
    df_feats = fc_test.calculate(data=[tdcsfog], window_idx="end", approve_sparsity=True, return_df=True)
    df_feats = df_feats.join(tdcsfog)
    tdcsfog_test = pd.concat([tdcsfog_test,df_feats],ignore_index = True)
    
print(tdcsfog_test.tail())

In [None]:
tdcsfog_test = reduce_memory_usage(tdcsfog_test)

In [None]:
# Set the directory path to the folder containing the CSV files.
defog_test_path = '/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/test/defog'

# Initialize an empty list to store the dataframes.
defog_test_list = []

# Loop through each file in the directory and read it into a dataframe.
for file_name in os.listdir(defog_test_path):
    if file_name.endswith('.csv'):
        file_path = os.path.join(defog_test_path, file_name)
        file = pd.read_csv(file_path)
        file['Id'] = file_name[:-4] + '_' + file['Time'].apply(str)
        file.Time = file.Time
        defog_test_list.append(file)

In [None]:
defog_test = pd.DataFrame()

for idx, defog in enumerate(defog_test_list): 
    df_feats = fc_test.calculate(data=[defog], window_idx="end", approve_sparsity=True, return_df=True)
    df_feats = df_feats.join(defog)
    defog_test = pd.concat([defog_test,df_feats],ignore_index = True)
    
print(defog_test.tail())

In [None]:
defog_test = reduce_memory_usage(defog_test)

In [None]:
test = pd.concat([tdcsfog_test, defog_test], axis = 0).reset_index(drop = True)
test = test.drop(columns=['Time'])
test

# Inference

In [None]:
# Separate the dataset for the independent variables.
# Change by hand
test_X = test.iloc[:, 0:36]

# Standardize the independent variables by a new scaler.
scaler = StandardScaler()
test_X = scaler.fit_transform(test_X)

# Get the predictions for the three models on the test data.
pred_y1 = model1.predict(test_X)
pred_y2 = model2.predict(test_X)
pred_y3 = model3.predict(test_X)

test['StartHesitation'] = pred_y1 # target variable for StartHesitation
test['Turn'] = pred_y2 # target variable for Turn
test['Walking'] = pred_y3 # target variable for Walking

test

# Submission

In [None]:
submission = test.iloc[:, 35:].fillna(0.0)
submission

In [None]:
submission.to_csv("submission.csv", index = False)

# Save, Load, and Use Model

To save the trained Logistic Regression model, you can use the joblib library from the sklearn.externals module. This will save the model to a file in the current working directory. **To load the saved model later**, we can use the joblib.load() function.

In [None]:
import joblib

# Save the model to disk.
joblib.dump(model1, 'model1.joblib')
joblib.dump(model2, 'model2.joblib')
joblib.dump(model3, 'model3.joblib')

# Load the saved models from disk.
model1_loaded = joblib.load('model1.joblib')
model2_loaded = joblib.load('model2.joblib')
model3_loaded = joblib.load('model3.joblib')

# Use the loaded models to make predictions on test data.
y1_pred_loaded = model1_loaded.predict(test_X)
y2_pred_loaded = model2_loaded.predict(test_X)
y3_pred_loaded = model3_loaded.predict(test_X)

# Conclusion

It is possible that **more features or more advanced machine learning algorithms** could improve the accuracy of the models. Additionally, it may be useful to **investigate other factors** that contribute to the occurrence of freezing of gait events, such as cognitive or environmental factors.

I am a medical doctor working on **artificial intelligence (AI) for medicine**. At present AI is also widely used in the medical field. Particularly, AI performs in the healthcare sector following tasks: **image classification, object detection, semantic segmentation, GANs, text classification, etc**. **If you are interested in AI for medicine, please see my other notebooks.**