# MJF-1B: Parkinson's Freezing of Gait 
Link to kaggle competition dataset and info: 
- https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data
## Objective:
- To detect the start and stop of each freezing episode and the occurrence in these series of three types of freezing of gait events:
  - Start Hesitation
  - Turn
  - Walking

## File and Field Description
- train/ Folder containing the data series in the training set within three subfolders: tdcsfog/, defog/, and notype/.
- Series in the notype folder are from the defog dataset but lack event-type annotations.
- The fields present in these series vary by folder.
  - Time An integer timestep. Series from the tdcsfog dataset are recorded at 128Hz (128 timesteps per second), while series from the defog and daily series are recorded at 100Hz (100 timesteps per second).
  - AccV, AccML, and AccAP Acceleration from a lower-back sensor on three axes: V - vertical, ML - mediolateral, AP - anteroposterior. Data is in units of m/s^2 for tdcsfog/ and g for defog/ and notype/.
  - StartHesitation, Turn, Walking Indicator variables for the occurrence of each of the event types.
  - Event Indicator variable for the occurrence of any FOG-type event. Present only in the notype series, which lack type-level annotations.
  - Valid There were cases during the video annotation that were hard for the annotator to decide if there was an Akinetic (i.e., essentially no movement) FoG or the subject stopped voluntarily. Only event annotations where the series is marked true should be considered as unambiguous.
  - Task Series were only annotated where this value is true. Portions marked false should be considered unannotated.
    
- Note that the Valid and Task fields are only present in the defog dataset. They are not relevant for the tdcsfog data.

# Data Exploration 
Objectives:
- Set up IDE and download relevant Python libraries.
- Load and inspect the three training sets from Parkinson’s FoG Kaggle Folder.
- Standardize categorical variables, and normalize numerical features.
- Understand the structure, types of sensors, and data format.

In [50]:
# Disable history so database is not readonly
%config HistoryManager.enabled = False

## Import Python Libraries

In [51]:
# Import necessary Python libraries
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

## Inspect Dataset Contents

In [52]:
# List all relevant files, folders, and subfolders 
all_files = os.listdir('../input/tlvmc-parkinsons-freezing-gait-prediction')
print('All competition datasets:')
print(all_files)

train_files = os.listdir('../input/tlvmc-parkinsons-freezing-gait-prediction/train')
print('\nFolders in train:')
print(train_files)

defog_path = '../input/tlvmc-parkinsons-freezing-gait-prediction/train/defog'
defog_files = os.listdir(defog_path)
print('\nFiles in defog:')
print(f'{defog_files[:10]}... plus {len(defog_files)-10} more remaining csv files')

tdcsfog_path = '../input/tlvmc-parkinsons-freezing-gait-prediction/train/tdcsfog'
tdcsfog_files = os.listdir(tdcsfog_path)
print('\nFiles in tdcsfog:')
print(f'{tdcsfog_files[:10]}... plus {len(tdcsfog_files)-10} more remaining csv files')

notype_path = '../input/tlvmc-parkinsons-freezing-gait-prediction/train/notype'
notype_files = os.listdir(notype_path)
print('\nFiles in notype:')
print(f'{notype_files[:10]}... plus {len(notype_files)-10} more remaining csv files')


All competition datasets:
['sample_submission.csv', 'unlabeled', 'subjects.csv', 'tasks.csv', 'defog_metadata.csv', 'daily_metadata.csv', 'test', 'events.csv', 'tdcsfog_metadata.csv', 'train']

Folders in train:
['defog', 'tdcsfog', 'notype']

Files in defog:
['be9d33541d.csv', '4c3aa8ea6e.csv', '18e7abc37e.csv', '6a20935af5.csv', 'e642d9ea5f.csv', '3f3b08f78d.csv', '68e7e02a47.csv', 'f17eacf7d8.csv', '3f970065e5.csv', '7030643376.csv']... plus 81 more remaining csv files

Files in tdcsfog:
['a171e61840.csv', '4171ea3a0c.csv', '0f985a8440.csv', '5d320ade20.csv', 'ae8c67086b.csv', 'b7214cbf21.csv', 'e18fcafee8.csv', '79568b8e25.csv', 'feba449e1a.csv', '7ebad45aec.csv']... plus 823 more remaining csv files

Files in notype:
['1e8d55d48d.csv', '89e9ed32d1.csv', 'e5a0e226fe.csv', '1b3bc93401.csv', '34b979fc28.csv', '9cd837fd53.csv', '60f28aa837.csv', '02ab235146.csv', '6a886a3bb8.csv', '339c0cc15f.csv']... plus 36 more remaining csv files


## Combine CSV Files into a Single Dataset
- Dataframes for: defog, tdcsfog, notype
- Add patientID column as unique identifier of each csv file

In [53]:
# Function to combine all csv files in folder with a unique identifier column 
def CombineDataset(path, df):
    # glob.glob('folder/*.type') -> look inside folder and return all files of .type
    for file in glob.glob(f'{path}/*.csv'):
        
        # Get the unqiue identifier of file (name of csv file)
        # os.path.basename(...) -> strips the directory, leaving only the filename.csv
        # os.path.splittext('filename.csv') -> splits into tuple ('filename'. '.csv')
        patientID = os.path.splitext(os.path.basename(file))[0]

        temp_df = pd.read_csv(file)
        
        # Create a new column for the ID
        temp_df['PatientID'] = patientID

        # Combine into one, final dataframe 
        df = pd.concat([df, temp_df], ignore_index=True)

    return df 

In [54]:
# Initialize an empty defog dataframe 
defog_df = pd.DataFrame()

# Call the above function to combine all csv files in defog 
defog_df = CombineDataset(defog_path, defog_df)
defog_df.head()

Unnamed: 0,Time,AccV,AccML,AccAP,StartHesitation,Turn,Walking,Valid,Task,PatientID
0,0,-1.002697,0.022371,0.068304,0,0,0,False,False,be9d33541d
1,1,-1.002641,0.019173,0.066162,0,0,0,False,False,be9d33541d
2,2,-0.99982,0.019142,0.067536,0,0,0,False,False,be9d33541d
3,3,-0.998023,0.018378,0.068409,0,0,0,False,False,be9d33541d
4,4,-0.998359,0.016726,0.066448,0,0,0,False,False,be9d33541d


In [55]:
# Initialize an empty tdcsfog dataframe 
tdcsfog_df = pd.DataFrame()

# Call the above function to combine all csv files in tdcsfog 
tdcsfog_df = CombineDataset(tdcsfog_path, tdcsfog_df)
tdcsfog_df.head()

Unnamed: 0,Time,AccV,AccML,AccAP,StartHesitation,Turn,Walking,PatientID
0,0,-9.66589,0.04255,0.184744,0,0,0,a171e61840
1,1,-9.672969,0.049217,0.184644,0,0,0,a171e61840
2,2,-9.67026,0.03362,0.19379,0,0,0,a171e61840
3,3,-9.673356,0.035159,0.184369,0,0,0,a171e61840
4,4,-9.671458,0.043913,0.197814,0,0,0,a171e61840


In [56]:
# Initialize an empty notype dataframe 
notype_df = pd.DataFrame()

# Call the above function to combine all csv files in notype 
notype_df = CombineDataset(notype_path, notype_df)
notype_df.head()

Unnamed: 0,Time,AccV,AccML,AccAP,Event,Valid,Task,PatientID
0,0,-0.991926,-0.119916,0.050087,0,False,False,1e8d55d48d
1,1,-0.994243,-0.118624,0.049909,0,False,False,1e8d55d48d
2,2,-0.99584,-0.118602,0.048774,0,False,False,1e8d55d48d
3,3,-0.995865,-0.121627,0.04809,0,False,False,1e8d55d48d
4,4,-0.99233,-0.122146,0.048878,0,False,False,1e8d55d48d


In [57]:
# Check that PatientID column logic was implemented correctly 
unique_ids = defog_df['PatientID'].unique()
print(unique_ids[:10])

['be9d33541d' '4c3aa8ea6e' '18e7abc37e' '6a20935af5' 'e642d9ea5f'
 '3f3b08f78d' '68e7e02a47' 'f17eacf7d8' '3f970065e5' '7030643376']


In [58]:
# Shape of each dataframe (rows, col) 
print(f"\nCombined defog shape: ({defog_df.shape[0]} rows, {defog_df.shape[1]} cols)")
print(f"\nCombined tdcsfog shape: ({tdcsfog_df.shape[0]} rows, {tdcsfog_df.shape[1]} cols)")
print(f"\nCombined notype shape: ({notype_df.shape[0]} rows, {notype_df.shape[1]} cols)")


Combined defog shape: (13525702 rows, 10 cols)

Combined tdcsfog shape: (7062672 rows, 8 cols)

Combined notype shape: (10251114 rows, 8 cols)


## Standardize Categorical Variables
- Convert booleans in *Valid* and *Task* to integers (True -> 1, False -> 0)

In [59]:
# Data types of features 
print(f'DEFOG DATA TYPES:\n{defog_df.dtypes}\n')
print(f'TDCSFOG DATA TYPES:\n{tdcsfog_df.dtypes}\n')
print(f'NOTYPE DATA TYPES:\n{notype_df.dtypes}\n')

DEFOG DATA TYPES:
Time                 int64
AccV               float64
AccML              float64
AccAP              float64
StartHesitation      int64
Turn                 int64
Walking              int64
Valid                 bool
Task                  bool
PatientID           object
dtype: object

TDCSFOG DATA TYPES:
Time                 int64
AccV               float64
AccML              float64
AccAP              float64
StartHesitation      int64
Turn                 int64
Walking              int64
PatientID           object
dtype: object

NOTYPE DATA TYPES:
Time           int64
AccV         float64
AccML        float64
AccAP        float64
Event          int64
Valid           bool
Task            bool
PatientID     object
dtype: object



In [60]:
# Categorical encode for defog dataset
defog_df['Valid'] = defog_df['Valid'].astype(int)
defog_df['Task'] = defog_df['Task'].astype(int)

defog_df.head()

Unnamed: 0,Time,AccV,AccML,AccAP,StartHesitation,Turn,Walking,Valid,Task,PatientID
0,0,-1.002697,0.022371,0.068304,0,0,0,0,0,be9d33541d
1,1,-1.002641,0.019173,0.066162,0,0,0,0,0,be9d33541d
2,2,-0.99982,0.019142,0.067536,0,0,0,0,0,be9d33541d
3,3,-0.998023,0.018378,0.068409,0,0,0,0,0,be9d33541d
4,4,-0.998359,0.016726,0.066448,0,0,0,0,0,be9d33541d


In [61]:
# Categorical encode for notype dataset
notype_df['Valid'] = notype_df['Valid'].astype(int)
notype_df['Task'] = notype_df['Task'].astype(int)

notype_df.head()

Unnamed: 0,Time,AccV,AccML,AccAP,Event,Valid,Task,PatientID
0,0,-0.991926,-0.119916,0.050087,0,0,0,1e8d55d48d
1,1,-0.994243,-0.118624,0.049909,0,0,0,1e8d55d48d
2,2,-0.99584,-0.118602,0.048774,0,0,0,1e8d55d48d
3,3,-0.995865,-0.121627,0.04809,0,0,0,1e8d55d48d
4,4,-0.99233,-0.122146,0.048878,0,0,0,1e8d55d48d
