
# Preprocessing & Training

## NeuroSense Analytics 

### v0.1.5

#### Planned Features  

- Dedicated preprocessor class for LSTM model
- LSTM model class for training
- Pipelining preprocessing into training
- Hyperparameter optimisation using Optuna


# Documentation  

## DataPreprocessor  

Creating a `preprocessor` class to automate the import workflow - preprocessor handles data import, datatype casting, one-hot encoding, MICE imputation, joining feature data with survey data and returns a Polars dataframe or NumPy series.  

Output is in the shape of `batch_size`, `time_steps`, `features` for LSTM processing.   

> ### Parameters:  

- `path`: *`{1, 2, 3, 4}`*  
Data path to use (INS_W1 = `1`, INS_W2 = `2`, INS_W3 = `3`, INS_W4 = `4`)  

- `imputer_max_iter`: *`int`, default = `10`*  
Max amount of iterations for IterativeImputer.  

- `imputer_random_state`: *`int`, default = `42`*  
Imputer random state.  

- `nearest_features`: *`int`, default = `None`*  
How many neighbours to sample when imputing.  

- `strategy`: *`{‘mean’, ‘median’, ‘most_frequent’, ‘constant’}`, default = `median`*  
What strategy to use when imputing  

- `impute`: *`bool`, default = `True`*  
Whether or not to run imputation.  

- `exclude_history`: *`bool`, default = `True`*  
Whether or not to exclude 14- and 7-day histories during preprocessing.

> ### Functions:  

- `import_csv_feature_data(csv)`:  
Imports and normalises feature dataset. Returns `pl.DataFrame`.

    > Parameters:  

    - `csv`: *`str`*   
    File name of csv to be processed.  

- `import_csv_survey_data(csv)`:  
Imports and normalises survey dataset. Returns `pl.DataFrame`. 

    > Parameters:  

    - `csv`: *`str`*  
    File name of csv to be processed.  
    
- `import_dep_endterm()`:  
Imports, one-hot encodes and normalises the `dep_endterm` dataset for the selected datapath. Returns `pl.DataFrame`.

- `merge_dataframe(dataframe_1, dataframe_2, join_type)`:  
Merges two dataframes on `join_type`, returns `pl.DataFrame`.  

    > Parameters:  

    - `dataframe_1`: *`pl.DataFrame`*  
    DataFrame to join with. 

    - `dataframe-2`: *`pl.DataFrame`*  
    DataFrame to be joined to `dataframe_1`.  

    - `join_type`: *`{‘inner’, ‘left’, ‘right’, ‘full’, ‘semi’, ‘anti’, ‘cross’}, default="inner"`*  
    Join strategy.
    

> ### Example Usage:   

`preprocessor_INS_W1 = DataPreprocessor(1, imputer_max_iter=20)`


## PreprocessorLSTM

## ModelTraining

# Imports

In [2]:
import numpy as np
import polars as pl
from datetime import date
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from tensorflow.keras.utils import pad_sequences
import logging

# Class Declaration  

## DataPreprocessor

In [3]:
# This script imports CSV files containing feature and survey data, processes them, and prepares them for analysis.
# It also includes functions to load and preprocess the data, including scaling and encoding categorical variables.

# data path for the CSV files
# The data is organized into four directories, each containing feature and survey data.

class DataPreprocessor:
    def __init__(self, path: str, imputer_max_iter: int = 10, imputer_random_state: int = 42, nearest_features: int = None, strategy: str = "mean", only_history: bool = True, impute: bool = True): # Initialize the DataPreprocessor class
        self.only_history = only_history # Include only history data if specified
        self.scaler = MinMaxScaler() # Initialize the MinMaxScaler with specified parameters
        self.impute = impute
        self.imputer = IterativeImputer(max_iter=imputer_max_iter, random_state=imputer_random_state, n_nearest_features=nearest_features, initial_strategy=strategy) # Initialize the IterativeImputer with specified parameters

        match path: ## Set the path based on the input parameter
            case 1:
                self.path = "./csv_data/INS-W_1/"
            case 2:
                self.path = "./csv_data/INS-W_2/"
            case 3:
                self.path = "./csv_data/INS-W_3/"
            case 4:
                self.path = "./csv_data/INS-W_4/"
            case _:
                raise ValueError("Invalid path specified. Please choose a valid path (1, 2, 3, or 4).")

    # Load the CSV files, cast columns to appropriate types, and drop empty columns
    def import_csv_feature_data(self, file_name: str) -> pl.DataFrame:
        try:
            q = (
                pl.scan_csv(self.path + "FeatureData/" + file_name + ".csv")
                .select(pl.col("*"))
                .cast({"date": pl.Date})
                .drop("")
                .with_columns(pl.col("pid").str.replace_all("INS-W_",""))
                .cast({"pid": pl.Int32})
                .select(pl.exclude(pl.String))
            )
            data = q.collect() # Collect the lazy frame into a DataFrame

            if self.impute:
                if self.only_history: # If only history data is to be included
                    data = data.select(pl.col(["pid","date","^.*14dhist$","^.*7dhist$"]))
                scaled_data = pl.from_numpy( # Convert to numpy array for scaling
                        self.scaler.fit_transform(data.select(pl.exclude([pl.Date, pl.Int32]))), schema=data.select(pl.exclude([pl.Date, pl.Int32])).columns # min max scaling on all columns except date and pid
                    )
                try:
                    self.imputer.fit(scaled_data) # Fit the imputer to the scaled data
                    imputed_data = pl.from_numpy(
                        self.imputer.transform(scaled_data), schema=data.select(pl.exclude(["pid","date"])).columns # Transform the scaled data using the imputer
                    )
                    data = data.select(["pid","date"])
                    data = data.hstack(imputed_data) # Add imputed data back to the DataFrame
                    del imputed_data # Delete the imputed data variable to free up memory
                except: 
                    print("Error in imputation, returning scaled data without imputation.")
                    return data
                return data
            return data
        
        except Exception as e:
            print(f"Error importing feature data from {self.path + 'FeatureData/' + file_name}: {e}")
            return pl.DataFrame()

    def import_csv_survey_data(self, file_name: str) -> pl.DataFrame:
        try: # Load survey data from CSV file
            q = (
                pl.scan_csv(self.path + "SurveyData/" + file_name + ".csv")
                .select(pl.col("*"))
                .cast({"date": pl.Date})
                .drop("")
                .with_columns(pl.col("pid").str.replace_all("INS-W_",""))
                .cast({"pid": pl.Int32})
            )
            data = q.collect()
            match file_name:
                case "ema":
                    survey_data = data.select(pl.exclude(["pid","date"])) # Convert to numpy array for scaling
                    scaled_data = pl.from_numpy(
                        self.scaler.fit_transform(survey_data), schema=survey_data.columns
                    )
                    data = data.select(["pid","date"])
                    data = data.hstack(scaled_data)
                    del scaled_data
                    return data
                case "post":
                    survey_data = data.select(pl.exclude(["pid","date"])) # Convert to numpy array for scaling
                    scaled_data = pl.from_numpy(
                        self.scaler.fit_transform(survey_data), schema=survey_data.columns
                    )
                    data = data.select(["pid","date"])
                    data = data.hstack(scaled_data)
                    del scaled_data
                    return data
                case "pre":
                    survey_data = data.select(pl.exclude(["pid","date"]))
                    scaled_data = pl.from_numpy(
                        self.scaler.fit_transform(survey_data), schema=survey_data.columns
                    )
                    data = data.select(["pid","date"])
                    data = data.hstack(scaled_data)
                    del scaled_data
                    return data
        except Exception as e:
            print(f"Error importing survey data from {self.path + 'SurveyData/' + file_name}: {e}")
            return pl.DataFrame()

    def import_dep_endterm(self) -> pl.DataFrame:
        try:
            q = (
                pl.scan_csv(self.path + "SurveyData/dep_endterm.csv")
                .select(pl.col("*"))
                .cast({"date": pl.Date})
                .drop("")
                .with_columns(pl.col("pid").str.replace_all("INS-W_",""))
                .cast({"pid": pl.Int32})    
                )
            data = q.collect()
            bdi2 = data.select(pl.exclude(["pid","date", "dep"]))
            data_scaled = pl.from_numpy(
                self.scaler.fit_transform(bdi2), schema=bdi2.columns # min max scaling on all columns except date and pid
            )
            data = data.select(["pid","date", "dep"])
            data = data.hstack(data_scaled) # Add scaled data back to the DataFrame
            del data_scaled # Delete the scaled data variable to free up memory
            data = data.to_dummies("dep")
            return data
        except Exception as e:
            print(f"Error importing endterm data from {self.path + 'SurveyData/dep_endterm.csv'}: {e}")
            return pl.DataFrame()

    def merge_dataframe(self, dataframe_1: pl.DataFrame, dataframe_2: pl.DataFrame, join_type: str = "inner") -> pl.DataFrame:
        try:
            merged_data = dataframe_1.join(dataframe_2, on=["pid", "date"], how=join_type) # Merge feature and survey data on 'pid'
            return merged_data
        except Exception as e:
            print(f"Error merging feature and survey data: {e}")
            return pl.DataFrame()
    
    def import_dep_weekly(self) -> pl.DataFrame:
        try:
            q = (
                pl.scan_csv(self.path + "SurveyData/dep_weekly.csv")
                .select(pl.col("*"))
                .cast({"date": pl.Date})
                .drop("")
                .with_columns(pl.col("pid").str.replace_all("INS-W_",""))
                .cast({"pid": pl.Int32})
                .select(pl.col(["pid","date","dep_weeklysubscale_endterm_merged"]))
                )
            data = q.collect()
            data = data.to_dummies("dep_weeklysubscale_endterm_merged") # Convert categorical variable to dummy variables
            return data
        
        except Exception as e:
            print(f"Error importing weekly data from {self.path + 'SurveyData/dep_weekly.csv'}: {e}")
            return pl.DataFrame()
        
    def merge_on_date(self, dataframe_1: pl.DataFrame, dataframe_2: pl.DataFrame) -> pl.DataFrame:
        try:
            merged_data = dataframe_1.join(dataframe_2, on=["date"], how="inner") # Merge feature and survey data on 'date'
            return merged_data
        except Exception as e:
            print(f"Error merging dataframes on date: {e}")
            return pl.DataFrame()

## ModelLSTM

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Masking

class ModelLSTM:
    def __init__(self): # Initialize the PreprocessorLSTM class
        self.logger = logging.getLogger(__name__) # Initialize the logger for the PreprocessorLSTM class
        logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Set the logging level and format
        self.logger.info("PreprocessorLSTM initialized") # Log the initialization of the PreprocessorLSTM class
        self.padded_sequences = None # Initialize the padded sequences variable

        
    def select_features(self, data: pl.DataFrame, features: list) -> pl.DataFrame:
        try:
            selected_data = data.select(pl.col(features)) # Select specified features from the DataFrame
            self.logger.info(f"Selected features: {features}")
            return selected_data
        except Exception as e:
            self.logger.error(f"Error selecting features: {e}")
            return pl.DataFrame()


    def create_padded_sequences(self, data: pl.DataFrame) -> np.ndarray:
        self.sequences = data.rows()
        try:
            self.padded_sequences = pad_sequences(
                self.sequences, padding='post', dtype='float32'
            )
            self.logger.info(f"Created padded sequences with shape: {self.padded_sequences.shape}")
            return self.padded_sequences
        
        except Exception as e:
            self.logger.error(f"Error creating padded sequences: {e}")
            return np.array([])
    

    def split_data(self, padded_sequences: np.ndarray, test_size: float = 0.2) -> tuple:
        try:
            self.participant_ids = np.arange(len(padded_sequences))  # [0, 1, 2, ...]
            self.train_ids, test_ids = train_test_split(self.participant_ids, test_size=test_size, shuffle=False) # Split the data into training and testing sets
            self.X_train = padded_sequences[self.train_ids]
            self.X_test = padded_sequences[test_ids]
            self.y_train = np.array([seq[1:, :] for seq in self.X_train])  # Shift by 1 time step
            self.y_test = np.array([seq[1:, :] for seq in self.X_test])
            # Logging
            self.logger.info(f"Split data into train and test sets with sizes: {len(self.X_train)}, {len(self.X_test)}")
            self.logger.info(f"Train and test data shapes: {self.X_train.shape}, {self.X_test.shape}")
            self.logger.info(f"Train and test labels shapes: {self.y_train.shape}, {self.y_test.shape}")
            return self.X_train, self.X_test, self.y_train, self.y_test
        
        except Exception as e:
            self.logger.error(f"Error splitting data: {e}")
            return None, None, None, None
    
    
    def build_lstm_model(self, n_features):
        try:
            inputs = Input(shape=(None, n_features))  # Variable-length sequences
            x = Masking(mask_value=0.0)(inputs)      # Ignore padded zeros
            x = LSTM(64, return_sequences=True)(x)  # Process entire sequence
            outputs = Dense(n_features)(x)           # Predict all features
            model = Model(inputs, outputs)
            model.compile(optimizer="adam", loss="mse")
            self.logger.info("LSTM model built successfully")
            self.logger.info(f"Model summary: {model.summary()}")
            return model
        
        except Exception as e:
            self.logger.error(f"Error building LSTM model: {e}")
            return None
            

In [None]:
TestModel = ModelLSTM() # Create an instance of the ModelLSTM class
model = TestModel.build_lstm_model(n_features=2)
model.fit(TestModel.X_train, TestModel.y_train, epochs=10, validation_data=(TestModel.X_test, TestModel.y_test)) # Since x_train and y_train are initialized in the class we can use them directly

# INS-W_1

In [47]:
PreprocessorLSTM = PreprocessorLSTM()

2025-05-12 10:47:43,105 - INFO - PreprocessorLSTM initialized


In [6]:
preprocessor_INS_W1 = DataPreprocessor(1, imputer_max_iter=10, nearest_features=10)

# Testing, ignore this section

In [7]:
rapids_1 = preprocessor_INS_W1.import_csv_feature_data("rapids")



In [8]:
dep_weekly_1 = preprocessor_INS_W1.import_dep_weekly()

In [6]:
dep_weekly_1.sample(10)

pid,date,dep_weeklysubscale_endterm_merged_false,dep_weeklysubscale_endterm_merged_true
i32,date,u8,u8
20,2018-05-09,1,0
198,2018-04-18,1,0
168,2018-05-13,1,0
163,2018-05-20,1,0
44,2018-05-02,1,0
121,2018-05-13,1,0
177,2018-05-30,1,0
13,2018-04-18,1,0
38,2018-04-11,0,1
44,2018-04-04,1,0


In [9]:
rapids_weekly = dep_weekly_1.join(rapids_1, on=["pid","date"], how="left")

In [31]:
rapids_weekly_float = rapids_weekly.select(pl.exclude(pl.Date))

In [32]:
rapids_weekly_float.head()

pid,dep_weeklysubscale_endterm_merged_false,dep_weeklysubscale_endterm_merged_true,f_slp:fitbit_sleep_summary_rapids_sumdurationafterwakeupmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationawakemain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationtofallasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationinbedmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgefficiencymain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationafterwakeupmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationawakemain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationtofallasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationinbedmain:14dhist,f_slp:fitbit_sleep_summary_rapids_countepisodemain:14dhist,f_slp:fitbit_sleep_summary_rapids_firstbedtimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_lastbedtimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_firstwaketimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_lastwaketimemain:14dhist,f_slp:fitbit_sleep_intraday_rapids_avgdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_avgdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_maxdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_maxdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_sumdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_sumdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_countepisodeasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_countepisodeawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_stddurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_stddurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mindurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mindurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mediandurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mediandurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiocountasleepunifiedwithinmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiocountawakeunifiedwithinmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiodurationasleepunifiedwithinmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiodurationawakeunifiedwithinmain:14dhist,…,f_loc:phone_locations_barnett_probpause_norm:7dhist,f_loc:phone_locations_barnett_rog_norm:7dhist,f_loc:phone_locations_barnett_siglocentropy_norm:7dhist,f_loc:phone_locations_barnett_siglocsvisited_norm:7dhist,f_loc:phone_locations_barnett_stdflightdur_norm:7dhist,f_loc:phone_locations_barnett_stdflightlen_norm:7dhist,f_loc:phone_locations_barnett_wkenddayrtn_norm:7dhist,f_loc:phone_locations_doryab_avglengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_avgspeed_norm:7dhist,f_loc:phone_locations_doryab_homelabel_norm:7dhist,f_loc:phone_locations_doryab_locationentropy_norm:7dhist,f_loc:phone_locations_doryab_locationvariance_norm:7dhist,f_loc:phone_locations_doryab_loglocationvariance_norm:7dhist,f_loc:phone_locations_doryab_maxlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_minlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_movingtostaticratio_norm:7dhist,f_loc:phone_locations_doryab_normalizedlocationentropy_norm:7dhist,f_loc:phone_locations_doryab_numberlocationtransitions_norm:7dhist,f_loc:phone_locations_doryab_numberofsignificantplaces_norm:7dhist,f_loc:phone_locations_doryab_outlierstimepercent_norm:7dhist,f_loc:phone_locations_doryab_radiusgyration_norm:7dhist,f_loc:phone_locations_doryab_stdlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_timeathome_norm:7dhist,f_loc:phone_locations_doryab_timeattop1location_norm:7dhist,f_loc:phone_locations_doryab_timeattop2location_norm:7dhist,f_loc:phone_locations_doryab_timeattop3location_norm:7dhist,f_loc:phone_locations_doryab_totaldistance_norm:7dhist,f_loc:phone_locations_doryab_varspeed_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_study_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_study_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_exercise_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_exercise_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_greens_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_greens_norm:7dhist,f_wifi:phone_wifi_connected_rapids_countscans_norm:7dhist,f_wifi:phone_wifi_connected_rapids_uniquedevices_norm:7dhist,f_wifi:phone_wifi_connected_rapids_countscansmostuniquedevice_norm:7dhist
i32,u8,u8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1,1,0,0.0,0.093041,0.016954,0.0,0.092651,0.867925,0.0,0.490092,0.097161,0.0,0.509579,0.133333,0.664553,0.687161,0.713582,0.603806,0.123021,0.140463,0.253021,0.124989,0.11734,0.021833,0.271134,0.017449,0.144736,0.1521,0.026507,0.10324,0.043685,0.112673,0.102509,0.905913,0.871094,0.119761,…,0.609419,0.518666,0.459295,0.472384,0.383791,0.448952,0.438403,0.348326,0.513956,0.0,0.497736,0.534031,0.551363,0.536144,0.403216,0.462368,0.511755,0.458369,0.407266,0.317716,0.516602,0.402223,0.465916,0.547802,0.439564,0.378483,0.438629,0.317451,0.376704,0.385906,0.378761,0.398675,0.274647,0.32547,0.336787,0.348105,0.316142
1,1,0,0.01182,0.339687,0.049046,0.0,0.333914,0.893082,0.022523,0.596433,0.093691,0.0,0.612175,0.4,0.634249,0.699096,0.764706,0.661246,0.121998,0.135132,0.345157,0.141564,0.337833,0.052292,0.287451,0.1669,0.151455,0.145222,0.017244,0.101366,0.043466,0.096922,0.098912,0.905287,0.891066,0.094509,…,0.627651,0.517038,0.465076,0.543568,0.362298,0.447053,0.460403,0.368515,0.513698,0.0,0.483816,0.525127,0.549962,0.617473,0.394272,0.478384,0.495894,0.453336,0.407757,0.282597,0.513793,0.430388,0.54416,0.616798,0.415924,0.383443,0.443552,0.311207,0.408718,0.404391,0.3987,0.417286,0.345124,0.359358,0.421382,0.416981,0.416434
1,1,0,0.01182,0.532288,0.065698,0.0,0.518635,0.907757,0.015015,0.623074,0.083666,0.0,0.633887,0.6,0.558844,0.699096,0.75435,0.649611,0.115555,0.112165,0.350906,0.144901,0.53306,0.064957,0.305628,0.268474,0.14422,0.12541,0.011267,0.100188,0.04365,0.07983,0.094605,0.904736,0.906447,0.083229,…,0.641934,0.517407,0.469908,0.583079,0.350798,0.446441,0.473107,0.379823,0.512659,0.0,0.477636,0.521019,0.552425,0.656072,0.388106,0.482036,0.489459,0.452908,0.407238,0.270708,0.512939,0.446661,0.589792,0.649832,0.407105,0.385448,0.447612,0.309423,0.42659,0.413919,0.413205,0.426701,0.389336,0.371606,0.469639,0.462706,0.479704
1,1,0,0.052009,0.782445,0.09779,0.0,0.765703,0.907112,0.045738,0.634082,0.086217,0.0,0.647903,0.866667,0.424595,0.746474,0.741276,0.634921,0.118672,0.124146,0.271186,0.195402,0.772186,0.098937,0.358209,0.359914,0.139153,0.168044,0.010101,0.1,0.037363,0.076923,0.08607,0.91393,0.897756,0.102244,…,0.793154,0.43167,0.794586,0.499424,0.237571,0.299051,0.553364,0.414056,0.401915,0.0,0.389542,0.45987,0.473004,0.623381,0.388765,0.551272,0.526318,0.410903,0.328469,0.180639,0.43846,0.462924,0.530929,0.610621,0.382734,0.38056,0.376078,0.291474,0.415427,0.478331,0.3796,0.435981,0.361637,0.377584,0.49393,0.543688,0.402526
1,1,0,0.040189,0.714734,0.091129,0.0,0.701614,0.904088,0.038288,0.627477,0.08704,0.0,0.643146,0.8,0.424595,0.746474,0.713084,0.603247,0.124494,0.13169,0.367666,0.195402,0.771497,0.098664,0.341151,0.338362,0.147807,0.187949,0.010101,0.1,0.03956,0.076923,0.092008,0.907992,0.897936,0.102064,…,0.801192,0.612689,0.589495,0.797588,0.233678,0.535339,0.485436,0.366904,0.532871,0.0,0.551965,0.557861,0.655238,0.67948,0.353477,0.469403,0.513295,0.492052,0.420313,0.368434,0.561858,0.454755,0.694326,0.615664,0.439867,0.422545,0.524813,0.324633,0.437919,0.4458,0.384594,0.4275,0.472736,0.393942,0.566841,0.612135,0.594258


In [48]:
rapids_padded = PreprocessorLSTM.create_padded_sequences(rapids_weekly_float)

2025-05-12 10:47:48,102 - INFO - Created padded sequences with shape: (2360, 819)


In [53]:
print(np.arange(len(rapids_padded)).shape)

(2360,)


In [49]:
x_train, x_test, y_train, y_test = PreprocessorLSTM.split_data(rapids_padded)

2025-05-12 10:47:54,370 - ERROR - Error splitting data: too many indices for array: array is 1-dimensional, but 2 were indexed


In [11]:
rapids_weekly.head(10)

pid,date,dep_weeklysubscale_endterm_merged_false,dep_weeklysubscale_endterm_merged_true,f_slp:fitbit_sleep_summary_rapids_sumdurationafterwakeupmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationawakemain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationtofallasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_sumdurationinbedmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgefficiencymain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationafterwakeupmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationawakemain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationtofallasleepmain:14dhist,f_slp:fitbit_sleep_summary_rapids_avgdurationinbedmain:14dhist,f_slp:fitbit_sleep_summary_rapids_countepisodemain:14dhist,f_slp:fitbit_sleep_summary_rapids_firstbedtimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_lastbedtimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_firstwaketimemain:14dhist,f_slp:fitbit_sleep_summary_rapids_lastwaketimemain:14dhist,f_slp:fitbit_sleep_intraday_rapids_avgdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_avgdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_maxdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_maxdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_sumdurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_sumdurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_countepisodeasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_countepisodeawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_stddurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_stddurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mindurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mindurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mediandurationasleepunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_mediandurationawakeunifiedmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiocountasleepunifiedwithinmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiocountawakeunifiedwithinmain:14dhist,f_slp:fitbit_sleep_intraday_rapids_ratiodurationasleepunifiedwithinmain:14dhist,…,f_loc:phone_locations_barnett_probpause_norm:7dhist,f_loc:phone_locations_barnett_rog_norm:7dhist,f_loc:phone_locations_barnett_siglocentropy_norm:7dhist,f_loc:phone_locations_barnett_siglocsvisited_norm:7dhist,f_loc:phone_locations_barnett_stdflightdur_norm:7dhist,f_loc:phone_locations_barnett_stdflightlen_norm:7dhist,f_loc:phone_locations_barnett_wkenddayrtn_norm:7dhist,f_loc:phone_locations_doryab_avglengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_avgspeed_norm:7dhist,f_loc:phone_locations_doryab_homelabel_norm:7dhist,f_loc:phone_locations_doryab_locationentropy_norm:7dhist,f_loc:phone_locations_doryab_locationvariance_norm:7dhist,f_loc:phone_locations_doryab_loglocationvariance_norm:7dhist,f_loc:phone_locations_doryab_maxlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_minlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_movingtostaticratio_norm:7dhist,f_loc:phone_locations_doryab_normalizedlocationentropy_norm:7dhist,f_loc:phone_locations_doryab_numberlocationtransitions_norm:7dhist,f_loc:phone_locations_doryab_numberofsignificantplaces_norm:7dhist,f_loc:phone_locations_doryab_outlierstimepercent_norm:7dhist,f_loc:phone_locations_doryab_radiusgyration_norm:7dhist,f_loc:phone_locations_doryab_stdlengthstayatclusters_norm:7dhist,f_loc:phone_locations_doryab_timeathome_norm:7dhist,f_loc:phone_locations_doryab_timeattop1location_norm:7dhist,f_loc:phone_locations_doryab_timeattop2location_norm:7dhist,f_loc:phone_locations_doryab_timeattop3location_norm:7dhist,f_loc:phone_locations_doryab_totaldistance_norm:7dhist,f_loc:phone_locations_doryab_varspeed_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_study_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_study_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_exercise_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_exercise_norm:7dhist,f_loc:phone_locations_locmap_duration_in_locmap_greens_norm:7dhist,f_loc:phone_locations_locmap_percent_in_locmap_greens_norm:7dhist,f_wifi:phone_wifi_connected_rapids_countscans_norm:7dhist,f_wifi:phone_wifi_connected_rapids_uniquedevices_norm:7dhist,f_wifi:phone_wifi_connected_rapids_countscansmostuniquedevice_norm:7dhist
i32,date,u8,u8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1,2018-04-04,1,0,0.0,0.093041,0.016954,0.0,0.092651,0.867925,0.0,0.490092,0.097161,0.0,0.509579,0.133333,0.664553,0.687161,0.713582,0.603806,0.123021,0.140463,0.253021,0.124989,0.11734,0.021833,0.271134,0.017449,0.144736,0.1521,0.026507,0.10324,0.043685,0.112673,0.102509,0.905913,0.871094,…,0.609419,0.518666,0.459295,0.472384,0.383791,0.448952,0.438403,0.348326,0.513956,0.0,0.497736,0.534031,0.551363,0.536144,0.403216,0.462368,0.511755,0.458369,0.407266,0.317716,0.516602,0.402223,0.465916,0.547802,0.439564,0.378483,0.438629,0.317451,0.376704,0.385906,0.378761,0.398675,0.274647,0.32547,0.336787,0.348105,0.316142
1,2018-04-08,1,0,0.01182,0.339687,0.049046,0.0,0.333914,0.893082,0.022523,0.596433,0.093691,0.0,0.612175,0.4,0.634249,0.699096,0.764706,0.661246,0.121998,0.135132,0.345157,0.141564,0.337833,0.052292,0.287451,0.1669,0.151455,0.145222,0.017244,0.101366,0.043466,0.096922,0.098912,0.905287,0.891066,…,0.627651,0.517038,0.465076,0.543568,0.362298,0.447053,0.460403,0.368515,0.513698,0.0,0.483816,0.525127,0.549962,0.617473,0.394272,0.478384,0.495894,0.453336,0.407757,0.282597,0.513793,0.430388,0.54416,0.616798,0.415924,0.383443,0.443552,0.311207,0.408718,0.404391,0.3987,0.417286,0.345124,0.359358,0.421382,0.416981,0.416434
1,2018-04-11,1,0,0.01182,0.532288,0.065698,0.0,0.518635,0.907757,0.015015,0.623074,0.083666,0.0,0.633887,0.6,0.558844,0.699096,0.75435,0.649611,0.115555,0.112165,0.350906,0.144901,0.53306,0.064957,0.305628,0.268474,0.14422,0.12541,0.011267,0.100188,0.04365,0.07983,0.094605,0.904736,0.906447,…,0.641934,0.517407,0.469908,0.583079,0.350798,0.446441,0.473107,0.379823,0.512659,0.0,0.477636,0.521019,0.552425,0.656072,0.388106,0.482036,0.489459,0.452908,0.407238,0.270708,0.512939,0.446661,0.589792,0.649832,0.407105,0.385448,0.447612,0.309423,0.42659,0.413919,0.413205,0.426701,0.389336,0.371606,0.469639,0.462706,0.479704
1,2018-04-18,1,0,0.052009,0.782445,0.09779,0.0,0.765703,0.907112,0.045738,0.634082,0.086217,0.0,0.647903,0.866667,0.424595,0.746474,0.741276,0.634921,0.118672,0.124146,0.271186,0.195402,0.772186,0.098937,0.358209,0.359914,0.139153,0.168044,0.010101,0.1,0.037363,0.076923,0.08607,0.91393,0.897756,…,0.793154,0.43167,0.794586,0.499424,0.237571,0.299051,0.553364,0.414056,0.401915,0.0,0.389542,0.45987,0.473004,0.623381,0.388765,0.551272,0.526318,0.410903,0.328469,0.180639,0.43846,0.462924,0.530929,0.610621,0.382734,0.38056,0.376078,0.291474,0.415427,0.478331,0.3796,0.435981,0.361637,0.377584,0.49393,0.543688,0.402526
1,2018-04-22,1,0,0.040189,0.714734,0.091129,0.0,0.701614,0.904088,0.038288,0.627477,0.08704,0.0,0.643146,0.8,0.424595,0.746474,0.713084,0.603247,0.124494,0.13169,0.367666,0.195402,0.771497,0.098664,0.341151,0.338362,0.147807,0.187949,0.010101,0.1,0.03956,0.076923,0.092008,0.907992,0.897936,…,0.801192,0.612689,0.589495,0.797588,0.233678,0.535339,0.485436,0.366904,0.532871,0.0,0.551965,0.557861,0.655238,0.67948,0.353477,0.469403,0.513295,0.492052,0.420313,0.368434,0.561858,0.454755,0.694326,0.615664,0.439867,0.422545,0.524813,0.324633,0.437919,0.4458,0.384594,0.4275,0.472736,0.393942,0.566841,0.612135,0.594258
1,2018-05-02,1,0,0.023641,0.711724,0.139873,0.0,0.717404,0.856918,0.022523,0.624835,0.133596,0.0,0.65762,0.8,0.536998,0.780108,0.736654,0.629729,0.104387,0.154175,0.367666,0.183908,0.719667,0.127283,0.379531,0.372845,0.132202,0.199157,0.00505,0.1,0.028571,0.076923,0.096391,0.903609,0.861789,…,0.7435,0.435432,0.410851,0.598812,0.282515,0.308321,0.454431,0.425334,0.423559,0.0,0.354951,0.50104,0.562604,0.752148,0.331481,0.490942,0.450422,0.417509,0.401858,0.285897,0.495539,0.499238,0.583442,0.626098,0.612331,0.261141,0.419363,0.252539,0.535506,0.444833,0.293089,0.33344,0.880853,0.582071,0.680646,0.665372,0.673207
1,2018-05-09,1,0,0.016548,0.75185,0.125946,0.0,0.746198,0.880987,0.014553,0.609288,0.111041,0.0,0.631398,0.866667,0.451022,0.780108,0.736941,0.630051,0.112694,0.155766,0.445893,0.183908,0.737654,0.122649,0.360341,0.355603,0.145465,0.199335,0.010101,0.1,0.028571,0.076923,0.094311,0.905689,0.86948,…,0.433068,0.41777,0.556965,0.573965,0.526349,0.385178,0.434534,0.276719,0.475889,0.0,0.516325,0.486297,0.541377,0.526951,0.331481,0.3914,0.535909,0.440462,0.380572,0.262522,0.465885,0.333772,0.594638,0.639793,0.408385,0.336578,0.363615,0.24475,0.446782,0.578805,0.330847,0.417733,0.680668,0.673164,0.599664,0.574109,0.675087
1,2018-05-13,1,0,0.014184,0.831348,0.122616,0.0,0.818995,0.894879,0.011583,0.62559,0.100383,0.0,0.643496,0.933333,0.451022,0.670886,0.722051,0.613322,0.117218,0.14691,0.445893,0.195402,0.844447,0.126192,0.396588,0.387931,0.153533,0.19888,0.010101,0.1,0.026374,0.076923,0.098361,0.901639,0.881843,…,0.646358,0.53552,0.453942,0.723047,0.318338,0.362356,0.582125,0.335492,0.497846,0.0,0.499208,0.567387,0.62263,0.746287,0.331481,0.377691,0.445087,0.611333,0.515384,0.428742,0.538771,0.444174,0.797206,0.857544,0.402567,0.356577,0.582149,0.561927,0.446782,0.390272,0.461271,0.402246,0.76079,0.499902,0.713469,0.58932,0.987123
1,2018-05-16,1,0,0.014184,0.750972,0.108689,0.0,0.738303,0.896952,0.012474,0.608576,0.095826,0.0,0.624718,0.866667,0.451022,0.656058,0.706593,0.595954,0.126933,0.144894,0.445893,0.195402,0.752194,0.103025,0.326226,0.321121,0.149839,0.200203,0.010101,0.1,0.048352,0.076923,0.095475,0.904525,0.891119,…,0.751048,0.581439,0.431507,0.673353,0.272707,0.471796,0.628249,0.349088,0.51426,0.0,0.46168,0.593455,0.638413,0.821789,0.331481,0.477119,0.431798,0.547575,0.529575,0.380052,0.583044,0.484624,0.866008,0.930447,0.36234,0.400082,0.56509,0.603684,0.467213,0.386419,0.475452,0.398434,0.489778,0.378493,0.62334,0.536083,0.712682
1,2018-05-20,1,0,0.021277,0.735423,0.096882,0.0,0.719726,0.902758,0.018711,0.595976,0.085416,0.0,0.608999,0.866667,0.513742,0.643761,0.68295,0.56939,0.12825,0.143199,0.328553,0.195402,0.710328,0.094304,0.304904,0.297414,0.139067,0.203782,0.010101,0.1,0.043956,0.076923,0.099644,0.900356,0.89427,…,0.745854,0.539839,0.472311,0.648506,0.260699,0.493793,0.593629,0.291906,0.548304,0.0,0.50341,0.570371,0.624582,0.708708,0.331481,0.507823,0.443743,0.527173,0.52248,0.363173,0.573245,0.414054,0.763946,0.821258,0.357589,0.380859,0.482056,0.2713,0.470509,0.433424,0.416306,0.404983,0.409657,0.351287,0.496352,0.505662,0.458918


In [13]:
target = rapids_weekly.select("dep_weeklysubscale_endterm_merged_true").to_numpy().tolist()
features = rapids_weekly.to_numpy().tolist()

In [None]:
sleep_1 = preprocessor_INS_W1.import_csv_feature_data("sleep")
wifi_1 = preprocessor_INS_W1.import_csv_feature_data("wifi")
bluetooth_1 = preprocessor_INS_W1.import_csv_feature_data("bluetooth")
call_1 = preprocessor_INS_W1.import_csv_feature_data("call")
location_1 = preprocessor_INS_W1.import_csv_feature_data("location")
screen_1 = preprocessor_INS_W1.import_csv_feature_data("screen")
steps_1 = preprocessor_INS_W1.import_csv_feature_data("steps")

In [120]:
sleep_wifi_temp_1 = preprocessor_INS_W1.merge_survey_to_feature(sleep_1, wifi_1, join_type="inner")
bluetooth_call_temp_1 = preprocessor_INS_W1.merge_survey_to_feature(bluetooth_1, call_1, join_type="inner")
location_screen_temp_1 = preprocessor_INS_W1.merge_survey_to_feature(location_1, screen_1, join_type="inner")

sleep_wifi_steps_1 = preprocessor_INS_W1.merge_survey_to_feature(sleep_wifi_temp_1, steps_1, join_type="inner")
bluetooth_call_location_screen = preprocessor_INS_W1.merge_survey_to_feature(bluetooth_call_temp_1, location_screen_temp_1, join_type="inner")
merged_all_temp = preprocessor_INS_W1.merge_survey_to_feature(sleep_wifi_steps_1, bluetooth_call_location_screen, join_type="inner")