<a href="https://colab.research.google.com/github/mdomingo2029/aai_530_final_project_group_4/blob/main/ai_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install ucimlrepo

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!ls "/content/drive/MyDrive/WESAD.zip"

In [None]:
print('Listing contents of MyDrive:')
!ls "/content/drive/MyDrive"

If your `WESAD.zip` file is inside a subfolder, for example, named `MyProject`, you would then run:

```python
!ls "/content/drive/MyDrive/MyProject"
```

Once you find the file, use the full path in your code.

In [None]:
from google.colab import drive
import os

# 1. Mount the drive
drive.mount('/content/drive')

# 2. Define the exact path (Google Drive root is 'MyDrive')
zip_path = '/content/drive/MyDrive/WESAD.zip'

# 3. Check if the file exists before unzipping
if os.path.exists(zip_path):
    print("Found it! Unzipping now...")
    !unzip -q "{zip_path}" -d "/content/WESAD_data"
    print("Done! Files are now in the 'WESAD_data' folder on the left sidebar.")
else:
    print("Still can't see it. Make sure the file is in the main 'My Drive' folder, not a subfolder.")

# Task
Explore the structure of the unzipped WESAD data by listing the contents of the `/content/WESAD_data` directory to identify the main data files.

## Explore Unzipped Data

### Subtask:
Explore the structure of the unzipped WESAD data by listing the contents of the `/content/WESAD_data` directory to identify the main data files.


**Reasoning**:
To explore the structure of the unzipped WESAD data, I will list the contents of the `/content/WESAD_data` directory using the `!ls -F` command as instructed.



In [None]:
print('Listing contents of the unzipped WESAD data directory:')
!ls -F /content/WESAD_data

**Reasoning**:
Since the previous `ls` command revealed a subdirectory named `WESAD/`, I will now list the contents of this subdirectory to further explore the data structure and locate the actual data files.



In [None]:
print('Listing contents of the WESAD subdirectory:')
!ls -F /content/WESAD_data/WESAD

**Reasoning**:
The previous command revealed subject-specific subdirectories (S2/, S3/, etc.) and a `wesad_readme.pdf` file. To understand the data structure further, I will list the contents of one of these subject directories, specifically `S2/`, to see what kind of data files are stored there.



In [None]:
print('Listing contents of the S2 subdirectory:')
!ls -F /content/WESAD_data/WESAD/S2

## Load Data

### Subtask:
Load the relevant WESAD data files into pandas DataFrames for cleaning.


**Reasoning**:
I will import the pandas library, define the base path to the S2 subject data, and then load the 'S2.pkl' and 'S2_quest.csv' files into two separate pandas DataFrames, `df_s2_data` and `df_s2_quest` respectively, to begin the data loading process as instructed.



In [None]:
import pandas as pd

# Define the base path to the S2 subject data directory
s2_data_path = '/content/WESAD_data/WESAD/S2'

# Load the S2.pkl file into a pandas DataFrame
df_s2_data = pd.read_pickle(f'{s2_data_path}/S2.pkl')

# Load the S2_quest.csv file into a pandas DataFrame
df_s2_quest = pd.read_csv(f'{s2_data_path}/S2_quest.csv')

print("df_s2_data head:")
print(df_s2_data.head())

print("\ndf_s2_quest head:")
print(df_s2_quest.head())

**Reasoning**:
The previous code failed because `df_s2_data` was a dictionary, not a pandas DataFrame, and therefore did not have a `.head()` method. The `S2.pkl` file, when loaded, is a dictionary containing various data points. To address the subtask of loading relevant data into DataFrames, I will first load the `.pkl` file into a dictionary, then extract the 'chest' and 'wrist' signal data from it and convert them into separate pandas DataFrames. I will also correctly load the `S2_quest.csv` into a DataFrame.



In [None]:
import pandas as pd

# Define the base path to the S2 subject data directory
s2_data_path = '/content/WESAD_data/WESAD/S2'

# Load the S2.pkl file into a dictionary first
s2_raw_data_dict = pd.read_pickle(f'{s2_data_path}/S2.pkl')

# Extract 'chest' and 'wrist' signals and convert them to DataFrames
df_s2_chest_data = pd.DataFrame(s2_raw_data_dict['signal']['chest'])
df_s2_wrist_data = pd.DataFrame(s2_raw_data_dict['signal']['wrist'])

# Load the S2_quest.csv file into a pandas DataFrame
df_s2_quest = pd.read_csv(f'{s2_data_path}/S2_quest.csv')

print("df_s2_chest_data head:")
print(df_s2_chest_data.head())

print("\ndf_s2_wrist_data head:")
print(df_s2_wrist_data.head())

print("\ndf_s2_quest head:")
print(df_s2_quest.head())

**Reasoning**:
The previous code failed because the 'ACC' (accelerometer) data within the 'chest' and 'wrist' signals of the `S2.pkl` file are multi-dimensional arrays, causing a `ValueError` when `pd.DataFrame` tried to interpret them as single columns. To fix this, I need to explicitly extract the 'ACC' data into separate columns (e.g., 'ACC_x', 'ACC_y', 'ACC_z') and then create DataFrames from all the extracted 1-dimensional signal arrays.



In [None]:
import pandas as pd
import numpy as np

# Define the base path to the S2 subject data directory
s2_data_path = '/content/WESAD_data/WESAD/S2'

# Load the S2.pkl file into a dictionary first
s2_raw_data_dict = pd.read_pickle(f'{s2_data_path}/S2.pkl')

# Process 'chest' signals
chest_signals = s2_raw_data_dict['signal']['chest']
chest_data = {}
for sensor_name, data_array in chest_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is typically 3-axis, so create three columns
        chest_data['ACC_x'] = data_array[:, 0]
        chest_data['ACC_y'] = data_array[:, 1]
        chest_data['ACC_z'] = data_array[:, 2]
    else:
        # Other sensors are typically 1D
        chest_data[sensor_name] = data_array

df_s2_chest_data = pd.DataFrame(chest_data)

# Process 'wrist' signals
wrist_signals = s2_raw_data_dict['signal']['wrist']
wrist_data = {}
for sensor_name, data_array in wrist_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is typically 3-axis, so create three columns
        wrist_data['ACC_x'] = data_array[:, 0]
        wrist_data['ACC_y'] = data_array[:, 1]
        wrist_data['ACC_z'] = data_array[:, 2]
    else:
        # Other sensors are typically 1D
        wrist_data[sensor_name] = data_array

df_s2_wrist_data = pd.DataFrame(wrist_data)

# Load the S2_quest.csv file into a pandas DataFrame
df_s2_quest = pd.read_csv(f'{s2_data_path}/S2_quest.csv')

print("df_s2_chest_data head:")
print(df_s2_chest_data.head())

print("\ndf_s2_wrist_data head:")
print(df_s2_wrist_data.head())

print("\ndf_s2_quest head:")
print(df_s2_quest.head())

**Reasoning**:
The previous code failed because some non-accelerometer sensor data (e.g., ECG, EMG) were still 2-dimensional arrays with a single column, which `pd.DataFrame` cannot directly handle as individual columns. I need to flatten these 2D arrays into 1-dimensional arrays before adding them to the `chest_data` and `wrist_data` dictionaries to resolve the `ValueError`.



In [None]:
import pandas as pd
import numpy as np

# Define the base path to the S2 subject data directory
s2_data_path = '/content/WESAD_data/WESAD/S2'

# Load the S2.pkl file into a dictionary first
s2_raw_data_dict = pd.read_pickle(f'{s2_data_path}/S2.pkl')

# Process 'chest' signals
chest_signals = s2_raw_data_dict['signal']['chest']
chest_data = {}
for sensor_name, data_array in chest_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is typically 3-axis, so create three columns
        chest_data['ACC_x'] = data_array[:, 0]
        chest_data['ACC_y'] = data_array[:, 1]
        chest_data['ACC_z'] = data_array[:, 2]
    else:
        # Other sensors might be 2D arrays with a single column, flatten them
        chest_data[sensor_name] = data_array.flatten()

df_s2_chest_data = pd.DataFrame(chest_data)

# Process 'wrist' signals
wrist_signals = s2_raw_data_dict['signal']['wrist']
wrist_data = {}
for sensor_name, data_array in wrist_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is typically 3-axis, so create three columns
        wrist_data['ACC_x'] = data_array[:, 0]
        wrist_data['ACC_y'] = data_array[:, 1]
        wrist_data['ACC_z'] = data_array[:, 2]
    else:
        # Other sensors might be 2D arrays with a single column, flatten them
        wrist_data[sensor_name] = data_array.flatten()

df_s2_wrist_data = pd.DataFrame(wrist_data)

# Load the S2_quest.csv file into a pandas DataFrame
df_s2_quest = pd.read_csv(f'{s2_data_path}/S2_quest.csv')

print("df_s2_chest_data head:")
print(df_s2_chest_data.head())

print("\ndf_s2_wrist_data head:")
print(df_s2_wrist_data.head())

print("\ndf_s2_quest head:")
print(df_s2_quest.head())

**Reasoning**:
The previous code failed because sensor data arrays within 'wrist' (and likely 'chest') have different lengths due to varying sampling rates, making it impossible to directly create a single pandas DataFrame. To resolve the `ValueError: All arrays must be of the same length`, I will create separate DataFrames for each individual sensor signal (e.g., 'ACC', 'BVP', 'EDA', 'TEMP', 'ECG', 'EMG') and store them in dictionaries (`chest_dataframes` and `wrist_dataframes`) for organized access.



In [None]:
import pandas as pd
import numpy as np

# Define the base path to the S2 subject data directory
s2_data_path = '/content/WESAD_data/WESAD/S2'

# Load the S2.pkl file into a dictionary first
s2_raw_data_dict = pd.read_pickle(f'{s2_data_path}/S2.pkl')

# Initialize dictionaries to hold individual sensor DataFrames
chest_dataframes = {}
wrist_dataframes = {}

# Process 'chest' signals
chest_signals = s2_raw_data_dict['signal']['chest']
for sensor_name, data_array in chest_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is 3-axis, create three columns
        chest_dataframes[sensor_name] = pd.DataFrame(data_array, columns=['ACC_x', 'ACC_y', 'ACC_z'])
    else:
        # Other sensors are typically 1D or 2D with one column, flatten and create a DataFrame
        chest_dataframes[sensor_name] = pd.DataFrame(data_array.flattenbase_data_path = '/content/WESAD_data/' # Corrected path
         (), columns=[sensor_name])

# Process 'wrist' signals
wrist_signals = s2_raw_data_dict['signal']['wrist']
for sensor_name, data_array in wrist_signals.items():
    if sensor_name == 'ACC':
        # Accelerometer data is 3-axis, create three columns
        wrist_dataframes[sensor_name] = pd.DataFrame(data_array, columns=['ACC_x', 'ACC_y', 'ACC_z'])
    else:
        # Other sensors are typically 1D or 2D with one column, flatten and create a DataFrame
        wrist_dataframes[sensor_name] = pd.DataFrame(data_array.flatten(), columns=[sensor_name])

# Load the S2_quest.csv file into a pandas DataFrame
df_s2_quest = pd.read_csv(f'{s2_data_path}/S2_quest.csv')

print("Chest DataFrames (first 5 rows of each):")
for sensor_name, df in chest_dataframes.items():
    print(f"\n{sensor_name}:")
    print(df.head())

print("\nWrist DataFrames (first 5 rows of each):")
for sensor_name, df in wrist_dataframes.items():
    print(f"\n{sensor_name}:")
    print(df.head())

print("\ndf_s2_quest head:")
print(df_s2_quest.head())

FileNotFoundError: [Errno 2] No such file or directory: '/content/WESAD_data/WESAD/S2/S2.pkl'

In [None]:

    import pandas as pd
    import numpy as np
    import os

    # Initialize dictionaries to hold individual sensor DataFrames from ALL subjects
    all_chest_dataframes = {}
    all_wrist_dataframes = {}
    all_quest_dataframes = {}

    # Base path to the WESAD data directory
    base_data_path = '/content/WESAD_data/WESAD'
    print(f"Base data path: {base_data_path}")
    print(f"Does base path exist? {os.path.exists(base_data_path)}")

    # Loop through subjects S2 to S17
    print("Starting loop through subjects S2-S17...")
    for subject_id in range(2, 18):
        subject_name = f'S{subject_id}'
        print(f"Looking for subject: {subject_name}")
        subject_data_path = os.path.join(base_data_path, subject_name)
        pkl_file_path = os.path.join(subject_data_path, f'{subject_name}.pkl')
        quest_file_path = os.path.join(subject_data_path, f'{subject_name}_quest.csv')

        print(f"Checking for pkl file: {pkl_file_path}")
        if os.path.exists(pkl_file_path):
            print(f"Processing {subject_name} data...")
            # Load the .pkl file
            raw_data_dict = pd.read_pickle(pkl_file_path)

            # Process 'chest' signals
            if 'chest' in raw_data_dict['signal']:
                chest_signals = raw_data_dict['signal']['chest']
                for sensor_name, data_array in chest_signals.items():
                    key = f'{subject_name}_{sensor_name}'
                    if sensor_name == 'ACC':
                        all_chest_dataframes[key] = pd.DataFrame(data_array, columns=['ACC_x', 'ACC_y', 'ACC_z'])
                    else:
                        all_chest_dataframes[key] = pd.DataFrame(data_array.flatten(), columns=[sensor_name])

            # Process 'wrist' signals
            if 'wrist' in raw_data_dict['signal']:
                wrist_signals = raw_data_dict['signal']['wrist']
                for sensor_name, data_array in wrist_signals.items():
                    key = f'{subject_name}_{sensor_name}'
                    if sensor_name == 'ACC':
                        all_wrist_dataframes[key] = pd.DataFrame(data_array, columns=['ACC_x', 'ACC_y', 'ACC_z'])
                    else:
                        all_wrist_dataframes[key] = pd.DataFrame(data_array.flatten(), columns=[sensor_name])

            # Load the _quest.csv file
            if os.path.exists(quest_file_path):
                all_quest_dataframes[subject_name] = pd.read_csv(quest_file_path)
                print(f"Loaded {subject_name}_quest.csv")
            else:
                print(f"Warning: {subject_name}_quest.csv not found.")

        else:
            print(f"Warning: {subject_name}.pkl not found at {pkl_file_path}")

    print("\nFinished loading data for all subjects.")
    print("Total chest dataframes loaded:", len(all_chest_dataframes))
    print("Total wrist dataframes loaded:", len(all_wrist_dataframes))
    print("Total quest dataframes loaded:", len(all_quest_dataframes))

    # For compatibility with the next cell, we'll make chest_dataframes and wrist_dataframes
    # point to the new dictionaries. And df_s2_quest will be S2's quest df.
    chest_dataframes = all_chest_dataframes
    wrist_dataframes = all_wrist_dataframes
    if 'S2' in all_quest_dataframes:
        df_s2_quest = all_quest_dataframes['S2']
    else:
        df_s2_quest = None # Or handle error if S2 quest is essential


Base data path: /content/WESAD_data/WESAD
Does base path exist? False
Starting loop through subjects S2-S17...
Looking for subject: S2
Checking for pkl file: /content/WESAD_data/WESAD/S2/S2.pkl
Looking for subject: S3
Checking for pkl file: /content/WESAD_data/WESAD/S3/S3.pkl
Looking for subject: S4
Checking for pkl file: /content/WESAD_data/WESAD/S4/S4.pkl
Looking for subject: S5
Checking for pkl file: /content/WESAD_data/WESAD/S5/S5.pkl
Looking for subject: S6
Checking for pkl file: /content/WESAD_data/WESAD/S6/S6.pkl
Looking for subject: S7
Checking for pkl file: /content/WESAD_data/WESAD/S7/S7.pkl
Looking for subject: S8
Checking for pkl file: /content/WESAD_data/WESAD/S8/S8.pkl
Looking for subject: S9
Checking for pkl file: /content/WESAD_data/WESAD/S9/S9.pkl
Looking for subject: S10
Checking for pkl file: /content/WESAD_data/WESAD/S10/S10.pkl
Looking for subject: S11
Checking for pkl file: /content/WESAD_data/WESAD/S11/S11.pkl
Looking for subject: S12
Checking for pkl file: /cont

In [None]:

    import pandas as pd
    import numpy as np
    from scipy.signal import resample

    # Target frequency for resampling
    TARGET_FREQ = 32  # Hz
    TOLERANCE = '10ms' # Tolerance for nearest matching

    # --- Resample Chest Data (Downsampling) ---
    resampled_chest_dataframes = {}
    for sensor_name, df in chest_dataframes.items():
        if not df.empty and 'ACC' in sensor_name: # ACC is 700Hz
            original_len = len(df)
            if original_len > 1:
                time_index = pd.to_timedelta(np.arange(original_len) / 700.0, unit='s')
                df.index = time_index
                resample_len = int(original_len * TARGET_FREQ / 700.0)
                resampled_data = {}
                for col in df.columns:
                    resampled_data[col] = resample(df[col], resample_len)
                resampled_df = pd.DataFrame(resampled_data)
                resampled_time_index = pd.to_timedelta(np.arange(resample_len) / float(TARGET_FREQ), unit='s')
                resampled_df.index = resampled_time_index
                resampled_chest_dataframes[sensor_name] = resampled_df
                print(f"Resampled {sensor_name} from {original_len} to {resample_len} points (700Hz to {TARGET_FREQ}Hz)")
            else:
                resampled_chest_dataframes[sensor_name] = df.copy()
                print(f"Kept {sensor_name} as is (empty or single row)")
        else:
            resampled_chest_dataframes[sensor_name] = df.copy()
            if not df.empty:
                 original_len = len(df)
                 time_index = pd.to_timedelta(np.arange(original_len) / float(TARGET_FREQ), unit='s')
                 df.index = time_index
                 resampled_chest_dataframes[sensor_name].index = time_index
                 print(f"Kept non-ACC {sensor_name} with {original_len} points, assumed {TARGET_FREQ}Hz")
            else:
                print(f"Kept non-ACC {sensor_name} as is (empty)")

    # --- Resample Wrist Data (Upsampling/Downsampling) ---
    resampled_wrist_dataframes = {}
    wrist_original_freqs = {'ACC': 32, 'BVP': 64, 'EDA': 4, 'TEMP': 4}

    for sensor_name, df in wrist_dataframes.items():
        if not df.empty and sensor_name in wrist_original_freqs:
            original_freq = wrist_original_freqs[sensor_name]
            original_len = len(df)
            if original_len > 1 and original_freq != TARGET_FREQ:
                time_index = pd.to_timedelta(np.arange(original_len) / float(original_freq), unit='s')
                df.index = time_index
                resample_len = int(original_len * TARGET_FREQ / float(original_freq))
                resampled_data = {}
                for col in df.columns:
                    resampled_data[col] = resample(df[col], resample_len)
                resampled_df = pd.DataFrame(resampled_data)
                resampled_time_index = pd.to_timedelta(np.arange(resample_len) / float(TARGET_FREQ), unit='s')
                resampled_df.index = resampled_time_index
                resampled_wrist_dataframes[sensor_name] = resampled_df
                print(f"Resampled {sensor_name} from {original_len} to {resample_len} points ({original_freq}Hz to {TARGET_FREQ}Hz)")
            else:
                if original_len > 0:
                    time_index = pd.to_timedelta(np.arange(original_len) / float(original_freq), unit='s')
                    df.index = time_index
                    resampled_wrist_dataframes[sensor_name] = df.copy()
                    print(f"Kept {sensor_name} as is with {original_len} points (at/near {TARGET_FREQ}Hz or single row)")
                else:
                    resampled_wrist_dataframes[sensor_name] = df.copy()
                    print(f"Kept {sensor_name} as is (empty)")
        else:
            resampled_wrist_dataframes[sensor_name] = df.copy()
            print(f"Kept {sensor_name} as is (empty or not in freq map)")

    print("\nResampling complete.")

    # --- Alignment ---
    # Assuming 'event_timings' df exists and has 'Start_Time' and 'End_Time' as datetime objects
    # If event_timings is not loaded/defined, this part will fail and needs event_timings loaded first.
    # For now, let's assume it's loaded from df_s2_quest or similar and converted.
    # We need to get event_timings first. Let's look for df_s2_quest processing
    # For demonstration, let's create dummy event_timings if not found
    if 'df_event_timings' not in locals() and 'df_event_timings' not in globals():
        print("Warning: 'df_event_timings' not found. Using dummy timings for alignment demonstration.")
        # Extract from df_s2_quest if it exists, otherwise dummy
        if 'df_s2_quest' in locals() or 'df_s2_quest' in globals():
             start_time_str = df_s2_quest.iloc[1,0].split(': ')[1] # Assuming # START: time is at index 1
             end_time_str = df_s2_quest.iloc[2,0].split(': ')[1]   # Assuming # END: time is at index 2
             start_time = pd.to_datetime(start_time_str)
             end_time = pd.to_datetime(end_time_str)
             df_event_timings = pd.DataFrame({'Start_Time': [start_time], 'End_Time': [end_time]})
        else:
            start_time = pd.to_datetime('1970-01-01 00:00:00') # Dummy start
            end_time = pd.to_datetime('1970-01-01 00:05:00')   # Dummy end (5 mins later)
            df_event_timings = pd.DataFrame({'Start_Time': [start_time], 'End_Time': [end_time]})


    start_time = df_event_timings['Start_Time'].iloc[0]
    end_time = df_event_timings['End_Time'].iloc[0]

    aligned_data = {}

    # Align resampled chest data
    for sensor_name, df in resampled_chest_dataframes.items():
        if not df.empty and isinstance(df.index, pd.TimedeltaIndex):
            df_abs_time = df.copy()
            df_abs_time.index = start_time + df_abs_time.index
            aligned_df = df_abs_time[(df_abs_time.index >= start_time) & (df_abs_time.index <= end_time)]
            # Make index relative to start_time again for merging
            aligned_df.index = aligned_df.index - start_time
            aligned_data[f'chest_{sensor_name}'] = aligned_df
            print(f"Aligned chest_{sensor_name}")

    # Align resampled wrist data
    for sensor_name, df in resampled_wrist_dataframes.items():
        if not df.empty and isinstance(df.index, pd.TimedeltaIndex):
            df_abs_time = df.copy()
            df_abs_time.index = start_time + df_abs_time.index
            aligned_df = df_abs_time[(df_abs_time.index >= start_time) & (df_abs_time.index <= end_time)]
            aligned_df.index = aligned_df.index - start_time
            aligned_data[f'wrist_{sensor_name}'] = aligned_df
            print(f"Aligned wrist_{sensor_name}")

    # --- Consolidation ---
    df_final = pd.DataFrame()
    for sensor_key, df in aligned_data.items():
        if not df.empty:
            df_renamed = df.rename(columns={col: f'{sensor_key}_{col}' for col in df.columns})
            if df_final.empty:
                df_final = df_renamed
            else:
                df_final = pd.merge_asof(df_final.sort_index(), df_renamed.sort_index(), left_index=True, right_index=True, direction='nearest', tolerance=pd.Timedelta(TOLERANCE))

    print("\nAlignment and Consolidation complete.")
    print("Final DataFrame head:")
    print(df_final.head())
    print("Final DataFrame info:")
    df_final.info()


NameError: name 'chest_dataframes' is not defined

## Initial Data Inspection

### Subtask:
Perform an initial inspection of the loaded data, including checking the first few rows (.head()), data types (.info()), and descriptive statistics (.describe()) to identify potential issues.


**Reasoning**:
To perform an initial inspection, I will iterate through the `chest_dataframes` and `wrist_dataframes` dictionaries, applying `.head()`, `.info()`, and `.describe()` to each sensor DataFrame. I will also apply these methods to the `df_s2_quest` DataFrame.



In [None]:
print("--- Initial Data Inspection for Chest Sensor DataFrames ---")
for sensor_name, df in chest_dataframes.items():
    print(f"\n----- {sensor_name} Data (Chest) -----")
    print(f"\n{sensor_name}.head():")
    print(df.head())
    print(f"\n{sensor_name}.info():")
    df.info()
    print(f"\n{sensor_name}.describe():")
    print(df.describe())

print("\n--- Initial Data Inspection for Wrist Sensor DataFrames ---")
for sensor_name, df in wrist_dataframes.items():
    print(f"\n----- {sensor_name} Data (Wrist) -----")
    print(f"\n{sensor_name}.head():")
    print(df.head())
    print(f"\n{sensor_name}.info():")
    df.info()
    print(f"\n{sensor_name}.describe():")
    print(df.describe())

print("\n--- Initial Data Inspection for Questionnaire Data (df_s2_quest) ---")
print("\ndf_s2_quest.head():")
print(df_s2_quest.head())
print("\ndf_s2_quest.info():")
df_s2_quest.info()
print("\ndf_s2_quest.describe():")
print(df_s2_quest.describe())

## Handle Missing Values

### Subtask:
Identify and address any missing values in the dataset. This may involve imputation, deletion of rows/columns, or other appropriate strategies based on the extent and nature of missing data.


**Reasoning**:
To identify missing values, I will iterate through each DataFrame in the `chest_dataframes` and `wrist_dataframes` dictionaries and print the sum of null values for each column. I will also add a markdown block to acknowledge that `df_s2_quest` needs parsing before checking for missing values.



In [None]:
print("--- Checking for Missing Values in Chest Sensor DataFrames ---")
for sensor_name, df in chest_dataframes.items():
    print(f"\nMissing values in {sensor_name} (Chest):")
    print(df.isnull().sum())

print("\n--- Checking for Missing Values in Wrist Sensor DataFrames ---")
for sensor_name, df in wrist_dataframes.items():
    print(f"\nMissing values in {sensor_name} (Wrist):")
    print(df.isnull().sum())


```markdown
**Regarding `df_s2_quest`:**

The `df_s2_quest` DataFrame is not yet in a tabular format suitable for a direct `.isnull().sum()` check. It contains metadata and questionnaire responses within a few rows, separated by semicolons. This DataFrame will require further parsing and cleaning before a meaningful assessment of missing values can be performed for individual questions/fields.

Based on the current state of the chest and wrist sensor data, there are no missing values detected.
```

## Handle Duplicates

### Subtask:
Check for and remove any duplicate rows in the dataset to ensure data integrity.


**Reasoning**:
To identify duplicate rows, I will iterate through each DataFrame in the `chest_dataframes` and `wrist_dataframes` dictionaries and use the `.duplicated().sum()` method to count and print the number of duplicate rows for each sensor.



In [None]:
print("--- Checking for Duplicate Rows in Chest Sensor DataFrames ---")
for sensor_name, df in chest_dataframes.items():
    print(f"\nDuplicate rows in {sensor_name} (Chest): {df.duplicated().sum()}")

print("\n--- Checking for Duplicate Rows in Wrist Sensor DataFrames ---")
for sensor_name, df in wrist_dataframes.items():
    print(f"\nDuplicate rows in {sensor_name} (Wrist): {df.duplicated().sum()}")

```markdown
**Regarding `df_s2_quest`:**

The `df_s2_quest` DataFrame is not yet in a tabular format suitable for a direct `.duplicated().sum()` check. It contains metadata and questionnaire responses within a few rows, separated by semicolons. This DataFrame will require further parsing and cleaning before a meaningful assessment of duplicate rows can be performed for individual questions/fields.

**Regarding Sensor DataFrames:**

Many of the chest and wrist sensor DataFrames contain a significant number of duplicate rows. This will need to be addressed in the next step to ensure data integrity and prevent skewed analysis.
```

**Regarding `df_s2_quest`:**

The `df_s2_quest` DataFrame is not yet in a tabular format suitable for a direct `.duplicated().sum()` check. It contains metadata and questionnaire responses within a few rows, separated by semicolons. This DataFrame will require further parsing and cleaning before a meaningful assessment of duplicate rows can be performed for individual questions/fields.

**Regarding Sensor DataFrames:**

Many of the chest and wrist sensor DataFrames contain a significant number of duplicate rows. This will need to be addressed in the next step to ensure data integrity and prevent skewed analysis.

**Reasoning**:
Many sensor dataframes were found to contain duplicate rows, which can affect data integrity. To address this, I will iterate through each DataFrame in the `chest_dataframes` and `wrist_dataframes` dictionaries and use the `.drop_duplicates()` method to remove all duplicate rows. After removal, I will confirm by re-checking the count of duplicate rows for each DataFrame.



In [None]:
print("--- Removing Duplicate Rows from Chest Sensor DataFrames ---")
for sensor_name, df in chest_dataframes.items():
    initial_duplicates = df.duplicated().sum()
    if initial_duplicates > 0:
        chest_dataframes[sensor_name] = df.drop_duplicates().reset_index(drop=True)
        print(f"Removed {initial_duplicates} duplicate rows from {sensor_name} (Chest). New count: {chest_dataframes[sensor_name].duplicated().sum()}")
    else:
        print(f"No duplicate rows found in {sensor_name} (Chest).")

print("\n--- Removing Duplicate Rows from Wrist Sensor DataFrames ---")
for sensor_name, df in wrist_dataframes.items():
    initial_duplicates = df.duplicated().sum()
    if initial_duplicates > 0:
        wrist_dataframes[sensor_name] = df.drop_duplicates().reset_index(drop=True)
        print(f"Removed {initial_duplicates} duplicate rows from {sensor_name} (Wrist). New count: {wrist_dataframes[sensor_name].duplicated().sum()}")
    else:
        print(f"No duplicate rows found in {sensor_name} (Wrist).")

## Check Data Types

### Subtask:
Verify that all columns have appropriate data types (e.g., numeric, categorical, datetime) and convert them if necessary to ensure correct processing in subsequent analysis.


**Reasoning**:
To verify data types, I will iterate through each DataFrame in `chest_dataframes` and `wrist_dataframes` and print their data types and information using `.info()`. I will also print the `.info()` for `df_s2_quest` and acknowledge that it requires parsing.



In [None]:
print("--- Checking Data Types for Chest Sensor DataFrames ---")
for sensor_name, df in chest_dataframes.items():
    print(f"\n----- {sensor_name} Data (Chest) -----")
    df.info()

print("\n--- Checking Data Types for Wrist Sensor DataFrames ---")
for sensor_name, df in wrist_dataframes.items():
    print(f"\n----- {sensor_name} Data (Wrist) -----")
    df.info()

print("\n--- Checking Data Types for Questionnaire Data (df_s2_quest) ---")
df_s2_quest.info()

## Final Task

### Subtask:
Summarize the data cleaning steps performed and prepare the cleaned data for further analysis or modeling.


## Summary:

### Data Analysis Key Findings

*   **Data Structure Identification**: The WESAD dataset is organized into subject-specific subdirectories (e.g., `S2/`), each containing a `.pkl` file for processed sensor data, a `_quest.csv` file for questionnaire data, and a `_E4_Data.zip` file for raw data.
*   **Data Loading Strategy**: Due to varying sampling rates and data structures within the `S2.pkl` file, individual pandas DataFrames were created for each chest and wrist sensor (e.g., `ACC`, `ECG`, `BVP`, `EDA`, `Temp`). Accelerometer data was specifically handled to separate its x, y, and z components into distinct columns. The `S2_quest.csv` file was loaded into a separate DataFrame.
*   **Sensor Data Consistency**:
    *   **Chest Sensors**: All chest sensor DataFrames (`ACC`, `ECG`, `EMG`, `EDA`, `Temp`, `Resp`) have a consistent length of 4,255,300 entries, indicating uniform sampling and duration. Their data types are primarily `float64`, with `Temp` being `float32`.
    *   **Wrist Sensors**: Wrist sensor DataFrames show varying lengths, suggesting different sampling rates or recording durations (e.g., `ACC` with 194,528 entries, `BVP` with 389,056 entries, `EDA` and `TEMP` with 24,316 entries). All wrist sensor data are `float64`.
*   **Missing Values**: No missing values were found in any of the chest or wrist sensor DataFrames.
*   **Duplicate Rows**: A significant number of duplicate rows were initially identified across all sensor DataFrames:
    *   Chest sensors: Ranged from 470,641 (ACC) to 4,251,080 (Temp).
    *   Wrist sensors: Ranged from 22,615 (EDA) to 343,616 (BVP).
    *   All identified duplicate rows were successfully removed, resulting in zero duplicates in the cleaned sensor DataFrames.
*   **Questionnaire Data (`df_s2_quest`) State**: The `df_s2_quest` DataFrame was loaded as a single `object` type column. It is not yet in a tabular format, requiring further parsing to extract meaningful questionnaire data and to perform checks for missing values or duplicates within its content.

### Insights or Next Steps

*   **Parse Questionnaire Data**: The `df_s2_quest` DataFrame needs to be parsed and transformed into a structured tabular format to make its contents accessible for analysis and to properly assess its data quality (e.g., missing values, data types).
*   **Synchronize Sensor Data**: Given the different sampling rates and lengths of the wrist sensor DataFrames (and potentially between chest and wrist), future steps should focus on synchronizing these time-series datasets, possibly through resampling or alignment, to enable integrated analysis.


## Save Cleaned Data

### Subtask:
Save the cleaned `df_event_timings`, `df_questionnaire_responses`, and the individual sensor DataFrames (from `chest_dataframes` and `wrist_dataframes`) to CSV files for future use. This will involve creating a new directory for cleaned data and then exporting each DataFrame.

**Reasoning**:
To save the cleaned data, I will first define a path for a new 'cleaned_data' directory within the `S2` subject folder. I will then create this directory if it doesn't already exist. Afterward, I will iterate through the `chest_dataframes` and `wrist_dataframes` dictionaries, saving each sensor DataFrame as a CSV file. Finally, I will save `df_event_timings` and `df_questionnaire_responses` as separate CSV files to the same directory, ensuring all processed data is persistently stored.

In [None]:
import os

# Define the directory to save cleaned data
cleaned_data_dir = f'{s2_data_path}/cleaned_data'

# Create the directory if it does not exist
os.makedirs(cleaned_data_dir, exist_ok=True)

print(f"Saving cleaned data to: {cleaned_data_dir}")

# Save df_event_timings
df_event_timings.to_csv(f'{cleaned_data_dir}/df_event_timings.csv', index=False)
print("Saved df_event_timings.csv")

# Save df_questionnaire_responses
df_questionnaire_responses.to_csv(f'{cleaned_data_dir}/df_questionnaire_responses.csv', index=False)
print("Saved df_questionnaire_responses.csv")

# Save individual chest sensor DataFrames
for sensor_name, df in chest_dataframes.items():
    df.to_csv(f'{cleaned_data_dir}/chest_{sensor_name.lower()}.csv', index=False)
    print(f"Saved chest_{sensor_name.lower()}.csv")

# Save individual wrist sensor DataFrames
for sensor_name, df in wrist_dataframes.items():
    df.to_csv(f'{cleaned_data_dir}/wrist_{sensor_name.lower()}.csv', index=False)
    print(f"Saved wrist_{sensor_name.lower()}.csv")

print("All cleaned DataFrames saved successfully!")

# Task
Inspect the `df_s2_quest` DataFrame to understand its structure for parsing, including printing additional rows and specific slices to identify patterns in the header information and data rows.

## Inspect df_s2_quest structure

### Subtask:
Further inspect the `df_s2_quest` to understand its exact structure and identify patterns for parsing, especially header information and data rows. This may involve printing more rows or specific slices of the DataFrame.


**Reasoning**:
To understand the structure of `df_s2_quest` for parsing, I will display all its rows, then specifically print the first row and a few other distinct rows to identify patterns in how different types of information are presented.



In [None]:
print("--- Displaying all rows of df_s2_quest ---")
print(df_s2_quest.to_string())

print("\n--- First row of df_s2_quest ---")
print(df_s2_quest.iloc[0])

print("\n--- Second row of df_s2_quest ---")
print(df_s2_quest.iloc[1])

print("\n--- Third row of df_s2_quest ---")
print(df_s2_quest.iloc[2])

print("\n--- Fifth row of df_s2_quest (index 4) ---")
print(df_s2_quest.iloc[4])

The inspection of `df_s2_quest` has provided a clear understanding of its structure:

*   **Single Column Format**: The entire CSV content is loaded into a single column, with each row being a string containing semicolon-separated values.
*   **Header Rows**: The first few rows (`#ORDER`, `#START`, `#END`) provide metadata about the experimental setup (e.g., order of tasks, start and end times). These rows contain both text labels and numeric values.
*   **Questionnaire Data Rows**: Subsequent rows (`#PANAS`, `#STAI`, `#DIM`, `#SSSQ`) contain responses to different questionnaires. Each of these lines starts with a tag (e.g., `#PANAS`) followed by a series of numeric responses, also semicolon-separated.
*   **Empty Rows**: There are rows containing only semicolons (`;;;;...`) which act as separators between different sections of the questionnaire data.
*   **Delimiter**: The primary delimiter for all values appears to be a semicolon (`;`).
*   **Subject Identifier**: The very first line `"# Subj;S2;;;;;;;;;;;;;;;;;;;;;;;;;"` identifies the subject. This information was implicitly captured when loading the `S2_quest.csv` file.

**Next Steps for Parsing `df_s2_quest`:**

1.  **Extract Metadata**: Parse the `#ORDER`, `#START`, and `#END` rows to extract the task sequence and their corresponding start/end times. This data could be stored in a dictionary or a separate DataFrame.
2.  **Extract Questionnaire Data**: Iterate through the remaining rows, identify each questionnaire type (PANAS, STAI, DIM, SSSQ), and extract the numerical responses into separate pandas DataFrames, possibly with appropriate column names derived from the questionnaire structure (if available in documentation or inferable).
3.  **Clean and Convert Data Types**: Ensure extracted numerical data is converted to appropriate numeric types (e.g., `int` or `float`).
4.  **Handle Empty/Irrelevant Values**: Remove or appropriately handle the trailing empty fields (`;;`) and the empty separator rows.

This detailed understanding of the `df_s2_quest` structure will guide the next data cleaning steps to transform this raw questionnaire data into a usable tabular format.

## Extract event timings

### Subtask:
Parse the 'START' and 'END' rows from `df_s2_quest` to extract the event timings for different experimental phases (e.g., Base, TSST, Medi 1, etc.) and store them in a suitable data structure, like a dictionary or a small DataFrame.


**Reasoning**:
To extract event timings, I will locate the '# ORDER', '# START', and '# END' rows, split their string content by semicolons, clean the resulting lists by removing empty strings and the initial '#' character, create event names from the '# ORDER' row, and then construct dictionaries for 'START' and 'END' times. Finally, I will combine these into a new pandas DataFrame as instructed.



In [None]:
import pandas as pd

# 1. Locate and extract the string content from the relevant rows
order_str = df_s2_quest.iloc[0, 0]
start_str = df_s2_quest.iloc[1, 0]
end_str = df_s2_quest.iloc[2, 0]

# 2. Split by semicolon and clean the lists
# Remove empty strings and the initial '#' from the first element

def clean_split_list(s):
    # Remove the first '#' if present and split
    parts = s.replace('#', '').split(';')
    # Filter out empty strings
    return [p.strip() for p in parts if p.strip()]

cleaned_order = clean_split_list(order_str)
cleaned_start = clean_split_list(start_str)
cleaned_end = clean_split_list(end_str)

# 3. Create a list of event names (skipping the 'ORDER' label)
event_names = cleaned_order[1:]

# 4. Create dictionaries for START and END times, converting to float
start_times = {}
for i, event in enumerate(event_names):
    # Ensure index is within bounds for start_str data, skipping the 'START' label
    if (i + 1) < len(cleaned_start):
        try:
            start_times[event] = float(cleaned_start[i + 1])
        except ValueError:
            start_times[event] = None # Handle cases where conversion to float fails

end_times = {}
for i, event in enumerate(event_names):
    # Ensure index is within bounds for end_str data, skipping the 'END' label
    if (i + 1) < len(cleaned_end):
        try:
            end_times[event] = float(cleaned_end[i + 1])
        except ValueError:
            end_times[event] = None # Handle cases where conversion to float fails

# 5. Combine into a new pandas DataFrame
# Create lists for DataFrame construction
events_list = []
start_time_list = []
end_time_list = []

for event in event_names:
    events_list.append(event)
    start_time_list.append(start_times.get(event))
    end_time_list.append(end_times.get(event))

df_event_timings = pd.DataFrame({
    'Event': events_list,
    'Start_Time': start_time_list,
    'End_Time': end_time_list
})

print("Event Timings DataFrame:")
print(df_event_timings)


## Extract questionnaire responses

### Subtask:
Parse the 'PANAS', 'STAI', 'DIM', and 'SSSQ' questionnaire response rows from `df_s2_quest`.


**Reasoning**:
To parse the questionnaire responses, I will iterate through the `df_s2_quest` DataFrame, identify rows corresponding to 'PANAS', 'STAI', 'DIM', and 'SSSQ' questionnaires, extract the numeric data by splitting the strings and converting values to float, and then store these responses into separate pandas DataFrames as instructed.



In [None]:
import pandas as pd

# Initialize empty lists to store parsed responses for each questionnaire type
panas_data = []
stai_data = []
dim_data = []
sssq_data = []

# Iterate through df_s2_quest from row index 4 onwards (as metadata is already extracted)
for i in range(4, len(df_s2_quest)):
    row_string = df_s2_quest.iloc[i, 0]

    # Check if the row starts with a questionnaire identifier and process it
    if row_string.startswith('# PANAS'):
        # Remove the tag and split by semicolon, then convert to numeric
        parts = row_string.replace('# PANAS;', '').split(';')
        numeric_parts = [float(p.strip()) for p in parts if p.strip()]
        if numeric_parts:
            panas_data.append(numeric_parts)
    elif row_string.startswith('# STAI'):
        parts = row_string.replace('# STAI;', '').split(';')
        numeric_parts = [float(p.strip()) for p in parts if p.strip()]
        if numeric_parts:
            stai_data.append(numeric_parts)
    elif row_string.startswith('# DIM'):
        parts = row_string.replace('# DIM;', '').split(';')
        numeric_parts = [float(p.strip()) for p in parts if p.strip()]
        if numeric_parts:
            dim_data.append(numeric_parts)
    elif row_string.startswith('# SSSQ'):
        parts = row_string.replace('# SSSQ;', '').split(';')
        numeric_parts = [float(p.strip()) for p in parts if p.strip()]
        if numeric_parts:
            sssq_data.append(numeric_parts)

# Convert lists of lists to pandas DataFrames. pd.DataFrame handles varying row lengths by padding with NaN.
df_panas = pd.DataFrame(panas_data)
df_stai = pd.DataFrame(stai_data)
df_dim = pd.DataFrame(dim_data)
df_sssq = pd.DataFrame(sssq_data)

# Print the head of each created DataFrame to inspect the parsed questionnaire responses
print("\n--- df_panas head ---")
print(df_panas.head())

print("\n--- df_stai head ---")
print(df_stai.head())

print("\n--- df_dim head ---")
print(df_dim.head())

print("\n--- df_sssq head ---")
print(df_sssq.head())

## Create structured DataFrame

### Subtask:
Combine the extracted questionnaire responses into a single, comprehensive pandas DataFrame, ensuring each question's response is in its own column and rows represent individual entries or participants, as appropriate.


**Reasoning**:
To combine the extracted questionnaire responses into a single DataFrame as instructed, I will first rename the columns of each individual questionnaire DataFrame with a descriptive prefix. Then, I will horizontally concatenate 'df_panas', 'df_stai', and 'df_dim', extend 'df_sssq' to match the row count, and finally concatenate it to the combined DataFrame, before printing the head and info of the resulting 'df_questionnaire_responses'.



In [None]:
import numpy as np

# 1. Rename the columns of each DataFrame
df_panas.columns = [f'PANAS_{i}' for i in range(len(df_panas.columns))]
df_stai.columns = [f'STAI_{i}' for i in range(len(df_stai.columns))]
df_dim.columns = [f'DIM_{i}' for i in range(len(df_dim.columns))]
df_sssq.columns = [f'SSSQ_{i}' for i in range(len(df_sssq.columns))]

# 2. Concatenate df_panas, df_stai, and df_dim horizontally
df_combined_responses = pd.concat([df_panas, df_stai, df_dim], axis=1)

# 3. Create a new DataFrame for df_sssq with the same number of rows as df_combined_responses
# Get the number of rows from df_combined_responses
num_rows = len(df_combined_responses)

# Create an empty DataFrame with the correct number of rows and columns from df_sssq
df_sssq_extended = pd.DataFrame(np.nan, index=range(num_rows), columns=df_sssq.columns)

# Place the actual df_sssq data (which is a single row) into the first row of the extended DataFrame
if not df_sssq.empty:
    df_sssq_extended.iloc[0] = df_sssq.iloc[0]

# 4. Horizontally concatenate the extended df_sssq DataFrame with df_combined_responses
df_questionnaire_responses = pd.concat([df_combined_responses, df_sssq_extended], axis=1)

# 5. Print the head and information of the final df_questionnaire_responses DataFrame
print("--- df_questionnaire_responses head ---")
print(df_questionnaire_responses.head())

print("\n--- df_questionnaire_responses info ---")
df_questionnaire_responses.info()

## Final Task

### Subtask:
Summarize the parsing process and the resulting structured DataFrame, including a preview of the cleaned questionnaire data and a brief discussion of its readiness for further analysis.


## Summary:

### Data Analysis Key Findings

*   The initial `df_s2_quest` DataFrame was identified as a single-column structure where each row contained semicolon-separated values.
*   The raw data contained distinct sections: header information (`#ORDER`, `#START`, `#END`), questionnaire responses (`#PANAS`, `#STAI`, `#DIM`, `#SSSQ`), and empty separator rows.
*   Event timings were successfully extracted from the `#ORDER`, `#START`, and `#END` rows into a `df_event_timings` DataFrame. For instance, the 'Base' event has a Start\_Time of 7.08 and an End\_Time of 26.32.
*   Individual questionnaire responses were parsed into separate DataFrames: `df_panas` (26 columns), `df_stai` (6 columns), `df_dim` (2 columns), and `df_sssq` (6 columns), handling varying response lengths with `NaN` values where necessary.
*   The final combined `df_questionnaire_responses` DataFrame consists of 5 entries and 40 columns, with all data converted to `float64`. Columns were systematically renamed (e.g., `PANAS_0`, `STAI_0`), and `df_sssq` (originally a single row) was appropriately extended with `NaN` values to match the row count of other questionnaires during concatenation.

### Insights or Next Steps

*   The questionnaire data is now in a clean, structured format, making it ready for statistical analysis, such as calculating questionnaire scores, conducting correlation analysis with experimental events, or preparing for machine learning models.
*   Further analysis could involve integrating `df_questionnaire_responses` with `df_event_timings` and other subject-specific data to build a holistic dataset for in-depth insights into experimental outcomes and participant responses.


## Verify Saved Files

### Subtask:
List the contents of the `cleaned_data` directory to confirm that all DataFrames were successfully saved as CSV files.

**Reasoning**:
To verify that the files have been saved, I will use the `!ls` command to list the contents of the `cleaned_data_dir`.

In [None]:
print(f"Listing files in {cleaned_data_dir}:")
!ls -F {cleaned_data_dir}

# Task
Upload the cleaned data files from the directory "/content/WESAD_data/WESAD/S2/cleaned_data" to a GitHub repository.

## Configure Git and Add Files

### Subtask:
Initialize a Git repository if necessary, configure your Git user name and email, and then add the cleaned data files to the staging area. This might also involve navigating to the root of your repository.


**Reasoning**:
I will navigate to the directory where the cleaned data files are saved, as this is the first step in initializing a Git repository and managing these files.



In [None]:
import os

# Ensure the directory exists before navigating
os.makedirs(cleaned_data_dir, exist_ok=True)

print(f"Navigating to: {cleaned_data_dir}")
%cd {cleaned_data_dir}

**Reasoning**:
Now that the current working directory is set to `cleaned_data_dir`, I will initialize a Git repository, configure the Git user name and email, and add all files in the directory to the Git staging area, as instructed.



In [None]:
print('Initializing Git repository...')
!git init

print('Configuring Git user name and email...')
!git config user.name "Your Name"  # Replace "Your Name" with your GitHub username
!git config user.email "your.email@example.com"  # Replace "your.email@example.com" with your GitHub email

print('Adding all files to the Git staging area...')
!git add .

print('Git initialization and staging complete.')

**Reasoning**:
Now that the files are in the staging area, I will commit them to the local Git repository with a descriptive message.



In [None]:
print('Committing staged files...')
!git commit -m "Add cleaned WESAD S2 data files"

print('Commit complete.')

**Reasoning**:
The previous attempt to save cleaned data failed with a `NameError`, and consequently, the `git commit` command committed nothing as there were no files in the directory. Although the traceback suggested `df_event_timings` was not defined, the kernel state now shows both `df_event_timings` and `df_questionnaire_responses` as existing DataFrames. Therefore, I will re-attempt to save the cleaned `df_event_timings`, `df_questionnaire_responses`, and the sensor DataFrames to CSV files. This will ensure the files are actually present in the `cleaned_data_dir` before they are added to Git.



In [None]:
import os

# Define the directory to save cleaned data
cleaned_data_dir = f'{s2_data_path}/cleaned_data'

# Create the directory if it does not exist
os.makedirs(cleaned_data_dir, exist_ok=True)

print(f"Saving cleaned data to: {cleaned_data_dir}")

# Save df_event_timings
df_event_timings.to_csv(f'{cleaned_data_dir}/df_event_timings.csv', index=False)
print("Saved df_event_timings.csv")

# Save df_questionnaire_responses
df_questionnaire_responses.to_csv(f'{cleaned_data_dir}/df_questionnaire_responses.csv', index=False)
print("Saved df_questionnaire_responses.csv")

# Save individual chest sensor DataFrames
for sensor_name, df in chest_dataframes.items():
    df.to_csv(f'{cleaned_data_dir}/chest_{sensor_name.lower()}.csv', index=False)
    print(f"Saved chest_{sensor_name.lower()}.csv")

# Save individual wrist sensor DataFrames
for sensor_name, df in wrist_dataframes.items():
    df.to_csv(f'{cleaned_data_dir}/wrist_{sensor_name.lower()}.csv', index=False)
    print(f"Saved wrist_{sensor_name.lower()}.csv")

print("All cleaned DataFrames saved successfully!")

**Reasoning**:
Since the files have now been successfully saved to the `cleaned_data_dir`, I will re-add all files in the current directory to the Git staging area to ensure the newly created CSVs are tracked. After staging, I will commit these files to the local Git repository.



In [None]:
print('Re-adding all files to the Git staging area...')
!git add .

print('Committing staged files...')
!git commit -m "Add cleaned WESAD S2 data files"

print('Commit complete.')

Before pushing to a GitHub repository, you'll need to create a new, empty repository on GitHub and get its URL. You will also need to authenticate with GitHub. The recommended way to do this in a Colab environment is using a Personal Access Token (PAT).

**Instructions to generate a GitHub Personal Access Token (PAT):**
1. Go to your GitHub profile settings.
2. Navigate to `Developer settings` > `Personal access tokens` > `Tokens (classic)`.
3. Click `Generate new token`.
4. Give it a descriptive name (e.g., `Colab-Access`).
5. Set an expiration date (e.g., 7 days or 30 days).
6. Select the `repo` scope (or `public_repo` if you only plan to push to public repositories).
7. Click `Generate token` and **copy the token immediately** (you won't be able to see it again).

**Once you have your GitHub token and the new repository URL, execute the next code block.**

**Reasoning**:
Now that the files are committed locally and the user has been instructed on how to get a GitHub PAT, I will provide a code block that takes the user's PAT and repository URL as input, adds the remote origin, and then pushes the committed files to the specified GitHub repository.



In [None]:
import getpass
import os

# 1. Prompt for credentials securely
github_token = getpass.getpass('Enter your GitHub Personal Access Token: ')
github_repo_url = input('Enter your GitHub repository URL (e.g., https://github.com/username/repo.git): ')

# 2. Parse the URL to inject the token for authentication
if "https://github.com/" in github_repo_url:
    auth_repo_url = github_repo_url.replace(
        "https://github.com/", f"https://{github_token}@github.com/"
    )
else:
    auth_repo_url = github_repo_url

# 3. Clean up existing remote and add the new authenticated one
print('Configuring remote origin...')
# Using '|| true' or a try-except style ensures this doesn't crash if origin doesn't exist
!git remote remove origin 2>/dev/null || true
!git remote add origin "{auth_repo_url}"

# 4. Push to the repository
print('Pushing to GitHub...')
# Note: Newer GitHub repos use 'main' as the default branch instead of 'master'
!git push -u origin master

print('Push complete. Your files should now be on GitHub!')

In [None]:
import getpass

# 1. Prompt for credentials securely
github_token = getpass.getpass('Enter your GitHub Personal Access Token: ')
github_repo_url = input('Enter your GitHub repository URL: ')

# 2. Parse the URL to inject the token
if "https://github.com/" in github_repo_url:
    auth_repo_url = github_repo_url.replace(
        "https://github.com/", f"https://{github_token}@github.com/"
    )
else:
    auth_repo_url = github_repo_url

print("Authentication URL prepared.")

In [None]:
# 1. Initialize Git LFS for large files
print('Initializing Git LFS...')
!git lfs install
!git lfs track "*.csv"

# 2. Clean up existing remote and add the authenticated one
print('Configuring remote origin...')
!git remote remove origin 2>/dev/null || true
!git remote add origin "{auth_repo_url}"

# 3. Verify the remote is set correctly
!git remote -v

In [None]:
# 1. Stage all files (including .gitattributes and your large CSV)
print('Staging files...')
!git add .

# 2. Create the commit
print('Committing changes...')
!git commit -m "Upload agentic AI project with LFS tracking for large datasets"

# 3. Push to the repository
print('Pushing to GitHub (this may take a moment due to file size)...')
!git push -u origin master

print('Push complete! Your project is now live on GitHub.')

In [None]:
# 1. Ensure all CSVs are actually handled by LFS
!git lfs track "*.csv"
!git add .gitattributes

# 2. "Undo" the previous commit but keep your work (Reset to previous state)
!git reset --soft HEAD~1

# 3. Re-add everything. LFS will now intercept the large files properly.
print("Re-staging files with LFS...")
!git add .

# 4. Re-commit
!git commit -m "Corrected LFS tracking for agentic AI datasets"

# 5. Push
print("Attempting push...")
!git push -u origin master

In [None]:
# 1. DELETE the corrupted history and start over
!rm -rf .git

# 2. Re-initialize Git
!git init

# 3. Setup LFS FIRST (before adding files)
!git lfs install
!git lfs track "*.csv"
!git add .gitattributes

# 4. Add your project files
print("Staging files with LFS...")
!git add .
!git commit -m "Initial commit: Agentic AI project with proper LFS tracking"

# 5. Connect to the remote (using the auth_repo_url from your first cell)
!git remote add origin "{auth_repo_url}"

# 6. Force push to overwrite the broken history on GitHub
print("Pushing clean history to GitHub...")
!git push -f -u origin master

In [None]:
# 1. DELETE the corrupted history and start over
!rm -rf .git

# 2. Re-initialize Git and set identity
!git init
!git config user.email "michaeldomingo@sandiego.edu"
!git config user.name "mdomingo2029"

# 3. Setup LFS FIRST (before adding files)
print("Setting up Git LFS...")
!git lfs install
!git lfs track "*.csv"
!git add .gitattributes

# 4. Add your project files
print("Staging and committing files with LFS...")
!git add .
!git commit -m "Initial commit: Agentic AI project with proper LFS tracking"

# 5. Connect to the remote (Using the auth_repo_url from your previous setup)
!git remote add origin "{auth_repo_url}"

# 6. Force push to overwrite the broken history on GitHub
print("Pushing clean history to GitHub... this may take a few minutes.")
!git push -f -u origin master

In [None]:
# 1. Save the current state of the notebook file manually
!git add data_cleaning.ipynb
!git commit -m "Manual save of data_cleaning notebook"
!git push origin master

In [None]:
# 1. Check exactly where you are and what files are visible
print("Current Directory:")
!pwd
print("\nFiles in this folder:")
!ls

# 2. Try to add the notebook using its full path or by moving back to /content
# Replace 'data_cleaning.ipynb' with the actual name if it's different in the 'ls' list above.
!git add /content/data_cleaning.ipynb
!git commit -m "Manual save of notebook from root"
!git push origin master

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 1. Search for your notebook in Drive
# Replace 'data_cleaning' with a keyword from your actual file name if different
!find /content/drive/MyDrive -name "*data_cleaning*.ipynb"

# 2. Once you see the path in the output, copy it here:
# Example: !cp "/content/drive/MyDrive/Colab Notebooks/data_cleaning.ipynb" /content/
# (Make sure to use quotes if there are spaces in the name)

In [None]:
# 1. Copy the notebook from Drive to the active Git repository folder
!cp "/content/drive/MyDrive/Colab Notebooks/data_cleaning.ipynb" /content/

# 2. Add the notebook and any remaining data
import os
os.chdir('/content')
!git add data_cleaning.ipynb
!git add WESAD_data/

# 3. Commit and Push
print("Committing and pushing your final project...")
!git commit -m "Final Project: Complete repository with notebook and LFS data"
!git push origin master

print("\nAll done! Your notebook and data are now live on GitHub.")

In [None]:
# 1. Increase the Git post buffer to handle large transfers (500MB)
!git config --global http.postBuffer 524288000

# 2. Ensure we are in the right folder
import os
os.chdir('/content')

# 3. Attempt the push again with a focus on stability
print("Resuming heavy upload (2.32 GB)... this will take several minutes.")
!git push -u origin master

In [None]:
# 1. Ensure we are at the root
import os
os.chdir('/content')

# 2. Increase buffer and stability settings
!git config --global http.postBuffer 1048576000
!git config --global ssh.postBuffer 1048576000
!git config --global core.compression 0

# 3. Push the 'WESAD_data' folder contents in small batches
# First, let's push just the notebook and small files
print("Pushing notebook and metadata...")
!git push origin master

# 4. If that succeeded, push the large LFS objects
# Note: LFS usually handles its own transfers, but a manual push helps resume
print("Resuming LFS data transfer...")
!git lfs push origin master

In [None]:

    import numpy as np
    import scipy.signal
    import pandas as pd

    def resample_sensor_data(df, target_freq, current_freq):
        if 'timestamp' not in df.columns:
            print("Timestamp column missing, cannot resample based on time. Assuming uniform sampling.")
            num_samples = len(df)
            time_duration = num_samples / current_freq
            new_num_samples = int(time_duration * target_freq)

            resampled_df = pd.DataFrame()
            for col in df.columns:
                if np.issubdtype(df[col].dtype, np.number): # Only resample numeric columns
                    resampled_data = scipy.signal.resample(df[col].values, new_num_samples)
                    resampled_df[col] = resampled_data
                else:
                    # For non-numeric, we can't directly resample, maybe forward fill or skip
                    # For now, let's just carry over if index matches, though index won't align
                    pass
            # Need to create a new time index for resampled_df
            new_time_index = np.linspace(0, time_duration, new_num_samples, endpoint=False)
            # If we had an original start time, we'd add it here.
            # resampled_df['timestamp'] = new_time_index + (original_start_time if available)
            # Since we don't have original timestamp, we create a relative one.
            resampled_df.insert(0, 'relative_time', new_time_index)

        else:
            time_seconds = (df['timestamp'] - df['timestamp'].iloc[0]) / np.timedelta64(1, 's')
            time_duration = time_seconds.iloc[-1]
            new_num_samples = int(time_duration * target_freq)
            new_time_index = np.linspace(0, time_duration, new_num_samples, endpoint=False)

            resampled_df = pd.DataFrame()
            resampled_df['timestamp_new'] = pd.to_timedelta(new_time_index, unit='s') + df['timestamp'].iloc[0]

            for col in df.columns:
                if col != 'timestamp' and np.issubdtype(df[col].dtype, np.number):
                    resampled_data = np.interp(new_time_index, time_seconds.values, df[col].values)
                    resampled_df[col] = resampled_data
            resampled_df = resampled_df.rename(columns={'timestamp_new': 'timestamp'})

        return resampled_df

    # Assuming chest_dataframes and wrist_dataframes are already loaded
    # and contain DataFrames for each sensor, with original frequencies known.

    # Example frequencies (replace with actual frequencies if known and different)
    chest_freq = 700
    wrist_freqs = {'ACC': 32, 'BVP': 64, 'EDA': 4, 'TEMP': 4} # Example freqs for wrist sensors

    target_freq = 64

    print("Resampling chest dataframes...")
    for sensor_name, df in chest_dataframes.items():
        print(f"Resampling {sensor_name} from {chest_freq}Hz to {target_freq}Hz")
        chest_dataframes[sensor_name] = resample_sensor_data(df.copy(), target_freq, chest_freq)
        print(f"New shape of {sensor_name}: {chest_dataframes[sensor_name].shape}")

    print("\nResampling wrist dataframes...")
    for sensor_name, df in wrist_dataframes.items():
        current_f = wrist_freqs.get(sensor_name, 32) # Default to 32 if not in map
        print(f"Resampling {sensor_name} from {current_f}Hz to {target_freq}Hz")
        wrist_dataframes[sensor_name] = resample_sensor_data(df.copy(), target_freq, current_f)
        print(f"New shape of {sensor_name}: {wrist_dataframes[sensor_name].shape}")

    print("\nResampling complete.")



Resampling chest dataframes...

Resampling wrist dataframes...

Resampling complete.
