# Creation of DataFrame for Visual Saliency Analysis of ET COCO experiment

This Jupyter notebook includes several data cells that create a DataFrame to analyze visual saliency in an eye tracking experiment. 

*Information about the eye tracking experiment* 

The eye tracking experiment involved 1,000 stimuli (retrieved from the Microsoft Coco Images Dataset) and 250 (1/4) of them were repeated. In the within-subjects design with randomized trials at a single time point, each participant completed 1,250 trials during data collection. The experimental conditions were the same as in the well-known experiment within the NSD dataset by Allen et al. (2022). Allen et al. (2022) systematically recorded only the voxel data of participants performing the same experiment while lying in an fMRI scanner. In contrast, our study captured the critical, missing ground truth eye-tracking data. We recorded eye movements from 23 participants as they viewed the same images used by Allen et al. (2022). The focus of this analysis is on the participants' eye movements, particularly fixations, during image perception.

Participants (*N*= 23) performed a continuous recognition task, which required them to remember and identify as many stimuli as possible. This indirect task design enables natural image exploration, yielding in more authentic eye-tracking data that better represents real-world visual perception.

Written by Lisa Heinemann  
Last edited: 15/12/2024

Used Tools: 
- MATLAB (R2022b)
- EyeLink (1000 Plus)
- Python 3.11.5


## Create a Pandas Dataframe to access the data. 



*Overview*

This code imports the required libraries, including scipy.io, pandas, and several others.

- ... using Pathlib as a module in python to organize the structure of data assessment, it sets the base, data and save directory as objects to be accessed according to the recognized operating system (e.g. Windows, MacOS or Linux) on which the code is running 
    - in the data directory the MATLAB files and ASCII Eye Tracking file per subject are accessed 

- ... creates a DataFrame per block of all subjects (subject 3-23 for 5 blocks)
    - .. handles the exceptions due to data loss in the beginning of the data collection (for subject 1 the concatenated DataFrame can just be created for the first block and for subject 2 for block 1,2,3,5)
        *- for subject 1*: [MATLAB Data Subject Experiment file e.g.:](../../../../bartels_data/lheinemann/0_subjectData/VP1/data_subj_1_24yrs_f_20231220_1.mat) exist for block 1 & 5, the ASCII files [Eye Tracking Data e.g.:](../../../../bartels_data/lheinemann/0_subjectData/VP1/ET_1_1.asc) for BLOCK 1-4 and the [Generated image sequence matrix](../../../../bartels_data/lheinemann/0_subjectData/VP1/img_seq_sub661_24yrs_f_20231220.mat) exits 

        *- for subject 2*: MATLAB experimental data files exist for block 1,2,3,5 and all ASCII files exist as well as the image sequence matrix 

        *- for subject 1 - 4*: the signal detection outcomes are calculated posthoc in relation to the value of oldResponsesDf (1 = if a subject pressed key'j' for 'old image', 0 = if pressed 'h' for 'new image') and img_is_old
        - oldResponsesDf: value *1* = if a subject pressed key'j' for 'old image', value *0* = if pressed 'h' for 'new image'
            - stored in the MATLAB Data Subject Experiment file
        - img_is_old = 1 if an image 1 presented at position *k* already occured in the image sequence matrix before, meaning this image 1 was already presented to the subject within this experiment

- ... then defines the MATLAB file paths, related to the data_dir object to access the data to be converted into a Pandas Dataframe

- ... loads the MATLAB file using scipy.io.loadmat.

- ... loads the Eye Tracking Data from the ASCII file [Eye Tracking Data](../../../../bartels_data/lheinemann/0_subjectData/VP1/ET_1_1.asc)

- ... extracts the relevant data from the MATLAB data files and the ASCII file

- ... converts the extracted data to a Pandas DataFrame

- ... Command at the end of the Jupyter Notebook Cell to save the df_combined as a CSV file and this additionally saved CSV file can be opend with DataWrangler by "Data Wrangler: Open File"

In the second part of the function definition "create_dataframe" the dataframe df_combined is adapted to df_expanded. 
- **Within the new organised Dataframe df_expanded: one row represents one fixation.**
    - The function *reorganize_dataframe* changes the structure of the DataFrame so that 1 fixation (fix_id) represents one row of the DataFrame. The function takes in the generated DataFrame of the code cell above (with regards to the currentBlockNumber) and then reorganizes the structure of the DataFrame. 

This Jupyter Notebook accesses the file paths of the MATLAB file and the ASCII file with regards to the subject number ('nr') and block number ('currentBlockNumber'), creates a DataFrame and saves it as a CSV file (e.g. name: *df_expanded_data_subj_1_1_2024-11-27.csv*) stored in the condas environment saliency-nsd *(Path: /Volumes/lheinemann/saliency-nsd/data/preprocessed)*

For this it is necessary to consider: 
- if currentBlockNumber = 1 => consider trial_nr: 1-250 (Index: 0 to 249)
- if currentBlockNumber = 2 => consider trial_nr: 251-500 (Index: 250 to 499)
- if currentBlockNumber = 3 => consider trial_nr: 501-750 (Index: 500 to 749)
- if currentBlockNumber = 4 => consider trial_nr: 751-1000 (Index: 750 to 999)
- if currentBlockNumber = 5 => consider trial_nr: 1001-1250 (Index: 1000 to 1249)

#### Regarding the extraction of MATLAB data: 
The code extracts and processes each field, using the **flatten_and_round function** to handle the nested structures of the variables stored in the *'images'* field of the MATLAB file. *'images'* is a nested struct stored inside the *'data'* struct (data.images).

To access only the rows for the current block in the MATLAB file (see the underlying logic above), the start_row and end_row variables are used to slice the arrays.

### The structure of the DataFrame should thus be: 

### A: Variables accessed from the MATLAB files

- **subject_nr** = Subject id (*integer*)
- **block_nr** = Number of the block (*integer*)
- **trial_nr** = Trial number within the current block (*integer*)
- **img_id** = Name of the picture shown within the trial (*string*)
- **img_onset** = ID of the presented picture within this trial (*float*)
- **img_offset** = ID of the presented picture within this trial (*float*)
- **oldResponsesDf** = **true**, if the image was presented to the participant within the experiment already (*is an old image*) and the correct answer of the subject was old. If not = **false**
- **img_is_old** = **true**, if the image was presented to the participant within the experiment already (is an old image) and the correct answer of the subject was old. If not = **false = 0**. (*boolean*) --> in MATLAB scipt named old_responses --> variable that can have more then 1 input 
- **rt** = Reaction time of image onset until the first key response, measured in seconds, miliseconds. (only the first key response gets considered) (*float*)
- **number_of_responses** = The number of keystrokes shown by the subject during the presentation of the trial. (*integer*)
- **correct_response** = **true**, if the subjects reaction was correct (teh subject answered 'old' if the the image presented was old and 'new' if it was new). If not correct_response  = **false = 0**. (*integer*)
- **hits** = **true**, if the image was presented to the participant within the experiment already (is an old image) and the correct answer of the subject was old. If not = **false**. --> hits same as correct_response   (*boolean*)
- **misses** = *true**, if the image was presented to the participant within the experiment already (is an old image) and the incorrect answer of the subject was new. If not = **false**. (*boolean*)
- **correctRejections** = *true**, if the image was not presented to the participant within the experiment already (is a new image) and the correct answer of the subject was new. If not = **false**. (*boolean*)
- **falseAlarms** = = *true**, if the image was not presented to the participant within the experiment already (is a new image) and the incorrect answer of the subject was old. If not = **false**. (*boolean*)
- **Image_actual_dur** = Duration of the presentation of the experiment  in miliseconds (supposed to be presented 3s --> 3000ms ) (*float*)

This dataframe has the name df_combined.

### B: Then the ASCII file is added and the Dataframe is reorginzed in df_expanded.  

- **fixation_num** = Number of the fixaion within the current trial (the frist, second, third ...) (*integer*)
- **x** and **y** = Coordinates of the current fixation (*float*)
- **fixation_time** = Time of the fixation, relative to the start time of the eye tracking within the experiment (*float*) or rather (*integer*)??
- **fixation_dur** = Duration of the fixation in miliseconds? (*integer*)

In the dataframe_expanded one row represents one fixation. The structure of this dataframe is the following: 



Within the reorganization of the code the structure of the Datafame is transformed to: | fixation_num| fixation num in trial| subject_nr | block_nr | trial_nr | img_id | fixation_dur| fixation_time_relative | x | y | img_onset | img_offset  |  Image actual_dur | oldResponsesDf |is_old_img | rt | number_of_responses | correct_response |hits | misses | correctRejections | falseAlarms | 

- **fixation numbers** = Number of fixation within the whole block (the frist, second, third ...) (*integer*)
- **fixation number in trial** = Number of fixation in current trial 
- **subject_nr** = Subject id  (*integer*)
- **block_nr** = Number of the block (*integer*)
- **trial_nr** = Trial number within the current block (*integer*)
- **image id** = ID of the presented picture within this trial (*string*)
- **fixation_dur** = Duration of the fixation in miliseconds? (*integer*)
- **fixation_time_relative** = Timestamp of start of fixation in relation to the start of the trial (img_onset timestamp) (*float*)
- **x** and **y** = Coordinates of the current fixation (*float*)
- **img_onset** = Time stamp of the start of the presentation of the image (*float*)
- **img_offset** = Time stamp of the end of the presentation of the image (*float*)
- **Image_actual_dur** = Duration of the presentation of the experiment  in miliseconds (supposed to be presented 3s --> 3000ms ) (*float*)
- **oldResponsesDf** = *true*, if the image was presented to the participant within the experiment already (*is an old image*) and the correct answer of the subject was old. If not = *false*. (*boolean*) --> in MATLAB script named old_responses --> Variable that can have more then 1 input or could be empty if the subject did not answer 
- **is_old_img** = If the image was already presented to te subject within the trial (value = 1) or not (value = 0). This variable is retrieved from the Matlab files generating the distinct image sequence matrix for the distinct participants. 
- **rt** = Reaction time of image onset until the first key response, measured in seconds, miliseconds (only the first key response gets considered). (*float*)
--> Variable that can have more then 1 input or could be empty if the subject did not answer 
- **number_of_responses** = Counts the amount of responses (key presses) a participant made during one trial (*integer*)
- **correct_response** = *true*, if the subject's reaction was correct (the subject answered old if the the image presented was old and new if it was new). If not correct_response  = *false*. (*integer*)
- **hits** = *true*, if the image was presented to the participant within the experiment already (is an old image) and the correct answer of the subject was old. If not = *false*. --> hits same as correct_response (*boolean*)
- **misses** = *true*, if the image was presented to the participant within the experiment already (is an old image) and the incorrect answer of the subject was new. If not = *false*. (*boolean*)
- **correctRejections** = *true*, if the image was not presented to the participant within the experiment already (is a new image) and the correct answer of the subject was new. If not = *false*. (*boolean*)
- **falseAlarms** = *true*, if the image was not presented to the participant within the experiment already (is a new image) and the incorrect answer of the subject was old. If not = *false*. (*boolean*)

### *Note 1:* The variables oldResponsesDF (old_responses within the MATLAB file) and rt (RTs in the MATLAB file) can contain within one cell more then one entry. 

To be able to create a Dataframe for every participant add a column named number_of_responses within the Dataframe. This variable should count the reponses of a participant within one trial. If number_of_responses > 1, only consider the last entry. Because particpants were told to react fast within the experiment but that they could correct their answer within the trial (Task: Determine whether the images presented within this trial was already presented to them or not). 

### *Note 2*: Covering Edge Cases - This code also prepares subject 1, 2, 3 and 4 for the conjunction of the Dataframe. The MATLAB files of these subjects do not contain the variables hits, misses, falseAlarms, correctRejection. 

(05.11) The code works and can be used for any number of subject_nr(variable name: sub_num) from 1 to 23. For blocks 1 to 4 it calculates the hits, misses, falseAlarms and correctRejections post-hoc via img_is_old and oldResponsesDf and for blocks 5 to 23 it retrieves these values from the corresponding MATLAB data file. 

### *I. Import necessary libaries for this code cell* 


In [1]:
from pathlib import Path
from itertools import chain
from scipy.io import loadmat
import pandas as pd
import scipy.io
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import itertools


### II. Base directory, data directory and save directory setup 

In [None]:
# Create a path object for the base directory
base_dir = Path('~/saliency-nsd').expanduser()

# Check if the base directory exists with an assert 
assert base_dir.exists(), f'{base_dir} does not exist'
# Print the base directory
print(base_dir)

# Combine paths to create new Path objects for the data directory and the save directory
data_dir = base_dir / 'data/raw'
save_dir = base_dir / 'data/preprocessed'

# Check if the data and save directories exist with assert statements, if not assertion error is raised
assert data_dir.exists(), f'{data_dir} does not exist'
assert save_dir.exists(), f'{save_dir} does not exist'

# Print statement of path 
print(data_dir)
print(save_dir)

# Just to be aware of the working directory print the working diretory, should be /gpfs01/bartels/user/lheinemann/saliency-nsd/code/code_analysis_saliency-nsd/1_data-preprocessing/creation_of_dataframes
print (os.getcwd())


/gpfs01/bartels/user/lheinemann/saliency-nsd
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/preprocessed
/gpfs01/bartels/user/lheinemann/saliency-nsd/code/code_analysis_saliency-nsd/1_data-preprocessing/creation_of_dataframes


### *III. Test accessibility of data file*

In [3]:
# To test the accessiibilty of the data file: Print the content of the data directory
print(list(data_dir.iterdir()))

""" # Delete all files with ._VP in the name for all 
for file in data_dir.iterdir():
    if '._VP' in file.name:
        file.unlink() """

# Print contents of VP23 as an example
print(list((data_dir / 'VP23').iterdir()))

""" # Delete files with ._ in the name for VP23 and did it for all other VP's as well 
for file in (data_dir / 'VP23').iterdir():
    if '._' in file.name:
        file.unlink()
 """


[PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP6'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP2'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/._.DS_Store'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP7'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP18'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP8'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP19'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP14'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP9'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP15'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/.DS_Store'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP3'), PosixPath('/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP16'), PosixPath('/gpfs01/bartels/user/lheinemann/s

" # Delete files with ._ in the name for VP23 and did it for all other VP's as well \nfor file in (data_dir / 'VP23').iterdir():\n    if '._' in file.name:\n        file.unlink()\n "

## 1. Running this code the DataFrames for each run of the distinct subjects (1-23) are created. Each DataFrame takes the information from the ASCII eye tracking file and two MATLAB files (the image sequence matrix file and the experimental data .mat -file).

The image sequence matrix file contains the calculated unique image sequence matrix and the experimental data.mat-file contains the key presses, img_onset and img_offset times and several more aspects.

There are due to data loss some exceptions for subject 1, 2. *Subject 1* only consists of block 1, because in block 2,3,4 the experimental data.mat-files are missing and in block 5 the ASCII file is missing. *Subject 2* does not contain a DataFrame of block 4 because the .mat-file is missing.

In [None]:
########################################## Load the Matfile path. Add the range of subjects you want to create the DataFrame of (N= 23). ###########################################

# Altenative way of passing the path files of the base_dir, data_dir and save_dir to the function. In the code cell above the path lib is used to create the path objects instead of strings
# This makes it possible to use the code in different environments and operating systems. The path objects get recognized automatically and are passed to the function as arguments

""" # Define directory paths with os:path.joing alternative to pathlib.Path
# Becuase pathlib is used not neccessarx to use os.path.join
base_dir = os.path.expanduser('~/saliency-nsd')
input_dir = os.path.join(base_dir, 'data/raw')
output_dir = os.path.join(base_dir, 'data/preprocessed') """

########################################### Definition of function to create the DataFrames for the subjects ######################################################################

for sub_num in range(1, 24):

     # Path to subject data directory
    sub_data_dir = data_dir /f'VP{sub_num}'
    print(sub_data_dir)

    # Make sure that subject data directory exists
    assert sub_data_dir.exists(), f"Data directory for subject {sub_num} does not exist"
    # Print that sub_num is processed
    print(f"Processing subject number {sub_num}")

################################### Function to create the DataFrames for the subjects and save them as CSV files #############################################################
# Define a function to create the DataFrames 
def create_dataframe(sub_num, currentBlockNumber):

    ######################################################### Load the .mat file with experimental data ######################################################################

    # Construct the search pattern for the .mat file and the image sequence matrix 
    search_pattern = sub_data_dir /f'data_subj_{sub_num}_*{currentBlockNumber}.mat'
    search_path = os.path.join(sub_data_dir, search_pattern)

    # Debugging: Print the search path and pattern
    print(f"Searching for MATLAB data file with pattern: {search_pattern}")
    print(f"Full search path: {search_path}")

    # Use glob to find the file that matches the pattern
    matching_files = sorted(glob.glob(search_path))
    if matching_files:
        print(f"Found {len(matching_files)} matching files.")
        print(f"First matching file: {matching_files[0]}")
    else:
        print(f"No matching files found with pattern: {search_pattern}")
        print(f"Please check if the file exists in the directory: {sub_data_dir}")

    # Check if any matching files are found
    # The assert statement is used to check if the condition is true, if not, the program will raise an AssertionError.
    assert len(matching_files) > 0, f"No matching MATLAB Data file found for subject number {sub_num} and block number {currentBlockNumber}"

    # There is only one matching file, get the first one.
    mat_file_path = matching_files[0]
    print(f"Found file: {mat_file_path}")

    # Load the .mat file
    try:
        mat_file_data = scipy.io.loadmat(mat_file_path)
    except Exception as e:
        print(f"Error loading .mat file: {e}")
        exit()
    ######################################################### Load the Image Sequence Matrix File (also MATLAB file) ##################################################################

    # This way the img_is_old variable can be loaded and added to the DataFrame as well as the images can be reconstructed from the image sequence matrix that were shown to subject 1 in block 2,3,4, where the 
    # .mat-files of the experimental data are missing (for subject 1, block 5 the ASCII file is missing, so this can not be reconstructed)
    # For subject 2 the .mat-file with the experimental data of block 4 is missing, all other files of the subject are available.

    # Construct the search pattern for the image sequence matrix file
    # load the image sequence matrix according to the number 
    # the image sequence matrix just needs the subject number to be loaded and not the currentBlockNumber 
    # because the image sequence matrix is the same for all blocks of one subject
     
    search_pattern = f'img_seq_sub{sub_num}_*.mat' 
    search_path = os.path.join(sub_data_dir, search_pattern)

    # Use glob to find the file that matches the pattern
    matching_files = glob.glob(search_path)
    if matching_files:
        print(f"Found {len(matching_files)} matching files.")
        print(f"First matching file: {matching_files[0]}")

    # Check if any matching files are found
    if not matching_files:
        print(f"No matching Image Sequence Matrix file found for subject number {sub_num} and block number {currentBlockNumber}")

    else:
        # Assuming there is only one matching file, access it
        img_seq_mat_file_path = matching_files[0]
        print(f"Found file: {img_seq_mat_file_path}")

        # Load the second .mat file
        try:
            img_seq_mat_file_data = scipy.io.loadmat(img_seq_mat_file_path)
        except Exception as e:
            print(f"Error loading Image Sequence Matrix file: {e}")
            exit()

    ############################################################## Load the ASCII file ############################################################################################################

    # Construct the search pattern for the ASCII file
    search_pattern = f'ET_{sub_num}_{currentBlockNumber}.asc'
    search_path = os.path.join(sub_data_dir, search_pattern)

    # Use glob to find the file that matches the pattern
    matching_files = glob.glob(search_path)
    if matching_files:
        print(f"Found {len(matching_files)} matching files.")
        print(f"First matching file: {matching_files[0]}")

    # Check if any matching files are found
    if not matching_files:
        print(f"No matching ASCII files found for subject number {sub_num} and block number {currentBlockNumber}")
    else:
        # Assuming there is only one matching file, get the first one
        ascii_file_path = matching_files[0]
        print(f"Found file: {ascii_file_path}")

        # Load the ASCII file
        try:
            with open(ascii_file_path, 'r') as file:
                lines = file.readlines()
        except Exception as e:
            print(f"Error loading ASCII file: {e}")
            exit()

    # Define a function to extract the eye tracker start time from the ASCII file
    def extract_eye_tracker_start_time(lines):
        for line in lines:
            if line.startswith('START'):
                parts = line.split()
                if len(parts) > 1:
                    return int(parts[1])
        return None

    # Extract the eye tracker start time
    eye_tracker_start_time = extract_eye_tracker_start_time(lines)
    print(f"Eye tracker start time: {eye_tracker_start_time}")
    if eye_tracker_start_time is None:
        print("Eye tracker start time not found in the ASCII file.")
        exit()

    # Set the display option to show only 5 decimals within the pandas DataFrame
    pd.options.display.float_format = '{:.5f}'.format

    # Define a function to handle nested structures and round numeric values
    def flatten_and_round(x):
        if isinstance(x, list) or isinstance(x, np.ndarray):
            flattened = []
            for item in x:
                if isinstance(item, (list, np.ndarray)):
                    flattened.extend(flatten_and_round(item))
                else:
                    flattened.append(round(item, 5) if isinstance(item, (int, float)) else item)
            return flattened
        return round(x, 5) if isinstance(x, (int, float)) else x

    # Calculate the start and end rows based on the block number
    rows_per_block = 250
    start_row = (int(currentBlockNumber) - 1) * rows_per_block
    end_row = int(currentBlockNumber) * rows_per_block

    print(f"Considering rows from {start_row} to {end_row - 1} for block {currentBlockNumber}")

    # Process the 'data' field of the MATLAB file to extract the relevant information for the DataFrame
    if 'data' in mat_file_data:
        data = mat_file_data['data']
        
        if isinstance(data, np.ndarray) and data.dtype.names is not None:
            if 'nr' in data.dtype.names and 'currentBlockNumber' in data.dtype.names and 'images' in data.dtype.names:
                
                # Print the shape of data struct in MATLAB file and the contents of 'sub_num' and 'currentBlockNumber'
                print("Shape of 'data':", data.shape)
                print("Contents of 'sub_num':", data['nr'])
                print("Contents of 'currentBlockNumber':", data['currentBlockNumber'])
                
                if data['nr'].size > 0 and data['currentBlockNumber'].size > 0:
                    try:
                        if data['nr'][0, 0].size > 0 and data['currentBlockNumber'][0, 0].size > 0:
                            subject_nr = data['nr'][0,0][0][0]
                            block_nr = data['currentBlockNumber'][0,0][0][0]
                        else:
                            raise IndexError("The 'sub_num' or 'currentBlockNumber' array is empty.")
                    except IndexError as e:
                        print(f"IndexError while accessing 'sub_num' or 'currentBlockNumber': {e}")
                        print(f"Shape of 'sub_num': {data['nr'].shape}")
                        print(f"Shape of 'currentBlockNumber': {data['currentBlockNumber'].shape}")
                        exit()
                else:
                    print(f"Shape of 'sub_num': {data['nr'].shape}")
                    print(f"Shape of 'currentBlockNumber': {data['currentBlockNumber'].shape}")
                    raise IndexError("The 'sub_num' or 'currentBlockNumber' array is empty.")
                
                # Access the 'images' field as a nested struct of the data array and extract the relevant fields
                images = data['images'][start_row:end_row]  # Adjust this to extract only the rows for the current block
                
                # Check if the array and its nested elements are not empty
                def is_valid_array(arr, index):
                    return arr.size > 0 and index < arr.shape[0] and len(arr[index]) > 0 and len(arr[index][0]) > 0

                # Check the structure of the 'images' field to find the correct field names
                print(data['images'][0, 0].dtype.names)

                ################# Accessing the variables in the MATLAB file (stored in nested data.images struct), and applying the flatten_and_round_function to add them to the DataFrame ############################

                # Extract and process each field, with the flatten_and_round function to handle the nested structures of the variables stored in the 'images' field of the MATLAB file
                # To access only the rows for the current block, the start_row and end_row variables are used to slice the arrays.
                img_id = [item[0] if isinstance(item, list) and len(item) > 0 else item 
                        for item in data['images'][0, 0]['pictures'][0][start_row:end_row]]
                img_id = [str(item).strip('[]') for item in img_id]  # Remove brackets from img_id
                img_onset = flatten_and_round(data['images'][0, 0]['img_onset'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['img_onset'][0], start_row) else []
                img_offset = flatten_and_round(data['images'][0, 0]['img_offset'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['img_offset'][0], start_row) else []
                
                ##### Extra check for old_responses, because this colum could have several responses and could also be empty, same done for rt because rt (name RTs in MATLAB file) shows the time stamps of the reactions stored 
                # in old_responses ########
                old_responses = data['images'][0, 0]['old_responses'][0]

                # Ensure the indices are within bounds for old_responses
                oldResponsesDf = []
                for i in range(start_row, end_row):
                    if i < len(old_responses) and len(old_responses[i]) > 0:
                        # Flatten and round the response, then convert to integer
                        response = flatten_and_round(old_responses[i][0])
                        if isinstance(response, list):
                            response = [int(item) for item in response]
                            oldResponsesDf.append(','.join(map(str, response)))
                    else:
                        # Handle the case where the index is out of bounds
                        oldResponsesDf.append("None")  


                # Check the shape of the RTs array within the 'images' field of the MATLAB file that gets stored under rt variable within the dataframe, the length should be equal to the length of the trial_num = 250
                RTs = data['images'][0, 0]['RTs'][0]

                # Ensure the indices are within bounds
                rt = []
                for i in range(start_row, end_row):
                    if i < len(RTs) and len(RTs[i]) > 0:
                        rt.append(flatten_and_round(RTs[i][0]))
                        # Convert rt to a comma-separated string if it is a list
                        rt = [','.join(map(str, item)) if isinstance(item, list) else str(item) for item in rt]
                    else:
                        # Handle the case where the index is out of bounds
                        # Debugging: Print why None is being appended
                        # print(f"Appending None for index {i}: RTs[{i}] is empty")
                        rt.append("None") # This way it prints 'None' in the DataFrame 

                # Number of responses per trial = stores the amount of key presses per trial.
                number_of_responses = [len(data['images'][0, 0]['RTs'][0][i]) for i in range(start_row, end_row)]

                correct_response = flatten_and_round(data['images'][0, 0]['correct_response'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['correct_response'][0], start_row) else []

                # Access the 'img_is_old' field from the image sequence matrix file
                img_seq_matrix = img_seq_mat_file_data
                
                # Debugging: Print the keys of the img_seq_matrix 
                # TODO: Remove this when debugging was successful
                print(img_seq_matrix.keys())

                # Access the nested dictionary or array
                nested_structure = img_seq_matrix['img_seq_matrix']
                print(type(nested_structure))

                # If it's a numpy structured array, print its dtype
                if isinstance(nested_structure, np.ndarray):
                    print(nested_structure.dtype)

                # Check if 'img_is_old' is a key in the nested structure
                if 'img_is_old' in nested_structure.dtype.names:
                    is_valid = is_valid_array(nested_structure['img_is_old'][0], start_row)
                    img_is_old_data = nested_structure['img_is_old']

                else:
                    print("Field 'img_is_old' does not exist in the structured array")

                # Check if the array is valid
                is_valid = is_valid_array(nested_structure['img_is_old'][0], start_row)

                # Extract the relevant slice if the array is valid
                if is_valid:
                    img_slice = nested_structure['img_is_old'][0][start_row:end_row]
                    img_is_old = flatten_and_round(img_slice)
                else:
                    img_is_old = []

                ################################## For subject nr 1,2,3,4:  Reconstruct hits, misses, correct rejections, false alarms from the img_is_old and old_responses variables #######################################
                if sub_num in range(1,5):

                    # Calculate the values of the variables hits, misses, correct rejections, false alarms post-hoc. These values were added to the MATLAB file during data collection after subject nr 4.
                    # Taking in consideration the values of img_is_old (indicating whether an image was already presented in the image sequence matrix) and old_responses variable 
                    # (indicating the keypresses the subject made during the trial). 
                    # ('j' for old image = 1,'h' for new image = 0)) (if there are multiple responses, the last one is taken into account)
                    # --> More precise: If the length of the oldResponsesDf array is greater than 1, the last element (key press) is taken into account.

                    # Then the variables are directly added to the DataFrame

                    # Create lits to store the values of hits, misses, correct rejections, false alarms
                    hits = []
                    misses = []
                    correct_rejections = []
                    false_alarms = []

                    # Iterate over the rows (representing the trial_nr) to calculate the hits, misses, correct rejections, false alarms
                    # Len is supposed to be the length of the start_row to the end_row, however the range function can only include one argument, so the length of oldResponsesDf is taken. 
                    # The length of oldResponsesDf is equal to the length of start_row to end_row. 
                    for i in range(len(oldResponsesDf)):
                        # new approach to calculate the hits, misses, correct rejections, false alarms

                        # if oldResponsesDf contains more than one response, the last one is taken into account
                        if len(oldResponsesDf[i]) > 1: 
                            oldResponsesDf[i] = oldResponsesDf[i][-1]
                        else:
                            oldResponsesDf[i] = oldResponsesDf[i]

                        # If the image is old and the response is old, it is a hit
                        if img_is_old[i] == 1 and oldResponsesDf[i] == '1':
                            hits.append(1)
                        else:
                            hits.append(0)

                        # If the image is old and the response is new, it is a miss
                        if img_is_old[i] == 1 and oldResponsesDf[i] == '0':
                            misses.append(1)
                        else:
                            misses.append(0) 
                        
                        # If the image is new and the response is new, it is a correct rejection
                        if img_is_old[i] == 0 and oldResponsesDf[i] == '0':
                            correct_rejections.append(1)
                        else:
                            correct_rejections.append(0)

                        # If the image is new and the response is old, it is a false alarm
                        if img_is_old[i] == 0 and oldResponsesDf[i] == '1':
                            false_alarms.append(1)
                        else:
                            false_alarms.append(0)
                # Add the values of hits, misses, correct rejections, false alarms of Subject 5-23 (already stored in the Matlab Experimental Data file) to the DataFrame
                
                elif sub_num in range(5,24):
                    hits = flatten_and_round(data['images'][0, 0]['hits'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['hits'][0], start_row) else []
                    misses = flatten_and_round(data['images'][0, 0]['misses'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['misses'][0], start_row) else []
                    correct_rejections = flatten_and_round(data['images'][0, 0]['correctRejections'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['correctRejections'][0], start_row) else []
                    false_alarms = flatten_and_round(data['images'][0, 0]['falseAlarms'][0][start_row:end_row]) if is_valid_array(data['images'][0, 0]['falseAlarms'][0], start_row) else []

                    print("Retrieved the hits, misses, correct rejections, false alarms values from the MATLAB file.")

                image_actual_duration = flatten_and_round(data['images'][0, 0]['Image_actual_dur'][0][start_row:end_row])

                # Define the number of trials per block
                trials_per_block = 250

                # Calculate the starting trial number based on the currentBlockNumber
                start_trial_nr = (currentBlockNumber - 1) * trials_per_block + 1

                # Generate the trial_nr list starting from the calculated trial number
                trial_nr = list(range(start_trial_nr, start_trial_nr + len(img_id)))

                ############################################################ Debug Option to ensure the length is the same for the variables that should be stored in the Dataframe ##############################################

                # TODO: Remove the debug option if not neccessary any more: Print the lengths of the extracted fields
                print(f"Length of img_id: {len(img_id)}")
                print(f"Length of img_onset: {len(img_onset)}")
                print(f"Length of img_offset: {len(img_offset)}")
                print(f"Length of oldResponsesDf: {len(oldResponsesDf)}")
                print(f"Length of img_is_old: {len(img_is_old)}")
                print(f"Length of rt: {len(rt)}")
                print(f"Length of number_of_responses: {len(number_of_responses)}")
                print(f"Length of correct_response: {len(correct_response)}")
                print(f"Length of hits: {len(hits)}")
                print(f"Length of misses: {len(misses)}")
                print(f"Length of correct_rejections: {len(correct_rejections)}")
                print(f"Length of false_alarms: {len(false_alarms)}")
                print(f"Length of image_actual_duration: {len(image_actual_duration)}")
                
                ############################################################# Adding the variables to the DataFrame ###############################################################################################################
                # Create a DataFrame

                df_combined = pd.DataFrame({
                    "Subject Number": [subject_nr] * len(img_id),
                    "Block Number": [block_nr] * len(img_id),
                    "Trial Number": trial_nr,
                    "Image Id": img_id,
                    "Image Onset": img_onset,
                    "Image Offset": img_offset,
                    "Old Responses": oldResponsesDf,
                    "Image Is Old": img_is_old,
                    "Reaction Time": rt,
                    "Number of Responses": number_of_responses,
                    "Correct Response": correct_response,
                    "Hits": hits,
                    "Misses": misses,
                    "Correct Rejections": correct_rejections,
                    "False Alarms": false_alarms,
                    "Image Actual Duration": image_actual_duration
                })

                # Print the cleaned DataFrame
                print(df_combined)
                
            else:
                print("'nr', 'currentBlockNumber', or 'images' field not found in 'data'.")
        else:
            print("'data' is not a structured array.")
    else:
        print("'data' key not found in the .mat file.")

    ############################# Extracting the fixation_num, x, y, fixation_time relative to start time from thr ASCII file #####################################

    # Since the fixation data is stored in the ASCII file, the fixation data has to be extracted from the ASCII file and added to the DataFrame #######
    # This is done by reading the fixation data from the ASCII file, parsing the fixation data, and filtering the fixation data by the 
    # image onset and offset times stored in the DataFrame 

    # Read the eye tracking file
    def read_eye_tracking_file(ascii_file_path):
        with open(ascii_file_path, 'r') as file:
            lines = file.readlines()
        return lines

    # Parse the fixations
    def parse_fixations(lines):
        fixations = []
        for line in lines:
            if line.startswith('EFIX'):
                parts = line.split()
                if len(parts) >= 8:
                    try:
                        fixation = {
                            'start': int(parts[2]),     # Fixation start time
                            'end': int(parts[3]),       # Fixation end time
                            'duration': int(parts[4]),  # Fixation duration
                            'x': float(parts[5]),       # X coordinate of fixation
                            'y': float(parts[6]),       # Y coordinate of fixation
                            'pupil_size': float(parts[7]) # Pupil size
                        }
                        fixations.append(fixation)
                    except ValueError as e:
                        print(f"Skipping line due to ValueError: {line.strip()} - {e}")
                else:
                    print(f"Skipping line due to insufficient data: {line.strip()}")
        return fixations

    def filter_fixations_by_time(fixations, img_onset, img_offset, eye_tracker_start_time):
        img_onset_ms = img_onset * 1000  
        img_offset_ms = img_offset * 1000 
        print(f"Image onset (ms): {img_onset_ms}, Image offset (ms): {img_offset_ms}")
        
        filtered_fixations = [fix for fix in fixations if (fix['start'] - eye_tracker_start_time) >= img_onset_ms and (fix['end'] - eye_tracker_start_time) <= img_offset_ms]
        return filtered_fixations

    # Read the eye tracking file
    lines = read_eye_tracking_file(ascii_file_path)

    # Parse the fixations
    fixations = parse_fixations(lines)

    # Create lists to store the fixation data for each row
    fixation_numbers = []
    fixation_durations = []
    fixation_x_coords = []
    fixation_y_coords = []
    filtered_fixations_list  = []
    
    # Fixation start time relative to the start time of the eye tracker
    fixation_time_relative = []

    # Iterate over the DataFrame rows to filter fixations and extract the fixation data
    for index, row in df_combined.iterrows():
        img_onset = row['Image Onset']
        img_offset = row['Image Offset']
        filtered_fixations = filter_fixations_by_time(fixations, img_onset, img_offset, eye_tracker_start_time)
        
        # Numerate the fixations
        fixation_numbers.append(list(range(1, len(filtered_fixations) + 1)))
        # Divide the duration value by 1000 to convert it to seconds (the ASCII file stores the duration in milliseconds)
        fixation_durations.append([fix['duration'] for fix in filtered_fixations])
        fixation_x_coords.append([fix['x'] for fix in filtered_fixations])
        fixation_y_coords.append([fix['y'] for fix in filtered_fixations])
        filtered_fixations_list.append(filtered_fixations)

        # This calculates the time stamp of the fixation relative to the image onset. Because the image onset is stored in the MATLAB file in unit seconds
        # and the fixation start time is stored in the ASCII file in milliseconds, the fixation start time has to be converted to seconds by dividing it by 1000
        fixation_time_relative.append([(fix['start'] - eye_tracker_start_time)/1000 - img_onset for fix in filtered_fixations])
        # Round each element in the inner lists to 5 decimals
        fixation_time_relative = [[round(item, 5) for item in sublist] for sublist in fixation_time_relative]

    # Find the maximum length of the fixation lists
    max_length = max(len(lst) for lst in fixation_numbers)

    # Pad the lists with NaN values to make them of equal length
    fixation_durations = [lst + [np.nan] * (max_length - len(lst)) for lst in fixation_durations]
    fixation_durations_in_seconds =  [list(map(lambda x: x / 1000, lst)) for lst in fixation_durations]
    fixation_x_coords = [lst + [np.nan] * (max_length - len(lst)) for lst in fixation_x_coords]
    fixation_y_coords = [lst + [np.nan] * (max_length - len(lst)) for lst in fixation_y_coords]

    # Add the fixation data to the DataFrame
    df_combined['Fixation Numbers'] = fixation_numbers

    # Pad the filtered_fixations list to match the length of the DataFrame index
    filtered_fixations_padded = filtered_fixations_list + [np.nan] * (len(df_combined) - len(filtered_fixations_list))

    df_combined['Fixation Durations'] = fixation_durations_in_seconds
    df_combined['Fixation X Coordinates'] = fixation_x_coords
    df_combined['Fixation Y Coordinates'] = fixation_y_coords
    df_combined['Filtered Fixations'] = filtered_fixations_padded
    df_combined['Fixation Time Relatives to Image Onset'] = fixation_time_relative

    # Print the updated DataFrame
    print(df_combined)

    # Save the DataFrame to a CSV file in save_dir
    output_file = save_dir/ f'df_combined_subj_{sub_num}_{currentBlockNumber}.csv'
    # save the DataFrame to a CSV file in the directory
    df_combined.to_csv(output_file, index=False)
    
    ################################### Second Part ############################################
    # Separate this part into a new cell for debugging reasons. These two parts are merged to enhance the speed of execution.

    # Define a function to reformate the DataFrame by the fixations, meaning that one fixation (fix_id) is represented by one row of the DataFrame
    def reorganize_dataframe(df):
        # Initialize lists to store the expanded data
        expanded_data = {
            "Fixation Numbers": [],
            "Fixation Number in Trial": [],
            "Subject Number": [],
            "Block Number": [],
            "Trial Number": [],
            "Image Id": [],
            "Fixation Duration": [],
            "Fixation Time Relative to Image Onset": [],
            "Fixation X Coordinate": [],
            "Fixation Y Coordinate": [],
            "Image Onset": [],
            "Image Offset": [],
            "Image Actual Duration": [],
            "Old Responses": [],
            "Image Is Old": [],
            "Reaction Time": [],
            "Number of Responses": [],
            "Correct Response": [],
            "Hits": [],
            "Misses": [],
            "Correct Rejections": [],
            "False Alarms": [],   
        }

        # Iterate over each row in the original DataFrame
        for index, row in df.iterrows():
            # Get the number of fixations for the current row
            num_fixations = len(row['Fixation Numbers'])

            # Iterate over each fixation
            for i in range(num_fixations):
                # Calculate the fixation time relative to the image onset
                #fixation_time_relative = row(filtered_fixations) - img_onset

                expanded_data["Fixation Numbers"].append(len(expanded_data["Fixation Numbers"]) + 1)
                expanded_data["Fixation Number in Trial"].append(row["Fixation Numbers"][i])
                expanded_data["Subject Number"].append(row["Subject Number"])
                expanded_data["Block Number"].append(row["Block Number"])
                expanded_data["Trial Number"].append(row["Trial Number"])
                expanded_data["Image Id"].append(row["Image Id"])
                expanded_data["Fixation Duration"].append(row["Fixation Durations"][i])
                expanded_data["Fixation Time Relative to Image Onset"].append(row["Fixation Time Relatives to Image Onset"][i])
                expanded_data["Fixation X Coordinate"].append(row["Fixation X Coordinates"][i])
                expanded_data["Fixation Y Coordinate"].append(row["Fixation Y Coordinates"][i])
                expanded_data["Image Onset"].append(row["Image Onset"])
                expanded_data["Image Offset"].append(row["Image Offset"])
                expanded_data["Image Actual Duration"].append(row["Image Actual Duration"])
                expanded_data["Old Responses"].append(row["Old Responses"])
                expanded_data["Image Is Old"].append(row["Image Is Old"])
                expanded_data["Reaction Time"].append(row["Reaction Time"])
                expanded_data["Number of Responses"].append(row["Number of Responses"])
                expanded_data["Correct Response"].append(row["Correct Response"])
                expanded_data["Hits"].append(row["Hits"])
                expanded_data["Misses"].append(row["Misses"])
                expanded_data["Correct Rejections"].append(row["Correct Rejections"])
                expanded_data["False Alarms"].append(row["False Alarms"])
                
        # Create a new DataFrame from the expanded data
        df_expanded = pd.DataFrame(expanded_data)

        return df_expanded

    # df_combined is the DataFrame generated from the previous code cell
    df_expanded = reorganize_dataframe(df_combined)
    print(df_expanded)

    ############################################ Save the expanded DataFrame to a CSV file #############################################    
    output_file = f'df_expanded_data_subj_{sub_num}_{currentBlockNumber}.csv'

    # save the expanded DataFrame to a CSV file in the directory
    file_path = os.path.join(save_dir, output_file)
    
    df_expanded.to_csv(file_path, index=False)

    # Print statement to show that the code has been executed successfully
    print(f"Successfully created full Dataframes for nr = {sub_num}\n")

#################################################################################################################################################################################
################################ Handle edge cases: Determine for what block the create_dataframes function (created above) is called ###########################################

# Run this code cell for all partcipants with full datasets automatically (range (1,24) for all 23 participants with full datasets)
for sub_num in range(1,24):
    sub_data_dir = data_dir / f'VP{sub_num}'

    if sub_num in range (3,24):
        # Run create_dataframes function for all 5 blocks of each subject automatically
        for currentBlockNumber in range(1,6):
            create_dataframe(sub_num, currentBlockNumber)
    elif sub_num == 1:
        # Run create_dataframes function only for the first block of subject 1 (for 2,3,4 the MATLAB files with the experimental data are missing and for block 5 the ASCII file is missing)
        for currentBlockNumber in range(1,2):
            create_dataframe(sub_num, currentBlockNumber)
    elif sub_num == 2:
        # Run create_dataframes function only for currentBlockNumber 1,2,3,5 of subject 2 (in block 4 the MATLAB  file with the experimental data is missing)
        for currentBlockNumber in itertools.chain(range(1, 4), range(5, 6)):
            create_dataframe(sub_num, currentBlockNumber)
        
    # Else type in the subject number for which you want to create the DataFrames (for all 5 blocks) manually, this statement, does not have to be reached,
    # ... because the range of subjects is defined above 
    else: 
        # TODO: Type the subject number for which you want to create the DataFrames (for all 5 blocks) manually
        create_dataframe(sub_num, currentBlockNumber)
        
    ############################################################# Final print statement #####################################################################

    # Print statement to show that the code has been executed successfully
    print(f"Successfully created full Dataframes for all participants with full datasets (sub_num = 1, 23)\n")

    # Runs completly and saves the data in the folder "Block DataFrames" in the directory "/Volumes/lheinemann/0_Master Thesis_NSD Visual Saliency/0_Data preprocessing/Creation of Dataframe and Fixation maps"
    # as well as in the working directory of the Jupyter Notebook     


/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP1
Processing subject number 1
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP2
Processing subject number 2
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP3
Processing subject number 3
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP4
Processing subject number 4
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP5
Processing subject number 5
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP6
Processing subject number 6
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP7
Processing subject number 7
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP8
Processing subject number 8
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP9
Processing subject number 9
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP10
Processing subject number 10
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw/VP11
Processing subject number 11
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/

*Note:* If the code does not work change working directory to the writable directory '/Volumes/lheinemann/0_Master Thesis_NSD Visual Saliency/0_Data preprocessing/Creation of Dataframe and Fixation maps'

## 2. Code cell that conjuncts all the data of one subject ("sub_num").

Further explanation: In this code cell, a single DataFrame is created for each participant, containing all the block data for that subject. The trial_num represents the trial number within the entire experiment for that participant (for example, the first trial of block 2 will have a trial_num of 251).

2.1 Concatenate the distinct 5 DataFrames, each referring to one block. (Name: 'df_expanded_data_subj_{nr}_{currentBlockNumber}.csv')

2.2 Add a new variable **test_is_old_img** to the concatenated DataFrame:

**test_is_old_img** = *true* if the image has already been presented during the experiment (across all blocks). This variable is then added to the DataFrame for all blocks of a single subject.

2.3 Save the concatenated DataFrame for each subject as a CSV file named 'df_expanded_data_subj_{nr}_concatenated.csv'.

This code cell should be executed after running the previous code cells in this Jupyter Notebook.

*Note: is_old_img exists as well in the img_seq_matrix of one subject stored under e.g img_seq_sub1_24yrs_f_20231220 (the distinct image sequence matrix for every participant)*

In [5]:
##### This code cell: Can only work with already created dataframes for each block of a subject #####

# Concatenate the expanded DataFrames for all blocks of the subjects, since there exists just one DataFrame for subject 1, 
# subject 1 is excluded from the loop and the DataFrame for subject 1 is loaded separately
for sub_num in range(2, 24):

    # Define a function to concatenate the expanded DataFrames for all blocks of a subject
    def concatenate_dataframes_for_subject(sub_num):
        # Initialize an empty list to store the DataFrames
        dfs = []
        # Make conditions for the subject number, because the range of the subject numbers is from 3 to 23 that have complete datasets (in subject 1 and 2 data is missing)
        if sub_num in range(3,24):

            # Iterate over the block numbers (1 to 5) because there are 5 blocks per subject, in Python the range function implemented in a for-loop is exclusive for the end value (e.g. 6 is not included)
            for currentBlockNumber in range(1, 6):
                # Load the DataFrame for the current block stored in save_dir
                df_filename = save_dir / f'df_expanded_data_subj_{sub_num}_{currentBlockNumber}.csv' 
                df = pd.read_csv(df_filename)

                # Append the DataFrame to the list
                dfs.append(df)

            # Concatenate the DataFrames
            df_concatenated = pd.concat(dfs, ignore_index=True)

            return df_concatenated
            
        elif sub_num == 2: 
            print("Subject 2 is missing the MATLAB file for block 4, so the concatenated DataFrame only contains the DataFrames of block 1, 2, 3, 5")
            # Iterate over the block numbers (1 to 3) and (5) because the MATLAB file for block 4 is missing

            for currentBlockNumber in chain(range(1, 4), range(5, 6)):

                # Load the DataFrame for the current block
                df_filename = save_dir/ f'df_expanded_data_subj_{sub_num}_{currentBlockNumber}.csv'
                df = pd.read_csv(df_filename)

                # Append the DataFrame to the list
                dfs.append(df)

            # Concatenate the DataFrames
            df_concatenated = pd.concat(dfs, ignore_index=True)

            return df_concatenated
        elif sub_num == 1:
            print("Subject 1 is missing the MATLAB file for block 2, 3, 4, so the only Dataframe that can be reconstructed is block 1, look for df_expanded_data_subj_1_1.csv ")
            
            # Maybe construct a new dataframe with the eye movements for block 2, 3, 4 of subject 1 because the MATLAB data files are missing, but the images shown are stored in the image sequence matrix file
            # However we do not have the exact image_onset amd image_offset times for the images shown in block 2,3,4 of subject 1

    # Concatenate the DataFrames for the subject
    df_concatenated2 = concatenate_dataframes_for_subject(sub_num)

############################################# Add the img_is_old variable to the DataFrame ####################################################

    # Define a function to add the img_is_old variable to the DataFrame
    def add_test_img_is_old(df_concatenated):
        
        # Initialize an empty list to store the values of the 'Test - Img is Old' column
        test_img_is_old = []

        # Iterate over the rows of the DataFrame
        for index, row in df_concatenated.iterrows():
            image_name = row['Image Id']  # 'Image Id' is the column that shows the images presented at a trial.
            trial_nr = row['Trial Number']  # 'trial_nr' is the column for trial numbers

            #######################################

            # # Sanity check: Ensure the image name is not empty, stored in the wrong format, or contains special characters so that Test_Img Is Old cannot detect that it was an old image
            image_name = image_name.replace("'", "")

            #######################################
            ### If satetement with certain values for debugging purposes to check if the code works for a specific trial number and subject number
            # to understand why img_is_old and test_img_is_old sometimes do not match
            if row['Subject Number'] == 5 and trial_nr == 1001:
                pass

            # Create temporary DataFrame consisting of fixations from previous trials only
            prev_df = df_concatenated.loc[df_concatenated['Trial Number'] < trial_nr]

            # Check if image name exists in DataFrame of previous trials
            is_old = image_name in prev_df['Image Id'].values


            # Add the img_is_old value to the list 
            test_img_is_old.append(1 if is_old else 0)   

        # Add the img_is_old values to the DataFrame
        df_concatenated['Test - Img is Old'] = test_img_is_old

        return df_concatenated

    ###### Now save the concatenated DataFrame of all blocks of one subject to a CSV file with img_is_old in it ######    
    
    # Set the location where the concatenated DataFrame should be saved - in save_dir (defined above)

    output_file = f'df_expanded_data_subj_{sub_num}_concatenated.csv'
    file_path = os.path.join(save_dir, output_file)

    df_concatenated2 = add_test_img_is_old(df_concatenated2)

    # Save the concatenated DataFrame to a CSV file
    df_concatenated2.to_csv(file_path, index=False)

    # Print a confirmation message 
    print(f"Saved concatenated DataFrame for subject {sub_num} to CSV file.")

    # This can be done for all subjects by iterating over the subject numbers
    # Keep in mind the exceptions for subject 1, subject 2 because the MATLAB file is not complete (Leave out?)



Subject 2 is missing the MATLAB file for block 4, so the concatenated DataFrame only contains the DataFrames of block 1, 2, 3, 5
Saved concatenated DataFrame for subject 2 to CSV file.
Saved concatenated DataFrame for subject 3 to CSV file.
Saved concatenated DataFrame for subject 4 to CSV file.
Saved concatenated DataFrame for subject 5 to CSV file.
Saved concatenated DataFrame for subject 6 to CSV file.
Saved concatenated DataFrame for subject 7 to CSV file.
Saved concatenated DataFrame for subject 8 to CSV file.
Saved concatenated DataFrame for subject 9 to CSV file.
Saved concatenated DataFrame for subject 10 to CSV file.
Saved concatenated DataFrame for subject 11 to CSV file.
Saved concatenated DataFrame for subject 12 to CSV file.
Saved concatenated DataFrame for subject 13 to CSV file.
Saved concatenated DataFrame for subject 14 to CSV file.
Saved concatenated DataFrame for subject 15 to CSV file.
Saved concatenated DataFrame for subject 16 to CSV file.
Saved concatenated DataF

## 3. Concatenate the Dataframes of all the subjects into one full Dataframe that consists of all the subject data. Named: df_expanded_data_subjs_concatenated_all.csv

This code svaes the concatenated DataFrame of all available subject data to /Volumes/bartels_data/lheinemann/0_subjectData/DataframesConcatenated/ 

In [10]:
##### This code cell: Can only work with already created dataframes for every subject (N=23) #####

# Define the subject numbers you want to concatenate 
# In this case the subject numbers should be between the range of 1 to 23, however since for subject 1 only the DataFrame of block 1 exists 
# This DataFrame gets loaded separately and added to the concatenated DataFrame of all subjects
sub_nums = list(range(2, 24))

print(sub_nums)

# Define a function to concatenate the expanded DataFrames for all blocks of a subject
def concatenate_dataframes_for_subject(sub_nums):
    # Initialize an empty list to store the DataFrames
    dfs = []

    # Iterate over the block numbers (1 to 5) because there are 5 blocks per subject, in Python the range function in a for loop is exclusive for the end value (e.g. 6 is not included)
    for sub_num in sub_nums:
        # Load the DataFrame for the current block
        df_filename = save_dir/ f'df_expanded_data_subj_{sub_num}_concatenated.csv'
        try:
            df = pd.read_csv(df_filename)
            # Append the DataFrame to the list
            dfs.append(df)
        except FileNotFoundError:
            print(f"File not found: {df_filename}")

    # Concatenate the DataFrames
    df_concatenated_all = pd.concat(dfs, ignore_index=True)

    return df_concatenated_all

# Concatenate the DataFrames for the subject
df_concatenated_all = concatenate_dataframes_for_subject(sub_nums)

# Add the df_expanded_data_subj_1_1_1.csv at the very beginning of the concatenated DataFrame
# Load the DataFrame for the current block
df_filename_block1 = save_dir/ f'df_expanded_data_subj_1_1.csv'
df_block1 = pd.read_csv(df_filename_block1)

# Concatenate the DataFrames
df_concatenated_all = pd.concat([df_block1, df_concatenated_all], ignore_index=True)

# Save the concatenated DataFrame to a CSV file in the save_dir 'smb://172.25.250.112/lheinemann/saliency-nsd/data/preprocessed'
output_file = save_dir/ f'df_expanded_data_subjs_concatenated_all.csv'

df_concatenated_all.to_csv(output_file, index=False)

# Print path where the concatenated DataFrame is saved
print(f"Saved concatenated DataFrame for all subjects to CSV file: {output_file}")

# This creates one dataframe with all the data from all subjects

# Print a confirmation message
print(f"Saved concatenated DataFrame for all subjects to CSV file.")  

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]


  df = pd.read_csv(df_filename)


Saved concatenated DataFrame for all subjects to CSV file: /gpfs01/bartels/user/lheinemann/saliency-nsd/data/preprocessed/df_expanded_data_subjs_concatenated_all.csv
Saved concatenated DataFrame for all subjects to CSV file.
