#  Filter DataFrame df_expanded_data_subjs_concatenated_all.csv 
This Jupyter notebook loads the concatenated DataFrame of all participants (*N*=23) and eliminates the fixations <50ms. 


*Information about the eye tracking experiment* 

*The eye tracking experiment involved 1,000 stimuli (retrieved from the Microsoft Coco Images Dataset) and 250 (1/4) of them were repeated. In the within-subjects design with randomized trials at a single time point, each participant completed 1,250 trials during data collection. The experimental conditions were the same as in the well-known experiment within the NSD dataset by Allen et al. (2022). Allen et al. (2022) systematically recorded only the voxel data of participants performing the same experiment while lying in an fMRI scanner. In contrast, our study captured the critical, missing ground truth eye-tracking data. We recorded eye movements from 23 participants as they viewed the same images used by Allen et al. (2022). The focus of this analysis is on the participants' eye movements, particularly fixations, during image perception.*

*Participants (N= 23) performed a continuous recognition task, which required them to remember and identify as many stimuli as possible. This indirect task design enables natural image exploration, yielding in more authentic eye-tracking data that better represents real-world visual perception.*

Used Tools: 
- MATLAB (R2022b)
- EyeLink (1000 Plus)
- Python 3.11.5


## Structure of the DataFrame 

Within the reorganization of the code the structure of the Datafame is transformed to: | fixation_num| fixation num in trial| subject_nr | block_nr | trial_nr | img_id | fixation_dur| fixation_time_relative | x | y | img_onset | img_offset  |  Image actual_dur | oldResponsesDf |is_old_img | rt | number_of_responses | correct_response |hits | misses | correctRejections | falseAlarms | 

- **fixation numbers** = Number of fixation within the whole block (the frist, second, third ...) (*integer*)
- **fixation number in trial** = Number of fixation in current trial 
- **subject_nr** = Subject id  (*integer*)
- **block_nr** = Number of the block (*integer*)
- **trial_nr** = Trial number within the current block (*integer*)
- **image id** = ID of the presented picture within this trial (*string*)
- **fixation_dur** = Duration of the fixation in miliseconds? (*integer*)
- **fixation_time_relative** = Timestamp of start of fixation in relation to the start of the trial (img_onset timestamp) (*float*)
- **x** and **y** = Coordinates of the current fixation (*float*)
- **img_onset** = Time stamp of the start of the presentation of the image (*float*)
- **img_offset** = Time stamp of the end of the presentation of the image (*float*)
- **Image_actual_dur** = Duration of the presentation of the experiment  in miliseconds (supposed to be presented 3s --> 3000ms ) (*float*)
- **oldResponsesDf** = *true*, if the image was presented to the participant within the experiment already (*is an old image*) and the correct answer of the subject was old. If not = *false*. (*boolean*) --> in MATLAB script named old_responses --> Variable that can have more then 1 input or could be empty if the subject did not answer 
- **is_old_img** = If the image was already presented to the subject within the trial (value = 1) or not (value = 0). This variable is retrieved from the Matlab files generating the distinct image sequence matrix for the distinct participants. 
- **rt** = Reaction time of image onset until the first key response, measured in seconds, miliseconds (only the first key response gets considered). (*float*)
--> Variable that can have more then 1 input or could be empty if the subject did not answer 
- **number_of_responses** = Counts the amount of responses (key presses) a participant made during one trial (*integer*)
- **correct_response** = *true*, if the subject's reaction was correct (the subject answered old if the the image presented was old and new if it was new). If not correct_response  = *false*. (*integer*)
- **hits** = *true*, if the image was presented to the participant within the experiment already (is an old image) and the correct answer of the subject was old. If not = *false*. --> hits same as correct_response (*boolean*)
- **misses** = *true*, if the image was presented to the participant within the experiment already (is an old image) and the incorrect answer of the subject was new. If not = *false*. (*boolean*)
- **correctRejections** = *true*, if the image was not presented to the participant within the experiment already (is a new image) and the correct answer of the subject was new. If not = *false*. (*boolean*)
- **falseAlarms** = *true*, if the image was not presented to the participant within the experiment already (is a new image) and the incorrect answer of the subject was old. If not = *false*. (*boolean*)

### *I. Import necessary libaries for this code cell* 


In [1]:
from pathlib import Path
from itertools import chain
from scipy.io import loadmat
import pandas as pd
import scipy.io
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import itertools


### II. Base directory, data directory and save directory setup 

In [13]:
# Create a Path object for the base directory
base_dir = Path('~/saliency-nsd').expanduser()

# Check if the base directory exists with an assert 
assert base_dir.exists(), f'{base_dir} does not exist'
# Print the base directory
print(base_dir)

# Combine paths to create new Path objects for the data directory and the save directory
data_dir = base_dir / 'data/raw'
save_dir = base_dir / 'data/preprocessed'

# Check if the data and save directories exist with assert statements, if not assertion error is raised
assert data_dir.exists(), f'{data_dir} does not exist'
assert save_dir.exists(), f'{save_dir} does not exist'

# print statement of path 
print(data_dir)
print(save_dir)

#Just to be aware of the working directory print the working diretory, should be /gpfs01/bartels/user/lheinemann/saliency-nsd/code/code_analysis_saliency-nsd/1_data-preprocessing/creation_of_dataframes
print (os.getcwd())


/gpfs01/bartels/user/lheinemann/saliency-nsd
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/raw
/gpfs01/bartels/user/lheinemann/saliency-nsd/data/preprocessed
/gpfs01/bartels/user/lheinemann/saliency-nsd/code/code_analysis_saliency-nsd/1_data-preprocessing/creation_of_dataframes


### *III. Test accessibility of concatenated DataFrame 'df_expanded_data_subjs_concatenated_all.csv' in the save_dir*

In [3]:
# To test the accessiibilty of concatenated DataFrame 'df_expanded_data_subjs_concatenated_all.csv' in the save_dir*
# Load the concatenated DataFrame
df = pd.read_csv(save_dir / 'df_expanded_data_subjs_concatenated_all.csv')
# Print the first 5 rows of the DataFrame
print(df.head())


  df = pd.read_csv(save_dir / 'df_expanded_data_subjs_concatenated_all.csv')


   Fixation Numbers  Fixation Number in Trial  Subject Number  Block Number  \
0                 1                         1               1             1   
1                 2                         2               1             1   
2                 3                         3               1             1   
3                 4                         4               1             1   
4                 5                         5               1             1   

   Trial Number                     Image Id  Fixation Duration  \
0             1  'nsd_stimulus_id_10007.jpg'              0.609   
1             1  'nsd_stimulus_id_10007.jpg'              0.246   
2             1  'nsd_stimulus_id_10007.jpg'              0.158   
3             1  'nsd_stimulus_id_10007.jpg'              0.176   
4             1  'nsd_stimulus_id_10007.jpg'              0.528   

   Fixation Time Relative to Image Onset  Fixation X Coordinate  \
0                                0.67981               

## A. Eliminate the rows with the fixation_durations < 50ms (0.05 s) from the DataFrame.  

Keep in mind that regarding to Rothkegel et al. (2019), "Searchers adjust their eye-movement dynamics to target charcteristics in natural scences" fixations that endure less then 50ms have to be removed, beacuse they are more likely a glissade. (p.10, red note) and also look at Bethges statements (2023) in the journal article "Predicting Visual Fixations".

Drift correction does not seem neccessary, already done during the data collection of the eye tracking experiment. 

In [4]:
# Code that load the concatenated and expanded DataFrame df_expanded_data_subjs_concatenated_all.csv and delets the rows were the values of the fixation duration are less then 50 ms (0.05s)
# and saves the new DataFrame to a new CSV file

# Load the concatenated DataFrame
df_concatenated_all = save_dir/ f'df_expanded_data_subjs_concatenated_all.csv'
df_concatenated_all = pd.read_csv(df_concatenated_all)

# Filter the DataFrame to remove rows where the fixation duration is less than 50 ms = 0.05 s
df_concatenated_filtered = df_concatenated_all[df_concatenated_all['Fixation Duration'] >= 0.05]

# Save the filtered DataFrame to a CSV file im the save_dir
df_concatenated_filtered.to_csv(save_dir/ f'df_expanded_data_subjs_concatenated_all_filtered.csv', index=False)
print(f"Saved filtered DataFrame to CSV file in the save_dir: {save_dir.resolve()}")



  df_concatenated_all = pd.read_csv(df_concatenated_all)


Saved filtered DataFrame to CSV file in the save_dir: /gpfs01/bartels/user/lheinemann/saliency-nsd/data/preprocessed
