# iCatcher annotations --> looking time pipeline

This script organizes iCatcher annotations and other subject and trial level data into a format that can feed into the rest of the lab's looking time pipeline. It does this by organizing subject, trial, and look level data in the same format as is output in by Datavyu manual coding. 

## Inputs: 


### 1. subject-level data: 
    - subject ID* 
    - session number**
    - DOB 
    - gender
    - session date 
    - subject group, if relevant
    - any other subject-level variables 
    
    * required
    ** default is '1' if not provided 
    
### 2. iCatcher annotation file/s (.npz file per subject, per session)

These are the main outputs from running iCatcher. Expects one file per subject, per session, named accordingly. 

### 3. actual video files 

   needed to extract frame rates for conversion from frame rates to ms (to get look events onset/offset relative to start of video)
    
   if this conversion has already been run, .json files with the relevant information should exist in the videos directory. if not, these .json files will be written in the process of obtaining this info 
   
### 4. trial onsets relative to start of video

   Note, for experiments that used manual triggers for trials or software that does not output log files with trial timing info, you will still need to create a manual log of trial onsets with respect to the start of the video using visual inspection of recorded videos or handwritten logs. Please see notes below on manual timing info for formatting instructions. 

   **(4a) trial onsets/offsets relative to start of experiment**

   Trial onsets are typically recorded in stimulus presentation software with respect to the start of the experiment. That is, you get an output log file with timestamps for when each trial began (and either trial durations or trial ends). 
    
   In Lookit, which can record _and_ present stimuli, video and experiment onsets are logged, and the log files (.json) can be used to get step (3b) below given only the single log file. 

   **(4b) experiment onset with respect to start of video**
    
   Ultimately we need trial onsets with respect to the start of the video, since this is how look events are annotated. If we only have trial onsets/offsets with respect to the start of the experiment, we need one additional number, which is how long after the video onset (in ms) the experiment started. Then we can add this value to the trial times to get trial onsets/offsets with respect to video onset. 
    
   If this number is not recorded automatically (we only know of lookit as a software capable of this), you will need to manually evaluate this through manual "coding" and produce a file called `sub-<subjID>_session-<sessionID>_experiment_onset.txt` in your input_files folder. 
    
   E.g., if the experiment starts at video timestamp 423ms for subjectID=SAX_ASDF_03 and sessionID=01, then there should be a file called sub-SAX_ASDF_03_session-1_experiment_onset.txt that ONLY contains the content: `423` 
    
    



## Setting the `trial_timing_method` variable 

this variable defines how trial onsets/durations/offsets are obtained, either with respect to video 


### 1:  from 'manually-formatted' file 

this supposes you are either manually writing down experiment onset times, or that you have written your own code to extract this data from a file format not yet supported

to use this option, you are expected to have CSVs named `sub-<subjID>_session-<sessionID>_trial_info.txt` in the input_files folder. 

These files are expected to have the following columns (including header names): 
    trialNumber: integer, from 1 to _n_ trials; which trial 
    
    onset: integer or float, in ms, with respect to start of experiment OR video (if with respect to start of experiment, `sub-<subjID>_session-<sessionID>_experiment_onset.txt` should contain a non-zero value -- see section below); at what time in ms did the trial start 
    
    offset: follows same rules as onset; at one time in ms did the trial end 

Therefore, you should have 3 columns and _n_ rows, where _n_ is the number of trials for this subject/session


### 2: from lookit logs 

uses lookit 'parser' (code from Gal to parse lookit .json logs) to obtain both trial onsets/offsets and experiment/video onsets to define trial times with respect to the video, all in one go


### _3+: future parsers_

_In the future, we can make parsers for other log formats, e.g., jsPsych, PsychoPy, OpenSesame, PsychToolBox, etc. Please inform your lab tech if you have another log format and/or could not easily convert your logs to 'manually-formatted files' on your own_

In [6]:
trial_timing_method = 1


## Setting the `experiment_onset_method` variable

this variable defines whether you are reading experiment onset with respect to the video from a file (see 4b above) or by using a Lookit parser 

### 1: manual  

looks for `sub-<subjID>_session-<sessionID>_experiment_onset.txt` to add value in file to all trial times  

note: if you use some software besides lookit that outputs this information, you can:
    (1) create a secondary script to pull this information from whatever log files you have and write the files `sub-<subjID>_session-<sessionID>_experiment_onset.txt`; that is, set to manual (`experiment_onset_method = 1`) even if not literally manually obtained 
    (2) ask your lab tech to build you a parser for your file type 
    
    Note, if your trial times are read from a file but are already defined with respect to the video, you still should create these expected files, with the value written in each file set to 0 to prevent crashing 
    
### 2: lookit 

uses lookit 'parser' (code from Gal to parse lookit .json logs) to obtain both trial onsets/offsets and experiment/video onsets to define trial times with respect to the video, all in one go

### _3+: future parsers_

_In the future, we can make parsers for other log formats, e.g., jsPsych, PsychoPy, OpenSesame, PsychToolBox, etc. Please inform your lab tech if you have another log format and/or could not easily convert your logs to 'manually-formatted files' on your own_

In [7]:
experiment_onset_method = 1

## Setting the `subject_info_method` variable

this variable defines whether you are reading subject-level variables from an isolated CSV or from some other parser 


### 1: manual  

looks for `sub-<subjID>_session-<sessionID>_subject_info.csv` 
    
Expects the following information, read as a CSV with headers (i.e., the CSV should have 2 rows: row 1 is column names, row 2 is info; as many columns as subject-level variables you want to include):

    - subject_ID* (as type: string)
    - session_number** (as type: integer)
    - DOB -- (mm/dd/yyyy)
    - gender (as type: string)
    - session_date (mm/dd/yyyy)
    - subject_group, if relevant (as type: string)
    
    * required
    ** default is '1' if not provided 
   
    
### 2: lookit 

uses lookit 'parser' (code from Gal to parse lookit .json logs) to obtain subject, session data 


### _3+: future parsers_

_In the future, we can make parsers for other log formats, e.g., jsPsych, PsychoPy, OpenSesame, PsychToolBox, etc. Please inform your lab tech if you have another log format and/or could not easily convert your logs to 'manually-formatted files' on your own_

In [8]:
subject_info_method = 1

next steps: 


- check over lookit reading stuff (using gal's example)

- read in new CSV w/ subject level info 
- edit write to CSV func 
- check over main func 

- see what other parsers he has (trial info)
- build in parser flexibility

- reorganize/clean up/document 

## import libraries 

In [9]:
import os
import os.path as op
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
from helperfuncs.video_framerates import get_frame_information
from helperfuncs.video_framerates import write_to_json
from helperfuncs.lookit_json_parser import get_lookit_trial_times

ModuleNotFoundError: No module named 'helperfuncs'

## Set relevant paths  -- todo: fix paths for various input files -> just folders

In [20]:
# global directory path variables. make these your folder names under MCS
project_dir = '/om3/group/saxelab/LAB_STANDARD_LOOKING_TIME_CODE/looking_time/template_project_dir'
data_dir = op.join(project_dir, 'data')

# where are icatcher outputs
icatcher_outputs_dir = op.join(data_dir, 'icatcher_outputs')

# where are trial info files 
trial_info_dir = op.join(data_dir, 'trial_info')
if (trial_timing_method == 2) or (experiment_onset_method == 2): # if lookit used for anything
    lookit_trial_info_csv = op.join(trial_info_dir, 'lookit_trial_timing_info.csv')

experiment_onsets_dir = op.join(data_dir, 'video_relative_experiment_onsets')

# where are subject info files 
subject_info_dir = op.join(data_dir, 'subject_info')

# where are videos 
videos_dir = op.join(project_dir, 'data/videos')

### Create all necessary functions

#### get all non-hidden files in dir (helper function):

In [14]:
# list all files except those beginning with '.' i.e., hidden files 
def listdir_nohidden(path):
    for f in os.listdir(path):
        if not f.startswith('.'):
            yield f

#### convert iCatcher annotated look-events from frame-wise to timing (ms from video onset) 

In [15]:
def read_convert_output(filename, stamps):
    """
    Given an npz file containing icatcher annotated frames and looks,
    converts to pandas DataFrame with another column mapping each frame
    to its time stamp in the video
    
    INPUTS: 
    filename (string): name of tabulated iCatcher output file in format
    '[CHILD_ID].npz'
    stamps (List[int]): time stamp for each frame, where stamps[i] is the 
    time stamp at frame i (determined in function get_frame_information(), IMPORTED function)
    
    OUTPUTS: 
    rtype: DataFrame
    
    """
    npz = np.load(filename)
    df = pd.DataFrame([])

    lst = npz.files

    df['frame'] = range(1, len(npz[lst[0]]) + 1)
    df['on_off'] = ['on' if frame > 0 else 'off' for frame in npz[lst[0]]]
    
    # TO DO: DELETE IF UNUSED 
    #df['confidence'] = npz[lst[1]]

    # convert frames to ms using frame rate
    df['time_ms'] = stamps
    df['time_ms'] = df['time_ms'].astype(int)
    
    return df

#### get trial onsets w/r/t video

In [16]:
def get_trial_sets(child_id, session_id, trial_info_file):
    """
    Finds corresponding trial info 
    and returns a list of [onset, offset] times for each trial in 
    milliseconds, with respect to video onset

    """
    
    if trial_timing_method == 1: # trial info obtained from manually-formatted file
        
        df = pd.read_csv(trial_info_file)
        
    if trial_timing_method == 2: # trial info obtained from lookit 
        
        if Path(lookit_trial_info_csv).is_file(): # check whether lookit file already parsed
            df = pd.read_csv(trial_info_file)
            
        else: # otherwise, parse and save out relevant info  
            df = get_lookit_trial_times(icatcher_outputs_dir)
            df.to_csv(lookit_trial_info_csv)
            
    
            # get part of df from current child
            df = df[df['child_id'] == child_id] 
            df = df[df['session_id'] == session_id] 
            
            
    if experiment_onset_method == 1: # if NOT lookit for experiment onset info too... 
    
        onset_file = glob.glob(op.join(experiment_onsets_dir, 'sub-{}_session-{}_experiment_onset.txt'.format(child_id, session_id)))[0]
    
        #get experiment onset 
        with open(onset_file) as f:
            text = f.read()
            expt_onset = int(text)
            
            # add difference to create relative onsets/offsets  
            
            # note: relative means, trial onsets/offsets relative to the start of video 
            # also note: here we add under assumption that video starts BEFORE experiment starts 
            # if your video starts AFTER experiment starts, set as negative value in file 
            df['relative_onset'] = df['onset'] + expt_onset
            df['relative_offset'] = df['offset'] + expt_onset

    
    
    # there's two different file formats -- updated as needed 
    # WHAT IS THIS ?? ASK GAL, MAYBE REMOVE THIS STEP 
    
    df_sets = df[['relative_onset', 'relative_offset']]
    df_sets = df_sets.rename(columns={"relative_onset": "onset", "relative_offset": "offset"})

    df_sets.dropna(inplace=True)
        
        
    trial_sets = []
    for _, trial in df_sets.iterrows():
        trial_sets.append([int(trial['onset']), int(trial['offset'])])

    def unique(sequence):
        seen = set()
        return [x for x in sequence if not (tuple(x) in seen or seen.add(tuple(x)))]

    
    
    return unique(trial_sets), df

#### Get subject level data 

In [18]:
def get_subject_info(child_id, session_id): 
    
    if experiment_onset_method == 1:
        subject_info_file = glob.glob(op.join(subject_info_dir, 'sub-{}_session-{}_subject_info.csv'.format(child_id, session_id)))[0]
        df = pd.read_csv(subject_info_file)
        
    return df

#### write a csv with subject, trial, and look level information (datavyu format)

In [None]:
def write_to_csv(data_filename, child_id, icatcher_data, session, trial_type, stim_type, icatcher):
    """
    checks if output file is in directory. if not, writes new file
    containing looking times computed by iCatcher and Datavyu for child
    with Lookit ID id. 
    
    child_id (string): unique child ID associated with subject
    icatcher_data (List[List[int]]): list of [on times, off times] per trial
                calculated form iCatcher
    datavyu_data (List[List[int]]): list of [on times, off times] per trial
                calculated form iCatcher
    session (string): the experiment session the participant was placed in
    rtype: None
    """
    # assert(len(icatcher_data) == len(datavyu_data))
    num_trials = len(icatcher_data)
    id_arr = [child_id] * len(icatcher_data)
    data = {
        'child': id_arr, # * subject level info
        'session': [session] * num_trials, # * subject level info
        'trial_num': [i + 1 for i in range(len(icatcher_data))], # * Trials.ordinal
        'trial_type': trial_type, # * Trials.x
        'stim_type': stim_type, # * Trial level info
        'confidence': list(icatcher[(icatcher['on_off'] == 'on') & (icatcher['trial'] != 0)].groupby('trial')[['confidence']].mean().squeeze()), # * no confidence
        'iCatcher_on(s)': [trial[0] for trial in icatcher_data], # * don't want this
        'iCatcher_off(s)': [trial[1] for trial in icatcher_data] # * don't want this
    }

    df = pd.DataFrame(data)

    output_file = Path(data_filename)
    if not output_file.is_file():
        df.to_csv(data_filename)
        return
    
    output_df = pd.read_csv(data_filename, index_col=0)
    ids = output_df['child'].unique()

    if child_id not in ids:
        output_df = output_df.append(df, ignore_index=True)
        output_df.to_csv(data_filename)

### main function: run processes 

In [4]:
def run_analyze_output(data_filename="BBB_output.csv", session=None):
    """
    Given an iCatcher output directory and Datavyu input and output 
    files, runs iCatcher over all videos in vid_dir that have not been
    already run, computes looking times for all iCatcher outputs, and
    compares with Datavyu looking times. 
    data_filename (string): name of file you want comparison data to be written
            to. Must have .csv ending. 
    session (string): ID of the experiment session. If session is not
            specified, looks for videos only within videos_dir, otherwise
            searches within [videos_dir]/session[session]
    """
    for filename in listdir_nohidden(icatcher_outputs_dir):
        child_id = filename.split('.')[0]
        
        session_id = 1;
        # TO DO: UPDATE 
        
        # determine trial info files 
        if trial_timing_method == 1:
            trial_info_file = glob.glob(op.join(trial_info_dir, 'sub-{}_session-{}_trial_info.csv'.format(child_id, session_id)))[0]
        elif trial_timing_method == 2:
            trial_info_file = lookit_trial_info_csv
                                 

        # skip if child data already added
        output_file = Path(data_filename)
        if output_file.is_file():
            output_df = pd.read_csv(data_filename, index_col=0)
            ids = output_df['child'].unique()
            if child_id in ids: 
                print(child_id + ' already processed')
                continue
        
        vid_path = videos_dir + '/'
        if session:
            vid_path += "session" + session + '/'
        vid_path = vid_path + child_id + ".mp4"
        json_video_data = vid_path + child_id + '.json'

        # get timestamp for each frame in the video
        print('getting frame information for {}...'.format(vid_path))
        timestamps, length = get_frame_information(vid_path, json_video_data)
        if not timestamps:
            print('video not found for {} in {} folder'.format(child_id, videos_dir))
            continue
        
        # initialize df with time stamps for iCatcher file
        icatcher_path = icatcher_outputs_dir + '/' + filename
        icatcher = read_convert_output(icatcher_path, timestamps)

        # get trial onsets and offsets from input file, match to iCatcher file
        trial_sets, df = get_trial_sets(child_id, session_id, trial_info_file)
        
        # sum on looks and off looks for each trial
        icatcher_times = get_on_off_times(icatcher)
        # datavyu_times = get_output_times(output_file)

        # check whether number of trials from trial info is the same as 
        if icatcher['trial'].max() != len(df):
            print('mismatch in # of trials between icatcher and session info: {} in {} folder'.format(child_id, videos_dir))
            continue

        write_to_csv(data_filename, child_id, icatcher_times, session, df['fam_or_test'], df['scene'], icatcher)
        # return comparison metrics 
        # icatcher_arr, datavyu_arr = np.array(icatcher_times).flatten(), np.array(datavyu_times).flatten()
        #stat, p = pearsonr(icatcher_arr, datavyu_arr)
       # print('Datavyu total on-off looks per trial: \n', datavyu_times)
      #  print('iCatcher total on-off looks per trial: \n', icatcher_times)
      #  print('Pearson R coefficient: {} \np-value: {}'.format(round(stat, 3), round(p, 3)))


# EXECUTE HERE: 

In [19]:
run_analyze_output()

NameError: name 'run_analyze_output' is not defined