## New Dataset Cleaning

**Goal**: This notebook will create functions that will clean the new datasets coming in from extract_AthenaDelayComp_dat.m. I want to manipulate the new df such that it can easily be used with existing design matrix generators.

For context, this is a script written by me and Chuck to get violation data from all sessions of a trained PWM animal (not just up to session 200). It appears that Athena encoded timeout trials as whenever the "wait_for_cpoke" Tuped after 2 minutes. In the old dataset, these were counted as violations. I want to remove them from the current dataset and reset the trial counters. I will store information though on how many timeouts in a row. Finding sessions with high timeout rates would be good- they will have a large change in total trial counts.

In [1]:
import pathlib
import sys

[
    sys.path.append(str(folder))
    for folder in pathlib.Path("../src/").iterdir()
    if folder.is_dir()
]

import pandas as pd
import numpy as np

**What is expexted by design matrix generator**

`animal_id` : str <-- `rat_name`

`session`:  int  <-- `session_counter`

`trial` : int <-- MAKE

`s_a`: float 64 <-- `A1_dB`

`s_b`: float 64 <-- `A2_dB`





In [15]:
renaming_map = {
    "rat_name": "animal_id",
    "session_counter": "session_id",
    "A1_dB": "s_a",
    "A2_dB": "s_b",
    "hit_history": "hit",
    "violation_history": "violation",
    "timeout_history": "trial_not_started",
    "A1_sigma": "s_a_sigma",  # not sure why i kept this since db mapping is done in matlab
    "Rule": "rule",
    "violation_iti": "violation_penalty_time",
    "error_iti": "error_penalty_time",
    "secondhit_delay": "delayed_reward_time",
    "PreStim_time": "pre_stim_time",
    "A1_time": "s_a_time",
    "Del_time": "delay_time",
    "A2_time": "s_b_time",
    "time_bet_aud2_gocue": "post_s_b_to_go_cue_time",
    "time_go_cue": "go_cue_time",
    "CP_duration": "fixation_time",
    "Left_volume": "l_water_vol",
    "Right_volume": "r_water_vol",
    "Beta": "antibias_beta",  # higher = stronger antibias
    "RtProb": "antibias_right_prob",  # higher = more likely for a right trial to occur
    "pysch_pairs": "using_psychometric_pairs",
    # "THIS_SIDE": "correct_side", TODO
    # "CenterLED_duration" : "trial_start_wait_time" TODO
}

In [17]:
# TODO:
# in python:
# -- confirm dB nan if stimuli off
# -- map correct side were correct side is Left=0 and Right=1
# -- map of cohice where choice made **by** the rat, where Left=0 and Right=1 (and Violation=NaN)
# -- compute delay len (subtract go or not?)
# -- drop timeouts but maintain dur
# -- make trial counter

In [13]:
ndf = pd.read_csv("/Volumes/brody/jbreda/PWM_data_scrape/W078_trials_data.csv")

ndf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108475 entries, 0 to 108474
Data columns (total 26 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   rat_name             108475 non-null  object 
 1   session_date         108475 non-null  int64  
 2   session_counter      108475 non-null  int64  
 3   rig_id               108475 non-null  int64  
 4   training_stage       108475 non-null  int64  
 5   A1_dB                102034 non-null  float64
 6   A2_dB                102034 non-null  float64
 7   hit_history          91273 non-null   float64
 8   violation_history    108475 non-null  int64  
 9   timeout_history      108475 non-null  int64  
 10  A1_sigma             108475 non-null  float64
 11  Rule                 108475 non-null  object 
 12  violation_iti        108475 non-null  int64  
 13  error_iti            108475 non-null  int64  
 14  secondhit_delay      108475 non-null  int64  
 15  PreStim_time     

In [9]:
rdf = pd.read_csv(
    "/Users/jessbreda/Desktop/github/animal-learning/data/raw/rat_behavior.csv"
)

In [10]:
rdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2540006 entries, 0 to 2540005
Data columns (total 10 columns):
 #   Column          Dtype  
---  ------          -----  
 0   subject_id      object 
 1   session         int64  
 2   trial           int64  
 3   s_a             float64
 4   s_b             float64
 5   choice          float64
 6   correct_side    int64  
 7   hit             float64
 8   delay           float64
 9   training_stage  int64  
dtypes: float64(5), int64(4), object(1)
memory usage: 193.8+ MB


In [6]:
from get_rat_data import *

df = get_rat_viol_data(animal_ids="W078")

df.info()

returning viol data for W078
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53299 entries, 0 to 53298
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   animal_id           53299 non-null  object 
 1   session             53299 non-null  int64  
 2   trial               53299 non-null  int64  
 3   s_a                 40115 non-null  float64
 4   s_b                 40115 non-null  float64
 5   choice              43481 non-null  float64
 6   correct_side        53299 non-null  int64  
 7   hit                 43481 non-null  float64
 8   delay               53299 non-null  float64
 9   training_stage      53299 non-null  int64  
 10  violation           53299 non-null  bool   
 11  n_trial             53299 non-null  int64  
 12  training_stage_cat  53299 non-null  int64  
dtypes: bool(1), float64(5), int64(6), object(1)
memory usage: 4.9+ MB


In [8]:
ndf = pd.read_csv("/Volumes/brody/jbreda/PWM_data_scrape/W078_trials_data.csv")

ndf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108475 entries, 0 to 108474
Data columns (total 26 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   rat_name             108475 non-null  object 
 1   session_date         108475 non-null  int64  
 2   session_counter      108475 non-null  int64  
 3   rig_id               108475 non-null  int64  
 4   training_stage       108475 non-null  int64  
 5   A1_dB                102034 non-null  float64
 6   A2_dB                102034 non-null  float64
 7   hit_history          91273 non-null   float64
 8   violation_history    108475 non-null  int64  
 9   timeout_history      108475 non-null  int64  
 10  A1_sigma             108475 non-null  float64
 11  Rule                 108475 non-null  object 
 12  violation_iti        108475 non-null  int64  
 13  error_iti            108475 non-null  int64  
 14  secondhit_delay      108475 non-null  int64  
 15  PreStim_time     

In [None]:
def load_raw_animal_df(animal_id):
    pass

def rename_columns(raw_df):
    pass

def format_dtypes(raw_df):
    pass

def drop_and_account_for_timeouts(raw_df):
    pass

def add_trial_counter(raw_df):
    pass