<a href="https://colab.research.google.com/github/jobellet/fast_and_rich_decoding_in_VLPFC/blob/main/Download_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following dataset was utilized in the study titled "Decoding rapidly presented visual stimuli from prefrontal ensembles without report nor post-perceptual processing," published in the Neuroscience of Consciousness in 2022 (DOI: [10.1093/nc/niac005](https://doi.org/10.1093/nc/niac005)).

# Downloading, Renaming, and Cleaning Up the Dataset

- The `monkeyA.pkl` dataset contains recordings from Monkey A, encompassing data from both the Prefrontal Cortex (PFC) and Posterior Parietal Cortex (PPC), recorded simultaneously.

- The `monkeyH.pkl` dataset includes PFC data for Monkey H, recorded during a session separate from the PPC data found in `monkeyH_PPC.pkl`.

In the original dataset, variable names are not intuitive, time scales vary among some variables, and some variables that were not analyzed previously may contain incorrect or unverified information.

**The next cell will perform the following operations:**

1. **Variable Renaming:** Update column names to be more descriptive, reducing potential confusion.
2. **Trial Index Calculation:** Generate a `TrialIndex` to group stimuli into trials where the inter-stimulus interval is less than 600 ms. A new trial is started when this interval is exceeded.
3. **Dataframe Cleaning:** Remove unneeded columns such as various saccade metrics and blink details to simplify the dataset.
4. **Type Conversion and Corrections:**
   - Convert `InterStimulusInterval` from milliseconds to seconds.
   - Correct `StimulusDuration` to ensure it maintains only relevant durations when a list is mistakenly passed as a single entry.
   - Convert `StimulusPosition`, `StimulusIdentity`, and `RecordingDayIndex` to integers for consistent and easier data handling.

These steps will make the data more intuitive for further analysis, with clearly named variables and a streamlined dataset structure.

## Execute the following cell to download and clean the datasets


In [None]:
import urllib.request
import pandas as pd
import numpy as np

# Functions for loading data

def get_monkeyA_df():
    """Load or download the Monkey A dataset from both PFC and PPC."""
    try:
        df = pd.read_pickle('monkeyA.pkl')
    except FileNotFoundError:
        link_to_monkeyA_data = 'https://figshare.com/ndownloader/files/27869238'
        urllib.request.urlretrieve(link_to_monkeyA_data, 'monkeyA.pkl')
        df = pd.read_pickle('monkeyA.pkl')
    return df

def get_monkeyH_dfs():
    """Load or download the Monkey H datasets for PFC and PPC from separate sessions."""
    try:
        dfPFC = pd.read_pickle('monkeyH.pkl')
    except FileNotFoundError:
        link_to_monkeyH_PFC_data = 'https://figshare.com/ndownloader/files/27946635'
        urllib.request.urlretrieve(link_to_monkeyH_PFC_data, 'monkeyH.pkl')
        dfPFC = pd.read_pickle('monkeyH.pkl')

    try:
        dfPPC = pd.read_pickle('monkeyH_PPC.pkl')
    except FileNotFoundError:
        link_to_monkeyH_PPC_data = 'https://figshare.com/ndownloader/files/28224414'
        urllib.request.urlretrieve(link_to_monkeyH_PPC_data, 'monkeyH_PPC.pkl')
        try:
            dfPPC = pd.read_pickle('monkeyH_PPC.pkl')
        except:
            !pip3 install pickle5
            import pickle5 as pickle
            with open('monkeyH_PPC.pkl', "rb") as fh:
                dfPPC = pickle.load(fh)

    return dfPFC, dfPPC

def rename_columns(df, name_pairs):
    """Renames columns of a pandas DataFrame based on a list of name pairs.

    Parameters:
        df (pd.DataFrame): The DataFrame whose columns are to be renamed.
        name_pairs (list of tuples): A list where each tuple contains two strings,
                                     the first string is the original column name and
                                     the second is the new name.

    Returns:
        pd.DataFrame: A DataFrame with updated column names.
    """
    rename_dict = {old_name: new_name for old_name, new_name in name_pairs}
    return df.rename(columns=rename_dict)

def recalculate_trial_id_count(df):
    """Recalculates 'TrialIDcount' based on the 'TrialID' column where stimuli grouped into the same
    trial have 'TrialID' below 300 ms.

    Parameters:
        df (pd.DataFrame): The DataFrame containing 'TrialID'.

    Returns:
        pd.DataFrame: The DataFrame with updated 'TrialIDcount'.
    """
    # Initialize a counter and a list to store the new 'TrialIDcount' values
    trial_counter = 1
    trial_id_counts = [1]

    # Iterate through each row in the DataFrame
    for i, row in df.iterrows():
        if i>1:
            if df.sesID[i]>df.sesID[i-1]:
                trial_counter += 1
        if i<len(df)-1:
            # If 'TrialID' is less than 300, the next trial will be the same
            if row['TrialID'] < 600:
                trial_id_counts.append(trial_counter)
            else:
                trial_counter += 1
                trial_id_counts.append(trial_counter)


    # Assign the new 'TrialIDcount' values to the DataFrame
    df['TrialIDcount'] = trial_id_counts
    return df

def clean_dataset(df):
    """Cleans the DataFrame by removing unnecessary columns.

    Parameters:
        df (pd.DataFrame): The DataFrame to be cleaned.

    Returns:
        pd.DataFrame: A cleaned DataFrame with irrelevant columns removed.
    """
    columns_to_remove = ['blinks_on', 'blinks_off', 'saccades_on', 'saccades_off', 'saccades_x', 'saccades_y', 'Spikes_phase']

    # Group stimuli in trials based on the SOAs
    df  = recalculate_trial_id_count(df)

    # Correct when the whole duration list got past in each single line
    if type(df['duration'][0])== np.ndarray:
        durations = []
        for ses in np.unique(np.array(df['sesID'])):
            durations.append(df['duration'][np.where(np.array(df['sesID'])==ses)[0][0]])
        df['duration'] = np.concatenate(durations)

    # Convert the SOA in seconds
    df['TrialID'] = np.array(df['TrialID']).astype(float)/1000

    # Convert the stimulus position in a sequence into integers
    df['ItemID'] = np.array(df['ItemID'] ).astype(int)

    # Convert the stimulus indetity in a sequence into integers
    df['StimID'] = np.array(df['StimID'] ).astype(int)

    # Convert the stimulus day index into integers
    df['sesID'] = np.array(df['sesID'] ).astype(int)

    return df.drop(columns=columns_to_remove, errors='ignore')

name_pairs = [
    ('Spikes', 'SpikeTimes_vlPFC'),
    ('PPC_Spikes', 'SpikeTimes_PPC'),
    ('TrialID', 'InterStimulusInterval'),
    ('duration', 'StimulusDuration'),
    ('ItemID', 'StimulusPosition'),
    ('StimID', 'StimulusIdentity'),
    ('TrialIDcount', 'TrialIndex'),
    ('sesID', 'RecordingDayIndex')
]

# Load datasets
monkeyA_PFC_and_PPC = get_monkeyA_df()
monkeyH_PFC, monkeyH_PPC = get_monkeyH_dfs()



monkeyA_PFC_and_PPC = clean_dataset(monkeyA_PFC_and_PPC)
monkeyA_PFC_and_PPC = rename_columns(monkeyA_PFC_and_PPC, name_pairs)


monkeyH_PFC = clean_dataset(monkeyH_PFC)
monkeyH_PFC = rename_columns(monkeyH_PFC, name_pairs)


monkeyH_PPC = clean_dataset(monkeyH_PPC)
monkeyH_PPC = rename_columns(monkeyH_PPC, name_pairs)

monkeyA_PFC_and_PPC.to_pickle('cleaned_monkeyA.pkl')
monkeyH_PFC.to_pickle('cleaned_monkeyH_PFC.pkl')
monkeyH_PPC.to_pickle('cleaned_monkeyH_PPC.pkl')

## Detailed Description of the Cleaned Dataset Variables

- **InterStimulusInterval**: Measures the time in seconds between the onset of one stimulus and the onset of the next.

- **TrialIndex**: An integer that uniquely identifies each trial. All stimuli presented within the same trial share this index, facilitating the grouping of data by trial.

- **RecordingDayIndex**: An integer that uniquely identifies each recording day, allowing analyses to be segmented by specific experimental sessions.

- **StimulusIdentity**: A number between 0 and 17 that identifies the specific image displayed during a trial. This identifier is used to correlate specific stimuli with behavioral and neural responses.

- **StimulusPosition**: An ordinal number indicating the position of the stimulus within a trial sequence. The first stimulus in a sequence is numbered 1, which is important for studying response patterns to stimulus order.

- **StimulusDuration**: Time in seconds that the stimulus is physically present on the screen, as measured by the photodiode.

- **SpikeTimes_vlPFC**: Lists the times of neural spikes in seconds, relative to the stimulus occurrence for each of the 96 channels of the vlFPC array. This detailed recording from the Ventrolateral Frontal Parietal Cortex provides insight into neural activity during stimulus presentation.

- **SpikeTimes_PPC**: Similar to `SpikeTimes_vlFPC`, but for the Parietal array. This records the spike times in seconds relative to the stimulus occurrence, enabling comparative studies of neural dynamics across different brain regions.

Each entry in the **SpikeTimes...** columns is a list of 96 lists. Each of these 96 lists contains the times at which spikes were detected relative to the stimulus onset. The time interval considered for detecting a spike is between -0.1 seconds and 0.7 seconds relative to stimulus onset.

# Download on your local machine
## If you ran the previous cells in Google Colab and want to download the cleaned dataset to your local machine, run the following cell:

In [None]:
from google.colab import files

# List of file names to be downloaded
file_names = ['cleaned_monkeyA.pkl', 'cleaned_monkeyH_PFC.pkl', 'cleaned_monkeyH_PPC.pkl']

# Loop through each file and initiate a download
for file_name in file_names:
    files.download(file_name)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>