# This notebook creates a dataset from a session of the IBL Brain Wide Map Dataset

You will need to install the following packages:
- numpy
- pandas
- spikeinterface
- ONE-api
- ibllib

---

In [1]:
import os
import sys

import numpy as np
import pandas as pd

from one.api import ONE

Install the latest version of SpikeInterface as recommended in the **From source** section [here](https://spikeinterface.readthedocs.io/en/latest/get_started/installation.html).

In [2]:
import spikeinterface as si
import spikeinterface.extractors as se
import spikeinterface.preprocessing as spre
from spikeinterface.sortingcomponents.peak_detection import detect_peaks

print(f"SpikeInterface version: {si.__version__}")

  from .autonotebook import tqdm as notebook_tqdm


SpikeInterface version: 0.102.1


Install local functions

In [3]:
import process_peaks

sys.path.append("..")
import preprocessing
import util

## 1. Read recording session

For this project, we will be using session [sub-CSHL049](https://dandiarchive.org/dandiset/000409/draft/files?location=sub-CSHL049&page=1) of the [IBL Brain Wide Map Dataset](https://dandiarchive.org/dandiset/000409/draft). 

In [4]:
data_folder = "../data/sub-CSHL049"

os.makedirs(data_folder, exist_ok=True)

In order to obtain this data, we will stream with ONE API using its identifier which is listed in the [metadata](https://api.dandiarchive.org/api/dandisets/000409/versions/draft/assets/7e4fa468-349c-44a9-a482-26898682eed1/).

In [5]:
one_folder = os.path.join(data_folder, "one")

os.makedirs(one_folder, exist_ok=True)

In [6]:
one = ONE(
    base_url="https://openalyx.internationalbrainlab.org",
    username="intbrainlab",  
    password="international",
    silent=True
)

eid = "c99d53e6-c317-4c53-99ba-070b26673ac4"
pids, _ = one.eid2pid(eid)
pid = pids[0]

Using SpikeInterface, we can read and save the data to disk. 

In [7]:
extractors_folder = os.path.join(data_folder, "extractors")

os.makedirs(extractors_folder, exist_ok=True)

In [8]:
preprocessed_folder = os.path.join(extractors_folder, "preprocessed")

if not os.path.exists(preprocessed_folder): 
    recording = se.read_ibl_recording(eid=eid, stream_name='probe00.ap', cache_folder=one_folder)
    
    # Preprocess the recording
    recording_f = spre.bandpass_filter(recording, freq_min=300, freq_max=6000)
    recording_cmr = spre.common_reference(recording_f, reference='global', operator='median')
    
    # Save the preprocessed recording to disk
    job_kwargs = dict(n_jobs=10, chunk_duration="1s", progress_bar=True)
    recording_cmr.save(folder=preprocessed_folder, **job_kwargs)
else:
    recording_cmr = si.load_extractor(preprocessed_folder)
    
recording_cmr

Downloading: /global/homes/r/rly/Downloads/ONE/openalyx.internationalbrainlab.org/histology/ATLAS/Needles/Allen/average_template_25.nrrd Bytes: 32998960


100%|██████████| 31.470260620117188/31.470260620117188 [00:00<00:00, 210.30it/s]


Downloading: /global/homes/r/rly/Downloads/ONE/openalyx.internationalbrainlab.org/histology/ATLAS/Needles/Allen/annotation_25.nrrd Bytes: 4035363


100%|██████████| 3.848422050476074/3.848422050476074 [00:00<00:00, 146.83it/s]
(S3) ../data/sub-CSHL049/one/churchlandlab/Subjects/CSHL049/2020-01-09/001/raw_ephys_data/probe00/_spikeglx_ephysData_g0_t0.imec.ap.ch: 100%|██████████| 129k/129k [00:00<00:00, 338kB/s]
(S3) ../data/sub-CSHL049/one/churchlandlab/Subjects/CSHL049/2020-01-09/001/raw_ephys_data/probe00/_spikeglx_ephysData_g0_t0.imec.ap.meta: 100%|██████████| 16.8k/16.8k [00:00<00:00, 77.3kB/s]

write_binary_recording 
engine=process - n_jobs=10 - samples_per_chunk=30,000 - chunk_memory=21.97 MiB - total_memory=219.73 MiB - chunk_duration=1.00s



write_binary_recording (workers: 10 processes): 100%|██████████| 4173/4173 [12:59<00:00,  5.35it/s]


---

## 2. Create peaks dataset

### Retrieve channel locations

In [9]:
channel_locations_file = os.path.join(data_folder, "channel_locations.npy")

if not os.path.exists(channel_locations_file):
    channel_locations = preprocessing.extract_channels(recording_cmr)
    np.save(channel_locations_file, channel_locations)
else:
    channel_locations = np.load(channel_locations_file)

display(pd.DataFrame(channel_locations))

Unnamed: 0,channel_index,channel_location_x,channel_location_y
0,0,16,0
1,1,48,0
2,2,0,20
3,3,32,20
4,4,16,40
...,...,...,...
379,379,32,3780
380,380,16,3800
381,381,48,3800
382,382,0,3820


### Detect peaks

In [10]:
peaks_folder = '../data/sub-CSHL049/peaks'
peaks_file = os.path.join(peaks_folder, "peaks.npy")

if os.path.exists(peaks_file):
    peaks_filtered = np.load(peaks_file)
else:
    os.makedirs(peaks_folder, exist_ok=True)
    
    job_kwargs = dict(chunk_duration='1s', n_jobs=10, progress_bar=True)
    
    peaks = detect_peaks(
        recording_cmr,
        method='locally_exclusive',
        peak_sign='neg',
        detect_threshold=6,
        radius_um = 100,
        **job_kwargs
    )    
    
    peaks_filtered = process_peaks.filter_peaks(recording_cmr, peaks)
    
    np.save(peaks_file, peaks_filtered)
    
display(pd.DataFrame(peaks_filtered))

noise_level (no parallelization): 100%|██████████| 20/20 [00:17<00:00,  1.16it/s]
detect peaks using locally_exclusive (workers: 10 processes): 100%|██████████| 4173/4173 [08:20<00:00,  8.34it/s]


Unnamed: 0,sample_index,channel_index,amplitude
0,56,132,-26.0
1,147,348,-40.0
2,177,337,-67.0
3,269,330,-34.0
4,314,330,-59.0
...,...,...,...
3177917,125189311,222,-36.0
3177918,125189392,273,-24.0
3177919,125189402,89,-37.0
3177920,125189402,269,-21.0


---

## 3. Create dataset

First, create HDF5 files each containing peaks data using `create_peaks_files.sh`. 
You will need to specify the ID of the recording (e.g., 'sub-CSHL049'). This creates multiple HDF5 files based on the number of tasks you set in the jobscript.

Next, run the `combine_peaks_files.py` file to combine the HDF5 files.
You will need to specify the ID of the recording (e.g., 'sub-CSHL049'). This creates a single HDF5 file containing the complete peaks dataset.

Within the file are three datasets:
- channel_locations [384, 3]: A dataset of channel indices and their corresponding locations on the probe.
- properties [n, 3]: A dataset of different properties for each peak - sample_index, channel_index, and amplitude.
- traces [n, 64, 192, 2]: A dataset of traces for each peak.