# Investigate FAST5 Files

A small description of multi-fast5 files (file_version >= 2.0) can be found [here](https://hasindu2008.github.io/slow5specs/fast5_demystified.pdf).

Constant attributes across all reads in a single sequencing run:
- run_id of reads
- all attributes in contex_tags (all read groups link to first read group)
- all attributes in tracking_id (all read groups link to first read group)

Variable attributes across all reads in a single sequencing run:
- Raw: duration, end_reason, median_before, read_id, read_number, start_mux, start_time
- channel_id: channel_number, offset

In [1]:
from ont_fast5_api.fast5_interface import get_fast5_file

run_id = '20220321_1207_MN24598_FAR91003_ff83ee47'
label = 'plasmid' # or 'chr'
batch_nr = 0
plasmid_file_path = f'data/prototype_original/{run_id}/{label}_fast5/batch{batch_nr}.fast5'

In [2]:
with get_fast5_file(plasmid_file_path, mode='r') as plasmid_file:
    for read in plasmid_file.get_reads():
        print(read.read_id, read.get_raw_data(scale=True))  # "scale" parameter transforms signals to pA values

00018174-9903-4176-be47-668b2774c065 [77.2538   85.31975  83.70656  ... 47.320194  0.        1.075459]
0005b791-9f04-4dc9-9a92-e2f17360d682 [118.12125  99.83844  72.41424 ...  84.06504  78.50851  91.05553]
00136645-2240-4b7a-a1f8-8efe7ba5222f [143.21529  101.630875  92.84796  ...  79.94245   66.8577    71.15954 ]
001e2294-d2af-4029-96aa-8c94105f55f6 [148.05486    115.97033    109.15909    ... 109.696815    72.77272
  -0.35848632]
0020add1-d885-4a43-a1d7-acb2c71c3b4a [111.48925  98.04601  91.7725  ...  86.21596  82.81034  88.90461]
004e4387-7f9a-45ba-9598-320c227f2cc6 [ 92.48947   89.26309   88.725365 ... 104.31952  102.885574 120.27216 ]
0057c377-b2dc-4f40-9bd1-a9336e6d2f10 [134.79086  96.97055  91.41401 ...  92.31023  91.41401 135.68707]
005a1f94-06d3-43a8-a1c9-ee40bc308ad2 [121.52686  100.9139    90.15931  ...  68.65013   69.00862   97.149796]
006e3758-c470-4932-aa98-9959d9f495a6 [96.07433  91.951744 89.44234  ... 88.725365 89.80083  94.46114 ]
0084154f-f431-4e17-ab00-be617eda82e7 [6

In [3]:
def get_pA_values(scale, raw, offset):
    return scale * (raw + offset)

plasmid_file = get_fast5_file(plasmid_file_path, mode='r')
read_ids = plasmid_file.get_read_ids()
first_read = plasmid_file.get_read(read_ids[0])

scale = first_read.get_channel_info()['range'] / first_read.get_channel_info()['digitisation']
raw = first_read.get_raw_data()
offset = first_read.get_channel_info()['offset'] # same across all reads of specific run

raw

array([425, 470, 461, ..., 258,  -6,   0], dtype=int16)

In [4]:
get_pA_values(scale, raw, offset)

array([77.25380294, 85.31974524, 83.70655678, ..., 47.32019484,
        0.        ,  1.07545897])

In [5]:
len(raw)

61817