Question about using the C API to acess a POD5 file efficiently using multiple threads #1

hasindu2008 · 2022-05-21T05:57:51Z

Dear POD5 developers,

I have been trying to use the POD5 C API to write a simple example of converting raw signal data to pico ampere. It is a single POD5 file containing a large number of reads and I want to iterate through all the reads while exploiting as many threads as possible. Learning from the Dorado code I have written something and a code snippet is given below. I have a few questions.

Is there a way to do this pod5_get_signal_row_info() without using a C++ vector (by using pure C structs)? See comment on the code below
Depending on the memory available on the system, is there an example that shows how to fetch a batch of an arbitrary size? That is to say my programme takes the batch size as a user input parameter, and I want to load that many reads at a time rather than the batch size hardcoded into the file?
Is the arrow libraries underneath exploiting multiple threads available on the system? If so how can I control the number of threads it is allowed to use based on a user input? If not, how do I exploit multiple threads to efficiently load, decompress and parse into memory arrays in the user side?

pod5_init();

Pod5FileReader_t* file = pod5_open_combined_file(argv[1]);

if (!file) {
   fprintf(stderr,"Error in opening file\n");
   perror("perr: ");
   exit(EXIT_FAILURE);
}

size_t batch_count = 0;
if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) {
	fprintf(stderr, "Failed to query batch count: %s\n", pod5_get_error_string());
}

int read_count = 0;

for (size_t batch_index = 0; batch_index < batch_count; ++batch_index) {


	Pod5ReadRecordBatch_t* batch = NULL;
	if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) {
	   fprintf(stderr,"Failed to get batch: %s\n", pod5_get_error_string());
	}

	size_t batch_row_count = 0;
	if (pod5_get_read_batch_row_count(&batch_row_count, batch) != POD5_OK) {
		fprintf(stderr,"Failed to get batch row count\n");
	}

	rec_t *rec = (rec_t*)malloc(batch_row_count * sizeof(rec_t));

	// need to find out of this part can be multi-threaded, and if so the best way, for instance should this be parallised using an openMP for ()? or is it internally using threads by the arrow library which is opaque to the user?
	for (size_t row = 0; row < batch_row_count; ++row) {
		uint8_t read_id[16];
		int16_t pore = 0;
		int16_t calibration_idx = 0;
		uint32_t read_number = 0;
		uint64_t start_sample = 0;
		float median_before = 0.0f;
		int16_t end_reason = 0;
		int16_t run_info = 0;
		int64_t signal_row_count = 0;
		if (pod5_get_read_batch_row_info(batch, row, read_id, &pore, &calibration_idx,
										&read_number, &start_sample, &median_before,
										&end_reason, &run_info, &signal_row_count) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld\n", row );
		}
		read_count += 1;

		char read_id_tmp[37];
		pod5_error_t err = pod5_format_read_id(read_id, read_id_tmp);

		CalibrationDictData_t *calib_data = NULL;
		if (pod5_get_calibration(batch, calibration_idx, &calib_data) != POD5_OK) {
			fprintf(stderr, "Failed to get read %ld calibration_idx data: %s\n", row,  pod5_get_error_string());
		}

		uint64_t *signal_rows_indices= (uint64_t*) malloc(signal_row_count * sizeof(uint64_t));

		if (pod5_get_signal_row_indices(batch, row, signal_row_count,
									   signal_rows_indices) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld; signal row indices: %s\n", row, pod5_get_error_string());
		}

		// cannot get to work this in C, So using C++
		//SignalRowInfo_t *signal_rows = (SignalRowInfo_t *)malloc(sizeof(SignalRowInfo_t)*signal_row_count);
		std::vector<SignalRowInfo_t *> signal_rows(signal_row_count);

		if (pod5_get_signal_row_info(file, signal_row_count, signal_rows_indices,
									signal_rows.data()) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld signal row locations: %s\n", row, pod5_get_error_string());
		}

		size_t total_sample_count = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			total_sample_count += signal_rows[i]->stored_sample_count;
		}

		int16_t *samples = (int16_t*)malloc(sizeof(int16_t)*total_sample_count);
		size_t samples_read_so_far = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			if (pod5_get_signal(file, signal_rows[i], signal_rows[i]->stored_sample_count,
							   samples + samples_read_so_far) != POD5_OK) {
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
			}

			samples_read_so_far += signal_rows[i]->stored_sample_count;
		}

		rec[row].len_raw_signal = samples_read_so_far;
		rec[row].raw_signal = samples;
		rec[row].scale = calib_data->scale;
		rec[row].offset = calib_data->offset;
		rec[row].read_id = strdup(read_id_tmp);

		pod5_release_calibration(calib_data);
		pod5_free_signal_row_info(signal_row_count, signal_rows.data());

		free(signal_rows_indices);

	}


	//process the batch here 
	
	//print the output here
	
	if (pod5_free_read_batch(batch) != POD5_OK) {
		fprintf(stderr,"Failed to release batch\n");
	}

	for (size_t row = 0; row < batch_row_count; ++row) {
		free(rec[row].read_id);
		free(rec[row].raw_signal);
	}
	free(rec);

}

Is the above implementation the most efficient way to use POD5 on a multi core system?

The text was updated successfully, but these errors were encountered:

0x55555555 · 2022-05-23T10:17:10Z

Hi @hasindu2008,

Great shouts on the above, I'll try to answer in order:

This API is one on my list to improve - I am planning to add a call to return all the signal for a read up front - with fewer calls. Any input on the call structure that would work best for you?
Arrow writes in a specific batch size, so you cannot change that at read time, you can however read multiple batches, or half a batch at once - if your code allows this.
So currently the above code is single threaded internally - loading only one read into memory, however, I have implemented a multi threaded reader for the python API, and again - it is on my list to move this to the general purpose C API. Do you have input on how you would like to call this as a user?

I will endeveour to push a release shortly that at least makes getting signal in a single threaded way easier, then push onto getting it faster - using multiple threads.

Thanks,

George

hasindu2008 · 2022-05-23T12:12:17Z

Hi George

Thank you for the quick response. (1) is not a big requirement as I can use g++ for now. For (2), how are you envisioning the MinKNOW output to become - what would be the default batch size. And, also is MinKNOW going to output one large single POD5 file per one sequencing run or will it be multiple POD5 files like it Is being done with FAST5 at the moment?

(3) is what I am mostly interested in. Isn't arrow capable of internally utilising threads for decompression and/or parsing?
From a C user's perspective, high-level API calls for threading are not that necessary, as long as thread-safe functions and the an example on how they can be used are provided. Can I simply parallelise the following loop - I tried but it crashes probably something is not threadsafe?

for (size_t row = 0; row < batch_row_count; ++row) {
....
}

hasindu2008 · 2022-05-24T06:23:22Z

@jorj1988
In addition to the above, it will be great if you could give some insights on how this random access (https://github.com/nanoporetech/pod5-file-format/blob/master/c%2B%2B/examples/find_specific_read_ids.cpp) can be parallelised in the user side.

hasindu2008 · 2022-11-30T09:35:20Z

Is this multi-threading-related crash in POD5 fixed now?

0x55555555 · 2022-11-30T09:54:09Z

Hello @hasindu2008,

My apologies - this issue has slipped by me.

Yes, this issue is now resolved - it is now safe to read pod5 files from as many threads as you like. I would recommend for cache efficiency reading one batch at a time in each thread, however this is not required by the API.

George

0x55555555 closed this as completed Nov 30, 2022

Rafael-Cast mentioned this issue May 24, 2023

Assertion failed when writing large batches. #41

Closed

biozzq mentioned this issue Apr 13, 2024

Troubleshooting Conversion of Fast5 Files to Pod5 Format #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about using the C API to acess a POD5 file efficiently using multiple threads #1

Question about using the C API to acess a POD5 file efficiently using multiple threads #1

hasindu2008 commented May 21, 2022

0x55555555 commented May 23, 2022

hasindu2008 commented May 23, 2022

hasindu2008 commented May 24, 2022

hasindu2008 commented Nov 30, 2022

0x55555555 commented Nov 30, 2022

Question about using the C API to acess a POD5 file efficiently using multiple threads #1

Question about using the C API to acess a POD5 file efficiently using multiple threads #1

Comments

hasindu2008 commented May 21, 2022

0x55555555 commented May 23, 2022

hasindu2008 commented May 23, 2022

hasindu2008 commented May 24, 2022

hasindu2008 commented Nov 30, 2022

0x55555555 commented Nov 30, 2022