Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about using the C API to acess a POD5 file efficiently using multiple threads #1

Closed
hasindu2008 opened this issue May 21, 2022 · 5 comments

Comments

@hasindu2008
Copy link

Dear POD5 developers,

I have been trying to use the POD5 C API to write a simple example of converting raw signal data to pico ampere. It is a single POD5 file containing a large number of reads and I want to iterate through all the reads while exploiting as many threads as possible. Learning from the Dorado code I have written something and a code snippet is given below. I have a few questions.

  1. Is there a way to do this pod5_get_signal_row_info() without using a C++ vector (by using pure C structs)? See comment on the code below
  2. Depending on the memory available on the system, is there an example that shows how to fetch a batch of an arbitrary size? That is to say my programme takes the batch size as a user input parameter, and I want to load that many reads at a time rather than the batch size hardcoded into the file?
  3. Is the arrow libraries underneath exploiting multiple threads available on the system? If so how can I control the number of threads it is allowed to use based on a user input? If not, how do I exploit multiple threads to efficiently load, decompress and parse into memory arrays in the user side?
pod5_init();

Pod5FileReader_t* file = pod5_open_combined_file(argv[1]);

if (!file) {
   fprintf(stderr,"Error in opening file\n");
   perror("perr: ");
   exit(EXIT_FAILURE);
}

size_t batch_count = 0;
if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) {
	fprintf(stderr, "Failed to query batch count: %s\n", pod5_get_error_string());
}

int read_count = 0;

for (size_t batch_index = 0; batch_index < batch_count; ++batch_index) {


	Pod5ReadRecordBatch_t* batch = NULL;
	if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) {
	   fprintf(stderr,"Failed to get batch: %s\n", pod5_get_error_string());
	}

	size_t batch_row_count = 0;
	if (pod5_get_read_batch_row_count(&batch_row_count, batch) != POD5_OK) {
		fprintf(stderr,"Failed to get batch row count\n");
	}

	rec_t *rec = (rec_t*)malloc(batch_row_count * sizeof(rec_t));

	// need to find out of this part can be multi-threaded, and if so the best way, for instance should this be parallised using an openMP for ()? or is it internally using threads by the arrow library which is opaque to the user?
	for (size_t row = 0; row < batch_row_count; ++row) {
		uint8_t read_id[16];
		int16_t pore = 0;
		int16_t calibration_idx = 0;
		uint32_t read_number = 0;
		uint64_t start_sample = 0;
		float median_before = 0.0f;
		int16_t end_reason = 0;
		int16_t run_info = 0;
		int64_t signal_row_count = 0;
		if (pod5_get_read_batch_row_info(batch, row, read_id, &pore, &calibration_idx,
										&read_number, &start_sample, &median_before,
										&end_reason, &run_info, &signal_row_count) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld\n", row );
		}
		read_count += 1;

		char read_id_tmp[37];
		pod5_error_t err = pod5_format_read_id(read_id, read_id_tmp);

		CalibrationDictData_t *calib_data = NULL;
		if (pod5_get_calibration(batch, calibration_idx, &calib_data) != POD5_OK) {
			fprintf(stderr, "Failed to get read %ld calibration_idx data: %s\n", row,  pod5_get_error_string());
		}

		uint64_t *signal_rows_indices= (uint64_t*) malloc(signal_row_count * sizeof(uint64_t));

		if (pod5_get_signal_row_indices(batch, row, signal_row_count,
									   signal_rows_indices) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld; signal row indices: %s\n", row, pod5_get_error_string());
		}

		// cannot get to work this in C, So using C++
		//SignalRowInfo_t *signal_rows = (SignalRowInfo_t *)malloc(sizeof(SignalRowInfo_t)*signal_row_count);
		std::vector<SignalRowInfo_t *> signal_rows(signal_row_count);

		if (pod5_get_signal_row_info(file, signal_row_count, signal_rows_indices,
									signal_rows.data()) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld signal row locations: %s\n", row, pod5_get_error_string());
		}

		size_t total_sample_count = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			total_sample_count += signal_rows[i]->stored_sample_count;
		}

		int16_t *samples = (int16_t*)malloc(sizeof(int16_t)*total_sample_count);
		size_t samples_read_so_far = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			if (pod5_get_signal(file, signal_rows[i], signal_rows[i]->stored_sample_count,
							   samples + samples_read_so_far) != POD5_OK) {
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
			}

			samples_read_so_far += signal_rows[i]->stored_sample_count;
		}

		rec[row].len_raw_signal = samples_read_so_far;
		rec[row].raw_signal = samples;
		rec[row].scale = calib_data->scale;
		rec[row].offset = calib_data->offset;
		rec[row].read_id = strdup(read_id_tmp);

		pod5_release_calibration(calib_data);
		pod5_free_signal_row_info(signal_row_count, signal_rows.data());

		free(signal_rows_indices);

	}


	//process the batch here 
	
	//print the output here
	
	if (pod5_free_read_batch(batch) != POD5_OK) {
		fprintf(stderr,"Failed to release batch\n");
	}

	for (size_t row = 0; row < batch_row_count; ++row) {
		free(rec[row].read_id);
		free(rec[row].raw_signal);
	}
	free(rec);

}

Is the above implementation the most efficient way to use POD5 on a multi core system?

@0x55555555
Copy link
Collaborator

Hi @hasindu2008,

Great shouts on the above, I'll try to answer in order:

  1. This API is one on my list to improve - I am planning to add a call to return all the signal for a read up front - with fewer calls. Any input on the call structure that would work best for you?
  2. Arrow writes in a specific batch size, so you cannot change that at read time, you can however read multiple batches, or half a batch at once - if your code allows this.
  3. So currently the above code is single threaded internally - loading only one read into memory, however, I have implemented a multi threaded reader for the python API, and again - it is on my list to move this to the general purpose C API. Do you have input on how you would like to call this as a user?

I will endeveour to push a release shortly that at least makes getting signal in a single threaded way easier, then push onto getting it faster - using multiple threads.

Thanks,

  • George

@hasindu2008
Copy link
Author

Hi George

Thank you for the quick response. (1) is not a big requirement as I can use g++ for now. For (2), how are you envisioning the MinKNOW output to become - what would be the default batch size. And, also is MinKNOW going to output one large single POD5 file per one sequencing run or will it be multiple POD5 files like it Is being done with FAST5 at the moment?

(3) is what I am mostly interested in. Isn't arrow capable of internally utilising threads for decompression and/or parsing?
From a C user's perspective, high-level API calls for threading are not that necessary, as long as thread-safe functions and the an example on how they can be used are provided. Can I simply parallelise the following loop - I tried but it crashes probably something is not threadsafe?

for (size_t row = 0; row < batch_row_count; ++row) {
....
}

@hasindu2008
Copy link
Author

@jorj1988
In addition to the above, it will be great if you could give some insights on how this random access (https://github.com/nanoporetech/pod5-file-format/blob/master/c%2B%2B/examples/find_specific_read_ids.cpp) can be parallelised in the user side.

@hasindu2008
Copy link
Author

Is this multi-threading-related crash in POD5 fixed now?

@0x55555555
Copy link
Collaborator

Hello @hasindu2008,

My apologies - this issue has slipped by me.

Yes, this issue is now resolved - it is now safe to read pod5 files from as many threads as you like. I would recommend for cache efficiency reading one batch at a time in each thread, however this is not required by the API.

  • George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants