Covid19 #140

jrober84 · 2021-03-03T17:26:41Z

This update is to make some minor modifications to BioHansel to be more compatible with amplicon based SARS-COV-2 data by introducing the min-kmer-frac parameter which works in combination with the min-kmer-cov parameter to ignore k-mers present in a sample which are less than the defined percentage set by the program with a default of 0.05. Additionally, the max-kmer-cov has been increased to not cause conflicts with high covered regions being ignored. The kmer report has also removed the error message field since this adds a lot of extra repeated data for these reports which adds up for larger sample pools.

…undant and adds large amounts of data to batch samples

github-actions · 2021-03-03T17:26:58Z

Hi @jrober84,
It looks like this pull-request is has been made against the phac-nml/biohansel master branch.
The master branch on repositories should always contain code from the latest release.
Because of this, PRs to master are only allowed if they come from the phac-nml/biohansel development branch.
You do not need to close this PR, you can change the target branch to development by clicking the "Edit" button at the top of this page.
Thanks again for your contribution!

peterk87 · 2021-03-03T20:25:58Z

bio_hansel/subtyper.py

@@ -228,6 +228,24 @@ def parallel_query_reads(reads: List[Tuple[List[str], str]],
    outputs = [x.get() for x in res]
    return outputs

+def filter_by_kmer_fraction(df,min_kmer_frac=0.05):


Hi @jrober84 I'm having trouble understanding what this function is supposed to do.

Is kmer fraction supposed to be like the alternate allele fraction (AF) from variant calling?

What is a noisy kmer? Any kmer that is observed at a low frequency relative to the sum of frequencies of all kmers observed at that position?

kmer_freq / sum(kmer_freqs_at_position) < min_kmer_frac

Hey @peterk87, I have updated the description in the code to be a bit more clear. But essentially, the function determines the total number of positive and negative kmers covering a position and then determines the percentage of the total coverage each k-mer for the position contributes. In the case where only a single k-mer is present it will always be one and will not ever be filtered. But when both positive and negative are present it will see if the percentage contribution of each k-mer to the total is above the set threshold. If it isn't that specific k-mer gets filtered from the data frame and so the QA/QC module will never see it. I would say this process has some similarity to the AF for variant calling but also will allow the user to configure an acceptable "contamination" level for their sample which for their application is not an issue.

Hi @jrober84

I'm not sure this code is doing what your description says it's doing. I don't see any accessing of kmer frequency values. If I'm interpreting your description above and in the docstring correctly, you need to be calculating the sum of kmer frequencies at each position. With .value_counts() on refposition you're simply getting a count of how many times you observe each refposition value (similar to Python's Counter). It also doesn't make sense that the refposition count is being divided by the refposition value.

Also, when possible, I think it's a better idea to include all results for the detailed results report for debugging and troubleshooting rather than filtering those results out. I would instead just modify this line:

biohansel/bio_hansel/subtyper.py

Line 289 in d20a00b

st, df = process_subtyping_results(st, df[df.is_kmer_freq_okay], scheme_subtype_counts)

Where instead of just getting the subset of results where is_kmer_okay, kmers that pass the kmer fraction threshold could also be filtered for:

df[(df.is_kmer_freq_okay | (df.kmer_fraction >= subtyping_params.min_kmer_fraction))]

This way no results are removed from the detailed report, and only the "good" kmers are used to determine the subtype result.

So I would propose adding columns like kmer_fraction and maybe total_refposition_kmer_frequency and does_pass_kmer_fraction_threshold. This would be useful information for potential contamination detection.

That's a good suggestion, I have updated the code to no longer be filtering the df for failed k-mers. This way all detected k-mers will appear in the detailed report but the QA/QC will be only on the pass list. Pass kmers must pass both the k-mer freq filter and k-mer frac filter. I have added the additional fields that you suggested for troubleshooting purposes.

… being filtered for failed kmers

peterk87 · 2021-03-04T17:29:36Z

bio_hansel/subtyper.py

+    Returns:
+        - pd.DataFrame with k-mers with kmer_fraction column
+    """
+    position_frequencies = df[['refposition','freq']].groupby(['refposition']).sum().reset_index()


It might be easier to use a dict:

position_frequencies = df[['refposition','freq']].groupby(['refposition']).sum().to_dict()

The dict should have refposition keys and summed frequency values, so getting total_freq would be easier and clearer:

total_freq = position_frequencies[row.refposition]

peterk87 · 2021-03-04T17:39:55Z

bio_hansel/subtyper.py

+    position_frequencies = df[['refposition','freq']].groupby(['refposition']).sum().reset_index()
+    percentages = []
+    total_refposition_kmer_frequencies = []
+    for index,row in df.iterrows():


I'd recommend using .itertuples() for performance reasons:

for row in df.itertuples(): refposition = row.refposition # cannot do string based access (row['refposition']) with tuples

You could also look into using apply instead of using a for-loop:

def get_kmer_fraction(row): total_freq = position_frequencies.get(row.refposition, 0) return row.freq / total_freq if total_freq > 0 else 0.0 df['kmer_fraction'] = df.apply(get_kmer_fraction, axis=1) df['total_refposition_kmer_frequency'] = df.apply(lambda row: position_frequencies.get(row.refposition, 0), axis=1)

peterk87

Hi @jrober84

Thanks for making the suggested changes and addressing my comments and questions.

The results I'm getting from Nanopore and Illumina data make sense with the low abundance kmer matches excluded from subtype calling yet still present in the detailed report. So everything looks good on my end and ready to merge into dev.

jrober84 added 5 commits March 2, 2021 15:00

increased max kmer freq to 10000000 for high coverage amplicon datasets

a64e1e0

increased max kmer freq to 10000000 for high coverage amplicon datasets

6c5b566

removed qc_message column from k-mer results because it is highly red…

0f45df1

…undant and adds large amounts of data to batch samples

removed qc_message column from k-mer results because it is highly red…

34d8e99

…undant and adds large amounts of data to batch samples

added minimum k-mer fraction as additional filtering option

37afbea

jrober84 requested a review from peterk87 March 3, 2021 17:26

peterk87 changed the base branch from master to development March 3, 2021 18:32

peterk87 reviewed Mar 3, 2021

View reviewed changes

jrober84 added 4 commits March 4, 2021 09:50

improved documentation of the new kmer filtering function

cdc7004

fixed issue with using count of position instead of frequency

cb9f22e

enhanced detailed report with additional information and df no longer…

a4d5572

… being filtered for failed kmers

updated tests with new fields for read kmer reports

8a1c62c

peterk87 reviewed Mar 4, 2021

View reviewed changes

simplified calc kmer fraction function

15aa720

peterk87 approved these changes Mar 4, 2021

View reviewed changes

peterk87 merged commit e3c2c82 into development Mar 4, 2021

jrober84 deleted the covid19 branch March 4, 2021 19:17

peterk87 mentioned this pull request Mar 5, 2021

Release v2.6.0 #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Covid19 #140

Covid19 #140

jrober84 commented Mar 3, 2021

github-actions bot commented Mar 3, 2021

peterk87 Mar 3, 2021

jrober84 Mar 4, 2021

peterk87 Mar 4, 2021

jrober84 Mar 4, 2021

peterk87 Mar 4, 2021

peterk87 Mar 4, 2021

peterk87 left a comment

Covid19 #140

Covid19 #140

Conversation

jrober84 commented Mar 3, 2021

github-actions bot commented Mar 3, 2021

peterk87 Mar 3, 2021

Choose a reason for hiding this comment

jrober84 Mar 4, 2021

Choose a reason for hiding this comment

peterk87 Mar 4, 2021

Choose a reason for hiding this comment

jrober84 Mar 4, 2021

Choose a reason for hiding this comment

peterk87 Mar 4, 2021

Choose a reason for hiding this comment

peterk87 Mar 4, 2021

Choose a reason for hiding this comment

peterk87 left a comment

Choose a reason for hiding this comment