Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illumina single-end data #4

Closed
hoelzer opened this issue Jan 24, 2020 · 7 comments
Closed

Illumina single-end data #4

hoelzer opened this issue Jan 24, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@hoelzer
Copy link
Member

hoelzer commented Jan 24, 2020

No description provided.

@hoelzer hoelzer added the enhancement New feature or request label Jan 24, 2020
@hoelzer hoelzer self-assigned this Jan 24, 2020
@MarieLataretu
Copy link
Collaborator

I'd add an extra parameter --illumina-single-end (like --nano and --illumina), so that one can clean single- and paired-end reads in one clean run

@hoelzer
Copy link
Member Author

hoelzer commented May 20, 2020 via email

hoelzer added a commit that referenced this issue May 21, 2020
@MarieLataretu
Copy link
Collaborator

I just scrolled by - Is the renaming of the reads applicable also for the single-end reads?

@hoelzer
Copy link
Member Author

hoelzer commented May 22, 2020

renaming of the reads? do you have an example?

@MarieLataretu
Copy link
Collaborator

We do this, before mapping:

  # this is working for ENA reads that have at the end of a read id '/1' or '/2'
  EXAMPLE_ID=\$(zcat ${reads[0]} | head -1)
  if [[ \$EXAMPLE_ID == */1 ]]; then 
    if [[ ${reads[0]} =~ \\.gz\$ ]]; then
      zcat ${reads[0]} | sed 's/ /DECONTAMINATE/g' > ${name}.R1.id.fastq
      TOTALREADS_1=\$(zcat ${reads[0]} | echo \$((`wc -l`/4)))
    else
      sed 's/ /DECONTAMINATE/g' ${reads[0]} > ${name}.R1.id.fastq
      TOTALREADS_1=\$(cat ${reads[0]} | echo \$((`wc -l`/4)))
    fi
    if [[ ${reads[1]} =~ \\.gz\$ ]]; then
      zcat ${reads[1]} | sed 's/ /DECONTAMINATE/g' > ${name}.R2.id.fastq
      TOTALREADS_2=\$(zcat ${reads[1]} | echo \$((`wc -l`/4)))
    else
      sed 's/ /DECONTAMINATE/g' ${reads[1]} > ${name}.R2.id.fastq
      TOTALREADS_2=\$(cat ${reads[1]} | echo \$((`wc -l`/4)))
    fi
  else
[....]```

But I just saw, that we also do this for the ONT data, so I'll implement this also for the Illumina singe-end data!

@hoelzer
Copy link
Member Author

hoelzer commented May 22, 2020

Ah sorry, I got confused with the rnaseq pipeline ;)

Yeah, I introduced this renaming stuff because I experienced problems with some FASTQ headers. I think what we could also have is a more convenient renaming Python script or so that

  • renames the reads
  • saves the mapping between the original and new ids in a tsv
  • restores ids based on the tsv

see:
https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/bin/rename_fasta.py

So we could have a separate rename step for any FASTQ, then the filtering happens, and then we have a restore module...

maybe that's cleaner?

But I am also happy with any other simple solution

@MarieLataretu
Copy link
Collaborator

yeah, an extra process for renaming would definitely reduce code redundancy!

I'll go for the copy-paste solution for the moment and open a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants