Skip to content

Snakemake pipeline for de novo transcriptome assembly with 454 reads

License

Notifications You must be signed in to change notification settings

jlanga/smsk_454

Repository files navigation

smsk: A Snakemake skeleton to jumpstart projects

1. Description

This is a workflow to assemble RNA reads from 454 into a transcriptome. The procedure is as follows:

  1. Quality control

    1. Base calling with PyroBayes (if .sff files are provided)
    2. Quality, length and adaptor trimming with SnoWhite
  2. Assembly

    With gsAssembler from Roche's Data Analysis tools (newbler)

2. First steps

  1. Clone the repo

    git clone https://github.com/jlanga/smsk_454.git # Clone
    cd smsk_454
  2. Activate the environment (deactivate to deactivate):

    source bin/activate
  3. Install software and packages via pip and homebrew (edit whatever is necessary):

    bash bin/install/brew.sh
    bash bin/install/from_brew.sh
    bash bin/install/from_pip3.sh
    bash bin/install/from_tarball.sh
  4. Additional requirements

    Pyrobayes is an accurate base caller for 454 datasets. It used to be available through free registration at here. It used to be a file called pyrobayes.unified_release_64bit.tar.gz. If you are able to get it, do the following

    mkdir -p src/
    pushd src/
    cp /path/to/tarball.tar.gz .
    tar xvf pyrobayes.unified_release_64bit.tar.gz
    popd
    cp src/UnifiedRelease/bin/PyroBayes bin/

    The same applies to get gsAssembler. You should go to Roche and ask for a copy (it is free but requires registration). You should get a file called DataAnalysis_2.9_All_20130530_1559.tgz. If you are connecting through ssh, use the -X option to allow graphic interfaces (ssh server -X).

    From here,

    mkdir -p src/
    pushd src/
    cp /path/to/DataAnalysis_2.9_All_20130530_1559.tgz .
    tar xvf DataAnalysis_2.9_All_20130530_1559.tgz
    pushd DataAnalysis_2.9_All/
    bash setup.sh

    And a window will pop up. Select "local installation" and choose as installation path the src/ directory of this project (in my case /home/jlanga/pipelines/smsk_454/src/454/)

  5. Download sample data from the European Nucleotide Archive (ENA; two sff files and two fastq files):

    bash bin/download_test_data.sh
  6. Execute the pipeline (should take up to 10 minutes):

    snakemake -j 24

3. File organization

The hierarchy of the folder is the one described in A Quick Guide to Organizing Computational Biology Projects:

smsk
├── .linuxbrew: brew files
├── bin: scripts,binaries and snakemeake related files.
├── data: raw data, hopefully links to backuped data.
├── doc: logs.
├── README.md
├── results: processed data, reports, etc.
└── src: additional source code, tarballs, etc.

Bibliography and resources

About

Snakemake pipeline for de novo transcriptome assembly with 454 reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published