build/data_availability.Rmd

---
title: "Data Availability"
output:
  html_document:
    code_folding: none
    toc: false
---

On this page you will find information on obtaining various data, data products, and processing scripts. Of course, all the R code is embeded in the workflows on the website as well.

<p id = "steel">
To start from the beginning you will need the DADA2 workflow and the raw data.
</p>

* [10.6084/m9.figshare.6875522](https://doi.org/10.6084/m9.figshare.6875522){target="_new"}: Raw data for each sample (before removing primers) plus two tables for ENA submission.
* [PRJEB28397](https://www.ebi.ac.uk/ena/browser/view/PRJEB28397){target="_new"}: Study accession number for trimmed data (primers removed)  deposited at the European Nucleotide Archive.
* [10.6084/m9.figshare.6997253](https://doi.org/10.6084/m9.figshare.6997253){target="_new"}: DADA2 workflow for processing 16S rRNA reads.

*The data in the ENA (primers removed) is the input for the DADA2 workflow but if you prefer to remove primers yourself, then please download the raw data instead.*

<div style="padding-top: 1em"></div>

<p id = "steel">
To run the phyloseq workflow only, download this file:
</p>

[10.6084/m9.figshare.7357178](https://doi.org/10.6084/m9.figshare.7357178){target="_new"}: DOI for the phyloseq workflow. Includes output from the DADA2 workflow, the phyloseq script, and other necessary input files.

*The input file for this workflow is the output from the DADA2 workflow (`combo_pipeline.rdata`). There are also some additional files included that are needed for some of the analyses.*

<div style="padding-top: 1em"></div>

<p id = "steel">
Additional data products.
</p>

* [10.6084/m9.figshare.7379930](https://doi.org/10.6084/m9.figshare.7379930){target="_new"}: Data products from the workflows including sequence tables, taxonomy tables, and ASV fasta files.
* [10.6084/m9.figshare.7379936](https://doi.org/10.6084/m9.figshare.7379936){target="_new"}: Fasta files for the 59 DA ASVs and top BLAST results, plus the alignment file (including top BLAST hits).
* [10.6084/m9.figshare.7379597](https://doi.org/10.6084/m9.figshare.7379597){target="_new"}: Supplementary file from the paper.

<div style="padding-top: 1em"></div>

<p id = "steel">
Accessing the R Code only.
</p>

The  R code is available by clicking on the code button in the menu bar or [here](raw_code.txt). Please note that this  R code is pulled from all the `.Rmd` files. This has not been tested independent of the R Markdown workflows so <span class="paper">Use at Your Own Risk</span>. In other words, the code works in the Rmarkdown format but the complete pipeline has not been tested using just this code. I used `knitr::purl()` to pull  the code from the Rmarkdown file. I did this for you just in case you wanted the code and not hear me drone on about colors, zoomable figures, or Keanu Reeves. The first part is the DADA2 workflow and the second part is the phyloseq workflow. Commands that are commented out are things I tried that I could never get to work. Any line that starts like this: `## ----` is the code chuck name and details.


<p id = "steel">
Submitting sequence data to nucleotide archives
</p>


We submitted out data to the [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena). The ENA does not like RAW data and prefers to have primers removed. So we submitted the trimmed Fastq files to the ENA. You can find these data under the study accession number **PRJEB28397**. The RAW files on our figshare site (see above).  

To submit to the ENA you need two data tables (plus your sequence data). One file describes the samples and the other file describes the sequencing data. 

You can download these data tables here:

[Description of sample data](ena_submit/ENA_Sample_SUBMISSION_DIGEST.txt)

[Description of sequence data](ena_submit/ENA_PAIRED_FASTQ_SUBMISSION_DIGEST.txt)

Step by step instructions for submitting to the ENA

1) go to https://www.ebi.ac.uk/ena/submit and select **Submit to ENA**.  
2) Login or Register.  
3) Go to **New Submission** tab and select **Register study (project)**. 
4) Hit Next
5) Enter details and hit Submit.
6) Next, Select Checklist. This will be specific to the type of samples you have and basically will create a template so you can add your sample metadata. For this study I chose **GSC MIxS host associated**
7) Next
8) Now go through and select/deselect fields as needed. Note, some fields are mandatory. 
9) Once finished, hit **Next** to fill in any details that applay to *All* samples. 
10) Fill in the sheet
11) Hit the **Next** button, change the number of samples, and download the sheet. (*This is a little messy and you just need to wade through it*)
12) Once everything looks good and uploaded, click Next to get to the **Run** page. 
13) Select **Two Fastq files (Paired)** and Download the template.
14) Before filling out the form, gzip **.gz** all the trimmed fastq files (these are what you submit)
15) Then run `md5` on all the `tar.gz` files. 
16) Upload all the fastq files. There are different options for this step.
17) Fill in the sheet including md5 checksum values.
18) Upload and submit the sheet.

<p class="edit-page">
  <a href="https://github.com/projectdigest/web/blob/master/build//data_availability.Rmd">
    <i class="fas fa-pen pr-2"></i> Edit this page
  </a>
</p>