# Dataset overview

Welcome! In this notebook, we will work with data from two recent studies that evaluate nucleic acid extracition kits applied to soil samples:

- [**Comparative evaluation of soil DNA extraction kits for long-read metagenomic sequencing**](https://www.microbiologyresearch.org/content/journal/acmi/10.1099/acmi.0.000868.v3)
- [**Evaluation of commercial RNA extraction kits for long-read metatranscriptomics in soil**](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001298)

More details in: 
https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP493328&o=acc_s%3Aa

## Dataset used

We’ll focus on a selected subset of samples, covering two soil types: Heathland and woodland, and two extraction types: DNA (metagenomics) and RNA (metatranscriptomics).

| Type | Soil Type | SRA Run ID    |
| ---- | --------- | ------------- |
| DNA  | Heathland | `SRR28415665` |
| DNA  | Woodland  | `SRR28415630` |
| RNA  | Heathland | `SRR28223365` |
| RNA  | Woodland  | `SRR28223359` |


## Dataset download

While we won’t be downloading any datasets directly during the workshop, here are the standard commands you would use (with sra-tools and edirect installed).

**Fetching project metadata**

```bash
# DNA
esearch -db sra -query PRJNA1090675 | efetch -format runinfo > dna_runinfo.csv

# RNA 
esearch -db sra -query PRJNA1079547 | efetch -format runinfo > rna_runinfo.csv



In [1]:
import pandas as pd

In [4]:
dna = pd.read_csv('../../data/metatranscriptomics/dataset_overview/dna_runinfo.csv')
rna = pd.read_csv('../../data/metatranscriptomics/dataset_overview/rna_runinfo.csv')

In [5]:
srr_ids = [
    "SRR28223365",  # metatranscriptomic reads: RNA Heathland
    "SRR28223359",  # metatranscriptomic reads: RNA Woodland
    "SRR28415630",  # metagenomic reads: DNA Woodland
    "SRR28415665",  # metagenomic reads: DNA Heathland
]

In [6]:
rna_selected = rna[rna["Run"].isin(srr_ids)]
dna_selected = dna[dna["Run"].isin(srr_ids)]

In [7]:
display(dna_selected.head())
display(rna_selected.head())

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
104,SRR28415665,2024-04-04 04:24:09,2024-03-22 05:58:08,219908,590409480,0,2684,492,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,UNIVERSITY OF EXETER,SRA1831207,,public,61E8A884574D9B291CA552D04DABF7AB,12126B9BA2828ACBD4EE0F5B6C6898F7
125,SRR28415630,2024-04-04 04:24:09,2024-03-22 05:49:41,156764,224066343,0,1429,199,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,UNIVERSITY OF EXETER,SRA1831207,,public,2C08A86B60D2C7B5C37CB2AFFC35ED3F,E2E5F33E4BE3131A92F79A393C9D873C


Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
4,SRR28223365,2024-10-09 00:30:14,2024-03-05 03:20:30,1453236,283197290,0,194,300,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,UNIVERSITY OF EXETER,SRA1812356,,public,1C96ECEFADB6A5A81A3B9779418C4A3B,6D66547223CF37787A9749F2430E2646
34,SRR28223359,2024-10-09 00:30:15,2024-03-05 03:20:50,2916327,594503386,0,203,608,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,UNIVERSITY OF EXETER,SRA1812356,,public,14B5C9548E9E97EE4593471CAFADC570,743CED5DF72AFCB3DCE7D8CF999ED03E


**Download readsets with ```fastq-dump```**

**Heathland**

Metagenomic (DNA)
```
fasterq-dump SRR28415665 --split-files --threads 4
spots read      : 219,908
reads read      : 219,908
reads written   : 219,908

```
Metatranscriptomic (RNA)
```
fasterq-dump SRR28223365 --split-files --threads 4
spots read      : 1,453,236
reads read      : 1,453,236
reads written   : 1,453,236
```

**Woodland**

Metagenomic (DNA)
```
fasterq-dump SRR28415630 --split-files --threads 4
spots read      : 156,764
reads read      : 156,764
reads written   : 156,764
```
Metatranscriptomic (RNA)
```
fasterq-dump SRR28223359 --split-files --threads 4
spots read      : 2,916,327
reads read      : 2,916,327
reads written   : 2,916,327
```