# SRA_download API test

Because sometimes is tedious to download SRA data, and specially to download only filtered samples based on their metadata information (treatment, sample group, patient sex, age...).

For that purpose, we have created a library containing some useful functions that act as wrappers for other packages such as GEOparse, pysradb, etc.

From a single GEO accesion, we can easily retrieve the list of samples and their metadata, and then download the fastq files for each sample.

The user can also optionally use their own filtering script in order to download only the samples that meet their requirements based on GEO metadata information. 

It would be nice to automate this process, but because SOFT datasets don't follow any standard, it requires the user's exploration and coding of their filtering scripts in order to download only the samples that meet their requirements.

We provide some useful functions for this purpose in the [SRA_download_lib.py](SRA_download_lib.py) library, but the rest will be up to what your dataset requires.



In [None]:
from SRA_download_lib import *

In [None]:
# Download metadata from a GEO dataset by parsing the SOFT matrix using GEOparse.
gse = get_GEO_info("GSE140069")

In [None]:
# Explore sample data.
show_GPL_info(gse)

In [None]:
# Explore sample data
show_GSM_info(gse)

For downloading only the desired fastq files, specific filters needs to be scripted. It is also possible to use the SRA Run selector: https://trace.ncbi.nlm.nih.gov/Traces/study/

In [None]:

##### FILTERING STEP #####

# This is the part when the users would have to create their own filter scripts.
# Because in many cases metadata slots are stored as an "dict-like lists" (["key_1 : value:1", ..., "key_N : value_N"]),
# we have created useful functions such as list_to_dict that automatically converts the list into a dictionary.

split = ": " # Set split character of the "dict-like list"
for sample_name in gse.gsms.keys():
    dict = list_to_dict(gse.gsms[sample_name].metadata["characteristics_ch1"], split)
    gse.gsms[sample_name].metadata["characteristics_ch1"] = dict

# We want to obtain the list of samples that we want to download.
# In this case, we want all samples labeled as "female" and "control".
id_list = []
for sample_name, sample_info in gse.gsms.items():
    sex = sample_info.metadata["characteristics_ch1"]["Sex"].lower()
    group = sample_info.metadata["characteristics_ch1"]["disease status"].lower()
    if sex == "female" and group == "control":
        # gsm_to_srr is our backend to convert GSM to SRR, which is what is recognized by most SRA downloaders.
        id_list.append(gsm_to_srr(sample_name)) 

In [None]:
# Download fastq files for the list of samples in parallel using kingfisher.
download_folder = "~/Data/miRNA/miRNA_Blanca_Rueda_20231221_140148/Fastq/test/"

# This function is the parallelized version of download_fastq, which severely reduces download time.
download_fastq_parallel(sample_list= id_list, out_dir = download_folder)

You can also input the sample ids from a file (for example, the output from SRA Run selector):

In [None]:
from os import remove as rm

with open("sample_list.txt", "w") as file:
    file.write("\n".join(id_list))

# Download fastq files for the list of samples in parallel using kingfisher.
download_folder = "~/Data/miRNA/miRNA_Blanca_Rueda_20231221_140148/Fastq/test/"

download_fastq_parallel(file= "sample_list.txt", out_dir=download_folder)
rm("sample_list.txt")

For a straight-forward download of the whole dataset, is possible to use traditional functions such as the ones in SRA-toolkit or the ones provided by ENA. Nevertheless, because our approach improves the download time, we provide a method to download the whole dataset in parallel:


In [None]:
download_GEO_dataset(GEO_id = "GSE140069", out_dir = "~/Data/miRNA/miRNA_Blanca_Rueda_20231221_140148/Fastq/test")