Skip to content

A little demo of how to use the oceania-query-fasta Python package to consume data from the Oceania storage.

License

Notifications You must be signed in to change notification settings

Inria-Chile/oceania-query-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

oceania-query-fasta Demo

A demo of how to use the oceania-query-fasta Python package to run queries on the OcéanIA FASTA Query Service.

License: CeCILLv2.1

oceania-query-fasta is a pip-installable Python package client of OcéanIA FASTA Query Service which is an online service to query large FASTA files stored in the OcéanIA data storage. It currently supports the Ocean Microbial Reference Gene Catalog v2 with 100GB (gziped) FASTA, CSV, TSV files.

By using oceania-query-fasta you do not need to move large files around. Instead, you run queries on our online service right from your Python code and get the results as a Pandas DataFrame.

Install

Requirements:

Create and activate a Python virtual environment:

git clone https://github.com/Inria-Chile/oceania-query-demo.git
cd oceania-query-demo/
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Option 1. Query of Tara Ocean Data from command-line

The library may be used directly as a command line tool:

oceania query-fasta -h

Usage: oceania query-fasta [OPTIONS] <key> <query_file> <output_format>
                           <output_file>

  Extract secuences from a fasta file in the OceanIA Storage.

  <key> object key in the OceanIA storage
  <query_file> CSV file containing the values to query.
               Each line represents a sequence to extract in the format "sequence_id,start,end,type"
               "sequence_id" sequence ID
               "start" start index position of the sequence to be extracted
               "end" end index position of the sequence to extract
               "type" type of the sequence to extract
                      options are ["raw", "complement", "reverse_complement"]
                      type value is optional, if not provided default is "raw"
  <output_format> results format
                  options are ["csv", "fasta"]
  <output_file> name of the file to write the results

Options:
  -h, --help  Show this message and exit.

Or only for more information:

oceania -h

Usage: oceania [OPTIONS] COMMAND [ARGS]...

  A simple OceanIA command line tool.

Options:
  -h, --help  Show this message and exit.

Commands:
  query-fasta  Extract secuences from a fasta file in the OceanIA Storage.

Example 1.A. Query in storage TARA_A100000171

The sample-data/ folder contains the query file query_tara_a100000171.csv

TARA_A100000171_G_scaffold48_1,10,50,complement
TARA_A100000171_G_scaffold48_1,10,50
TARA_A100000171_G_scaffold48_1,10,50,reverse_complement
TARA_A100000171_G_scaffold181_1,0,50
TARA_A100000171_G_scaffold181_1,100,200
TARA_A100000171_G_scaffold181_1,200,230
TARA_A100000171_G_scaffold493_2,54,76
TARA_A100000171_G_scaffold50396_2,87,105
TARA_A100000171_G_C2001995_1,20,635
TARA_A100000171_G_C2026460_1,0,100

Run the query:

oceania query-fasta TARA_A100000171 query_tara_a100000171.csv csv example_tara_a100000171.output.csv
[08-06-2021 21:48:52] Sending request for fasta sequences
[08-06-2021 21:48:54] Request accepted
[08-06-2021 21:48:54] Waiting for results...

And then, check the output file example_tara_a100000171.output.csv that should look like:

TARA_A100000171_G_scaffold48_1,10,50,complement,ACCGTAACGTAGGCCATATTATTTTCATGGTCTTCCACAA
TARA_A100000171_G_scaffold48_1,10,50,raw,TGGCATTGCATCCGGTATAATAAAAGTACCAGAAGGTGTT
TARA_A100000171_G_scaffold48_1,10,50,reverse_complement,AACACCTTCTGGTACTTTTATTATACCGGATGCAATGCCA
TARA_A100000171_G_scaffold181_1,0,50,raw,CCAAGACCAAGCAATTTTAACACCACACTTAGATACTGCGCAAACAGCGT
TARA_A100000171_G_scaffold181_1,100,200,raw,ATTATGTTACCAGCACTTGATAACCAAAAAGTTTGGGcaggattaaaattaactaaTGATCAATTAATTGCAACTGACGATGATCAAGCATACTTTAAGT
TARA_A100000171_G_scaffold181_1,200,230,raw,ATCAAACTGATGCTACTAACTCAGAAGCAT
TARA_A100000171_G_scaffold493_2,54,76,raw,TAAGTTTTTATTATTATATTTT
TARA_A100000171_G_scaffold50396_2,87,105,raw,AGCTGTTCGGAAAACTAG
TARA_A100000171_G_C2001995_1,20,635,raw,ACAGCACACCAAGCAGGTCGTCGACCGAAACGATATTGAGAAGAATAAGAACGGAAACCGCGATGGCTGCACTCACCTCCGGCGAGCGCCATTCGCGGGCAAACGCTATAAAGAGACCGATAATGACGACGCCAACGATCAGCGCGCCATAGGGCTCAATCAGGCTAGCGAACAAATGCACCCTCCGCTCGGTCCACGGCGCACTCTATGCGATGCCGGCCTGTATTGGAAAGCAGTCAGAATCAATTCGGACTTCTTTTTTAAGCAAACGGGCTTGGGCATTACCGCCCGGATAATGTACGGCTGACTGCATCCCGCCAACCGGCCAGCTTTTCCTTGCGCGCCGCTCCGTCCATTTCGGGAACGAACTGACGTTCGAGCGCCCAGCTTCTTGAAAACGCTTCTTGATCCGGCCAAAGCCCTGTCGCTTGCCCTGCGAGCCAGGCGGCCCCCAGAACCGTTGTTTCGAGCATATTTGGCCGGTCGACCGGTGCGTCGAGAA
TARA_A100000171_G_C2026460_1,0,100,raw,AATTTGAAACAACCCTAAAGTGTTTACCATAATAGGTTCTTAAATCAAAACCAACATTCCAAGTTAGGTTGTCGCCTAGCTTTTTCTCAAGGTTTGAAAT

Previous steps can be executed also from the following bash query_tara_a100000171.sh with:

bash query_tara_a100000171.sh

Example 1.B. Query in storage TARA_R110002003

For this example we use the file sample-queries/query_tara_r110002003.csv

TARA_R110002003_G_scaffold3_1,3290,6293
TARA_R110002003_G_scaffold3_3,0,327
TARA_R110002003_G_scaffold3_3,944,2742
TARA_R110002003_G_scaffold3_4,379,379
TARA_R110002003_G_scaffold3_4,1530,1669

Execute the query:

oceania query-fasta TARA_R110002003 query_tara_r110002003.csv csv example_tara_r110002003.output.csv
[08-06-2021 21:48:52] Sending request for fasta sequences
[08-06-2021 21:48:54] Request accepted
[08-06-2021 21:48:54] Waiting for results...

And then, check the output file example_tara_r110002003.output.csv:

id,start,end,type,sequence
TARA_R110002003_G_scaffold3_1,3290,6293,raw,TGATCGGGAGTCCTCCAGGCTTTGGATCGTTTGGGATAGATTTGTTCGAAGGAATACGGTGTCAGGAAAAGAGGATGAGGGATCGATAGTTGTGAGCTGGCATGAGCCATCAACGGTTCTGGAGTCTCGGGTACAAGTCTCACGCAGGTCTGACTGCTGGGCCACGTGCTGAAATGTATTGCTTGTAAAAGCAAATGCTTCACCGAGTAGGGTACAACAGATTGCGAATCGCATGATTTTGGATTGTTCGAGAGGTTGAATGTCTGAGAAGACGAACTTACTACTACAGCCTGCAAAGATTCATTGGGGTTGATATACTGTTGACGGTGGAGTTGGTGCGCCGAGTTATGAAACGCGGGATCGCAGTGAAGCGAAGAGCTGAAACATTTACTGCGAAACATGCCGTCTGTGTTCGAAACTGTACAGCTACCTCGTTGCTACAGCTTGAGTCTACGGGCACCGACTTCAGGCAGCACAATAGGCGCTCCTGACCTCTGCAGGAGGTACTATGAGCTTGCTGTTGAAGGCCTTATGCCACTAATTTGACGAGACCTGAGTTGCTACCCGCACATTTAAACATGCAAGACATACATCATGACAGCTTCGTTAATTGGGTCCGTCGATACAAGATCGAGCGGCGGAAATATCGATGAGCGCTGTTTTCAATAGTGTACTGTGATTTGCGATTTGCGGGGGAAGCAAGAGCGAGACGCGGATGACGGGGGAAGGTTGTCGCATTTGTTGTTCGAGGCTGAAACGAAGCGCTGCTCGGCAAAGCCTGCCATTCCGCGCTGGGAGGCTCGCCATTTCTTTCTTCCAATTGGACGAGGGAGCGTCTTGAGAATTTTCGAAATGACATGAAAGTCCAATAAGTCGATAGGCATGTTGACCGAGTCCGTAGGCACGAAAGACTGCAGCATTGATTTGAATTGCCCTCAATTTTCTTTGGTGACTACTCGATCGATCTCTGCCCACAATGTTGTTGCTCAGCACGACAACGGACTGCCATCCCTGGGACTATGAGTCAAGGGTGGAATGGTGATGACCGGTCATGATGTGCCAGAAGAGACAGCCATTGACTTGCCAGACGCATCTCCTGGTGCGATTGGCGCGATCAGACCCTTCGGCAGTGCGATACTGTATGCTCATTGATGTCATTCGTTATGGCTGAACGAGATGTCACACCCCTCGCGCCATGCAATGATCGCGGATCTCTGAAGACGCTGATGTTGTCGTCTAGCTCACTGTTATTTTCTCAAGAACTTCGCGACGGTAATCTCGTCCGAGGTGTCGGGCGTACTGATTGCACGTGCATGGGCATATGGGTGCGGTTGTTGACCAGGATGGGCCGTTTGGATGACTGCGCTCCGGGTAGGAGTCGCCAAGAGCTTACATGGTGCAAGAACAGAGGGCGGTATGTCGACATCTGTGAGGCGGCAGGCTGAAGTCAAGCTATTTCTGTCCTGATCAGCGCGAGGGCAGACAGGCAATTGTGCGACGGTAAATTCGGTGCCAATGGCTCTCGATATCAACGACAGGCCGGCGTCGTGCCCAGCACTACCGCCGGGCCGCCTAGTGCGAGCCCTTGCTGACGGGTTGCGCGTTATGCACCTGTTCAGTAGATTCATGAGCTCATGGTAGTGGTGGTAGCTGGCGTGGTGCTGAACGGGATGGTGAATGGTTCCGATGCTCCGATCCGAACCCCAAAGCGCCCCTCGCAACCCTACCCGGTAGCAGATGCCCCGGGATTCAGGCCACGTCAACGTAAGCTCAGCAAAGTGCGCAAAGACCTGCCTCAACTAGTCGACACTGTGGCCAAGCCTTTGCATATTCACCAGAGAGAGGAGAGCTCTCTTGTACGCCCCGTACCGTGTAGCACAGTCAGCGCGTCGAGAGTCCTGGAGTCGTCTCGTCGTAAGTCACGGTAATGGCAGTACAACGGGCGCGAAAGTCGACATAAGACCAGCTCCTCGAGCGAACAAGCCCGTCATCTTGATCGATGGAGCAAACATCTCAGGCTCTCTAGCTTTGTTCATCGGAGTTGGAACTGAAGACATTGTTTTTGGTGTTTGCGACCACAAGTTGAACGTCAACCGGCGCAAGAACGCCCCGAGCCCGATTGACTTCCGCTCCGCCCCCTTCTCGTCCGACGTGCACCGCCTTTCCACCATGTGGCATGACCCACCAAGGCAGCCAATCTGGCGCCCAATGATCTTCGTTGCTTCCAGCCTCATGTCCCAGTTTGGTTCGACCTAATCCGTTCTGGGGCCATTCTGCGATGTCCTGGAAACGCACGCTACACGCCACACTGCCTAGGATTCCGACCGCTGGACGGGAGTTATTATCTGACATGCTAGACTCGCGTCTTCGACCGCTTTGGGAAGGGCTTCGTCTCCACTGCGCGCTCGTAATCTACGCGTAGTCGCTGCCTCAGGAGGATTACTGCAGCGCCATGGAGAGGCGAATGCGAAACAACATTGTGTCCGGTCACCACTCGCAGCATATAAGGCATCAGGTTTCGCCCTCGTAAGCGATTGTTCCAACCCAAGTCAGCATTACTACACTCGCAACGAGAGACTATTCGCCTCGGCCTCCCCTTCAGAGTCATAGCTAGAAGTTTCTCATTGGCTTCCTTTCGACACAACTTCACCTCGCAATCTGCAAAATGACTGTTCCCCTACCAAACCCCGATCTCTGTCAGTCGAAATTGTCCAGCTTCACACCTTCAACTCACTAATCTGTTTCTCAAAGGTACCGCAGTCGGTCAGGTCGTTGCTGGCCGGCCATGTACAGTCGAAGGGACACTCTACGGCTACTACCCTAGCCTTGGCGCCAACGCTTTCTTCGCTGCTTTCTTCGCGGTCTGCTTTGCCTGGCAACTATATTGCGGCATCAGATACAAGACATGGACCTACATGGTGTGTATTGGAGTACGAGTCCTCATTGTCCATGACGCTGACCACTAGCAGATCGCCCTTTGTCTTGGGTGTGTCGGTGAAGCCG
TARA_R110002003_G_scaffold3_3,0,327,raw,TCCCTCTACACAGAGCAAACCTCCCAGGTAAGATCAGCCCGGGCTAGTCCCCTACCTGGGGTCGATGGATAAATAACCTTGAGTCCAGCTTACTTGCCCAGGATTCTACAGGCACTTCCGGGAGTGGTGTGAGCACTGATTCGACAGCCGAATACAGCGATGGCATGGCCTCATCGTACCGACACAGCTCCCGGGCTTCATACTCACCGTCTGTTGGACACCATCCTAGCTGGCCAGGCTCTAGCACGATTGCCTCATCATCCCAATCGATATCAACGAAAGGGAAGCAGCCAGCGCCCACGGCGGATGCTCTCGGGCGGCCATTTT
TARA_R110002003_G_scaffold3_3,944,2742,raw,CAACATCTCCCTCTTCTTTACTTTGAATCTCTCGTCCTTATTTCGTATCTATGAAATGAGTGCTAAAAATCTCAGGGAACGACTTCACAGCCTTTGCATCTATGTTTCACTTCTGGTACCTCATGCGATGGATGATATCACCGTCACCCGAAACTTACGAAGCGATCCCAGAATGGCTGAGACCAACGTAAGATACCGGTAGCAGCAGTTTGGTCTTTGCGCTCACGTTGTTCATTCTAGACCAAACCAGTTATTCATGCCTCACATCAACATGCTCGATTTCATCGCTTGGCCTGCGTTCCGCGAATTCGCTGTACAGGTTCCACGCATGCAAGAGCGGATGGACTGGATGATGGACATGAGTCTTACAATCCAGTGCGACTGGTCATTTGCCAACGATGAGGCTTTTCGAAGAGATGATGAGACAGGTTTGCTAGACCTATGTTTGGTGGCAAAGGTATGCTCACTTCGCTACATTAAGACTCCTCGAAAGAACCATGCAGTCAAAGGCTCAGGGACACACGCTATGAACCCTTGCTGACTAGGTCCAGACGGCTATGCGTGATCTCTCCTGTTGGTCTGTAGGGCCAACATTTAGAGCCTACGTAAGCAATGCGGATTCGTACGTGCGAATCAGGACAGAAGAATCATCCGGGTGAAGAGATTATTCGGACACCTTGGATATAATAGCCGAACGACAACATCAAATACAGTCTGTTGTGCAGCAACAACAAGAGTTTTATTTACGAATCTTTCCCAGATAAGTTATTATAATTGCCTCTAACTTACCACTTACTTAAGACTATAGAGCTGTAGAGGTTGTAGTGCTAACTATCATGCAAAAGGAAACCTTTGGTGGGGTGTCGAAATGTGACCGATTTTCTTTTACCCGGGTGGAACATTGACCGAGCTTGGTAACGACCTCCGCTTGGAAGGCGGAGTAAAGAAAGTGTAAGTTGCCCATACATACGTACTAGTAATCTCAGTCGGAAGCACGGAAAACCAGCATGCACACCAAGCCACTAAATAACACACCGATACCAAATGAAAACACCGCCAGGCATCTTTACGTCCGTCATCAGTACTACAACCTTCGCGCCATATACCGTTGGTACGTATGACGGCTTTTCGTACGGCCTTTTCACTGGATGTAATACCCATATGACTCGATATAAATATGCGAAACATCGTACGATGCGCCTCCAGAAATTCGATGACCACGTTAACTACGATGCACGTCATAAGTCGATGCTCATCGCGACAATGAGGGGCACGGAGGGGCAGACCCCCTGGTCAAGTCTTCCGACCCAATCATATTGTTCCTTTCCCTAGGGAAACTCGATCTCTTCATATAGAATCGATTCCGATCTTGTGATTCAACCACGGAAGTACCTCAGCTTGTCTGCTTGGGAGATGAGGCCGATTCACGACGGATTACGACGATTGCAGCGTGGGAGGACGTCTGGGCCAGTGGCGCTGCGGTAGTGGCGTTGTTCTAGTGTCGCAAACGGTCGTGATGGAAGCCGGATAGCTTCACACATTTGGGGGAGGGTCGAACGGAATATTACAAACAGATGGTGTTAAGTGCATGCGATCTTAGTGATGAGAGATGCTACTAACGAAGCTAGTCTTGCCGCTGCTGTGCCTTGTGAGGGATACCGGTAGGAGACCGATACCGTTAACTCAATCTCTCCAACCCGGAGACATAGCGCGGATCGGAATATGCATAGAACTTTTAGTCCAAGAGAGAAGCCAGTCGTAAGGAGAGTAGCAGGCAATGCCGAGTAGGTGACCAACT
TARA_R110002003_G_scaffold3_4,379,379,raw,
TARA_R110002003_G_scaffold3_4,1530,1669,raw,GAGCAATTTGCAGATGGTGGTGTAGTCCTCGAAGTTGGAACAGATGCTCGCGAGACTCCACGGTGTCAGGAGTGTCGGGAACCAACGATAGCTAGGAAAGTTAGTCCAGGCTCAGGGAACCAAAGGCCAAAAAAAAacc

Previous steps can be executed from the following bash query_tara_r110002003.sh with:

bash query_tara_r110002003.sh

Option 2. Query of Tara Oceans Data from Python package

The library may be used directly as a python package.

Example 2.A. Run Python query TARA_A100000171

Run query_tara_a100000171.py with:

python3 examples/query_tara_a100000171.py

Example 2.B. Run Python query TARA_R110002003

Run query_tara_r110002003.py with:

examples/query_tara_r110002003.py

Option 3: Query of Tara Oceans Data from Jupyter Notebook

Example Jupyter notebooks are available on folder notebooks/. To use them you can create a running instance of Jupyter notebooks by

jupyter notebook

or, alternatively, use the Google Colab links that are provided below.

Example 3.A. Query in storage TARA_A100000171

Navigate to notebooks/query_tara_a100000171.ipynb to find the code used example and then execute all the cells. Open In Colab

Example 3.B. Query in storage TARA_R110002003

Navigate to notebooks/query_tara_r110002003.ipynb to find the code used example and then execute all the cells. Open In Colab

Note: more Jupyter notebooks are available in the notebooks/ folder.

About

A little demo of how to use the oceania-query-fasta Python package to consume data from the Oceania storage.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages