Skip to content

naturalis/Custom-databases-DNA-sequences

Repository files navigation

NOTE: this repo is deprecated, work was continued here: https://github.com/naturalis/arise-barcode-metadata

Custom-databases-DNA-sequences

Bioinformatics project B19017-555 (Applicant: B. van der Hoorn)

Project overview

Project repository for the biodiversity assessment in Dutch freshwater and saltwater areas, presenting a management and reporting system for reference DNA barcodes. Contains source code (Python, R, and SQL) and data files. The overall layout is as follows:

  • script - scripts in Python, R, and SQL
  • data - data, including exports from the NSR and BOLD
  • results - output files for the creation of custom databases

The paper accompanying this project can be found here

Workflow

A custom export from the Dutch Species Register (NSR) (see data/NSR_exports/) contains the taxonomic classification of species of interest, including synonyms and expected species. A selection of curated taxa enables the retrieval of BOLD specimen data and sequence records, Naturalis internal specimen records, names and phylogenetic lineages of the NCBI database, and higher NSR taxonomic classification. The resulting data sets provide a snapshot to assess the accuracy and reliability of BOLD’s reference data and to determine its overlap and discrepancies in comparison to Naturalis internal records. The underlying data structure combines the molecular data through NSR’s accepted names and links sequence information to its taxonomic data.

Workflow

Getting Started

Instructions and requirements to get a local copy up and running, see INSTALL file

Part 1: Downloading public specimen data and sequence records

Navigate to the installation directory and run the python script. Parameters are by default set to a file/directory within a local github clone.

usage: custom_databases.py [-h] [-indir INDIR] [-infile1 INFILE1] [-infile2 INFILE2] [-outdir1 OUTDIR1] [-outdir2 OUTDIR2] [-outfile1 OUTFILE1] [-outfile2 OUTFILE2]

optional arguments:
  -h, --help          show this help message and exit
  -indir INDIR        Input folder: NSR export directory
  -infile1 INFILE1    Input file 1: NSR taxonomy export
  -infile2 INFILE2    Input file 2: NSR synonyms export
  -outdir1 OUTDIR1    Output folder 1: BOLD export directory
  -outdir2 OUTDIR2    Output folder 2: Result data directory
  -outfile1 OUTFILE1  Output file 1: Matching records
  -outfile2 OUTFILE2  Output file 2: Non-matching records

Using any of the optional user arguments allows for the user to change the input/output destination of the files and directories. Example of argument usage:

python custom_databases.py -indir ../data/NSR_exports -outdir1 ../data/BOLD_exports -outdir2 ../data/TSV_files -outfile1 match.tsv -outfile2 mismatch.tsv

Part 2: Data classification & analyses

Classification and analyses of the data set can be achieved by following the execution of code laid out in the Rmarkdown file using ie. R Studio, see script.

  1. Open RStudio and install the necessary packages:
install.packages(c("rmarkdown","data.table","taxizedb","myTAI","tidyr","shiny","DT","plyr","dplyr","stringr","d3Tree","billboarder","nbaR"))
  1. Go to "File" on the top left corner, click "Open file" / "Open script" and navigate to the script/custom_databases.Rmd file.

  2. Set the working directory to the script source (in RStudio: Session > Set working directory > To source file location).

  3. In the top right corner of the opened script file, click "run app" if you're on RStudio. Alternatively, run the script line by line using CTRL+ENTER for each chunk of code. Note: chunks of code can be run independently (CTRL+SHIFT+ENTER) as long as the necessary data is loaded into the environment.

  4. After running the script, or the tree-vizualition segment, a window will pop up, and you can click "open in browser" to run the shiny app in your default browser.

Part 3: Creation of custom databases

Output files, as generated by part 1 and 2, are written to the results/ folder and can be used to create custom databases. To do so, a new database file must be created:

  1. Start DB Browser for SQLite
  2. Choose the File -> Import -> Database from SQL file menu option, which will open a dialog box. Use it to navigate to, select, and open the created schema.
  3. Choose a filename (and location) to save the database under.

Populating the database with the available data sets is done in a similar way:

  1. Choose the File -> Import -> Table from CSV file menu option, which will open a dialog box. Use it to navigate to, select, and open all project’s results files.
  2. Enable the checkbox next to the “Column names in first line” label, which will cause SQLite to use the names in the first line as the header.
  3. If not by default chosen, select “comma” as Field separator, “double quotation mark” as Quote character, “UTF-8” as Encoding, and enable the checkboxes for the “Trim Fields” and “Separate tables” labels.
  4. Click the OK button at the bottom of the tab to import each table.
  5. Click on “Write Changes” or press “Ctrl+S” to commit the changes to the database file.

Changelog

See the changelog for a list of all notable changes made to the project.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgements

Special thanks to Rutger Vos as supervisor on this project, Berry van der Hoorn as applicant/supervisor, and Dick Groenenberg, Kevin Beentjes, and Oscar Vorst as advisory group of Naturalis Biodiversity Center.

Additional remarks

Related projects:

  • B19009-560 - Barcoding database status reporting tool (Applicant: A. Speksnijder)
  • B19011-560 - Local custom database creator in the galaxy (Applicant: A. Speksnijder)

About

Bioinformatics project B19017-555

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages