Skip to content

6. Input files

Jérémy Rousseau edited this page Jan 14, 2026 · 2 revisions

Table of content

Fasta files

During pipeline execution, only the first part of each sequence header is retained.
It is therefore essential that each sequence identifier is unique.

Sequences in the FASTA file can be either single-line or multi-line.

>MALV-II_sp_EP00452|sequence00001 EukProt EP00452 MALV-II_sp_L67-6 single-cell 
LLGSEIVHCSLQRTVPRTLQKILSQSGLRELSVLLQEELPLRYAVRIKLIESVPGWEGQKNLRHVRKIYADSFKQIRLLCGEGDELDPQFVDALRDIKRRHTNVIQHIILGAHAMKDELEFQQMNGFMSAFFMSRMGTEMLTTHFLARAA
>MALV-II_sp_EP00452|sequence00002 EukProt EP00452 MALV-II_sp_L67-6 single-cell
KPGSVFLSLGRGQAVDEAALVKHSARFGGIVMDVFEKEPLSKESPLWTLDNAMLTSHNADIVPSYEQDTIDVFLERFAEFSRGE
>MALV-II_sp_EP00452|sequence00003 EukProt EP00452 MALV-II_sp_L67-6 single-cell
LEPFYPIYQHGLMYGKPEERYLAAKALGELVSHTTEEALKPFVVKITGPLIRIVGDRFAANVKIAIIDTLKALLIKGGAALRPFLPQLQTTYLKCLNN
>MALV-II_sp_EP00452|sequence00004 EukProt EP00452 MALV-II_sp_L67-6 single-cell
HGVKSVLPHLLDGIADKQWRTKLHSVELLGTMVKVSPKQLTISLPSIVPVLAETVNDTHAKVKEAAKLSLEKVARVVTNPEIKALAPELLTTSNLHDERLEGFFSRMADEFAQGLGSE
>MALV-II_sp_EP00452|sequence00005 EukProt EP00452 MALV-II_sp_L67-6 single-cell
MVIAGWELGVPESIAVVIVIGFSVDYVVHLAAHYVHSPFNSREERATESVTAMGVSIFSGAITTMGSGVFLFGG

Correspondence and annotations files

Correspondence file

When users provide their own annotations to LAGOON-MCL, they must first create a correspondence file that links annotation names to the paths of the corresponding annotation files.

This file must be in CSV (Comma-Separated Values) format and consist of two columns:

  • Column 1: annotation – the name of the annotation (e.g., Pfam, Gene3D, Species, Phylum, etc.).
  • Column 2: file – the path to the annotation file. For more details, see here.
annotation file
funfam /path/to/lagoon-mcl/data-test/malv/labels/funfam.tsv
gene3d /path/to/lagoon-mcl/data-test/malv/labels/gene3d.tsv
tmhmm /path/to/lagoon-mcl/data-test/malv/labels/tmhmm.tsv

An example of the annotationsheet.csv file is available here.

Annotations files

Each annotation file must contain a single type of annotation. For example, if the user provides Pfam, Gene3D, Phobius, and taxonomic annotations (phylum, genus, and species), then six separate annotation files will be required.

The annotation file must consist of two columns:

  • Column 1: sequence_id – the sequence identifiers.
  • Column 2: label – the sequence-specific annotations.

If a sequence has multiple annotations, they should be separated by a semicolon (";"). Sequences with no annotations should not appear in the file.

sequence_id label
MALV-I-01_sp_EP00398|sequence00236 G3DSA:3.40.50.300;G3DSA:2.40.30.230;G3DSA:6.10.140.1240
MALV-I-01_sp_EP00398|sequence00079 G3DSA:3.40.50.300;G3DSA:1.10.8.60;G3DSA:2.30.30.190
MALV-I-01_sp_EP00398|sequence00009 G3DSA:3.30.60.90
MALV-I-01_sp_EP00398|sequence00443 G3DSA:1.25.10.10
MALV-I-01_sp_EP00398|sequence00399 G3DSA:3.40.630.30
MALV-I-01_sp_EP00398|sequence00573 G3DSA:1.25.40.20
MALV-I-01_sp_EP00398|sequence00985 G3DSA:2.130.10.10
MALV-I-01_sp_EP00398|sequence00545 G3DSA:1.10.287.3240
MALV-I-01_sp_EP00398|sequence00949 G3DSA:2.160.20.60;G3DSA:1.10.1060.10;G3DSA:3.50.50.60;G3DSA:3.20.20.70
MALV-I-01_sp_EP00398|sequence00334 G3DSA:3.40.50.720

An example of an annotation file is available here.

Alignments file

The Diamond BLASTp output file contains the following columns:

  • qseqid: Query sequence ID.
  • qlen: Length of the query sequence.
  • qstart: Start position of the alignment on the query sequence.
  • qend: End position of the alignment on the query sequence.
  • sseqid: Subject sequence ID.
  • slen: Length of the subject sequence.
  • sstart: Start position of the alignment on the subject sequence.
  • send: End position of the alignment on the subject sequence.
  • length: Length of the alignment.
  • pident: Percentage of identical matches.
  • ppos: Percentage of positive matches.
  • score: Raw score of the alignment.
  • evalue: E-value of the alignment.
  • bitscore: Bitscore of the alignment.
qseqid qlen qstart qend sseqid slen sstart send length pident ppos score evalue bitscore
MALV-I-01_sp_EP00398|sequence00001 475 1 475 MALV-I-01_sp_EP00398|sequence00001 475 1 475 475 100 100 2448 0.0 947
MALV-I-01_sp_EP00398|sequence00001 475 197 475 MALV-II-16_sp_EP00396|sequence00239 290 16 268 279 56.6 71.3 812 5.11e-107 317
 MALV-I-01_sp_EP00398|sequence00001 475 128 207 MALV-II-16_sp_EP00396|sequence00238 83 5 82 82 46.3 67.1 195 2.66e-19 79.7 
MALV-I-01_sp_EP00398|sequence00002 28 1 28 MALV-I-01_sp_EP00398|sequence00002 28 1 28 28 100 100 134 1.52e-14 56.2

Clone this wiki locally