-
Notifications
You must be signed in to change notification settings - Fork 0
6. Input files
During pipeline execution, only the first part of each sequence header is retained.
It is therefore essential that each sequence identifier is unique.
Sequences in the FASTA file can be either single-line or multi-line.
>MALV-II_sp_EP00452|sequence00001 EukProt EP00452 MALV-II_sp_L67-6 single-cell
LLGSEIVHCSLQRTVPRTLQKILSQSGLRELSVLLQEELPLRYAVRIKLIESVPGWEGQKNLRHVRKIYADSFKQIRLLCGEGDELDPQFVDALRDIKRRHTNVIQHIILGAHAMKDELEFQQMNGFMSAFFMSRMGTEMLTTHFLARAA
>MALV-II_sp_EP00452|sequence00002 EukProt EP00452 MALV-II_sp_L67-6 single-cell
KPGSVFLSLGRGQAVDEAALVKHSARFGGIVMDVFEKEPLSKESPLWTLDNAMLTSHNADIVPSYEQDTIDVFLERFAEFSRGE
>MALV-II_sp_EP00452|sequence00003 EukProt EP00452 MALV-II_sp_L67-6 single-cell
LEPFYPIYQHGLMYGKPEERYLAAKALGELVSHTTEEALKPFVVKITGPLIRIVGDRFAANVKIAIIDTLKALLIKGGAALRPFLPQLQTTYLKCLNN
>MALV-II_sp_EP00452|sequence00004 EukProt EP00452 MALV-II_sp_L67-6 single-cell
HGVKSVLPHLLDGIADKQWRTKLHSVELLGTMVKVSPKQLTISLPSIVPVLAETVNDTHAKVKEAAKLSLEKVARVVTNPEIKALAPELLTTSNLHDERLEGFFSRMADEFAQGLGSE
>MALV-II_sp_EP00452|sequence00005 EukProt EP00452 MALV-II_sp_L67-6 single-cell
MVIAGWELGVPESIAVVIVIGFSVDYVVHLAAHYVHSPFNSREERATESVTAMGVSIFSGAITTMGSGVFLFGG
When users provide their own annotations to LAGOON-MCL, they must first create a correspondence file that links annotation names to the paths of the corresponding annotation files.
This file must be in CSV (Comma-Separated Values) format and consist of two columns:
-
Column 1:
annotation– the name of the annotation (e.g., Pfam, Gene3D, Species, Phylum, etc.). -
Column 2:
file– the path to the annotation file. For more details, see here.
| annotation | file |
|---|---|
| funfam | /path/to/lagoon-mcl/data-test/malv/labels/funfam.tsv |
| gene3d | /path/to/lagoon-mcl/data-test/malv/labels/gene3d.tsv |
| tmhmm | /path/to/lagoon-mcl/data-test/malv/labels/tmhmm.tsv |
An example of the annotationsheet.csv file is available here.
Each annotation file must contain a single type of annotation. For example, if the user provides Pfam, Gene3D, Phobius, and taxonomic annotations (phylum, genus, and species), then six separate annotation files will be required.
The annotation file must consist of two columns:
-
Column 1:
sequence_id– the sequence identifiers. -
Column 2:
label– the sequence-specific annotations.
If a sequence has multiple annotations, they should be separated by a semicolon (";"). Sequences with no annotations should not appear in the file.
| sequence_id | label |
|---|---|
| MALV-I-01_sp_EP00398|sequence00236 | G3DSA:3.40.50.300;G3DSA:2.40.30.230;G3DSA:6.10.140.1240 |
| MALV-I-01_sp_EP00398|sequence00079 | G3DSA:3.40.50.300;G3DSA:1.10.8.60;G3DSA:2.30.30.190 |
| MALV-I-01_sp_EP00398|sequence00009 | G3DSA:3.30.60.90 |
| MALV-I-01_sp_EP00398|sequence00443 | G3DSA:1.25.10.10 |
| MALV-I-01_sp_EP00398|sequence00399 | G3DSA:3.40.630.30 |
| MALV-I-01_sp_EP00398|sequence00573 | G3DSA:1.25.40.20 |
| MALV-I-01_sp_EP00398|sequence00985 | G3DSA:2.130.10.10 |
| MALV-I-01_sp_EP00398|sequence00545 | G3DSA:1.10.287.3240 |
| MALV-I-01_sp_EP00398|sequence00949 | G3DSA:2.160.20.60;G3DSA:1.10.1060.10;G3DSA:3.50.50.60;G3DSA:3.20.20.70 |
| MALV-I-01_sp_EP00398|sequence00334 | G3DSA:3.40.50.720 |
An example of an annotation file is available here.
The Diamond BLASTp output file contains the following columns:
- qseqid: Query sequence ID.
- qlen: Length of the query sequence.
- qstart: Start position of the alignment on the query sequence.
- qend: End position of the alignment on the query sequence.
- sseqid: Subject sequence ID.
- slen: Length of the subject sequence.
- sstart: Start position of the alignment on the subject sequence.
- send: End position of the alignment on the subject sequence.
- length: Length of the alignment.
- pident: Percentage of identical matches.
- ppos: Percentage of positive matches.
- score: Raw score of the alignment.
- evalue: E-value of the alignment.
- bitscore: Bitscore of the alignment.
| qseqid | qlen | qstart | qend | sseqid | slen | sstart | send | length | pident | ppos | score | evalue | bitscore |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MALV-I-01_sp_EP00398|sequence00001 | 475 | 1 | 475 | MALV-I-01_sp_EP00398|sequence00001 | 475 | 1 | 475 | 475 | 100 | 100 | 2448 | 0.0 | 947 |
| MALV-I-01_sp_EP00398|sequence00001 | 475 | 197 | 475 | MALV-II-16_sp_EP00396|sequence00239 | 290 | 16 | 268 | 279 | 56.6 | 71.3 | 812 | 5.11e-107 | 317 |
| MALV-I-01_sp_EP00398|sequence00001 | 475 | 128 | 207 | MALV-II-16_sp_EP00396|sequence00238 | 83 | 5 | 82 | 82 | 46.3 | 67.1 | 195 | 2.66e-19 | 79.7 |
| MALV-I-01_sp_EP00398|sequence00002 | 28 | 1 | 28 | MALV-I-01_sp_EP00398|sequence00002 | 28 | 1 | 28 | 28 | 100 | 100 | 134 | 1.52e-14 | 56.2 |