-
Notifications
You must be signed in to change notification settings - Fork 0
Inference of the pan‐ncRNAome
We have developed a pipeline to infer the pan-ncRNAome. The pipeline follows the steps outlined below:
This step transforms the format of the cluster file generated by MMseqs2 into a format identical to OrthoFinder's clustering. Each cluster's name is listed on the rows, and the genotypes are represented in columns, with each cell containing the sequence ID of the genotype within each cluster. Like this example.
This step is performed with this script.
Note: Before running this script, it is necessary to carry out an intermediate step to keep only the genotype names in the second column of the clusters file generated by MMseqs2, using the following code:
awk -F '\t' 'BEGIN {OFS = FS} {split($2, arr, "_"); $2 = arr[1]; print}' DB_clust.tsv >> DB_clust_genotypeName.tsv
Before (MMseqs2 clustering output format):
Co06022_k31_TRINITY_DN17309_c0_g2_i1 | Co06022_k31_TRINITY_DN17309_c0_g2_i1 |
---|---|
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i8 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i10 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k25_TRINITY_DN6264_c0_g2_i4 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022_k31_TRINITY_DN12248_c0_g2_i5 |
intermediate step (MMseqs2 clustering output format with only genotype names):
Co06022_k31_TRINITY_DN17309_c0_g2_i1 | Co06022 |
---|---|
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
Co06022_k25_TRINITY_DN6264_c0_g2_i8 | Co06022 |
After (OrthoFinder like matrix):
Orthogroup | B1 | B2 | CP74-2005 | Co06022 | group_name |
---|---|---|---|---|---|
OG1 | 0 | 0 | 0 | 1 | OG1 |
OG2 | 0 | 0 | 0 | 4 | OG2 |
This step is necessary to avoid including duplicate information regarding the cluster names. See this example.
cut -f1-49 DB_clust_groups.tsv >> DB_clust_groups_withoutLastColumn.tsv
We developed a python script that categorizes each cluster into the following classes:
- Pan (sum of total classes)
- Hard-core (clusters with 100% of genotypes present)
- Soft-core (clusters with 80% of genotypes present)
- Exclusive (clusters with only one genotype present)
- Accessory (clusters with more than one genotype up to 80% of genotypes present)
This bash script was used to run the script to calculate the pan-ncRNAome classes from the above reformatted similarity matrix, generating this file.
Output format example:
Genotypes | Groups | Class | Sequences |
---|---|---|---|
1 | 96872 | pan | 120200 |
1 | 0 | accessory | 0 |
1 | 0 | soft-core | 0 |
1 | 96872 | hard-core | 120200 |
1 | 3310316 | exclusive | 0 |
This script generates strip plots of the group classes, with the number of identified groups represented on the Y-axis and the number of genotypes on the X-axis.
At the end of the pipeline, we obtain a representation of the group classes as follows: