# Poster supporting file
Index:

* Part 1: Methods
* Part 2: References

# Part 1. Methods

# 1.  Alignment
Alignment of chromosomes 4 with [minimap2](https://github.com/lh3/minimap2/).

Variables:   
CHR_SP1 = fasta file containing chromosome of species 1  
CHR_SP2 = fasta file containing chromosome of species 2

In [None]:
#!/bin/bash
 ~/Software/minimap2/minimap2 -ax asm5 -f 0.02 -t 4 --eqx CHR_SP1 CHR_SP2 | samtools sort -O BAM - > SP1_SP2.bam
 ~/Software/minimap2/minimap2 -ax asm5 -f 0.02 -t 4 --eqx CHR_SP2 CHR_SP3 | samtools sort -O BAM - > SP2_SP3.bam

samtools index SP1_SP2.bam
samtools index SP2_SP3.bam

# 2. Structural variant calling and synteny plot
Used [Syri](https://github.com/schneebergerlab/syri) for structural variant calling using alignments, and [plotsr](https://github.com/schneebergerlab/plotsr) for constructing synteny plots.


In [None]:
#!/bin/bash
conda activate syri
syri -c SP1_SP2.bam -r CHR_SP1 -q CHR_SP2 -F B -f --prefix SP1_SP2 --tdgaplen 5000000 --hdrseq  --all --allow-offset 100
syri -c SP2_SP3.bam -r CHR_SP2 -q CHR_SP3 -F B -f --prefix SP2_SP3 --tdgaplen 5000000 --hdrseq  --all --allow-offset 100
# produce synteny plots
plotsr --sr SP1_SP2syri.out --sr SP2_SP3syri.out --genomes genomes.txt --tracks tracks.txt --markers markers.bed


# 3. Repeat discovery and annotation with RepeatModeler and RepeatMasker
[RepeatModeler](https://github.com/Dfam-consortium/RepeatModeler) was run in usegalaxy.org using the complete genomes followed by clustering with cd-hit-est (following Goubert et al. 2022) to remove redundancy.  
[RepeatMasker](https://github.com/rmhubley/RepeatMasker) was run locally to anotate repeats in chromosome 4. Together with RepeatMasker Util's script (RM2BED.py) to generate bed files without redundancy.

In [None]:
#!/bin/bash
RepeatMasker -lib SP1-families_clustered.fasta CHR_SP1 -s -a -pa 8
#produce bed file with repeat annotations
python RM2BED.py -o "longer_element" CHR_SP1.out -dmax 20 -m 100


Plot repeat content after manual compilation of results in *.tbl output

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("all_table.tsv", sep="\t")
df=df.loc[:,["TOTAL","UNCLASSIFIED","DNA","LINES","LTR","SPECIES","SIMPLE_REPEATS","OTHERS (<1%)"]]
df=df.melt(["SPECIES"])
pallete=["#785EF0","#648FFF","#FE6100"]
fig, ax= plt.subplots(1, figsize=(10,3))
barplot=sns.barplot(x=df["variable"], y=df.value, hue=df["SPECIES"], palette=pallete, saturation=1)
plt.xticks(rotation=45)
barplot.spines["top"].set_visible(False)
barplot.spines["right"].set_visible(False)
barplot.set_ylabel("Genome proportion (%)")
barplot.set_xlabel("Repeat Class")
plt.savefig("repeats.svg", dpi=300)

# 4. Centromere analysis


## 4.1. Detection of putative location of centromere
Used [Nessie](https://github.com/B3rse/nessie) for the calculation of linguistic complexity and entropy for each chromosome (separatelly), in window size = 10000, in steps = 1000.
This is followed by a graphical analysis to detect decreases in complexity and entropy, and extraction of centromeres by manually searching the output file.

A custom script was used for plotting.

In [None]:
#!/bin/bash
for file in *.fa; do nessie -I $file -O $file'_complexity.tsv' -L -l 10000 -s 1000; done
for file in *.fa; do nessie -I $file -O $file'_entropy.tsv' -E -l 10000 -s 1000; done

In [None]:
# to graph in python:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set_theme()

df_e = pd.read_csv("complexity.txt", sep="\t", skiprows=3, header=None)
df_c = pd.read_csv("entropy.txt", sep="\t", skiprows=3, header=None)

fig, ax = plt.subplots(2,1)
complexity = sns.scatterplot(x=df_c.index, y=df_c[1], ax=ax[0], s=1)
entropy = sns.scatterplot(x=df_e.index, y=df_e[1], ax=ax[1], s=1)

In [None]:
#!/bin/bash
# extract a centromeric fasta
seqkit subseq --bed CHR_SP_centromere.bed CHR_SP.fa

## 4.2. Use NTRprism to detect repetitive profiles of centromeric regions
Used [NTRprism](https://github.com/altemose/NTRprism) to obtain the Nested Tandem Repeat (NTR) "spectrum" indicates the most abundant tandem repeat periodicities found in an input DNA sequence.


The first plot was produced to detect the repeat lenght distribution in a large scale. Then, a second, third, etc, plot were produced to zoom on the part of the distribution where the most frequent repeat lenght periodicity is found.

Periodicity is usefull to predict the structure in which tandem repeats are organized (as monomers or High Order Repeats).

In [None]:
#!/bin/bash
perl NTRprism_ProcessFasta_v0.22.pl CHR_SP output_folder 1 3000 30 6 0
perl NTRprism_ProcessFasta_v0.22.pl CHR_SP output_folder 100 3000 30 6 0

# Produce Heatmap
for i in *.bin100.txt;do 	j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i)); 	Rscript ~/Software/NTRprism/NTRprism_PlotHeatmap.r --args $i $j 30000 100 NTRprism_Plot; done
for i in *.bin100.txt;do 	j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i)); 	Rscript ~/Software/NTRprism/NTRprism_PlotHeatmap.r --args $i $j 5000 100 NTRprism_Plot; done
for i in *.bin100.txt;do 	j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i)); 	Rscript ~/Software/NTRprism/NTRprism_PlotHeatmap.r --args $i $j 500 100 NTRprism_Plot; done


# Produce Spectral plots
for i in *.bin1.txt;do    j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i))
        Rscript ~/Software/NTRprism/NTRprism_PlotSpectrum.r --args $i $j 30000 NTRprism_Plot; done
for i in *.bin1.txt;do    j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i))
        Rscript ~/Software/NTRprism/NTRprism_PlotSpectrum.r --args $i $j 5000 NTRprism_Plot; done
for i in *.bin1.txt;do    j=$(perl -lae 'if(m/region_(.+)\.span/){print "$1"}' <(echo $i))
        Rscript ~/Software/NTRprism/NTRprism_PlotSpectrum.r --args $i $j 500 NTRprism_Plot; done


## 4.3. Predic monomeric sequences
[Tandem Repeat Finder ](https://tandem.bu.edu/trf) was run to recover monomer sequences of tandem repeats and then select the most abundant monomers. Manually compared to the to lenght profiles of NTRprism output.

In [None]:
#!/bin/bash
~/Software/repeatmasker/trf AcitC1_centromere.fasta 2 7 7 80 10 50 400 -dq


# 5. Prediction of Actinopterygii single copy orthologs
Using [BUSCO](https://busco.ezlab.org/) and Actinopterygii database version 10.

In [None]:
#!/bin/bash
busco -i Genome.fa -l /home/ferro/PROGRAMAS/BUSCO_DB/actinopterygii_odb10 -o BUSCO_output -m genome -c 10

# Get Chr 4 genes
grep "Chr4" BUSCO_output/run_actinopterygii_odb10/full_table.tsv > Chr4_table.tsv


# Part 2. References.

(not included in the previous sections)

**Methods**  
Centromere Analysis
*   Brekke, T. D., Papadopulos, A. S. T., Julià, E., Fornas, O., Fu, B., Yang, F., de la Fuente, R., Page, J., Baril, T., Hayward, A., & Mulley, J. F. (2023). A New Chromosome-Assigned Mongolian Gerbil Genome Allows Characterization of Complete Centromeres and a Fully Heterochromatic Chromosome. Molecular Biology and Evolution, 40(5), msad115. https://doi.org/10.1093/molbev/msad115

*   Altemose, N. (2022). A classical revival: Human satellite DNAs enter the genomics era. In Seminars in Cell & Developmental Biology (Vol. 128, pp. 2-14). Academic Press.

Repeat content
*   Goubert, C., Craig, R. J., Bilat, A. F., Peona, V., Vogan, A. A., & Protasio, A. V. (2022). A beginner’s guide to manual curation of transposable elements. Mobile DNA, 13(1), 7. https://doi.org/10.1186/s13100-021-00259-7


**Introduction**


*   Salzburger, W. (2018). Understanding explosive diversification through cichlid fish genomics. Nature Reviews Genetics, 19(11), 705-717.https://doi.org/10.1038/s41576-018-0043-9

*   Feldberg, E., Porto, J. I. R., & Bertollo, L. A. C. (2003). Chromosomal changes and adaptation of cichlid fishes during evolution. Fish adaptations, 285, 308.
*   Conte, M. A., Joshi, R., Moore, E. C., Nandamuri, S. P., Gammerdinger, W. J., Roberts, R. B., ... & Kocher, T. D. (2019). Chromosome-scale assemblies reveal the structural evolution of African cichlid genomes. Gigascience, 8(4), giz030.  https://doi.org/10.1093/gigascience/giz030


