# Nanopore long-read sequencing uncovers hundreds of structural variants in the Schistosome genome
*Shalini Nair, Elisha Enabulele, Xue Li, Tim Anderson, Neal Platt*

__ABSTRACT__: Nanopore sequencing generates extremely long sequence reads allowing detailed characterization of genomic features (genome rearrangements/structural variants) that are difficult to accurately assay using Illumina short-read sequencing approaches. Technological improvements in Nanopore long-read sequence have resulted in reduced prices, as well as increased accuracy and output. This method was declared “Method of the year” in Nature in 2022. Our laboratory is exploring the use of nanopore for characterizing pathogen genomes. Here, we detail application of nanopore sequencing of schistosome parasites. These parasites have a 380 Mb genome that is riddled with repetitive elements and structural variants. We have developed protocols that allow generation of reads up to 200kb, with 50% of reads > 24kb, and generation of 13.5 Gb of high-quality data from a single R10 flow cell, providing 27x coverage of the parasite genome.  Using this new data, we have identified multiple structural variants segregating in a lab strain of S. mansoni that would not have been visible using short read sequencing alone.  For example, we have found 468 large structural variants (>500bp) that were completely fixed in our laboratory population compared to the reference genome/strain. We anticipate that characterization of large structural variants may help to identify causative features underlying important schistosome phenotypes. Nanopore also has many possible uses for other researchers at Texas biomed, including sequencing entire viral genomes or host MHC, and rapid monitoring of infectious diseases in global populations.

**[08/04/2023] Nanopore analysis ideas**

- Are the genomes of lab populations syntenic or have large changes in gene structure/order been introduced?
  - __This would be extremely interesting__
  - Aside – one possible explanation for the division between Sh populations might be rearrangements/inversions
- What impact do large SVs have on the coding regions of the genome?
  - Comment – define impact here. It would certainly be interesting to know whether SVs in coding sequences seem likely to reduce/eliminate gene expression
- Is there evidence of population specific SVs (including CNVs/gene family expansions/contractions)?
- Can we recover and examine variability in complex regions like SmPoMucs and how is this variability distributed among and within lab populations?
  - __Yes – v important.__
  - Simply identifying the genome location of these genes will be very important – we would like to be able to definitively state that smPoMucs do/do not colocalize with the host specificity QTLs that Fred has mapped.
- Are there large SVs in important regions associated with parasite phenotypes?
  - Important question but difficult to do. I think we can certainly examine how many SVs are segregating in QTL regions, to highlight possible causative SVs
- (Some interesting question about the tandem repeat region in the mitochondrial genome...not sure what yet)
  - There is some old literature about these regions, suggesting v rapid change and extensive polymorphism within individuals. This might be difficult with the population data, but would be great to do when we have data from individual genotypes. Perhaps worth waiting until we have that data.
- __Is it possible to identify SVs that are being selected (for/against)?__
  - I think this is particularly interesting. We expect most SVs to be selected against.
  - Comparing allele frequencies within populations would be informative – we expect most to be at low frequency. High frequency SVs are good candidates for positive selection
  - __Comparing SV burden in coding vs non-coding regions would be informative to understand strength of selection against SVs__
  - __Likewise, comparing SVs in low and high recombination rate regions. We predict higher frequency/larger SVs in high recombination regions__

# Analyzing structural variants

## Prep analyses

In [448]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
# from Bio import SeqIO
# from Bio.SeqRecord import SeqRecord
# import vcf
# from itertools import combinations
# import venn
# import io
# import gffutils
# from sklearn.decomposition import PCA
# from sklearn.preprocessing import StandardScaler

In [449]:
proj_dir = "/master/nplatt/sch_man_ont"
ref_fas = "{}/data/genome/SM_V10.fa".format(proj_dir)
Path("{}/results".format(proj_dir)).mkdir(parents=True, exist_ok=True)

In [3]:
samples = [ "smor", 
            "smle_pzq_es",
            "smle_pzq_er",
            "smbre", 
            "smeg"        ]

pops=samples

In [4]:
os.chdir("{}/results".format(proj_dir))

# Fst

In [958]:
Path("{}/results/fst".format(proj_dir)).mkdir(parents=True, exist_ok=True)
os.chdir("{}/results/fst".format(proj_dir))

In [959]:
cov_thresh=8

filt_df=pd.read_csv(f"{proj_dir}/results/svs_and_genes/sv_cat_afs_dp_ge_{cov_thresh}_w_genes.csv", sep=",", header=0)
filt_df

Unnamed: 0,CHROM,POS,ID,REF,ALT,SVTYPE,SVLEN,smor.MAF,smle_pzq_es.MAF,smle_pzq_er.MAF,smbre.MAF,smeg.MAF,smor.COV,smle_pzq_es.COV,smle_pzq_er.COV,smbre.COV,smeg.COV,gene_ids
0,SM_V10_1,108036,cuteSV.INS.4,C,['CATTATTATTATTATTATTACTACTATTATTATTACTATT'],INS,39.0,1.0000,0.9429,0.8889,1.0000,0.9565,25,70,18,25,23,
1,SM_V10_1,156118,cuteSV.DEL.7,ATGTTATCTTTGGCAGCCTATTTAAACATCTGGGCTACCTGTTCCT...,['A'],DEL,-401.0,0.3500,0.1587,0.3846,0.8750,0.2800,20,63,13,8,25,
2,SM_V10_1,214273,cuteSV.INS.8,T,['TGATGATGATGATAGATATTGATGATGATGATGATGATGATGAT...,INS,51.0,0.1304,0.6667,0.9333,0.8696,0.3125,23,36,15,23,16,
3,SM_V10_1,258698,cuteSV.INS.9,A,['ACTCGGGAATAACATTAGGATCACTTCAATTTTTTTAATAATTT...,INS,4470.0,0.2778,0.1897,0.1429,0.0930,0.3125,18,58,14,43,16,
4,SM_V10_1,264300,cuteSV.INS.10,C,['CCTGGAAGCACTGGACGGCCGTTTCGTCCTATTGGCGGACCCCT...,INS,324.0,0.6579,0.2637,0.5455,0.6269,0.1522,38,91,22,67,46,Smp_318880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8096,SM_V10_Z,86357919,cuteSV.INS.4444,T,['TTGTTATGTCCCCTGGTGAATTATAAATGGTAACTCTGAGTCTA...,INS,604.0,0.3333,0.2222,0.1250,0.9545,0.3750,18,36,16,22,16,
8097,SM_V10_Z,86361688,cuteSV.INS.4445,A,['AATCAGTAAGCGAGTAGTGATGGAAGTTTGGTTATTTTGGTACA...,INS,325.0,1.0000,1.0000,1.0000,1.0000,1.0000,17,47,26,24,20,
8098,SM_V10_Z,86465601,cuteSV.DEL.5146,GTGCGCGTTCAAAAAGCGGATCACAAACTTTGTAGAAAAAT,['G'],DEL,-40.0,0.5000,0.2022,0.2292,0.0000,0.1667,38,89,48,29,54,
8099,SM_V10_Z,86467302,cuteSV.INS.4446,G,['GAATTCTATATTTTTCAAGGGAGTTGAAATCATAAGTCAATCGA...,INS,229.0,0.5238,0.1860,0.0263,1.0000,0.1500,21,86,38,35,40,


In [764]:
#use grenedalf to calculate per site fst from freq table

In [962]:
#prep for grenedalf with coverages
g_df=filt_df.copy()

g_df["ALT"] = g_df["ALT"].str.replace("^\['", "", regex=True)
g_df["ALT"] = g_df["ALT"].str.replace("\']$", "", regex=True)
g_df.to_csv("to_grenedalf.csv", sep=",", header=True, index=False)
g_df
 
    

Unnamed: 0,CHROM,POS,ID,REF,ALT,SVTYPE,SVLEN,smor.MAF,smle_pzq_es.MAF,smle_pzq_er.MAF,smbre.MAF,smeg.MAF,smor.COV,smle_pzq_es.COV,smle_pzq_er.COV,smbre.COV,smeg.COV,gene_ids
0,SM_V10_1,108036,cuteSV.INS.4,C,CATTATTATTATTATTATTACTACTATTATTATTACTATT,INS,39.0,1.0000,0.9429,0.8889,1.0000,0.9565,25,70,18,25,23,
1,SM_V10_1,156118,cuteSV.DEL.7,ATGTTATCTTTGGCAGCCTATTTAAACATCTGGGCTACCTGTTCCT...,A,DEL,-401.0,0.3500,0.1587,0.3846,0.8750,0.2800,20,63,13,8,25,
2,SM_V10_1,214273,cuteSV.INS.8,T,TGATGATGATGATAGATATTGATGATGATGATGATGATGATGATGA...,INS,51.0,0.1304,0.6667,0.9333,0.8696,0.3125,23,36,15,23,16,
3,SM_V10_1,258698,cuteSV.INS.9,A,ACTCGGGAATAACATTAGGATCACTTCAATTTTTTTAATAATTTTT...,INS,4470.0,0.2778,0.1897,0.1429,0.0930,0.3125,18,58,14,43,16,
4,SM_V10_1,264300,cuteSV.INS.10,C,CCTGGAAGCACTGGACGGCCGTTTCGTCCTATTGGCGGACCCCTCA...,INS,324.0,0.6579,0.2637,0.5455,0.6269,0.1522,38,91,22,67,46,Smp_318880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8096,SM_V10_Z,86357919,cuteSV.INS.4444,T,TTGTTATGTCCCCTGGTGAATTATAAATGGTAACTCTGAGTCTATT...,INS,604.0,0.3333,0.2222,0.1250,0.9545,0.3750,18,36,16,22,16,
8097,SM_V10_Z,86361688,cuteSV.INS.4445,A,AATCAGTAAGCGAGTAGTGATGGAAGTTTGGTTATTTTGGTACAGA...,INS,325.0,1.0000,1.0000,1.0000,1.0000,1.0000,17,47,26,24,20,
8098,SM_V10_Z,86465601,cuteSV.DEL.5146,GTGCGCGTTCAAAAAGCGGATCACAAACTTTGTAGAAAAAT,G,DEL,-40.0,0.5000,0.2022,0.2292,0.0000,0.1667,38,89,48,29,54,
8099,SM_V10_Z,86467302,cuteSV.INS.4446,G,GAATTCTATATTTTTCAAGGGAGTTGAAATCATAAGTCAATCGAAG...,INS,229.0,0.5238,0.1860,0.0263,1.0000,0.1500,21,86,38,35,40,
