# HRP2 manual investigation

## Please read before continuing. You may have to use `gen3` to run IGV. 

#### This notebook was created to examine any new HRP2 deletions (either because the sample was new to Pf8 or because it was called as something else for Pf7). 

The first cell will initialise some key functions and show you some metadata on HRP3-deleted samples from Pf7, as well as the HRP deletion figure from the Pf7 paper. 

The subsequent cells represent individual analyses, and will usually consist of:

- a `print` statement which describes the nature of the sample, my analysis, and my verdict,
- a `plot_cnv_diagnostic` function, which shows you the corresponding CNV diagnostic plot,
- and an `igv` function, which shows you the IGV browser, usually pointing you to custom coordinates that are relevant to the analysis. 

Please take a look at the analyses and let me know if you disagree with anything or want to discuss something with me. I strongly suggest you right-click and `Clear cell output` once you are done with individual analyses. 

In [None]:
import igv_notebook
import pandas as pd

from IPython.display import Image, display

def plot_cnv_diagnostic(sample, gene):
    display(Image(filename = f"<INSERT PATH HERE>/malariagen-pf8-cnv-calling/04_gcnv_calls_validation/results/{gene}/{sample}/plot.png"))

def igv(sample, locus):
    REF = "<INSERT PATH HERE>/resource-bundle/PlasmoDB-54_Pfalciparum3D7_Genome.fasta"
    GFF = "<INSERT PATH HERE>/resource-bundle/snpEff/data/Pfalciparum3D7_PlasmoDB_55/PlasmoDB-55_Pfalciparum3D7.gff"
    meta = pd.read_csv("../../assets_pf8/01_paths_to_bams.tsv", sep = "\t", names = ["SAMPLE", "PATH"])

    assert sample in meta.SAMPLE.tolist()

    path_to_bam = meta.loc[meta.SAMPLE == sample, "PATH"].values[0]
    
    igv_notebook.init()
    
    b = igv_notebook.Browser({
        "reference": {
            "name": "Plasmodium falciparum (3D7)",
            "fastaPath": REF,
            "indexed": False
        },
        "locus": locus
    })
    
    b.load_track({
        "name": "genes",
        "type": "annotation",
        "format": "gff3",
        "path": GFF,
        "displayMode": "COLLAPSED",
    })
    
    b.load_track({
        "path": path_to_bam,
        "indexPath": path_to_bam + ".bai",
        "format": "bam",
        "type": "alignment",
        "displayMode": "EXPANDED",
        "height": 800,
        "squishedRowHeight": 5,
        "viewAsPairs": "true",
        "showSoftClips": "true"
    })

display(Image(url = "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e4/9971654/c2bf7ea4a9e7/wellcomeopenres-8-20716-g0004.jpg"))

df = pd.read_csv("hrp_calls_pf7.tsv", sep = "\t")
meta = pd.read_csv("../../assets_pf8/Pf_8_samples_20241212.txt", sep = "\t", usecols = ["Sample", "Study", "Country", "Year"])
meta.Year = meta.Year.fillna(-1).astype(int)

pd.set_option("display.max.rows", None)
df.loc[df.HRP2_breakpoint != "-", ["Sample", "HRP2_breakpoint", "HRP2_deletion_type"]].reset_index(drop = True).merge(meta, on = "Sample")

Unnamed: 0,Sample,HRP2_breakpoint,HRP2_deletion_type,Study,Country,Year
0,PJ0135-Cx,Pf3D7_08_v3:1374462,Telomere healing,1146-PF-MULTI-PRICE,Indonesia,2011
1,PJ0233-C,Pf3D7_08_v3:1374462,Telomere healing,1146-PF-MULTI-PRICE,Indonesia,2010
2,PJ0258-C,Pf3D7_08_v3:1374462,Telomere healing,1146-PF-MULTI-PRICE,Indonesia,2013
3,PP0002-C,Pf3D7_08_v3:1374932,Telomere healing,1013-PF-PEGB-BRANCH,Peru,2011
4,PP0011-C,Pf3D7_08_v3:1374932,Telomere healing,1013-PF-PEGB-BRANCH,Peru,2011
5,PP0017-C,Pf3D7_08_v3:1374986,Telomere healing,1013-PF-PEGB-BRANCH,Peru,2011
6,PP0024-C,Pf3D7_08_v3:1374986,Telomere healing,1145-PF-PE-GAMBOA,Peru,2009
7,PP0025-C,Pf3D7_08_v3:1374986,Telomere healing,1145-PF-PE-GAMBOA,Peru,2009
8,PP0026-C,Pf3D7_08_v3:1374986,Telomere healing,1145-PF-PE-GAMBOA,Peru,2009
9,PP0028-C,Pf3D7_08_v3:1374986,Telomere healing,1145-PF-PE-GAMBOA,Peru,2009


In [None]:
print("Example 1: Called as GT=-1 in Pf7 but GT=1 in Pf8. I think I could believe something is going on here, but safest to call as -1 in my opinion.")
plot_cnv_diagnostic("RCN13093", "HRP2")
igv("RCN13093", "Pf3D7_08_v3:1,374,100-1,375,347")

---
# New to Pf8

#### The following four samples look like real deletions based on the CNV diagnostic plots. However, there is practically no confirmation from IGV. Reads at the suspected breakpoint region look really dodgy. All samples will be called as -1, meaning there are no new HRP2-deleted samples in Pf8. 

In [None]:
print("Really dodgy evidence. Sort of a telomere repeat detected. Very messy but there is a single GGGTTCA in a repetitive soft clip in just one read, whose mate read is on chromosome 13, far to the right of HRP3, so probably a dodgy mapping. Mate CIGAR isn't that good anyway. ")
plot_cnv_diagnostic("SPT42757", "HRP2")
igv("SPT42757", "Pf3D7_08_v3:1,370,859-1,371,173")

In [None]:
plot_cnv_diagnostic("RCN15908", "HRP2")
igv("RCN15908", "Pf3D7_08_v3:1,371,062-1,371,218")

In [None]:
plot_cnv_diagnostic("RCN18529", "HRP2")
igv("RCN18529", "Pf3D7_08_v3:1,371,062-1,371,218")

In [None]:
plot_cnv_diagnostic("RCN18542", "HRP2")
igv("RCN18542", "Pf3D7_08_v3:1,370,859-1,371,173")