# HRP3 manual investigation

## Please read before continuing. You may have to use `gen3` to run IGV. 

#### This notebook was created to examine any new HRP3 deletions (either because the sample was new to Pf8 or because it was called as something else for Pf7). 

The first cell will initialise some key functions and show you some metadata on HRP3-deleted samples from Pf7, as well as the HRP deletion figure from the Pf7 paper. 

The subsequent cells represent individual analyses, and will usually consist of:

- a `print` statement which describes the nature of the sample, my analysis, and my verdict,
- a `plot_cnv_diagnostic` function, which shows you the corresponding CNV diagnostic plot,
- and an `igv` function, which shows you the IGV browser, usually pointing you to custom coordinates that are relevant to the analysis. 

Please take a look at the analyses and let me know if you disagree with anything or want to discuss something with me. I strongly suggest you right-click and `Clear cell output` once you are done with individual analyses. 

In [None]:
import igv_notebook
import pandas as pd

from IPython.display import Image, display

def plot_cnv_diagnostic(sample, gene):
    display(Image(filename = f"<INSERT PATH HERE>/malariagen-pf8-cnv-calling//04_gcnv_calls_validation/results/{gene}/{sample}/plot.png"))

def igv(sample, locus):
    REF = "<INSERT PATH HERE>/resource-bundle/PlasmoDB-54_Pfalciparum3D7_Genome.fasta"
    GFF = "<INSERT PATH HERE>/resource-bundle/snpEff/data/Pfalciparum3D7_PlasmoDB_55/PlasmoDB-55_Pfalciparum3D7.gff"
    meta = pd.read_csv("../../assets_pf8/01_paths_to_bams.tsv", sep = "\t", names = ["SAMPLE", "PATH"])

    assert sample in meta.SAMPLE.tolist()

    path_to_bam = meta.loc[meta.SAMPLE == sample, "PATH"].values[0]
    
    igv_notebook.init()
    
    b = igv_notebook.Browser({
        "reference": {
            "name": "Plasmodium falciparum (3D7)",
            "fastaPath": REF,
            "indexed": False
        },
        "locus": locus
    })
    
    b.load_track({
        "name": "genes",
        "type": "annotation",
        "format": "gff3",
        "path": GFF,
        "displayMode": "COLLAPSED",
    })
    
    b.load_track({
        "path": path_to_bam,
        "indexPath": path_to_bam + ".bai",
        "format": "bam",
        "type": "alignment",
        "displayMode": "EXPANDED",
        "height": 800,
        "squishedRowHeight": 5,
        "viewAsPairs": "true",
        "showSoftClips": "true"
    })

display(Image(url = "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4e4/9971654/c2bf7ea4a9e7/wellcomeopenres-8-20716-g0004.jpg"))

df = pd.read_csv("hrp_calls_pf7.tsv", sep = "\t")
meta = pd.read_csv("../../assets_pf8/Pf_8_samples_20241212.txt", sep = "\t", usecols = ["Sample", "Study", "Country", "Year"])
meta.Year = meta.Year.fillna(-1).astype(int)

pd.set_option("display.max.rows", None)
df.loc[df.HRP3_breakpoint != "-", ["Sample", "HRP3_breakpoint", "HRP3_deletion_type"]].reset_index(drop = True).merge(meta, on = "Sample")

In [None]:
print("Example 1: Confirmed Pf7 example of chromosome 11 recombination from Peru, 2009. Left breakpoint has XA=chromosome 11 and has coords 2,806,840")
plot_cnv_diagnostic("PP0029-C", "HRP3")
igv("PP0029-C", "Pf3D7_13_v3:2,806,781-2,806,917")

In [None]:
print("Example 2: Confirmed Pf7 example of telomere healing from Cambodia, 2009. GGGTTCA - telomere repeat")
print('From Richard: ...I think the mates always map to a telomeric sequence, though which one is somewhat random, which is probably to be expected as the telomeric sequences at each end of each chromosome are essentially identical. This means that when looking for telomere healing events like those around hrp genes we could look for mates mapping to telomeres in addition to "split reads" containing telomeric sequence. You can see some such reads in the IGV plot which end before the breakpoint but for which the mate is in a telomere. A few years ago we were working with someone called Fai who was doing a more comprehensive analysis of telomere healing events, but never finished the work before moving on to a different job.')
plot_cnv_diagnostic("PH1616-C", "HRP3")
igv("PH1616-C", "Pf3D7_13_v3:2,837,329-2,837,492")

In [None]:
print("Example 3: Confirmed Pf7 example of chromosome 5 recombination, 2009")
plot_cnv_diagnostic("PH0959-Cx", "HRP3")
igv("PH0959-Cx", "Pf3D7_13_v3:2,835,143-2,835,982")

---
# Pf7 samples that may require revisiting

In [None]:
print("Figure 1: Pf7 uncallable. Pf8 potentially del? Breakpoint different to Example 1, but lots of XA and Mate chromosomes point to chromosome 11 at left breakpoint. Left breakpoint coords are 2,807,173. Right breakpoint points to chromosome 4? Need help calling this. Calling as GT=1 for now. ")
plot_cnv_diagnostic("SPT16994", "HRP3")
igv("SPT16994", "Pf3D7_13_v3:2,807,067-2,807,340")

In [None]:
print("Figure 2: Pf7 uncallable. Pf8 potentially del? Telomere healing. GGGTTCA - telomere repeat. Pretty happy for this to be called as GT=1 personally. Coords 2,837,144")
plot_cnv_diagnostic("PV0132-C", "HRP3")
igv("PV0132-C", "Pf3D7_13_v3:2,836,863-2,837,411")

In [None]:
print("Figure 3: Pf7 uncallable. Pf8 potentially del? Quite a clear stretch of near-zero coverage. Lots of F1F2 pair orientation reads on the left breakpoint. Not sure if that indicates something in particular. Lots of junk reads at the right breakpoint. Will call -1 for now. ")
plot_cnv_diagnostic("RCN02146", "HRP3")
igv("RCN02146", "Pf3D7_13_v3:2,830,001-2,832,197")
igv("RCN02146", "Pf3D7_13_v3:2,844,941-2,845,854")

In [None]:
print("Figure 4: Pf7 uncallable. Pf8 potentially del? Same comments as Figure 3. Will call -1 for now. ")
plot_cnv_diagnostic("SPT35228", "HRP3")
igv("SPT35228", "Pf3D7_13_v3:2,830,001-2,832,197")
igv("SPT35228", "Pf3D7_13_v3:2,844,941-2,845,854")

In [None]:
print("Figure 5: Pf7 uncallable. Pf8 potentially del? Left breakpoint somewhat similar to Figure 1, but seemingly no right breakpoint. Left breakpoint's XA points to chromosome 11. Calling as GT=1 for now. Coords 2,807,399 which are different to Figure 1's. A little worried about the low complexity region here BUT I think it would have enough complexity towards the 5' end for this to be real. ")
plot_cnv_diagnostic("SPT38537", "HRP3")
igv("SPT38537", "Pf3D7_13_v3:2,807,007-2,807,799")

In [None]:
print("Figure 6: Pf7 uncallable. Pf8 potentially del? Lots of XA values point to chromosome 11. Calling as GT=1 for now. Same exact left breakpoint as Figure 5: 2,807,399")
plot_cnv_diagnostic("SPT34382", "HRP3")
igv("SPT34382", "Pf3D7_13_v3:2,806,832-2,807,943")

---
# New to Pf8

In [None]:
print("Figure 7: New to Pf8. Telomere healing. GGGTTCA detected. Breakpoint: 2,835,899. Calling as GT=1")
plot_cnv_diagnostic("SPT43994", "HRP3")
igv("SPT43994", "Pf3D7_13_v3:2,835,645-2,836,252")

In [None]:
print("Figure 8: New to Pf8. Chromosome 11 recombination. Supplementary alignments to chromosome 11. Breakpoint: 2,807,296, but the exact coords of this one are likely unreliable. It's different to the other coords and only one read with soft clips. Calling as GT=1")
plot_cnv_diagnostic("RCN15335", "HRP3")
igv("RCN15335", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 8: New to Pf8. Chromosome 11 recombination. Supplementary alignments to chromosome 11. Breakpoint: 2,807,399. Calling as GT=1")
plot_cnv_diagnostic("RCN14910", "HRP3")
igv("RCN14910", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 9: New to Pf8. Telomere healing. GGGTTCA detected. One clean read with the telomere repeat sequence and one dodgy read. Calling as GT=1. Note that this sample breaks the paradigm about how we shouldn't trust samples with copy number gradually decreasing. This is however in a very repetitive part of HRP3 - note that the amino acid sequence has HA[G/D]DH repeats. This feels a bit riskier than some of the other calls. breakpoint: 2,840,859")
plot_cnv_diagnostic("SPT83055", "HRP3")
igv("SPT83055", "Pf3D7_13_v3:2,840,672-2,841,155")

In [None]:
print("Figure 10: New to Pf8. Telomere healing. GGGTTCA detected. Very similar to Figure 9. Just one read of evidence though. Would suggest calling as GT=1, breakpoint 2,840,825")
plot_cnv_diagnostic("SPT83047", "HRP3")
igv("SPT83047", "Pf3D7_13_v3:2,840,672-2,841,155")

In [None]:
print("Figure 11: New to Pf8. I will mark this as GT=1 for now, but read pair evidence is a bit weak. Maybe two reads that point to chromosome 11. Breakpoint at the usual 2,807,399 coords.")
plot_cnv_diagnostic("RCN15215", "HRP3")
igv("RCN15215", "Pf3D7_13_v3:2,807,252-2,807,572")

In [None]:
print("Figure 12: New to Pf8. I will mark this as GT=1 for now, but similar to Figure 11, read pair evidence is a bit weak. Maybe an abundance of reads nearby that point to chromosome 11? ")
plot_cnv_diagnostic("SPT83039", "HRP3")
igv("SPT83039", "Pf3D7_13_v3:2,807,252-2,807,572")

In [None]:
print("Figure 13: New to Pf8. I will mark this as GT=1 for now, but again, read pair evidence is weak. Maybe an abundance of reads nearby that point to chromosome 11? A little similar to Figure 1")
plot_cnv_diagnostic("SPT83076", "HRP3")
igv("SPT83076", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 14: Same comments as Figure 13 above. Calling GT=1")
plot_cnv_diagnostic("RCN15202", "HRP3")
igv("RCN15202", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 15: Same comments as Figure 13 and 14 above. Calling GT=1")
plot_cnv_diagnostic("RCN15231", "HRP3")
igv("RCN15231", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 16: Breakpoint 2,807,399. Calling as GT=1")
plot_cnv_diagnostic("SPT83051", "HRP3")
igv("SPT83051", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 17: Breakpoint 2,807,399. Calling as GT=1")
plot_cnv_diagnostic("SPT83074", "HRP3")
igv("SPT83074", "Pf3D7_13_v3:2,806,832-2,807,943")

In [None]:
print("Figure 18: Breakpoint 2,807,399. Calling as GT=1")
plot_cnv_diagnostic("SPT50714", "HRP3")
igv("SPT50714", "Pf3D7_13_v3:2,806,832-2,807,943")

---
---
---

# Given that Pf8 has no new hrp2 deletions (see the other notebook - `2024_12_01_-_breakpoint_hrp2.ipynb`), I will simply make the hrp breakpoints file in this notebook below:

In [None]:
# loading old Pf7 hrp breakpoints file
pf7_hrp = pd.read_csv("hrp_calls_pf7.tsv", sep = "\t")

# loading QC pass Pf8 samples
pf8_meta = pd.read_csv("../../assets_pf8/Pf_8_samples_20241212.txt", sep = "\t", usecols = ["Sample", "QC pass"])
pf8_meta = pf8_meta.loc[pf8_meta["QC pass"] == True, "Sample"].to_frame()

# loading HRP3-deleted samples in Pf8
pf8_cnv = (
    pd.read_csv("../../04_gcnv_calls_validation/2024_12_02_pf8_coverage_calls.tsv",
                sep = "\t",
                usecols = ["SAMPLE", "HRP2", "HRP3"])
    .rename(columns = {"SAMPLE": "Sample", "HRP3": "Pf8 HRP3", "HRP2": "Pf8 HRP2"})
)

# merging all together
pf8_hrp = pf8_meta.merge(pf7_hrp, on = "Sample", how = "left").merge(pf8_cnv, on = "Sample", how = "left")

In [None]:
print("1. Distribution of HRP3 GT calls in Pf8:")
print(pf8_cnv["Pf8 HRP3"].value_counts().to_dict())
print()

print("2. Distribution of HRP3 GT calls in Pf7:")
print(pf7_hrp.HRP3.value_counts().to_dict())
print()

# just double-checking I haven't lost any deletions
print("3. Distribution of combinations of old and new GTs:")
print(pf8_hrp[["HRP3", "Pf8 HRP3"]].fillna("new to pf8").value_counts())

# 198 + 13 + 5 = 255, which is the number we're looking for. 

# However, Pf7 had 201 HRP3-deleted samples. Here, there are 198 samples with the [del, 1] combination. Where did the three samples go?

# Analysing the above

Firstly, we're expecting 215 samples to have deletions based on Excerpt 1. Looking at Excerpt 3, it's good to not see any worrying signs, such as samples that have the combination (nodel, 1) or (del, 0), for example, as that would imply a change from 1 to 0 or vice versa, between Pf7 and Pf8. 

If we focus on the last three rows of Excerpt 3, we see combinations: 
- (del, 1) : samples that were called as del in Pf7 and Pf8
- (new to pf8, 1) : samples that are new to Pf8 and are HRP3-deleted
- (uncallable, 1) : samples that were previous called as -1 in Pf7 but are now 1

These sample numbers sum to 215, which is the same from Excerpt 1, so nothing catastrophic going on here. 

Next, Excerpt 2 suggests there should be only 201 samples in Pf7 that were HRP3-deleted. This should in theory be the same number of samples with the combination (del, 1), but this number is only 198. We're missing three samples. However, below is a calculation of how many samples that were previously HRP3-deleted in Pf7 and now fail sample QC for Pf8. The missing three samples are here. 

In [None]:
sum([sample not in pf8_meta.Sample.values for sample in pf7_hrp.loc[pf7_hrp.HRP3 == "del", "Sample"].tolist()])

---

# We're good to proceed. Let's build the hrp breakpoint file

In [None]:
pf8_hrp_breakpoint = pf8_hrp.drop(columns = ["HRP2", "HRP3"]).rename(columns = {"Pf8 HRP2": "HRP2", "Pf8 HRP3": "HRP3"})
pf8_hrp_breakpoint[["HRP2", "HRP3"]] = pf8_hrp_breakpoint[["HRP2", "HRP3"]].replace({
    -1: "uncallable",
    0: "nodel",
    1: "del"
})

pf8_hrp_breakpoint.head()

In [None]:
breakpoint_labels_dict = {
    "RCN14910": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"), # breakpoint is maybe 2,807,399 which is outside this range
    "RCN15202": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "RCN15215": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"), # similarly, breakpoint maybe slightly outside?
    "RCN15231": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "RCN15335": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"), # different to ones above, but still might be outside quoted range
    "SPT43994": ("Pf3D7_13_v3:2835899", "Telomere healing"), # new, but confident this is the breakpoint
    "SPT50714": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT83039": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT83047": ("Pf3D7_13_v3:2840825", "Telomere healing"), # only one read evidence and it's soft-clipped on both ends..
    "SPT83051": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT83055": ("Pf3D7_13_v3:2840859", "Telomere healing"),
    "SPT83074": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT83076": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),

    # The following samples were called as -1 in Pf7
    "PV0132-C": ("Pf3D7_13_v3:2837144", "Telomere healing"),
    "SPT16994": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT34382": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
    "SPT38537": ("Pf3D7_13_v3:2800004-2807159", "Chromosome 11 recombination"),
}

for sample, values in breakpoint_labels_dict.items():
    pf8_hrp_breakpoint.loc[pf8_hrp_breakpoint.Sample == sample, "HRP3_breakpoint"] = values[0]
    pf8_hrp_breakpoint.loc[pf8_hrp_breakpoint.Sample == sample, "HRP3_deletion_type"] = values[1]

pf8_hrp_breakpoint[["HRP2_breakpoint", "HRP3_breakpoint"]] = pf8_hrp_breakpoint[["HRP2_breakpoint", "HRP3_breakpoint"]].fillna("-")

---

# Final sanity checks

In [None]:
print(pf8_hrp_breakpoint[["HRP3_breakpoint", "HRP3_deletion_type"]].value_counts())

pf8_hrp_breakpoint[["HRP3_breakpoint", "HRP3_deletion_type"]].value_counts().sum()

In [None]:
print(pf8_hrp_breakpoint[["HRP2_breakpoint", "HRP2_deletion_type"]].value_counts())

pf8_hrp_breakpoint[["HRP2_breakpoint", "HRP2_deletion_type"]].value_counts().sum()

---

# All good. Exporting as `hrp_calls_pf8.tsv`

In [None]:
pf8_hrp_breakpoint[['Sample', 'HRP2', 'HRP3', 'HRP2_breakpoint', 'HRP3_breakpoint', 'HRP2_deletion_type', 'HRP3_deletion_type']].to_csv("hrp_calls_pf8.tsv", sep = "\t", index = False)