In [1]:
#| hide
%load_ext autoreload
%autoreload 2

In [8]:
#| hide
from cryptosporidium_host_adaptation.core import *
import matplotlib.pyplot as plt
import pandas as pd

# Divergent Pathways: Tracking Cryptosporidium's Host Adaptation


###  [This guide is best visualized here](https://mtinti.github.io/cryptosporidium_host_adaptation/)

## The Origin: Strain M4

Our journey begins with a single infected mouse (M4), harboring a Cryptosporidium strain that would become the progenitor of two distinct evolutionary paths. This initial host served as the critical branching point for our experimental design.

From this single origin, the parasite's story split into two parallel narratives:

### 🐭 The Murine Passage 🐭
> In this pathway, Cryptosporidium continued its journey through a series of mouse hosts, adapting to the murine environment through sequential passages:

```
M4 → M5 → M6 → M7
```

Each passage potentially allowed the parasite to optimize its survival and reproductive strategies within these genetically similar mammalian hosts.

### 🐄 The Bovine Passage 🐄
> Simultaneously, we challenged the adaptability of the same initial strain by introducing it to an entirely different mammalian lineage - neonatal calves:

```
M4 → C1 → C2 → C3
```

This cross-species transmission forced the parasite to navigate a dramatically different physiological environment, potentially driving rapid adaptation.


## 🧬 Specialized Variant Calling Strategy 🧬

The progenitor Cryptosporidium population in mouse M4 wasn't a homogeneous colony, but rather a **diverse mixture of strains**. 

>This discovery fundamentally shaped our analytical approach.

To capture the true genetic complexity within our samples, we implemented a customized variant calling pipeline:

```
┌────────────────────────────────┐
│ FREEBAYES VARIANT CALLING      │
├────────────────────────────────┤
│ • Ploidy = 1                   │
│ • --pooled-continuous option   │
└────────────────────────────────┘
```

Why This Approach Matters?

1. **Beyond Binary Detection**:
> Traditional presence/absence variant calling would have flattened the rich complexity of our samples, obscuring the very phenomenon we aimed to study.

3. **Quantitative Insight**:
> By focusing on allele frequencies rather than simple variant calls, will allow us to track subtle shifts in population genetics across hosts.


# Variant Analysis: Filtering Strategy

Our approach employed a strategic sequence of filtering steps, each addressing specific aspects of data quality:

```
┌─────────────────────────────────────┐
│ THREE-TIER FILTERING STRATEGY       │
├─────────────────────────────────────┤
│ 1. Quality-based Filtering          │
│ 2. Read Depth Optimization          │
│ 3. Variant Type Selection           │
└─────────────────────────────────────┘
```

In [11]:
filter_variants()

Starting Variant Filtering Process
Total variants before filtering: 15901
Stage 1: QUAL filtering: 14087 Variants removed and 1814 variants left
Stage 2: FORMAT/DP filtering, DP >= 30 & DP <= 150: 355 Variants removed and 1459 variants left
Stage 3: After keeping SNPs and indels: 203 Variants removed and 1256 variants left


### Stage 1: Quality-Based Filtering

```bash
# Eliminate low confidence variant calls
bcftools filter -e 'QUAL < 30' "$INPUT_VCF" -o "$QUAL_FILTERED_VCF"
```

**Rationale**: The QUAL score represents the statistical confidence in each variant call. By establishing a minimum threshold of 30:
- We eliminated variants likely to be sequencing errors
- Retained variants with a 99.9% probability of being genuine

### Stage 2: Read Depth Optimization

```bash
# Balance between coverage requirements and anomalous amplification
bcftools view -i 'FMT/DP >= 30 & FMT/DP <= 150' "$QUAL_FILTERED_VCF" -o "$DP_FILTERED_VCF"
```

**Rationale**: Read depth optimization addressed two critical concerns:
- **Lower bound (DP ≥ 30)**: Ensured sufficient read coverage
- **Upper bound (DP ≤ 150)**: Protected against false positives from regions with anomalous read pileups which often indicate repetitive elements


### Stage 3: Variant Type Selection

```bash
# Focus on  variation subset
bcftools view -v snps,indels "$DP_FILTERED_VCF" -o "$SNP_FILTERED_VCF"
```
**Rationale**: This final step ensured our analysis focused exclusively on:

- Single nucleotide polymorphisms (SNPs)
- Small insertions and deletions (indels)

# From Raw Variants to Biological Insights

## 📊 From Data to Discovery with 🐼 Pandas

Leveraging the robust capabilities of Python's pandas library transformed our genetic data into a powerhouse of actionable insights.

```python
┌─────────────────────────────────────────────────────┐
│ DATA TRANSFORMATION PIPELINE                        │
├─────────────────────────────────────────────────────┤
│ 1. Load filtered VCF file                           │
│ 2. Remove ambiguous reference calls (REF = 'N')     │
│ 3. Restructure for for computing frequency          │
└─────────────────────────────────────────────────────┘
```


In [24]:
vcf_file = "../data/filtered_final.vcf"  
df_vcf = read_vcf(vcf_file)
print(f'step 1: {df_vcf.shape}')
df_vcf=df_vcf[(df_vcf['REF']!='N')]
print(f'step 2: {df_vcf.shape}')

step 1: (1256, 16)
step 2: (945, 16)


In [26]:
df_vcf.head()

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,M7,M5,M4,M6,C3,C2,C1
0,CM000429,60867,.,TAAAAAAAAAAGATAT,"TAAAAAAAAAAAGATTT,TAAAAAAAAAAAGATAT,TAAAAAAAAA...",10088.4,PASS,"AB=0,0,0;ABP=0,0,0;AC=0,7,0;AF=0,1,0;AN=7;AO=1...",GT:GQ:DP:AD:RO:QR:AO:QA:GL,"2:138:82:9,2,67,2:9:296:2,67,2:24,2218,68:-172...","2:138:46:3,2,35,1:3:100:2,35,1:24,1116,34:-91....","2:138:69:4,4,59,1:4:132:4,59,1:48,1980,34:-166...","2:138:82:14,2,62,1:14:461:2,62,1:24,2092,26:-1...","2:138:57:5,1,45,1:5:163:1,45,1:12,1492,34:-119...","2:138:69:1,0,60,4:1:34:0,60,4:0,1966,136:-173....","2:138:62:4,1,50,4:4:130:1,50,4:12,1678,128:-13..."
1,CM000429,60889,.,ACCCCACT,ACCCCCACT,11705.8,PASS,AB=0;ABP=0;AC=7;AF=1;AN=7;AO=435;CIGAR=1M1I7M;...,GT:GQ:DP:AD:RO:QR:AO:QA:GL,"1:137:90:9,81:9:295:81:2686:-215.126,0","1:137:53:1,50:1:34:50:1596:-140.656,0","1:137:70:2,66:2:68:66:2184:-190.372,0","1:137:82:11,70:11:359:70:2241:-169.324,0","1:137:45:5,40:5:169:40:1237:-96.064,0","1:137:68:1,66:1:31:66:2174:-192.923,0","1:137:69:4,62:4:126:62:2002:-168.787,0"
2,CM000429,76625,.,A,G,265.872,PASS,AB=0;ABP=0;AC=1;AF=0.142857;AN=7;AO=248;CIGAR=...,GT:GQ:DP:AD:RO:QR:AO:QA:GL,"0:131:104:76,28:76:2560:28:952:0,-144.672","0:131:80:52,28:52:1738:28:930:0,-72.6994","0:131:84:50,34:50:1596:34:1156:0,-39.5691","0:131:89:61,28:61:2066:28:944:0,-100.949","1:131:83:36,47:36:1216:47:1598:-34.3706,0","0:131:112:78,34:78:2630:34:1126:0,-135.322","0:131:112:63,49:63:2126:49:1658:0,-42.1062"
3,CM000429,82019,.,A,T,8192.19,PASS,AB=0;ABP=0;AC=7;AF=1;AN=7;AO=410;CIGAR=1X;DP=5...,GT:GQ:DP:AD:RO:QR:AO:QA:GL,"1:160:98:9,89:9:306:89:2994:-241.841,0","1:160:51:15,36:15:510:36:1186:-60.8115,0","1:160:86:19,67:19:638:67:2240:-144.133,0","1:160:90:15,75:15:488:75:2488:-179.952,0","1:160:70:26,44:26:862:44:1472:-54.8869,0","1:160:81:31,50:31:1046:50:1662:-55.4182,0","1:160:71:22,49:22:748:49:1666:-82.5935,0"
4,CM000429,82192,.,G,A,6765.84,PASS,AB=0;ABP=0;AC=6;AF=0.857143;AN=7;AO=398;CIGAR=...,GT:GQ:DP:AD:RO:QR:AO:QA:GL,"1:134:104:11,93:11:374:93:3124:-247.418,0","1:134:67:20,47:20:658:47:1598:-84.5836,0","1:134:75:25,50:25:842:50:1692:-76.4771,0","1:134:102:26,76:26:846:76:2568:-154.943,0","0:0:53:28,25:28:944:25:850:0,-8.45503","1:134:92:35,57:35:1190:57:1938:-67.2984,0","1:134:83:33,50:33:1084:50:1692:-54.7114,0"


### Reproducibility

Install latest from the GitHub [repository][repo]:

```sh
$ pip install git+https://github.com/mtinti/cryptosporidium_host_adaptation.git
```

or from [conda][conda]

```sh
$ conda install -c mtinti cryptosporidium_host_adaptation
```

or from [pypi][pypi]


```sh
$ pip install cryptosporidium_host_adaptation
```


[repo]: https://github.com/mtinti/cryptosporidium_host_adaptation
[docs]: https://mtinti.github.io/cryptosporidium_host_adaptation/
[pypi]: https://pypi.org/project/cryptosporidium_host_adaptation/
[conda]: https://anaconda.org/mtinti/cryptosporidium_host_adaptation

In [5]:
#| hide
import nbdev; nbdev.nbdev_export()