<a href="https://colab.research.google.com/github/Elpastore/WENETAM_VECTOR_GENOMICS_TRAINING_WORKSHOP/blob/main/Population_structure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌍 Population Structure, Gene Flow & Cryptic Species Detection

Understanding how mosquito populations are genetically structured is critical for effective vector control strategies. Population structure reveals how **genetic variation** is distributed across **geographic regions**, **time**, or **ecological niches**, and provides insights into **evolutionary history**, **gene flow**, and the presence of **cryptic species**.

---

## 🧬 Why Population Structure Matters

- 🔄 **Gene Flow**: The movement of genes between populations can **spread beneficial or resistance mutations**, such as those conferring insecticide resistance.
- 🚧 **Barriers to Gene Flow**: Genetic structure may reflect **geographic**, **ecological**, or **behavioral barriers**.
- 🧬 **Cryptic Species**: Some species, like *An. gambiae* and *An. coluzzii*, look identical but are **genetically distinct** — these cryptic species may respond differently to interventions.
- 🎯 **Targeted Vector Control**: Detecting population structure enables **localized strategies** based on the specific genetic make-up of vector populations.

---

## 🔍 Key Analytical Approaches

To investigate structure, gene flow, and cryptic species, we will use:

| Method | Description |
|--------|-------------|
| **PCA** (Principal Component Analysis) | Visualizes major genetic clusters based on genome-wide variation. |
| **F<sub>ST</sub>** (Fixation Index) | Measures genetic differentiation between populations. |
| **NJT** (Neighbour-Joining Tree) | Reconstructs relationships between individuals or populations using genetic distances. |


---

## 🎯 Learning Objectives

In this section, you will:

- 🔹 Detect **genetic clusters** in your vector population data.
- 🔹 Quantify **genetic divergence** using F<sub>ST</sub>.
- 🔹 Visualize relationships using **PCA** and **phylogenetic trees**.
- 🔹 Explore **evidence for gene flow** or **cryptic diversity**.

---



## 🧬 Introduction to Population Genetics Concepts

Before diving into our practical session, let’s briefly review some key concepts that are fundamental to understanding mosquito population dynamics and how they relate to vector control strategies.

---

### 🌍 1. Population Structure

**Population structure** refers to the presence of genetically distinct groups within a species. These groups may be separated by geography, behavior, or ecological barriers.

- Structured populations exhibit differences in **allele frequencies**.
- Structure can affect how genes (including resistance genes) spread.
- Tools like **PCA**, **F<sub>ST</sub>**, and **phylogenetic trees** help us detect structure.

---

### 🔄 2. Evolutionary Forces

Several forces shape the genetic variation within and between populations:

- **Mutation**: Introduces new genetic variation.
- **Genetic drift**: Random changes in allele frequencies, especially in small populations.
- **Natural selection**: Increases or decreases allele frequencies based on fitness.
- **Gene flow**: Movement of genes between populations via migration.

These forces interact and leave **signatures in the genome** that we can detect through analysis.

---

### 🔗 3. Importance of Gene Flow in Vector Control

**Gene flow** plays a crucial role in the spread of traits across mosquito populations:

- Can spread **insecticide resistance** mutations across regions.
- May **counteract local selection**, maintaining genetic diversity.
- Understanding gene flow helps design **effective vector control strategies** by identifying connected populations and predicting how resistance might spread.

---

These concepts form the basis for interpreting results in our upcoming analyses of **population structure**, **diversity**, and **selection**.



### Practical session

In [None]:
!pip install -qq malariagen_data

In [None]:
import plotly.io as pio
pio.renderers.default = "notebook+colab"
import plotly.express as px
import malariagen_data
import pandas as pd

In [None]:
# Select our region of interest and set the number of SNPs
region = "3L:15,000,000-41,000,000"
n_snps = 100_000

In [None]:
pca_bf_df, evr_bf = ag3.pca(
    region=region,
    n_snps=n_snps,
    sample_sets = "1191-VO-MULTI-OLOUGHLIN-VMF00140"
    #sample_sets=["AG1000G-BF-A","AG1000G-BF-B"],
);

In [None]:
ag3.plot_pca_variance(evr_bf)


In [None]:
ag3.plot_pca_coords(
    pca_bf_df,
    color="taxon",
    title="Taxonomic structure in Burkina Faso"
)

In [None]:
ag3.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_sets = "1191-VO-MULTI-OLOUGHLIN-VMF00140",
    color="taxon",
)

In [None]:
ag3.plot_njt(
    region="X",
    n_snps=n_snps,
    sample_sets = "1191-VO-MULTI-OLOUGHLIN-VMF00140",
    color="taxon",
)

In [None]:
# Fst
taxon_fst = ag3.pairwise_average_fst(
      region="3L:15,000,000-41,000,000",
      cohorts="taxon",
      sample_sets = "AG1000G-BF-B",
      #sample_query="country=='Gambia, The' and taxon in ['coluzzii', 'gambiae', 'gcx1']",
      #sample_sets=my_sample_set,
      site_mask="gamb_colu_arab",
      site_class="CDS_DEG_4",
      )
taxon_fst

In [None]:
ag3.plot_pairwise_average_fst(taxon_fst)


The combinaison of PCA, NJT and Fst show a clear genetic structure in *An. gambiae* complex, which is particualar due to the genetic distinct between different taxon.
The NJT using the X chromosome which tell us more about the reproduction stratetgies. It shows real isolation reproduction between taxa and little or inexistant gene flow between taxa.


### Identification of crypties species

## 🧪 Practical Exercises – Population Structure

In this session, you will analyze the **population structure** of *An. gambiae* mosquitoes sampled from **Tanzania** and **Kenya**. Your goal is to explore genetic patterns, detect clustering, and assess genetic differentiation between populations.

---

### 🌍 Sample Sets

We will work with the following sample sets:

- 🇹🇿 **Tanzania**: `AG1000G-TZ`
- 🇰🇪 **Kenya**: `AG1000G-KE`

These sets contain samples from the *gambiae* taxon only.

---

### 🧬 Instructions

1. **Subset your data** to include only the Tanzania and Kenya samples.
   - ✅ Ensure you're working only with *An. gambiae* individuals.

2. **Perform PCA (Principal Component Analysis)**.
   - ➤ Can you identify genetic clusters?
   - ➤ Do samples from Kenya and Tanzania separate clearly?

3. **Compute pairwise F<sub>ST</sub>** between the two populations.
   - ➤ What is the degree of genetic differentiation?
   - 💡 *Hint:* Use windowed F<sub>ST</sub> to identify genomic regions under divergence.

4. **Construct a Neighbour-Joining Tree (NJT)** using genome-wide distance matrix.
   - ➤ Do the clusters support the PCA result?

5. **Use the metadata to plot the sampling map** and explain the genetic pattern seen.

6. **Optional**: Use **AIMs (Ancestry Informative Markers)** to reinforce your clustering and species-level assignments.

---

### 🧠 Reflection

- What do these results tell you about **gene flow** between Tanzania and Kenya?
- Could **geographical distance** or **local adaptation** be contributing to population structure?
- Why is it important to understand structure before doing selection scans?

---

💡 *Bonus Challenge:* Try color samples by region or year to investigate finer-scale structure.
