# Demonstrating New Phenotype Data Access and Filtering

This notebook showcases the updated functionalities for accessing and filtering insecticide resistance phenotype data, and merging it with genetic data (SNPs and Haplotypes), as per the recent refactoring.



## 1. Setup and Imports

First, we'll import the necessary libraries and instantiate the `Ag3` API client.

In [1]:
import pandas as pd
import xarray as xr
import warnings
from malariagen_data import Ag3

# Suppress warnings for cleaner output in the notebook
warnings.simplefilter("ignore", UserWarning)

# Instantiate the Ag3 API client
# Use pre=True for development versions, or specify a release like release="v3.2"
ag3 = Ag3(pre=True)

print("MalariaGEN Ag3 API client initialized.")
print(ag3)


MalariaGEN Ag3 API client initialized.
<MalariaGEN Ag3 API client>
Storage URL             : gs://vo_agam_release_master_us_central1/
Data releases available : 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, 3.15
Results cache           : None
Cohorts analysis        : 20250502
AIM analysis            : 20220528
Site filters analysis   : dt_20200416
Software version        : malariagen_data 15.2.2.post22+48d1097a
Client location         : Tanger-Tetouan-Al Hoceima, Morocco
---
Please note that data are subject to terms of use,
for more information see https://www.malariagen.net/data
or contact support@malariagen.net. For API documentation see 
https://malariagen.github.io/malariagen-data-python/v15.2.2.post22+48d1097a/Ag3.html


## 2. Listing Available Phenotype Sample Sets
Let's see which sample sets have phenotype data available.

In [2]:
phenotype_sample_sets = ag3.phenotype_sample_sets()
print(f"Available phenotype sample sets: {phenotype_sample_sets}")

# We'll pick one sample set for demonstration, preferably one known to have data
# For this example, we'll use '1237-VO-BJ-DJOGBENOU-VMF00050'
demo_sample_set = '1237-VO-BJ-DJOGBENOU-VMF00050'
if demo_sample_set not in phenotype_sample_sets:
    print(f"Warning: '{demo_sample_set}' not found. Using the first available: {phenotype_sample_sets}")
    demo_sample_set = phenotype_sample_sets

print(f"\nUsing sample set for demonstration: {demo_sample_set}")


Available phenotype sample sets: ['1237-VO-BJ-DJOGBENOU-VMF00050', '1244-VO-GH-YAWSON-VMF00051', '1245-VO-CI-CONSTANT-VMF00054', '1253-VO-TG-DJOGBENOU-VMF00052']

Using sample set for demonstration: 1237-VO-BJ-DJOGBENOU-VMF00050


## 3. Loading Phenotype Data with sample_query
The phenotype_data() method now exclusively returns a Pandas DataFrame containing phenotype and sample metadata. Filtering by attributes like insecticide, location, dose, etc., is done using the sample_query parameter, which accepts a Pandas-style query string.

### Example 1: Filter by a specific insecticide

In [3]:
print(f"\n--- Loading phenotype data for '{demo_sample_set}' filtered by Deltamethrin ---")
df_deltamethrin = ag3.phenotype_data(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin'"
)

print(f"Shape of DataFrame: {df_deltamethrin.shape}")
print("\nFirst 5 rows of the filtered DataFrame:")
df_deltamethrin.head()
print(f"\nUnique insecticides in filtered data: {df_deltamethrin['insecticide'].unique()}")
print("\nDataFrame Info:")
df_deltamethrin.info() 


--- Loading phenotype data for '1237-VO-BJ-DJOGBENOU-VMF00050' filtered by Deltamethrin ---
Shape of DataFrame: (88, 60)         

First 5 rows of the filtered DataFrame:

Unique insecticides in filtered data: ['Deltamethrin']

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 88 entries, 0 to 87
Data columns (total 60 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   sample_id                        88 non-null     object 
 1   country                          88 non-null     object 
 2   location                         88 non-null     object 
 3   insecticide                      88 non-null     object 
 4   dose                             88 non-null     float64
 5   phenotype                        88 non-null     object 
 6   sample_set                       88 non-null     object 
 7   partner_sample_id                88 non-null     object 
 8   contributor                     

### Example 2: Filter by multiple conditions (insecticide and dose)

In [4]:
print(f"\n--- Loading phenotype data filtered by Deltamethrin and dose >= 1.0 ---")
df_filtered_multi = ag3.phenotype_data(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin' and dose >= 1.0"
)

print(f"Shape of DataFrame: {df_filtered_multi.shape}")
print("\nFirst 5 rows of the multi-filtered DataFrame:")
df_filtered_multi.head()

print(f"\nUnique insecticides: {df_filtered_multi['insecticide'].unique()}")
print(f"Unique doses: {df_filtered_multi['dose'].unique()}")
print("\nDataFrame Info:")
df_filtered_multi.info()



--- Loading phenotype data filtered by Deltamethrin and dose >= 1.0 ---
Shape of DataFrame: (48, 60)

First 5 rows of the multi-filtered DataFrame:

Unique insecticides: ['Deltamethrin']
Unique doses: [2.]

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, 40 to 87
Data columns (total 60 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   sample_id                        48 non-null     object 
 1   country                          48 non-null     object 
 2   location                         48 non-null     object 
 3   insecticide                      48 non-null     object 
 4   dose                             48 non-null     float64
 5   phenotype                        48 non-null     object 
 6   sample_set                       48 non-null     object 
 7   partner_sample_id                48 non-null     object 
 8   contributor                      48 non-null     obj

### Example 3: Applying cohort size filtering
Cohort filtering parameters (cohort_size, min_cohort_size, max_cohort_size) can be combined with sample_query.

In [5]:
print(f"\n--- Loading phenotype data with min_cohort_size=10 ---")
df_cohort_filtered = ag3.phenotype_data(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin'",
    min_cohort_size=10
)

print(f"Shape of DataFrame: {df_cohort_filtered.shape}")
print("\nFirst 5 rows of the cohort-filtered DataFrame:")
df_cohort_filtered.head() 
print("\nDataFrame Info:")
df_cohort_filtered.info() 
# Verify cohort sizes (optional, for internal testing)
# if not df_cohort_filtered.empty:
#     cohort_keys = ["insecticide", "dose", "location", "country", "sample_set"]
#     available_keys = [col for col in cohort_keys if col in df_cohort_filtered.columns]
#     if available_keys:
#         cohort_sizes = df_cohort_filtered.groupby(available_keys).size()
#         print("\nCohort sizes after filtering:")
#         print(cohort_sizes)
#         print(f"All cohorts meet min_cohort_size (>=10): {all(cohort_sizes >= 10)}")



--- Loading phenotype data with min_cohort_size=10 ---
Shape of DataFrame: (88, 60)

First 5 rows of the cohort-filtered DataFrame:

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 60 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   sample_id                        88 non-null     object 
 1   country                          88 non-null     object 
 2   location                         88 non-null     object 
 3   insecticide                      88 non-null     object 
 4   dose                             88 non-null     float64
 5   phenotype                        88 non-null     object 
 6   sample_set                       88 non-null     object 
 7   partner_sample_id                88 non-null     object 
 8   contributor                      88 non-null     object 
 9   year                             88 non-null     int64  
 10

In [6]:
print(f"\n--- Getting binary phenotype outcomes for '{demo_sample_set}' ---")

# Example 1: Binary outcomes for all Deltamethrin samples
binary_deltamethrin = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin'"
)

print(f"Shape of binary series: {binary_deltamethrin.shape}")
print("First 5 entries:")
print(binary_deltamethrin.head())
print(f"Unique values in series: {binary_deltamethrin.unique()}")


# Example 2: Binary outcomes for samples that were 'alive' with Deltamethrin
binary_alive_deltamethrin = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin' and phenotype == 'alive'"
)

print(f"\nShape of binary series (alive Deltamethrin): {binary_alive_deltamethrin.shape}")
print("First 5 entries:")
print(binary_alive_deltamethrin.head())
print(f"Unique values in series: {binary_alive_deltamethrin.unique()}")

# Example 3: Binary outcomes for samples with dose 0.5
binary_dose_0_5 = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="dose == 0.5"
)

print(f"\nShape of binary series (dose 0.5): {binary_dose_0_5.shape}")
print("First 5 entries:")
print(binary_dose_0_5.head())
print(f"Unique values in series: {binary_dose_0_5.unique()}")


--- Getting binary phenotype outcomes for '1237-VO-BJ-DJOGBENOU-VMF00050' ---
Shape of binary series: (88,)
First 5 entries:
sample_id
VBS18949-5562STDY7801785    0.0
VBS18950-5562STDY7801786    0.0
VBS18951-5562STDY7801787    0.0
VBS18952-5562STDY7801788    0.0
VBS18953-5562STDY7801789    0.0
Name: phenotype_binary, dtype: float64
Unique values in series: [0. 1.]

Shape of binary series (alive Deltamethrin): (48,)
First 5 entries:
sample_id
VBS18995-5562STDY7801828    1.0
VBS18996-5562STDY7801829    1.0
VBS18998-5562STDY7801830    1.0
VBS18999-5562STDY7801831    1.0
VBS19000-5562STDY7801832    1.0
Name: phenotype_binary, dtype: float64
Unique values in series: [1.]

Shape of binary series (dose 0.5): (40,)
First 5 entries:
sample_id
VBS18949-5562STDY7801785    0.0
VBS18950-5562STDY7801786    0.0
VBS18951-5562STDY7801787    0.0
VBS18952-5562STDY7801788    0.0
VBS18953-5562STDY7801789    0.0
Name: phenotype_binary, dtype: float64
Unique values in series: [0.]


## 4. Getting Binary Phenotype Outcomes with phenotype_binary

The `phenotype_binary()` method provides a convenient way to get phenotype outcomes as a binary Pandas Series (1 for alive/resistant, 0 for dead/susceptible, NaN for unmapped). It also uses the `sample_query` for filtering.

In [7]:
print(f"\n--- Getting binary phenotype outcomes for '{demo_sample_set}' ---")

# Example 1: Binary outcomes for all Deltamethrin samples
binary_deltamethrin = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin'"
)

print(f"Shape of binary series: {binary_deltamethrin.shape}")
print("First 5 entries:")
print(binary_deltamethrin.head())
print(f"Unique values in series: {binary_deltamethrin.unique()}")


# Example 2: Binary outcomes for samples that were 'alive' with Deltamethrin
binary_alive_deltamethrin = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin' and phenotype == 'alive'"
)

print(f"\nShape of binary series (alive Deltamethrin): {binary_alive_deltamethrin.shape}")
print("First 5 entries:")
print(binary_alive_deltamethrin.head())
print(f"Unique values in series: {binary_alive_deltamethrin.unique()}")

# Example 3: Binary outcomes for samples with dose 0.5
binary_dose_0_5 = ag3.phenotype_binary(
    sample_sets=[demo_sample_set],
    sample_query="dose == 0.5"
)

print(f"\nShape of binary series (dose 0.5): {binary_dose_0_5.shape}")
print("First 5 entries:")
print(binary_dose_0_5.head())
print(f"Unique values in series: {binary_dose_0_5.unique()}")


--- Getting binary phenotype outcomes for '1237-VO-BJ-DJOGBENOU-VMF00050' ---
Shape of binary series: (88,)
First 5 entries:
sample_id
VBS18949-5562STDY7801785    0.0
VBS18950-5562STDY7801786    0.0
VBS18951-5562STDY7801787    0.0
VBS18952-5562STDY7801788    0.0
VBS18953-5562STDY7801789    0.0
Name: phenotype_binary, dtype: float64
Unique values in series: [0. 1.]

Shape of binary series (alive Deltamethrin): (48,)
First 5 entries:
sample_id
VBS18995-5562STDY7801828    1.0
VBS18996-5562STDY7801829    1.0
VBS18998-5562STDY7801830    1.0
VBS18999-5562STDY7801831    1.0
VBS19000-5562STDY7801832    1.0
Name: phenotype_binary, dtype: float64
Unique values in series: [1.]

Shape of binary series (dose 0.5): (40,)
First 5 entries:
sample_id
VBS18949-5562STDY7801785    0.0
VBS18950-5562STDY7801786    0.0
VBS18951-5562STDY7801787    0.0
VBS18952-5562STDY7801788    0.0
VBS18953-5562STDY7801789    0.0
Name: phenotype_binary, dtype: float64
Unique values in series: [0.]


## 5. Loading Phenotype Data Merged with SNP Calls
The phenotypes_with_snps() method returns an xarray.Dataset that combines phenotype data with SNP calls for a specified genomic region. The sample_query parameter is still used to filter the phenotype data before merging.

In [8]:
print(f"\n--- Loading phenotype data merged with SNP calls ---")

# Choose a small region for faster loading
demo_region_snps = "2L:2420000-2430000"

ds_snps = ag3.phenotypes_with_snps(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin' and phenotype == 'alive'",
    region=demo_region_snps
)

print(f"Dataset dimensions: {ds_snps.dims}")
print("\nDataset structure (rich display):")
ds_snps

print("\nFirst 5 samples and their phenotype_binary values:")
print(ds_snps["phenotype_binary"].head(5).values)
print("\nFirst 5 samples and their insecticide values:")
print(ds_snps["insecticide"].head(5).values)
print("\nFirst 5 variant positions:")
print(ds_snps["variant_position"].head(5).values)
print("\nDataset Info:")
ds_snps.info()



--- Loading phenotype data merged with SNP calls ---

Dataset structure (rich display):

First 5 samples and their phenotype_binary values:
[1. 1. 1. 1. 1.]

First 5 samples and their insecticide values:
['Deltamethrin' 'Deltamethrin' 'Deltamethrin' 'Deltamethrin'
 'Deltamethrin']

First 5 variant positions:
[2420000 2420001 2420002 2420003 2420004]

Dataset Info:
xarray.Dataset {
dimensions:
	samples = 48 ;
	variants = 10001 ;
	alleles = 4 ;
	ploidy = 2 ;

variables:
	float64 phenotype_binary(samples) ;
	object insecticide(samples) ;
	float64 dose(samples) ;
	object phenotype(samples) ;
	object location(samples) ;
	object country(samples) ;
	object sample_set(samples) ;
	object samples(samples) ;
	|S1 variant_allele(variants, alleles) ;
	bool variant_filter_pass_gamb_colu_arab(variants) ;
	bool variant_filter_pass_gamb_colu(variants) ;
	bool variant_filter_pass_arab(variants) ;
	int32 variant_position(variants) ;
	uint8 variant_contig(variants) ;
	int8 call_genotype(variants, samples

## 6. Loading Phenotype Data Merged with Haplotypes
Similarly, the phenotypes_with_haplotypes() method returns an xarray.Dataset combining phenotype data with haplotype calls for a given region.

In [9]:
print(f"\n--- Loading phenotype data merged with Haplotypes ---")

# Choose a small region for faster loading
demo_region_haps = "2L:2420000-2430000"

ds_haps = ag3.phenotypes_with_haplotypes(
    sample_sets=[demo_sample_set],
    sample_query="insecticide == 'Deltamethrin' and phenotype == 'dead'",
    region=demo_region_haps
)

print(f"Dataset dimensions: {ds_haps.dims}")
print("\nDataset structure (rich display):")
ds_haps

print("\nFirst 5 samples and their phenotype_binary values:")
print(ds_haps["phenotype_binary"].head(5).values)
print("\nFirst 5 samples and their insecticide values:")
print(ds_haps["insecticide"].head(5).values)
print("\nFirst 5 variant positions:")
print(ds_haps["variant_position"].head(5).values)
print("\nDataset Info:")
ds_haps.info()



--- Loading phenotype data merged with Haplotypes ---

Dataset structure (rich display):

First 5 samples and their phenotype_binary values:
[0. 0. 0. 0. 0.]

First 5 samples and their insecticide values:
['Deltamethrin' 'Deltamethrin' 'Deltamethrin' 'Deltamethrin'
 'Deltamethrin']

First 5 variant positions:
[2420351 2420359 2420370 2420371 2420375]

Dataset Info:
xarray.Dataset {
dimensions:
	samples = 40 ;
	variants = 2153 ;
	alleles = 2 ;
	ploidy = 2 ;

variables:
	float64 phenotype_binary(samples) ;
	object insecticide(samples) ;
	float64 dose(samples) ;
	object phenotype(samples) ;
	object location(samples) ;
	object country(samples) ;
	object sample_set(samples) ;
	object samples(samples) ;
	int32 variant_position(variants) ;
	uint8 variant_contig(variants) ;
	|S1 variant_allele(variants, alleles) ;
	int8 call_genotype(variants, samples, ploidy) ;

// global attributes:
}

## Conclusion
This notebook demonstrates the streamlined approach to accessing and combining phenotype and genetic data. The use of sample_query provides flexible filtering, and the dedicated functions for SNPs and Haplotypes (and potentially CNVs in the future) ensure clarity and separation of concerns.