# scNET Tutorial: Malaria-Associated B Cell Pathway Reconstruction

This tutorial demonstrates how to use scNET for pathway reconstruction analysis using a human malaria-associated B cell dataset. The experiment showcases scNET's ability to identify enriched pathways in different B cell subpopulations and reconstruct biological pathways relevant to malaria immune response.

## Dataset Requirements

This tutorial is designed for a human malaria-associated B cell dataset with:
- **Genes**: 19,374 genes
- **Cells**: 7,044 cells (or similar datasets with 5,000+ cells)
- **Data type**: Single-cell RNA sequencing data
- **Format**: AnnData object (.h5ad) or compatible format

## Data Sources

Since the exact dataset from the paper may not be publicly available, you can use similar malaria-associated immune cell datasets from:

1. **Gene Expression Omnibus (GEO)**:
   - Search for "malaria B cell single cell" or "Plasmodium immune response scRNA-seq"
   - Look for datasets with B cell populations from malaria-exposed individuals

2. **Single Cell Portal**:
   - Browse malaria-related immune studies

3. **European Bioinformatics Institute (EBI)**:
   - Single Cell Expression Atlas

4. **10x Genomics Public Datasets**:
   - Look for immune system datasets that may include malaria-exposed samples

## Alternative: Using Your Own Dataset

If you have your own malaria-associated B cell dataset, ensure it:
- Contains B cell populations
- Has sufficient cell numbers (>1,000 cells recommended)
- Includes appropriate metadata for cell type annotation
- Is in AnnData format or can be converted to it


## Installation and Setup

First, ensure you have scNET installed and import the necessary libraries:

In [None]:
# Uncomment to install scNET if not already installed
# !pip install scnet

import scNET
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')

## Step 1: Load and Prepare the Malaria B Cell Dataset

Replace the path below with your actual malaria B cell dataset:

In [None]:
# Example dataset loading - replace with your actual dataset path
# adata = sc.read_h5ad('path/to/your/malaria_b_cell_dataset.h5ad')

# For demonstration purposes, we'll show how to structure the code
# assuming you have loaded your dataset into 'adata'

print("Dataset loading example:")
print("# Load your malaria B cell dataset")
print("adata = sc.read_h5ad('malaria_b_cell_data.h5ad')")
print("\n# Expected dataset properties:")
print("# - adata.shape: (~7044 cells, ~19374 genes)")
print("# - adata.obs should contain cell type annotations")
print("# - adata.var should contain gene information")

# Example of what your dataset structure should look like
print("\n# Example dataset info:")
print("# adata.obs.columns might include: 'cell_type', 'sample_id', 'treatment', etc.")
print("# Cell types might include: 'Memory B cells', 'Naive B cells', 'Plasma cells', etc.")

## Step 2: Basic Dataset Exploration

Before running scNET, let's explore the dataset:

In [None]:
# Uncomment and modify these lines when you have your dataset loaded

# print(f"Dataset shape: {adata.shape}")
# print(f"Number of cells: {adata.n_obs}")
# print(f"Number of genes: {adata.n_vars}")

# # Check cell type distribution
# if 'cell_type' in adata.obs.columns:
#     print("\nCell type distribution:")
#     print(adata.obs['cell_type'].value_counts())
# else:
#     print("\nAvailable metadata columns:")
#     print(adata.obs.columns.tolist())

# # Basic visualization
# sc.pl.highest_expr_genes(adata, n_top=20)

print("Once you load your dataset, this section will show:")
print("- Dataset dimensions")
print("- Cell type distribution")
print("- Highest expressed genes")
print("- Quality control metrics")

## Step 3: Preprocessing for scNET

Prepare the data for scNET analysis:

In [None]:
# Example preprocessing steps
# Uncomment and modify when you have your dataset

# # Basic filtering
# sc.pp.filter_cells(adata, min_genes=200)  # Filter cells with too few genes
# sc.pp.filter_genes(adata, min_cells=3)    # Filter genes expressed in too few cells

# # Calculate QC metrics
# adata.var['mt'] = adata.var_names.str.startswith('MT-')
# sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)

# # Additional filtering based on QC
# sc.pp.filter_cells(adata, max_genes=5000)  # Filter cells with too many genes (potential doublets)

# print(f"After filtering - Cells: {adata.n_obs}, Genes: {adata.n_vars}")

print("Preprocessing steps include:")
print("1. Filter cells with too few or too many genes")
print("2. Filter genes expressed in too few cells")
print("3. Calculate quality control metrics")
print("4. Remove potential doublets")
print("5. Normalize and log-transform (handled by scNET if pre_processing_flag=True)")

## Step 4: Run scNET Analysis

Train scNET on the malaria B cell dataset:

In [None]:
# Set project name for this analysis
project_name = "malaria_b_cell_pathway_reconstruction"

# Run scNET with appropriate parameters for human data
# Uncomment when you have your dataset loaded

# scNET.run_scNET(
#     adata, 
#     pre_processing_flag=True,     # Let scNET handle preprocessing
#     human_flag=True,              # Use human gene names
#     number_of_batches=3,          # Number of training batches
#     split_cells=True,             # Split by cells for training
#     max_epoch=250,                # Training epochs
#     model_name=project_name       # Name for saving outputs
# )

print("scNET training parameters for malaria B cell analysis:")
print("- pre_processing_flag=True: Automatic preprocessing")
print("- human_flag=True: Human gene name formatting")
print("- number_of_batches=3: Training batch size")
print("- max_epoch=250: Sufficient epochs for convergence")
print("- split_cells=True: Cell-based training splits")

print("\nTraining will generate:")
print("- Gene embeddings")
print("- Cell embeddings")
print("- Reconstructed gene expression matrix")

## Step 5: Load scNET Results

Retrieve the trained embeddings and reconstructed data:

In [None]:
# Load the scNET results
# Uncomment when scNET training is complete

# embedded_genes, embedded_cells, node_features, out_features = scNET.load_embeddings(project_name)

# print(f"Gene embeddings shape: {embedded_genes.shape}")
# print(f"Cell embeddings shape: {embedded_cells.shape}")
# print(f"Original features shape: {node_features.shape}")
# print(f"Reconstructed features shape: {out_features.shape}")

print("After training, scNET will provide:")
print("- embedded_genes: Gene representation in learned space")
print("- embedded_cells: Cell representation in learned space")
print("- node_features: Original gene expression matrix")
print("- out_features: scNET-reconstructed gene expression")

## Step 6: Create Reconstructed Dataset Object

Generate a new AnnData object with the reconstructed data:

In [None]:
# Create reconstructed AnnData object
# Uncomment when you have the scNET results

# recon_obj = scNET.create_reconstructed_obj(node_features, out_features, adata)

# print(f"Reconstructed object shape: {recon_obj.shape}")
# print(f"Available observations: {recon_obj.obs.columns.tolist()}")

# # Basic visualization of reconstructed data
# sc.pl.umap(recon_obj, color=['leiden'], legend_loc='on data')

print("The reconstructed object will contain:")
print("- scNET-enhanced gene expression data")
print("- Original cell metadata")
print("- New clustering based on enhanced features")
print("- UMAP visualization of enhanced data")

## Step 7: Annotate B Cell Subtypes

Define B cell subtypes relevant to malaria immune response:

In [None]:
# Define B cell subtypes based on clustering
# This mapping should be customized based on your specific dataset and clustering results

# Example B cell subtype mapping for malaria studies
malaria_b_cell_types = {
    "0": "Memory B cells",
    "1": "Naive B cells", 
    "2": "Atypical Memory B cells",
    "3": "Plasma cells",
    "4": "Germinal Center B cells",
    "5": "Activated B cells",
    "6": "Transitional B cells",
    "7": "Regulatory B cells"
}

# Apply the mapping to your reconstructed object
# Uncomment when you have the reconstructed object

# recon_obj.obs["B_Cell_Subtype"] = recon_obj.obs.leiden.map(malaria_b_cell_types)

# # Visualize the B cell subtypes
# sc.pl.umap(recon_obj, color=['B_Cell_Subtype'], legend_loc='on data')

print("B cell subtypes relevant to malaria immune response:")
for cluster, cell_type in malaria_b_cell_types.items():
    print(f"  Cluster {cluster}: {cell_type}")

print("\nNote: Adjust the cluster-to-cell-type mapping based on:")
print("- Marker gene expression")
print("- Literature on malaria B cell responses")
print("- Your specific dataset characteristics")

## Step 8: Pathway Reconstruction Analysis

This is the core pathway reconstruction analysis using scNET's pathway enrichment capabilities:

In [None]:
# Perform pathway enrichment analysis on B cell subtypes
# Focus on subtypes most relevant to malaria immune response

# Define B cell subtypes of interest for malaria pathway analysis
target_subtypes = ["Memory B cells", "Atypical Memory B cells", "Plasma cells", "Activated B cells"]

print("Pathway Reconstruction Analysis Settings:")
print(f"Target B cell subtypes: {target_subtypes}")
print("Pathway database: KEGG 2021 Human")
print("Analysis method: Differential expression + pathway enrichment")
print("Statistical threshold: Adjusted p-value < 0.05")

# Run pathway enrichment analysis
# Uncomment when you have the reconstructed object with annotations

# de_genes_per_group, significant_pathways, filtered_kegg, enrichment_results = scNET.pathway_enricment(
#     recon_obj.copy()[recon_obj.obs["B_Cell_Subtype"].isin(target_subtypes)],
#     groupby="B_Cell_Subtype",
#     logfc_threshold=0.25,  # Minimum log fold change
#     pval_threshold=0.05    # Adjusted p-value threshold
# )

# print("\nPathway enrichment analysis completed!")
# print(f"Number of B cell subtypes analyzed: {len(de_genes_per_group)}")
# print(f"Total significant pathways found: {sum(len(df) for df in significant_pathways.values())}")

print("\nExpected outputs:")
print("- de_genes_per_group: Differentially expressed genes per B cell subtype")
print("- significant_pathways: Enriched pathways per subtype")
print("- enrichment_results: Detailed enrichment statistics")
print("- filtered_kegg: KEGG pathways filtered for dataset genes")

## Step 9: Visualize Pathway Reconstruction Results

Create comprehensive visualizations of the pathway analysis:

In [None]:
# Visualize the pathway enrichment results
# Uncomment when you have the pathway analysis results

# # Create pathway enrichment heatmap
# scNET.plot_de_pathways(significant_pathways, enrichment_results, head=20)
# plt.title('Malaria B Cell Pathway Reconstruction: Top Enriched Pathways', fontsize=16)
# plt.tight_layout()
# plt.show()

# # Display top pathways for each B cell subtype
# print("\nTop 5 pathways per B cell subtype:")
# for subtype, pathways_df in significant_pathways.items():
#     if len(pathways_df) > 0:
#         print(f"\n{subtype}:")
#         top_pathways = pathways_df.head(5)
#         for idx, row in top_pathways.iterrows():
#             print(f"  - {row['Term']} (adj. p-val: {row['Adjusted P-value']:.2e})")
#     else:
#         print(f"\n{subtype}: No significant pathways found")

print("Pathway reconstruction visualization will include:")
print("1. Heatmap of pathway significance across B cell subtypes")
print("2. Top enriched pathways per subtype")
print("3. Pathway significance scores (-log10 p-values)")
print("4. Hierarchical clustering of pathway patterns")

print("\nExpected malaria-relevant pathways:")
print("- B cell receptor signaling pathway")
print("- Antigen processing and presentation")
print("- Cytokine-cytokine receptor interaction")
print("- NF-kappa B signaling pathway")
print("- JAK-STAT signaling pathway")
print("- Complement and coagulation cascades")
print("- Toll-like receptor signaling pathway")

## Step 10: Analyze Malaria-Specific Pathways

Focus on pathways specifically relevant to malaria immune response:

In [None]:
# Define malaria-relevant pathways for focused analysis
malaria_relevant_pathways = [
    'B cell receptor signaling pathway',
    'Antigen processing and presentation',
    'Complement and coagulation cascades',
    'Cytokine-cytokine receptor interaction',
    'NF-kappa B signaling pathway',
    'JAK-STAT signaling pathway',
    'Toll-like receptor signaling pathway',
    'IL-17 signaling pathway',
    'TNF signaling pathway',
    'Fc gamma R-mediated phagocytosis'
]

print("Malaria-specific pathway analysis:")
print("Target pathways for detailed reconstruction:")
for i, pathway in enumerate(malaria_relevant_pathways, 1):
    print(f"{i:2d}. {pathway}")

# Filter results for malaria-relevant pathways
# Uncomment when you have the pathway analysis results

# malaria_pathways_results = {}
# for subtype, enrichment_df in enrichment_results.items():
#     malaria_subset = enrichment_df[enrichment_df['Term'].str.contains(
#         '|'.join(malaria_relevant_pathways), case=False, na=False
#     )]
#     if len(malaria_subset) > 0:
#         malaria_pathways_results[subtype] = malaria_subset.sort_values('Adjusted P-value')

# # Display malaria-specific results
# print("\nMalaria-relevant pathway enrichment results:")
# for subtype, pathways_df in malaria_pathways_results.items():
#     print(f"\n{subtype} ({len(pathways_df)} malaria-relevant pathways):")
#     for idx, row in pathways_df.head(3).iterrows():
#         print(f"  - {row['Term']}")
#         print(f"    Adj. p-value: {row['Adjusted P-value']:.2e}")
#         print(f"    Genes: {len(row['Genes'].split(';'))} genes involved")

print("\nThis analysis will reveal:")
print("- Which B cell subtypes are most active in malaria response")
print("- Specific pathways upregulated in each subtype")
print("- Shared vs. unique pathway responses")
print("- Key genes driving pathway enrichment")

## Step 11: Build Co-embedded Gene Network

Construct a network based on gene co-embeddings to understand gene relationships:

In [None]:
# Build co-embedded network from gene embeddings
# Uncomment when you have the embeddings

# # Build the co-embedded network
# graph, modularity = scNET.build_co_embeded_network(
#     embedded_genes, 
#     node_features, 
#     threshold=95  # Use 95th percentile for network edges
# )

# print(f"Co-embedded gene network statistics:")
# print(f"Number of nodes (genes): {graph.number_of_nodes()}")
# print(f"Number of edges: {graph.number_of_edges()}")
# print(f"Network density: {graph.number_of_edges() / (graph.number_of_nodes() * (graph.number_of_nodes() - 1) / 2):.4f}")
# print(f"Modularity score: {modularity:.4f}")

# # Identify highly connected genes (potential key regulators)
# degree_centrality = nx.degree_centrality(graph)
# top_connected_genes = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:20]

# print("\nTop 10 most connected genes in the co-embedded network:")
# for gene, centrality in top_connected_genes[:10]:
#     print(f"  {gene}: {centrality:.4f}")

print("Gene co-embedding network analysis will provide:")
print("- Network of functionally related genes")
print("- Identification of gene modules")
print("- Hub genes (potential key regulators)")
print("- Community structure in gene relationships")

print("\nExpected insights for malaria B cell analysis:")
print("- Gene modules related to B cell activation")
print("- Antibody production gene networks")
print("- Immune response regulatory hubs")
print("- Memory formation gene clusters")

## Step 12: Summary and Biological Interpretation

Summarize the pathway reconstruction results and their biological significance:

In [None]:
# Summary analysis function
def summarize_malaria_pathway_analysis():
    """
    Provides a comprehensive summary of the malaria B cell pathway reconstruction analysis.
    """
    
    print("=" * 60)
    print("MALARIA B CELL PATHWAY RECONSTRUCTION SUMMARY")
    print("=" * 60)
    
    print("\n1. DATASET CHARACTERISTICS:")
    print("   - Human malaria-associated B cells")
    print("   - ~7,044 cells, ~19,374 genes (or similar dataset)")
    print("   - Multiple B cell subtypes identified")
    
    print("\n2. scNET ANALYSIS OUTCOMES:")
    print("   - Enhanced gene and cell embeddings")
    print("   - Reconstructed gene expression matrix")
    print("   - Improved cell type resolution")
    print("   - Pathway-informed feature representation")
    
    print("\n3. PATHWAY RECONSTRUCTION RESULTS:")
    print("   - B cell subtype-specific pathway enrichment")
    print("   - Malaria-relevant immune pathway identification")
    print("   - Differential pathway activation patterns")
    print("   - Gene network community structure")
    
    print("\n4. BIOLOGICAL INSIGHTS EXPECTED:")
    print("   a) Memory B Cell Response:")
    print("      - Enhanced BCR signaling")
    print("      - Antigen presentation pathways")
    print("      - Memory maintenance mechanisms")
    
    print("   b) Atypical Memory B Cells:")
    print("      - Unique activation signatures")
    print("      - Altered cytokine responses")
    print("      - Exhaustion-related pathways")
    
    print("   c) Plasma Cells:")
    print("      - Antibody production machinery")
    print("      - ER stress and UPR pathways")
    print("      - Metabolic reprogramming")
    
    print("   d) Activated B Cells:")
    print("      - Proliferation and activation")
    print("      - Inflammatory cytokine production")
    print("      - Differentiation pathways")
    
    print("\n5. CLINICAL RELEVANCE:")
    print("   - Understanding malaria vaccine responses")
    print("   - Identifying therapeutic targets")
    print("   - Biomarker discovery for protection")
    print("   - Insights into immune memory formation")
    
    print("\n6. NEXT STEPS:")
    print("   - Validate key pathway findings")
    print("   - Compare with other malaria datasets")
    print("   - Integrate with clinical outcomes")
    print("   - Design follow-up experiments")
    
    print("\n" + "=" * 60)

# Run the summary
summarize_malaria_pathway_analysis()

## Conclusion

This tutorial demonstrates how to use scNET for pathway reconstruction in malaria-associated B cell datasets. The analysis workflow includes:

1. **Data preparation**: Loading and preprocessing malaria B cell scRNA-seq data
2. **scNET training**: Learning enhanced gene and cell representations
3. **Cell type annotation**: Identifying B cell subtypes relevant to malaria response
4. **Pathway reconstruction**: Using differential expression and pathway enrichment
5. **Network analysis**: Building co-embedded gene networks
6. **Biological interpretation**: Understanding malaria-specific immune responses

### Key Advantages of scNET for Pathway Reconstruction:

- **Context-aware embeddings**: Integrates PPI networks with expression data
- **Enhanced signal**: Reduces noise in pathway analysis
- **Comprehensive analysis**: Combines multiple analysis modalities
- **Biological relevance**: Leverages protein interaction information

### Adapting This Tutorial:

To use this tutorial with your own dataset:
1. Replace the dataset loading section with your data path
2. Adjust cell type annotations based on your clustering results
3. Modify the target pathways based on your research focus
4. Customize the analysis parameters for your dataset size and characteristics

For questions or support, please refer to the scNET documentation or create an issue in the GitHub repository.