# Cell-Level Projection and Similarity Analysis

This notebook combines **local** single-cell analysis (scanpy) with
**server-side** coexpression analysis to give you the best of both
worlds: fine-grained local control over individual cells, plus the
statistical power of the full Malva atlas.

## 1. Setup

Import the necessary libraries. This notebook requires `scanpy` for
local analysis.

In [None]:
from malva_client import MalvaClient
from malva_client.tools import score_correlated_features, run_go_enrichment
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

client = MalvaClient()

## 2. Cell-Level Search

Search at single-cell resolution using `search_cells()`. This returns
individual cell IDs with their expression values.

In [None]:
cells = client.search_cells("SPP1")
print(cells)

## 3. Download a Reference Sample

Aggregate cells by sample, pick one of the top samples, and download
its full AnnData object.

In [None]:
# Find samples with the most matching cells
agg = cells.aggregate_by_sample()
agg = agg.sort_values('cell_count', ascending=False)
print(agg.head())

# Enrich to get sample UUIDs
df_cells = cells.enrich_with_metadata()

# Pick the top sample
top_sample_id = agg.iloc[0]['sample_id']
sample_uuid = df_cells[df_cells['sample_id'] == top_sample_id]['uuid'].iloc[0]

# Download
adata = client.download_sample(sample_uuid)

## 4. Standard scanpy Preprocessing

Run standard preprocessing: normalisation, log-transform, highly variable
genes, PCA, neighbors, and UMAP.

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

## 5. Overlay Search Expression on Local UMAP

Map the Malva cell-level expression values onto the local AnnData object
and visualise on the UMAP.

In [None]:
# Filter to cells from this sample
sample_cells = df_cells[df_cells['sample_id'] == top_sample_id]
sample_cells.index = sample_cells['cell_id'].astype(str)

# Map expression onto adata
adata.obs['malva_match'] = np.log1p(sample_cells['expression'])

sc.pl.umap(adata, color=['malva_match'], title='SPP1 expression (Malva)')

## 6. Find Correlated Features Locally

Use `score_correlated_features` from `malva_client.tools` to identify
genes whose expression is correlated with the Malva match score.

In [None]:
thr, pos_markers, neg_markers = score_correlated_features(
    adata,
    feature_key="malva_match",
    method="gmm",
    gmm_components=2,
    n_markers=50,
    significance=0.01,
    show=True,
)

## 7. Local GO Enrichment

Run GO enrichment on the locally identified positive markers.

In [None]:
go_local = run_go_enrichment(
    pos_markers,
    organism="hsapiens",
    significance=0.05,
    top_n=10,
    plot=True,
)

## 8. Server-Side Coexpression for Comparison

Run the same gene through the server-side coexpression pipeline and
compare the results.

In [None]:
# Get the dataset hierarchy to find the dataset ID
hierarchy = client.get_datasets_hierarchy()
client.print_dict_summary(hierarchy)

DATASET_ID = "DATASET_ID"  # <-- replace with your dataset

In [None]:
result = client.search("SPP1")
coexpr = client.get_coexpression(result.job_id, DATASET_ID)
coexpr.plot_umap(color_by='positive_fraction')

## 9. Cross-Comparison: Local vs Server

Compare the locally identified markers with the server-side correlated
genes to see which findings are consistent.

In [None]:
local_genes = set(pos_markers) if isinstance(pos_markers, list) else set(pos_markers.index)
server_genes = set(coexpr.get_top_genes(50))

shared = local_genes & server_genes
local_only = local_genes - server_genes
server_only = server_genes - local_genes

print(f"Local markers:  {len(local_genes)}")
print(f"Server genes:   {len(server_genes)}")
print(f"Shared:         {len(shared)}")
print(f"Local only:     {len(local_only)}")
print(f"Server only:    {len(server_only)}")
print(f"\nShared genes: {sorted(shared)}")

## 10. Side-by-Side Visualisation

Compare the local UMAP (single sample) with the server UMAP
(full atlas).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Local UMAP
sc.pl.umap(adata, color='malva_match', ax=axes[0], show=False,
           title='Local UMAP (single sample)')

# Server UMAP
scores_df = coexpr.scores_to_dataframe()
if not scores_df.empty:
    x_col = 'x' if 'x' in scores_df.columns else scores_df.columns[0]
    y_col = 'y' if 'y' in scores_df.columns else scores_df.columns[1]
    color_col = 'positive_fraction' if 'positive_fraction' in scores_df.columns else None
    scatter = axes[1].scatter(
        scores_df[x_col], scores_df[y_col],
        c=scores_df[color_col] if color_col else 'steelblue',
        cmap='viridis', s=5, alpha=0.7
    )
    if color_col:
        plt.colorbar(scatter, ax=axes[1], label=color_col)
axes[1].set_xlabel('UMAP 1')
axes[1].set_ylabel('UMAP 2')
axes[1].set_title('Server UMAP (full atlas)')

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated how **local** and **server-side** analyses
complement each other:

| Approach | Strengths |
|----------|----------|
| **Local (scanpy)** | Full control, custom preprocessing, single-sample resolution |
| **Server (coexpression API)** | Atlas-wide statistics, no heavy computation locally |

By comparing the two, you can identify robust correlated genes that
appear consistently across approaches and distinguish sample-specific
signals from atlas-wide patterns.