# Preparing adata for cellxgene / MaMPlan creation

### Preparing for cellxgene
This Notebook prepares the anndata object for cellxgene.
This preparation includes:
 - Removing unnessesary data to keep the resulting h5ad file as small as possible
 - Renaming columns for a nicer presentation in cellxgene
 - Converting unsupported datatypes to supported datatypes
 - Additional fixes for bugs between scanpy, anndata and cellxgene
   
### MaMPlan creation
Additionally, a MaMPlan can be created which is needed to deploy the dataset to the BCU repository using mampok or the BCU repository overlay.  
A MaMPlan acts as the config file for each specific dataset. It holds a variety of different parameters needed by mampok and the BCU repository.  
To simplyfy the creation process only the importent parameters can be set. The other parameters get a (often) required default value.

See the [MaMpok wiki](https://gitlab.gwdg.de/loosolab/software/mampok/-/wikis/Getting-Started/MaMPlan_keys) for more detailed information about each parameter.

#### Parameters
| Parameter | Description | Options |
|:---:|:---|:---|
| project_id | Project ID, e.g. 'ext123', 'dst123' | str |
| tool | Select the cellxgene docker container. | 'cellxgene-new', 'cellxgene-fix', 'cellxgene-vip-latest'  |
| cluster | Select the kubernetes cluster. | 'BN', 'GI', 'GWDG', 'GWDGmanagt' |
| organization | Select organizations related to the project.<br> Every user in one of the organizations will be able to access the dataset via the BCU repository. | [Options](https://gitlab.gwdg.de/loosolab/software/metadata_whitelists/-/blob/main/whitelists/department?ref_type=heads) |
| label | Set label shown in the browser tab. | str |
| user | List of users that, additonally to the organization, get access to the dataset via the BCU repository.  | List of LDAP user IDs |
| pubmedid | Pubmed ID of public datasets. | Pubmed user ID |
| citation | Citation of public dataset. | str |
| cpu_limit | Set the limit of cpu cores that can be used by the deplyoment. | int |
| mem_limit | Set the limit (in GB) of memory that can be used by the deplyoment. | int |
| cpu_request | Set the requested amount of cpu cores that can be used by the deplyoment. | int |
| mem_request | Set the requested amount (in GB) of memory that can be used by the deplyoment. | int |

## Input - 1

In [1]:
last_notebook_adata = "anndata_5.h5ad"
datatype = "scRNA"

# MaMPlan options

## Project options
project_id = "ID"
tool = "cellxgene-fix" #cellxgene-vip-latest
cluster = "BN"
organization = ["AG-nerds"]
label = None
user = None

## Options for public datasets
pubmedid = None
citation = None

## Options for computational resource manangemnt

### Limit
cpu_limit = None
mem_limit = None
### Requested
cpu_request = None
mem_request = None

mamplan_filename = f"{project_id}_MaMPlan.yaml"

## Setup

In [2]:
try:
    from mampok import mamplan_creator
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the latest mampok version.")
import sctoolbox.utilities as utils


utils.settings_from_config("config.yaml", key="99")



In [3]:
adata = utils.load_h5ad(last_notebook_adata)
display(adata)

[INFO] The adata object was loaded from: pipeline_output/adatas/anndata_5.h5ad


AnnData object with n_obs × n_vars = 2773 × 21154
    obs: 'orig.ident', 'chamber', 'donor', 'batch', 'sample', 'celltype', 'total_counts', 'log1p_total_counts', 'total_counts_is_ribo', 'log1p_total_counts_is_ribo', 'pct_counts_is_ribo', 'total_counts_is_mito', 'log1p_total_counts_is_mito', 'pct_counts_is_mito', 'total_counts_is_gender', 'log1p_total_counts_is_gender', 'pct_counts_is_gender', 'doublet_score', 'predicted_doublet', 'predicted_sex', 'n_genes', 'log1p_n_genes', 'S_score', 'G2M_score', 'phase', 'leiden', 'LISI_score_X_pca', 'LISI_score_X_umap', 'leiden_0.1', 'leiden_0.2', 'leiden_0.3', 'leiden_0.4', 'leiden_0.5', 'leiden_0.6', 'leiden_0.7', 'leiden_0.8', 'leiden_0.9', 'leiden_0.5_recluster', 'clustering', 'SCSA_pred_celltype', 'marker_pred_celltype'
    var: 'is_ribo', 'is_mito', 'cellcycle', 'is_gender', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispers

## Prepare adata for cellxgene

### Input - 2 

In [4]:
# Keep columns in adata.obs (Cell metadata)
keep_obs = [
    "sample",
    "batch",
    "celltype",
    "pct_counts_is_mito",
    "pct_counts_is_ribo",
    "phase",
    "clustering",
    "SCSA_pred_celltype",
    "marker_pred_celltype"
]

# Rename columns in adata.obs
rename_obs = {
    "sample": "Sample",
    "batch": "Batch",
    "celltype": "Celltype",
    "pct_counts_is_mito": "Mitochondiral content (%)",
    "pct_counts_is_ribo": "Ribosomal content (%)",
    "phase": "Phase",
    "clustering": "Final Clustering",
    "SCSA_pred_celltype": "Predicted Celltype (SCSA)",
    "marker_pred_celltype": "Predicted Celltype (Marker)"
}

# Keep columns in adata.var (Gene metadata)
# An empty list removes all columns
keep_var = []
rename_var = {}

### Add leiden columns

In [5]:
leiden_cols = [col for col in adata.obs.columns if col.startswith("leiden")]
keep_obs += leiden_cols
rename_obs |= {c: c.replace("_", " ").capitalize() for c in leiden_cols}

The cellxgene preparation removes all data from the anndata object that is not required for the cellxgene deplyoment.  
This saves memory on the cluster and increases runtime.

In addition, every invvalid or problematic datatype is checked for and casted to a fitting datatype if possible.

**Note: Keep in mind that the resulting adata object should not be used for further analysis.**

In [6]:
utils.prepare_for_cellxgene(adata,
                           keep_obs=keep_obs,
                           keep_var=keep_var,
                           rename_obs=rename_obs,
                           rename_var=rename_var,
                           inplace=True)

In [7]:
display(adata)

AnnData object with n_obs × n_vars = 2773 × 21154
    obs: 'Batch', 'Sample', 'Celltype', 'Ribosomal content (%)', 'Mitochondiral content (%)', 'Phase', 'Leiden', 'Leiden 0.1', 'Leiden 0.2', 'Leiden 0.3', 'Leiden 0.4', 'Leiden 0.5', 'Leiden 0.6', 'Leiden 0.7', 'Leiden 0.8', 'Leiden 0.9', 'Leiden 0.5 recluster', 'Final Clustering', 'Predicted Celltype (SCSA)', 'Predicted Celltype (Marker)'
    uns: 'SCSA', 'dendrogram_clustering', 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'rank_genes_clustering', 'rank_genes_groups', 'scrublet', 'sctoolbox', 'umap', 'Sample_colors', 'Batch_colors', 'Final Clustering_colors', 'Predicted Celltype (SCSA)_colors', 'Predicted Celltype (Marker)_colors', 'Leiden_colors', 'Leiden 0.1_colors', 'Leiden 0.2_colors', 'Leiden 0.3_colors', 'Leiden 0.4_colors', 'Leiden 0.5_colors', 'Leiden 0.6_colors', 'Leiden 0.7_colors', 'Leiden 0.8_colors', 'Leiden 0.9_colors', 'Leiden 0.5 recluster_colors'
    obsm: 'X_pca', 'X_scanorama', 'X_umap'
    varm: 'PCs'
    layers: 

### Save adata

In [8]:
#Saving the data
adata_output = f"{project_id}_cellxgene.h5ad"
utils.save_h5ad(adata, adata_output)

[INFO] The adata object was saved to: pipeline_output/adatas/deployment/ID_cellxgene.h5ad


## Write MaMPlan

In [9]:
mamplan = mamplan_creator.SimpleMamplan(
    exp_id = project_id,
    files = adata_output,
    tool = tool,
    analyst = utils.get_user(),
    datatype = datatype,
    cluster = cluster,
    label = label,
    organization = organization,
    user = user,
    pubmedid = pubmedid,
    citation = citation,
    cpu_limit = cpu_limit,
    mem_limit = mem_limit,
    cpu_request = cpu_request,
    mem_request = mem_request
)

In [10]:
mamplan.save(f"{project_id}")